326 19 18MB
English Pages [356] Year 2021
Methods in Molecular Biology 2229
Filippo Menolascina Editor
Synthetic Gene Circuits Methods and Protocols
METHODS
IN
MOLECULAR BIOLOGY
Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK
For further volumes: http://www.springer.com/series/7651
For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.
Synthetic Gene Circuits Methods and Protocols
Edited by
Filippo Menolascina School of Engineering, Institute for Bioengineering, University of Edinburgh, Edinburgh, UK
Editor Filippo Menolascina School of Engineering Institute for Bioengineering University of Edinburgh Edinburgh, UK
ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-1031-2 ISBN 978-1-0716-1032-9 (eBook) https://doi.org/10.1007/978-1-0716-1032-9 © Springer Science+Business Media, LLC, part of Springer Nature 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
Preface Synthetic Biology is an emerging engineering discipline with an ambitious goal: empowering scientists with the ability to program new functions into cells, just like they would do with computers. The field is predicated upon the “Design-Build-Test-Learn” (DBTL) cycle. Borrowed from engineering, this framework guides human experts through the process of translating requirements into synthetic circuit designs, building, characterizing, and re-designing them by “learning” from “mistakes”. Systematic in nature, the DBTL cycle could translate into lengthy and expensive iterations, until now. The recent advent of laboratory automation in biology radically changed this landscape: computer algorithms have emerged over the past few years that automate the design of biocircuits. Robots that can be instructed to take such designs and physically assemble the DNA constructs are becoming more and more affordable. Miniaturization, introduced by microfluidics, allows to increase throughput and cut reagents cost, ultimately enabling faster and cheaper screening of candidate circuit designs. Mathematical models, once a prerogative of quantitative scientists, can now be built automatically with human minimal effort. Experiments themselves can now be designed by computer programs to save time and money. Such advances, combined, can significantly speed up the process of biological circuits engineering, yet only a fraction of this potential has been expressed so far. This book aims at filling such knowledge gap. By bringing together some of the most prominent scientists and engineers in synthetic biology, this volume aims at providing the reader with clear, immediately actionable protocols to implement/exploit automated DBTL in their research and development efforts. Following the natural evolution of a project in synthetic biology, we first outline the techniques to model and simulate biological systems: Chapters 1 and 2. Chapters 3 and 4 show how such models can be used automatically to design and redesign biological systems. We then move onto laboratory automation: while Chapter 5 guides the reader in the setup of an automated biolaboratory, Chapters 6 and 7 provide a step-by-step guide on how to perform Computer-Aided Design, Planning, and Verification of DNA constructs using the rich toolbox of software developed at the Edinburgh Genome Foundry, one of the most automated public biofoundries worldwide. The following three chapters delve into protocols for high-throughput gene circuits characterization: either through RNA sequencing (Chapter 8) or via microfluidics using bacterial cellfree extracts, as in Chapter 9, or live mammalian cells, as in Chapter 10. Computational and experimental procedures to automatically infer models, with minimal efforts, are outlined in Chapters 11 and 12, respectively. Metabolic burden, a common source of divergence between model predictions and experiments, is the focus of the following three chapters. While in Chapters 13 and 14 we focus on computational techniques to predict such burden from models, Chapters 15 illustrates how sensors can be designed and developed to experimentally measure metabolic burden. Chapter 16 concludes the volume offering the reader a broader, yet practical, perspective on how DNA parts can be engineered in mammalian cells to sense, and respond to, intracellular signals in general. Working on this volume we aimed at distilling our collective experience into a set of steps and advices that will ideally help our readers jump-starting their journey into
v
vi
Preface
automating the design, construction, testing, and modeling of biocircuits. We hope the result will be met with favor. I personally wish to thank all the authors for their contributions: editing this book and venturing in their science were a tremendously enjoyable learning experience for me. Edinburgh, UK
Filippo Menolascina
Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 Qualitative Modeling, Analysis and Control of Synthetic Regulatory Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Madalena Chaves and Hidde de Jong 2 Stochastic Differential Equations for Practical Simulation of Gene Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jesu´s Pico , Alejandro Vignoni, and Yadira Boada 3 Using Models to (Re-)Design Synthetic Circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . Giselle McCallum and Laurent Potvin-Trottier 4 Automated Biocircuit Design with SYNBADm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Irene Otero-Muras and Julio R. Banga 5 Setting Up an Automated Biomanufacturing Laboratory . . . . . . . . . . . . . . . . . . . . Marilene Pavan 6 Computer-Aided Design and Pre-validation of Large Batches of DNA Assemblies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentin Zulkower 7 Computer-Aided Planning for the Verification of Large Batches of DNA Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentin Zulkower 8 Characterizing Genetic Parts and Devices Using RNA Sequencing . . . . . . . . . . . . Deepti Vipin, Zoya Ignatova, and Thomas E. Gorochowski 9 Steady-State Cell-Free Gene Expression with Microfluidic Chemostats . . . . . . . . Nadanai Laohakunakorn, Barbora Lavickova, Zoe Swank, Julie Laurent, and Sebastian J. Maerkl 10 A Microfluidic/Microscopy-Based Platform for on-Chip Controlled Gene Expression in Mammalian Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahmoud Khazim, Elisa Pedone, Lorena Postiglione, Diego di Bernardo, and Lucia Marucci 11 Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eva Balsa-Canto, Lucia Bandiera, and Filippo Menolascina 12 A Cyber-Physical Platform for Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucia Bandiera, David Gomez-Cabeza, Eva Balsa-Canto, and Filippo Menolascina 13 Prediction of Cellular Burden with Host–Circuit Models . . . . . . . . . . . . . . . . . . . . Evangelos-Marios Nikolados, Andrea Y. Weiße, and Diego A. Oyarzu´n 14 A Practical Step-by-Step Guide for Quantifying Retroactivity in Gene Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andras Gyorgy
vii
v ix
1
41 91 119 137
157
167 175 189
205
221 241
267
293
viii
15 16
Contents
Engineering Sensors for Gene Expression Burden. . . . . . . . . . . . . . . . . . . . . . . . . . . 313 Alice Boo and Francesca Ceroni Engineering Protein-Based Parts for Genetic Devices in Mammalian Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 Giuliano Bonfa´, Federica Cella, and Velia Siciliano
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
347
Contributors EVA BALSA-CANTO • (Bio)Process Engineering Group, IIM-CSIC (Spanish National Research Council), Vigo, Spain LUCIA BANDIERA • School of Engineering, Institute for Bioengineering, The University of Edinburgh, Edinburgh, UK; SynthSys - Centre for Synthetic and Systems Biology, The University of Edinburgh, Edinburgh, UK JULIO R. BANGA • BioProcess Engineering Group, IIM-CSIC, Spanish National Research Council, Vigo, Spain YADIRA BOADA • Synthetic Biology and Biosystems Control Lab, I.U. de Automa´tica e Informa´tica Industrial (ai2), Universitat Polite`cnica de Valencia, Valencia, Spain; Centro Universitario EDEM, Escuela de Empresarios, La Marina de Vale`ncia, Valencia, Spain GIULIANO BONFA´ • Istituto Italiano di Tecnologia, Largo Barsanti e Matteucci, Naples, Italy ALICE BOO • Department of Bioengineering, Imperial College London, London, UK; Imperial College Centre for Synthetic Biology, Imperial College London, London, UK FEDERICA CELLA • Istituto Italiano di Tecnologia, Largo Barsanti e Matteucci, Naples, Italy; University of Genoa, Genoa, Italy FRANCESCA CERONI • Imperial College Centre for Synthetic Biology, Imperial College London, London, UK; Department of Chemical Engineering, Imperial College London, London, UK MADALENA CHAVES • Universite´ Coˆte d’Azur, Inria, INRAE, CNRS, Sorbonne Universite´, Biocore Team, Sophia Antipolis, France HIDDE DE JONG • Universite´ Grenoble Alpes, Inria, Grenoble, Montbonnot, Saint Ismier Cedex, France DIEGO DI BERNARDO • Telethon Institute of Genetics and Medicine, Pozzuoli (NA), Italy DAVID GOMEZ-CABEZA • School of Engineering, Institute for Bioengineering, The University of Edinburgh, Edinburgh, UK THOMAS E. GOROCHOWSKI • School of Biological Sciences, University of Bristol, Bristol, UK ANDRAS GYORGY • New York University Abu Dhabi, Abu Dhabi, United Arab Emirates ZOYA IGNATOVA • Institute for Biochemistry and Molecular Biology, Department of Chemistry, University of Hamburg, Hamburg, Germany MAHMOUD KHAZIM • Department of Engineering Mathematics, University of Bristol, Bristol, UK; School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK; BrisSynBio, Bristol, UK NADANAI LAOHAKUNAKORN • School of Biological Sciences, Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, UK JULIE LAURENT • Institute of Bioengineering, School of Engineering, E´cole Polytechnique Fe´de´ rale de Lausanne, Lausanne, Switzerland BARBORA LAVICKOVA • Institute of Bioengineering, School of Engineering, E´cole Polytechnique Fe´de´rale de Lausanne, Lausanne, Switzerland SEBASTIAN J. MAERKL • Institute of Bioengineering, School of Engineering, E´cole Polytechnique Fe´de´rale de Lausanne, Lausanne, Switzerland
ix
x
Contributors
LUCIA MARUCCI • Department of Engineering Mathematics, University of Bristol, Bristol, UK; School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK; BrisSynBio, Bristol, UK GISELLE MCCALLUM • Department of Biology, Concordia University, Montreal, QC, Canada FILIPPO MENOLASCINA • School of Engineering, Institute for Bioengineering, The University of Edinburgh, Edinburgh, UK; SynthSys - Centre for Synthetic and Systems Biology, The University of Edinburgh, Edinburgh, UK EVANGELOS-MARIOS NIKOLADOS • School of Biological Sciences, University of Edinburgh, Edinburgh, UK IRENE OTERO-MURAS • BioProcess Engineering Group, IIM-CSIC, Spanish National Research Council, Vigo, Spain DIEGO A. OYARZU´N • School of Biological Sciences, University of Edinburgh, Edinburgh, UK; School of Informatics, University of Edinburgh, Edinburgh, UK MARILENE PAVAN • Lanzatech Inc., Skokie, IL, USA ELISA PEDONE • Department of Engineering Mathematics, University of Bristol, Bristol, UK; School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK; BrisSynBio, Bristol, UK JESU´S PICO´ • Synthetic Biology and Biosystems Control Lab, I.U. de Automa´tica e Informa´tica Industrial (ai2), Universitat Polite`cnica de Valencia, Valencia, Spain LORENA POSTIGLIONE • Department of Engineering Mathematics, University of Bristol, Bristol, UK; School of Cellular and Molecular Medicine, University of Bristol, Bristol, UK; BrisSynBio, Bristol, UK LAURENT POTVIN-TROTTIER • Department of Biology, Concordia University, Montreal, QC, Canada; Center for Applied Synthetic Biology, Concordia University, Montreal, QC, Canada; Department of Physics, Concordia University, Montreal, QC, Canada VELIA SICILIANO • Istituto Italiano di Tecnologia, Largo Barsanti e Matteucci, Naples, Italy ZOE SWANK • Institute of Bioengineering, School of Engineering, E´cole Polytechnique Fe´de´rale de Lausanne, Lausanne, Switzerland ALEJANDRO VIGNONI • Synthetic Biology and Biosystems Control Lab, I.U. de Automa´tica e Informa´tica Industrial (ai2), Universitat Polite`cnica de Valencia, Valencia, Spain DEEPTI VIPIN • Institute for Biochemistry and Molecular Biology, Department of Chemistry, University of Hamburg, Hamburg, Germany ANDREA Y. WEIßE • School of Informatics, University of Edinburgh, Edinburgh, UK VALENTIN ZULKOWER • Edinburgh Genome Foundry, SynthSys, School of Biological Sciences, University of Edinburgh, Edinburgh, UK
Chapter 1 Qualitative Modeling, Analysis and Control of Synthetic Regulatory Circuits Madalena Chaves and Hidde de Jong Abstract Qualitative modeling approaches are promising and still underexploited tools for the analysis and design of synthetic circuits. They can make predictions of circuit behavior in the absence of precise, quantitative information. Moreover, they provide direct insight into the relation between the feedback structure and the dynamical properties of a network. We review qualitative modeling approaches by focusing on two specific formalisms, Boolean networks and piecewise-linear differential equations, and illustrate their application by means of three well-known synthetic circuits. We describe various methods for the analysis of state transition graphs, discrete representations of the network dynamics that are generated in both modeling frameworks. We also briefly present the problem of controlling synthetic circuits, an emerging topic that could profit from the capacity of qualitative modeling approaches to rapidly scan a space of design alternatives. Key words Qualitative modeling, Gene regulatory networks, Synthetic circuits, Boolean models, Piecewise-linear differential equation models, Network control
1
Introduction Over the past decade, the construction of synthetic circuits in living cells has been facilitated by the development of increasingly powerful techniques in molecular biology, from DNA synthesis to parts libraries to genome editing [1–4]. Moreover, bioinformatics tools supporting the in silico design of plasmids and genomes have become standard in every laboratory. Notwithstanding these technological advances, the biological implementation of synthetic circuits remains a highly challenging task, because the interactions between the circuit elements, and the circuit and the cellular chassis, may have unforeseen dynamic consequences [5]. The difficulties to predict and understand circuit dynamics become even more compelling when the size and the scope of synthetic circuits increase, as has been the case in recent years.
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_1, © Springer Science+Business Media, LLC, part of Springer Nature 2021
1
2
Madalena Chaves and Hidde de Jong
Probably the most promising way to get a grasp on the relation between the structure and behavior of synthetic circuits is the use of mathematical models. The development of such models is important for both the a-priori design and the a-posteriori analysis of synthetic regulatory circuits. For example, libraries of circuit components have been used to explore the design space and find constructions that are Pareto optimal, in the sense of having the best trade-off between multiple, conflicting optimization criteria [6]. Moreover, the analysis of mathematical models of synthetic oscillators has shown that circuits combining a negative with positive feedback lead to increased stability of oscillations [7]. While mathematical modeling is a basic part of the toolbox of engineers in other disciplines, its application to (synthetic) biology encounters specific problems that cause it to be still underexploited today. One important difficulty for modeling biological systems on the molecular level is the absence of reliable in-vivo values of parameters. This difficulty is further amplified by the fact that the modeling of synthetic circuits and their interactions with the cellular environment may require a large number of equations and parameters. The size and nonlinearity of these models make quantitative estimation of the parameters from usually incomplete and noisy time-series data a real challenge [8–10]. An advantage of qualitative models is that they do not need quantitative parameter values for making predictions about the network dynamics. This comes at a price, of course, namely that the predictions are less precise. Usually, only qualitative patterns can be predicted, such as that a protein concentration increases over time instead of increasing from 1 to 2 μM. For many purposes this is sufficient though and in some cases even desirable. Indeed, by focusing on qualitative aspects of the system dynamics, one may gain a better insight into key structural properties of the network that give rise to a certain dynamical behavior. This might allow, for example, to perform a preliminary screening of possible network structures capable of displaying a certain desired property or to understand structural causes for undesired side-effects when analyzing data on circuit performance. Qualitative models have been traditionally proposed for gene regulatory networks [11–13], since the switching dynamics of gene expression lend themselves particularly well to the approximations underlying most qualitative models. As a matter of fact, the activity levels of genes can be seen as consisting of distinct discrete states, typically ON and OFF. Moreover, switches between these discrete states can be seen as following a regulatory logic, due to the combinatorial effect of transcription factors and other regulators. In recent years, similar approximations have been shown useful for other types of networks as well, notably signal transduction networks [14, 15]. In the latter case, the activity levels of signaling proteins like kinases and phosphatases are assumed to correspond
Qualitative Modeling of Synthetic Circuits
3
to distinct discrete states and their combinatorial effect on the activity of other signaling proteins to follow a regulatory logic. Through the examples in this review, however, we will focus on gene regulatory networks. We discuss two different qualitative modeling approaches: Boolean or logical models vs. piecewise-linear differential equation models. The first approach is based on discrete models, whereas models in the second approach are continuous but have dynamics that can be analyzed in a qualitative manner via discrete abstractions, that is, discrete descriptions of an underlying continuous dynamics. While the two types of models draw upon different mathematical concepts and methods, it is quite remarkable that the discrete representations of the network dynamics that are eventually used are quite similar. In both cases, the dynamics are well described by the so-called state transition graphs, consisting of network states and transitions between these states. As a consequence, the methods for analyzing the two types of models are quite similar in practice, and are concerned with, for example, finding attractors in the state transition graph, verifying properties of paths in the state transition graphs, reducing large state transition graphs to smaller, more insightful graphs, and composing properties of the state transition graph of the entire network from the properties of the network modules. In our description of these methods, we will emphasize their applicability to both types of qualitative models. Much work in systems and synthetic biology has been concerned with analyzing the relation between the structure of synthetic circuits and their dynamic properties. A complementary question, which has started to emerge in recent years, is the control of synthetic circuits through suitably chosen inputs, so as to steer the network behavior towards a desired objective [16, 17]. The question of control can be understood in a wider sense as well, namely how a synthetic circuit can bring a naturally evolved network in the cell into a certain desired state. Both control questions raise issues that can be fruitfully addressed using the qualitative modeling approaches discussed in this review, and we will discuss some burgeoning approaches by means of recent examples from the literature. This chapter does not aim at covering the whole breadth of qualitative modeling approaches proposed in systems and synthetic biology, nor does it intend to provide much detail on the mathematical bases of the formalisms and methods covered here. Previous reviews of qualitative modeling approaches available in the literature have already done this (see Note 8.1). Our specific contributions are threefold. First, we focus on the use of qualitative modeling in the design and analysis of synthetic networks as compared to naturally evolved networks. Second, we provide an integrated discussion of two different modeling frameworks,
4
Madalena Chaves and Hidde de Jong
having different strengths and weaknesses, structured around a common discrete representation of the network dynamics, state transition graphs. Third, we highlight an emerging topic in the design and analysis of synthetic circuits, namely the integration of considerations of network control in the design phase.
2
Examples of Synthetic Regulatory Circuits The modeling frameworks and analysis methods discussed in this chapter will be illustrated by means of three simple examples of synthetic regulatory circuits: the toggle switch, a synthetic oscillator, and the IRMA network (Fig. 1). The toggle switch is among the first synthetic networks described in the literature [18] and is still much used for illustrating design and control principles in synthetic biology [6, 19]. Moreover, the network motif of the toggle switch has been found to play an important role in a variety of natural processes, such as in the development of the body plan in insect embryos [20]. The toggle switch implemented in Escherichia coli consists of two genes, lacI and tetR (Fig. 1a). The two genes encode transcription factors, LacI and TetR, that mutually repress each other by binding to the promoter region of the tetR and lacI genes, respectively. A gfp reporter gene is co-transcribed with tetR. The inhibitory activity of LacI can be modulated by adding the non-metabolizable inducer molecule isopropyl β-D-1-thiogalactopyranoside (IPTG) to the medium. Similarly, repression by TetR can be released by adding anhydrotetracycline (aTc). The toggle switch can display bistable behavior, depending on parameter values, due to the presence of a positive feedback loop (see [21] for another example of a simple bistable synthetic network with positive feedback). On the contrary, negative feedback loops have been associated with oscillatory behaviors. The repressilator [22], a circuit implementing a three-gene negative feedback loop, is the first example of a synthetic oscillator. As toggle switches have been shown useful for understanding developmental switches, repressilators, and other synthetic oscillators are interesting models for analyzing temporally repetitive processes, such as the cell cycle and circadian rhythms. The notions of positive and negative feedback, and their relation to circuit dynamics, are further developed in Note 8.2, following the analysis in [23]. In this section, we introduce another example of an oscillatory circuit, combining a negative feedback loop with positive feedback [24] (Fig. 1b). The circuit consists of two genes, lacI and glnG. The latter encodes nitrogen regulator I (NRI), a transcription regulator that is active when phosphorylated. The E. coli strain in which this circuit has been implemented was modified in such a way as to make phosphorylation of NRI constitutive, independent of the cellular
Qualitative Modeling of Synthetic Circuits
a
5
aTc
LacI
TetR + GFP
lacI
tetR
gfp
IPTG
b LacI
NRI
lacI
NRIp
glnG IPTG
c Cbf1
Ash1 CBF1
ASH1
Gal4 GAL4 Galactose
Swi5 SWI5
Gal80 GAL80
Fig. 1 Three examples of synthetic regulatory circuits. (a) Toggle switch in E. coli. (b) Oscillator in E. coli. (c) IRMA network in S. cerevisiae. Genes (blue/ green blocks) are preceded by a promoter region (red block). Gene names are in italic font, protein names in roman font. The regulation of gene expression by the proteins encoded by genes is represented by solid lines ending by a symbol indicating the type of regulation: activation of gene expression is indicated by the ▹ symbol and repression by j. External metabolites and other small molecules (IPTG, aTc, galactose), shown in orange, may affect the strength of the regulatory interactions
(nitrogen) state. Both lacI and glnG are activated by phosphorylated NRI (NRIp). The negative feedback loop involving both genes, lacI and glnG, is thus modulated by a positive (autoregulatory) feedback loop on glnG. Network topologies combining negative with positive feedback have been argued to give rise to more robust oscillations [7]. The final circuit used as a running example in this chapter is the IRMA network (Fig. 1c). The number of genes in this network is larger as compared to the others considered above and it has been implemented in a eukaryotic microorganism, the yeast
6
Madalena Chaves and Hidde de Jong
Saccharomyces cerevisiae [25]. The IRMA network consists of five genes and includes both transcriptional regulation and protein– protein interactions. In particular, the genes CBF1, GAL4, SWI5, and ASH1 encode transcriptional regulators arranged in a negative feedback loop with a super-imposed positive feedback loop. Swi5 also activates the gene GAL80, whose product Gal80 binds to and inactivates Gal4, thus giving rise to an additional negative feedback loop. The action of Gal80 on Gal4 is inhibited when the metabolite galactose is present in the medium. In what follows, the toggle switch and the oscillator circuit will be used to explain the principles of the modeling approaches. The IRMA network, which has the most complex dynamics, will illustrate the methods developed for the analysis of large state transition graphs.
3
Boolean Models Boolean models are a simplified but intuitive modeling framework, using discrete variables and translating the topology of the graph of interactions into a set of logical operations. To apply this framework to biological networks, the concentration of each biological component (protein, mRNA) is represented by 0 or 1, depending on whether it is weakly or highly expressed, and its activity is represented by a logical rule describing the combination of interactions influencing that component. The concentrations are defined in continuous time, but their value is allowed to change only at a discrete set of time instants. The time units are arbitrary and only sequences of states have biological meaning. The general form of a Boolean model is Xþ i ¼ f i ðX 1 , . . ., X n Þ,
ð1Þ
where n is the number of variables, Xi ∈{0, 1} denotes the Boolean concentration of variable i at the current time, and X þ i is the value at the next evaluation time, to be computed from the current values by applying the logical rule fi : {0, 1}n !{0, 1}. The function fi may depend only on a subset of the variables. The discrete concentrations evolve according to an updating schedule, which defines the order in which the variables change to their next values. Common schedules include synchronous updates, where the rules for all components are simultaneously applied, and asynchronous updates, where at most one rule is applied at each time instant (see Note 8.3). This concise but abstract formalism becomes quite useful when studying complex biological networks, where the structure and topology of interactions are well known but few quantitative details are available; for instance, whenever the concentrations of most of
Qualitative Modeling of Synthetic Circuits
7
the components involved are not measured, and the reaction rates and other parameters are unknown. In such cases, a Boolean model of the network provides a global qualitative view of the dynamical behavior of the network, using all the available information on the network, but without introducing unknown parameters. The first application of Boolean models to biological networks was suggested by Stuart Kauffman in 1969 [11] and by Rene´ Thomas around the same time [13]. Both used the Boolean representation to describe genetic regulatory networks, where events such as mRNA transcription and protein translation may be thought of as being “turned on” or “turned off” (1 or 0). The last 20 years have witnessed an increasing availability of genomic and proteomic data, the discovery of new biological molecules and pathways, and the multiplication of interactions among biological components. Nevertheless, it is still difficult to obtain detailed parameter sets to characterize each biological reaction or interaction. On the mathematical side, several methods have been proposed to better characterize Boolean models and introduce quantitative elements: probabilistic and stochastic approaches [26, 27], complex updating schedules [28–30], model reduction [31, 32], attractor computation [33–35], characterization of state transition graphs [36], network interconnections [37], and control methods [38–40]. All these advances sparked a new wave of interest in Boolean models for application to a wide range of biological networks, from the cellular division cycle in various organisms [41–43], to signal transduction networks [15, 44], cancer-related networks [45], or pattern formation [46, 47]. A large collection of recent examples can be found in a special issue of the journal Frontiers in Physiology [48]. In addition to Boolean models, there are several approaches using discrete and logical functions to describe biological networks. Further work by Rene´ Thomas and collaborators extends Boolean models in several ways [23], such as the inclusion of multiple discrete levels, by assigning parameters to the transition graph edges to indicate different concentration thresholds. Among other formal methods, Petri nets have been successfully applied to model biological systems [49]. A Petri net is defined through a graph with two types of nodes (places and transitions), connected by weighted directed edges. Places may be marked by a number of tokens that enable transitions. Petri nets are especially suitable to model biochemical and metabolic networks, as the incidence matrix of the net reflects the stoichiometry matrix [50]. 3.1 The Toggle Switch
As a first example, consider the toggle switch, a network with two components L and T (for LacI and TetR protein expression, respectively), and two inputs A and I (for aTc and IPTG concentration, respectively). Both variables and inputs take values 0 or 1. To write
8
Madalena Chaves and Hidde de Jong
A
LT 00 01 10 11
L + T+ 11 01 10 00
B
01
11
00
10 C
L T L + T+ 00 10, 01 01 01 10 10 11 10, 01
01
11
00
10
D
Fig. 2 Toggle switch without inputs (I ¼ A ¼ 0). (a and b) Synchronous updating schedule and corresponding transition graph. (c and d) Asynchronous updating schedule and corresponding transition graph
the logical rule for L activity, we translate the interactions into logical operations, that is, protein LacI is produced only when its repressor TetR is not present or when input aTc is present, since aTc prevents inhibition by TetR. This can be written as fL(T, A) ¼ Ø T _ A. A similar argument can be used to write the logical rule for TetR: fT(L, I) ¼ ØL _ I. The Boolean model of the toggle switch is then: L þ ¼ f L ðT , AÞ ¼ ØT _ A,
ð2Þ
T þ ¼ f T ðL, I Þ ¼ ØL _ I ,
ð3Þ
where both A and I are constant inputs. A constant variable may be defined by the rule A+ ¼ A, but here we will simply analyze separately the cases A ¼ 0 or A ¼ 1, and I ¼ 0 or I ¼ 1. From the logical rules, we can construct the synchronous truth table for the model, containing the successor of each state (Fig. 2a). The overall behavior of the system can also be represented in terms of a state transition graph, where each state is connected to its successor by an arrow (Fig. 2b). The states LT that satisfy L ¼ fL(T, A) and T ¼ fT(L, I) are called point attractors and correspond to steady states of the system (see Subheadings 5.1 and 5.2 for more details). The structure of the state transition graph depends on the updating schedule. A short review of common schedules is given in Note 8.3 below. With the synchronous updating schedule, all variables are simultaneously updated, so there is exactly one successor for each state. In the corresponding state transition graph, there are two point attractors, 01 and 10, representing two steady states: either LacI is at high concentration thus inhibiting TetR which must be at low concentration (state 10) or the opposite. There is also a cyclic attractor, composed of two states {00, 11}. This cycle does not have any valid biological interpretation, as it requires that both LacI and TetR change concentrations simultaneously, but appears as an artifact of the synchronous updating schedule. To overcome this problem, a common approach is to add the hypothesis that exactly one variable may change its value at each updating
Qualitative Modeling of Synthetic Circuits
9
step, thus yielding an asynchronous schedule, where each state can have up to 2 n successors. Under this hypothesis, all point attractors remain unchanged, but cyclic attractors are often resolved into more realistic state trajectories. Applying this hypothesis to the toggle switch yields Fig. 2c and the state transition graph in panel d, where the states 00 and 11 can each cross to the two attractors. We conclude that, independently of its initial state and without inputs, the toggle switch will eventually converge to a state where only one of the proteins is strongly expressed. For other combinations of the inputs A and I, it is easy to predict their effect: a constant A ¼ 1 implies L+ ¼ L ¼ 1 which eventually sets T+ ¼ T ¼ 0, hence leading to the state 10. Conversely, a constant I ¼ 1 eventually leads to the state 01. Setting both inputs to 1 leads to a state where both proteins are strongly expressed since no inhibition remains. 3.2 The Oscillator with Positive Feedback
This network is also composed of two genes lacI and glnG encoding for two proteins, LacI and NRI, both regulated by the phosphorylated form of the transcription factor NRI. The protein LacI represses transcription of glnG and, in turn, the input IPTG lifts LacI repression. In general, the phosphorylated transcription factor NRI will activate genes lacI and glnG at different concentrations or activity thresholds, that is, whenever protein NRI is above a first threshold concentration θ1N , transcription of glnG is activated, and when NRI becomes higher than a second threshold concentration θ2N , it induces activity of lacI. The experimental system [51] implies that θ1N < θ2N . These distinct thresholds of activation for NRI require a variable with at least three discrete concentration levels, while Boolean variables have only two levels. To resolve this problem, a generalized logical model would consider a multi-leveled variable N to describe the concentration of protein NRI (as in the corresponding PLDE model, Subheading 4.2). Alternatively, Boolean models can also be extended as suggested in [52], by creating two different Boolean variables, N1 and N2, to represent N as follows: ( ( 0, N < θ1N , 0, N < θ2N , N1 ¼ N ¼ 2 1, N > θ1N , 1, N > θ2N , These two variables will evolve according to different Boolean rules, but should always satisfy N1 N2, by definition of the thresholds. More specifically, if N is a logical variable with three levels {0, 1, 2}, then N1 and N2 allow us to code for those three levels in a Boolean notation, that is, “0¼00,” “1¼10,” and “2¼11” so that the higher concentration of N corresponds to both N1 and N2 ON, while the intermediate concentration of N corresponds to N1 ON and N2 OFF. Note that the Boolean state (N1, N2) ¼ (0, 1) does
10
Madalena Chaves and Hidde de Jong
not encode for any level of variable N and does not take part in the state transition graph of the Boolean model. Therefore, three variables will be considered: L for LacI and N1, N2 for NRI protein expression. The input IPTG is denoted I. To assign the rules for variables N1 and N2, we will consider that NRI transcription is activated in a first stage by the positive feedback loop and in a second stage LacI repression comes into play. Thus N1 is regulated by N2 only, while N2 is regulated both by N1 and L. The Boolean rules for the oscillator with positive feedback become: Lþ ¼ N 2,
ð4Þ
Nþ 1 ¼ N 2,
ð5Þ
Nþ 2 ¼ ðØL _ I Þ ^ N 1 :
ð6Þ
The input I ¼ 1 induces the expression of NRI, followed by its phosphorylation, and subsequent expression of LacI, so the system converges to state 111. In the case I ¼ 0, the synchronous and asynchronous updating schedules lead to quite different state transition graphs, but both contain only one attractor, consisting of the origin with all proteins weakly expressed. In the synchronous case, however, the transition 010 ! 001 is an artifact of the simultaneous updating of N1 and N2. This problem is resolved in the asynchronous state transition graph (Fig. 3b), where the states 001 and 101 are transient and do a
L N1 N2 L+ N1+ N2+ 000 001 010 011 100 101 110 111
b
L+ N1+ N2+
synchronous
asynchronous
000 110 001 111 000 110 000 110
000 101, 011, 000 000, 011 111 000 111, 100 010, 100 110
c C1
111
011 010
110 001
000
101 100
C4 C2
C3
C5
Fig. 3 Oscillator with positive feedback and zero input (I ¼ 0). (a) Truth table for synchronous and asynchronous updating schedules. (b) Asynchronous state transition graph. (c) Hierarchical state transition graph after decomposition into strongly connected components (see Subheading 5.2 for the corresponding analysis and definition of the components Ci)
Qualitative Modeling of Synthetic Circuits
11
not have any incoming arrows from other states. In this graph, the effect of the negative feedback loop between LacI and NRI can be observed in the cyclic orbit which is reached whenever NRI is above its intermediate threshold concentration (N1 ¼ 1): 111 ! 110 ! 010 ! 011 ! 111. However, this cyclic orbit is not an attractor itself and the Boolean model predicts that all trajectories eventually converge to the point attractor formed by the origin (see the transitions from the states 010 and 110 to 000). In this example, the global behavior of the Boolean model differs from that of the corresponding PLDE model in Subheading 4.2, even though both models have the same point attractor at the origin and the cyclic orbit of the Boolean model corresponds exactly to the orbit depicted in Fig. 6b. However, in the PLDE model, the cyclic orbit is also an attractor, and there are trajectories converging either to the origin or to a (damped) periodic orbit depending on the initial conditions. In this case, the PLDE model allows for a more detailed description of the continuous state space, as discussed below (Subheading 4.2). 3.3
The IRMA Circuit
This circuit is composed of five genes encoding for five proteins, Ash1, Cbf1, Gal4, Gal80, and Swi5, and one input G (galactose). One of the proteins (Swi5) is a transcription factor for three of the genes. In this circuit, the different activity thresholds of Swi5 relative to the three genes will play an important role in determining the dynamical properties of the system. These thresholds define the (distinct) concentrations of Swi5 which trigger the transcription of the three genes. If S denotes the (continuous) concentration of protein Swi5, then transcription of gene ASH1 is activated whenever S > θaS . Similarly, transcription of genes CBF1 and g GAL80 is initiated when S > θcS and S > θS , respectively. From the analysis in [51], the activity threshold for CBF1 should be the g lowest, and in this section we will consider that θcS < θS < θaS. These distinct thresholds for S require a logical variable with at least four discrete concentration levels so, as in the oscillator example, an extended Boolean model will be constructed [52], by creating three different Boolean variables to represent S as follows: ( g 0, S < θcS , 0, S < θaS , 0, S < θS , Sg ¼ Sc ¼ Sa ¼ g 1, S > θcS , 1, S > θaS : 1, S > θS , These three variables will evolve according to different Boolean rules, but should always satisfy Sc Sg Sa since, by definition of the thresholds, Sa ¼ 1 implies Sg ¼ 1 which implies Sc ¼ 1. In particular, this signifies that the Boolean states where Sc < Sg or Sg < Sa do not have biological meaning for the IRMA circuit and are not a part of the state transition graph.
12
Madalena Chaves and Hidde de Jong
The IRMA circuit can now be translated into Boolean rules as follows, with A, C, G4, and G80 denoting the concentrations of Ash1, Cbf1, Gal4, and Gal80, respectively: G4þ ¼ C,
ð7Þ
A þ ¼ Sa,
ð8Þ
C þ ¼ Sc ^ ØA,
ð9Þ
G80þ ¼ Sg,
ð10Þ
Sc þ ¼ G4 _ Sg,
ð11Þ
Sg þ ¼ ððØG80 _ GÞ ^ ScÞ _ Sa,
ð12Þ
Saþ ¼ G4 ^ Sg,
ð13Þ
where the rules for the three Swi5 variables indicate that, first, Swi5 is activated by protein Gal4 and then, in a second step, the inhibition by Gal80 comes into play. If Swi5 is already above its second threshold (with Sg ¼ 1), then continued activation by Gal4 further increases Swi5 production. The dependence of the Swi5 variables on each other guarantees that the “forbidden” states (i.e., those satisfying Sc < Sg or Sg < Sa) do not enter into the dynamics: Sc (respectively, Sg) should be in the ON state whenever Sg is (respectively, Sa); conversely, Sa cannot become ON unless Sg is. The IRMA Boolean model has 64 states, the state transition graph corresponding to the asynchronous updating schedule with G ¼ 1 is shown in Fig. 4. In the absence of galactose (G ¼ 0), there is only one attractor, the origin, corresponding to the state in which all proteins are weakly expressed. In the presence of galactose (G ¼ 1), the inhibition of SWI5 by Gal80 is inactive, and the system has two attractors: the origin and a cyclic attractor with eight states, 0001110 ! 0011110 ! 1011110 ! 1011111 ! 1111111 ! ! 1101111 ! 0101111 ! 0101110 ! 0001110: ð14Þ In this cyclic attractor, Swi5 is always expressed above its second threshold (Sc ¼ Sg ¼ 1), hence Gal80 is also always expressed. The sequence of activations is then Cbf1 ON, Gal4 ON, Swi5a ON, Ash1 ON, followed by their repression in the same order. The IRMA circuit was implemented in yeast and experiments report the response to input G (see Figure 3 in [25]): once galactose is added (G ¼ 1), the transcription of gene SWI5 is “switched-on,” and all proteins rapidly increase their concentration before going back to a (possibly new) steady state, and possibly showing an oscillatory behavior. Conversely, once galactose is removed (G ¼ 0), all proteins are observed to decrease to a weakly expressed state.
Qualitative Modeling of Synthetic Circuits
13
Fig. 4 State transition graph of the IRMA model, for the case G ¼ 1. The yellow nodes represent the two attractors. This graph was constructed in the software platform Cytoscape [53]
The Boolean model is consistent with these experiments: in the absence of galactose the only attractor is the origin, while in the presence of galactose the proteins generally increase their concentrations and enter into a cyclic attractor, which may correspond to sustained oscillations (see Subheading 4 below), or damped oscillations and convergence to a new steady state. Even in the presence of galactose, the origin remains an attractor so another possible behavior is that, after a rapid onset, the system returns to a state where all proteins are weakly expressed.
4
Piecewise-Linear Differential Equation Models Piecewise-linear differential equation (PLDE) models retain the logical functions that occur in the Boolean models discussed above, but embed them in a system of differential equations. This gives rise to the following general definition of the dynamics of the concentration of the product of gene i (typically a protein) [12, 54, 55]: x_ i ¼
P
l∈L i
κli b li ðxÞ γ i x i , 1 i n,
ð15Þ
where x ∈ Ω denotes a vector of n protein concentrations, and Ω a subset of n0 . The synthesis rate is composed of a sum of positive synthesis constants κ li , each modulated by a regulation function b li ðxÞ : Ω ! f0, 1g, with l in an index set Li. A regulation function is an algebraic expression of step functions s+(xj, θj) or s(xj, θj) which formalizes the regulatory logic of gene expression,
14
Madalena Chaves and Hidde de Jong
analogously to the Boolean functions in Subheading 3. θj is a so-called threshold for the concentration xj, such that s+(xj, θj) evaluates to 1 if xj > θj, and to 0 if xj < θj, while s(xj, θj) ¼ 1 s+(xj, θj). The step functions thus capture the switch-like character of gene regulation. The degradation of a gene product has firstorder kinetics, with a positive degradation constant γ i. For any value of x, the functions b li ðxÞ evaluate to either 0 or 1. It can be shown that, when subdividing Ω into hyper-rectangular P regions by the threshold planes, l∈L i κli b li ðxÞ is constant in every region. In other words, every region is associated with a system of n decoupled, linear equations, thus making Eq. 15 a piecewiselinear model. As a consequence, the local dynamics in the regions is straightforward to analyze, in the sense that all solutions monotonically converge to the steady state of the local linear system [12]. However, before reaching this state, the solutions may leave the region in which the linear system is defined and enter another. When piecing together the local dynamics in all regions, the possibly complex global dynamics of the network can be reconstructed. The solutions of the PLDE models are not well-defined on the threshold planes, where the step functions switching from 0 to 1 (1 to 0) may cause a discontinuity in the right-hand side of Eq. 15 for one or several i. Several ways to resolve this complication have been proposed in the literature [51, 56–60], a subject briefly summarized in Note 8.4. The fact that in every region the system reduces to a simple linear model suggests an intuitive, abstract description of the dynamics of a regulatory circuit. Since in each region the system behaves in a qualitatively uniform manner, the region can be associated with a qualitative state, and the existence of solutions entering one region from another with transitions between qualitative states. The sets of states and transitions between states form a graph, the state transition graph [12]. The properties of this graph, such as the occurrence of attractor states or cycles, can be related to dynamical properties of the underlying PLDE model, such as (stable or unstable) steady states and limit cycles [57, 61– 64]. The construction of a state transition graph from a PLDE model follows simple rules, which do not need exact values for the parameters but exploit qualitative orderings of parameters [55]. The relation between the PLDE model and its state transition graph can be formally grounded in discrete abstractions developed in hybrid systems theory [65]. The state transition graphs thus obtained are closely related to the graphs describing the dynamics of multi-level logical models [66]. PLDE models of gene regulatory networks in the form of Eq. 15 were first proposed by Leon Glass and Stuart Kauffman [12] and are also known as Glass networks. They have been shown powerful tools for exploring the possible dynamics of regulatory circuits, such as the onset of chaotic dynamics [67, 68]. However,
Qualitative Modeling of Synthetic Circuits
15
they have also been used for modeling actual regulatory networks, for example, in microbiology [69–71]. Computer tools allowing the definition of PLDE models of regulatory networks and their qualitative analysis are available, such as Genetic Network Analyzer (GNA) [72, 73]. Recent publications present the (qualitative) analysis of more general classes of PLDE models [74], while other work presents the related formalism of hybrid automata and their application to circuit modeling [75]. 4.1 The Toggle Switch
A PLDE model for the toggle switch can be developed, analogously to the Boolean model in Subheading 3.1. We again use the variables L (LacI), T (TetR), I (IPTG), and A (aTc), but now treat them as concentrations taking their values in 0 . Similarly, we introduce for each of the variables a concentration threshold, labeled θL, θT, θI, and θA, respectively. With these definitions, the step function s+(L, θL) evaluates to 1, if L is present at a high concentration, above its threshold θL, and to 0, if L is present at a low concentration, below its threshold. Like in the Boolean model, we would like to express that the gene encoding TetR is expressed when the concentration of L is low or that of I high, in other words Øs+(L, θL) _ s+(I, θI). An equivalent formulation is obtained using de Morgan’s law: Ø(s+(L, θL) ^Øs+(I, θI)) ¼ Ø(s+(L, θL) ^ s(I, θI)), which can be interpreted as saying that TetR is not expressed when LacI is present at a high concentration and not inhibited due to the presence of IPTG. The latter expression can be rewritten in algebraic form as (1 s+(L, θL) s(I, θI)). Similarly, the regulation of LacI by TetR and aTc gives rise to the step function expression (1 s+(T, θT) s(A, θA)). Boolean expressions of step functions can always be translated into equivalent algebraic expressions [54]. With the above considerations, the model for the toggle switch reads as L_
¼ κL ð1 s þ ðT , θT Þ s ðA, θA ÞÞ γ L L,
ð16Þ
T_
¼ κT ð1 s þ ðL, θL Þ s ðI , θI ÞÞ γ T T ,
ð17Þ
where I and A are considered constant inputs. The dynamics of this model can be analyzed in the plane, where we assume for the time being that IPTG and aTc are absent from the medium, that is, I ¼ A ¼ 0 and therefore s(I, θI) ¼ s(A, θA) ¼ 1. The thresholds for T and L divide the phase space into four regions (Fig. 5a), in each of which the model of Eqs. 16–17 reduces to a simple linear model. For example, in the region S1, defined by the inequalities 0 L < θL and 0 T < θT, we have L_ ¼ κL γ L L and T_ ¼ κT γ T T . In this region all solution trajectories (monotonically) converge to the asymptotically stable steady state of the linear system given by (κL/γ L, κT/γ T)0 . This so-called focal point is here assumed to lie outside S1, in the region S3, which amounts to assuming that κL/γ L > θL and κT/γ T > θT (Fig. 5a). In other
16
Madalena Chaves and Hidde de Jong
a
b
T
S4
κT /γT
S3
θT
01
S4
S2
00
S1
S1 0
θL
κL/γL
11
S3
10
S2
L
Fig. 5 PLDE model of toggle switch in the absence of inputs (I ¼ A ¼ 0). (a) Phase plane analysis. Some example solutions are shown (solid curves). (b) State transition graph. The names of the states correspond to the names of the regions and the states have been labeled with the values of s+(L, θL) and s+(T, θT)
words, the solution trajectories starting in S1 will leave the region after some (finite) time and enter S2 or S4. Similarly, in region S2, where L > θL and 0 T < θT, the model becomes L_ ¼ κL γ L L and T_ ¼ γ T T , and the solutions converge towards the focal state (κL/γ L, 0)0 . Since this focal state is included in S2, the solutions in S2 will never leave the region, and (κL/γ L, 0)0 is a (stable) steady state of the system. The phase plane analysis, when carried out in all four regions, indicates that the system has two stable steady states, in which either L or T is above its threshold (and the other variable below its threshold). When now associating each region with a qualitative state of the same name, and the solutions entering one region from another with transitions between qualitative states, we obtain the state transition graph shown in Fig. 5b. For clarity, the states have been labeled with the values of s+(L, θL) and s+(T, θT), indicating whether the concentrations of L and T are above or below their thresholds in that region. Notice that for generating the graph, we did not need to specify quantitative values for the parameters, but only needed to know how κ L/γ L and κT/γ T were positioned with respect to their respective thresholds, an observation that holds more generally [55, 65]. As can be seen by comparing the graph in Fig. 5b with that in Fig. 2d, the qualitative dynamics of the PLDE model of the toggle switch and the Boolean asynchronous model are equivalent. Independently of the initial state, the system will reach one of the two stable equilibria (attractors) of the regulatory circuit. The state transition graph provides a qualitative picture of the dynamics of the network. The transitions in the graph correspond to events of qualitative importance, such as the crossing of a threshold that switches off a gene. In the transition from S1 to S2 in Fig. 5, for example, LacI exceeds its threshold concentration and starts to inhibit the expression of tetR.
Qualitative Modeling of Synthetic Circuits
17
Figure 5a does not show that some solutions in regions S1 and S3 may reach the intersection of the thresholds θL and θT. The vector field in the region these solutions are about to enter, for example S3 from S1, points in the opposite direction, which precludes straightforward continuation of the solutions. In order to resolve this problem, the definition of the solutions of the PLDE systems needs to be extended to the thresholds, an issue that is further developed in Note 8.4. It suffices here to say that, when doing this, the threshold intersection turns out to be another steady state, but unstable contrary to the other two stable steady states. 4.2 The Oscillator with Positive Feedback
In developing the PLDE model of the oscillator with positive feedback (Fig. 1b), like in the Boolean model, we will not distinguish between the phosphorylated and non-phosphorylated forms of NRI, but rather build upon the fact that in the strain considered, phosphorylation of NRI is constitutive. Contrary to the Boolean model, however, we introduce only a single variable for the NRI concentration (N), in addition to a variable for the LacI concentration (L) and the input IPTG (I). N has two different threshold concentrations, a first threshold for activation of the promoter driving NRI expression and a second threshold for the promoter driving LacI expression. These two thresholds will be referred to as θ1N and θ2N , respectively. The limitation to two state variables makes it possible to display the dynamics of the model in the phase plane, which will be convenient for illustrative purposes. This results in the following PLDE model of the network: L_ N_
¼ κL s þ ðN , θ2N Þ γ L L,
ð18Þ
¼ κ N ð1 s þ ðL, θL Þ s ðI , θI ÞÞ s þ ðN , θ1N Þ γ N N : ð19Þ
The regulatory logic embedded in the equation for N agrees with the details of the molecular implementation of the regulatory circuit, where for the gene to be expressed, NRI needs to be present and LacI to be absent or inactive. Moreover, the choice of promoters in the circuit guarantees that θ1N < θ2N [24]. Figure 6a shows the phase plane analysis of the oscillator model, under the assumption that IPTG is absent (I ¼ 0) and that κL/γ L > θL and κN =γ N > θ2N . Notice that any other choice of the parameter inequalities would be inconsistent with the implementation of the regulatory circuit, as it would imply that even when the proteins were expressed, the concentrations of NRI and LacI would never rise to a level where they can influence the expression of their target genes. Interestingly, the analysis shows that the system has the potential to generate oscillations in the regions where N > θ1N . Below this threshold, however, the system falls back to the trivial stable steady state (0, 0)0 . The oscillations and the steady state show
18
Madalena Chaves and Hidde de Jong
a
b
N
S6
S5
κN /γ N θN2 θN1
02
S4
0
κL/ γ L
c
11
01
S2 θL
S6
S4
S3
S3 S1
12
S5
00
10
S1
S2
L
d θN1 . Z2
0
.
Z4
. Z3
Z5
Z1
. θL
(0, θN1 )
Z4
( θL ,θN1 )
Z2
Z5
Z3
(0, 0)
Z1
( θL , 0)
Fig. 6 PLDE model of oscillator with positive feedback in the absence of input (I ¼ 0). (a) Phase plane analysis. Some example solutions are shown (solid curves). (b) State transition graph. The names of the states correspond to the names of the regions and the states have been labeled with the values of s+(L, θL) and s þ ðN , θ1N Þ þ s þ ðN , θ2N Þ. (c) Refined phase plane analysis of the lowerleft portion of the phase plane with example solutions. (d) State transition graph corresponding to the analysis in c
up as two attractors in the state transition graph (Fig. 6b). The states in the graph have been labeled with the values of s+(L, θL) and s þ ðN , θ1N Þ þ s þ ðN , θ2N Þ. Generally speaking, the occurrence of cyclic attractors in a state transition graph is not sufficient to conclude that the oscillations are stable, that is, that the underlying PLDE system has a limit cycle (Subheading 5.1). Indeed, numerical simulations suggest that the oscillations are damped, consistent with the experimental data [24], and converge to a steady state located on the intersection of the threshold planes L ¼ θL and N ¼ θ2N [51]. In order to identify this steady state, the analysis of the model needs to be extended to the threshold planes, as explained in Note 8.4. The same extension is necessary to show that solutions can slide on the threshold N ¼ θ1N separating the two basins of attraction and reach an unstable steady state at ð0, θ1N Þ. The results of such an extended analysis, focusing on the lower-left portion of the phase plane are shown in Fig. 6c, d. The computer tool GNA has been used to generate these results. Notice that the solutions sliding on the threshold N ¼ θ1N appear as the sequence of transitions ðθL , θ1N Þ ! Z 4 ! ð0, θ1N Þ . While
Qualitative Modeling of Synthetic Circuits
19
these subtle aspects of the dynamics are absent from the Boolean model of Subheading 3.2, both models agree in predicting oscillations and a stable steady. 4.3
The IRMA Circuit
Whereas the example networks in the previous two sections are small and can be analyzed in the phase plane, this is not the case for the IRMA network. The model has five genes, ASH1, CBF1, GAL4, GAL80, and SWI5, and one input, galactose. The PLDE model previously developed for this network [76] has five state variables, one for each protein concentration (A, C, G4, G80 and S), and one input variable, representing the galactose concentration (G): A_ ¼
κ0A þ κA s þ ðS, θaS Þ γ A A,
Ċ ¼ κ 1C s þ ðS, θcS Þ þ κ 2C s þ ðS, θcS Þs ðA, θA Þ γ C C, _ ¼ G4
κ0G4 þ κ G4 s þ ðC, θC Þ γ G4 G4,
_ ¼ κ 0 þ κG80 s þ ðS, θg Þ γ G80 G80, G80 G80 S
ð20Þ ð21Þ ð22Þ ð23Þ
S_ ¼ κ 0S þ κ S s þ ðG4, θG4 Þ ð1 s þ ðG80, θG80 Þ s ðG, θG ÞÞ γ S S: ð24Þ We also use the following inequalities on the parameters, which were estimated from experimental data [76]: 0 < κ0A =γ A < θA < ðκ 0A þ κA Þ=γ A ,
ð25Þ
0 < κ1C =γ C < θC < ðκ 1C þ κ2C Þ=γ C ,
ð26Þ
0 < κ 0G4 =γ G4 < θG4 < ðκ 0G4 þ κG4 Þ=γ G4 ,
ð27Þ
0 < θG80 < κ 0G80 =γ G80 < ðκ0G80 þ κG80 Þ=γ G80 ,
ð28Þ
g
0 < κ0S =γ S < θcS < θaS < θS < ðκ 0S þ κ S Þ=γ S :
ð29Þ
Despite the size of the network, the transition graph can be generated and analyzed to find different attractors corresponding to steady states or oscillations. The graph has 64 states, when restricting the analysis to regions between thresholds like in Fig. 6a, b. The results correspond to those obtained with the Boolean model (Subheading 3.3). However, when doing a more refined analysis like in Fig. 6c, d, necessary for identifying steady states on threshold planes, this number quickly jumps to above 7000 states. Without additional tools, this makes it practically impossible to analyze the qualitative dynamics of the network in detail. In the following section, we will zoom in on some of the tools developed for coping with large networks and state transition graphs, both in Boolean and PLDE models.
20
5
Madalena Chaves and Hidde de Jong
Analysis of Network Dynamics To a large extent, the analysis of network dynamics reduces to the analysis of state transition graphs. Various approaches have been proposed for this purpose and will be briefly reviewed in this section. Like in the discussion of the modeling frameworks, the different approaches will be illustrated by means of the examples in Fig. 1.
5.1 Analysis of Attractors and Their Stability
Attractors in a state transition graph are (minimal) sets of states which do not have any outgoing transitions, that is, transitions from a state inside to a state outside the attractor. Usually, two different types of attractors are distinguished: point attractors, consisting of a single state, and cyclic attractors, consisting of a set of states forming one or several cycles. The interest of attractors for the study of network dynamics is that, starting from an initial state in the graph, the system reaches an attractor after a finite number of transitions and then indefinitely remains there. For this reason, attractors have been associated with end-points of developmental trajectories in higher organisms [46, 47] or possible responses of microorganisms to a challenge from their environment [69]. In synthetic biology, attractors may correspond to different functional states and thus form an objective of circuit design [6]. Although new measurement techniques have made it possible to follow the transient dynamics of networks, for instance, by using fluorescent reporter proteins, in many cases attractors remain the only reliably observable states of the system. Given a state transition graph, the identification of attractors is straightforward. Point attractors can be found by inspecting all individual states and cyclic attractors by looking for strongly connected components (SCCs) of the graph. An SCC is a set of states which are mutually connected, that is, there exists a directed pathway from each state to any other in the SCC. An SCC is also a maximal set, in the sense that it contains every state mutually connected to any other state in the SCC. An SCC may have incoming edges and outgoing transitions, and for it to be an attractor, it needs to be a terminal SCC, that is, have no outgoing transitions. Since the size of the state transition graphs grows exponentially with the number of variables (genes), however, this enumeration approach may not be feasible in many situations of practical interest. Several approaches for identifying attractors that do not require the prior generation of the state transition graph have been developed. These approaches are based, for example, on the solution of a constraint satisfaction problem [77], a satisfiability problem [78, 79], a problem formulated in the answer set programming framework [80], or a temporal logic query [81].
Qualitative Modeling of Synthetic Circuits
21
In the case of PLDE models, the attractors in the state transition graph map to properties of the underlying differential equation systems. In particular, point attractors in the graph correspond to stable steady states, whereas cyclic attractors may represent (stable or damped) oscillations (Subheading 4). However, other states in the graph may be interesting as well for studying the dynamic properties of the PLDE system. For instance, in the state transition graph in Fig. 6d, ð0, θ1N Þ0 corresponds to an unstable steady state of the PLDE system, just like the threshold intersection (θL, θT)0 in Fig. 5. These equilibria are located on the separatrix between two stable attractors. Methods have been developed for identifying all equilibria of a regulatory circuit described by a PLDE system, and determining their stability [57, 78]. In general, determining if a cyclic attractor in a state transition graph corresponds to a limit cycle or a damped oscillator is a much more complex problem [63, 64]. The question cannot usually be decided using parameter inequalities of the type introduced in Subheading 4 only, and requires numerical analysis [51]. The above analysis methods have been implemented in a variety of computer tools, such as GINsim for logical models [82] and GNA for PLDE models [73]. Applying the latter tool to the IRMA model indicates that, when cells are growing on galactose (s+(G, θG) ¼ 1), the network has three steady states: one stable, one unstable, and one whose stability cannot be determined from the local structure of the state transition graph. The latter steady state lies in a region where a cyclic attractor is present as well and numerical simulations suggest that the cycle in the graph corresponds to a stable limit cycle (and the point attractor to an unstable steady state). Although the data are not entirely conclusive, due to the fact that they are noisy and quantify mRNA concentrations on the population level, they seem to indicate oscillatory patterns may occur for at least some of the network components [25]. The analysis of attractors makes it possible to answer the question which attractors can be reached from a given initial state. As will be seen below, the attractor structure of a state transition graph forms a suitable starting-point for network reduction, but is also an important consideration in circuit design. It notably allows to test if certain desired functional states can be reached and undesired states avoided. 5.2 Reduction of State Transition Graphs
For high-dimensional systems, state transition graphs are typically handled through a square matrix of size 2 n, which is the number of states in the graph for a model with n Boolean variables. Numerical operations on state transition graphs are thus limited by the memory capacities of current computers, which cannot deal efficaciously in real time with networks of n > 25 (approximately). Methods that enable the analysis of large networks are thus critical, for example, when studying the interactions of a synthetic circuit with a host network.
22
Madalena Chaves and Hidde de Jong
Hierarchical Graphs and Goal-Oriented Reduction: A classical tool for state transition graph analysis is its decomposition into SCCs. Once the states are partitioned into SCCs, the latter become the nodes of a new directed graph with no cycles and its terminal SCCs are the attractors of the original state transition graph. In the oscillator example of Fig. 3b, there are five SCCs: C1
¼
f001g, C 2 ¼ f101g, C 3 ¼ f100g,
C4
¼
f010, 011, 111, 110g, C 5 ¼ f000g
ð30Þ
and the corresponding hierarchical graph is shown in Fig. 3c, where C5 is the only terminal SCC. These decomposition techniques can be found, for instance, in [83]. Other methods for the analysis and reduction of the state transition graph are goal-oriented and can be applied more generally, for asynchronous graphs, automata networks, and other multilevel logical models [84]. Identifying PLDE Parameter Regions with the Same Qualitative Dynamics: An original approach was recently developed to explore the correspondence between the state transition graphs of Boolean or multi-level logical models and the parameters in a class of PLDE models (see [85] and references therein). The idea is to characterize regions in the parameter space that lead to the same local dynamics and hence to the same state transition graph. This approach identifies different parameter regions, each corresponding to a specific state transition graph and a set of logical rules compatible with the system dynamics. A Morse graph, whose nodes are the SCCs of the state transition graph, is associated with each parameter region. In addition, these parameter regions are related through a parameter graph. A software tool DSGRN (Dynamic Signatures Generated by Regulatory Networks) is available to perform these computations [85]. Among other applications, this tool lists all possible dynamical behaviors compatible with the regulatory network, suggests minimal network rules, or rules that exhibit a specific dynamic behavior in the most robust way. For synthetic biology circuits, which are affected by stochastic perturbations in molecule concentrations, it is important to guarantee topological robustness, in the sense that the qualitative dynamics is the same even though parameters may suffer perturbations. Reduction of the Regulatory Network: A different class of model reduction algorithms focuses on reducing the size n of the network of regulatory interactions and is based on iteratively suppressing variables, by linking the incoming edges of the variable to be supressed directly to its outgoing edges. Every occurrence of the suppressed variable is substituted by its Boolean rule [31]. To illustrate this methodology, consider the oscillator with positive feedback example from Subheading 3.2. To eliminate variable N2,
Qualitative Modeling of Synthetic Circuits
23
we replace it by its Boolean rule, to obtain the following reduced model: L þ ¼ ðØL _ I Þ ^ N 1 ,
Nþ 1 ¼ ðØL _ I Þ ^ N 1 :
This method is shown to preserve all attractors. Indeed, for the case I ¼ 0, the asynchronous transition graph of this reduced model is 01 ⇆ 11 ! 10 ! 00 which contains the attractor 00 and also keeps a reduced form (01 ⇆ 11) of the oscillatory behavior with N1 ¼ 1 (compare with Fig. 3b). The above procedure can be applied iteratively to eliminate variables from the network [86], excepting those variables which contain self-loops, in which case the Boolean rule substitution does not apply. A novel procedure for reduction of networks was introduced in [32], which allows for a very efficient attractor computation method. It is based, first, on a network expansion that removes all negative interactions and, second, on the identification of the so-called stable motifs, which contain sets of variables which are fixed in each attractor. The network expansion consists of adding a composite node to represent each conjunction and, in networks with negative feedback loops, a complementary node (x i ) for each variable, whose rule is the negated rule of the variable (x þ i ¼ Ø f i ðxÞ). For the oscillator example with I ¼ 0, the expanded network becomes: Lþ L
þ
¼
þ
þ N 2, N þ 1 ¼ N 2 , N 2 ¼ LN 1 , ðLN 1 Þ ¼ L ^ N 1 , þ
þ
¼ N 2, N 1 ¼ N 2, N 2 ¼ L _ N 1:
A stable motif is a strongly connected set of nodes such that: (1) it does not contain both a variable and its complementary node and (2) if it contains a composite node, than it must contains its two inputs. In the current example, there is only one such stable motif, N 1 ⇆N2 . This motif represents an attractor of the network and indicates that both N1 and N2 stabilize at the value 0 in the þ attractor. In addition, since L ¼ N 2 , it follows that 000 is the only attractor of the network. Note that, for larger stable motifs, an iterative procedure involving network reduction as above [31] and stable motif identification is applied. 5.3 Formal Verification of Network Properties Using Model Checking
Besides the detection and reachability of attractors, other dynamical properties may be of interest for network analysis and design. For example, in order to validate a model, it is important to know if there exist paths in the graph in which the predicted qualitative ordering of events, the temporal sequence of changes in gene activity or protein concentrations, are consistent with experimental observations.
24
Madalena Chaves and Hidde de Jong
Methods for model checking provide a formal framework for testing a large variety of temporal properties of labeled state transition graphs, that is, graphs in which the states and/or transitions have been annotated with features such as the value of the variables identifying a state. In order to unambiguously specify the properties to be verified, formal languages called temporal logics have been proposed [87]. Properties stated in temporal logic can be verified using algorithms that efficiently run through the graph to check if a property is true or false. While model checking has found widespread use in computer science and engineering, applications in systems and synthetic biology have also emerged in the past 15 years (see [88, 89] for reviews). The application of model checking for the analysis of qualitative models of biological regulatory circuits is supported by a number of computer tools [73, 82, 90– 92]. Temporal Logic to Describe Circuit Properties: A large variety of temporal logics have been used to formalize temporal properties of state transition graphs, differing in such characteristics as the possibility to express properties on a single path or on branching paths, to add or ignore quantitative constraints on the timing of events, etc. In this chapter, we will illustrate the description of circuit properties using the classical Computation Tree Logic (CTL) [87]. The expressions in CTL are interpreted on so-called Kripke structures, which are very close to the labeled state transition graphs describing the dynamics of Boolean and PLDE models (Subheadings 3 and 4). A Kripke structure consists of a set of states, a set of transitions between states, a set of atomic propositions describing features of states, a labeling function mapping a state to the atomic propositions that are satisfied in the state, and an initial state. A CTL formula can be recursively constructed from atomic proposition by means of standard Boolean operators describing a state and pairs of quantifiers ranging over paths. For example, for the Boolean model of the toggle switch in Fig. 2d, the temporal logic formula EF (:L ^ T) means that, starting from the initial state, there exists a path (E) such that in the future (F) the formula :L ^ T holds. This property is true for the initial state 00, but false when starting from 10. In the case of the refined state transition graphs associated with PLDE models, the CTL formula AG EF (L ¼ 0 ^ N ¼ 0) means that from any state it is always possible to get to the trivial state in which the concentrations of both LacI and NRI are 0. This property is false when the initial states comprise the entire phase space, as the concentrations are predicted to oscillate when starting above the threshold N ¼ θ1N .
Qualitative Modeling of Synthetic Circuits
25
Model Checkers to Verify Network Properties: The verification of properties expressed in temporal logic requires highly efficient, specialized computer tools, called model checkers. Model checking algorithms can run on explicitly generated state transition graphs, but as the number of states exponentially grows with the size of the regulatory circuit, this approach quickly becomes infeasible. Another approach is based on the symbolic encoding of the state transition graph and the temporal logic formula to verify, reducing the verification of the property to, for example, the solution of a Boolean satisfiability problem or the reduction of a binary decision diagram [87]. In the latter approach, the state transition graph is not explicitly generated but implicitly contained in the symbolic encoding of the problem. Powerful model checking tools exist and most tools for biological network analysis and design call upon these tools through a dedicated front-end or by generating an appropriate input file. Figure 7 shows the verification of the property AG EF (L ¼ 0 ^ N ¼ 0) for the PLDE model of the oscillator with positive feedback. The model and the property are entered in the tool GNA, which calls the model checker NuSMV to test the property [93]. The property is found to be false, as expected, since above the threshold concentration θ1N oscillations can occur. A counterexample is shown in the screenshot of the formal verification.
Fig. 7 Verifying reachability properties of the oscillator with positive feedback using model checking. The property AG EF Zero, with Zero equal to (L ¼ 0 ^ N ¼ 0), is tested for the PLDE model of the oscillator in the absence of IPTG (Fig. 6). GNA and NuSMV show the property to be false and a counterexample is shown in the form of oscillations in the concentrations of LacI and NRI
26
Madalena Chaves and Hidde de Jong
The IRMA model has been analyzed using the above formal verification tools [76]. The objective of the study was to verify that the network structure and the observed data are compatible by (1) expressing the measured RT-qPCR expression patterns of the genes as temporal logic formulae and (2) testing if there are combinations of parameter inequalities for which the model predictions are compatible with the observations. Surprisingly, among the almost 5000 possible combinations of parameter inequalities, only a handful turned out to be consistent with the data. The ordering of the different activation thresholds of Swi5 inferred from the data was corroborated by independent measurements of the promoter activities. This and other examples from the literature [94, 95] illustrate the interest of using temporal logic and model checking for supporting the analysis and design of synthetic circuits. 5.4 Modular Analysis of Network Dynamics
Networks in synthetic biology are often constructed by coupling small networks, or modules, through known interactions so as to obtain new dynamical behaviors [96]. To take advantage of this modular approach, a recent method [37, 97] proposes to analyze a Boolean network as the interconnection of two or more smaller modules. In particular, this method calculates the attractors of the full network from the attractors of the modules, thus avoiding the calculation of the full state transition graph. To illustrate this interconnection method, we will study a hypothesized synthetic biology coupling between the toggle switch (module Σ A) and the IRMA circuit (module Σ B). Input/Output Characterization of the Modules’ Attractors: The first step is to characterize the asymptotic input/output behavior (or the attractors) of each module and identify the variable(s) of each module which will influence some variable(s) in the other module, in other words, identify the outputs and inputs for each module. The full network will be obtained by interconnecting the output of each module to the input of the other. For simplicity, we assume that each module has a single input and a single output where the inputs are as given above, with u denoting the aTc concentration for the toggle switch (but fixing I ¼ 0) and G for the IRMA circuit. For the outputs we will consider LacI for the toggle switch and Gal4 for IRMA (see Fig. 8). Next, the attractors of each module are computed for each input and they are classified in terms of their output values, so that Auα denotes an attractor of module Σ A subject to input u and whose output is α (both u and α are Boolean values): Attractors of Σ A : A 01 ¼ f10g, A 00 ¼ f01g, A 11 ¼ f10g, where 10 and 01 are the two attractors of the toggle switch when the input is 0; for each of these, the output L takes the values 1 and 0, respectively. In the case of input u ¼ 1, the toggle switch has only
Qualitative Modeling of Synthetic Circuits
A
Sa
27
C
Sg
Sc
G4
◦
u
v
◦
L
T
G80
Fig. 8 Input/output interconnected IRMA (module Σ B) and toggle switch (module Σ A). The bold arrows denote the feedback interconnection: the output of one network becomes the input to the other
one attractor 10, whose output is 1. With this input/output characterization of the attractors, A01 and A11 are different objects, even if they contain the same state. Similarly, for the IRMA circuit, we obtain the attractors Bvβ of module Σ B corresponding to input v and whose output is β: Attractors of Σ B :B 00 ¼ f0000000g, B 10 ¼ f0000000g, B c10 ¼ f0001110, 0011110, 0101111, 0101110g, B c11 ¼ f1011110, 1011111, 1111111, 1101111g,
where the cyclic attractor B c10 \ B c11 is separated into two sets, according to the values of the output G4. Asymptotic Graph and Attractors of the Interconnected Network: The second step of the method is to construct the so-called asymptotic graph, which is a directed graph where the nodes are all possible pairs Auα Bvβ (3 4 ¼ 12 in this example) and the edges are assigned through reachability properties. Consider, for instance, node A 01 B c11 . The output 1 from A01 implies that the input to module Σ B is also 1, hence states in B c11 may evolve to B c10, so an edge A 01 B c11 ! A 01 B c10 is added to the asymptotic graph. Conversely, the output 1 from B 011 forces module Σ A to switch to input 1; the states in A01 eventually converge to attractor A11, so an edge A 01 B c11 ! A 11 B c11 is added to the asymptotic graph. The theoretical result [37] states that the asymptotic graph contains (a representative of) all the attractors of the interconnection meaning that, in practice, the attractors of the asymptotic graph are those of the full system. However, because the asymptotic graph is a reduced version of the full interconnected system, it contains less pathways and, in some cases, may generate other spurious attractors (see [97, 98] for example). In this example, observation of Fig. 9 shows that the interconnected system has three attractors: two steady states
28
Madalena Chaves and Hidde de Jong c
A01 × B11 c
A01 × B10
c
A11 × B11 c
A11 × B10
c
A00 × B11 c
A00 × B10
A00 × B10
A11 × B00
A01 × B00
A00 × B00
A11 × B10
A01 × B10
Fig. 9 Asymptotic graph for the interconnection between the toggle switch and the IRMA circuit. Bold arrows denote a cyclic attractor of the interconnected system. There are two other point attractors, A00 B00 and A01 B10
A00 B00 ¼ {010000000} and A01 B10 ¼ {100000000} which, unsurprisingly, correspond to the product of the two toggle switch attractors with the origin attractor of the IRMA circuit. In addition, the cyclic attractor of the IRMA module is also maintained, coupled to the state 10 of the toggle switch. This analysis suggests that, if the two modules are synthetically coupled through Gal4 and LacI, as indicated, three types of asymptotic behavior may co-exist. Note that different interconnection schemes may lead to different behavior. From a synthetic biology perspective, this appears as a promising tool for useful predictions and testing the dynamics of modular systems, without the need to compute the full state transition graphs: in the example, the latter has 29 nodes, while the asymptotic graph has only 12 nodes. 5.5 Probabilistic Analysis of State Transition Graphs
Although state transition graphs provide a global characterization of the trajectories of the system, they give no quantitative information on the probability of the system following a given trajectory, or the frequency of observing a given attractor. To introduce more detailed state transition graphs, it is useful to consider their representation as a square matrix M, where the entry mij is 1 if there is a transition from node i to node j. For synchronous updating schedules, each row has only one non-zero entry, while asynchronous updating allows for multiple non-zero entries. The first step towards a more quantitative description is to suppose that each transition has its own associated probability, 0 mij 1, where P2n each row i adds up to 1, j ¼1 m ij ¼ 1. This idea leads to a natural generalization of state transition graphs as Markov chains, thereby allowing the computation of several quantitative measures such as expected convergence time, expected reachability in a fixed number of steps, or the probability of convergence to a given attractor. Asynchronous Updating Schedules: Two questions can be asked: (1) how to assign the transition probabilities mij and (2) how to compute a trajectory in the state transition graph. Answers to these questions are given, for instance, in [99] which discusses state transition graphs as Markov chains. The first question is related to the biological knowledge and experimental data on the system,
Qualitative Modeling of Synthetic Circuits
29
which is often incomplete. For piecewise-linear models, [100] suggests to compute the (relative) size of the region of domain i whose initial conditions lead to a trajectory into domain j. Transition probabilities are thus given in terms of the PLDE model’s parameters. The tool MaBoSS [101] answers question (2) by applying a stochastic algorithm (Gillespie) to evolve within the state transition graph and obtain different realizations. The transition probabilities are either supplied by the user or assigned by default. One of the outputs of MaBoSS is the probability of reaching a given attractor. The modular analysis described in Subheading 5.4 also allows for a probabilistic extension [97]: it starts from the hypothesis that each module attractor Auα and Bvβ is observed with a certain probability and propagates their products along the asymptotic graph using conditional probabilities. The attractor’s probabilities may be obtained from biological data or other tools (for instance, applying MaBoSS to each module). As an example, for the IRMA/toggle switch interconnection, assume that the probability of observing each module attractor is defined as P(Auα) ¼ auα and P(Bvβ) ¼ bvβ, with b 11 =2 ¼ PðB c10 Þ ¼ PðB c11 Þ . To propagate these values throughout the asymptotic graph, we need to specify the probability ρ that module Σ A is updated. Then we assign transition probabilities as follows: PðA 00 B c11 ! A 11 B c11 Þ ¼ ρ a 00 b 11 =2 and PðA 00 B c11 ! A 00 B 00 Þ¼ð1 ρÞ a 00 b 11 =2 . The total probability of reaching an attractor is given by the sum of all incoming transitions. Synchronous Updating Schedules: In this case, each node has exactly one outgoing edge and belongs to a single basin of attraction, so the Markov chain representation as above is of no avail. An alternative method to generate probabilistic transitions for synchronous Boolean networks is proposed in [26]. The idea is to consider a family of Boolean rules for each variable i, f f ki : k ¼ 1, . . ., K i g and associated probabilities fc ki : k ¼ 1, . . ., K i g with PK i k k k¼1 c i ¼ 1. At each updating time, there is a probability c i that k rule f i is selected to compute the next value of variable i. The functions f ki might represent the uncertainty in the characterization of the dynamics of variable i, or some perturbation effect. In this way, one can compute the probability Pi(x) that variable i takes value 0 at the next update, which depends only on the current state x ¼ (x1, x2, . . ., xn) of the network. Therefore, taking all possible combinations of the values Pi(x) and 1 Pi(x) for all i obtains the probability of transition from state x to any other state y. These probabilities are time-independent and may also be written as a Markov chain to represent the state transition matrix. In synthetic biology approaches, a calibrated probabilistic state transition graph can help to improve circuit design to increase the probability of observing a desired behavior, quantitatively predict
30
Madalena Chaves and Hidde de Jong
the response of the circuit to different inputs, and also allow for a better regulation and control of the system, for which some techniques will be discussed in the next section.
6
Control of Network Dynamics A dynamical system, in either of the qualitative forms studied above, Eq. 1 or 15, may have one or more input variables which act on the system in some way, for instance, to induce gene transcription, repress an inhibitory interaction, regulate the activity of a transcription factor. In general, these input variables are easily manipulated in the lab to obtain the desired degree of regulation. In the toggle switch example, IPTG and aTc are input variables and galactose is an input for the IRMA circuit. More generally, a dynamical system with inputs may be studied as a control system: x_ ¼ f ðx, uÞ,
ð31Þ
where x ¼ (x1, . . ., xn), f ¼ ( f1, . . ., fn), is the vector field which governs the dynamics, and u is called the control vector, of dimenp sion p, and takes values in some set U 0 . Note that p ¼ 2 for the toggle switch and p ¼ 1 for both the oscillator and the IRMA circuit. For control systems, the most frequently asked question is of the form: given a target state, is there an input function U(t) driving the system towards that state? 6.1 Control Strategies
There are different ways to answer this question, but a first distinction can be made between open-loop and closed-loop control. In open-loop control, the function U(t) is determined independently of the dynamics of the system (31). As an example, consider the toggle switch and suppose the target state is the one where both LacI and TetR are strongly expressed, which is not a steady state of the system without inputs. With the help of either the Boolean or the PLDE models, we know that the following input will effectively drive the toggle switch to the target state: U(t) ¼ (A(t), I(t)) ¼ (Ahigh, Ihigh), i.e., both inputs should be at a constant but sufficiently high concentration. An attractive open-loop method for practical use is known as “bang-bang” control. The idea is to use only two constant values for the input, Ihigh and Ilow, and apply them sequentially, by intervals of appropriate length. This strategy tends to accelerate convergence to the target state. This is useful when only a limited number of input values are available, as is often the case in synthetic biology experiments. In contrast, a closed-loop control strategy takes into account the evolution of the system and uses the current state to “correct” the input. If all variables are observable, the control function is
Qualitative Modeling of Synthetic Circuits
31
written in terms of the system variables, U(t) ¼ k(x(t)), where the p function k : n0 ! 0 is called a feedback control law. In the case of linear systems, a common method is to measure the difference between the current state x(t) and the target state x∗ and let U(t) be proportional to this difference, with a size n constant square matrix K: x ¼ f ðxðtÞÞ þ K ðx ∗ xðtÞÞ, so the system tends to reduce the difference between its trajectory and the target state. Alternatively, for a smoother response, the integral of this difference can also be used, such as U ðtÞ ¼ _ K~ wðtÞ þ K ðx ∗ xðtÞÞ with wðtÞ ¼ x ∗ xðtÞ and K~ another size n constant square matrix. Application of both proportional and integral feedback control leads to the classic PI controller [17]. 6.2 Control for Boolean Models
In Boolean models, a control system still has the form of Eq. 31 where u takes values in a discrete set U f0, 1gp . A control function at time t corresponds to a discrete sequence of input values, U [t] ¼ [u1, . . ., ut]. To construct a Boolean control function, there are several approaches that take advantage of the discrete nature of the system and are interpreted as a protocol for interventions. The idea is to successively avoid pathways that lead away from the target state. In [102], two types of control actions are introduced, deletion of a node or deletion of an edge in the regulatory network. The first action corresponds to setting that node at a constant value, while deletion of edge xi ! xj is encoded in the logical rules by: f j ðx, ui,j Þ ¼ f j ðx 1 , . . ., Øui,j ^ x i , . . ., x n Þ,
ð32Þ
where ui,j ¼ 0 implies no control is exerted and ui,j ¼ 1 implies that xi no longer influences xj. In its general form, an input ui,j is added for every edge in the network. For probabilistic Boolean networks, algorithms were developed that solve the problems of optimal finite-horizon [38] or infinitehorizon [103] control. The goal is to drive the system from an initial state z0 to a desired target state zM in a finite (or infinite) number of steps while minimizing the cost associated with each state transition. Finite- (or infinite-) horizon corresponds to the case of a fixed (or very large) time window available for application of a given treatment. For the infinite-horizon problem [103], the cost is of the form X M 1 1 ~ t , μt , w t Þ , E J Π ðz 0 Þ ¼ lim gðz t¼0 M !1 M ~ t , μt , w t Þ is the cost associated with the transition zt ! wt where gðz applying control μt at state zt at time t. The algorithm calculates a ∗ ∗ control sequence Π ∗ ¼ ðμ∗ 0 , μ1 , . . ., μM , . . .Þ that minimizes JΠ (z0).
32
Madalena Chaves and Hidde de Jong ∗ Moreover, the authors find that a stationary policy, that is μ∗ k ¼μ for all k, is easier to apply and gives good results in a melanoma metastasis model.
6.3 Control of Synthetic Circuits
In synthetic biology the main control question is often related to the robustness of a circuit with respect to perturbations in the environment, maintaining homeostasis [104, 105], or the reliability and the predictability of circuit functioning [16, 17]. Applications of closed-loop control techniques to synthetic biology circuits may involve a computer interface within the experimental setup [17]. In this in silico approach, real-time measurements are sent to the computer, where a calibrated mathematical model of the circuit is used for online simulation of the PI controller, which returns the updated input value. This was the methodology used in [19] to control the toggle switch. The first objective was to drive the system to the unstable steady state corresponding to both LacI and TetR at their threshold concentrations. To do this, the authors applied a PI controller through a computer interface, computing aTc and IPTG separately, and succeeded in maintaining the system near the unstable steady state. A second experiment consisted of forcing the toggle switch with periodic control, in an open-loop configuration. Independent pulses of aTc and IPTG were applied to the synthetic circuit with different periods. The toggle switch responded with periodic oscillations, but only for carefully chosen periods of forcing. In the experiments [19], both inputs were used to control the system, and they were independently computed. However, a recent theoretical result shows that a single input (in this case aTc) suffices to control the toggle switch to the unstable steady state, x∗ ¼ (θL, θT) [106]. The novelty is a feedback control law which is piecewiseconstant in regions of the state space: U(L(t)) ¼ umin < 1 if L (t) < θL, that is, LacI is weakly expressed and the control law decreases the influence of TetR on LacI; conversely, U(L(t)) ¼ umax > 1 if L(t) > θL. A similar approach on control of PLDE with affine controls is discussed in [107], with the goal of either generating sustained oscillations in a system where they do not occur naturally or, conversely, suppressing oscillations by damping, with applications to a bacterial model. Implementation of feedback control laws in a cellular environment remains one of the challenges in synthetic biology, even though in silico techniques using PI controllers and optogenetic devices (where gene transcription is controlled by light signals) are increasingly used [16, 17]. Two main directions can be identified in current synthetic biology approaches [16]: first, the design and implementation of new circuits with biological components help to understand the fundamental mechanisms guiding and regulating cellular behavior; second, the design of controllers for natural regulatory and
Qualitative Modeling of Synthetic Circuits
33
metabolic networks, to improve a particular aspect of the system. In this case, possible objectives in bioengineering include increasing the production of specific biochemical products or metabolic components, changing the cell’s resource allocation strategy, or even controlling the distribution of a given product throughout a cell population [108].
7
Concluding Remarks In this chapter we have discussed the modeling and analysis of synthetic circuits using qualitative formalisms. We focused on two widely used formalisms, Boolean and other logical models and piecewise-linear differential equation models. Although the two formalisms are quite different on first sight, they are built upon similar modeling assumptions. Moreover, they both allow a discrete description of the network dynamics by means of state transition graphs. This means that many of the methods developed for the analysis of properties of state transition graphs are applicable to models developed in the two formalisms. We illustrated this convergence by modeling three prototypical synthetic circuits in both the Boolean and the PLDE formalism and analyzing their properties. We also discussed one emerging trend in the engineering of synthetic circuits, namely to view them from a control perspective. This review, structured around three example circuits and avoiding technical details, is certainly not exhaustive. In Note 8.1 some pointers to further reviews are provided. While Boolean and PLDE models are quite similar in many respects, they also have differences and depending on the situation, one formalism may be more appropriate than another. Boolean models provide a natural encoding of regulatory functions and they have been used to model very large networks [14, 44]. PLDE models are more closely related to classical kinetic models and they can account for subtle but potentially important regulatory phenomena, such as the occurrence of steady states on threshold (hyper)planes (Subheading 4). Other formalisms, like multi-valued logical models and hybrid automata borrow aspects from both [66, 75]. A major advantage of the use of qualitative models of synthetic circuits is that they allow the rapid exploration of dynamical consequences of design choices, in particular the choice of the network topology and the logic of gene regulation. Without going through the lengthy and difficult process of parameter estimation in quantitative models, key properties of the dynamics, such as the occurrence of bistability and oscillations, can be rapidly assessed by means of the methods discussed in Subheading 5. Although some care should be exercised in directly translating the results of a qualitative analysis to quantitative properties of the network dynamics, the
34
Madalena Chaves and Hidde de Jong
initial screening enabled by qualitative approaches may speed up the network design phase and focus attention on high-potential candidate network designs before their actual biological implementation. While the qualitative modeling and analysis of synthetic regulatory circuits are thus valuable in itself, it may also provide a stepping stone towards a more detailed and precise, quantitative analysis of the network dynamics. Computer tools have been developed for converting logical models into ordinary differential equation models, by transforming Boolean function into expressions of sigmoidal functions [109], in the spirit of the representation of the regulatory logic in PLDE models (Subheading 4). Tools for the numerical simulation of networks described by PLDE models have also been developed, capable of taking into account the dynamics on threshold planes [51]. More generally, the development of the SBML qual format [110], for the representation and exchange of qualitative models, has made it possible to analyze a model by means of different tools and facilitate the passing back-and-forth between qualitative and quantitative formalisms. The SBML qual format has emerged from the Consortium for Logical Models and Tools (CoLoMoTo) (www.colomoto.org), an active and dynamic working group stimulating the use of qualitative modeling for a variety of biological applications.
8
Notes
8.1 Reviews on Qualitative Modeling
Qualitative modeling approaches have been discussed as part of general reviews of the modeling of regulatory networks [111– 115]. Moreover, for most of the modeling formalisms discussed in this chapter dedicated reviews are available: Boolean and other logical models [116–118], piecewise-linear differential equation models [119], Petri nets [82], and hybrid systems [89, 120].
8.2 Dynamic Properties of Positive and Negative Feedback Loops
One of the first studies spelling out at length the interest of positive and negative feedback loops for the functioning of regulatory networks is the book on logical modeling by Rene´ Thomas [23]. Here, it was conjectured that positive feedback loops are a prerequisite for multistability, that is, the co-occurrence of multiple stable steady states (point attractors). On the other hand, negative feedback loops were hypothesized to be necessary for stable oscillations. Later work has confirmed the conjectures, both for the case of positive and negative feedback loops [121–123]. Notice that the criteria have been proposed for deterministic ODE models, and that the existence of feedback loops provides necessary but not sufficient conditions. For example, in Fig. 5, if we choose κL/ γ L < θL and κT/γ T < θT, then the toggle switch has only a single stable steady state. Corresponding proofs of the conjectures in the discrete, logical context have also been developed [124, 125].
Qualitative Modeling of Synthetic Circuits
8.3 Updating Schedules for Boolean Models
35
Boolean variables are defined in continuous time, but their state is allowed to change only at a discrete set of time instants. An updating schedule essentially determines the order in which the variables change their state and it may be deterministic, where the same order is applied at each iteration [30] or non-deterministic, where the order is given by a random or stochastic process [126]. A deterministic updating schedule s may be defined as a function s : {1, . . ., n}!{1, . . ., m}, where m n, s(i) < s( j) means that variable i is updated before variable j, and s(i) ¼ s( j) indicates that variables i and j are updated simultaneously. The case m ¼ 1 denotes the synchronous updating schedule and the case m ¼ n denotes an asynchronous sequential schedule. In the case of random schedules, both m ¼ mt and s(i) ¼ st(i) depend on the current iteration time t. If xi[t] denotes the state of variable i at time t, then the state at the next iteration, xi[t + 1], is given by: x i ½t þ 1 ¼ f i ðx 1 ½t þ Δt1i , . . . , x n ½t þ Δtni Þ,
ð33Þ
where Δt j i ¼ 0 if st(i) st( j) and Δt j i ¼ 1 if st(i) > st( j). In general, each realization of an updating schedule leads to a different trajectory. The dynamic properties of various updating schedules have been studied in the literature (see [28, 30, 126] for some examples). 8.4 Discontinuities in Piecewise-Linear Differential Equation Models
The use of step functions results in PLDE models with favorable mathematical properties, except at the thresholds where discontinuities occur. As explained in Subheading 4, these discontinuities arise from the fact that when a protein concentration crosses a threshold, it may change the rate at which some genes are expressed, and thus switch the local vector field in a region. While the dynamics at the thresholds are often ignored, this is potentially dangerous as it may cause steady states and other important dynamical properties of the system to be missed. In order to deal with the discontinuities in a mathematically rigorous manner, the PL differential equations have been generalized to differential inclusions [56]. Several different extensions have been proposed, such as Filippov extensions [56, 57], Aizerman–Pyatnitskii extensions [51, 60], and hyper-rectangular overapproximations of the former [55]. The latter overapproximations can be computed with qualitative information only, i.e., the parameter inequalities mentioned in Subheading 4, and have been implemented in the tool GNA [73]. For relatively mild conditions on the types of regulatory functions, the three extensions are equivalent in practice [51]. Other approaches for dealing with discontinuities in piecewise-linear models have been proposed [58, 59], but are less amenable to the automated qualitative analysis of higherdimensional networks.
36
Madalena Chaves and Hidde de Jong
Acknowledgements We would like to thank our friend and colleague Jean-Luc Gouze´, for a critical reading of the manuscript and many useful discussions. This work has been supported by the ANR projects Maximic (ANR-17-CE40-0024-01) and ICycle (ANR-16-CE33-001601), and Inria IPL CoSy. References 1. Kosuri S, Church G (2014) Large-scale de novo DNA synthesis: technologies and applications. Nat Methods 11(5):499–507 ˝ B, Nyerges A, Po´sfai G, Fe´her T 2. Cso¨rgo (2016) System-level genome editing in microbes. Curr Opin Microbiol 33:113–122 3. Decoene T, Paepe BD, Maertens J, Coussement P, Peters G, Maeseneire SD, Mey MD (2018) Standardization in synthetic biology: an engineering discipline coming of age. Crit Rev Biotechnol 38(5):647–656 4. Nielsen A, Der B, Shin J, Vaidyanathan P, Paralanov V, Strychalski E, Ross D, Densmore D, Voigt C (2016) Genetic circuit design automation. Science 352(6281): aac7341 5. Kwok R (2010) Five hard truths for synthetic biology. Nature 463(7279):288–290 6. Otero-Muras I, Banga J (2017) Automated design framework for synthetic biology exploiting Pareto optimality. ACS Synth Biol 6(7):1180–1193 7. Purcell O, Savery N, Grierson C, di Bernardo M (2010) A comparative analysis of synthetic genetic oscillators. J R Soc Interface 7 (52):1503–1524 8. Ashyraliyev M, Nanfack YF, Kaandorp J, Blom J (2009) Systems biology: parameter estimation for biochemical models. FEBS J 276(4):886–902 9. Berthoumieux S, Brilli M, de Jong H, Kahn D, Cinquemani E (2011) Identification of metabolic network models from incomplete high-throughput datasets. Bioinformatics 27(13):i186–i195 10. Villaverde A, Banga J (2013) Reverse engineering and identification in systems biology: strategies, perspectives and challenges. J R Soc Interface 11(91):20130505 11. Kauffman S (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol 22(3):437–467 12. Glass L, Kauffman S (1973) The logical analysis of continuous, nonlinear biochemical control networks. J Theor Biol 39:103–129
13. Thomas R (1973) Boolean formalization of genetic control circuits. J Theor Biol 42:563–585 14. Rodrı´guez-Jorge O, Kempis-Calanis L, Abou-Jaoude´ W, Gutie´rrez-Reyna D, Hernandez C, Ramirez-Pliego O, ThomasChollier M, Spicuglia S, Santana M, Thieffry D (2019) Cooperation between T cell receptor and Toll-like receptor 5 signaling for CD4 + T cell activation. Sci Signal 12(577): eaar3641 15. Saez-Rodriguez J, Simeoni L, Lindquist J, Hemenway R, Bommhardt U, Arndt B, Haus UU, Weismantel R, Gilles E, Klamt S, Schraven B (2007) A logical model provides insights into T cell receptor signaling. PLoS Comput Biol 3(8):e163 16. Hsiao V, Swaminathan A, Murray R (2018) Control theory for synthetic biology. IEEE Control Syst Mag 38:32–62 17. Del Vecchio D, Dy AJ, Qian Y (2016) Control theory meets synthetic biology. J R Soc Interface 13:20160380 18. Gardner T, Cantor C, Collins J (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403(6767):339–342 19. Lugagne JB, Carrillo S, Kirch M, Ko¨hler A, Batt G, Hersen P (2017) Balancing a genetic toggle switch by real-time feedback control and periodic forcing. Nat Commun 8:1671 20. Jaeger J, Surkova S, Blagov M, Janssens H, Kosman D, Kozlov K, Manu, Myasnikova E, Vanario-Alonso C, Samsonova M, Sharp D, Reinitz J (2004) Dynamic control of positional information in the early Drosophila embryo. Nature 430(6997):368–371 21. Becskei A, Serrano L (2000) Engineering stability in gene networks by autoregulation. Nature 405(6786):590–591 22. Elowitz M, Leibler S (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403(6767):335–338 23. Thomas R, D’Ari R (1990) Biological feedback. CRC Press, Boca Raton 24. Atkinson M, Savageau M, Myers J, Ninfa A (2003) Development of genetic circuitry
Qualitative Modeling of Synthetic Circuits exhibiting toggle switch or oscillatory behavior in Escherichia coli. Cell 113(5):597–608 25. Cantone I, Marucci L, Iorio F, Ricci M, Belcastro V, Bansal M, Santini S, di Bernardo M, di Bernardo D, Cosma M (2009) A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell 137:172–181 26. Shmulevich I, Dougherty E, Kim S, Zhang W (2002) Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks. Bioinformatics 18 (2):261–274 27. Mori T, Flo¨ttmann M, Krantz M, Akutsu T, Klipp E (2015) Stochastic simulation of Boolean rxncon models: towards quantitative analysis of large signaling networks. BMC Syst Biol 9(45):1–9 28. Chaves M, Albert R, Sontag E (2005) Robustness and fragility of Boolean models for genetic regulatory networks. J Theor Biol 235(3):431–449 29. Gonzalez A, Naldi A, Sa`nchez L, DThieffry, Chaouiya C (2006) GINsim: a software suite for the qualitative modelling, simulation and analysis of regulatory networks. BioSystems 84(2):91–100 30. Aracena J, Goles E, Moreira A, Salinas L (2009) On the robustness of update schedules in Boolean networks. BioSystems 97(1):1–8 31. Naldi A, Re´my E, Thieffry D, Chaouiya C (2011) Dynamically consistent reduction of logical regulatory graphs. Theor Comput Sci 412(21):2207–2218 ˜ udo J, Albert R (2013) An effective net32. Zan work reduction approach to find the dynamical repertoire of discrete dynamic networks. Chaos 23(2):025111 33. Irons D (2006) Improving the efficiency of attractor cycle identification in Boolean networks. Physica D 217:7–21 34. Akutsu T, Melkman A, Tamura T, Yamamoto M (2011) Determining a singleton attractor of a Boolean network with nested canalyzing functions. J Comput Biol 18(10):1275–1290 35. Veliz-Cuba A, Aguilar B, Hinkelmann F, Laubenbacher R (2014) Steady state analysis of Boolean molecular network models via model reduction and computational algebra. BMC Bioinform 15:221 36. Lorenz T, Siebert H, Bockmayr A (2013) Analysis and characterization of asynchronous state transition graphs using extremal states. Bull Math Biol 75(6):920–938 37. Tournier L, Chaves M (2013) Interconnection of asynchronous Boolean networks,
37
asymptotic and transient dynamics. Automatica 49(4):884–893 38. Datta A, Choudhary A, Bittner ML, Dougherty ER (2003) External control in Markovian genetic regulatory networks. Mach Learn 52 (1–2):169–181 39. Laschov D, Margaliot M (2012) Controllability of Boolean control networks via the Perron-Frobenius theory. Automatica 48 (6):1218–1223 40. Yang JM, Lee CK, Cho KH (2018) Global stabilization of Boolean networks to control the heterogeneity of cellular responses. Front Physiol 9:774 41. Li F, Long T, Lu Y, Ouyang Q, Tang C (2004) The yeast cell-cycle network is robustly designed. Proc Natl Acad Sci USA 101(14):4781–4786 42. Faure´ A, Naldi A, Chaouiya C, Thieffry D (2006) Dynamical analysis of a generic boolean model for the control of the mammalian cell cycle. Bioinformatics 22(14):124–131 43. Ortiz-Gutie´rrez E, Garcı´a-Cruz K, Azpeitia E, Castillo A, Sa´nchez M, AlvarezBuylla E (2015) A dynamic gene regulatory network model that recovers the cyclic behavior of Arabidopsis thaliana cell cycle. PLoS Comput Biol 11(9):e1004486 44. Calzone L, Tournier L, Fourquet S, Thieffry D, Zhivotovsky B, Barillot E, Zinovyev A (2010) Mathematical modelling of cell-fate decision in response to death receptor engagement. PLoS Comput Biol 6(3): e1000702 45. Zhang R, Shah M, Yang J, Nyland S, Liu X, Yun J, Albert R, Loughran TP Jr (2008) Network model of survival signaling in large granular lymphocyte leukemia. Proc Natl Acad Sci USA 105(42):16308–16313 46. Sa´nchez L, Thieffry D (2001) A logical analysis of the Drosophila gap-gene system. J Theor Biol 211:115–141 47. Albert R, Othmer HG (2003) The topology of the regulatory interactions predicts the expression pattern of the Drosophila segment polarity genes. J Theor Biol 223:1–18 48. Barberis M, Helikar T (eds) (2019) Logical modeling of cellular processes: from software development to network dynamics. Lausanne: Frontiers Media 49. Chaouiya C (2007) Petri net modelling of biological networks. Brief Bioinform 8 (4):210–219 50. Heiner M, Koch I (2004) Petri net based model validation in systems biology. In: Cortadella J, Reisig W (eds) Applications and
38
Madalena Chaves and Hidde de Jong
theory of Petri nets 2004. Springer, Berlin, pp 216–237 51. Acary V, de Jong H, Brogliato B (2014) Numerical simulation of piecewise-linear models of gene regulatory networks using complementarity systems. Physica D 269:103–119 52. van Ham P (1979) How to deal with variables with more than two levels. In: Thomas R (ed) Kinetic logic: a Boolean approach to the analysis of complex regulatory systems. Lecture notes in biomathematics, vol 29. Springer, Berlin, pp 326–343 53. Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504 54. Mestl T, Plahte E, Omholt S (1995) A mathematical framework for describing and analysing gene regulatory networks. J Theor Biol 176(2):291–300 55. de Jong H, Gouze´ JL, Hernandez C, Page M, Sari T, Geiselmann J (2004) Qualitative simulation of genetic regulatory networks using piecewise-linear models. Bull Math Biol 66 (2):301–340 56. Gouze´ JL, Sari T (2002) A class of piecewise linear differential equations arising in biological models. Dynam Syst 17 (4):299–316 57. Casey R, de Jong H, Gouze´ JL (2006) Piecewise-linear models of genetic regulatory networks: equilibria and their stability. J Math Biol 52(1):27–56 58. Ironi L, Panzeri L, Plahte E, Simoncini V (2011) Dynamics of actively regulated gene networks. Physica D 240(8):779–794 59. Plahte E, Kjo´glum S (2005) Analysis and generic properties of gene regulatory networks with graded response functions. Physica D 201(1):150–176 60. Machina A, Edwards R, van den Driessche P (2013) Singular dynamics in gene network models. SIAM J Appl Math 12(1):95–125 61. Glass L (1975) Classification of biological networks by their qualitative dynamics. J Theor Biol 54(1):85–107 62. Glass L, Pasternack J (1978) Prediction of limit cycles in mathematical models of biological oscillations. Bull Math Biol 40 (3):27–44 63. Edwards R (2000) Analysis of continuoustime switching networks. Physica D 146 (1–4):165–199
64. Farcot E (2006) Geometric properties of a class of piecewise affine biological network models. J Math Biol 52(3):373–418 65. Batt G, de Jong H, Page M, Geiselmann J (2008) Symbolic reachability analysis of genetic regulatory networks using discrete abstractions. Automatica 44(4):982–989 66. Thomas R, Thieffry D, Kaufman M (1995) Dynamical behaviour of biological regulatory networks: I. Biological role of feedback loops and practical use of the concept of the loopcharacteristic state. Bull Math Biol 57 (2):247–276 67. Edwards R, Siegelmann H, Aziza K, Glass L (2001) Symbolic dynamics and computation in model gene networks. Chaos 11 (1):160–169 68. Mestl T, Lemay C, Glass L (1996) Chaos in high-dimensional neural and gene networks. Physica D 98(1):33–52 69. de Jong H, Geiselmann J, Batt G, Hernandez C, Page M (2004) Qualitative simulation of the initiation of sporulation in B. subtilis. Bull Math Biol 66(2):261–299 70. Monteiro P, Dias P, Ropers D, Oliveira A, Sa´-Correia I, Teixeira M, Freitas A (2011) Qualitative modelling and formal verification of the FLR1 gene mancozeb response in Saccharomyces cerevisiae. IET Syst Biol 5 (5):308–316 71. Sepulchre JA, Reverchon S, Nasser W (2007) Modeling the onset of virulence in a pectinolytic bacterium. J Theor Biol 44(2):239–257 72. de Jong H, Geiselmann J, Hernandez C, Page M (2003) Genetic network analyzer: qualitative simulation of genetic regulatory networks. Bioinformatics 19(3):336–344 73. Batt G, Besson B, Ciron P, de Jong H, Dumas E, Geiselmann J, Monte R, Monteiro P, Page M, Rechenmann F, Ropers D (2012) Genetic network analyzer: a tool for the qualitative modeling and simulation of bacterial regulatory networks. Methods Mol Biol 804:439–462 74. Huttinga Z, Cummins B, Gedeon T, Mischaikow K (2018) Global dynamics for switching systems and their extensions by linear differential equations. Physica D 367:19–37 75. Ghosh R, Tomlin C (2004) Symbolic reachable set computation of piecewise affine hybrid automata and its application to biological modelling: Delta-Notch protein signalling. Syst Biol 1(1):170–183 76. Batt G, Page M, Cantone I, Goessler G, Monteiro P, de Jong H (2010) Efficient parameter search for qualitative models of
Qualitative Modeling of Synthetic Circuits regulatory networks using symbolic model checking. Bioinformatics 26(18):i603–i610 77. Devloo V, Hansen P, Labbe´ M (2003) Identification of all steady states in large networks by logical analysis. Bull Math Biol 65:1025–1051 78. de Jong H, Page M (2008) Search for steady states of piecewise-linear differential equation models of genetic regulatory networks. IEEE/ACM Trans Comput Biol Bioinform 5(2):208–222 79. Dubrova E, Teslenko M (2011) A SAT-based algorithm for finding attractors in synchronous Boolean networks. IEEE/ACM Trans Comput Biol Bioinform 8(5):1393–1399 80. Abdallah EB, Folschette M, Roux O, Magnin M (2017) ASP-based method for the enumeration of attractors in non-deterministic synchronous and asynchronous multi-valued networks. Algorithms Mol Biol 12:20 81. Klarner H, Siebert H (2015) Approximating attractors of Boolean networks by iterative CTL model checking. Front Bioeng Biotechnol 3:130 82. Chaouiya C, Naldi A, Thieffry D (2012) Logical modelling of gene regulatory networks with GINsim. Methods Mol Biol 804:463–479 83. Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms. MIT Press and McGraw-Hill, Cambridge 84. Pauleve´ L (2018) Reduction of qualitative models of biological networks for transient dynamics analysis. IEEE/ACM Trans Comput Biol Bioinformatics 15(4):1167–1179 85. Cummins B, Gedeon T, Harker S, Mischaikow K (2018) DSGRN: examining the dynamics of families of logical models. Front Physiol 9:549 86. Veliz-Cuba A (2011) Reduction of Boolean network models. J Theor Biol 289:167–172 87. Clarke E, Grumberg O, Peled D (1999) Model checking. MIT Press, Boston 88. Carrillo M, Go´ngora P, Rosenblueth D (2012) An overview of existing modeling tools making use of model checking in the analysis of biochemical networks. Front Plant Sci 3:155 89. Bartocci E, Lio´ P (2016) Computational modeling, formal analysis, and tools for systems biology. PLoS Comput Biol 12(1): e1004591 90. Bernot G, Comet JP, Richard A, Guespin J (2004) Application of formal methods to biological regulatory networks: extending Thomas’ asynchronous logical approach with temporal logic. J Theor Biol 229(3):339–347
39
91. Calzone L, Fages F, Soliman S (2006) BIOCHAM: an environment for modeling biological systems and formalizing experimental knowledge. Bioinformatics 22 (14):1805–1807 92. Kwiatkowska M, Norman G, Parker D (2011) PRISM 4.0: Verification of probabilistic realtime systems. In: Gopalakrishnan G, Qadeer S (eds) Proceedings of 23rd international conference computer aided verification (CAV’11). Lecture notes in computer science, vol 6806. Springer, Berlin, pp 585–591 93. Monteiro P, Dumas E, Besson B, Mateescu R, Page M, Freitas A, de Jong H (2009) A service-oriented architecture for integrating the modeling and formal verification of genetic regulatory networks. BMC Bioinform 10:450 94. Batt G, Belta C, Weiss R (2008) Temporal logic analysis of gene networks under parameter uncertainty. IEEE Trans Autom Control 53:215–229 95. Courbet A, Amar P, Fages F, Renard E, Molina F (2018) Computer-aided biochemical programming of synthetic microreactors as diagnostic devices. Mol Syst Biol 14(6):e7845 96. Perez-Carrasco R, Barnes C, Schaerli Y, Isalan M, Briscoe J, Page K (2018) Combining a toggle switch and a repressilator within the AC-DC circuit generates distinct dynamical behaviors. Cell Syst 6(4):521–530 97. Chaves M, Tournier L (2018) Analysis tools for interconnected Boolean networks with biological applications. Front Physiol 9:586 98. Chaves M, Carta A (2015) Attractor computation using interconnected Boolean networks: testing growth rate models in E. coli. Theor Comput Sci 599:47–63 99. Bourdon J, Eveillard D, Siegel A (2011) Integrating quantitative knowledge into a qualitative gene regulatory network. PLOS Comput Biol 7(9):1–11 100. Chaves M, Farcot E, Gouze´ JL (2013) Probabilistic approach for predicting periodic orbits in piecewise affine differential models. Bull Math Biol 75(6):967–987 101. Stoll G, Viara E, Barillot E, Calzone L (2012) Continuous time Boolean modeling for biological signaling: application of Gillespie algorithm. BMC Syst Biol 6(1):116 102. Murrugarra D, Veliz-Cuba A, Aguilar B, Laubenbacher R (2016) Identification of control targets in Boolean molecular network models via computational algebra. BMC Syst Biol 10:94 103. Pal R, Datta A, Dougherty ER (2006) Optimal infinite-horizon control for probabilistic
40
Madalena Chaves and Hidde de Jong
Boolean networks. IEEE Trans Signal Process 54(6):2375–2387 104. Miller M, Hafner M, Sontag E, Davidsohn N, Subramanian S, Purnick P, Lauffenburger D, Weiss R (2016) Modular design of artificial tissue homeostasis: robust control through synthetic cellular heterogeneity. PLoS Comput Biol 8:e1002579 105. Aoki S, Lillacci G, Gupta A, Baumschlager A, Schweingruber D, Khammash M (2019) A universal biomolecular integral feedback controller for robust perfect adaptation. Nature 570(7762):533–537 106. Chambon L, Gouze´ JL (2019) A new qualitative control strategy for the genetic toggle switch. IFAC-PapersOnLine 52(1):532–537 107. Edwards R, Kim S, van den Driessche P (2011) Control design for sustained oscillation in a two-gene regulatory network. J Math Biol 62(4):453–478 108. Liu D, Mannan A, Han Y, Oyarzu´n D, Zhang F (2018) Dynamic metabolic control: towards precision engineering of metabolism. J Ind Microbiol Biotechnol 45(7):535–543 109. Wittmann D, Krumsiek J, Saez-Rodriguez J, Lauffenburger D, Klamt S, Theis F (2009) Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling. BMC Syst Biol 3:98 110. Chaouiya C, Be´renguier D, Keating S, Naldi A, Van Iersel M, Rodriguez N, Dr€ager A, Bu¨chel F, Cokelaer T, Kowal B, Wicks B, Gonc¸alves E, Dorier J, Page M, Monteiro P, Von Kamp A, Xenarios I, de Jong H, Hucka M, Klamt S, Thieffry D, Le Nove`re N, Saez-Rodriguez J, Helikar T (2013) SBML qualitative models: a model representation format and infrastructure to foster interactions between qualitative modelling formalisms and tools. BMC Syst Biol 7 (1):135 111. de Jong H (2002) Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol 9(1):67–103 112. Fisher J, Henzinger T (2007) Executable cell biology. Nat Biotechnol 25(11):1239–1250 113. Karlebach G, Shamir R (2008) Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 9(10):770–780
114. Nove`re NL (2015) Quantitative and logic modelling of molecular and gene networks. Nat Rev Genet 16(3):146–158 115. de Jong H, Ropers D (2006) Strategies for dealing with incomplete information in the modeling of molecular interaction networks. Brief Bioinform 7(4):354–63 116. Bornholdt S (2008) Boolean network models of cellular regulation: prospects and limitations. J R Soc Interface 5(Suppl 1):S85–S94 117. Wang RS, Saadatpour A, Albert R (2012) Boolean modeling in systems biology: an overview of methodology and applications. Phys Biol 9(5):055001 118. Abou-Jaoude´ W, Traynard P, Monteiro P, Saez-Rodriguez J, Helikar T, Thieffry D, Chaouiya C (2016) Logical modeling and dynamical analysis of cellular networks. Front Genet 7:94 119. Glass L, Edwards R (2018) Hybrid models of genetic networks: mathematical challenges and biological relevance. J Theor Biol 458:111–118 120. Li X, Omotere O, Qian L, Dougherty E (2017) Review of stochastic hybrid systems with applications in biological systems modeling and analysis. EURASIP J Bioinform Syst Biol 2017(1):8 121. Gouze´ JL (1998) Positive and negative circuits in dynamical systems. J Biol Syst 6 (1):11–15 122. Soule´ C (2003) Graphic requirements for multistationarity. ComPlexUs 1(3):123–133 123. Snoussi E (1998) Necessary conditions for multistationarity and stable periodicity. J Biol Syst 6(1):3–9 124. Remy E, Ruet P, Thieffry D (2008) Graphic requirement for multistability and attractive cycles in a Boolean dynamical framework. Adv Appl Math 41(3):335–350 125. Richard A, Comet JP (2007) Necessary conditions for multistationarity in discrete dynamical systems. Discr Appl Math 155 (18):2403–2413 126. Deng X, Geng H, Matache M (2006) Dynamics of asynchronous random Boolean networks with asynchrony generated by stochastic processes. BioSystems 88(1–2):16–34
Chapter 2 Stochastic Differential Equations for Practical Simulation of Gene Circuits Jesu´s Pico´, Alejandro Vignoni, and Yadira Boada Abstract The Chemical Langevin Equation approach allows simple stochastic simulation of gene circuits under many practical situations where the number of molecules of the species involved is not extremely low. Here, we describe methods and a computational framework to simulate a population of cells containing gene circuits of interest. These methods account for both intrinsic and extrinsic noise sources, and allow us to have both individual cell-related species and population-related ones. The protocol covers aspects related to proper description of the system and setting the software tools. It also helps to deal with the optimization of data storage and the simulation precision versus computational time issue. Finally, it also gives practical tests to assess the validity of the underlying technical assumptions. Key words Synthetic biology, Gene circuits, Stochastic Modeling, Chemical Langevin equation
1
Introduction Noise is pervasive in the cellular mechanisms underlying gene expression [32]. As a consequence, a variation of protein expression levels appears in every cell within a population of cells [28]. This stochasticity in protein expression levels is often referred to as gene expression noise [9, 12]. Gene expression noise cannot be avoided and generates phenotypic variability that may have a relevant impact on cellular functions, including, e.g. the stress response, metabolism, development, the cell cycle, circadian rhythms, and aging [1, 30]. Indeed, noise propagates to downstream genes at the single-cell level and eventually causes variations within an isogenic population [24, 30]. These variations may determine the fate of individual cells and that of a whole population, being beneficial in some contexts and harmful in others [11]. At the gene level, noise can be traced back to intrinsic sources due to stochastic fluctuations in transcription and translation reacting mechanisms, and extrinsic ones corresponding to gene
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_2, © Springer Science+Business Media, LLC, part of Springer Nature 2021
41
42
Jesu´s Pico´ et al.
independent fluctuations in gene expression due to external factors [9, 11, 19]. The first arise as a consequence of the discrete random nature of the molecular reaction events involved in gene expression. The latter are mainly caused by global fluctuations in the amounts of biochemical resources (e.g. available number of RNA polymerases and ribosomes, plasmids copy number, etc.) and noise in upstream genes. Both intrinsic and extrinsic noise should be taken into account to perform stochastic simulations of gene circuits [7, 17, 37, 40]. Besides the different mechanisms originating noise, the reactions involved in gene circuits take place within the spatially organized volume of the cell. In this chapter, algorithms that simulate spatially distributed stochastic systems are not considered. Therefore, we assume a well-mixed homogeneous system where diffusion processes are much faster than reaction ones, so we can ignore the spatial distribution of reactions within the cell and cells within the culture. For spatial stochastic modeling, the reader is referred to [3, 27]. The continuous deterministic approach to model and simulate gene circuits provides a good approximation to the average behavior of the relevant variables (i.e. biochemical species). Yet, it fails to fully characterize the system behavior when stochasticity due to noise plays a relevant role [11, 34] as, for instance, when the number of molecules of the involved species is small and random fluctuations are thus significant relative to the overall gene expression average values. In these cases, the stochastic system may show behaviors, like, for instance, bimodality and oscillations, that their deterministic counterparts do not show. In addition, making use of stochastic models and the additional information conveyed by noisy experimental measurements may allow improved estimation of model parameters [6, 33] The direct approach to simulate a biochemical stochastic system consists of accounting for the probability that each of the biochemical species in the system has a given number of copies. The Chemical Master Equation (CME) provides an accurate account for the evolution of the probability density distribution of the number of molecules for each species in the system [38, 39]. It yields an infinite-dimensional system of differential equations, one for each possible state of the system. That is, the CME has to be solved for all possible number of molecules of each species. Therefore, an analytic solution of the CME is not possible in the general case. Workarounds include numerical approaches providing a solution in a truncated state-space like in [20, 26] or simulations of its exact sample paths using Gillespie’s stochastic simulation algorithm (SSA) [14]. The SSA computes single realizations of the underlying Markov jump process to obtain a numerical estimation of the probability distribution of each species in the system. The properties deduced about the probabilistic nature of
Stochastic Differential Equations for Practical Simulation of Gene Circuits
43
the process from multiple runs can be made arbitrarily accurate by averaging over a sufficient number of runs to reduce the Monte Carlo error associated with the estimates [39]. Thus, the SSA is exact in the sense that the statistics from the CME are reproduced precisely. But it comes at a high computational cost even for a few species. In particular, if the number of molecules has large fluctuations or if many reactions occur per unit time. In the first case a large number of samples have to be simulated to obtain statistically accurate results, whereas in the second case single simulations become expensive since the time between reaction events becomes small [35]. There are two main approaches to speed up the simulation time for the SSA. The first one, a still exact method, is based on factoring-out reaction propensities, what is called partialpropensities method [29]. The second one, essentially consists of lumping together reactions and updating the state vector only after many reactions have fired. This last method is the so-called tauleaping approximation and its variants introduce approximation errors that will be small as long as the state vector updates are relatively small [16]. Pushing further these approximation approaches leads to continuous stochastic simulation methods. As an alternative to the numerical approaches above, the Chemical Langevin Equation (CLE) approach is a continuous stochastic simulation method that approximates the CME by a system of stochastic differential equations (SDE). In contrast to the CME, which leads to an infinite-dimensional system, the CLE gives a system of SDEs of order equal to the number of species [16]. The CLE approach is a practical way to model gene expression noise when the number of molecules of the species is sufficiently large [13, 14]. To apply the CLE approach and simulate genetic circuits, the mathematical model of the circuit dynamics must first be expressed in the proper form. Then, it is necessary to account for both intrinsic and extrinsic noise sources, and to enable the possibility of having either or both individual cell-related species and population-related ones. In addition, an efficient computational framework for stochastic simulation must be set. Aspects related to optimization of data storage, simulation precision versus computational time, and practical tests to assess the validity of the underlying technical assumptions have also to be considered.
2
Materials To clarify the protocol steps, the synthetic gene circuit depicted in Fig. 1 will be used as example when needed. This circuit, hereafter denoted as the QS/Fb circuit, integrates a cell-to-cell communication mechanism and an intracellular feedback loop [5]. As a consequence, there are both intracellular biochemical species at the
44
Jesu´s Pico´ et al.
Fig. 1 QS/Fb circuit. The gene circuit aims to regule the mean expression of a protein of interest while minimizing the noise strength. To this end, it relies on the combination of a cell-to-cell communication based on quorum sensing (QS) via exchange of a diffusible molecule, and intracellular negative feedback (Fb). The Fb subsystem regulates the expression of the protein of interest inside each cell, minimizing its noise strength. The QS subsystem induces consensus among the cells thus achieving homogeneous expression across the population of cells
individual cell level and extracellular ones at the population level. The circuit aim is to achieve a desired mean value of the expression of a protein of interest while minimizing its variability in time and across the population of cells. Therefore, analysis of the circuit performance requires stochastic simulations. 2.1 Getting the Model in Proper Form
1. Define a vector containing the number of molecules of the biochemical species for the population of cells. The dynamics of the circuit will later be expressed using this vector containing the number of molecules of the relevant biochemical species as the model state variables (see Note 1). Set the number N of cells to be simulated. This protocol assumes N is constant throughout the simulation. This is consistent with experimental conditions carried out under continuous operation in turbidostats and microfluidic devices. Refer to Note 2 on how to obtain an estimation of the population size N so as to get statistically correct results taking into account the computational cost. Refer to Note 3 to relate the population size and the optical density. Consider all N cells have the same set of relevant intracellular biochemical species. Refer to Note 4 on how to deal with heterogeneous cells. For a system with c common intracellular species for all N cells and e extracellular species, define the column vector n ¼ [ni, . . ., nN, nc+1, . . ., nc+e]T containing all vectors ni ¼ [n1, . . ., nc]i with the number of molecules for the c intracellular species in the i-th cell, and the variables nc+1, . . .,
Stochastic Differential Equations for Practical Simulation of Gene Circuits
45
nc+e containing the number of molecules of each extracellular species (see Example 1). Example 1 For the QS/Fb circuit the relevant species are the intracellular n1 (PI), n2 (R), n3 (R.A2), the small molecule n4 (A) that can freely diffuse across the cellular membrane, and n6 (R.A). n6 can be obtained algebraically as a function of the others, n6 ¼ g(n2, n3, n4) (see Point 4 below to handle species algebraically determined). Additionally, n5 (Ae) represents the amount of extracellular molecules of the diffusible species A. With this set of species, n ¼ [ni, . . ., nN, n5]T, with ni ¼ ½ni1 , ni2 , ni3 , ni4 , ni6 for the five intracellular species in each i-th cell.
2. Define the vector of reaction propensities. For a system with N cells where – rc reactions are common to all cells and affect both the dynamics of intracellular species and those of the extracellular ones, and – re reactions only affect the dynamics of extracellular species, define the column vector a(n) (see Eq. 1) containing all vectors a(n)i of propensities for the rc intracellular reactions in the i-th cell, and the reaction propensities aðnÞr c þ1 , . . . , aðnÞr c þr e for the extracellular reactions. ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ a(n) = ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣
⎤ 1
a(n)
⎥ ⎥ a(n) ⎥ ⎥ ⎥ .. ⎥ ⎥ . ⎥ ⎥ N ⎥ a(n) ⎥ ⎥, ⎥ ⎥ ⎥ ⎥ arc +1 (n) ⎥ ⎥ ⎥ .. ⎥ . ⎥ ⎦ arc +re (n) 2
⎤
⎡ ⎢ ⎢ a(n) = ⎢ ⎢ ⎣ i
a1 (n) .. . arc (n)
⎥ ⎥ ⎥. ⎥ ⎦
ð1Þ
Refer to Note 4 on how to deal with heterogeneous cells with different sets of intracellular reactions. Refer to Note 5 on how to model the reaction propensities. Refer to Notes 6, 7 and 8 on how to deal with lumped propensities obtained from a
46
Jesu´s Pico´ et al.
reduced order model (see Example 2) and validate them (see Example Note 5). Refer to Note 7 on how to deal with diffusion of molecules across the cell membrane. Example 2 For the QS/Fb circuit, consider the set of reactions among the species depicted below. This set includes some pseudo-reactions: the first reaction with the lumped functional propensity f 1 ðni3 Þ resulting from a previous model-order reduction (see Notes 6 and 7), and the 9th reaction accounting for the diffusion process (see Note 7). The corresponding vector of propensities is shown on the right. For each cell, the last propensity DVcn5 depends on the extracellular species n5 (Ae) but it is included as an intracellular reaction as it affects the dynamics of the intracellular species ni4 (A). On the contrary, the propensity function dAe n5 only affects directly the dynamics of the extracellular species n5. Therefore, it is included as an extracellular reaction in the vector of propensities. Refer to Example 12 for the software code implementation. ⎤ ⎡ i i i f1 (n3 ) i i f (n ) 1 3 ⎥ (R · A)2 −−−−→ PI + (R · A)2 ⎢ ⎥ ⎢ i ⎥ ⎢ d ⎢ dI n1 ⎥ PIi −−I→ ∅ ⎥ ⎢ ⎢ C ⎥ cLR ⎢ i LR ⎥ −−→ R ⎢ ⎥ ⎢ − i ⎥ ⎤ ⎡ ⎢ ⎥ + k n k1 1 ⎢ 1 6 ⎥ − a(n) ⎥ Ri + Ai − R · Ai − − ⎢ ⎥ ⎢ ⎢ k+ ni ni ⎥ k− 1 ⎥ ⎢ ⎢ 1 2 4⎥ ⎢ a(n)2 ⎥ ⎢ ⎥ ⎥ ⎢ d ⎢ d ni ⎥ Ri −−R→ ∅ ⎢ . ⎥ ⎢ R 2 ⎥ ⎢ . ⎥ ⎢ ⎥ ⎥ ⎢ . + ⎢ i k2 ⎥ , a(n) = ⎢ k+ (ni )2 ⎥ ⎢ i a(n) = ⎥ − ⎥ ⎢ (R · A) R · Ai + R · Ai − − − 2 ⎢ 2 6 ⎥ ⎢ a(n)N ⎥ k− ⎢ − ⎥ 2 ⎥ ⎢ ⎢ k ni ⎥ ⎥ ⎢ ⎢ 2 3 ⎥ i dRA ⎥ ⎢ ⎢ ⎥ (R · A)2 −−−→ ∅ ⎥ ⎢ ⎢ ⎥ ⎦ ⎣ dRA2 ni3 ⎥ ⎢ i kA i i ⎢ ⎥ dAe n5 PI −−→ PI + A ⎢ ⎥ ⎢ kA ni1 ⎥ ⎢ ⎥ D ⎢ ⎥ −−− − Ai −− Ae ⎢ dA ni ⎥ DVc 4 ⎥ ⎢ ⎢ ⎥ d ⎢ ⎥ Ai −−A→ ∅ ⎢ Dni4 ⎥ ⎣ ⎦ dAe Ae −− →∅ DVc n5
3. Define the extended stoichiometry matrix. For a system with N cells where rc common intracellular reactions and re extracellular reactions take place among c common intracellular non-algebraic species, i.e. species that cannot be obtained algebraically as a function of others (see Point 4 below to consider algebraic species)—for all N cells and e extracellular non-algebraic species, the extended stoichiometry matrix S has a blocks structure:
Stochastic Differential Equations for Practical Simulation of Gene Circuits
47
⎤
⎡
⎢ IN ⊗ Scc 0cN×re ⎥ S=⎣ ⎦ 11×N ⊗ Sec See
ð2Þ
where – Scc is a c rc matrix formed by the stoichiometric coefficients for the c intracellular non-algebraic species accounting only for the rc intracellular reactions – 0cNre is a c N re null matrix – Sec is a e rc matrix formed by the stoichiometric coefficients for the e extracellular non-algebraic species accounting for the interactions with the intracellular ones via the rc intracellular reactions affecting them – See is a e re matrix formed by the stoichiometric coefficients for the e extracellular non-algebraic species accounting only for the re extracellular reactions – IN is the N N identity matrix – 11N is a 1 N row vector of ones and is the Kronecker product (see Note 9). Example 3 For the QS/Fb circuit, using the reactions defined in Example 2, we have 2
Scc
1 6 60 6 ¼6 6 60 4
1 0
0 Sec
¼ ½0 0
See
¼ ½ 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1 1 1
0
0
0
0
0
1 1 1
0
0
0
1 1
0
0
1 1 1
0 0
0 0
0 0
0
0
0 0 0
0
0
3
7 07 7 7 7 07 5
1
1 1
ð3Þ If we consider N ¼ 2 cells, the extended stoichiometry matrix takes the form: 2 3 ½1 Scc ½0 Scc ½041 6 7 6 7 ð4Þ S ¼ 6 ½0 Scc ½1 Scc ½041 7 4 5 ½1 Sec ½1 Sec ½See Refer to Example 14 for the software code implementation.
48
Jesu´s Pico´ et al.
4. Consider algebraically determined species. Algebraic relationships among the species in the circuit may arise as a product of model-order reduction (see Note 6). Thus, for each i-th cell, the vector of propensities a(n)i may depend on species nia1 , . . . , niaa that in turn can be obtained as an algebraic function of the species in the system. These algebraic species need not to be considered in the vector of the species defined in Point 1 above. The algebraic functions nia j ¼ g i j ðnÞ, j ¼ 1, . . . , a will be used as constraints during the simulation of the system dynamics (see Point 5 below). For a system with ca common intracellular algebraic species for all N cells and ea extracellular algebraic species: (a) define the column vector na ¼ ½na i , . . . , na N , nc a þ1 , . . . , nc a þe a T containing all vectors na i ¼ ½na1 , . . ., naca i for the number of molecules of the ca intracellular algebraic species in the i-th cell, and the variables nc a þ1 , . . . , nc a þe a containing the number of molecules of each algebraic extracellular species (see Example 4). (b) define the column vector gðnÞ ¼ ½ga i ðnÞ, . . . , ga N ðnÞ, g c a þ1 ðnÞ, . . . , g c a þe a ðnÞT containing all vectors of the algebraic functions ga i ðnÞ ¼ ½g a1 ðnÞ, . . . , g aca ðnÞi for the ca intracellular algebraic species in the i-th cell, and the functions g c a þ1 ðnÞ, . . . , g c a þe a ðnÞ for the algebraic extracellular species (see Example 4). Example 4 For the QS/Fb circuit, notice the fourth component in the i-th cell vector of propensities a(n)i depends on the species ni6 (see Example 1). This one can be obtained algebraically as a function of the others, ni6 ¼ g i6 ðni2 , ni3 , ni4 Þ, so there was no need to explicitly consider it as a component of the vector containing the number molecules of the biochemical species. To simulate N cells we set the algebraic constraint n6 ¼ g(n) with n6 ¼ T 1 1 1 1 N N ½n16 , . . . , nN 6 , and g6 ðnÞ ¼ ½g 6 ðn2 , n3 , n 4 Þ, . . . , g 6 ðn2 , N N T n3 , n4 Þ . Refer to Example 16 for the software code implementation. 5. Define the general structure of the dynamic model. See Note 1 for the technical background. Split the column vector n such that n ¼ ½nna , na where nna contains the number of molecules for the set of non-algebraic species and na for the algebraic ones. The temporal evolution of the number of molecules for each biochemical species of interest will be expressed as: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipffiffiffiffiffi nna ðt þ δtÞ ¼ nna ðtÞ þ S aðnðtÞÞδt þ S Nt aðnðtÞÞ δt na ðt þ δtÞ ¼ gðnðt þ δtÞÞ ð5Þ
Stochastic Differential Equations for Practical Simulation of Gene Circuits
49
See Methods 3.1 for software coding instructions to simulate the temporal evolution of the number of molecules of each biochemical species of interest in the system. See Point 2.2.1 for the definition of Nt . Example 5 For the QS/Fb circuit, define nna ¼ [nnai, . . ., nnaN, n5]T, with nna i ¼ ½ni1 , ni2 , ni3 , ni4 (see Example 1) and na ¼ n6 as defined in Example 4. The propensities vector a(n(t)) and the extended stoichiometry matrix S were defined in Examples 2 and 3, respectively.
2.2 Accounting for Noise and Computing Long-Term Statistics
1. Define the intrinsic noise matrix. The diffusion term in the Euler–Maruyama discrete formulation of the CLE given by Eq. 5 accounts for the intrinsic noise (see Note 1). For a system with N cells, rc intracellular reactions, and re extracellular ones, define the (N rc + re) (N rc + re) matrix Nt as a diagonal matrix with N rc + re continuous independent normal random variables with zero mean and unit variance (Nii ðμ, σ2 Þ ¼ Nð0, 1Þ). Refer to Example 13 for the software code implementation. See there how to skip some reactions so they are not affected by intrinsic noise. 2. Define the extrinsic noise characteristics. Time-invariant dynamics are assumed. That is, the system parameters may take random values around a nominal one, but they keep constant in time. The time-variant case, not covered here, requires setting a stochastic differential equation for the temporal evolution of each parameter for each cell in the population. Consider extrinsic noise by randomizing the values of the model parameters. Each model parameter θ has a nominal value θn. The value of the parameter assigned to the i-th cell is θi ¼ θn ð1 þ CV θ N ð0, 1ÞÞ , where N ð0, 1Þ is the standard normal distribution, and CVθ is a user-defined coefficient of variation for the parameter θ. Refer to Example 10 for the software code implementation of extrinsic noise. 3. Computing long-term first statistics for the time evolution of the species of interest in the population of cells. To compute the long-term moments of the species of interest, such as mean (μ) and standard deviation (σ), and derived statistics such as the noise strength η2 (squared coefficient of variation, η2 ¼ σ 2/μ2) for a population of N cells, follow the steps: (a) For a population of N cells, run a simulation (see Subheading 3) of length T units of time, with discrete-time sampling δt ensuring T large enough so that the steady state is reached and maintained for Ts units of time. Refer to Note 2 to estimate the appropriate value of N providing representative statistics. Notice only one realization of the
50
Jesu´s Pico´ et al.
simulation (one run) is required to compute the longterm first moments if the system is ergodic. That is, if enough time averaging along one realization is equivalent to obtain statistics drawn from many realizations at each time instant. Refer to Note 10 on computational assessment of ergodicity of the system. (b) Using the data for the N cells at every discrete-time instant tk, calculate the mean m n j ðt k Þ and the variance s 2n j ðt k Þ for each nj species of interest across the population at each discrete-time instant tk: mn j ðt k Þ ¼ s 2n j ðt k Þ ¼
X
N 1 ni ðt k Þ i¼1 j N XN 2 1 ðni j ðt k Þ mn j ðt k ÞÞ i¼1 N
where ni j ðt k Þ is the value of species nj in the i-th cell at time instant tk. Refer to Example 17 to see parallel software code implementing these expressions. (c) For each species of interest nj, calculate the long-term total mean and variance using the law of total variance— the total variance is the sum of the mean of the variance plus the variance of the mean [4]: μn j ¼
1 Ts
σ 2n j ¼
1 Ts
Xf k¼0
Xf
k¼0
m n j ðt k Þ s 2n j ðt k Þ þ
1 Ts
Xf k¼0
ðm n j ðt k Þ μn j Þ2
(d) Obtain additional statistics. The noise strength is obtained as: η2n j ¼
2.3
Software
σ 2n
j
μ2n
j
1. Select a software platform. To implement and run the simulation algorithm of the CLE-based model, efficient software platforms alleviate the computational cost. Here we consider the C+ + version of the scalable Open Framework for particle and Particle-Mesh codes (OpenFPM, available in http://openfpm. mpi-cbg.de/). OpenFPM allows efficient parallel computation using full parallel particle and mesh algorithms [18]. 2. OpenFPM server installation. To install OpenFPM in Linux or OSX, clone the repository and use the following lines to install it in the default location (refer to Note 11 on how to install other OpenFPM possible configurations and troubleshooting):
Stochastic Differential Equations for Practical Simulation of Gene Circuits
1
51
git clone https :// github . com / IBirdSoft / openfpm_pdata_2 .0.0. git
2
cd openfpm_pdata_2 .0.0
3
./ install
4
sudo make install
After successful installation, either run the line
1
source ˜/ openfpm_vars
before each compilation of your OpenFPM client code or incorporate it into your .bashrc system file (or equivalent). The following three files are required in the working directory: –
main.cpp, the OpenFPM client program that implements the algorithm (refer to Methods 3.1)
–
langevin.mk, a file that includes the locations of the OpenFPM library in the particular installation (see below), and
–
Makefile
with the compiler configuration (see below).
The file langevin.mk has all the specific locations of the OpenFPM library in the particular installation. To create the file langevin.mk, copy into the working directory the file example.mk located in /usr/local/openfpm_pdata/ include/ (where /usr/local is the default installation directory). If the installation directory is different, substitute the /usr/local for the one where the OpenFPM library was installed:
1
/ working_directory$ cp / usr / local / openfpm_pdata / include / example . mk langevin . mk
Finally, create the file Makefile shown below which specifies the compiler configuration:
52
Jesu´s Pico´ et al.
1
include langevin . mk
2
CC = mpic ++
3
LDIR =
4
OBJ = main . o
5
%. o : %. cpp
6
7
8
9
$ ( CC ) - O3 -c -- std = c ++11 -o $@ $ < $ ( INCLUDE_PATH ) langevin : $ ( OBJ ) $ ( CC ) -o $@ $ ˆ $ ( CFLAGS ) $ ( LIBS_PATH ) $ ( LIBS ) all : langevin
10
. PHONY : clean all
11
clean :
12
rm -f *. o *˜ core langevin
3. OpenFPM client program. The OpenFPM client program is the C++ program, called main.cpp, that implements the simulation algorithm with the CLE model generated as in Materials 2.1.5 (refer to Methods 3.1). To obtain the source code, clone the repository:
1
git clone https :// github . com / sc2cl / population_cle . git
The repository includes the full source code of the OpenFPM client main.cpp, together with the Matlab and Python scripts to show different usage options.
3
Methods
3.1 Define the OpenFPM Client Program main.cpp
The dynamic CLE model as defined in Materials 2.1.5 and the computation of the long-term statistics of the species of interest are implemented in an OpenFPM client program called main.cpp. It contains three functions: main() with the main algorithmic steps to execute the simulation and generate the selected output, input_data() to read the parameters of the model from a file, and evolve_time() to compute the system states—that is, the number of molecules of the species of interest—at each simulation time step δt. The last two functions are called from main(). Their main features are given next. For further details refer to the complete code available at https://github.com/sc2cl/population_cle. 1. Function main( ) The pseudo-code with the contents of the main( ) function is given below:
Stochastic Differential Equations for Practical Simulation of Gene Circuits
1
Initialize OpenFPM library
2
Get Arguments from input :
3
Number of cells to simulate
4
Variance of parametric extrinsic noise
5
Extracellular species initial condition
6
Output selection :
7
Histograms for every time ,
8
Long - term histogram ,
9
Population statistics for every time , or Long - term statistics
10
11
Initialize distributed random number generator
12
Initialize variables :
13
Variables common to all processors
14
Distributed domain
15
Call to the function input_data ()
16
Initialize time
17
Allocate memory for the statistics and histograms
18
Loop from 0 to final time :
19
Call to the function evolve_time ()
20
Obtain population statistics or histograms for each time step :
21
Iterate over the domain
22
Gather distributed variables into one processor
23
Write the output files ( one per state )
24
end loop
25
Obtain long - term population statistics or histograms :
26
Compute long term statistics
27
Write the output files
28
Finalize the OpenFPM library
53
54
Jesu´s Pico´ et al.
Line 1 of the main( ) pseudo-code is standard for any OpenFPM client (see Example 6). Example 6 For the QS/Fb example, the code corresponding to the first line of the pseudo-code is
310
// Initialize the library
311
openfpm_init (& argc ,& argv ) ;
312
Vcluster < > & v_cl = create_vcluster () ;
Line 312 initializes the v_cl, the Vcluster object that manages the paralellization and the operations between the processors. This will be used, for instance, in Example 18 to execute sums over all the processors. Lines 2–10 of the main( ) pseudo-code are standard C++ code for obtaining arguments from the command line. Refer to the main.cpp code (Lines 314–354) in the repository for further details. Line 11 sets the random number generator. Refer to Note 12 to see the implementation and code details. Lines 12–14 initialize the variables and data structures to be used. There are two types: normal variables and distributed ones. Normal variables are common to all processors. For example, normal variables can be used for the extracellular species. Distributed variables are defined as nodes in a distributed grid using the data structure type dist_grid_id. Every node of the grid represents an individual cell. A node may contain several variables, such as the state variables of that cell (e.g. the intracellular species) and the specific values for its parameters (Line 412 in Example 7). Once the code is compiled and executed over several processors, the grid is decomposed and each processor executes actions on the subset of nodes of the grid assigned to it [18]. This process is transparent for the user. Example 7 The code to create the grid with 240 cells reads like:
Stochastic Differential Equations for Practical Simulation of Gene Circuits
378
// Creation of the distributed grid domain
379
Box domain ({0.0 ,0.0} ,{1.0 ,1.0}) ;
380
size_t a = 10;
381
size_t b = 24;
394
size_t sz [2] {a , b };
Line 394 defines the size of the grid, i.e. the number of cells, as N ¼ a b.
399
// Vector for x5 ( as normal variable , not a distriduted one )
400
openfpm :: vector < double > x5 ;
Line 400 creates a openfmp::vector variable, a normal variable which is not distributed along the processors. In this case for the extracellular specie x5.
411
// Create a distributed grid in 2 D
412
grid_dist_id > g1 ( sz , domain , g ) ;
Line 412 creates the distributed grid g1. The first argument sets the grid as a 2D dimensional one. The second one specifies double precision. The variable aggregate is a vector containing as many double precision variables as states in the cell and an array of doubles with size equal to the number of model parameters of the cell. For the QS/Fb example, there are five doubles, one for each of the five intracellular species (see Example 1) and one double array of size 23 (double [23]) for the parameters of each cell. To make access to these variables more user friendly, the following lines are defined at the beginning of the code:
55
56
Jesu´s Pico´ et al.
14
const size_t I = 0;
15
const size_t R = 1;
16
const size_t RA2 = 2;
17
const size_t A = 3;
18
const size_t RA = 4;
19
20
const size_t Parameters = 5;
These lines (Lines 14–20) define names for the variables in each cell. To access the variables inside each cell use g1.template get < WHO >(key) where WHO is one of the five previously defined names (I, R, RA2, A, RA, Parameters), and key is the identity of the cell to be accessed. Refer to Example 10, Line 273, to see to obtain the key.
413
grid_dist_id >
g2 ( g1 .
getDecomposition () ,sz , g ) ;
Line 413 creates a second grid g2 with the same characteristics asg1, the one created in Line 412. Computing the time evolution of the system makes use of two grids: one for the values of the current discretized time instant, and another one for the previous one. Alternating between these two grids yields an efficient memory usage, as the amount of memory used is independent of the number of time steps. See Example 8 for details. Line 15 of the main( ) pseudo-code calls the function to obtain the parameters of the CLE model from the param.dat (see the pseudo-code in Methods 3.1). Line 16 initializes the time variables and the time step δt (refer to Note 13 on how to select the time step, and the source code in Lines 422–461 for the implementation). Line 17 allocates input_data()
Stochastic Differential Equations for Practical Simulation of Gene Circuits
57
memory for long-term statistics and histograms. Lines 18–23 implement the integration loop by calling the function evolve_time() to numerically solve the stochastic differential equations as an iterative solution of the CLE model in Materials 2.1.5 (see Example 8). Example 8 The code to implement the time step integration of the QS/Fb CLE model reads like: 463
// Now we start a loop in time to solve the differential equations
464
for ( int j = 0; j < (N -1) ; j ++)
465
{
466
// J increments by 1 in each time step . Always read
467
the first argument and and write in second . But change the positions in memory swaping between 1 and 2. 468
469
if ( j %2 == 0)
470
{ evolve_time ( g1 , g2 , T , sT , x5 , engine , NDist ,
471
stats_vec , Ncells ,j , dAee , val4 ) ; 472
}
473
else
474
{ evolve_time ( g2 , g1 , T , sT , x5 , engine , NDist ,
475
stats_vec , Ncells ,j , dAee , val4 ) ; 476
}
where j increments by 1 at every time step. The first argument of the function evolve_time() is the grid to be read, and the second one is the grid to write into. Thus, in the even time steps, the current information is read from g1 and the updated one written into g2, and
Jesu´s Pico´ et al.
58
the other way round in the odd time steps (read from g2 and write into g1). The other arguments of this function are: the time steps (T and sT), the vectors of the extracellular species (x5), the engine of the random number generator (engine), the normal distribution generator (NDist), the statistics vector (stats_vec), the number of cells (Ncells), the step counter (j), the degradation of the extracelular species (dAee), and the initial condition of the extracellular species (val4).
// Increment the time
611
t += T ;
612
613
}
Lines 20–23 of the main( ) pseudo-code implement a loop over the grid domain to obtain the values of the states per cell, and gather into one processor all the information distributed in the other processors to write the output file (see Lines 478–555 in the source code for details). For output analysis purposes, the temporal evolution of the i-th cell is stored at each multiple D of the time step δt. The storage decimation value D depends on the number of cells N in the population, the size of the simulation time step δt, and the desired maximum size of the output files (see Example 9). Refer to Note 14 on how to select D. Example 9 One realization of the QS/Fb CLE model was ran over a simulation time of T ¼ 800 min using a time step δt ¼ 0.025 min. Keeping every calculated data point generates a total of 32,000 time points per 240 cells and 4 states per cell, implying 30 million data points in a 250 MB file. However, using D ¼ 10, i.e. keeping one in ten data points, yields a total of three million data points stored in a 26 MB file. Lines 24–26 compute the long-term statistics and write the output file. These are standard implementations of equations in Materials 2.2.4c. Refer to Lines 616–694 in the source code for further details. Finally, Line 27 finalizes the execution closing the library appropriately.
Stochastic Differential Equations for Practical Simulation of Gene Circuits
59
2. Function input_data( ) The pseudo-code with the contents of the input_data( ) function is given below. This function sets the values of the model parameters for all the cells in the population. It reads the initial values from an external file provided by the user and modifies them adding extrinsic noise if so specified.
1
Read file param . dat
2
Write into parameters array
3
Iterate over all the cells :
4
Initialize the states of cell i
5
Read parameters array and add parametric extrinsic noise if required
6
Write new parameters into the cell i
7
Increment i
8
Write parameters that are external to the cells
9
Return the distributed grid with the parameters
Example 10 below shows relevant parts of the code implementing the input_data() function for the QS/Fb case. Example 10 // Create iterator for the distributed grid
267
268
auto dom_init = g1 . getDomainIterator () ; // Iterate
269
270
while ( dom_init . isNext () )
271
{
272
// Get the actual position from the iterator in the subdomain
273
auto key = dom_init . get () ;
274
// Initialize the grid
275
g1 . template get ( key ) = 0.0;
60
Jesu´s Pico´ et al.
Line 268 creates the domain iterator. In each iteration, Line 273 obtains the identity key of the current cell. Then, key is used to assign the initial conditions of the states (Lines 275–279) and the parameters to the current cell: 281
for ( int l = 0; l < 19; l ++)
282
{ double aux_param = ( EXTRINSIC_NOISE * enoise_sigma *
283
NDist ( en ) + 1) * parameters [ l ]; 284
285
g1 . template get < Parameters >( key ) [ l ] = aux_param ;
286
g2 . template get < Parameters >( key ) [ l ] = aux_param ;
287
288
}
In Line 283, when EXTRINSIC_NOISE¼1, extrinsic noise is added to each parameter l by multiplying it by ð1 þ σ e N l Þ, where σ e is a user-defined coefficient of variation previously introduced as an argument to the program in Line 1 of the pseudo-code (see Materials 2.1.2), and NDist(en) is a random number generated with the random number generator (see Note 12). Refer to Lines 249–305 of the source code of the QS/Fb implementation for further details. 3. Function evolve_time( ) The pseudo-code with the contents of the evolve_time( ) function is given below. This function updates the states of the system at each discrete-time step.
Stochastic Differential Equations for Practical Simulation of Gene Circuits
1
Initialize variables ( states and statistics )
2
Create a domain iterator
3
Execute until domain is covered
4
Read parameters for the cell i
5
Calculate propensities of cell i using parameters and
61
previous values of the states 6
Generate random numbers to include noise
7
Calculate the deterministic and stochastic terms
8
Compute the new values of the states of cell i
9
Compute the algebraic restrictions
10
Append the contribution of cell i to the extracellular species
11
Append the contribution of cell i to the population mean calculation
12
13
Increment the iterator Compute the sum of the partial extracellular species and means over all the processors
14
Compute the new values of the extracellular species
15
Create a new domain iterator
16
Execute until domain is covered
17
Append the contribution of cell i to the population variance calculation
18
19
Increment the iterator Compute the sum of the partial variances over all the processors
20
Return the updated statistics and extracellular species
Lines 1–3 of the pseudo-code are equal to the ones in Example 10 (Lines 267–273). Line 4 saves into local variables the parameters of the i-th cell obtained from the grid, as seen in Example 11 for the QS/Fb case.
62
Jesu´s Pico´ et al.
Example 11 Saving the value of the first parameter (degradation rate of the species I ) of the key cell in the variable dI. 62
// Copy parameters from the distributed grid variable Parameters into the named variables
63
64
65
// I parameters double dI = g_dist_read . template get < Parameters >( key ) [0];
Line 5 calculates the propensities as shown in Example 12. Example 12 94
// Propensities
95
// I
96
double X11 = dI * g_dist_read . template get ( key ) ;
97
double c1 = ( pI * kI * pN_I ) / dmI ;
98
double c2 = 1/( kdLux + g_dist_read . template get < RA2 >( key ) ) ;
99
double X12 = c1 * c2 *( kdLux + alphaI * g_dist_read . template get < RA2 >( key ) ) ;
Note that all these operations are performed on g_dist_read, which is the grid containing the number of molecules of the species in the previous discrete-time instant. Line 6 generates random numbers to incorporate the noise terms (see Note 12 and Example 13).
Stochastic Differential Equations for Practical Simulation of Gene Circuits
63
Example 13 To generate random numbers, the code for the QS/Fb example reads like: 118
// Noises
119
double x1_noise1 = NDist ( en ) ;
120
double x1_noise2 = NDist ( en ) ;
Each time the function NDist(en) is called, it generates a new independent random number. 138
x4_noise_difu = 0;
Line 138 shows how to consider a reaction not to be affected by intrinsic noise simply setting the corresponding noise variable to zero. Line 7 calculates the deterministic and stochastic terms taking into account the stoichiometry (see Example 14). Example 14 The code in QS/Fb CLE model for the species I in the key cell reads like: 141
// Deterministic and stochastic terms with stoicheometry included
142
double x1_det = T *( - X11 + X12 ) ;
143
double x1_sto = sT *( - sqrt ( std :: abs ( X11 ) ) * x1_noise1 + sqrt ( std :: abs ( X12 ) ) * x1_noise2 ) ;
Line 8 computes the updated new value of the number of molecules of the species of interest by adding the deterministic and stochastic terms to the value of the number of molecules in the previous discrete-time instant (see Example 15). Example 15 The code in QS/Fb CLE model for the species I in the key cell reads like: 154
155
// Compute new value g_dist_write . template get ( key ) = g_dist_read . template get ( key ) + x1_det + x1_sto ;
Recall that between consecutive calls to the function the grid g1 and g2 will alternatively
evolve_time()
64
Jesu´s Pico´ et al.
assigned to g_dist_read and tioned in Example 8.
g_dist_write
as men-
Line 9 calculates the algebraic constraint using the new states values previously computed (see Example 16). Example 16 The code that implements the algebraic relationship between the species R.A (ni6) as a function of ni2 , ni3 , ni4 (R, A, (R.A)2) for the i ¼ key cell in the QS/Fb CLE model (see Example 4) is: 173
174
// Algebraic constraints double c6 = 2* k_2 * kd1 * g_dist_write . template get < RA2 >( key ) + k_1 * g_dist_write . template get ( key ) * g_dist_write . template get ( key ) ;
175
double c7 = 8* k_2 * c6 ;
176
double c8 = k_1 + dRA ;
177
double c9 = ( kd1 * kd2 ) * c8 * c8 ;
178
double c10 = ( kd2 * c8 ) /(4* k_2 ) ;
179
double c11 = c7 / c9 + 1;
180
g_dist_write . template get < RA >( key ) :: abs ( c11 ) ) - 1) ;
= c10 *( sqrt ( std
// % This is R . A
Line 10 adds the contribution of the i-th cell to the value of the extracellular species and Line 11 adds the contribution of the ith cell to the variable accounting for the mean number of molecules of the intracellular species (see Example 17). Example 17 The code in the QS/Fb model obtaining the updated number of molecules for the extracellular species Ae is:
Stochastic Differential Equations for Practical Simulation of Gene Circuits
// Write the partial value of Ae from the present
182
cell . tot_A_partial += T * ( -1) * X44 + sT * ( -1) * sqrt (
183
std :: abs ( X44 ) ) * x4_noise_difu ; 184
// Add to the mean the term corresponding to the
185
present cell . x1mean += g_dist_write . template get ( key ) / Ncells
186
;
Line 186 implements a partial calculation of the mean of x1 in the following way:
x 1mean ¼
1 N
XN
xk k¼1 1
where N is the total number of cells, x i1 is the number of molecules of the species x1 (PI) in the i-th cell. Recall this code is actually executed in several processors at the same time. For example, considering a hypothetical distribution of N cells over two processors: P1 = {celli : i = 1, 2, . . . M }, P2 = {celli : i = M + 1, M + 2, . . . N } ,
where P q is the q-th processor, and M < N. Then, the calculation of x 1mean becomes P1
x 1mean ¼
1 N
zfflfflfflfflffl Xffl}|fflfflfflfflfflffl{ M
xk k¼1 1
P2
þ
1 N
zfflfflfflfflfflfflfflffl X ffl}|fflfflfflfflfflfflfflfflffl{ N
xk k¼M þ1 1
When the iterator is in the j-th cell, with j M, the calculation is executed in processor P 1, so the partial calculation of x 1mean is x1mean (P1 , cellj ) =
(j−1) j 1 k 1 k 1 1 x1 = x1 + xj1 = x1mean (P1 , cellj−1 ) + xj1 N N N N k=1
k=1
On the contrary, when the iterator is in the j-th cell, with M < j < N, the calculation is executed in processor P 2, and the partial calculation of x 1mean is x1mean (P2 , cellj ) =
1 N
j k=M +1
xk1 =
1 N
j−1 k=M +1
xk1 +
1 j 1 x = x1mean (P2 , cellj−1 )+ xj1 N 1 N
After the iterator covers the whole grid of N cells, each processor has finished its partial calculation of x 1mean :
65
66
Jesu´s Pico´ et al.
X
N 1 xk k¼1 1 N XN 1 x 1mean ðP 2 Þ ¼ xk k¼M þ1 1 N
x 1mean ðP 1 Þ ¼
The mean value corresponding to the whole population of N cells is obtained as the sum of the distributed partial values gathered from all the processors (two processors in the example): x 1mean ¼ x 1mean ðP 1 Þ þ x 1mean ðP 2 Þ The corresponding software code is shown in Example 18. Line 12 in the pseudo-code of evolve_time( ) increments the iterator and the previous code is executed for all the cells. Once the iterator reaches the end, Line 13 computes the mean number of molecules of the intracellular species across the population summing the partial results obtained in each of the computing processors. To this end, the functions sum and execute are used as shown in Example 18 (Lines 197–201). Line 13 of the pseudo-code also computes the number of molecules of the extracellular species as the sum of the values obtained in each of the distributed processors (see Example 18, Line 196). Example 18 Aggregating the partial mean values of the species in the QS/Fb example: 195
// Excecute the sum of the means and A_partial over all the processors
196
v_cl . sum (
tot_A_partial ) ;
197
v_cl . sum (
x1mean ) ;
198
v_cl . sum (
x2mean ) ;
199
v_cl . sum (
x3mean ) ;
200
v_cl . sum (
x4mean ) ;
201
v_cl . execute () ;
Line 14 computes the new values of the extracellular species (see Example 19).
Stochastic Differential Equations for Practical Simulation of Gene Circuits
67
Example 19 Parts of the code computing the updated value of the number molecules of the extracellular species Ae in the QS/Fb example: // Noise for x5
235
236
double x5_noise = NDist ( en ) ; // Stochastic part of x5 , x5 . last is the previous
237
value of x5 238
double x5_sto = sT *( - sqrt ( std :: abs ( dAe * x5 . last () ) ) * x5_noise ) ; // Calculate the new x5 value
239
240
241
double x5to_add = x5 . last () + T *( - dAe * x5 . last () ) + x5_sto + tot_A_partial ;
242
// Add the las calculated value to x5 vector
243
244
x5 . add ( x5to_add ) ;
Lines 15–19 perform the same operations: creation of an iterator, iteration over the domain, calculation of the partial variance over the population, and sum of the partial variances once the domain is covered as in Examples 17–18. See Lines 203–223 in the source code for details about the implementation. 3.2
Compilation
Next, compile the OpenFPM client program main.cpp with the command make. The result of the compilation is the executable program langevin. For a successful compilation it is mandatory to have both the langevin.mk and Makefile files in the working directory with the compiler configuration mentioned in Materials 2.3.
3.3
Simulation
Stochastic simulations of the dynamic model are obtained by executing the program langevin (the compiled main.cpp). A stand-alone parallel execution of the langevin program can be run as follows:
68
1
Jesu´s Pico´ et al.
mpirun - np 4 ./ langevin param . dat 240 0.1 0 1 0 1 0
where -np 4 sets four core processors to run the parallel simulation, ./langevin is the name of the executable program, and the file param.dat. The input file param.dat is a CSV text file with the nominal values of the parameters ordered and separated by commas. The first three numeric arguments correspond to the number of cells to be simulated (240), the user-defined coefficient of variance for the extrinsic noise (0.1), and the initial number of molecules of the extracellular species (0). The last four remaining arguments configure: – Simulation with intrinsic noise (1) or deterministic simulations (0), – Long-term population histograms (1) or not (0). See Example 20, right part of the plot. – Population statistics (mean and variance) at every time step (1) or not (0). See Example 21. – Temporal response of all cells at time step (1) or not (0). See Example 20. An execution of langevin returns by default the long-term population statistics in the output file output.dat, see Example 22. Additionally, the file param.dat and the corresponding execution of the langevin program can be performed in different ways. See Note 15 to see a parametric swept performed in MATLAB®. See Note 16 for a Python script to start an execution. Example 20 N ¼ 240 cells were simulated to obtain trajectories and histograms of the intracellular species for the QS/Fb system. Each species has a normalized endpoint distributions computed over one realization of the CLE model, including intrinsic and extrinsic noise. All endpoint histograms show a well-shaped normal distribution, and they differ only in their means and noise strengths. Each species mean and standard deviation (μ σ) were computed using the last one third data of each species in the whole population to avoid the effect of the transient.
Stochastic Differential Equations for Practical Simulation of Gene Circuits
69
PI molecules
5000 4000 3000 2000 1000 0 0
100
200
300
400
500
600
700
0
0.5
1
1.5x10-3
100
200
300
400
500
600
700
0
2
4
6x10-3
100
200
300
400
500
600
700
0
1
2
3x10-3
100
200
300
400
500
600
700
0
0.02
R molecules
1000 800 600 400 200
(R.A)2 molecules
0 0 1500
1000
500
0 0
A molecules
100 80 60 40 20 0 0
Time [min]
0.04
Frequency
Jesu´s Pico´ et al.
70
Example 21 Population statistics at each time step comparing stochastic and deterministic results of the QS/Fb CLE model are depicted below. This is a single realization computed over 800 min for the four intracellular species considering a population of N ¼ 240 cells. The stochastic (solid line) and deterministic (dashed line) are two independent simulations but under the same initial conditions. The average number of molecules of each species obtained in both simulations closely match. PoI/LuxI molecules
3000 2000 1000
LuxR molecules
0
0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
0
100
200
300
400
500
600
700
800
600 400 200 0
(LuxR.AHL)2 molecules
600 400 200
AHL molecules
0
40 20 0
Time(min) Deterministic
Stochastic (Mean)
Stochastic (Variance)
Example 22 Long-term population statistics for the QS/Fb system output (n1 (PI)) using different sets of model parameters are illustrated below. Quorum sensing (orange dots) in the QS/Fb system reduces the PoI/LuxI noise strength. The no quorum sensing effect (purple dots) inhibits the diffusion of AHL molecules and increases the PoI/LuxI noise strength.
Stochastic Differential Equations for Practical Simulation of Gene Circuits
4
71
Notes 1. Chemical Langevin Equation The general form of the stochastic differential Chemical Langevin Equation is pffiffiffiffiffiffiffiffiffi ð6Þ dnðtÞ ¼ S aðnÞ dt þ S aðnÞ dW where n(t) is a vector containing the number of molecules of each species in the model, SKJ is the stoichiometry matrix, where K the number of species and J the number of biochemical reactions, a(n)J1 is the vector of propensities containing the reaction kinetics, and dW are scalar independent Brownian motions associated to each reaction [16]. The first term on the right-hand side of the equation corresponds to the deterministic kinetics, also called the macroscopic drift term within this stochastic context. The second term, accounting for intrinsic noise, is the so-called diffusion term. Notice the deterministic drift term grows as the size of the system (i.e. the number of molecules), while the diffusion term grows as the square root of the size of the system. Therefore, the relative weight of the stochastic term with respect to the deterministic one scales as the inverse square root of the size of the system. That is, as the number of molecules of the species increases, the solution of the SDE (6) will approach that of the deterministic model in the sense that the fluctuations around the deterministic solution will have less relative size. The Euler–Maruyama discretization method [15] can be used for generating sample paths of the stochastic process driven by the CLE . It describes the temporal evolution of the number of molecules of each biochemical species in the system as:
72
Jesu´s Pico´ et al.
nðt þ δtÞ ¼ nðtÞ þ S aðnÞδt þ S N
pffiffiffiffiffiffiffiffiffipffiffiffiffiffi aðnÞ δt
ð7Þ
where Nð0, 1ÞJ J is a diagonal matrix containing J statistically independent normal random variables, and δt is the discretization time step. 2. Selecting the size N of the population of cells Recall this protocol assumes N is constant throughout the simulation. To get an estimate of the population size N so as to get statistically correct results at minimum computational cost, run a set of simulations changing the size of the population of cells and the culture volume while keeping constant cell density (see Note 3) and evaluate the effect of changes on the statistical information of interest (e.g. noise strength). Simulations at different OD values can assess on its potential effect on the system behavior (e.g. by affecting cell-to-cell communication mechanisms) (see Example Note 1). Example Note 1 The next figure shows the results obtained for the QS/Fb circuit when comparing noise strength of protein n1 (PI) at different OD600 values defined in the table below. A Noise strength does not appreciably change for OD ∈ [0.005, 5] obtained either changing the volume Vext and keeping the cell number N¼240 (blue squares) or changing both N and Vext (green squares). B Noise strength for different N and Vext keeping constant OD600 ¼ 0.3.
Stochastic Differential Equations for Practical Simulation of Gene Circuits
73
N fixed N (cells)
240
240
240
240
240
240
Vext ( μL) 0.06 0.03 0.006 0.003 0.0006 0.0003 OD600
0.005 0.01 0.05
0.1
0.5
1
Variable N and Vext N (cells)
240
240 1200 2400 4800 12000
Vext ( μL) 0.03 0.006 0.015 0.006 0.006 0.003 OD600
0.01 0.05
0.1
0.5
1
5
OD fixed N (cells)
240 1200 2400 4800 12000
Vext ( μL) 0.001 0.005 0.01 0.02
0.05
OD600
0.3
0.3
0.3
0.3
0.3
3. Relating the cell population size N and optical density Optical density (OD) is an adimensional measurement commonly used to estimate the concentration of bacterial or other cells in a liquid culture [36]. Typically, the OD of a cell sample is measured at a wavelength of 600 nm (OD600). Its value depends on the number of cells N and the volume of the culture Vext as: OD ¼ N
1 1 V ext N OD¼1
where NOD¼1 is the quantity of cells contained in one volumetric unit of culture when the optical density is OD ¼ 1 (see Example Note 2). Example Note 2 Considering that N ¼ 8 105 is the quantity of cells contained in 1 μL of bacterial culture when the OD is 1 (Source: Agilent, E. coli Cell Culture Concentration from OD600 Calculator) and Vext ¼ 1 103 μL as a typical culture volume in a microfluidic device, we need N¼240 cells to simulate a scenario corresponding to OD¼ 0.3
74
Jesu´s Pico´ et al.
4. Dealing with heterogeneous cells In case there are several populations of cells, with different sets of intracellular reactions and species but sharing the extracellular species and reactions, extend the model by creating one grid per group of different cells (see Example 7 in Methods 3.1). 5. Reaction propensities The reaction propensities a(n) in Eq. 6 can be obtained by applying the mass action kinetics formalism [10]. The law of mass action states that the rate of a chemical reaction is proportional to the product of the reactant concentrations raised to a given power given by the stoichiometry of the reaction. If one of the required products is lacking, the reaction will not take place. The reaction proceeds faster as the concentration of the required substrates increase. Doing it proportionally to the product of the reactant concentrations—also called substrates—basically accounts for the probability of encounter (collision) among the reactants. Thus, the rationale behind mass action kinetics is that the rate at which a reaction proceeds is proportional to the probability that the required reactants encounter. This probability, in turn, is proportional to the product of their concentrations. Consider a system with m species and a reaction Rj relating them: k+ j
Rj : pj1 n1 + . . . + pjm nm − − − − qj1 n1 + . . . + qjm nm − kj
where pji, qji are the consumption and production stoichiometry coefficients for the i-th species, and kþj , kj the specific reaction rates of the forward and reverse reactions, respectively. Define the net stoichiometry coefficient rji ¼ qji pji. According to the mass action formalism, reaction Rj will contribute to the deterministic dynamics of the i-th species, ni, as: p
q
n_ i ðtÞ ¼ r j i kþj ∏s¼1 ns j s r j i kj ∏s¼1 ns js þ . . . m
m
where notice other reactions will contribute with analogous pjs terms. The functions a þj ðnÞ ¼ kþj ∏m and aj ðnÞ ¼ s¼1 n s qjs m k j ∏s¼1 ns are the propensity functions corresponding to the forward and reverse j-th reactions. 6. Model reduction Direct application of mass action kinetics to the set of reactions may result in dynamic models with many states (biochemical species) and model parameters. Model reduction yields a model with less variables and, thus, less first order differential equations, i.e. less order. There are some advantages in reducing a dynamic differential model. Thus: – large order models have many parameters (i.e. specific reaction rates). The values of these parameters must be obtained using experimental data related to the corresponding reactions. The experimental difficulties and computational cost
Stochastic Differential Equations for Practical Simulation of Gene Circuits
75
for the parameters estimation process increases with the number of parameters. – In practice, there are reactions that proceed at much faster rates than others. This means that there are very different time scales associated to each reaction. The large differences in the time scales among the different species in the reactions network originate huge difficulties for simulating the temporal evolution of the circuit species and for understanding the basic principles of its operation. The reduction process should yield a model more amenable for computational analysis, but avoiding excessive reduction that would lead to lack of biological relevance. In particular, the species in the reduced model must not be lumped ones. The resulting lumped parameters in this reduced model must be easy to associate to experimental tuning knobs. Model reduction can be carried out by means of the Quasi Steady-State Approximation (QSSA) of the fast chemical species. In essence QSSA is a singular perturbation method [21, 22] that considers the time-scale separation among the different dynamics [25, 42]. In particular, a common assumption is that binding reactions of transcription factors to gene promoters occur very fast in comparison with those corresponding to transcription, translation, and degradation. Additional algebraic relationships among variables can be obtained through system invariants. In the case of reaction networks, it can be observed that some reactions are a linear combination of other ones. Then, the linear combination of the number of molecules (alternatively, concentrations) of the species involved will keep constant in time. These linear combinations, so-called moieties, can be understood as a kind of quasi-species that keep invariant, i.e. keep constant number of molecules (see Example Note 3). The reduced order models can be expressed as a reduced set of equivalent pseudoreactions with lumped functional propensities (see Example Note 4.1). Example Note 3 The set of reactions below represent the conversion of the substrate X2 into the product X4 catalyzed by the enzyme X1: k+
1 − − X3 X1 + X2 −− − −
k1
k
X3 −−2→ X1 + X4 d
X4 −−4→ ∅
76
Jesu´s Pico´ et al.
where x3 is the intermediate substrate-enzyme complex. Application of the mass action kinetics gives the dynamic balances for the four species in the system: x_ 1 x_ 2
¼ k 1 x 3 k1 x 1 x 2 þ k2 x 3
x_ 3 x_ 4
¼ k 1 x 3 þ k1 x 1 x 2 k2 x 3
¼ k 1 x 3 k1 x 1 x 2 ¼ k2 x 3 d 4 x 4
Assuming the association–dissociation reaction between the enzyme X1 and the substrate X2 to produce the intermediate complex X3 is much faster than the other reactions, we can apply the quasi-steady-state assumption: x_ 2 0 Î x 3 ¼
k1 x 1x 2 k 1
On the other hand, the sum of the first and third equations in the dynamic balances is zero. That is, the sum of free and ligated enzyme is constant, equal to the total amount of enzyme in the system: x_ 1 þ x_ 3 ¼ 0 Î x 1 þ x 3 ¼ c From the expressions above, one has the reduced order dynamics: x_ 4 ¼
k2 cx 2 1 k
k1
d4x 4
þ x2
7. Pseudo-reactions and lumped propensities The reduced order models can be expressed as a reduced set of equivalent pseudo-reactions with associated lumped functional propensities (see Example Note 4.1). Pseudo-reactions may also be used to represent physical diffusion processes (see Example Note 4.2). Example Note 4.1 Consider the reduced order dynamics of the product obtained in Example Note 3: x_ 4 ¼
k2 cx 2 1 k
k1
d4x 4
þ x2
One may consider an equivalent pseudo-reaction such that application of mass action kinetics to it will produce the dynamics above:
Stochastic Differential Equations for Practical Simulation of Gene Circuits
77
f (x2 )
X2 −−−→ X2 + X4 d
X4 −−4→ ∅
where f ðx 2 Þ ¼ kk12 cx 2 . Notice f(x2) can be considered as a k1
þx 2
lumped propensity function. Example Note 4.2 Consider the diffusion process of the species Ai, Ae in Example 2 across the cell membrane. Even if a physical system, it can be modeled by means of the pseudoreaction: D
−− − Ai − −− Ae DVc
where Vc is the ratio between the cell volume and the extracellular one. The corresponding propensity terms (see Example 2) will generate the appropriate terms in the dynamic balances: n_ i4
¼ DV c n5 Dni4 þ . . .
n_ 5
¼ N DV c n5 þ D
PN
i i¼1 n4
þ ...
8. Statistical validation of the lumped propensities In Note 4 lumped propensity functions are obtained as result of considering reaction invariants and dependence of slow reactions as a function of fast ones. The use of higher-order terms in stochastic simulation is justified from the point of view of the computational implementation [8, 31]. Usually, stochastic algorithms treat all the reaction events alike, spending the great majority of their time simulating the many relatively uninteresting fast reaction events than explicitly simulate only the slow reactions. Nevertheless, statistical validation of the high-order functional propensities can be done by simulating the associated pseudoreaction using the CLE approach, and then comparing this result with the one obtained by simulating the set of corresponding original reactions using Gillespie’s direct method SSA [5, 41]. To this end, for the relevant species involved in the associated pseudo-reaction, obtain Box-andWhisker plots of the SSA and CLE realizations. Perform a Kruskal–Wallis test [23] to test if there is any statistically significant difference between their medians (see Example Note 5).
78
Jesu´s Pico´ et al.
Example Note 5 In the QS/Fb case, there is one lumped propensity: the Hill-like function f(n3) modeling the repressible promoter PI/(R.A)2. Transcription and degradation of PI can be described (see Example 2) using the equivalent set of pseudo-reactions: f1 (ni )
(R · A)2i −−−−3→ mPIi + (R · A)2i dm
I mPIi −−→ ∅
The original set of reactions is C
I gPI −−→ gPI + mPI klux k
dlux −−− −− − gPI + (R · A)2 − gPI · (R · A)2
klux
αCI
gPI · (R · A)2 −−→ gPI · (R · A)2 + mPI dm
I mPI −−→ ∅
where f ðni3 Þ ¼
C I pI kdlux þαI ni3 dmI ð kdlux þni3 Þ.
To validate the propensity function f(n3), both sets of reactions were simulated. For one single-cell (i¼1) and with the same conditions, the set of pseudo-reactions were ran using the CLE, and original ones using the Gillespie direct method (SSA). For one realization, the figure below shows how the CLE trajectory (A right) matches very well the SSA one (A left) during the whole simulation period. Both SSA and CLE trajectories have similar distributions with small differences between their first statistical moments (μSSA μCLE, and σ SSA σ CLE) (see B). The noise strength of mRNApoI/luxI for the SSA distribution (η2SSA ¼ 0:008 ) matches closely with the same for the CLE (η2CLE ¼ 0:0072). Subplot C shows the Box-and-Whisker plots for the messenger RNA of poI/luxI in both SSA and CLE realizations. Their medians (red line) are practically the same ~ 1CLE ¼ 126:1 molecules). The Krus(~ n1SSA ¼ 127:7 and n kal–Wallis test [23] reveals that there is no statistically significant difference between their medians at the 95.0% confidence level ([test statistic, p-value] ¼ [2.09067 106, 1.0]).
Stochastic Differential Equations for Practical Simulation of Gene Circuits
SSA
A 200
CLE 200
180
180
160
160
140
140
120
120
100
100
80
80
60
60
40
40
20
20 0
0
0
Normalized Counts
B
79
5 10 Time [min]
15
0
5 10 Time [min]
15
C 0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
9. Kronecker product If A is an m n matrix and B is an p q matrix, the Kronecker product A B is the mp nq matrix: 2 3 a 11 B . . . a 1n B 6 7 ⋱ ⋮ 7 AB¼6 4 ⋮ 5 a m1 B . . . a mn B 10. Number of realizations. Checking ergodicity Stochastic simulations have a high computational cost. Hence, it may be very useful to check if one realization of the CLE model for the whole population will be enough to characterize the long-term statistics such as mean, variance, or noise strength for each biochemical species. To this end, perform several realizations of the CLE model. For each realization, select the values at the last time point of each species for every i-th cell. Use a
80
Jesu´s Pico´ et al.
MANOVA test ([d,p,stats] ¼ manova1([A I R RA2 Realiin Matlab) to determine whether there are differences in the means of the species, among the different realizations. The results of the MANOVA test include d, an estimate of the dimension of the space containing the group means. If d¼0, there is no statistically significance to reject the hypothesis of the three realizations have the same mean, with p-value p and the test statistic Wilk’s lambda (stats. lambda). In addition, it is interesting to evaluate if the Mahalanobis distance between the means of each realization is close to zero (values of the elements of stats.gmdist). MANOVA test, however, assumes the variables are normally distributed. The non-parametric Kruskal–Wallis test can be performed for each one of the species separately to address if the data of each realization comes from the same distribution (see Example Note 6). zation])
Example Note 6 In the QS/Fb system, three realizations were performed with the same set of parameters and conditions for a population of N cells. The steady-state portion of each species of interest was selected for every i-th cell. The time average was obtained over this steady-state time-window, resulting in an averaged number of molecules of each species per cell. The figure below depicts the matrix scatter plot for the three realizations (using a different color for each one). Notice the distributions for all four species are unimodal and well shaped. A MANOVA analysis reflects no statistically significance to reject the hypothesis of the three realizations have the same mean and variance, with p-value ¼ [0.5961, 0.6730], and Wilk’s lambda λ ¼ [0.9910, 0.9978]. In addition, the Mahalanobis distance between the means of each realization is close to zero (DM ¼ [0.0132, 0.0320, 0.0363]). This analysis confirms that one realization of the simulation of a population with N interconnected cells during enough simulation time provides representative long-term moments of the population. Since, the MANOVA test assumes normality, the Kruskal–Wallis test was performed for the three realizations in each one of the species. The results for A: [statistic, p-value ¼ [0.610148, 0.737069]; I: [statistic, p-value ¼ [0.427088, 0.807717]; R: [statistic, p-value ¼ [2.22063, 0.309456]; and (R.A.)2: [statistic, p-value ¼ [0.344232, 0.841881]. Since all the p-values are greater than or equal to 0.01, there is no statistically significant difference between the medians of the species
Stochastic Differential Equations for Practical Simulation of Gene Circuits
81
from the different realizations (with a 99.0% confidence level).
Species distributions and matrix scatter plot 100 50
A
0 4000
PI
2000 0 800
R
400 0 1000
(R.A)
2
500 0 0
50
100
A
0
2000
PI
4000
0
400
800
R
0
500
1000
(R.A)2
11. OpenFPM Server installation Using the default location installation will install OpenFPM in /usr/local. For this, you need to have writing privileges in that directory, otherwise the installation fails. To choose another directory where to install dependencies and the library use: 1
git clone https :// github . com / IBirdSoft / openfpm_pdata_2 .0.0. git
2
cd openfpm_pdata_2 .0.0
3
./ install -i " / where / you / want / to / put / dependencies "
4
-c " -- prefix =/ where / you / want / to / install "
5
make install
For more information on the installation and troubleshooting the reader is referred to [18] and http://openfpm. mpi-cbg.de/troubleshoot. 12. Random Number Generator (RNG) Seeding a random number generator is a highly nontrivial task, since one has to eliminate all structures of the input seeds. For unbiased sampling, we use a Mersenne Twister random number generator (MTRNG). To assure a unique seed on every processor, we mix
82
Jesu´s Pico´ et al.
the processor ID and the user seed with the mixing procedure of a hash-based random number generator, Saru [2], as shown in Example Note 7. Example Note 7 The code that implements the distributed random number generators for the QS/Fb CLE model reads like:
356
// Random number generator
357
size_t seed1 = v_cl . getProcessUnitID () ;
358
srand ( time ( NULL ) ) ;
359
size_t seed0 = rand () ;
360
srand ( seed0 ) ;
361
size_t seed2 = getuid () + rand () ;
362
363
seed2 += seed1 < 7;
366
seed1 ˆ=(( signed int ) seed2 ) > >3;
367
seed2 *=0 xA5366B4D ;
368
seed2 ˆ= seed2 > >10;
369
seed2 ˆ=(( signed int ) seed2 ) > >19;
370
seed1 += seed2 ˆ0 x6d2d4e11 ;
371
seed1 =0 x79dedea3 *( seed1 ˆ((( signed int ) seed1 ) > >14) ) ;
372
seed2 =( seed1 + seed2 ) ˆ ((( signed int ) seed1 ) > >8) ;
373
size_t MTseed =0 xABCB96F7 + ( seed2 > >1) ;
374
375
std :: mt19937_64 engine ( MTseed ) ;
// Use the global key
as a seed for the PRNG 376
std :: normal_distribution < double > NDist (0 ,1) ; // double Normal distribution with mean 0 and std 1
Stochastic Differential Equations for Practical Simulation of Gene Circuits
83
The above initial mixing does not create any correlation and MTseed is used to seed MTRNG (the test harness ensures that the behavior of this seeding is indeed chaotic). This seeding assures uncorrelated RNG streams on different processor ID. 13. Time step selection A Poisson random variable with large enough mean is well approximated by a normal random variable with the same mean and variance. If every reaction j is expected to fire many times over [t, t + δt), its corresponding propensity will change from Poisson to normal. A reaction firing has a Poisson distribution probability with mean njδt, if we consider a short time interval. Therefore, the CLE approximation implies two conditions: (1) δt is small enough to assume a constant propensity during the interval ½0, T ; and (2) δt is large enough so that the expected number of occurrences of each reaction in [t, t + δt) be much larger than 1 [13]. Even though both conditions present a trade-off, they can be simultaneously satisfied by having large molecular population numbers. To select an appropriate δt, start from a large initial value. Run the simulation for the whole system with N ¼ 1 or for a subset computationally affordable using SSA. Check whether the CLE simulation converges and gives positive values for the species. If so, compare the results with those obtained with SSA used as gold standard. Decrease the time step by halving it until convergence of the CLE simulation and good results are obtained. 14. Storage memory (decimation) The implementation of the algorithm makes an efficient use of memory by saving the only data points corresponding to the previous time and the present time as mentioned in Example 7. This way, the memory necessary for a execution is independent of the length of the simulation and scales 2N with the number of cells. When the user needs to keep all the trajectories of all states for every cell, a strategy for saving memory is needed. Storing data points every D time steps is a feasible solution for this problem. The best value for D is the largest possible one that does not deteriorate the statistical properties. To find the best D value: For one realization, compute the statistics of interest (e.g. mean and noise strength, μk and η2k , respectively) of each species k for each i-th cell. This is D ¼ 1. Duplicate the value of D (D ¼ 2). This is equivalent to take sample every other time step. Compute
Jesu´s Pico´ et al.
the new values of the statistics of interest. Repeat this process until the statistics of interest start depart significatively from their initial values (see Example Note 8). Example Note 8 For the QS/Fb case, the figure below shows how the use of storage memory decreases when the number of samples is reduced by the decimation process. Decimation to D ¼ 32 yields approximately a 95% reduction of the required memory space to save the data, and keeps the long-term statistics without significantly changes from the initial ones. 1
2
0.8
0
5
10
1 10-1 10
Relative Memory Usage
4
0 Time Step (min)
84
0.6
0.4
0.2
-2
0
2
4 6 8 Iteration
10
0 0
5 Iteration
15. Matlab Parameter Sweep Example Note 9 For the QS/Fb case, a parametric swept can be carried out as follows.
10
Stochastic Differential Equations for Practical Simulation of Gene Circuits
96
% Generate Matrix X with all parameters combinations
97
D_v = [2 0];
98
kA_v = [0.04 0];
99
pI_v = [0.2 0.4 2 4 10];
100
kdLux_v = [10 100 200 500 1000 2000];
101
alphaI_v = [0.01 0.1];
102
dR_v = [0.02 0.07 0.2];
103
pR_v = [0.2 0.4 2 4 10];
104
105
X = transpose ( combvec ( D_v , pI_v , kdLux_v , alphaI_v , dR_v , pR_v , kA_v ) ) ;
106
%%
107
for xpop =1: size (X ,1)
108
% With and without QS
109
D = X ( xpop ,1) ;
110
% LuxI
111
pI = X ( xpop ,2) ;
% translation rate of LuxI #
mRNA [1/ min ]. b * dmI from Weber . 112
kdLux = X ( xpop ,3) ;
[3.0928 - 6.1856]
% dissociation cte ( LuxR . A ) 2
to promoter [ molecules ] , Bucler et al [1 1000] nM 113
alphaI = X ( xpop ,4) ; Plux 0.01 -0.1
114
115
% LuxR
% leakage of the repressor
85
86
Jesu´s Pico´ et al.
116
dR = X ( xpop ,5) ;
117
pR = X ( xpop ,6) ;
118
kA = X ( xpop ,7) ;
119
120
% Writing parameters to a struct in the proper order to be read by the
121
% langevin C ++ program .
122
param_out = struct ( ’ dI ’ , dI ,
’ pI ’ , pI , ’ kI ’ , kI , ’
pN_luxI ’ , pN_luxI , ’ dmI ’ , dmI , ’ kdLux ’ , kdLux , ’ alphaI ’ , alphaI , ’ dR ’ , dR , ’ pR ’ , pR , ’ cR ’ , cR , ’ dmR ’ , dmR , ’ k_1 ’ , k_1 , ’ kd1 ’ , kd1 , ’ k_2 ’ , k_2 , ’ kd2 ’ , kd2 , ’ dRA2 ’ , dRA2 , ’ kA ’ , kA , ’ dA ’ , dA , ’D ’ , D , ’ Vcell ’ , Vcell , ’ Vext ’ , Vext , ’ dRA ’ , dRA , ’ dAe ’ , dAe ) ; 123
124
125
% Write a file named param . dat , with the struct param_out
126
struct2csv_append ( param_out , ’ param . dat ’ , ’W ’) ;
127
128
129
% Excecuting the external C ++ program langevin with 4 cores , and with the files param . dat as input
130
command = [ ’ mpirun - np 4 ./ langevin param . dat ’ num2str ( Ncells ) ’ ’ num2str ( ruido ) ’ ’ num2str ( ahl_e_0 ) ’ ’ num2str ( STO ) ’ ’ num2str ( HISTO ) ’ ’ num2str ( TEMPO ) ’ ’ num2str ( TEMPOT ) ];
131
system ( command ) ;
In Lines 96–105 a matrix with all the combinations of parameters is generated from the individual parameters using combvec command. In Lines 107–118 a loop is
Stochastic Differential Equations for Practical Simulation of Gene Circuits
87
executed. The parameters combinations are recovered one by one and saved into the param.dat file in Lines 122 and 126. The execution of the langevin program with its corresponding arguments is performed in Lines 130–131. The remaining part of the code (Lines 133–173) implement the code to read from the output.dat file and import the obtained data into a Matlab variable Data, which is saved into a Data.mat Matlab binary file for further analysis. To be able to execute this code in Linux, it is necessary to declare an alias for Matlab:
1
alias matlabc = ’ end LD_PRELOAD =/ usr / lib / x86_64 - linux - gnu libstdc ++. so .6 / matlab / instalation / directory / R2014a / bin / matlab - nodesktop - nojvm - nosplash ’
This way, Matlab has the location of the C++ lib in its environment. 16. Python execution Another option to execute langevin is to run it though a Python script. An example for the QS/Fb case is shown below. Example Note 10 For the QS/Fb case, the Python3 corresponding to the Simulate_PCLE.py:
75
# Arguments of the langevin program .
76
Ncells = 240
77
ruido = 0.15
78
ahl_e_0 = 0/ Vc
79
STO = 1
# STO = 0 is deterministic simulation
, and STO = 1 is stochastic . 80
HISTO = 0 state histogram
# HISTO = 1 for obtaining the steady
code
88
81
Jesu´s Pico´ et al.
TEMPO =
1
# TEMPO = 1 For obtaining the
temporal response of the means and std 82
TEMPOT =
0
# TEMPOT = 1 For obtaining the
temporal response of all cells 83
84
print ( ’ Saving params ... ’)
85
# Params list
86
param_out = [ dI , pI , kI , pN_luxI , dmI , kdLux , alphaI , dR , pR , cR , dmR , k_1 , kd1 , k_2 , kd2 , dRA2 , kA , dA , D , Vcell , Vext , dRA , dAe ]
87
88
# Write a file named param . sT , with the struct param_out
89
filename = ’ param . dat ’
90
with open ( filename , ’w ’) as myfile :
91
wr = csv . writer ( myfile , quoting = csv . QUOTE_NONE )
92
wr . writerow ( param_out )
93
94
print ( ’ Params saved in ’ + filename )
95
# Excecuting the external C ++ program langevin with 4 cores , and with the files param . dat as input
96
command = ’ mpirun - np 4 ./ langevin param . dat ’ + str ( Ncells ) + ’ ’ + str ( ruido ) + ’ ’ + str ( ahl_e_0 ) + ’ ’ + str ( STO ) + ’ ’ + str ( HISTO ) + ’ ’ + str ( TEMPO ) + ’ ’ + str ( TEMPOT )
97
print ( command )
98
os . system ( command )
In Lines 1–71 the necessary parameters are assigned into Python variables (code not shown, refer to the repository for details). In Lines 75–82 the execution
Stochastic Differential Equations for Practical Simulation of Gene Circuits
89
parameters are set. In Line 86 the parameters are gathered into a vector and saved into the param. dat file in Lines 89–92. The execution of the langevin program with its corresponding arguments is performed in Lines 96–98 using the os.command Python function.
Acknowledgement This work is partially supported by grant MINECO/AEI, EU DPI2017-82896-C2-1-R. References 1. Acar M, Mettetal JT, van Oudenaarden A (2008) Stochastic switching as a survival strategy in fluctuating environments. Nat Genet 40 (4):471–475 2. Afshar Y, Schmid F, Pishevar A, Worley S (2013) Exploiting seeding of random number generators for efficient domain decomposition parallelization of dissipative particle dynamics. Comput Phys Commun 184 (4):1119–1128 3. Andrews SS, Dinh T, Arkin AP (2009) Stochastic models of biological processes. Springer New York, New York, pp 8730–8749 4. Basak S, Chabakauri G (2010) Dynamic meanvariance asset allocation. Rev Financ Stud 23 (8):2970–3016 5. Boada Y, Vignoni A, Pico´ J (2017) Engineered control of genetic variability reveals interplay among quorum sensing, feedback regulation, and biochemical noise. ACS Synth Biol 6 (10):1903–1912 6. Boada Y, Vignoni A, Pico´ J (2019) Multiobjective identification of a feedback synthetic gene circuit. IEEE Trans Control Syst Technol 1–16. 7. Cai L, Friedman N, Xie XS (2006) Stochastic protein expression in individual cells at the single molecule level. Nature 440 (7082):358–362 8. Cao Y, Gillespie DT, Petzold LR (2005) The slow-scale stochastic simulation algorithm. J Chem Phys 122(1):014116 9. Chalancon G, Ravarani CN, Balaji S, MartinezArias A, Aravind L, Jothi R, Madan Babu M (2012) Interplay between gene expression
noise and regulatory network architecture. Trends Genet 28(5):221–232 10. Chellaboina V, Bhat S, Haddad M, Bernstein D (2009) Modeling and analysis of mass-action kinetics. IEEE Control Syst 29(4):60–78 11. Eldar A, Elowitz MB (2010) Functional roles for noise in genetic circuits. Nature 467 (7312):167–173 12. Elowitz MB, Levine AJ, Siggia ED, Swain PS (2002) Stochastic gene expression in a single cell. Science 297(5584):1183–1186 13. Gillespie DT (2000) The chemical Langevin equation. J Chem Phys 113:297–306 14. Gillespie DT (2007) Stochastic simulation of chemical kinetics. Annu Rev Phys Chem 58:35–55 15. Higham DJ (2001) An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM Rev 43 (3):525–546 16. Higham DJ (2008) Modeling and simulating chemical reactions. SIAM Rev 50(2):347–368 17. Hilfinger A, Paulsson J (2011) Separating intrinsic from extrinsic fluctuations in dynamic biological systems. Proc Natl Acad Sci 108 (29):12167–12172 18. Incardona P, Leo A, Zaluzhnyi Y, Ramaswamy R, Sbalzarini IF (2019) Openfpm: a scalable open framework for particle and particle-mesh codes on parallel computers. Comput Phys Commun 241:155–177. 19. Jones DL, Brewster RC, Phillips R (2014) Promoter architecture dictates cell-to-cell
90
Jesu´s Pico´ et al.
variability in gene expression. Science 346 (6216):1533–1536 20. Kazeev V, Khammash M, Nip M, Schwab C (2014) Direct solution of the chemical master equation using quantized tensor trains. PLoS Comput Biol 10(3):e1003359 21. Khalil HK (1996) Nonlinear systems, 3rd edn. Prentice-Hall, New Jersey 22. Kokotovic P, Khalil H, O’Reilly J (1986) Singular perturbation methods in control: analysis and design. Academic Press, Orlando 23. Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621 24. Labhsetwar P, Cole JA, Roberts E, Price ND, Luthey-Schulten ZA (2013) Heterogeneity in protein expression induces metabolic variability in a modeled Escherichia coli population. Proc Natl Acad Sci USA 110 (34):14006–14011 25. Me´lyku´ti B, Hespanha JaP, Khammash M (2014) Equilibrium distributions of simple biochemical reaction systems for time-scale separation in stochastic reaction networks. J R Soc Interface 11(97):20140054 26. Munsky B, Khammash M (2006) The finite state projection algorithm for the solution of the chemical master equation. J Chem Phys 124(4):044104 27. Murray JD (1989) Mathematical biology. Springer, Berlin 28. Novick A, Weiner M (1957) Enzyme induction as an all-or-none phenomenon. Proc Natl Acad Sci USA 43(7):553 29. Ostrenko O, Incardona P, Ramaswamy R, Brusch L, Sbalzarini IF (2017) pssalib: the partial-propensity stochastic chemical network simulator. PLoS Comput Biol 13(12): e1005865 30. Raj A, van Oudenaarden A (2008) Nature, nurture, or chance: stochastic gene expression and its consequences. Cell 135(2):216–226 31. Rao CV, Arkin AP (2003) Stochastic chemical kinetics and the quasi-steady-state assumption: application to the Gillespie algorithm. J Chem Phys 118(11):4999–5010
32. Raser JM, O’Shea EK (2005) Noise in gene expression: origins, consequences, and control. Science 309(5743):2010–2013 33. Ruess J, Lygeros J (2015) Moment-based methods for parameter inference and experiment design for stochastic biochemical reaction networks. ACM Trans Model Comput Simul 25(2):8 34. Samoilov M, Plyasunov S, Arkin AP (2005) Stochastic amplification and signaling in enzymatic futile cycles through noise-induced bistability with oscillations. Proc Natl Acad Sci USA 102(7):2310–2315 35. Schnoerr D, Sanguinetti G, Grima R (2017) Approximation and inference methods for stochastic biochemical kinetics-a tutorial review. J Phys A: Math Theor 50(9):093001 36. Sutton S (2006) Measurement of cell concentration in suspension by optical density. Microbiology 585:210-8336 37. Swain PS, Elowitz MB, Siggia ED (2002) Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc Natl Acad Sci 99 (20):12795–12800 38. Van Kampen N (2011) Stochastic processes in physics and chemistry. North-Holland Personal Library, Elsevier Science 39. Wilkinson DJ (2006) Stochastic modelling for systems biology. Mathematical and computational biology Series, 2nd edn. Champan and Hall/CRC, London 40. Wilkinson DJ (2009) Stochastic modelling for quantitative description of heterogeneous biological systems. Nat Rev Genet 10 (2):122–133 41. Woods ML, Leon M, Perez-Carrasco R, Barnes CP (2016) A statistical approach reveals designs for the most robust stochastic gene oscillators. ACS Synth Biol 5 (6):459–470 42. Zagaris A, Kaper HG, Kaper TJ (2004) Analysis of the computational singular perturbation reduction method for chemical kinetics. J Nonlinear Sci 14(1):59–91
Chapter 3 Using Models to (Re-)Design Synthetic Circuits Giselle McCallum and Laurent Potvin-Trottier Abstract Mathematical models play an important role in the design of synthetic gene circuits, by guiding the choice of biological components and their assembly into novel gene networks. Here, we present a guide for biologists to build and utilize models of gene networks (synthetic or natural) to analyze dynamical properties of these networks while considering the low numbers of molecules inside cells that results in stochastic gene expression. We start by describing how to write down a model and discussing the level of details to include. We then briefly demonstrate how to simulate a network’s dynamics using deterministic differential equations that assume high numbers of molecules. To consider the role of stochastic gene expression in single cells, we provide a detailed tutorial on running stochastic Gillespie simulations of a network, including instructions on coding the Gillespie algorithm with example code. Finally, we illustrate how using a combination of quantitative experimental characterization of a synthetic circuit and mathematical modeling can guide the iterative redesign of a synthetic circuit to achieve the desired properties. This is shown using a classic synthetic oscillator, the repressilator, which we recently redesigned into the most precise and robust synthetic oscillator to date. We thus provide a toolkit for synthetic biologists to build more precise and robust synthetic circuits, which should lead to a deeper understanding of the dynamics of gene regulatory networks. Key words Synthetic gene circuits, Mathematical modeling, Dynamical gene network, Stochastic simulations, Gillespie algorithm, Synthetic oscillator, Synthetic biology, Biological oscillations
1
Introduction Models are simplified representation of the world and a core component of science. They help us understand how the world works, for example, via simple mathematical equations that approximately describe the movement of objects under a range of conditions. They can also help us design and engineer systems, for example by using mathematical models to ensure an electronic circuit will function as intended. Models have been at the core of synthetic biology since its beginnings in 2000 with the publication of two gene circuits: an oscillator called the repressilator [1] and a bistable toggle switch [2]. They have helped define synthetic biology as a new field, in which biologists moved from modifying existing
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_3, © Springer Science+Business Media, LLC, part of Springer Nature 2021
91
92
Giselle McCallum and Laurent Potvin-Trottier
biological systems to designing and engineering novel gene circuits. In these seminal papers, mathematical models were used to determine the required parameters for the oscillatory or bistable behavior of the systems and guided the choice of biologic parts that met these requirements. Even though this chapter will focus on the “redesign” of synthetic circuits, we want to emphasize that modeling should be used throughout the design-build-test-learn cycle of synthetic biology. Models should be created and analyzed before building a circuit, as even with advances in DNA synthesis and assembly, it is still much faster to model a circuit in silico than to build it—and there is no point in building a circuit if it cannot work under any conditions. After building and testing a circuit, models can help to understand why the circuit might not behave as intended and to learn about the dynamics of the assembled gene networks. Measurements of a circuit’s properties need to be reliable, as it is necessary to distinguish and isolate variability coming from either the measurements, the environment, or the synthetic circuit itself. This is a key component of the redesign: experimental data will be only useful if it actually represents a circuit’s behavior. As a final warning, we would like to remind the reader that if the model fits the data, it does not mean that it is “correct.” It simply means that it is consistent with the data and cannot be ruled out, but it is possible (and even likely) that many more models can fit the data equally well. Nevertheless, these models can be useful to make predictions about a systems’ function, informing processes such as redesigning features of the circuit to improve its behavior. Modeling of gene circuits expressed in vivo is complicated by the fact that cells are not macroscopic-sized test tubes: their size and the finite, small number of molecules they contain means that the chemical reactions inside a cell will happen by chance when molecules collide with each other. Thus, the number of molecules, such as mRNAs and proteins, will vary over time, meandering around the mean value. This is due to stochastic chemistry and is referred to as stochastic gene expression. These random fluctuations are important to take it into account in the design and modeling of synthetic circuits as they do not merely make the circuit’s behavior “noisier,” but can have counterintuitive effects, such as turning a non-oscillating system into an oscillating one [3, 4]. In 2016, using a combination of careful single-cell microscopy experiments and stochastic modeling, we took the original repressilator circuit that appeared to have rather poor oscillatory properties and iteratively transformed it into by far the most precise and robust synthetic oscillator to date [5]. A microfluidic device nicknamed the “mother machine” [6–8] enabled us to track single cells under carefully controlled growth conditions and separate environmental noise from variability intrinsic to the circuit. Modeling of the circuit using these measurements while considering the stochastic nature of the chemical reactions enabled us to build a
Using Models to (Re-)Design Synthetic Circuits
93
version of the circuit with improved precision and robustness. Many synthetic biology projects are often left at the proof-ofprinciple stage, where the function of the circuit is approximately what was originally intended. We believe that understanding why these circuits do not function exactly as intended is as useful as making them in the first place: it will provide a greater understanding of gene regulatory networks while leading to precise and robust systems that can be used in applications. In this chapter, we aim to provide a guide to (re-)designing gene circuits using modeling, using our previous work with the repressilator as an example. Obviously, there is no universal step-bystep protocol to redesign a synthetic circuit. However, we hope to provide a useful guide to synthetic biologists from various scientific backgrounds to help them incorporate modeling while building and analyzing their circuits. To complement existing resources, we will focus on dynamic systems and stochastic analysis [9– 13]. While this chapter focuses on synthetic circuits, the same modeling approach is just as useful for natural gene circuits. We will start by explaining how to write down a simple model and use it to write down the differential equations representing a system’s dynamics. Then, we will discuss how to model the effects of stochastic gene expression on the circuit by performing stochastic (Gillespie) simulations and understanding sources of noise in a system. In the spirit of this series, we will provide a step-by-step protocol (and code) to run Gillespie simulations, which provide an exact representation of the stochastic system while being simple and fast to run, making them tremendously useful. Finally, we will describe an example of how these models can be used to guide the re-design of synthetic circuits.
2
Materials All models used in this chapter can be solved or simulated using built-in or custom-coded functions in programming languages like Matlab, Octave, Python (numpy), Mathematica, C, Fortran, and so on. All code for models discussed in Subheading 3 can be found at our source code repository (https://github.com/potvinlab/ MiMB_circuitmodeling.git) and can be easily implemented in Matlab or Octave, an open-source alternative. While there are already software and packages to run stochastic simulations available in many languages, writing these algorithms from scratch is as fast as understanding an existing code while being much more pedagogical (see list of software at Gillespie Wikipedia page [14], Wolfram Alpha Demonstrations Project [15]).
94
3
Giselle McCallum and Laurent Potvin-Trottier
Methods
3.1 Writing Down a Model 3.1.1 Abstracting the Circuit: Sketching Its Diagram
The first step in writing down a model is to record all the interactions between the molecules in the circuit or impacting it, and the chemical reactions that create and eliminate them in a diagram (sometimes called the network topology). Here, we must consider the level of detail we want to include in the model. While it is important to include enough details to accurately reflect the underlying processes we wish to learn about, too much detail can weigh down the model and distract from the effects that particular variables can have on the behavior of the system. For example, considering relativity while calculating the movement of a ball through the air will obscure the understanding of the simple system while adding futile precision. Consider the repressilator circuit: in this network, three genes encode different repressor proteins (LacI, TetR, and λ CI), each of which represses the expression of the next gene in the circuit in a single feedback loop. Because of the odd number of repressors, this effectively leads to autorepression with a delay, producing out-of-phase oscillations of the three proteins. The simplest model of this circuit contains only the repressors as variables and considers that the proteins directly repress each other’s production (Fig. 1a). While this model can still lead to oscillations, it ignores many important biological parameters, such as transcription rates and difference between mRNAs’ and proteins’ half-lives (see Note 1). We could also model fluctuations in gene copies due to plasmid copy number, the switching of the promoter between the active and inactive state, the number of RNA polymerases and ribosomes, multimerization of the repressors, and enzymatic degradation of the repressor via proteases—all biological parameters with an impact on our circuit (Fig. 1b). However, these details will not necessarily provide valuable insight on the behavior of our circuit and will add many new variables and parameters to the model. While they may not be included in the equations, keeping these details in mind is helpful when analyzing the model and circuit’s behavior, as they might help explain unexpected results. Our favorite approach is to start with the simplest model that can lead to some understanding of the system. Then, complexity can be progressively added if it is necessary to explain the observed behavior or to test the effects of a particular component of the system. It is important to consider that in biology there are still many unknowns (and unknown unknowns), and that adding many parameters to the model will not make it a better representation of reality. Powerful approaches have been developed to rigorously model systems with many unknown interactions, but they are beyond the scope of this chapter [16, 17]. In our example, we will include mRNA (m) and proteins (P) as variables in our network. The transcription of an mRNA is repressed by the previous
Using Models to (Re-)Design Synthetic Circuits
a
b
n plam
ids
P1
95
Ø
gene P1
Ø
+ RNAP P3
RNase +
P2 Ø
c
+ ribosomes
Ø
+
protease
Ø
Ø
P1
Ø
Ø
Ø diluo n
+
P3 Ø
Ø Ø
P2
mRNA x 3 genes
Fig. 1 Network topology diagram. Examples of possible network topology diagrams displaying the interactions between molecular species in the repressilator. Models can range from being very minimalistic (a) to including many detailed processes and interactions between your circuit and the cell (b). How much detail you should include depends on what you want to learn, but starting simple is key. (c) We have chosen a simple model of the repressilator that includes both mRNA and proteins as variables
protein in the network (i.e., P1 inhibits the transcription of m2), and proteins are translated proportionally to the number of mRNAs. Both mRNA and proteins are depleted from the cell by a combination of dilution due to cell growth and active degradation (Fig. 1c). We will account for the dimerization of repressors and their affinity to their promoters in Subheading 3.1.2. 3.1.2 Mass Action Equations
The next step in building our model is to write an expression describing each reaction in our system according to the law of mass action, which states that the rate of a reaction is proportional to the concentration of the reactants. For example, for the reaction x + y ! z, the rate of production of z is calculated as ddt½z ¼ ½x ½y k1, where k1 is a constant known as the mass action rate constant that indicates the rate per reactant (or proportionality) of the reaction. Intuitively, this means that if you have twice as many molecules, the reaction rate will be twice as high because collisions between molecules are twice as likely to happen. The mass action equation is then k1 written as x þ y ! z.
96
Giselle McCallum and Laurent Potvin-Trottier
We can thus write the mass equations for all the chemical reactions included in our model. For the repressilator, the equations are: f ðP i1 Þ
; ! m i βm
mi ! ; λP
mi ! P i þ mi βP
Pi ! ; for each repressor (i ¼ 1, 2, 3, and where P0 ¼ P3 by definition). λP is the rate of translation of repressor (Pi) per mRNA per unit time, and βm and βP are the rate of elimination of mRNA and protein, respectively (determined by dilution due to cell growth and active degradation). The rate for the transcription of mi is the value of the function f(Pi 1) describing repression of the promoter by the previous protein (Pi 1) in the network. Here, we decide to use the following function called a Hill function: f ðP i1 Þ ¼
λm K h K h þ P hi1
The Hill equation is classically used to describe cooperative binding of ligands to a receptor and is useful in describing many biological processes, as it describes nonlinear switching of a system between 0 and 1 (a fully “off” and fully “on” state). In our model, the Hill function is used to approximate the (possibly partial) cooperative binding of the repressor proteins to their promoters (see Note 2). Here, h is the Hill coefficient representing this cooperative binding. The parameter K in the Hill function is the threshold at which half of a population of a repressor in the cell is bound to its site and accounts for the affinity that a repressor has for its binding site in a promoter. λm is the maximal transcription rate when there is no repression of the gene encoding the mRNA. Using the mass action equations, we can now write the ordinary differential equations (ODEs) that describe the dynamics of our system and find a deterministic solution of the system (see Subheading 3.2) assuming that the numbers of molecules are very high. This may not be an accurate approximation in all situations, but can provide an intuition about the system’s behavior. In order to consider the effects of the finite number of molecules, we can also simulate the reactions stochastically (see Subheading 3.3). 3.1.3 Parameter Estimation
Before proceeding, it is useful to obtain an order of magnitude estimate for the biological parameters of our model. There are a few resources that can be extremely helpful with this task, such as BioNumbers [18], an online database containing molecular
Using Models to (Re-)Design Synthetic Circuits
97
Table 1 Estimated parameter values for the mRNA-protein repressilator model
Parameter Description
Est. value
Units 1
λmi
Max transcription rate
mRNA min mRNA τp1(see footnote a)
K
Threshold of repression (½ molecules are bound to promoters)
h
Hill coefficient of cooperativity
Source
4.1 150
[49]
proteins
7
[5]
Unitless
2
[50]
0.1 3.6
[51]b
1
βm
mRNA elimination rate (combination degradation and dilution)
min τp1(see footnote a)
λPi
Translation rate
proteins mRNA1 min1 proteins mRNA1τp1(see footnote a)
βP
Combination of dilution due to cell growth and active degradation of protein
min1 τp1(see footnote a)
1.8 65
[21, 51]b
0.027 1
a
For parameter scans discussed in Subheading 3.3.8, we have set all values for rate constant parameters to units of protein lifetime τp1 (see Note 1) by scaling rate parameter values assuming a cell division rate (and protein half-life) of 25 min, with no degradation of our proteins by proteases (making τP ¼ 36 min) b Values were found on BioNumbers [18]
biology numbers from many cell types and species (from peerreviewed sources) and Ron Milo and Rob Philips’ book Cell Biology by the Numbers [19]. For example, in our model (assuming that our network is being expressed in Escherichia coli), several studies listed on BioNumbers have shown translation rates of ~ 8 aa/mRNA/ s [20, 21], which we can adapt to the units required for our model (see Note 3). Table 1 contains a complete list of parameters in the example repressilator model, and an order of magnitude estimation of these parameters that serve as a starting point for our simulations. It is easy to vary these parameters by orders of magnitude in silico, but it is important to remember what they physically represent: rates of chemical reactions in a cell. As such, they must remain physically realistic. For example, binding constants cannot be so high such that molecules would need to diffuse faster than they would in water. Note that although testing a range of parameters computationally is easier than experimentally, the dimensionality (and therefore computational load) expands as the number of parameters increases (if you want to test a range of n values for x parameters, you must run nx simulations). This provides further motivation to keep the model as simple as possible.
98
Giselle McCallum and Laurent Potvin-Trottier
3.2 Deterministic Solution 3.2.1 Writing Ordinary Differential Equations (ODEs)
Let’s start by assuming that our system is evolving in a macroscopic test tube in which all of our reactants are present in high numbers, and the system is homogenously mixed. Under this assumption, we can write a set of deterministic ordinary differential equations (ODEs) that describe the dynamics of our system. Here, we write one equation for each species in the system, describing its rate of change (i.e., its production rate minus its overall depletion rate). For example, for a molecule x produced at a constant rate and eliminated at a constant rate per molecule, dx dt ¼ λ β x ðt Þ, where λ and β are the production and elimination rates constants, respectively. It is often useful to know the concentration of your components at equilibrium. To get this value, we can simply set dx dt ¼ 0 (at steady state, concentration does not change) and solve the equation for x. Here, the steady-state value of x is (intuitively) determined by its rate of production divided by its degradation rate: x ss ¼ βλ. We can use the same strategy to build a set of ODEs describing the repressilator. In the spirit of simplicity, let’s assume for now that the parameters are roughly equal for each mRNA and repressor. We can therefore use the same parameter values for all equations in our symmetrical system: dmi λ Kh ¼ hm m i βm dt K þ P hi1 dP i ¼ m i λP P i β p dt where i is the gene index as defined above. In total, our model will consist of six ODEs with two terms each, describing the rate of change of three mRNAs and three repressors over time.
3.2.2 Solving the System of ODEs
A system of ODEs describing nonlinear biochemical networks cannot usually be solved analytically. However, for a given set of parameters and initial value for the components, it can be solved relatively easily using a numerical solver (for well-behaved systems), which are built into most programming languages (e.g., Matlab (see code at https://github.com/potvinlab/MiMB_ circuitmodeling.git), Python). For example, using the ode23 function built into MatLab, we can solve our system of equations with the parameters in Table 1 over a specified time. In Fig. 2a, we can see that for our estimated parameter values and chosen set of initial conditions, our system exhibits sustained oscillations. For more detailed information on ODE models of biological networks, please refer to the chapter in this book titled Modelling frameworks: Ordinary Differential Equations.
Using Models to (Re-)Design Synthetic Circuits
a
b Deterministic
1500
Copy Number
99
P1 P2 2000 P3
1000
500
0
10
Stochastic
3000
1000
0
c
20
40
0
0
40
Time (τp)
5
0 28
29
30
Fig. 2 Time traces for the proteins of the repressilator. (a) Deterministic numerical solution to the system of ODEs, solved with the parameter set shown in shows oscillations in Table 1. This set of parameters leads to sustained oscillations. (b) Time traces for three proteins, simulated stochastically with the same parameter set as in a. The stochastic system still produces sustained oscillations, with some noise in period and amplitude of peaks. (c) Zooming in, we can see copy number changing in discrete steps, with one protein being produced or degraded in time steps of various lengths 3.2.3 Parameter Space Analysis and Bifurcation Diagrams
Often, we are interested in understanding the behavior of the system over a range of parameters. For example, you might be interested in choosing components (and indirectly parameters) for your circuit that will lead to a specific behavior. For a deterministic system, it can be possible to analytically determine the parameter boundary that will give oscillations (or other behaviors like damped oscillations or stability of equilibria) using a method called linear stability and bifurcation analysis. While the detailed process of linear stability analysis is outside the scope of this chapter (see Note 4), this approach has previously been used to analyze the parameter space of the repressilator model and find the boundary of the parameter space, which can give rise to oscillations [1, 22]. Sometimes, combinations of parameters (such as the ratio between them), rather than individual parameters themselves, determine a system’s behavior. Here, two combinations of parameters, α and β (where α ¼ βλmβλPK and β ¼ ββP ), determine whether m P m the system oscillates and are used in linear stability analysis to determine the boundary of the oscillation space. A plot of this boundary, called the bifurcation diagram (Fig. 3), shows that there are many sets of parameters for which the system can oscillate. From this diagram, we can infer that increasing cooperativity/
100
Giselle McCallum and Laurent Potvin-Trottier
103
h = 1.35 h = 1.5 h=2 h=3
β=βp/βm
102
101
100
10-1
10-2 0 10
101 α=
102 λ mλ P
103
104
βp βmK
Fig. 3 Bifurcation diagram. Plot showing the boundary of the αβ parameter space that gives rise to oscillations. Thick lines indicate the boundary at various h values, as determined by linear stability analysis (and the parameter combinations that lead to oscillations contained to the right each line). Increasing cooperativity (h), increasing α and having β ¼ 1, all increase the parameter space that support oscillations
nonlinearity helps the oscillations (by broadening the region of parameter space that sustains oscillations) and that ideally the mRNAs’ half-lives should be similar to the proteins’ half-lives. When interpreting this diagram, it is important to keep in mind that not all parameter values in this space are biologically relevant, physically possible, or easy to achieve experimentally. While these differential equations ignore the stochastic noise inside cells, they can provide useful insights into the behavior of dynamic systems and networks by providing analytical relations for the parameters. 3.3 Stochastic Simulations
So far, we have been operating under the assumption that molecules in our system are present in high numbers and therefore behave according to deterministic dynamics. In cells, this assumption is not always correct: many molecules such as mRNAs and proteins are present in low copy numbers [23–28]. Individual chemical reactions will happen by chance when molecules collide with each other, such that numbers of molecules will fluctuate over time (or across cells in a population), making their respective cellular processes (like gene expression) stochastic in nature. Levels of molecules can also fluctuate even if they are present in higher numbers, as this noise can be transmitted from one molecule to the next (for example, if proteins are translated from a noisy
Using Models to (Re-)Design Synthetic Circuits
101
mRNA). Therefore, we cannot predict or calculate how these numbers will change over time (which chemical reaction will happen and when), we can only calculate the probabilities that the system will have a particular number of molecules at a given time. These probabilities are described by a set of differential equations relating the change in the probability distribution of having a certain number of molecules called the chemical master equation (CME, [11, 29, 30]). Even for the simplest processes such as one molecule being produced at a constant rate and eliminated at a constant rate per molecule (i.e., a Poisson birth and death process), the CME is represented by an infinite number of coupled differential equations. While the CME is generally not solvable except in special cases, it is possible to calculate the moments of the probability distribution (e.g., average, variance, autocorrelation) by approximating the rates of the chemical reactions as linear around the average. This is called the linear noise approximation (or a first-order van Kampen expansion in physical chemistry, and many other names in other fields) and is exact when the rates are linear function of the number of molecules. This is outside the scope of this chapter, and we direct the interested reader to the following references [10, 11, 28]. Here, rather than analytically solving the CME, we will focus on simulating one realization (one sample path, or an example time trace) of this continuous-time, discrete state Markov process using an approach known as a Gillespie simulation [23, 31]. This algorithm is very simple to implement and is exact in the sense that simulated time traces converge to the correct probability distribution and its moments (average, variance, autocorrelation, etc.) (Fig. 4a, b). Using this algorithm, we can simulate our genetic circuits and measure the impact of the stochastic chemistry inside cells on our circuits for different designs. Continuing with the example of the repressilator, we will describe and demonstrate how to implement the Gillespie algorithm to stochastically model gene regulatory networks. 3.3.1 Stochastic Notation
Similar to the deterministic system, we can write a set of equations describing the rates for each possible reaction in the system, using the following notation (see Note 5): λ
x ! xþn where n is the change in x value resulting from a reaction and λ is the average rate of the reaction. In our example of simple birth and death of a molecule, n ¼ 1 and 1, respectively, but can be 2 in other cases, such as production in a burst or oligomerization of molecules. In all cases, n should be an integer, as molecules can only exist in integer numbers. For the repressilator, the reaction equations in stochastic notation are as follows:
Giselle McCallum and Laurent Potvin-Trottier
a
b
t t t t
xss
frequency
102
x
xss
x
time
c
β·x
P(x=2,t1) 0
x
1
t t t t
λ
t0
P(x=4,t1)
2
3
4
x → x-1
x → x+1
x 5
t0→ next reaction?
time Fig. 4 Stochastic simulation of a birth and death process. (a) Single time traces (thin gray lines) simulated using the Gillespie algorithm for a species x, which is produced at rate λ and degraded at rate β x. Although each trace is different, their statistical properties eventually converge to the correct probability distribution (colored lines) and its moments (e.g., mean and standard deviation). (b) Once steady state is reached, the probability distribution does not change. (c) Random walk on a lattice. Starting at a given x value, the system can move to a value of x + 1 or x 1 with probabilities that depend on the production and degradation rate, respectively
mi mi
λm K h K h þP i1 h
!
mi þ 1
βm mi
! mi 1
λP mi
Pi ! Pi þ 1 βP P i
Pi ! Pi 1 We will use the reaction rate expressions from these equations in our Gillespie algorithm. 3.3.2 Simulating a Time-Trace: The Gillespie Algorithm
Instead of analytically solving the CME, we will simulate one realization of the stochastic process. The idea behind the Gillespie algorithm is quite simple: we initialize the system to an (arbitrary) initial value (number of molecules at time t), and then let chemical reactions happen randomly. To be exact, we need to pick these reactions from their proper probability distribution, describing both when the reaction is going to happen and which one will happen. For example, consider our molecule x from the previous
Using Models to (Re-)Design Synthetic Circuits
103
section (produced with rate λ and degraded with rate β x), and imagine our cell has 3 molecules (x ¼ 3) at a particular time point (t0). The next chemical reaction will either be the production or degradation of a molecule, either leading to x ¼ 4 or x ¼ 2. The time until this next reaction happens is also stochastic and depends on the current state of the system (Fig. 4c). 3.3.3 Gillespie Algorithm: Time to Next Reaction
After assigning an (arbitrary) initial value to all molecular species at the time zero of our simulation (e.g., x ¼ 3 in Fig. 4c), we will first calculate the time to the next reaction, given the current state of our system. We can imagine the system sitting in one state, simply waiting for the next reaction to occur. We know that the probability of that chemical reaction happening per time unit is constant over time, regardless of how long we waited. As an analogy, imagine the waiting time for rolling black while playing roulette. It does not matter how long you wait or how many times you have already rolled red, the probability of falling on black on a given roll is always the same. Another example is radioactive decay, a stochastic process where the probability of one nucleus decaying is constant over time. This property is called memorylessness, because the stochastic process does not have a “memory” of how long it waited in a state. The only continuous probability distribution with this property is the exponential distribution, here with T as the time to the first reaction, λ is the average rate of the reaction, and τ is a given time interval (see Note 6): P ðT > τÞ ¼ e λτ P ðT τÞ ¼ 1 P ðT > τÞ ¼ 1 e λτ ¼ F ðt Þ This is the cumulative distribution function (CDF) of the time to the first reaction: if we take the derivative of this function, we get the exponential probability density function, which gives the probability density that the time to the first reaction is around τ: pðT ¼ τÞ ¼
dF ðτÞ ¼ λ e ðλτÞ dτ
In our algorithm, we therefore want to sample from this exponential distribution. Because most programming languages include a function to generate random numbers uniformly distributed between 0 and 1, we use the CDF to map the distribution of interest to a uniform distribution (see Note 7). This process is referred to as inversing the distribution. In our example, λ is the total rate of all reactions in the system: because the reactions are independent, the rate of any one reaction happening is the sum of N P the rates of each reaction in our system λtot ¼ λi , where N is i¼1
the number of reactions in the system. Because we know the current number of molecules in the system, we can calculate these
104
Giselle McCallum and Laurent Potvin-Trottier
0
λ1
λ1 / λtot
0
λ2
λ3
λ2 / λtot
λ3 / λtot
λN
N λtot = ∑ λi i=1
λN / λtot
r2
1
Fig. 5 Choosing a reaction. Rates of all reactions are calculated given the current system state and normalized by the sum of all possible reactions (λtot), such that their cumulative sum is 1. Rates are aligned, and a number r2 uniformly distributed between 0 and 1 is generated, whose value will determine which of these reactions will occur
rates at that particular timepoint and using a randomly generated number (here named r1) between 0 and 1, solve for the time to the next reaction τ with the equation τ¼
ln ð1 r 1 Þ ln ðr 1 Þ ¼ λtot λtot
Because r1 is a randomly chosen number between 0 and 1, we can replace 1 r1 with r1 to simplify the expression (see Note 8). 3.3.4 Gillespie: Choosing a Reaction
We now know that a reaction happened at time t0 + τ, but we still don’t know which reaction occurred. Using the rate of each reaction λi (again using the current state of the system), we first normalize these rates by λtot, such that if we line them up, their cumulative sum is 1, thus building a cumulative distribution function (see Note 9). We now generate a second random number between 0 and 1 (r2), which will fall somewhere on this line of normalized reaction rates, determining the reaction that will occur (Fig. 5). Reactions with higher rates take up more of the space in the vector, and are therefore chosen with higher probability.
3.3.5 Gillespie: Updating the State of the System
Once we know which of the N reactions will occur and the time it will take, we must update both the time of our simulation and quantities of all the species in the system. To update the time, simply add the randomly sampled τ value to the current time. To update the quantity of reactants, we add or subtract the appropriate value to each species involved in the randomly picked reaction (see Note 10). For example, if we chose the transcription of m1 as our next reaction, we would update the system by adding +1 to the current value of m1.
Using Models to (Re-)Design Synthetic Circuits 3.3.6 Gillespie: Iterating the Algorithm
105
The steps of the algorithm above are iterated a chosen number of times (n chemical reactions), with time and quantity of reactants being updated at each iteration. The number of iterations should be long enough to properly characterize the resulting time trace. For example, if you are interested in the statistical properties of your system around steady state, you should run your simulation far enough past the time that steady state is reached that you have sufficient points to sample to calculate these statistics. Note that different species in your model may evolve at different time scales and it might take many reactions until you can sufficiently sample your slow species (this may be computationally challenging). After running the simulation for n iterations at the parameters listed in Table 1, our system shows regular oscillations, but with some noise in the period and amplitude of the oscillations (Fig. 2b). Zooming in, we can see the discrete production and depletion steps of our proteins, and the different sized τ intervals in time (Fig. 2c). The steps of the Gillespie algorithm can be summarized as follows: 1. Initialize the system at t0 to a chosen set of reactant quantities 2. Calculate all reaction rates (λi(x1, x2. . .)) and their sum N P λtot ¼ λi , using the current state of the system (quantity i¼1
of each reactant, x1, x2. . ., at t0) 3. Use λtot and a randomly generated r1 (0,1) to calculate time τ to next reaction using inverse sampling of the exponential distribution: ln ðr 1 Þ τ¼ λtot 4. Normalize all reaction rates (λi) by λtot and align them. Randomly generate r2 between (0,1), whose value determines which reaction happens at time t0 + τ 5. Update system according to chosen reaction, adding or subtracting the appropriate amount to or from the quantity of each species involved in the chosen reaction 6. Repeat steps 2–5 n times, updating the state of the system at each iteration 3.3.7 Characterization of Results
After simulating the system for many steps, we can then characterize its properties. For example, using a long time trace, we can calculate the probability distribution of the number of molecules at steady state (P(X ¼ x), Fig. 4b), or moments of the distribution, such as the mean number of molecules or the fluctuations around the mean (i.e., variance). The specific measure by which a gene circuit is characterized will depend on its desired behavior. For oscillators, the autocorrelation of the protein copy number is a
106
Giselle McCallum and Laurent Potvin-Trottier
convenient measure of the quality of the oscillations. The autocorrelation function represents the correlation coefficient of a trace at two time points separated by a time lag (ΔτP), and thus should be equal to 1 after one period for a perfect oscillator. It includes both a measure of phase drift (how quickly the oscillations become de-synchronized), as well as amplitude noise (see Note 11). 3.3.8 Parameter Scan
As with the deterministic solution to our model, we should assess how the behavior of the system changes as a function of its parameters. Here, it is useful to set one parameter equal to 1, varying the other parameter values in relation to it to minimize the number of parameters to range and assess the behavior of your system when simulated using this range of parameters. Typically, we set βP to 1, switching the time units of the simulation to protein lifetimes (τp) and scaling the value of other rate values accordingly (Table 1, see Note 1). Quantifying the autocorrelation for a range of parameters shows that the stochastic system oscillates over a broader range of parameters than the deterministic system (Fig. 6a, b), “smoothing” out the bifurcation transition (Fig. 6b). While the
a Determinisc
0.7 150
0.5
10-1
β
200
0.3 0
Hill = 1.5 Copy Number
100
400
0
20
40
0
0.1
10-2 0
20
40
0
10
100
200
100
50
100
10-1
1
10
α
2
10
0.7 0.5
0
0
20
40
0
β
Hill = 2 Copy Number
300
Correlaon aer one period
b
Stochasc
0
Time (τP)
20
40
0.3 0.1
10-2 0
10
1
10
α
2
10
Fig. 6 Scanning the parameter space of the stochastic system. (a) Comparison of deterministic solution and stochastic simulation of the system with different Hill coefficients. With the chosen parameters, the system still shows sustained oscillations with low cooperativity, whereas the deterministic solution shows damped oscillations. (b) Heatmap of the autocorrelation of the time traces of the proteins after one period. Here, P λm α ¼ βλβ and β ¼ ββP . The thick black line indicates the deterministic bifurcation boundary, and the pink dot P m K m corresponds to the parameter values used in simulations in a. As in the deterministic bifurcation calculation, increasing cooperativity increases the size of the oscillation space. However, in the stochastic regime, it is possible to maintain oscillations outside the predicted bifurcation boundary, with both high and low cooperativity constants
Using Models to (Re-)Design Synthetic Circuits
107
recommendations from the deterministic analysis still hold (e.g., increasing cooperativity to expand oscillation space), the requirements are much less stringent in the stochastic system. This tells us that we have more leeway when choosing parts (and their corresponding parameters) with which to build and redesign our circuits than the deterministic analysis would indicate. 3.4 Using Models to Redesign the Circuit: An Example
Now that we know how to simulate our synthetic circuits, we can incorporate data from the first design to help us understand and improve their behavior. This process is obviously very specific to particular circuits, so here we will use our previous experience redesigning the repressilator as an example. Some recommendations are general, such as reducing propagation of stochastic noise (as this can transmitted between molecules), and we will emphasize these. It might also be necessary to iterate the design-build-test loop multiple times, making small changes to the circuit, then quantifying its properties and analyzing the results. Initially, the repressilator was designed using the bifurcation diagram in Fig. 3. The guidelines were therefore to have strong promoters (to increase α) and high cooperativity, while ensuring that the proteins’ half-lives were similar to the mRNAs’ (β ¼ 1). Therefore, repressors that multimerize and bind strongly were chosen, and they were targeted for fast degradation to reduce their half-lives. The assembled circuit did indeed oscillate, but its performance appeared much lower than natural oscillators or other subsequently published synthetic oscillators [32–39]. For the redesign of a circuit, it is crucial that the experimental data accurately represent the circuit’s behavior. Therefore, for single-cell dynamic properties such as oscillations, we evaluated the performance of the circuit using a microfluidic device nicknamed the mother machine [6–8], which enables us to track thousands of single cells under constant growth conditions for hundreds of cell divisions. Comparing these data to the original experiments performed on agar pad (where growth conditions change rapidly as cells start to compete for nutrients) revealed that the oscillatory properties appeared much improved, suggesting that the circuit is sensitive to changes in growth conditions. This illustrates how separating variability from the environment and intrinsic noise of a circuit can aid in its redesign, as we can then change or eliminate components that are highly sensitive to environmental noise (e.g., growth conditions). We also observed high amplitude noise between the peaks of the oscillations (Fig. 7), and we thus decided to investigate fluorescent read-out for the oscillations as a potential source of noise. The original design included one plasmid for the repressor and another, noisy plasmid carrying the fluorescent read-out to track oscillations. Therefore, this amplitude noise could be simply an artifact of our measurements. Indeed, transferring the reporter to the repressilator plasmid
108
Giselle McCallum and Laurent Potvin-Trottier
PR λcI-ssrA
Redesigning a synthetic circuit: the repressilator
PLtetO1
PLlacO1 λcI
PLtetO1
tetR
pSC101 ori
colE1 ori
tetR-ssrA
original repressilator cicuit GFP concentration
1cm
“Mother Machine”
under constant growth conditions
gfp-asv
lacI
1. Precise characterization • analyze single cells
lacI-ssrA
4 3 2 1 0
1.5μm
0
10
0
10
PLtetO1
a. amplitude noise
λcI-ssrA
PR
mVenus
• integrate reporter
to remove noise from reporter plasmid
lacI-ssrA
PLtetO1
PLlacO1 pSC101 ori
b. noisy decay
model decay to guide redesign:
decay decay
decay
mVenus
sponge to increase repression P LtetO1 threshold and cooperativity
pSC101 ori
20
30
2 0 time (τP)
repression curve
+
TetR sponge
PLtetO1 K h
colE1 ori
[repressor]
λcI
PLtetO1
PR lacI
+
PLlacO1 tetR
colE1 ori
YFP concentration
PLtetO1
30
4
time • add titration
20
6
transcription
tion to increase peak protein number
tetR-ssrA
protein
• remove degrada-
YFP concentration
2. Identify and eliminate sources of noise
time (τP)
6 4 2 0
0
10
20 30 time (τP)
40
Fig. 7 Redesigning the repressilator. Outline of steps taken to redesign the repressilator circuit to achieve high robustness and precision. In step 1, we characterized the circuit in single cells at constant growth rates, which improved oscillations compared to the original experimental setup. In step 2, we identified and eliminated intrinsic sources of noise in the circuit. These included variable copy numbers of reporter plasmids (a), low peak amplitude due to degradation of repressors and apparent low K values of repressors (b). Integrating the fluorescent reporter onto the repressilator plasmid, removing degradation, and adding a titration sponge improved precision of the circuit, leading to the most precise performance of a synthetic oscillator to date
Using Models to (Re-)Design Synthetic Circuits
109
greatly reduced peak amplitude noise (Fig. 7). In doing so, we also made a serendipitous discovery: the fluorescent reporter originally targeted for degradation interfered with degradation of the repressors, adding noise to the oscillations. This was an example of the unknowns in biology and emphasizes the need for both experiment and modeling. We also observed that the shape of the oscillations was strongly non-sinusoidal (Fig. 7), which mathematical modeling of our system told us was characteristic of very low repression thresholds (K) (as expected for these strong repressors). In such a regime, the promoters operate in a switch-like fashion—they are either completely on or off—and the period can be decomposed in three sub phases where each repressor decays from its peak value down to its repression threshold while its production is completely off (Fig. 7). After P1 decays below its threshold, production of P2 is derepressed, which will immediately inhibit production of P3 and initiate its decay phase. The length of the period is thus determined by the sum of the three decay times, and we can analyze each decay independently. While this analysis is specific to this circuit, such pseudo-steady-state analysis, or time scale separation, is a general technique that can be useful in analyzing many types of circuits. A detailed analysis of the decay phase showed that two factors were necessary for a precise timing: (1) high peak amplitude, averaging the timing of the decay over many steps, and (2) relatively high repression thresholds, as the elimination time of the last few molecules (to fall below a low K value) is very noisy (Fig. 2c), which in turns causes large variation in period. These recommendations were implemented by (1) removing protein degradation, thus letting proteins accumulate to higher numbers (and also possibly removing a source of noise) and (2) adding decoy binding sites for the repressors (called a “titration sponge”). These decoy sites (present in much higher copy numbers than the actual sites) soak up free repressors, effectively increasing the repression threshold (and increasing effective cooperativity at the same time)(Fig. 7). This linear molecular titration [40–44] is a very versatile tool that enables the tuning of repression curves that would experimentally difficult to change otherwise and has been used in a variety of applications, from timers in natural biological systems [7] to controllers for perfect adaptation [45]. After implementing these changes, the oscillations of the repressilator were extremely precise, taking more than 13 periods before accumulating half a period of phase drift. As demonstrated, mathematical modeling and careful experimental characterization are both critical components of designing and redesigning synthetic genetic circuits. Models can provide valuable insights into required parameter values and possible system behaviors, and should guide the initial engineering of a circuit. It is important to carefully characterize this initial circuit
110
Giselle McCallum and Laurent Potvin-Trottier
experimentally, isolating variability originating from the growth environment, reporters or measurements, and other sources of extrinsic noise from the variability of the circuit itself. This data can then be used together with continued mathematical modeling to hypothesize changes to the circuit that could improve its behavior. After testing these changes, the circuit’s behavior can be characterized again and improved iteratively. Here, we provided a brief guide to building deterministic models and running stochastic simulations of simple circuits, as these methods are simple and fast to execute, while being extremely informative. We propose that incorporating modeling in the design-build-test loop of synthetic biology while pursuing a precision and robustness that rivals natural circuits will lead to novel insights into the design of natural and synthetic gene networks.
4
Notes 1. It is often convenient to set one of the parameter values of the system equal to 1. For example, in the parameter scan discussed in Subheading 3.3.8, this method minimizes the number of parameters that we need to scan through. Typically, we choose the protein elimination rate, setting βP ¼ 1, and scaling all other rate values accordingly. This changes the units of all rate parameters to protein lifetimes. This parameter has an intuitive connotation, in both the human realm and the molecular one. It is a measure of how long molecules “live” in the system on average before they are eliminated. To illustrate this important parameter, we will consider a simple system where proteins are eliminated at a constant rate per molecule (if there is degradation by protease, it is not saturated), and they are no longer produced. Therefore, dP ðt Þ ¼ βp P ðt Þ: dt P ðt Þ ¼ P 0 e βP t where P0 is the initial amount of protein, and βP is the decay rate constant. The number of proteins thus decays exponentially (Fig. 8). The lifetime is defined as β1 P , or the time it takes to decay 1 to e . It is more intuitive to think about the half-life of the protein (t1/2) or the time at which half the initial quantity has decayed. These time constants are simply related by a proportionality constant: e t=τp ¼ 2 ln ð2Þt=τ1=2 t 1=2 ¼ τP ln ð2Þ
Using Models to (Re-)Design Synthetic Circuits
111
P0
Protein (P)
P(t) = P0·e-βP t
P0
t1 /2=
2
ln(2) = τP ln(2) βP
e-1
0
0
t1 /2 τP
Time (t)
Fig. 8 Exponential protein decay. For proteins being eliminated at a constant rate (with no production), the population will decay exponentially. The half-life (t1/2) is the time at which the population has reached half of its initial value. The lifetime (τP) of the protein is the average time that a protein will exist in the system. βP is the elimination rate constant
The lifetime does not merely represent how quickly molecules are eliminated but is also a natural timescale for the system, indicating how quickly the system adopts a new steady state. This is why we usually set the timescale of our simulation to τP. In the case of our example in which proteins are not actively degraded, we can assume the protein half-life is 25 min, meaning that τP ¼ 25ln min ð2Þ ¼ 36 min 2. The Hill function is a common function in biology that describes cooperative binding of ligands to a receptor. This function was originally derived by Hill in 1910 to describe O2 molecules binding to hemoglobin [46]. It is typically written in the following form: xh θ¼ h K þ xh Where θ is the fraction of ligand bound, x is the concentration of free ligand, K is the concentration of ligand at which half of the population is bound, and h is the cooperativity coefficient. h determines the nonlinearity of this function— higher h values make the curve sharper (Fig. 9). The Hill function has been used extensively in modeling of unknown activation or repression functions, due to their flexibility to describe nonlinear sigmoidal-shaped function (note that h can be non-integer). In our model, we use 1 θ to calculate the transcription rate of a gene based on the fraction of unbound promoters available for expression.
Giselle McCallum and Laurent Potvin-Trottier 1
fraction repressors bound
112
1/2
h=1 h = 1.5 h=3 h=6 0
0
K
[repressors]
Fig. 9 Hill function. The Hill is used in our model to describe binding of repressors to their promoters. Transcription rate is calculated as a function of the number of unbound promoters (available for expression). As the cooperativity coefficient h increases, the transition between a gene being fully unbound (expressed with rate λm) and fully bound (repressed) becomes sharper
3. Many values provided on BioNumbers may not be in the exact units required for your model but can easily be adapted for use. For example, we can change translation rate from aa mRNA1s1 to proteins mRNA1 min1 as follows: 8 aa 60 s 3 bp protein 1:75 protein ¼ 1 mRNA s min aa 821 bp mRNA min 4. While a detailed description of linear stability analysis is outside the scope of this chapter, there are many helpful resources available. Two helpful texts that cover this topic are Strogatz’s Non-linear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry and Engineering [47], and Epstein and Pojman’s book An Introduction to Non-linear Chemical Dynamics: Oscillations, Waves, Patterns and Chaos [48]. 5. Note that in stochastic notation, the arrows in the reaction equations have a different meaning than in the mass action equations in Subheading 3.1.2. Here, rather than a transformation from one product to another, an arrow means a change in quantity of a reactant. Note also that in this notation, the full rate of this change is indicated above the arrow (without the implicit multiplication by the left-hand side). Finally, note that in a stochastic simulation, we measure numbers rather than concentration of reactants. Here, if we wanted to measure concentration, we would need to include cell growth and
Using Models to (Re-)Design Synthetic Circuits
113
division in our stochastic equations, tracking the cell’s volume over time. Instead, we approximate this process by adding constant dilution of our molecules, and tracking absolute numbers of molecules in a “typical” cell volume over time. 6. The memorylessness property is defined as: P ðT > t þ s j T > t Þ ¼ P ðT > s Þ where t and s are positive real numbers. This means that the conditional probability that the time to the first event is greater than t + s, knowing that it is greater than t, is equal to the probability that the time is greater than s. In other words, it does not matter how long you have already waited. We can show that the exponential distribution satisfies this property using the law of total probability (P(A| B)P(B) ¼ P(A and B)): P ðT > t þ s j T > t Þ ¼ ¼
P ðT > t þ s and T > t Þ P ðT > t Þ
P ðT > t þ s Þ ¼ P ðT > s Þ P ðT > t Þ
because if T > t + s, then T is necessarily greater than t. Substituting the exponential distribution satisfies this equality: e λðtþs Þ ¼ e λs e λt Note that it is possible to show that the exponential is the only distribution to show this property, but we do not include it here, as it is not particularly pedagogical. 7. Inverse transform sampling is easier to understand visually (Fig. 10). Intuitively, the idea is to map the uniform probability density to the CDF. If we compare regions where the slope of the CDF (the probability density function since pðT ¼ t Þ ¼ dF ðt Þ dt ) is different, we see that the regions with higher slope will take up a higher proportion of the uniform distribution and will therefore have a higher probability. This can be proven mathematically, if U is a uniform random variable between 0 and 1, then F1(U) has F(x) its CDF. P F1 ðU Þ x ¼ P ðU F ðx ÞÞ ¼ F ðx Þ Where we applied F(x) on both sides and then used the CDF of the uniform distribution (P(U x) ¼ x). 8. We realize it is a bit of a jump between these equations. For those who like to see the steps, they are as follows: r ¼ 1 e λtot τ
Giselle McCallum and Laurent Potvin-Trottier
P(T≤t)
15%
1
0.5
15%
114
0
0
4.2%
27.8%
100
time (t)
Fig. 10. Intuition for the inverse transform sampling. Equal probabilities (15%) on the uniform distribution are mapped to different probabilities on the CDF, where the higher slopes (corresponding to the PDF) corresponds to higher probabilities
e λtot τ ¼
1 1 r1
remember that if y ¼ ex, x ¼ ln ( y). Therefore: 1 λtot τ ¼ ln 1 r1 Also remember that ln(x y) ¼ y · ln (x): τ¼
ln ð1 r 1 Þ λtot
9. Gillespie in practice: rate vector function. When setting up the Gillespie algorithm, there are a few tricks that make things more efficient and cleaner. For example, after writing the entire set of equations, it is helpful to assemble the rates definitions for each equation into a vector, and build this vector into a function that accepts current reactant values as an input (called the rate vector function, or rvf). This allows us to easily calculate the individual rates of all reactions for a given system state and to quickly sum the rates to calculate λtot. In the case of the repressilator, the rate vector function is defined as: rvf ðm, P Þ ¼
h
λm K h λ Kh , βm m1 , λP m1 , βP P 1 , mh , β m m 2 , λP m 2 , h h K þ P3 K þ P h1 i λ Kh βP P 2 , mh , β m , λ m , β P 3 P 3 3 m P K þ P h2
Using Models to (Re-)Design Synthetic Circuits
115
10. Gillespie in practice: Stoichiometry matrix. To easily determine the value to add or subtract to each species in the event that a given reaction occurs, we build a stoichiometry matrix for our system. This is an M by N matrix, where M is the number of species/reactants in the system, and N is the number of possible reactions. Each row of the matrix corresponds to one reactant, and each column gives the change in quantity of each reactant, should a given reaction occur. As an example, see the stoichiometry matrix of the repressilator (Table 2). Column i corresponding to the chosen reaction will give the values to be added or subtracted from each reactant and can be used to directly update a vector containing the current values of each species. Note that the case of the repressilator, the matrix is quite simple, as all reactions are independent and lead to an increase or decrease of 1 molecule. Matrices can be more complex, especially for systems with coupled reactions. For example, consider a system in which monomers of molecule y are produced at rate λy and dimerize irreversibly to form molecule Y at rate λY. For this system, our reaction equations would be written as λy
y ! y þ1 λY y 2
ðy, Y Þ ! ðy 2, Y þ 1Þ
Table 2 Stoichiometry matrix for the repressilator model m1
P1
m2
P2
m3
P3
m1 ! m1 + 1
1
0
0
0
0
0
m1 ! m 1 1
1
0
0
0
0
0
P1 ! P1 + 1
0
1
0
0
0
0
P1 ! P1 1
0
1
0
0
0
0
m2 ! m2 + 1
0
0
1
0
0
0
m2 ! m 2 1
0
0
1
0
0
0
P2 ! P2 + 1
0
0
0
1
0
0
P2 ! P2 1
0
0
0
1
0
0
m3 ! m3 + 1
0
0
0
0
1
0
m3 ! m 3 1
0
0
0
0
1
0
P3 ! P3 + 1
0
0
0
0
0
1
P3 ! P3 1
0
0
0
0
0
1
116
Giselle McCallum and Laurent Potvin-Trottier
Table 3 Example Stoichiometry matrix for a system with coupled reactions Production λy
Dimerization λY y 2
y ! y þ1
ðy, Y Þ ! ðy 2, Y þ 1Þ
y
1
2
Y
0
1
The stoichiometry matrix for this system would be written as in Table 3. 11. Gillespie in practice: Resampling time trace data. Because τ varies between reactions, the time trace output by the Gillespie algorithm will have points separated by non-uniform time steps. For further analysis (for example, to calculate the autocorrelation), it is often helpful to resample the data at regular time intervals. To do this, we sample the number of reactants at every tresample time interval of the output, such that the length of our final resampled data matrix will be tmax/tresample, where tmax is the time length of the simulation. tresample is chosen appropriately so that the resampling is done sufficiently enough to capture the system’s behavior. References 1. Elowitz MB, Leibier S (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403:335–338. https://doi.org/ 10.1038/35002125 2. Gardner TS, Cantor CR, Collins JJ (2000) Construction of a genetic toggle. Nature 403:339–342. https://doi.org/10.1038/ 35002131 3. Vilar JMG, Kueh HY, Barkai N, Leibler S (2002) Mechanisms of noise-resistance in genetic oscillators. Proc Natl Acad Sci U S A 99:5988–5992. https://doi.org/10.1073/ pnas.092133899 4. McKane AJ, Newman TJ (2005) Predator-prey cycles from resonant amplification of demographic stochasticity. Phys Rev Lett 94:218102. https://doi.org/10.1103/Phy sRevLett.94.218102 5. Potvin-Trottier L, Lord ND, Vinnicombe G, Paulsson J (2016) Synchronous long-term oscillations in a synthetic gene circuit. Nature 538:514–517. https://doi.org/10.1038/ nature19841 6. Wang P, Robert L, Pelletier J et al (2010) Robust growth of Escherichia coli. Curr Biol
20:1099–1103. https://doi.org/10.1016/j. cub.2010.04.045 7. Norman TM, Lord ND, Paulsson J, Losick R (2013) Memory and modularity in cell-fate decision making. Nature 503:481–486. https://doi.org/10.1038/nature12804 8. Potvin-Trottier L, Luro S, Paulsson J (2018) Microfluidics and single-cell microscopy to study stochastic processes in bacteria. Curr Opin Microbiol 43:186–192. https://doi. org/10.1016/j.mib.2017.12.004 9. Alon U (2007) An introduction to systems biology : design principles of biological circuits. Chapman & Hall/CRC, Boca Raton, FL 10. Phillips R, Kondev J, Theriot J et al (2013) Physical biology of the cell, 2nd edn. Garland Science, New York, NY 11. Munsky B, Hlavacek WS, Tsimring LS (2018) Quantitative biology : theory, computational methods, and models. MIT Press, Cambridge, MA 12. Ingalls B (2013) Mathematical modelling in systems biology: an introduction. MIT Press, Cambridge, MA
Using Models to (Re-)Design Synthetic Circuits 13. Bialek WS (2012) Biophysics: searching for principles. Princeton University Press, Princeton, NJ 14. Wikipedia (2019) Gillespie algorithm. https:// en.wikipedia.org/wiki/Gillespie_algorithm 15. Kernst OK (2015) Gillespie’s stochastic simulation algorithm for chemical reactions. In: Wolfram Alpha Demonstr. https:// demonstrations.wolfram.com/Gillespies StochasticSimulationAlgorithmForChemical Reactions/ 16. Hilfinger A, Norman TM, Paulsson J (2016) Exploiting natural fluctuations to identify kinetic mechanisms in sparsely characterized systems. Cell Syst 2:251–259. https://doi. org/10.1016/j.cels.2016.04.002 17. Hilfinger A, Norman TM, Vinnicombe G, Paulsson J (2016) Constraints on fluctuations in sparsely characterized biological systems. Phys Rev Lett 116:058101. https://doi.org/ 10.1103/PhysRevLett.116.058101 18. Milo R, Jorgensen P, Moran U et al (2010) BioNumbers--the database of key numbers in molecular and cell biology. Nucleic Acids Res 38:D750–D753. https://doi.org/10.1093/ nar/gkp889 19. Milo R, Phillips R (2016) Cell biology by the numbers. Garland Science, New York, NY 20. Guet CC, Bruneaux L, Min TL et al (2008) Minimally invasive determination of mRNA concentration in single living bacteria. Nucleic Acids Res 36:e73. https://doi.org/10.1093/ nar/gkn329 21. Siwiak M, Zielenkiewicz P (2013) Transimulation - protein biosynthesis web service. PLoS One 8:e73943. https://doi.org/10.1371/ journal.pone.0073943 22. Elowitz MB (1999) Transport, assembly, and dynamics in systems of interacting proteins. PhD Thesis. Princeton University, Princeton, NJ 23. El Samad H, Khammash M, Petzold L, Gillespie D (2005) Stochastic modelling of gene regulatory networks. Int J Robust Nonlinear Control 15:691–711. https://doi.org/10. 1002/rnc.1018 24. Paulsson J (2005) Models of stochastic gene expression. Phys Life Rev 2:157–175. https:// doi.org/10.1016/j.plrev.2005.03.003 25. Elowitz MB, Levine AJ, Siggia ED, Swain PS (2002) Stochastic gene expression in a single cell. Science 297:1183–1186. https://doi. org/10.1126/science.1070919 26. Ozbudak EM, Thattai M, Kurtser I et al (2002) Regulation of noise in the expression of a single gene. Nat Genet 31:69–73. https:// doi.org/10.1038/ng869
117
27. Raj A, Van Oudenaarden A (2009) Singlemolecule approaches to stochastic gene expression. Annu Rev Biophys 38:255–270. https:// doi.org/10.1146/annurev.biophys.37. 032807.125928 28. Paulsson J (2004) Summing up the noise in gene networks. Nature 427:415–418. https://doi.org/10.1038/nature02257 29. McQuarrie DA (1967) Stochastic approach to chemical kinetics. J Appl Probab 4:413–478. https://doi.org/10.2307/3212214 30. van Kampen NG (2007) Stochastic processes in physics and chemistry, 3rd edn. Elsevier, Amsterdam 31. Gillespie DT (1977) Exact stochastic simulation of coupled chemical reactions. J Phys Chem 81:2340–2361. https://doi.org/10. 1021/j100540a008 32. Mihalcescu I, Hsing W, Leibler S (2004) Resilient circadian oscillator revealed in individual cyanobacteria. Nature 430:81–85. https://doi. org/10.1038/nature02533 33. Teng S-W, Mukherji S, Moffitt JR et al (2013) Robust circadian oscillations in growing cyanobacteria require transcriptional feedback. Science 340:737–740. https://doi.org/10. 1126/science.1230996 34. Chabot JR, Pedraza JM, Luitel P, van Oudenaarden A (2007) Stochastic gene expression out-of-steady-state in the cyanobacterial circadian clock. Nature 450:1249–1252. https:// doi.org/10.1038/nature06395 35. Stricker J, Cookson S, Bennett MR et al (2008) A fast, robust and tunable synthetic gene oscillator. Nature 456:516–519. https://doi.org/ 10.1038/nature07389 36. Tigges M, De´nervaud N, Greber D et al (2010) A synthetic low-frequency mammalian oscillator. Nucleic Acids Res 38:2702–2711. https://doi.org/10.1093/nar/gkq121 37. Danino T, Mondrago´n-Palomino O, Tsimring L, Hasty J (2010) A synchronized quorum of genetic clocks. Nature 463:326–330. https://doi.org/10.1038/ nature08753 38. Mondrago´n-Palomino O, Danino T, Selimkhanov J et al (2011) Entrainment of a population of synthetic genetic oscillators. Science 333:1315–1319. https://doi.org/10.1126/ science.1205369 39. Prindle A, Selimkhanov J, Li H et al (2014) Rapid and tunable post-translational coupling of genetic circuits. Nature 508 (7496):387–391. https://doi.org/10.1038/ nature13238 40. Buchler NE, Louis M (2008) Molecular titration and ultrasensitivity in regulatory networks.
118
Giselle McCallum and Laurent Potvin-Trottier
J Mol Biol 384:1106–1119. https://doi.org/ 10.1016/j.jmb.2008.09.079 41. Buchler NE, Cross FR (2009) Protein sequestration generates a flexible ultrasensitive response in a genetic network. Mol Syst Biol 5:272. https://doi.org/10.1038/msb. 2009.30 42. Genot AJ, Fujii T, Rondelez Y (2012) Computing with competition in biochemical networks. Phys Rev Lett 109:1–5. https://doi. org/10.1103/PhysRevLett.109.208102 43. Lee T-H, Maheshri N (2012) A regulatory role for repeated decoy transcription factor binding sites in target gene expression. Mol Syst Biol 8:576. https://doi.org/10.1038/msb.2012.7 44. Brewster RC, Weinert FM, Garcia HG et al (2014) The transcription factor titration effect dictates level of gene expression. Cell 156:1312–1323. https://doi.org/10.1016/j. cell.2014.02.022 45. Lillacci G, Aoki SK, Gupta A et al (2019) A universal rationally-designed biomolecular integral feedback controller for robust perfect adaptation. Nature 570:533–537. https://doi. org/10.1038/s41586-019-1321-1
46. Hill A (1910) The possible effects of the aggregation of the molecules of haemoglobin on its oxygen dissociation curve. J Physiol 40:4–7 47. Strogatz S (2015) Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering, 2nd edn. Westview Press, Boulder, CO 48. Epstein IR, Irving R, Pojman JA, John A (1998) An introduction to nonlinear chemical dynamics: oscillations, waves, patterns, and chaos. Oxford University Press, New York, NY 49. Weiße AY, Oyarzu´n DA, Danos V et al (2015) Mechanistic links between cellular trade-offs, gene expression, and growth. Proc Natl Acad Sci U S A 112:E1038–E1047. https://doi. org/10.1073/pnas.1416533112 50. Niederholtmeyer H, Sun ZZ, Hori Y et al (2015) Rapid cell-free forward engineering of novel genetic ring oscillators. elife 4:1–18. https://doi.org/10.7554/elife.09771 51. Taniguchi Y, Choi PJ, Li G-W et al (2010) Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single cells. Science 329:533–538. https://doi.org/ 10.1126/science.1188308
Chapter 4 Automated Biocircuit Design with SYNBADm Irene Otero-Muras and Julio R. Banga Abstract SYNBADm is a Matlab toolbox for the automated design of biocircuits using a model-based optimization approach. It enables the design of biocircuits with pre-defined functions starting from libraries of biological parts. SYNBADm makes use of mixed integer global optimization and allows both single and multiobjective design problems. Here we describe a basic protocol for the design of synthetic gene regulatory circuits. We illustrate step-by-step how to solve two different problems: (1) the (single objective) design of a synthetic oscillator and (2) the (multi-objective) design of a circuit with switch-like behavior upon induction, with a good compromise between performance and protein production cost. Key words Automated design, Biological parts, Global optimization, Mixed Integer Nonlinear Programming, Multi-objective optimization, Trade-offs, Synthetic biology
1
Introduction Synthetic biology aims to provide a framework for the rational bottom-up engineering of biocircuits with a priori defined functionalities. Computational tools can play a major role in the design of these synthetic biosystems [1]. The challenge is to map in a predictable manner sequence and function [2] such that, given a function of interest, we obtain the DNA sequence encoding the transcriptional circuit to be implemented in cells to execute it. One approach to automated design is inspired in the design of electronic circuits, and relies on truth table and logic gates (see [3] and references therein). A second main approach is based on continuous dynamic models, usually with mechanistic meaning (see [4] and references therein). Both approaches aim at meeting the principles of modularity, orthogonality, predictability, and reliability (enumerated by Xiang et al. [5] as key principles of automated design). SYNBADm [4] belongs to the second family of methods, and combines continuous dynamic simulation with advanced mixed integer optimization solvers. A key novelty of this toolbox is that it allows to consider multi-objective design problems, i.e. those considering conflicting design criteria. For these situations,
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_4, © Springer Science+Business Media, LLC, part of Springer Nature 2021
119
120
Irene Otero-Muras and Julio R. Banga
SYNBADm provides the set of best trade-offs between the objectives (usually called the Pareto-optimal set). In the next section we describe the formalism and methods used in this software. In Subheading 3 we indicate how to install and initialize SYNBADm. In Subheadings 4 and 5 we describe how to design an oscillator (single objective design) and a circuit with switch-like behavior (multi-objective design), respectively.
2
Methods SYNBADm combines a dynamic modeling formalism with modelbased design methods using global optimization. These two main components are described below.
2.1 The Modeling Framework
SYNBADm is based on continuous dynamic descriptions of the behavior of biological circuits. In particular, it uses models based on nonlinear deterministic ordinary differential equations (ODEs). The dynamics of gene regulatory networks is (internally encoded) by a superstructure of the form: dc ¼ f ðc, w, y, kÞ, dt
ð1Þ
where c is the vector of species concentrations, w is a vector of tunable (real) parameters, y is a vector of binary variables that describes the topology of the circuit, and k is a vector of fixed parameters. Given a specific library of components and user-defined design objective(s) and specifications, SYNBADm automatically generates the corresponding superstructure(s) that optimize(s) the objectives (s) and is compatible with the specifications. Currently, SYNBADm allows us to use two classes of libraries, denoted as Mass Action and Hill-type libraries, respectively. The user decides which type of library is more convenient depending on the desired level of model granularity. The superstructure in Eq. 1 corresponding to a mass action library is a plain mass action kinetic model describing the dynamics of genes, intermediates, mRNAs, and proteins. The superstructure for a Hill library, also in the form of Eq. 1 is a reduced model describing only the dynamics of the proteins with Hill kinetics. In both cases, a library is constituted by the following elements: l
a set of promoters (P),
l
a set of protein Coding Sequences (CDS),
l
a set of Inducers (I),
l
CDS to P relations (what promoter is affected by each transcript),
Biocircuit Design with SYNBADm
Library of components P1 P2
RBS
1
P1
2
2
P1
3
5
P3
CDS
3
P1
2
4
2
P1
1
Ter
2
P3
System
Devices
1
121
3
. . .
15 5
15
P3 5
Fig. 1 Biological components: Promoters, Ribosome Binding Site (RBS), Protein Coding Sequences (CDS), and Terminator (Ter); Devices and Systems l
I to CDS relations (what transcript is affected by each inducer).
To illustrate how the vector y of binary variables encodes the structure of a particular system (biocircuit), we use the example in Fig. 1. In this case we have three different promoters and five coding sequences. This gives a total of 15 different devices. These 15 devices are internally labeled from 1 to 15. In this way, the structure of any circuit (which is a combination of devices) is given by a vector of 15 binary variables. If a device j is part of the circuit, then the corresponding component is yj ¼ 1 (being zero otherwise). The device in the figure will be given, in terms of structure by the vector y ¼ [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], or equivalently by a 5 3 matrix where the entry Yi,k corresponds to a device with promoter k and coding sequence i: 0 1 0 0 0 B C B1 0 0C B C B C B Y ¼ B1 0 0C ð2Þ C: B C B0 0 0C @ A 0 0 1 2.2 Design as an Optimization Problem
In the framework described above, a biocircuit is completely characterized by a vector of integer (and/or binary) variables (defining the topology) and a vector of real variables (the tunable parameters of the parts). The design problem is formulated as follows: given a set (library) of components, a design objective, and a set of design specifications, find the optimal biocircuit given by the connectivity of a certain subset of parts and the value of their tunable parameters. In other words, SYNBADm performs a simultaneous global search in topology and parameter spaces by means of an
122
Irene Otero-Muras and Julio R. Banga
optimization algorithm and a mixed integer description of the biocircuit dynamics. The toolbox allows single and multi-objective formulations. In the latter case, instead of a single solution, SYNBADm gives a Pareto-optimal set of solutions (i.e. the best trade-offs between the objectives). From a mathematical point of view, the design problem is formulated as a mixed integer multi-objective dynamic optimization problem. Technical details can be found in [6]. In addition to model-based optimal design, SYNBADm can also solve dynamic simulations, i.e. given a model of a biocircuit and certain initial conditions, obtain its time-dependent behavior by solving the corresponding initial value problem. 2.3 Optimization Solvers
The class of mixed integer dynamic optimization problems considered above is NP-hard. Although these problems can be solved with purely stochastic methods (such as simulated annealing, genetic algorithms, etc.), that would be extremely costly computationally, since these methods require a very large number of evaluations of the cost function (and therefore many simulations of the explored biocircuits). To avoid this, SYNBADm includes four global optimization solvers which are based on metaheuristics, combining stochastic global search with efficient local search methods. The main advantage is that we keep the global character of the search (escaping from local solutions) while increasing efficiency dramatically thanks to the deterministic local solvers. Currently, SYNBADm offers the following metaheuristics: l
eSS (enhanced Scatter Search by Egea et al. [7]): for Mixed Integer Nonlinear Programming problems, it handles constraints and incorporates the local solver MISQP (Mixed Integer Sequential Quadratic Programming by Exler et al. [8]).
l
MITS (Mixed Integer Tabu Search by Exler et al. [9]): for Mixed Integer Nonlinear Programming problems, incorporates the local solver MISQP.
l
ACOmi (Ant Colony Optimization for mixed integer domain by Schlueter et al. [10]): for Mixed Integer Nonlinear Programming problems, incorporates local solver MISQP.
l
VNS (Variable Neighborhood search by Mladenovic et al. [11]): for Integer Nonlinear Programming problems, it does not handle constraints. Use this solver for unconstrained single objective problems with integer (or binary) variables only.
It should be noted that, due to their stochastic and heuristic nature, these methods cannot guarantee global optimality. However, it should also be recalled that deterministic global optimization methods, which could in principle offer such guarantees, are in practice too computationally costly to be applied to problems of realistic size. In contrast, the metaheuristics considered in
Biocircuit Design with SYNBADm
123
SYNBADm usually provide near-globally optimal solutions in reasonable computation times. We have obtained excellent results considering benchmark problems of known solution, and we have obtained similar or better solutions for a number of published case studies [6]. Besides, for many problems we can formulate objective functions that are bounded, so we can easily assess how good these solutions are despite not being able to guarantee their global optimality. For problems with constraints (e.g. minimum and maximum number of devices) and/or mixed integer-continuous variables, eSS, MITS, or ACOmi can be used. The three of them are suitable for single and multi-objective design, but their comparative performance will be problem dependent. For new problems with no prior knowledge about the expected solution, it is a good practice to use all of them. In our experience, eSS usually shows a good performance in synthetic gene circuit design independently of the balance between the number of real and integer variables. MITS and ACOmi usually perform better for problems without (or with few) real decision variables. However, due to their stochastic nature, we recommend to solve each new problem with these three different solvers and cross-check the results. 2.4 Practical Examples
3
Below we consider two practical examples, giving a step-by-step description of their solution using SYNBADm. In these examples, we consider design problems for two different target behaviors: (1) an oscillator and (2) a circuit with a switch-like behavior upon induction. For these case studies we will use no a priori information about the configurations or architectures leading to the behaviors of interest. We will make use of built-in SYNBADm libraries that we will modify accordingly to adapt to the biological components available in each case.
SYNBADm Installation and Initialization SYNBADm is available under a GPLv3 license at https://sites. google.com/site/synbadm, and runs under the MATLAB numerical computing environment (www.mathworks.com). 1. First, make sure that a compatible C++ compiler is installed. SYNBADm allows fast dynamic simulation by automatically converting dynamic models to C code. This feature requires the installation of a compatible C++ compiler. For more information, go to: http://es.mathworks.com/support/compilers Alternatively, dynamic models can be integrated with Matlab ODE solvers (without requiring a C++ compiler) but the execution times will be much longer.
124
Irene Otero-Muras and Julio R. Banga
2. Unzip and copy the toolbox folder to a directory of your choice (it is important to (re)name the main folder as SYNBAD 3. (For Linux users only) the first time you use SYNBADm it is needed to (1) in Matlab change to the SYNBAD main folder and run >>SYNBAD_install; (2) compile the default library files (in order to use C++ integrators), changing to the folder MA_library and executing: >>SYNBAD_Makelibrary_MA_C( MA_input_library) .
then change to the folder HL_library and execute: >>SYNBAD_Makelibrary_HL_C( HL_input_library) .
this step is needed only the first time you use SYNBADm, before running >>SYNBAD_Startup. 4. From Matlab, change to the SYNBAD directory and run >>SYNBAD_Startup, which adds all the relevant files to the path. Remember to run >>SYNBAD_Startup in every new Matlab session. 5. Test the installation by running any of the examples available in the Examples folder. For example 1, execute: >>Run_Example_1, if the installation is correct, the optimization will run and the corresponding results stored in the file RESULTS_DESIGN.mat.
4
Design of an Oscillator with SYNBADm
4.1 Definition of the Problem
The goal is to find an endogenous oscillator (i.e. the system oscillates without the need of an external inducer) starting from a library of available components. We consider mass action kinetics, because we are interested in tracking the concentrations of proteins and mRNAs (see Note 1). We have the following components available: two different promoters Pλ and Ptet that we denote, respectively, by P1, P2 and four protein coding sequences (cI, tetR, lacI, luxI). In addition, we consider an extra promoter P3, repressed by lacI with tunable promoter strength. To generate the corresponding SYNBADm library we will use as a template the built-in mass action library and modify it accordingly.
4.2 Preparing the Library of Components
1. In the folder MA_Library (within USR_Libraries), we open the script MA_input_library that we are going to use as a template. Before doing any modification, we save the script with a different name, MA_input_library_EX1.
Biocircuit Design with SYNBADm 1 2 3 4 5 6
library library library library library library
125
. n a m e o f f u n c t i o n= ’MAex1 ’ ; . p r o m o t e r s={ ’ P1 ’ , ’ P2 ’ , ’ P3 ’ } ; % l i s t o f p r o m o t e r s . t r a n s c r i p t s ={ ’ c I ’ , ’ tetR ’ , ’ l a c I ’ , ’ l u x I ’ } ; % l i s t o f p r o t e i n c o d i n g r e g i o n s . p r o m t f={ ’ c I ’ , ’ tetR ’ , ’ l a c I ’ } ; % t r a n s c r i p t a f f e c t i n g each promoter . i n d u c e r s ={}; % l i s t o f i n d u c e r s . i n d t r ={}; % t r a n s c r i p t a f f e c t e d by each i n d u c e r
Fig. 2 Library input file for example 1: MA_input_library_EX1.m
2. Now, in the file MA_input_library_EX1 we fill in the fields of the structure library, as indicated in Fig. 2 where name_of_function contains a short name to identify the library files, promoters is a row cell array of strings containing the names of the promoters, transcripts is a row cell array of strings containing the names of the protein coding regions, prom_tf is the list of transcripts repressing each promoter, inducers is a row cell array of strings containing the names of the inducers (empty in this case), and ind_tr is a row cell array of strings containing the names of the transcript being bound by each inducer (empty in this case). 3. In order to generate the library files we call the SYNBADm mass action library function: >>SYNBAD_Makelibrary_MA_C(MA_input_library_EX1),
this generates the C++ library code, for efficient integration with CVODES. 4. We can check the ordinary differential equations opening the generated odefile MAex1_odefile_c. 5. The values of default initial conditions are stored in default_states.m.
MAex1_-
6. The values of the default parameters are stored in MAex1_default_parameters.m. We are going to use the values in the literature [12] for all the parameters. The default value for kf_pt_3 is also taken from the literature (same source) but note that this parameter is a decision variable in the optimization problem. We use as a template the file MAex1_default_parameters.m, and after the corresponding modifications, we save it as MAex1_parameters_1.m. The values of the parameters are shown in Fig. 3. 4.3 Defining the Objective Function
SYNBADm has a number of built-in objective functions included in the folder USR_ObjFuns. The function OF_Oscil is the objective function especially suited to design oscillators. Therefore, we do not need to define in this case an ad-hoc objective function but making use directly of the built-in function OF_Oscil. We only need to adapt the list of species to the library that we have defined for our problem. In order to do this, we open OF_Oscil and substitute the default list of species by the one we are currently using, i.e.: trnsc ¼ {cI,tetR,lacI,luxI}; The objective
126 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
Irene Otero-Muras and Julio R. Banga f u n c t i o n k=MAex1 parameters
NA = 6 . 0 2 2 1 4 1 5 e23 ; V = 1 e −14; NAV = NA∗V/1 e9 ;
% Avogadro % C e l l volume % For c o n c e n t r a t i o n i n nM
k f p t 1=NAV; k f p t 2=NAV; k f p t 3=NAV; kb pt 1 =0.5; kb pt 2 =0.5; kb pt 3 =0.5; kdeg pt 1 =0.075; kdeg pt 2 =0.075; kdeg pt 3 =0.075; ktransc 1 =0.00005; ktransc 2 =0.00005; ktransc 3 =0.00005; kleak 1 =0.12; kleak 2 =0.09; kleak 3 =0.01; k t r a n s l 1 =0.1; k t r a n s l 2 =0.1; k t r a n s l 3 =0.1; k t r a n s l 4 =0.1; kdeg m 1 = 0 . 0 0 1 ; kdeg m 2 = 0 . 0 0 1 ; kdeg m 3 = 0 . 0 0 1 ; kdeg m 4 = 0 . 0 0 1 ; kdeg 1 =0.001; kdeg 2 =0.001; kdeg 3 =0.001; kdeg 4 =0.001;
Fig. 3 Parameter values considered for example 1: MAex1_parameters_1.m
function is based on the autocorrelation function and it has been described in [13]. 4.4 Solving the Single Objective Optimization Problem
In the library USR_inputs, we create the input file (a pre-existing input file can be used as a template), and we save it as Oscillator_MAex1.m (see Fig. 4). In this file we indicate: 1. The model options, including type of library, name of the odefile to be used, number of variables of each type to be considered for the design, and the names of the scripts containing the values of parameters, initial conditions, etc. 2. The options for the design, including the name of the script with the objective function to be used, the lower and upper bounds for the decision variables, the minimum and maximum number of devices allowed in the final system and the indices of the parameters to be tuned, in this case inputs.design. idx¼{3}, as it corresponds to the parameter kf_pt_3. 3. The options for simulation, including the time interval for the integration inputs.simulate.tspan. 4. The options for the MINLP solvers (we choose the optimization solver, and in case of ESS, MITS, or ACO we choose also the local solver to be used). In presence of integer or binary variables, the local solver to be used is MISQP.
Biocircuit Design with SYNBADm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
127
%================================ % MIXED INTEGER MODEL FRAMEWORK %================================ i n p u t s . model . l i b t y p e = ’ MA Library ’ ; %Choose ’ MA Library ’ | ’ HL Library ’ i n p u t s . model . ode name = ’ M A e x 1 o d e f i l e c ’ ; i n p u t s . model . n i n t e g e r v a r = 0 ; i n p u t s . model . n r e a l v a r = 1 ; i n p u t s . model . n b i n a r y v a r = 1 2 ; i n p u t s . model . d e f p a r a m f u n c t i o n= ’ MAex1 parameters 1 ’ ; i n p u t s . model . d e f s t a t e f u n c t i o n= ’ M A e x 1 d e f a u l t s t a t e s ’ ; i n p u t s . model . t r a n s c p r o m o t f u n c t i o n = ’ M A e x 1 t r a n s c r i p t s a n d p r o m o t e r s ’ ; i n p u t s . model . u v a l u e s = [ ] ; %============================ % DESIGN PROBLEM OPTIONS %============================ i n p u t s . d e s i g n . o b j e c t i v e = ’ OF Oscil ’ ; inputs . design . idx = {3}; inputs . design . par x = [ ] ; inputs . design . var L = z e r o s ( 1 , 1 3 ) ; i n p u t s . d e s i g n . var U = o n e s ( 1 , 1 3 ) ; inputs . design . var 0 = zeros (1 ,13) ; i n p u t s . d e s i g n . D max = 3 ; % o n l y a p p l i e s i n MITS , ESS , ACO i n p u t s . d e s i g n . D min = 3 ; % o n l y a p p l i e s i n MITS , ESS , ACO %==================================== % SIMULATE OPTIONS %===================================== inputs . simulate . v a r c i r c u i t = [ ] ; inputs . simulate . tspan = 0 : 1 0 : 4 0 0 0 0 ; i n p u t s . s i m u l a t e . o b j e c t i v e = { ’ OF Oscil ’ } ; %================================== % MINLP SOLVER OPTIONS %================================== i n p u t s . o p t s o l . o p t s o l v e r = ’ ESS ’ ; % Choose MINLP s o l v e r ’ ESS ’ | ’ MITS ’ | ’ ACO’ | ’ VNS’ i n p u t s . o p t s o l . maxtime = 1 0 0 ; i n p u t s . o p t s o l . maxeval = [ ] ; %e s s o p t i o n s i n p u t s . o p t s o l . e s s . l o c a l . s o l v e r = ’ misqp ’ ; %================================== % IVP SOLVER OPTIONS %================================== i n p u t s . i v p s o l . r t o l = 1 . 0D−7; % [ ] IVP s o l v e r i n t e g r a t i o n t o l e r a n c e s i n p u t s . i v p s o l . a t o l = 1 . 0D−7;
Fig. 4 SYNBAD design input file for example 1: Oscillator_MAex1.m
5. The options for the integration, mainly the tolerances for the initial value problem (IVP) solver. Once the input file is completed, we call (from the main directory) the function to solve the single optimization design problem: >>SYNBAD_Design_SO(Oscillator_MAex1). After the computation time (selected in the design input file, in this case 100 s), the optimal solution found is stored in the file RESULTS_DESIGN.mat. Note that the design problem might have more than one optimal solution, and due to the fact that we use global optimization solvers, the solution obtained by SYNBADm might be different in every call to SYNBAD_Design_SO. Here we find the following solution: results.xbest ¼ [0.013209502186800 0 0 1 1 0 0 0 1 0 0
which corresponds to the circuit in Fig. 5. The value of the objective function for the optimal circuit is results.fbest¼ -0.739787373366868. We recommend to save the mat file containing the results with a different name (RESULTS_DESIGN_EX1_T1) in the USR_Results folder, to avoid overwriting the results in further calls to the single objective design function. 0 0]
128
Irene Otero-Muras and Julio R. Banga Circuit superstructure matrix (active pairs in red)
P1
t1 (cI)
tetR
t2 (tetR)
P2 LacI
t3 (lacI)
P3
t4(luxI)
kf_pt_3 = 0.0132 cI
P1
P2
P3
Fig. 5 Oscillator found by SYNBADm (matrix and circuit scheme) 0.2
P2
P1
0.2 0.1
P3lacI P1cI
P2tetR P3
0 0.2 0.1 0 0.2 0.1
tetR
cI
100
cIm
500 0 2
lacIm
lacI
0 1000
tetRm
0 0.2 0.1 0 0.2 0.1
0 100
0 200
1 0
0.1
0
0.5
1
1.5
2
t
2.5
3
3.5
4
50 0 2 1 0 20 10 0
0
0.5
1
1.5
x10 4
2
t
2.5
3
3.5
x10 4
4
Fig. 6 Dynamics of all the species involved in the oscillator obtained by SYNBADm 4.5 Simulating the Dynamics of a Circuit
If we want to simulate the dynamics of a circuit (the optimal circuit found by SYNBADm or any other combination) we only have to fill the simulation options in the input file Oscillator_MAex1.m, including the vector describing the circuit inputs.simulate. var_circuit. Importantly, if we choose other circuit than the solution found, the vector needs to preserve the same structure in terms of number and class (binary or real) of the entries. The simulation is carried out by running: >>SYNBAD_Simulate(Oscillator_MAex1). Using the solution found, the dynamics of all the species involved are automatically depicted, as it is shown in Fig. 6.
Biocircuit Design with SYNBADm
5
129
Design of a Switch-Like Circuit with SYNBADm
5.1 Definition of the Problem
The second example consists on finding circuits that behave as switches upon stimulation by different inducers (starting by a library containing different promoters, protein coding sequences, and inducers). By switch-like performance, we understand as in [14] that the steady state level of LacI is high upon aTc and low upon IPTG induction, whereas the steady state level of tetR is low upon aTc and high upon IPTG induction. In order to ensure an optimal use of the cell resources, we take into account the protein production cost as a second optimization objective. We consider in this case Hill kinetics (we are interested only in the dynamics of the proteins involved, see Note 2). We have the following components available: five different promoters Plac1, Plac2, Pλ, Ptet, ParaC that we denote, respectively, by P1 ... P5 and four protein coding sequences (tetR, lacI, cI, araC). To generate the corresponding SYNBADm library we will use as a template the built-in Hill kinetics library and modify it accordingly. The first two promoters are both repressed by lacI but with different affinities.
5.2 Preparing the Library of Components
1. In the folder HL_Library (within USR_Libraries), we open the script HL_input_library that we are going to use as a template. Before doing any modification, we save the script with a different name, HL_input_library_EX2. 2. Now, in the file HL_input_library_EX2 we fill in the fields of the structure library, as indicated in Fig. 7, where name_of_function contains a short name to identify the library files, promoters is a row cell array of strings containing the names of the promoters, transcripts is a row cell array of strings containing the names of the protein coding regions, prom_tf is the list of transcripts repressing each promoter, prom_nhill contains the Hill coefficients for each repressorpromoter pair, inducers is a row cell array of strings containing the names of the inducers and ind_tr is a row cell array of strings containing the names of the transcript being bound by each inducer. 3. In order to generate the library files we call the SYNBADm Hill kinetics library function:
1 2 3 4 5 6 7
library library library library library library library
. n a m e o f f u n c t i o n= ’ HLex2 ’ ; . p r o m o t e r s={ ’ P l a c 1 ’ , ’ P l a c 2 ’ , ’ Plambda ’ , ’ P t e t ’ , ’ ParaC ’ } ; % l i s t o f p r o m o t e r s . t r a n s c r i p t s ={ ’ tetR ’ , ’ l a c I ’ , ’ c I ’ , ’ araC ’ } ; % l i s t o f p r o t e i n c o n d i n g r e g i o n s . p r o m t f={ ’ l a c I ’ , ’ l a c I ’ , ’ c I ’ , ’ tetR ’ , ’ araC ’ } ; % t r a n s c r i p t a f f e c t i n g each promoter . p r o m n h i l l = [ 4 , 4 , 2 , 2 , 2 ] ; % H i l l c o e f f i c i e n t s o f R e p r e s s o r −Promoter p a i r s . i n d u c e r s ={ ’IPTG ’ , ’ aTc ’ } ; % l i s t o f i n d u c e r . i n d t r ={ ’ l a c I ’ , ’ tetR ’ } ; % t r a n s c r i p t a f f e c t e d by each i n d u c e r
Fig. 7 Library input file for example 2: HL_input_library_EX2.m
130 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Irene Otero-Muras and Julio R. Banga f u n c t i o n k=H L e x 2 p a r a m e t e r s 1 K Plac1 =10; K Plac2 = 0 . 0 1 ; K Plambda = 0 . 3 3 ; K Ptet = 0 . 0 1 4 ; K ParaC = 2 . 5 ; alpha tetR =1.215; a l p h a l a c I =1.215; alpha cI =2.92; alpha araC =1.215; kdeg tetR =0.0346; k d e g l a c I =0.0346; kdeg cI =0.0693; kdeg araC = 0 . 0 1 1 5 ; kf lacIIPTG =0.05; kf tetRaTc =0.05; kb lacIIPTG = 0 . 1 ; kb tetRaTc = 0 . 1 ; kdeg lacIIPTG = 0 . 0 6 9 3 ; kdeg tetRaTc = 0 . 0 6 9 3 ;
Fig. 8 Parameter values considered for example 2: HLex2_parameters_1.m >>SYNBAD_Makelibrary_HL_C(HL_input_library_ EX2) .
This generates the C++ library code, for efficient integration with CVODES. 4. We can check the ordinary differential equations opening the generated odefile HLex2_odefile_c. 5. The values of default initial conditions are stored in default_states.m.
HLex2_-
6. The values of the default parameters are stored in HLex2_default_parameters.m. We are going to use the values in the literature [14] for all the parameters. We use as a template the file HLex2_default_parameters.m, and after making the corresponding modifications (the values of the parameters are shown in Fig. 8), we save the script as HLex2_parameters_1.m. 5.3 Defining the Objective Functions
The first objective function must encode the desired switch-like behavior: namely, the steady state level of LacI must be high upon aTc and low upon IPTG induction, whereas the steady state level of tetR must be low upon aTc and high upon IPTG induction, as it has been defined in [14]. To ensure that we achieve the desired functionality at a minimal consumption of cell resources we consider as a second objective a proxy of the protein production cost as defined elsewhere [15]. Both objective functions can be defined by modifying accordingly the templates denoted, respectively, as OF_Switch and OF_Cost available in SYNBADm, in the folder USR_ObjFuns. We save the corresponding functions as OF1_Switch and OF2_Cost in USR_ObjFuns.
Biocircuit Design with SYNBADm
sol1
Objective 2
O2U
131
sol2
O2L
O1L
Objective 1
O1U
Fig. 9 Scheme of the ε-constraint strategy as implemented in SYNBADm 5.4 Solving the Multi-Objective Optimization Problem
SYNBADm solves bi-objective optimization problems using the epsilon-constraint strategy [16]. First, we choose our objective 1 and objective 2 and solve, for each of them, a single objective optimization problem. In this way we obtain the extremes of the Pareto front denoted in Fig. 9 by sol1 and sol2. In this example, we choose the circuit performance as our first objective and the protein production cost as the second objective. In order to solve the first single objective optimization problem, we create the corresponding input file Switch_HLex2_OJB1.m as indicated in Fig. 10 in the library USR_inputs. Once this file is created, we execute: >>SYNBAD_Design_SO( Switch_HLex2_OBJ1)
obtaining as a result the first extreme of the Pareto front (sol1 in Fig. 9), we rename the mat file as sol1.mat for storage purposes. We proceed in the same manner to solve the second single objective optimization problem (in this case we create the input file Switch_HLex2_OJB2.m as indicated in Fig. 10 in the library USR_inputs, but just modifying in this case the objective inputs.design.objective ¼ {OF2_Cost}. Once this file is created, we execute: >>SYNBAD_Design_SO(Switch_HLex2_OBJ2) to obtain the second extreme of the Pareto front (sol2 in Fig. 9), that we store as sol2.mat. Now, we are in the position to solve the bi-objective optimization problem. In the library USR_inputs, we create input file Switch_HLex2.m (see Fig. 11). In this file we indicate the two objectives to optimize inputs.modesign. objective1 and inputs.modesign.objective2. We also indicate the coordinates of the two extreme points of the Pareto front obtained as solutions of the single optimization problems, respectively, inputs_modesign_min_objective_1 and inputs_modesign_min_objective_2. Finally we need to
132
Irene Otero-Muras and Julio R. Banga
Fig. 10 SYNBADm design input file for example 2, single objective search for the extreme of the Pareto: Switch_HLex2_OBJ1.m
indicate the number of intervals of the ε-constraint, i.e. the number of intervals in which we divide the y-axis in the objective space (see Fig. 9). We are now in the position to solve the bi-objective optimization problem, executing (from the main directory) the function: >>SYNBAD_Design_MO(Switch_HLex2). Before starting the optimization, SYNBADm indicates the approximated computation time (calculated as the maximum computation time selected by the user times the number of intervals considered for the ε-constraint strategy). Once the optimization is ready the results are stored in the file RESULTS_MO_DESIGN.mat. Note that the design problem is bi-objective, and therefore we obtain a Pareto set of optimal solutions. In order to depict the Pareto front obtained, we call: >>SYNBAD_Plot_Pareto(Switch_HLex2, RESULTS_MO_DESIGN).
The Pareto optimal front obtained from the available
Biocircuit Design with SYNBADm
133
Fig. 11 SYNBADm design input file for example 2: Switch_HLex2.m
library of components is depicted in Fig. 12. The circuit P2 shows a good compromise between both objectives.
6
Notes 1. Reactions Associated with the Biological Devices in SYNBADm Library of Mass Action Type. The kinetic formalism is adapted and further extended from the work by Pedersen and Phillips [12]. Within this framework, all the reactions are endowed with mass action kinetics. Next we enumerate the reactions (and kinetic parameters) corresponding parts, combinations of parts and devices of interest:
Irene Otero-Muras and Julio R. Banga
IPTG
Repressors
P5
1 2 3 4
LacI1
2500
P1
Promoters 1 2 3 4 5
P1 tetR
IPTG P2 LacI1 P1
Objective 2 (Cost)
aTc
tetR
2000
Promoters 1 2 3 4 5
Repressors
134
1 2 3 4
aTc IPTG P1
1500
LacI 1
P2
P1
Promoters 1 2 3 4 5
1000 −1
−0.95
−0.9 −0.85 −0.8 Objective 1 (Performance)
aTc
−0.75
Repressors
tetR
P3
1 2 3 4
Fig. 12 Pareto front of solutions obtained in Example 2
1. A promoter P1 negatively regulated by a protein T1: kfpt ktransc P 1 + T 1 FGGGGGGGGB GGGGGGGG P 1T 1GGGGGGGGGGGA P 1T 1 + mT 2 kbpt
ð3Þ
where P1 is the promoter, T1 is the repressor protein, P1T1 is the repressor–promoter complex, and mT2 is the mRNA of the transcribed protein. The parameters kfpt, kbpt, and ktransc are the protein-promoter binding rate constant, protein-promoter unbinding rate constant, and the rate of transcription in the bound state. 2. A promoter P1 that is not regulated by any transcription factor: kleak P 1GGGGGGGGGA P 1 + mT 2
kdegpt mT 2GGGGGGGGGGGA ∅
ð4Þ
where kleak is the constitutive rate of transcription in absence of transcription factors and kdegpt is the degradation rate constant for the mRNA degradation. 3. A protein coding sequence T2 introduces the reactions of translation: ktransl mT 2GGGGGGGGGGGA mT 2 + T 2
ð5Þ
Biocircuit Design with SYNBADm
135
where ktransl is the rate constant corresponding to the translation of mRNA, and degradation: kdeg T 2GGGGGGGGGGGA ∅
kdegm mT 2GGGGGGGGGA ∅
ð6Þ
where kdeg and kdegm are the degradation rate constants of protein and mRNA, respectively. 4. The complete set of reactions for a device P1 T2 in presence of the repressor protein T1: ktb kb BG P 1T 1GGGGGGA P 1 + mT 2 P 1 + T 1 FGGGGGG GGGGG ku kdm kd kr mT 2GGGGGGGGA ∅ T2 GGGGGGA ∅ mT 2GGGGGGA mT 2 + T 2
ð7Þ
5. The presence of an external inducer binding repressor T1 adds the following reactions: kdi kbi1 I + T 1 FGGGGGGGB GGGGGGG IT 1GGGA ∅ kui1
where kbi, kui and kdi are the constants of binding, unbinding, and degradation of the inducer complex, respectively. 2. Reactions Associated with the Biological Devices in a SYNBADm Library of Hill Type. The kinetic formalism is adapted from [14] and further extended. Within this framework, the device P1 T2, where P1 is a promoter negatively regulated by a protein T1, has associated with the reactions: rt P 1GGGGGA P 1 + T 2
ð8Þ
kd T 2GGGGGGA ∅.
ð9Þ
The first reaction has Hill-type kinetics, being the rate rt of the lumped transcription and translation given by the expression: αt1 rt ¼ ð10Þ 1 þ K p1 T 1n where αp1, Kp1 are constants associated with the promoter and repressor, respectively, and n is a Hill-like coefficient. The second reaction corresponds to the degradation of the protein T2 (with first order mass action kinetics). The presence of an external inducer binding repressor T1 will add the following reactions (with mass action kinetics):
136
Irene Otero-Muras and Julio R. Banga
kdi kbi1 I + T 1 FGGGGGGGB GGGGGGG IT 1GGGA ∅ kui1
where kbi, kui, and kdi are the constants of binding, unbinding, and degradation of the inducer complex, respectively. Funding: This research was funded by the Spanish Ministry of Science, Innovation and Universities, project SYNBIOCONTROL (ref. DPI2017-82896-C2-2-R). References 1. Marchisio MA, Stelling J (2009) Computational design tools for synthetic biology. Curr Opin Biotechnol 20(4):479–485 2. Rodrigo G, Landrain TE, Shen S, Jaramillo A (2013) A new frontier in synthetic biology: automated design of small RNA devices in bacteria. Trends Genet 29(9):529–536 3. Nielsen AAK, Der BS, Shin J, Vaidyanathan P, Paralanov V, Strychalski EA, Ross D, Densmore D, Voigt CA (2016) Genetic circuit design automation. Science 352(6281): aac7341 4. Otero-Muras I, Henriques D, Banga JR (2016) Synbadm: a tool for optimizationbased automated design of synthetic gene circuits. Bioinformatics 32(21):3360–3362 5. Xiang Y, Dalchau N, Wang B (2018) Scaling up genetic circuit design for cellular computing: advances and prospects. Nat Comput 17 (4):833–853 6. Otero-Muras I, Banga JR (2017) Automated design framework for synthetic biology exploiting Pareto optimality. ACS Synth Biol 6 (7):1180–1193 7. Egea JA, Marti R, Banga JR (2010) An evolutionary method for complex-process optimization. Comput Oper Res 37:315–324 8. Exler O, Schittkowski K (2007) A trust region SQP algorithm for mixed-integer nonlinear programming. Optim Lett 1(3):269–280
9. Exler O, Antelo LT, Egea JA, Alonso AA, Banga JR (2008) A Tabu search-based algorithm for mixed-integer nonlinear problems and its application to integrated process and control system design. Comput Chem Eng 32 (8):1877–1891 10. Schlueter M, Egea JA, Banga JR (2009) Extended ant colony optimization for non-convex mixed integer nonlinear programming. Comput Oper Res 36(7):2217–2229 11. Hansen P, Mladenovic N, Moreno-Perez JA (2010) Variable neighbourhood search: methods and applications. Ann Oper Res 175 (1):367–407 12. Pedersen M, Phillips A (2009) Towards programming languages for genetic engineering of living cells. J R Soc Interface 6:S437–S450 13. Otero-Muras I, Banga JR (2016) Design principles of biological oscillators through optimization: Forward and reverse analysis. PLoS ONE 11(12):e0166867 14. Dasika MS, Maranas CD (2008) Optcircuit: an optimization based method for computational design of genetic circuits. BMC Syst Biol 2:24 15. Szekely P, Sheftel H, Mayo A, Alon U (2013) Evolutionary tradeoffs between economy and effectiveness in biological homeostasis systems. PLoS Comput Biol 9(8):e1003163 16. Otero-Muras I, Banga JR (2014) Multicriteria global optimization for biocircuit design. BMC Syst Biol 8:113
Chapter 5 Setting Up an Automated Biomanufacturing Laboratory Marilene Pavan Abstract Laboratory automation is a key enabling technology for genetic engineering that can lead to higher throughput, more efficient and accurate experiments, better data management and analysis, decrease in the DBT (Design, Build, and Test) cycle turnaround, increase of reproducibility, and savings in lab resources. Choosing the correct framework among so many options available in terms of software, hardware, and skills needed to operate them is crucial for the success of any automation project. This chapter explores the multiple aspects to be considered for the solid development of a biofoundry project including available software and hardware tools, resources, strategies, partnerships, and collaborations in the field needed to speed up the translation of research results to solve important society problems. Key words Laboratory automation, Synthetic biology, Hardware, Software, Throughput, Machine learning, Liquid handling, Metabolic engineering, Standardization, Reproducibility
1
Introduction Synthetic biology provides the opportunity to produce thousands of complex molecules and solve key challenges in the modern society varying from agriculture, chemicals production, and clean energy in a sustainable way [1–3]. It brings together engineers and biologists to design and build synthetic genetic circuits to encode novel components, networks, and pathways, and use these constructs to reprogram organisms. However, the execution of these experiments is still largely dependent on the artisanal work of skillful researchers. The low throughput of this approach limits the speed of development of the DBT (Design-Build-Test) cycle, and manual procedures involve human errors and consequent lack of reproducibility. Yet, almost 50% of the variance in human error can be explained by stress, repetition, fatigue, and work environment, strongly related to a traditional laboratory work [4]. Also, due to the complexity of biological networks and pathways, designing, building, testing, and replicating the large number of constructs needed to achieve optimal solutions for biological
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_5, © Springer Science+Business Media, LLC, part of Springer Nature 2021
137
138
Marilene Pavan
questions are virtually impossible when using manual techniques and procedures. As an example, it took over 10 years and more than 100 million dollars to develop the biosynthetic process for 1,3-propanediol [5–8]. The advancement of the manufacturing of chemicals using biological systems depends also on the development of new tools, techniques, and skills that can speed up and provide new insights in the interface of synthetic biology, metabolic engineering, and automation, which will also be explored in this chapter [6]. Laboratory automation is a key enabling technology for Synthetic Biology. In fact, a recent survey showed that up to 89% of publications in the biomedical literature have some methods that are supported by existing commercial robotic labs [9]. Without the usage of automated, high-throughput pipelines, it is virtually impossible to achieve an efficient and effective design-build-test (DBT) cycle [1, 5, 10–13]. The concept of laboratory automation is not a new and already became mainstream in the 1950s. In 1963 it was predicted that the market for clinical laboratory instruments would grow 20% annually, driven by recent advances in miniaturization and speed of biological sample analysis [14]. Modern automated systems involve intricate workflows, with different levels of integration of both software and hardware components. These could be off-the-shelf or customized, each one with different specifications, costs, and skills needed to operate. Laboratory automation can increase the throughput of the laboratory, free up researchers from repetitive tasks, and allow them to dedicate themselves to activities more intellectually relevant, while at the same time avoiding injuries caused by repetitive procedures. It also removes human errors and allows standardization of procedures and increase reproducibility and accuracy, save money and time, and collect, store, retrieve and analyze the large amount of scientific data being produced [15]. While having so many options available that can easily fill the gaps in different set ups and environments, from a researcher standpoint it can be a burden to choose between different systems, vendors, and skills needed to operate them. Failure to choose the appropriate system can lead to frustration and, consequently, barrier to adoption. Having a person fully dedicated to assessing the lab’s current processes—and consequently their automation needs—is key to ensure the project success and adoption of automation in the lab routine. This include to prospect different solutions and partners based on those needs and to educate themselves and others in the lab on the new, automated processes. Similarly, it is crucial that the vendor or provider, both for software and hardware, is open to work with the lab to develop features in collaboration to adapt their tools to the lab needs [16]. Continuous team
Setting Up an Automated Biomanufacturing Lab
139
communication and training are also key to get the team on board. One should be prepared to explain the benefits in adopting the new procedures and technologies, have a clear roadmap in hand, and have the team fully trained and supported during operations.
2
Identifying the Need for Automation Biofoundries provide an integrated infrastructure to enable the rapid design, construction, and testing of genetically modified organisms [10]. As mentioned above, automation can lead to higher throughput, more efficient and accurate experiments, better data management and analysis, decrease in the DBT turnaround, increase of reproducibility, and save lab resources [12, 17, 18]. Poor experiments recording, disparities on how research is conducted, run-to-run differences, among others, lead to the incapacity to reproduce experiments and results, resulting in a waste of time and money. An analysis of past studies indicates that the cumulative (total) prevalence of irreproducible preclinical research exceeds 50%, for example, resulting in approximately US$28 billion/year spent on preclinical research that is not reproducible—in the United States alone [19]. A typical automated lab workflow is presented below. Brainstorming how the lab scientific research sits in each one (or all) of these steps is key to identify the need for automation either for some or all the steps. As consequence, the whole process can be fully integrated, decreasing human intervention as much as possible or semi-automated, usually more flexible and suitable for smaller labs and throughput [5].
2.1
Design
Software tools are available today to help on the experimental design and data management and analysis. They are fundamental to help to guide on the best strategy for both the abstract level (choosing the best genetic candidates to be tested) and for the practical level (the DNA assembly strategy to be executed). Automated, well-informed designs help to increase the number of designs that can be generated, the speed these designs can be generated, and it helps to narrow down the design space prioritizing the best candidates to be built and tested, saving lab resources [20–24].
2.2
Build
Worklist-based instructions for liquid handlers are the most common approach in automated facilities. Usually a .csv file is generated, containing the instructions from where to aspirate liquids (DNA parts, reagents, media, etc.) to where to dispense them. Ideally, the build instructions should be linked to the design and sample inventory pipelines. As an example for the advantage of having the build process automated in the lab, DNA assembly can
140
Marilene Pavan
consume 33–50% of a scientist time and is generally low efficient [5, 12]. Also, the multiple interactions and repetitions to achieve the correct construct and to insert variability in the design space makes it, potentially, the most tedious and error prone procedure in a molecular biology lab, besides the increase in costs. 2.3
Test
The test tools and equipment should link the experiments being performed in the lab, with the respective constructs created by the design tool and assembled by the build pipeline. They also collect and store the data being generated while testing the synthetic genetic circuits. Fragment analysis, sequencing, transcriptomics, fermentation, proteomics, and metabolomics can be usually automated, though analytical techniques remain one of the most difficult steps to automate to date.
2.4
Learn
Machine learning (ML) [25, 26] is being used to interpret data being generated by the test cycle. The vast amount of data being generated by automated experiments suits better computer algorithms than human minds. These algorithms can help to understand the behavior and predictability of genetic circuits, narrowing down the space of genetic circuits to be constructed, speeding up and saving money in the whole process. In a study published in 2019, Opgenorth and colleagues [13] reported on the implementation of two DBT cycles to optimize 1-dodecanol production from glucose using 60 engineered Escherichia coli MG1655 strains. They used the data produced in the first DBT cycle to train several machine-learning algorithms and to suggest protein profiles for the second DBT cycle that would increase production. These strategies resulted in a 21% increase in dodecanol titer in Cycle 2. Finally, together with the data presented above, the 10 questions below should be asked when considering the need for automation: (1) Does the lab need more reproducible and accurate results? (2) Would the laboratory research line benefit from increasing the throughput of experiments (more data generation leading to faster answers)? (3) Does the lab staff have the time and resources to train the lab students and employees, continuously, on automation? (4) Does the lab want and have the time to work with the vendors to co-develop and adapt the protocols and processes to an automated set up? (5) Not having the students and employees performing some repetitive tasks is important for the lab operations (benefiting from walk-away time and avoiding injuries caused by repetitive tasks)? (6) How the costs associated with automation (purchase, maintenance, consumables, training, co-developments) will impact the lab? (7) Does the lab need better documentation on the protocols being performed and results being generated? (8) Are the lab protocols and experiments standard and frequent enough so they can benefit from an automated procedure? (9) Would it be beneficial to achieve financial stability in lab operations? (10) Does the lab have the required space for an automated system?
Setting Up an Automated Biomanufacturing Lab
3
141
Strategy When implementing automation for the first time, projects might fail to achieve the desired results. Failure and frustration can often be traced back to as follows: 1. The scope of the work has been underestimated (by all levels of seniority involved). It usually requires considerable investment in time, training, and money to proper manage and implement any automation project. A written vision is useful and must be extensively communicated so that the purpose of the project is understood by everyone in the lab [27]. 2. The selection of the appropriate approach to automation should be guided by a deep understanding of how the lab is currently working (protocols, personnel, time, costs, throughput, frequency of the protocols being performed, how flexible they are, how the workflow conditions may change over time) and what is the vision to where the laboratory is heading with automation adoption (how automation will change the work) [28]. 3. Unsuccessful choices in system flexibility, data management systems, and the quality of the partnership with the vendors also have direct impact over the automation project success. (a) Some considerations that should be made by the one (s) responsible for the project, together with the lab group, are: l Budget—Be very clear to assess and communicate the available budget from the very beginning of the project to all involved parties. Take into consideration the costs of highly skilled scientists, potentially expensive reagents and labware, hardware and software, system maintenance, and necessary training. Vendors can help to analyze these costs, as well as proper internal departments in companies and universities (departments as business development, financial, corporate relationships, project management). Talking with experts and independent consultants in laboratory automation can also help to understand the costs involved and avoid future surprises. Finally, a range of different solutions in different sizes, costs, throughput, and flexibility are available today [29, 30] to choose from. l
Expertise—In general, researchers that can navigate the biology, hardware, and software fields tend to be more prone to learn automation faster and teach others easily. Still, the one(s) responsible for the project should ask for and the vendors should provide training and post-
142
Marilene Pavan
sale support to develop methods and be ready to answer any possible questions in a timely manner. Knowing the skills and knowledge available in the lab is key to forecast the amount of training and support that will be required to program and run the automated system as smooth as possible. l
Protocols—The main aspect to consider here is how likely the conditions tend to change for the protocols aimed for automation. The more they change, the more flexible and accommodating the system should be. This is important to consider when determining the system flexibility and integration (fully versus semi-automated workflow).
l
Throughput—This is frequently the first question asked by all vendors. Having a clear throughput expectation (number of experiments, tests, strains, compounds, and data being performed and produced for example) per period is key to identify the best system to suit the workflow.
l
Software—Efficient data and protocol collection, storage, and sharing are required to manage and interpret properly the great amount of data being generated by an automated laboratory. Efficient and standard data capture and analysis help keeping track of biological behavior predictability and results reproducibility leading to, ultimately, better experimental design [22]. Software is important to provide experimental strategies [20] and robotic instructions [21], manage repository of parts and reagents [31], integrate equipment and schedule protocols (some developers include ThermoFisher, Biosero, HighRes, and Synthace), and store and interpret data and protocols [23, 32]. It is a normal situation an automated system will demand more than one software tool to cover all the items described above.
l
Hardware—A thermocycler is an automation equipment, manually operated, and is enough to perform a number of PCR (polymerase chain reaction) reactions in most of the labs. However, the more the experimental throughput and complexity increase the more a laboratory will need to acquire peripherals that can be integrated seamlessly to other similar equipment and to robotic arms, in detriment of human handling. Some of them (centrifuges, sealers, pealers, incubators, and thermocyclers, for example) might make sense if integrated to robotic arms, for example, to provide
Setting Up an Automated Biomanufacturing Lab
143
the walk-away component, characteristic of many automated systems. Others, as liquid handlers, can vary in footprint and cost (examples include the Opentrons OT2, Hamilton STAR, Tecan Evo, among others) and in sample volume transferred (Labcyte Echo, iDOT). Automated freezers offer advantages as better control and storage of biological samples that can be achieved independently of the set up (integrated or standalone). To choose the correct individual piece of equipment, before decide on the capabilities, capacity, and integration, keep in mind: (1) which equipment the laboratory will use in standalone mode (manually operated) or integrated (or both) to other equipment; (2) which footprint, cost, and throughput the laboratory is aiming for; (3) the reasons for automation (walkaway component, avoiding repetitive tasks, increase throughput, cost per reaction). Figure 1 shows an example, at the University of Edinburgh, of a fully integrated automated system. Another integrated system is enclosed in an oxygen-free environment—which adds a different level of complexity—at Lanzatech Inc. (Fig. 2) to manage anaerobic organisms. l
4
Building (infrastructure)—Be sure to check with the responsible in the department for the infrastructure to ensure that the room where the system is going to sit has appropriate energy, water, and gas requirements, the floor supports the weight of the system, and that the system is following all (bio)safety requirements.
Business Plan A business plan is an effective tool not only to provide a big, complete picture to justify the investment in automation but also to secure the capital and resources needed to run the facility [33]. Below are the key strategic elements that should be considered and presented in any plan, whether the biofoundry is located in a university or in a private company.
4.1 Funding: Government
Federal investments in synthetic biology research contribute to foundational knowledge and technological development to facilitate commercial applications in platform organisms, process pathways, and related biotechnologies [34]. In the United States, some of the most important federal funders include the National Science Foundation (NSF), Department of Energy (DOE), Department of Health & Human Services (HHS, including National Institutes of Health (NIH)), Department of Defense (DOD, including Defense
144
Marilene Pavan
Fig. 1 Edinburgh Genome Foundry. The foundry system has a complete, fully integrated automated system for the design, build and test of hundreds of genetic variants per day
Fig. 2 Enclosed biofoundry at Lanzatech Inc. allows to modify and test anaerobic microorganisms
Advanced Research Projects Agency (DARPA)), the Navy, the Air Force, Department of Agriculture (USDA), and National Aeronautics & Space Administration (NASA), each of them with their own scientific focus [1]. Funded efforts in biomanufacturing
Setting Up an Automated Biomanufacturing Lab
145
automation for scientific research and partnerships in the United States include the Agile Biofoundry (DOE) and the Broad Institute (DARPA). What to do: Review the several programs available in your country supporting new equipment acquisition, keep track on dates, participate on the informative agency calls, and submit your project with clear indication—supported by consistent and reasonable data—on how the new technology will support and advance your research. 4.2 Funding: Fee-for-Service Model and Project Partnerships
To maintain an automated system can be expensive. A fee-forservice model might be a good option to provide funding (in exchange for providing services to others) and diversify the business (keeping your biofoundry busy while attracting new scientific partnerships). As an example, known foundries located inside universities working under the fee-for-service model are the Edinburgh Genome Foundry (https://www.genomefoundry.org/) and the DAMP Lab (https://www.damplab.org). Private companies include Transcriptic (https://www.transcriptic.com/) and Emerald (https://www.emeraldcloudlab.com/), which offer “cloudbased” labs. Other interesting and effective way to diversify and optimize your research pipeline is to establish partnerships with other companies, university groups, and national laboratories as the iBioFab [35], the DNA London Foundry (https://www. londonbiofoundry.org/), Ginkgo Bioworks (https://www. ginkgobioworks.com/), Zymergen (https://www.zymergen. com/), Amyris (https://amyris.com/), Agile Biofoundry (https://agilebiofoundry.org/), among others. What to do: explore, inside your company or university, which are the possibilities to diversify your business and get more funding and scientific partnerships in the process to help set up and support the biofoundry. Usually the finance department and departments related to Open Innovation, Industry Partnerships, Entrepreneurship, and Business Development are good candidates to start with. They can help not only with ideas but also with all paperwork necessary to put together the business plan. Also, talk with existing groups operating in one of these models above to learn from them. Not less important, the balance between the own lab research and experiment and the experiments and research being developed to others must be very clearly defined.
4.3
Partnerships
Though more and more companies are offering low cost, more affordable, and accessible solutions in lab automation, partnerships with existent public or private biofoundries [10, 12] might provide a good solution for first-time users or to those already using automation in their lab but in need of different protocols and
146
Marilene Pavan
methodologies to be developed and implemented. Startup companies, while evaluating where to spend their limited funds, can also benefit from these partnerships for their proof of concept experiments before deciding on having an in-house automated system. Companies like Transcript (https://www.transcriptic.com/), Emerald (https://www.emeraldcloudlab.com/) [12], and GenoFab (https://genofab.com/) are cloud-based laboratories that can provide services that enable scale and efficiency for the discovery process. On the software spectrum companies as TeselaGen (https://teselagen.com/), Benchling (https://www.benchling. com/), and Ryffin (https://riffyn.com/) offer on-demand projects and ready-to-use software solutions for scientific design, automation, and data analysis. Public and private biofoundries and biofoundry consortiums as the Global Biofoundry Alliance (https:// www.biofoundries.org/) and Agile Biofoundry (https:// agilebiofoundry.org/) can be great partners as well as they reunite a vast amount of contacts, resources, and expertise, all over the world [10]. Finally, vendors as Thermo, Opentrons, HighRes, Molecular Devices, and Biosero are valuable partners as well and provide guidance based specifically in the client’s needs and available resources. No matter the nature and goals of the partnership, communication is key. Freedom and flexibility can be barriers while setting up a partnership and contracting services. So, it is very important to know partner’s capabilities and flexibility to adapt to the researcher’s protocols. Finally, other great source of knowledge and contacts are conferences (focused or not) on biomanufacturing automation as SLAS (Society for Laboratory Automation and Screening), IWBMA (International Workshop in Biomanufacturing Automation), IWBDA (International Workshop in Biodesign Automation), and SynBioBeta. What to do: research and prospect among potential partners and map their capabilities, set up calls and visits with them, explore partnership options, have in mind a clear process or service you want to develop, outcomes you want to achieve, timeline, and resources available (personnel, space, budget), be open to participate on conferences, to codevelop protocols, and to write publications together with the partners. 4.4
Education
Sometimes, the gains of having an automated lab set up are not as obvious as scaling up and speeding up experiments, but equally important. Through research being developed in an automated setup, a new generation of scientists is being trained at the interface of systems biology, synthetic biology, molecular biology, bioinformatics, strain engineering, and metabolic engineering, along with hardware and software engineering. These skill sets are critical in basic and applied R&D but rarely acquired in traditional research programs [1].
Setting Up an Automated Biomanufacturing Lab
147
Companies like OpenTrons, together with scientific competitions as iGEM (https://igem.org) are actively promoting automation of engineering biology throughout the competition. Initiatives as the STEM Pathways at Boston University (https:// stempathways.org), JBEI educational programs (http://www.jbei. org/education/), and the Earlham Institute (https://www. earlham.ac.uk/learning) are also actively contributing to engage a new generation of scientists in the biomanufacturing field. Not only outreach activities using automated set up gets a new generation of researchers interested in a specific scientific field, but it also promotes the lab research activities and capabilities to other groups, fostering collaborations and attracting talent. Finally, for some federal agencies, outreach programs are actually a requirement [36]. What to do: look for inspiration in existing programs inside and outside your institution to set up your own science outreach and training programs. It might be very useful to develop a system based on qualifiable and quantifiable metrics to keep track of the efficiency of the program. 4.5 System Maintenance and Personnel
5
Oftentimes, the use of grant funds to pay for system maintenance is not permitted. In this case, check with the finance department in the company or university for alternatives. Usually, the equipment comes with a one-year warranty contract that can be negotiated to be extended. Is important to involve the department responsible for purchases and partnerships in those negotiations. Preventive maintenance and service contracts, after the warranty expires, should also be negotiated. Finally, consider learning the basics of the system maintenance, so you do not have to rely heavily in maintenance services. Be also very clear in the business plan on how you are planning to implement daily procedures for cleaning and maintenance. Personnel are often the most expensive part of an automation project due to the fact that it requires highly specialized scientists in biotechnology and software-related fields. Usually a scientist working in automation projects learns a programming language, works well in a diverse team, is a good problem-solving, build solid, longterm relationships with other laboratories, vendors, and partners, can understand and adapt scientific protocols to automated systems, helps with grant writing, and has great discipline to maintain the system and document the research being developed there.
Exciting Ventures in the Automation Field: Enabling Technologies Achieving the ideal bioengineering scenario where commercially viable biochemicals and pharmaceutical compounds, effective and fast medical diagnosis and treatment, effective climate change
148
Marilene Pavan
solutions as carbon recycling, and other solutions for societal problems that can be translated faster from the laboratory to society requires the development of powerful enabling tools and methodologies related to the biomanufacturing field, some of them described below. 5.1 Machine Learning
The design of scientific experiments, the generation of robotics instructions, the tracking of samples, experimental procedures, and results, and the analysis of the large data being produced by an automated framework is simply not possible without the support of a solid software structure. Not only private and public research groups are working to build this framework [20, 21, 23] but also to make it more flexible, user friendly, and smart as possible [37]. Machine learning algorithms also bring the promise to eliminate the historically nonintegrated, trial-and-error approach to construct and test synthetic pathways. Though with the improving capabilities in DNA synthesis and automation to build and test thousands of different genetic constructs, a more informed decision to narrow down the design space and optimize the number of variants being tested represents important savings in research costs and time. The availability of large amounts of high-quality data being produced by automated facilities would enable computational biologists to produce robust theories [38], and the theory produced by these data sets would allow experimentalists to better design experiments and tackle questions of general relevance. In silico tools for the predictive design of microbial cell factories, for example, allow for the optimal genetic combination prediction in multigene pathways for high producers. In the work published by Jervis and colleagues [26], the development and training of machine learning algorithms has demonstrated to boost the monoterpenoid production titers, for example, by over 60% while screening under 3% of a library, using an automated screening pipeline. For other good reference, see Opgenorth et al., 2019 [13]. Machine learning algorithms have also being used for drug discovery [39], cell image analysis [40], and metabolic flux analysis [41]. Enough resources should be dedicated to the unique opportunity to learn from the results being generated by the Design, Build, and Test phases and to the incorporation of this new knowledge in the new cycles to avoid the obstacles to an increase in synthetic biology productivity [42].
5.2 Microfluidics and Cell-Free Systems
Engineering biological systems holds great potential to generate high-value compounds. However, the Design, Build, and Test (DBT) cycle involved in the process can be slow, laborious, expensive, and hard to automate and scale. In vitro prototyping and biomanufacturing offers a powerful technical solution for
Setting Up an Automated Biomanufacturing Lab
149
automation and high-throughput screening as it is promised to be faster—compared with microbial cell factories—and suitable to be adapted to miniaturized hardware footprint and reaction volume. Yet, the cell crude extract is oftentimes easy and inexpensive to produce in any molecular biology laboratory, can be produced in scale [43], and it can be stored for longer times [44]. Cell-free protein synthesis (CFPS)/Transcription-Translation (TX-TL) systems usually involves cell growth, lysis, and DNA extraction from the lysate, and the addition of NTPs, amino acids, an energy source and, finally, the DNA of interest [45–47]. A number of works has demonstrated its efficacy for E. coli [48– 50], yeast [51], and even nontraditional organisms [52, 53]. Cellfree systems have also been used for educational purposes [54] and for studies requiring minimal cells construction, improving the understanding of biological process with minimal metabolic interference [55, 56], and prototyping pathways [52]. CFPS are also suitable for the utilization in microfluidics chips [57]. Microfluidics platforms offer an alternative for the adoption of an automated pipeline at lower price and footprint while maintaining equivalent throughputs and reproducibility capabilities, compared to liquid handling robotics. According to DeMello et al. 2019, “microfluidics describes the investigation of analytical systems that manipulate, process and control small volumes of fluids (typically on the picolitre to nanoliter scale)” [57]. Droplet microfluidics (in which very low volumes of reagents and biological material are encapsulated into monodisperse droplets), for example, has been demonstrating exception promise for biological experiments as DNA assembly, transformation and transfection, cell culturing and sorting, and genetic circuits prototyping [58]. Microfluidic chips also offer great control over the experiments via the control of temperature, evaporation, droplet generation rate, size, and flow, addition of selection medium, oxygen ratio, and throughput [59, 60]. 5.3 DNA Assembly and Strain Development
Standardization and modularity bring the potential to make engineering biology more predictable, while enabling miniaturization and automation of DNA assembly methods. The first attempt, in synthetic biology, to introduce some degree of standardization was the implementation of the BioBrick standard, adopted by the iGEM competition [61]. Today, a number of modular, highly efficient, automation friendly tools, and methodologies are available for DNA construction, being the most used method known as Gibson Assembly [62]. This one-step, isothermal, scarless, in vitro recombination approach utilizes exonuclease activity, DNA polymerase activity, and DNA ligase activity to amplify and ligate DNA fragments with appropriate overlaps. Another widely adopted tool, the Golden Gate assembly [63]—and its variables as Modular Cloning (MoClo) [64] and BASIC [65] among others [66–68]—
150
Marilene Pavan
takes advantage of the intrinsic characteristic of the restriction enzymes type IIs to cut the DNA after the recognition site. By this way, short overhangs can be introduced in the DNA sequence, adjacent to the enzyme recognition site, complementary to the adjacent parts, which will be later ligated by a T4 DNA ligase enzyme. Several vendors commercialize Gibson and Golden Gate assembly enzyme mixes and offer online tools and instructions to proper design oligos and experiments in general. Also, several reagents (oligos, gene fragments, competent cells) also have the option to be delivered in SBS standard plates. Ellis et al. have published a very complete review on a plethora of powerful DNA assembly methods [69]. 5.4
Open Science
5.5 Metrology and Standardization
The more automation and throughput are introduced in the lab routine, the more biological material exchange and collaborations might be required and needed. Biological material exchange is wanted to save time and money in resynthesizing, retesting already existing, well-characterized DNA parts and strains. Material Transfer Agreements (MTAs) underlie the legal requirements within researchers to define the terms and conditions for sharing biological materials, ensuring and respecting the rights of the creators, and promoting safe practices and responsible research [70]. However, the process of getting an agreement can be very bureaucratic and time-consuming. Fortunately, there are initiatives as the OpenMTA (https://www.openplant.org/openmta/), which relaxes restrictions on the redistribution and commercial use of biomaterials, while supporting the practical realities of technology transfer by being flexible enough to accommodate the needs of different groups worldwide [70]. It is highly desirable the widespread adoption of this system, in order to accelerate and simplify the MTA process. Also, community labs such as BioBlaze (https://www.bioblaze. org/), BioCurious (http://biocurious.org/), and GenSpace (https://www.genspace.org/); material exchange initiatives such as the OpenMTA and the Free Genes project (https://biobricks. org/freegenes/); outreach initiatives such as the Community Biotechnology Initiative (CBI) [71] and the IGEM competition; and low-cost robots like the OpenTrons OT2 (https://opentrons. com/) are facilitating, promoting, and democratizing the entry access to automation. Metrics, standards, and modularity are intrinsic characteristics of synthetic biology and make possible to keep track and evaluate experimental reproducibility, cost, time, and efficiency. Also, metrics and standards allow the critical analyze of genetic engineering automation for the same parameters to determine even when automation is warranted, based on factors such as assembly methodology, protocol details, and number of samples [72, 73]. A
Setting Up an Automated Biomanufacturing Lab
151
number of tools and steps can be used or taken in consideration to evaluate an automated system and to improve its reproducibility overtime [74]. Evaluating these metrics collaboratively among different laboratories also provide valuable information over the robustness of the protocol, methodology, and the system as a whole [17, 72]. A work developed over the 2016 iGEM competition between 92 institutions around the world brilliantly showcases the collaborative efforts in the metrology and standardization fields with the objective, in this case, to tackle the lack of comparable units to measure fluorescence. The participant groups measured fluorescence from E. coli transformed with three engineered test plasmids, plus positive and negative controls, using simple, low-cost unit calibration protocols designed for use with a plate reader and/or flow cytometer. The results of this project provided not only comparable units but also valuable information about data collection and processing, precision, and instances of protocol failure [75]. Recently, synthetic biologists have developed an open-source software called SynBioHub that facilitates the sharing of information about engineered biological systems. By connecting to relevant repositories, the software allows users to browse, upload, and download data in various standard formats, regardless of their location or representation. SynBioHub also provides a central reference point for other resources to link to delivering design information in a standardized format using the Synthetic Biology Open Language (SBOL), a data exchange standard for descriptions of genetic parts, devices, modules, and systems. The goals of this standard are to allow scientists to exchange designs of biological parts and systems, to facilitate the storage of genetic designs in repositories, and to facilitate the description of genetic designs in publications [76, 77]. The lab automation field, itself, developed a standard system that, today, guides the whole lab automation industry, facilitating the adoption and stimulating competition in the sector. It started with the definition of the microplate standard defined by the American National Standards Institute (ANSI) and the Society of Biomolecular Screening (SBS) now named the Society for Laboratory Automation and Screening (SLAS), which today guides the whole industry, and it is known as SBS standard.
6
Conclusion Automation of genetic circuits synthesis is a recurrent technical theme in scientific publications and roadmaps [78, 79] due to rapid advances in software and high-throughput analysis and DNA assembly. Using automation to conduct large numbers of
152
Marilene Pavan
experiments in parallel is making it increasingly possible to address biological system function from a digital rather than analogue perspective. By applying engineering principles of characterization, standardization, and modularization to biological systems—allied to a range of innovations in miniaturization, automation, and metrology—predictability and development speed can be increased and costs reduced. Previously intractable challenges can be readdressed, and the potential to commercialize useful applications enhanced. By enabling concepts to be translated more rapidly and reliably into commercially viable processes, the cost of market entry may be reduced, competitiveness enhanced, and the delivery of benefits accelerated [79]. The development and adoption of automation-friendly enabling technologies and the collaboration between different groups and biofoundries have enabled the integration between software development, liquid handling robotics, protocol development empowering the design, construct, test, and learn cycle in order to develop scalable new biotechnology applications. However, effective automation requires the proper skills, physical space, dedication, training, and funding to achieve scalability and reproducible results. Challenges might include the proper translation of a scientific protocol to an automated framework, adequate system flexibility, proper throughput forecasting, and an adequate software ecosystem for data management and analysis. Biomanufacturing automation brings the powerful capability to accelerate the scientific discovery, while developing and sharing standardized protocols and techniques, promoting training and education, developing and adopting metrology standards, decreasing costs and time, and fostering partnerships to ultimately deliver transformative technologies to address complex problems in a sustainable manner.
Acknowledgments This work was supported by the U.S. Department of Energy, Office of Biological and Environmental Research in the DOE Office of Science [Grant Number DE-SC0018249]. References 1. Si T, Zhao H (2016) A brief overview of synthetic biology research programs and roadmap studies in the United States. Synth Syst Biotechnol 1(4):258–264 2. Khalil AS, Collins JJ (2010) Synthetic biology: applications come of age. Nat Rev Genet 11 (5):367–379 3. Smanski MJ, Zhou H, Claesen J et al (2016) Synthetic biology to access and expand nature’s
chemical diversity. Nat Rev Microbiol 14 (3):135–149 4. Yeow JA, Ng PK, Tan KS et al (2014) Effects of stress, repetition, fatigue and work environment on human error in manufacturing industries. J Appl Sci 14(24):3464–3471. https:// doi.org/10.3923/jas.2014.3464.3471
Setting Up an Automated Biomanufacturing Lab 5. Chao R, Mishra S, Si T, Zhao H (2017) Engineering biological systems using automated biofoundries. Metab Eng 42:98–108 6. Studies L (2015) Industrialization of biology: a roadmap to accelerate the advanced manufacturing of chemicals 7. Nielsen J, Keasling JD (2016) Engineering cellular metabolism. Cell 164(6):1185–1197 8. Karim AS, Dudley QM, Jewett MC (2016) Cell-free synthetic systems for metabolic engineering and biosynthetic pathway prototyping. In: Wittmann C, Liao JC (eds) Industrial biotechnology. Wiley, Weinheim 9. Groth P, Cox J (2017) Indicators for the use of robotic labs in basic biomedical research: a literature analysis. PeerJ 5:e3997. https://doi. org/10.7717/peerj.3997 10. Hillson N, Caddick M, Cai Y et al (2019) Building a global alliance of biofoundries. Nat Commun 10:2040 11. Carbonell P, Jervis AJ, Robinson CJ et al (2018) An automated design-build-test-learn pipeline for enhanced microbial production of fine chemicals. Commun Biol 1:66. https:// doi.org/10.1038/s42003-018-0076-9 12. Hayden EC (2014) The automated lab. Nature 516(7529):131–132 13. Opgenorth P, Costello Z, Okada T et al (2019) Lessons from two design-build-test-learn cycles of Dodecanol production in Escherichia coli aided by machine learning. ACS Synth Biol 8(6):1337–1351. https://doi.org/10.1021/ acssynbio.9b00020 14. Olsen K (2012) The first 110 years of laboratory automation: technologies, applications, and the creative scientist. J Lab Autom 17 (6):469–480. https://doi.org/10.1177/ 2211068212455631 15. Chapman T (2003) Lab automation and robotics: automation on the move. Nature 421(6923):661, 663, 665-6. https://doi. org/10.1038/421661a 16. Lundberg K (2012) Increase user adoption rates and realize a higher rate of return on your LIMS investment. GenomeWeb 1–8 17. Phillips P, Lithgow GJ, Driscoll M (2017) A long journey to reproducible results. Nature 548:387–388 18. Teytelman L (2018) No more excuses for non-reproducible methods. Nature 560 (7719):411 19. Freedman LP, Cockburn IM, Simcoe TS (2015) The economics of reproducibility in preclinical research. PLoS Biol 13(6): e1002165. https://doi.org/10.1371/journal. pbio.1002165
153
20. Densmore DM, Bhatia S (2014) Bio-design automation: software + biology + robots. Trends Biotechnol 32:111–113 21. Hillson NJ, Rosengarten RD, Keasling JD (2012) J5 DNA assembly design automation software. ACS Synth Biol 1(1):14–21. https:// doi.org/10.1021/sb2000116 22. Morrell WC, Birkel GW, Forrer M et al (2017) The experiment data depot: a web-based software tool for biological experimental data storage, sharing, and visualization. ACS Synth Biol 6(12):2248–2259. https://doi.org/10.1021/ acssynbio.7b00204 23. Nielsen AAK, Der BS, Shin J et al (2016) Genetic circuit design automation. Science 352(6281):aac7341. https://doi.org/10. 1126/science.aac7341 24. Appleton E, Densmore D, Madsen C, Roehner N (2017) Needs and opportunities in bio-design automation: four areas for focus. Curr Opin Chem Biol 40:111–118 25. Costello Z, Martin HG (2018) A machine learning approach to predict metabolic pathway dynamics from time-series multiomics data. NPJ Syst Biol Appl 4:19. https://doi. org/10.1038/s41540-018-0054-3 26. Jervis AJ, Carbonell P, Vinaixa M et al (2019) Machine learning of designed translational control allows predictive pathway optimization in Escherichia coli. ACS Synth Biol 8 (1):127–136. https://doi.org/10.1021/ acssynbio.8b00398 27. Hale AN (1999) 5 Building realistic automated production lines for genetic analysis. In: Craig AG, Hoheisel JD (eds) Methods in microbiology. Academic Press, San Diego 28. O’Sullivan B (2019) Points to consider when planning for lab automation projects. In: HighRes Bio. https://highresbio.com/blog/ points-to-consider-when-planning-for-labautomation-projects/ 29. Opentrons (2019) Guide to choosing a lab automation platform. In: Opentrons. https:// insights.opentrons.com/the-automatedpipetting-revolution-is-here 30. Butler JM (2012) New technologies and automation. In: Advanced topics in forensic DNA typing. Elsevier Academic Press, San Diego 31. Ham TS, Dmytriv Z, Plahar H et al (2012) Design, implementation and practice of JBEIICE: an open source biological part registry platform and tools. Nucleic Acids Res 40(18): e141. https://doi.org/10.1093/nar/gks531 32. Oberortner E, Cheng JF, Hillson NJ, Deutsch S (2017) Streamlining the design-to-build transition with build-optimization software
154
Marilene Pavan
tools. ACS Synth Biol 6(3):485–496. https:// doi.org/10.1021/acssynbio.6b00200 33. Cohen L (2019) Writing your business plan. Nat Biotechnol 20(Suppl):BE33–BE35. https://doi.org/10.1038/nbt0602suppBE33 34. Clark DP, Pazdernik NJ (2016) Synthetic biology: report to congress 2013. Biotechnology 419–445. https://doi.org/10.1016/B978-012-385015-7.00013-2 35. Chao R, Liang J, Tasan I et al (2017) Fully automated one-step synthesis of singletranscript TALEN pairs using a biological foundry. ACS Synth Biol 6:678–685. https:// doi.org/10.1021/acssynbio.6b00293 36. NSF Broader impacts review criterion. https:// www.nsf.gov/pubs/2007/nsf07046/ nsf07046.jsp 37. Segal M (2019) An operating system for the biology lab. Nature 573(7775):S112–S113 38. Carbonell P, Radivojevic T, Garcı´a Martı´n H (2019) Opportunities at the intersection of synthetic biology, machine learning, and automation. ACS Synth Biol 8(7):1474–1477. https://doi.org/10.1021/acssynbio.8b00540 39. Lima AN, Philot EA, Trossini GHG et al (2016) Use of machine learning approaches for novel drug discovery. Expert Opin Drug Discov 11(3):225–239 40. Kan A (2017) Machine learning applications in cell image analysis. Immunol Cell Biol 95 (6):525–530 41. Ghosh A, Ando D, Gin J et al (2016) 13C metabolic flux analysis for systematic metabolic engineering of S. cerevisiae for overproduction of fatty acids. Front Bioeng Biotechnol 4:76. https://doi.org/10.3389/fbioe.2016.00076 42. Lawson CE, Harcombe WR, Hatzenpichler R et al (2019) Common principles and best practices for engineering microbiomes. Nat Rev Microbiol 17(12):725–741. https://doi.org/ 10.1038/s41579-019-0255-9 43. Kwon YC, Jewett MC (2015) Highthroughput preparation methods of crude extract for robust cell-free protein synthesis. Sci Rep 5:8663. https://doi.org/10.1038/ srep08663 44. Karim AS, Jewett MC (2018) Cell-free synthetic biology for pathway prototyping. Methods Enzymol 608:31–57 45. Perez JG, Stark JC, Jewett MC (2016) Cellfree synthetic biology: engineering beyond the cell. Cold Spring Harb Perspect Biol 8(12): a023853. https://doi.org/10.1101/ cshperspect.a023853 46. Garamella J, Marshall R, Rustad M, Noireaux V (2016) The all E. coli TX-TL toolbox 2.0: a
platform for cell-free synthetic biology. ACS Synth Biol 5(4):344–355. https://doi.org/ 10.1021/acssynbio.5b00296 47. Gregorio NE, Levine MZ, Oza JP (2019) A user’s guide to cell-free protein synthesis. Methods Protoc 2:24. https://doi.org/10. 3390/mps2010024 48. Kay JE, Jewett MC (2015) Lysate of engineered Escherichia coli supports high-level conversion of glucose to 2,3-butanediol. Metab Eng 32:133–142. https://doi.org/10. 1016/j.ymben.2015.09.015 49. Rustad M, Eastlund A, Marshall R et al (2017) Synthesis of infectious bacteriophages in an E. coli-based cell-free expression system. J Vis Exp (126):56144. https://doi.org/10.3791/ 56144 50. Dudley QM, Nash CJ, Jewett MC (2019) Cellfree biosynthesis of limonene using enzymeenriched Escherichia coli lysates. Synth Biol 4 (1):ysz003. https://doi.org/10.1093/ synbio/ysz003 51. Schoborg JA, Clark LG, Choudhury A et al (2016) Yeast knockout library allows for efficient testing of genomic mutations for cell-free protein synthesis. Synth Syst Biotechnol 1:2–6. https://doi.org/10.1016/j.synbio.2016.02. 004 52. Karim AS, Dudley QM, Juminaga A, et al (2019) In vitro prototyping and rapid optimization of biosynthetic enzymes for cellular design. bioRxiv. https://doi.org/10.1101/ 685768 53. Moore SJ, MacDonald JT, Wienecke S et al (2018) Rapid acquisition and model-based analysis of cell-free transcription–translation reactions from nonmodel bacteria. Proc Natl Acad Sci U S A 115(19):E4340–E4349. https://doi.org/10.1073/pnas.1715806115 54. Huang A, Nguyen PQ, Stark JC et al (2018) Biobits™ explorer: a modular synthetic biology education kit. Sci Adv 4(8):eaat5105. https://doi.org/10.1126/sciadv.aat5105 55. Jewett MC, Forster AC (2010) Update on designing and building minimal cells. Curr Opin Biotechnol 21(5):697–703 56. Caschera F, Noireaux V (2016) Compartmentalization of an all-E. coli cell-free expression system for the construction of a minimal cell. Artif Life 22(2):185–195 57. Gulati S, Rouilly V, Niu X et al (2009) Opportunities for microfluidic technologies in synthetic biology. J R Soc Interface 6:S493–S506 58. Gach PC, Iwai K, Kim PW et al (2017) Droplet microfluidics for synthetic biology. Lab Chip 17:3388–3400
Setting Up an Automated Biomanufacturing Lab 59. Gach PC, Shih SCC, Sustarich J et al (2016) A droplet microfluidic platform for automating genetic engineering. ACS Synth Biol 5 (5):426–433. https://doi.org/10.1021/ acssynbio.6b00011 60. Lashkaripour A, Rodriguez C, Ortiz L, Densmore D (2019) Performance tuning of microfluidic flow-focusing droplet generators. Lab Chip 19(6):1041–1053. https://doi.org/10. 1039/C8LC01253A 61. Shetty RP, Endy D, Knight TF (2008) Engineering BioBrick vectors from BioBrick parts. J Biol Eng 2:5. https://doi.org/10.1186/ 1754-1611-2-5 62. Gibson DG, Young L, Chuang RY et al (2009) Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat Methods 6 (5):343–345. https://doi.org/10.1038/ nmeth.1318 63. Engler C, Kandzia R, Marillonnet S (2008) A one pot, one step, precision cloning method with high throughput capability. PLoS One 3 (11):e3647. https://doi.org/10.1371/jour nal.pone.0003647 64. Weber E, Engler C, Gruetzner R et al (2011) A modular cloning system for standardized assembly of multigene constructs. PLoS One 6(2):e16765. https://doi.org/10.1371/jour nal.pone.0016765 65. Storch M, Casini A, Mackrow B et al (2015) BASIC: a new biopart assembly standard for idempotent cloning provides accurate, singletier DNA assembly for synthetic biology. ACS Synth Biol 4(7):781–787. https://doi.org/10. 1021/sb500356d 66. Lai HE, Moore S, Polizzi K, Freemont P (2018) EcoFlex: a multifunctional moclo kit for E. coli synthetic biology. Methods Mol Biol 1772:429–444 67. Iverson SV, Haddock TL, Beal J, Densmore DM (2016) CIDAR MoClo: improved MoClo assembly standard and new E. coli part library enable rapid combinatorial design for synthetic and traditional biology. ACS Synth Biol 5(1):99–103. https://doi.org/10. 1021/acssynbio.5b00124 68. Sarrion-Perdigones A, Falconi EE, Zandalinas SI et al (2011) GoldenBraid: an iterative cloning system for standardized assembly of
155
reusable genetic modules. PLoS One 6(7): e21622. https://doi.org/10.1371/journal. pone.0021622 69. Casini A, Storch M, Baldwin GS, Ellis T (2015) Bricks and blueprints: methods and standards for DNA assembly. Nat Rev Mol Cell Biol 16 (9):568–576 70. Kahl L, Molloy J, Patron N et al (2018) Opening options for material transfer. Nat Biotechnol 36(10):923–927 71. Kong DS, Thorsen TA, Babb J et al (2017) Open-source, community-driven microfluidics with metafluidics. Nat Biotechnol 35 (6):523–529 72. Walsh DI, Pavan M, Ortiz L et al (2019) Standardizing automated DNA assembly: best practices, metrics, and protocols using robots. SLAS Technol 24(3):282–290. https://doi. org/10.1177/2472630318825335 73. Ortiz L, Pavan M, McCarthy L, et al (2017) Automated robotic liquid handling assembly of modular DNA devices. J Vis Exp (130):54703. https://doi.org/10.3791/54703 74. Jessop-Fabre MM, Sonnenschein N (2019) Improving reproducibility in synthetic biology. Front Bioeng Biotechnol 7:18. https://doi. org/10.3389/fbioe.2019.00018 75. Beal J, Haddock-Angelli T, Baldwin G et al (2018) Quantification of bacterial fluorescence using independent calibrants. PLoS One 13 (6):e0199432. https://doi.org/10.1371/jour nal.pone.0199432 76. Madsen C, McLaughlin JA, Misirl G et al (2016) The SBOL stack: a platform for storing, publishing, and sharing synthetic biology designs. ACS Synth Biol 5(6):487–497. https://doi.org/10.1021/acssynbio.5b00210 77. McLaughlin JA, Myers CJ, Zundel Z et al (2018) SynBioHub: a standards-enabled design repository for synthetic biology. ACS Synth Biol 7(2):682–688. https://doi.org/ 10.1021/acssynbio.7b00403 78. Bioeconomy G (2019) A research roadmap for the next-generation bioeconomy 79. Clarke LJ, Kitney RI (2016) Synthetic biology in the UK – an outline of plans and progress. Synth Syst Biotechnol 1(4):243–257
Chapter 6 Computer-Aided Design and Pre-validation of Large Batches of DNA Assemblies Valentin Zulkower Abstract Type-2S restriction enzymes allow the routine assembly of large batches of synthetic constructs from individual genetic parts. However, design flaws in the part sequence can cause assembly failures, incurring troubleshooting costs and project delays. As a result, the careful design and checking of the assembly plan is often a bottleneck of large assembly projects, and may require computational support. This chapter demonstrates the use of two free and open-source web applications accelerating this task by automating genetic part design and simulating type-2S cloning to detect potential assembly issues. Key words Computer-aided design, Computer-aided manufacturing, DNA assembly, Synthetic Biology
1
Introduction Advances in DNA synthesis technologies and robotics over the past two decades have significantly reduced the costs and completion times of Synthetic Biology projects [1, 2]. In particular, various methods relying on type-2S restriction enzymes (which can cleave DNA outside of their recognition site) enable the assembly of reusable genetic parts into DNA constructs ranging typically from 2000 to 20,000 nucleotides in size [3–5]. However, these methods require the sequence of each genetic part to be standardized, which may involve the removal of internal type-2S restriction sites and the addition of flanking restriction sites determining the part’s relative position in the assembled construct. Design flaws at this stage can result in assembly failure or artifacts which are often long and costly to troubleshoot. This chapter presents software solutions to help suppress human error in part standardization, and ensure that the final DNA sequences conform to the researcher’s expectations. We describe two web applications routinely used at the Edinburgh Genome Foundry (EGF) to design large projects involving
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_6, © Springer Science+Business Media, LLC, part of Springer Nature 2021
157
158
Valentin Zulkower
hundreds of different genetic parts and constructs, and released as part of the EGF’s collection of public web applications (https:// cuba.genomefoundry.org). The first application streamlines the standardization of large sets of genetic parts with respect to a user-selected assembly standard. The second application uses cloning simulation to predict final assembly sequences and detect flawed assembly plans. Both applications are and rely on free, open-source computational libraries developed at the EGF (https://edinburghgenome-foundry.github.io/).
2
Batch Part Standardization Several assembly standards based on type-2S restriction enzymes have been proposed over the last decade, including MoClo [6], YeastFab [7], and EMMA [8]. These standards differ by their choice of restriction enzyme(s) and assembly overhangs. As a consequence, each standard enforces a specific set of sequence design rules to ensure that its genetic parts can be properly assembled together, and that the resulting constructs have the expected biological function. We will first see an example of part standardization “by hand,” before showing how this can be automated for larger batches using a dedicated web application.
2.1 Manual Standardization of a Genetic Part (Outline)
Here, we detail the different steps involved in the standardization of a Green Fluorescent Protein (GFP) sequence for use at position “p9” of the EMMA assembly standard, which will enable to express other proteins (placed at position “p7”) with a downstream GFP fusion, by the intermediary of a peptide chain in position p8. 1. Obtain a GFP-encoding nucleotides sequence, e.g., from the NCBI website (https://www.ncbi.nlm.nih.gov/nuccore/ L29345.1). 2. Open the sequence in the editor of your choice. Sequence files in text or FASTA format can be open in any text editor, while files in Genbank format require specialized software such as Benchling (https://www.benchling.com/) or Snapgene (https://www.snapgene.com). 3. Add the two-nucleotide sequence “CA” at the beginning of the sequence in order to make the sequence compatible with position p9 of the EMMA standard. This dimer will complete position p9’s left overhang GCGT to form the assembly scar GCGTCA, encoding a short Alanine–Serine peptide chain. Omitting the addition of “CA” will result in an out-of-frame GFP sequence and biologically dysfunctional protein. 4. Add GCGT and TGCT on the left and right sides of the sequence, respectively. These will be the sequences of the
Computer-Aided Design and Pre-validation of Large Batches of DNA Assemblies
159
part’s “sticky ends” after cleaving of the DNA by BsmBI, and will anneal with parts for position 8 on the left and position 10 on the right. 5. Ensure that neither the recognition sequence of the BsmBI enzyme (CGTCTC) nor its reverse complement (GAGACG) appears in the sequence. If the parts are intended to be used in hierarchical (two-step) assemblies, then the BsaI recognition site should also be removed from the sequence. Use synonymous codon juggling to preserve the protein’s sequence while removing restriction sites [9]. 6. Add the sequence CGTCTCN to the left of the sequence (where CGTCTC is a BsmBI recognition site and N is a nucleotide of your choice), and add the reverse complement NGAGACG on the right. 7. If the sequence is to be ordered from a commercial provider as linear DNA, add around 5 base pairs on each end of the sequence (see https://international.neb.com/tools-and-res ources/usage-guidelines/cleavage-close-to-the-end-of-dnafragments). 8. Order a DNA preparation of the resulting sequence. For projects involving a large number of parts, manual standardization can be time consuming and error-prone. The next sections describe the usage of the web application developed at the EGF to automate these steps and apply an assembly standard’s rules to large batches of genetic parts at once. 2.2 Preparing the Necessary Data Files
1. Specify the assembly standard by creating a spreadsheet (using Microsoft Excel or LibreOffice) on the model of Fig. 1a. Some standards may not have dedicated names for the different part positions, in which case arbitrary position names can be chosen by the user, as long as the specified overhangs comply with the standard. Save the spreadsheet in Excel format (.xls or .xlsx) or CSV format (.csv). The resulting file will be referred to as the Standard Definition File. 2. Make sure that each part to be processed is named after the template “POSITION_part-name,” where the POSITION attribute refers to the part position, as defined in the standard definition file. For instance, a GFP sequence to be standardized for EMMA’s p9 position should be named “p9_GFP.” 3. Gather the sequences of all parts to be processed in a single file, which can be either (1) a Fasta file in the format shown in Fig. 1c, or (2) a zip file containing the sequences as separated files in the Genbank format, each file named after the part sequence it provides, for instance “p9_GFP.gb.” This file will be referred to as the Sequences File.
160
Valentin Zulkower
Fig. 1 Input and output of the web-based part standardization application. (a) User-created spreadsheet defining the assembly standard to follow. (b) Screenshot of the web application showing the web form in its entirety. (c) Sample files from the part standardization report: PDF summary of the report (front) and Fasta file listing all standardized sequences for ordering from a DNA synthesis company
If some genetic part regions must be protected against arbitrary modifications (such as promoter regions or coding sequences), then these parts should be provided in Genbank format, with annotations indicating design constraints, as explained in the next.
Computer-Aided Design and Pre-validation of Large Batches of DNA Assemblies
2.3 Protecting some Part Regions against Modifications
161
1. Before adding the part’s corresponding Genbank file to the zip archive, open the Genbank file in a sequence editor, for instance, the free software Snapgene Viewer (snapgene.com) or Benchling (benchling.com). 2. Locate a sequence region which should be protected against modifications and add an annotation at this location (the method for which may vary from one sequence editor to another). The Genbank type of the annotation should be “misc_feature,” and the label of the annotation should be either “@keep” (to forbid any mutation in the region) or “@cds” (to allow codon juggling only, i.e., mutations that do not change the translated protein sequence). Note that many more design constraints are available, as listed in the documentation of the underlying sequence optimizer DNA Chisel (https://edinburgh-genome-foundry.github.io/DnaChisel). 3. Save the resulting Genbank record to a file (e.g., “p9_GFP. gb”) and add the file to the zip archive.
2.4 Using the Web Application
1. With the web browser of your choice (we recommend a modern version of Google Chrome or Firefox), connect to the application at the following address: https://cuba. genomefoundry.org/domesticate_part_batches. 2. The application consists in a one-page form shown in Fig. 1b. In the rest of this protocol, letters in parenthesis (a), (b), etc. refer to annotations in this figure. 3. Enter the name of the assembly standard used in (a). This information is mostly optional and only used for reference in the report produced by the application. 4. Drag and drop the Standard Definition File in the upload box (b). 5. Drag and drop the Sequences Files in the upload box (c). 6. When using Genbank records, if the name of the different parts is provided in the file name (e.g., “p9_GFP.gb”) rather than in the Genbank’s metadata, set the selection menu in (d) to the “Use file names as parts IDs” option. 7. Tick the checkbox (e) to allow sequence edits. If the box is left unticked and some of the provided sequences cannot be standardized without sequence modifications, the standardization of these parts will fail, and the failures will be signaled in the resulting report with an indication for troubleshooting. If the box is ticked, make sure that sensitive elements have been protected as explained in the previous section. 8. Click on the “Domesticate” button (f) to start the standardization of the parts. This will take a few seconds to a few minutes depending on the number of parts to process (a progress bar will be displayed).
162
2.5
Valentin Zulkower
Output
1. As the automated standardization process ends, a button marked “Download Report” appears below the form. Clicking the button will download a multi-file zipped report (the Standardization Report). 2. Open Report.pdf, located at the root of the Standardization Report, and review, for each part of the batch, the domestication variant used (to check that each part has been indeed standardized for the intended position), as well as the number of nucleotides added or modified during the standardization. 3. Parts for which standardization failed due to unsatisfiable constraints will be indicated by messages both in the web interface and in the PDF report. Refer to the subfolder “error_reports” of the Standardization Report for more information, in particular the location of the problematic regions of the parts. 4. The subfolder “sequences_to_order” in the Standardization Report contains the sequences of all domesticated parts, in 3 formats (CSV, Excel, and FASTA). These files can be uploaded on the website of commercial providers (which will generally accept one of these formats) to order DNA preparations of the standardized parts. However, before any DNA ordering, it is recommended to check the overall validity of the assembly plan, as detailed in the next section.
3
Batch Type-2S Assembly Pre-validation Via Cloning Simulation Type-2S assembly protocols typically consist in mixing standard genetic parts together with a restriction enzyme (to produce linear parts with single-stranded overhangs) and a ligase (to assemble complementary parts into a circular plasmid). An assembly plan can be defined as the set of parts that must be mixed together to obtain each desired construct. Despite the apparent simplicity of the task, some errors can be introduced during the writing of an assembly plan. Some advanced standards, such as EMMA, allow to assemble up to 25 parts at once, and assemblies of less than 25 parts can be created by inserting biologically neutral “connector parts” to cover unused positions. While offering more freedom to the designer, such a standard is also more complex to use. Simple mistakes, such as a flawed part standardization, the omission of a part in the assembly plan, or the omission of a connector, can lead to the failure of multiple assemblies at once, yielding either unexpected sequences of DNA constructs, or a total absence of viable clones at the end of the assembly protocol. In both cases, any design flaw uncovered at this stage requires the design and ordering of new genetic parts, delaying the project by months and adding thousands of dollars to its budget.
Computer-Aided Design and Pre-validation of Large Batches of DNA Assemblies
163
This section describes a web-based software application relying on cloning simulation, that is, the in-silico modeling of restriction and ligation reactions [10], to predict the outcome of assembly reactions, and validate assembly plans prior to any DNA ordering or cloning work. The application offers several advantages for assembly planning, as it is able, with minimal input, to detect part standardization issues, find omitted or redundant parts in a assembly plan, and produce the final sequence of the assemblies, which will be useful for quality control at the end of the manufacturing process, as discussed in the next chapter. 3.1 Preparing the Necessary Data Files
1. Gather the sequences of all parts involved in the assembly. The sequences could be spread across different Fasta and Genbank files, but for practicality, we recommend a single Fasta file or a zip file containing the sequences as separated files in the Genbank format, each file named after the part sequence it provides. 2. Specify the assembly plan by creating a spreadsheet on the model of Fig. 2a. Save the spreadsheet in Excel format (.xls or .xlsx) or CSV format (.csv). The resulting file will be referred to as the Assembly Plan Spreadsheet in the rest of this section. 3. The web application offers the possibility to omit connector parts in the Assembly Plan Spreadsheet, and instead have the necessary connectors for each construct automatically selected. This requires to gather the sequences of all connector parts available, as a single Fasta file or a zip file containing the sequences as separated files in the Genbank format, each file named after the connector part it provides. These file(s) will be referred to as Connector Sequences in this section.
3.2 Using the Web Application
1. Connect to the following application with the web browser of your choice: https://cuba.genomefoundry.org/simulate_gg_ assemblies. 2. The application consists in a one-page form shown in Fig. 2b. In the rest of this protocol, letters in parenthesis (a), (b), etc. refer to annotations in this figure. 3. Select the enzyme to be used for the assembly (a). Options are BsaI, BsmBI (this option is also suitable for other type-2S BsmBI isoschizomers such as Esp3I), BbsI, or the default option “Autoselect,” which will attempt to guess the intended enzyme based on the presence of recognition sites in the part sequences of each assembly. 4. Drag and drop all sequence files in the upload box (b). 5. Tick the checkbox “Provide a list of assemblies” and drag the Assembly Plan Spreadsheet in the appearing box (c). Note that this step can be skipped if the assembly plan consists in a single assembly.
164
Valentin Zulkower
Fig. 2 Input and output of the web-based cloning simulation application. (a) Screenshot of the web application showing the web form in its entirety. (b) User-created spreadsheet specifying the assembly plan. Each line starts with the name of the construct to be assembled, followed by the list of parts in each assembly. (c) Organization of the Cloning Simulation Report file. (d) Schema of the Genbank record of “Construct 1” as predicted by the application from the assembly plan of panel B. (e) Part connection schema for “Construct 2.” The circularity of the schema indicates that the parts will indeed assemble properly into a circular plasmid
6. When using Genbank records, if the name of the standard parts is provided in the file names (e.g., “p9_GFP.gb”) rather than in the Genbank’s internal ID field, choose option “Use file names as parts IDs” in the selection menu in (d). 7. The checkboxes in (e) provide customization options for the final report. Check “Ensure each line gives a single assembly” to flag in the report the construct definitions which may lead to several valid assemblies (i.e., combinatorial assemblies). Check “Ensure that no part is forgotten in the assemblies” to flag constructs for which only a subset of the parts will assemble into a valid circular assembly, other parts being redundant.
Computer-Aided Design and Pre-validation of Large Batches of DNA Assemblies
165
8. If the assembly plan requires completion by connectors, tick the “Autoselect connectors” box and drag the Connector Sequences in the appearing box. 9. Click on the “Predict Final Constructs” button (f) to start the standardization of the parts. This will take a few seconds to a few minutes depending on the number of parts to process (a progress bar will be displayed). 3.3
Output
1. As the cloning simulation process ends, a “Download” button appears below the form (g). Clicking on the button will download a multi-file zipped report on the user’s computer, referred to as the Cloning Simulation Report in the rest of this section (Fig. 2c). 2. File “assembly_plan.csv” provides the assembly plan, possibly complemented with auto-selected connectors to form valid assemblies. Note that if no connector completion was necessary, this file is identical to the input assembly plan, and is attached in the Cloning Simulation Report for traceability. 3. File “all_parts.csv” provides the alphabetical list of all parts involved in the assembly plan, including auto-selected connectors, and can be used as a materials checklist when carrying out the assembly plan. 4. The Cloning Simulation Report features one folder for each assembly. A folder contains a Genbank record with the predicted assembly sequence, as well as a schema of the assembly (Fig. 2d) and the construct’s “connections graph” showing connections between the different parts of the assembly (Fig. 2e). 5. Reviewing the connections graph to detect design problems. Any connections graph that is not perfectly circular indicates an invalid assembly plan. A linear connection graph, for instance, may indicate that parts are missing from the assembly plan. A part appearing at several places in the connection graph indicates that it was digested in more than one fragment, that is, the part sequence contains internal restriction sites which should be removed. 6. Carefully review the final Genbank sequences to ensure that the final sequences are biologically viable. For instance, check that all open reading frames spanning over several assembly parts are not affected by unwanted base pair deletion or insertion due to assembly scars. Also check that the final constructs feature a replication origin and the adequate resistance marker. 7. For convenience, copies of each assembly’s Genbank record are gathered in the “all_records” folder. This set of Genbank records can be used later on as the input of other software applications, for example, for automated quality control, as will be discussed in the next chapter.
166
Valentin Zulkower
References 1. Kosuri S, Church GM (2014) Large-scale de novo DNA synthesis: technologies and applications. Nat Methods 11(5):499–507. https:// doi.org/10.1038/nmeth.2918 2. Chao R, Mishra S, Si T, Zhao H (2017) Engineering biological systems using automated biofoundries. Metab Eng 42:98–108. https:// doi.org/10.1016/j.ymben.2017.06.003 3. Engler C, Kandzia R, Marillonnet S (2008) A one pot, one step, precision cloning method with high throughput capability. PLoS One 3 (11):e3647. https://doi.org/10.1371/jour nal.pone.0003647 4. Tsuge K, Sato Y, Kobayashi Y, Gondo M, Hasebe M, Togashi T et al (2015) Method of preparing an equimolar DNA mixture for one-step DNA assembly of over 50 fragments. Sci Rep 5:10655. https://doi.org/10.1038/ srep10655 5. Lin D, O’Callaghan CA (2018) MetClo: Methylase-assisted hierarchical DNA assembly using a single type IIS restriction enzyme. Nucleic Acids Res 46:e113. https://doi.org/ 10.1093/nar/gky596 6. Weber E, Engler C, Gruetzner R, Werner S, Marillonnet S (2011) A modular cloning
system for standardized assembly of multigene constructs. PLoS One 6(2):e16765. https:// doi.org/10.1371/journal.pone.0016765 7. Guo Y, Dong J, Zhou T, Auxillos J, Li T, Zhang W et al (2015) YeastFab: the design and construction of standard biological parts for metabolic engineering in Saccharomyces cerevisiae. Nucleic Acids Res 43(13):e88. https://doi.org/10.1093/nar/gkv464 8. Martella A, Matjusaitis M, Auxillos J, Pollard SM, Cai Y (2017) EMMA: an extensible mammalian modular assembly toolkit for the rapid design and production of diverse expression vectors. ACS Synth Biol 6(7):1380–1392. https://doi.org/10.1021/acssynbio.7b00016 9. Richardson SM, Wheelan SJ, Yarrington RM, Boeke JD (2006) GeneDesign: rapid, automated design of multikilobase synthetic genes. Genome Res 16:550–556. https://doi.org/ 10.1101/gr.4431306 ˆ , Ribeiro GF, 10. Pereira F, Azevedo F, Carvalho A Budde MW, Johansson B (2015) Pydna: a simulation and documentation tool for DNA assembly strategies using python. BMC Bioinformatics 16(1):142. https://doi.org/10. 1186/s12859-015-0544-x
Chapter 7 Computer-Aided Planning for the Verification of Large Batches of DNA Constructs Valentin Zulkower Abstract Restriction digest analysis and Sanger sequencing are among the most commonly used techniques to check the sequence of synthetic DNA constructs. However, both require careful preparation to select restriction enzymes or DNA primers adapted to the expected constructs sequences. In projects involving manufacturing of large batches of synthetic constructs, the task can be tedious and error-prone. This chapter demonstrates the use of two free and open-source web applications providing fast and automated selection of enzymes and sequencing primers for DNA construct verification. Key words Computer-aided manufacturing, DNA assembly, DNA verification, Sanger sequencing, Restriction digest analysis, Synthetic Biology
1
Introduction The assembly of standard genetic parts into circular plasmids is one of the most common operations in modern genetic engineering. In a typical protocol, DNA parts are fused together via enzymatic ligation or PCR [1] and the assembly product is transformed into bacteria. The bacteria are then plated to obtain isolated colonies, and each colony can be cultivated and lysed in order to obtain a high-concentration preparation of the assembled plasmid. However, only a fraction of the colonies may carry plasmids with the expected sequence, due to undesired phenomena such as homologous DNA recombination in the bacterial cells, or parts mis-annealing during the assembly reaction [2]. As a consequence, assemblies featuring either a large number of parts or impeding sequence patterns (such as homologies and tandem repeats) may require the verification of over 20 colonies to obtain a valid plasmid preparation. Quality control can therefore account for a significant proportion of the costs and planning time spent on a highthroughput DNA assembly project.
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_7, © Springer Science+Business Media, LLC, part of Springer Nature 2021
167
168
Valentin Zulkower
While advances in Next Generation Sequencing (NGS) have significantly decreased DNA verification costs [3], current NGS methods still require a complex processing of the DNA samples to be analyzed, and are only price-competitive for large batches of hundreds to thousands of assemblies. Pre-existing methods such as restriction digests and Sanger sequencing [4] remain popular solutions for routine DNA verification, although they require careful planning, in particular in the selection of ad hoc restriction enzymes and sequencing primers, in order to obtain decisive results. This chapter presents two web applications routinely used at the Edinburgh Genome Foundry (EGF) to plan the verification of large assembly sets via restriction digests and Sanger sequencing. The first application automates the selection of a minimal set of restriction enzymes for the verification of an assembly batch. The ond application automates the selection of primers for Sanger sequencing verification of assembly batches, with a focus on primer reuse across constructs to reduce prices and protocol complexity.
2
Automated Enzyme Selection for Restriction Digest DNA assembly verification by restriction digest analysis consists in digesting a DNA construct preparation using selected restriction enzymes, and separating the resulting DNA fragments into distinct “migration bands” via gel or capillary electrophoresis. Comparing band migrations to that of DNA ladders (containing calibrated DNA fragments of known sizes) enables the estimation of the different fragment sizes, also called the “band profile” of a construct. Band profiles resulting from the digestion of a DNA sequence by a given set of enzymes can easily be predicted using pen and paper, and observations differing significantly from the predictions indicate an invalid construct. While less informative than DNA sequencing (which will be addressed in the next section), restriction digests provide a simple method for first-pass screening of mis-assembled constructs in a few hours for less than $1 per assemblies, and laboratory automation advances in recent years have increased the possible batch size from a few dozen to a few thousand constructs [5]. However, the selection of enzymes adapted to the constructs to be verified can be challenging. An ideal digest would consist in one or two enzymes cutting the construct in several places in order to produce a multi-band profile, while ensuring that the bands are well spaced (to prevent bands from fusing together and becoming indistinguishable) and that fragment sizes are in the same range as the ladder’s bands (typically from a few dozen to a few thousand base pairs) so that they can be measured with good precision. Additionally, when assembling many constructs, one may want to find a single enzyme (or enzymes mix) suitable for all constructs, which
Computer-Aided Planning for DNA Verification
169
would allow to prepare a single digestion mix for the project, at the same time greatly simplifying the protocol and saving reagents. If need be, the objective could be relaxed to finding a pair of digests such that any construct in the batch can be digested using one of the two options. This section describes a software solution automating the search for such optimal enzymes. 2.1 Using the Web Application
1. With the web browser of your choice (we recommend Google Chrome or Firefox), connect to the application at the following address: https://cuba.genomefoundry.org/domesticate_part_ batches. The application consists in a simple one-page form shown in Fig. 1a. In the rest of this section, letters in parenthesis (a), (b), etc. refer to annotations in this figure. 2. Make sure that the selection box in (a) indicates “Good patterns for all constructs.” 3. Choose the ideal range for the number of bands in a band profile (b). Less than three bands are generally considered too generic to be a good screening, and more than 8 bands generally result in crowded band patterns. 4. Drag and drop the sequences of all constructs, in Genbank or Fasta format, in the upload box (c). 5. Tick the checkbox in (d) to indicate that the sequences are circular plasmids. 6. Indicate which ladder will be used among the options proposed in (e), or the ladder with the closest range if it does not appear in the options. 7. Enter all available enzymes as a comma-separated list in the text box (f). Note that a few pre-set enzyme lists are available in the selection on top of the text box. 8. Choose the maximum number of enzymes in a given digest, as well as the maximum number of digests accepted to verify the batch. For instance, when asking for 3–6 bands via a single digest with 2 enzymes suitable for a batch of 10 constructs, the application will return a digestion plan as shown in Fig. 1b, relying on a mix of enzymes AseI and EcoRI. When asking for 4–6 bands and 2 possible digests for the same constructs, the application will return an assembly plan consisting of AseI +EcoRI, completed this time by a SphI+XhoI digest (as shown in Fig. 1c). Notice how this second digest provides 4-band patterns for constructs C5, C6, and C8, for which digest AseI+EcoRI only produced three bands. As a consequence, each construct in the batch can be verified using either AseI+EcoRI or SphI+XhoI. 9. Optionally, tick the boxes in (h) for the application to return detailed plots showing, for each digest, which regions of a construct correspond to the different bands in the band profile (Fig. 1d).
170
Valentin Zulkower
Fig. 1 Form and output of the web-based enzyme selection application. (a) Screenshot of the web application showing the web form in its entirety. (b) Example plot returned by the application for a batch of 10 constructs. The selected enzymes are indicated on the left. (c) Plot returned by the application in complement to the one in panel B when the user requires two different digests. (d) Example of construct map returned by the application (here construct C1, with construct features blurred as not relevant for this chapter). Yellow features indicate the construct regions corresponding to the different bands (labeled a, b, c, d) of the digestion pattern
3
Automated Primer Selection and Design for Sanger Sequencing Sanger sequencing [4] enables the determination of a DNA molecule’s sequence by pairing the DNA preparation to be sequenced with different primers. Each sequencing primer is a small oligonucleotide (typically 20 nucleotides long) homologous to a specific region of the DNA molecule, and produces a Sanger read
Computer-Aided Planning for DNA Verification
171
determining the sequence of the segment located between two points ~100 bp and ~1000 bp downstream of the homologous region, for a typical cost of $1–2 per read. To ensure the success of Sanger sequencing, each primer should be designed so as to be specific to the region targeted and have a melting temperature of 55–65 C (to avoid weak reads or high background), and avoid strong secondary structure that could impair annealing, which can be automated via software [6]. The Sanger validation of large batches of assemblies may therefore require the design and purchase of hundreds of primers, as well as hundreds of sequencing reactions, adding significantly to the overall time and cost of the DNA assembly process. However, for batches of constructs assembled from standard genetic parts, the number of necessary primers can be greatly reduced, as (1) constructs sharing common genetic parts present homologies, making it possible to use a primer with several constructs, and (2) some primers ordered to sequence a given batch may be reusable in the next. From this perspective, we describe here a web application automating the selection of primers for a given batch of constructs, via a strategy minimizing the number of reads and the number of new primers required. 3.1 Preparing the Necessary Data Files
1. Prepare a zip archive containing the expected sequence of all constructs in the batch as separate Genbank files (referred to as the Constructs Sequences Archive in the rest of this section). The name of each Genbank file should reflect the construct’s name. If only part of these sequences should be covered, refer to the instructions below. 2. Optionally, prepare a Fasta file gathering the sequences of all primers already available to you. This file will be referred to as the Primers Sequences File in the rest of this section).
3.2 Indicating Regions to Cover and Primer-Free Regions
The full Sanger sequencing of a 10-kb plasmid requires typically 20 reads to ensure a 2 coverage (where each nucleotide is read twice). Consequently, sequencing a hundred plasmids will require two thousand reads, and possibly hundreds of different primers, making it costly and logistically challenging. To reduce the complexity of the sequencing plan, one may want to restrict sequencing to some regions of interest. For instance, when assembling several genetic parts into a receptor vector, the sequencing of the receptor region may be deemed unnecessary. One may also decide to only sequence regions at the junctions between successive genetic parts, as these locations may be more prone to assembly artifacts. Moreover, one may want to avoid using a primer annealing at these junctions, as the primer may not be able to anneal in case of artifacts at this location, leading to no read at all. The following steps show how to specify regions to cover and prevent primers at certain
172
Valentin Zulkower
Fig. 2 Input and output of the web-based primer selection application. (a) Schematic representation of an assembly’s Genbank record, with part junctions annotated to indicate that these regions in particular should be covered by sequencing, and should not be an annealing location for primers. (b) Screenshot of the web application showing the web form in its entirety. (c) Example output schema showing the sequencing plan for 2 constructs. Short red triangles indicate primer annealing locations, blue features indicate Sanger reads from newly designed primers, and purple features indicate Sanger reads using available primers
locations, using Genbank annotations as shown in Fig. 2a. These steps are optional and can be skipped if the whole construct sequence should be covered and no location is unfit for primers. 1. Before adding the construct’s corresponding Genbank file to the zip archive, open the Genbank file in a sequence editor, for instance, the free software Snapgene Viewer (see snapgene. com) or Benchling (benchling.com). 2. Find the location a sequence region which should be protected against modifications.
Computer-Aided Planning for DNA Verification
173
3. Add an annotation at this location (the method for which may vary from one sequence editor to another). The Genbank type of the annotation should be “misc_feature,” and the label of the annotation “cover.” 4. Likewise, find the location of sequence region for which primers should be avoided. Add an annotation at this location with Genbank type “misc_feature,” and the label of the annotation “no_cover.” 5. Save the resulting Genbank record to a file and add the file to the Constructs Sequences Archive. 3.3 Using the Web Application
1. With the web browser of your choice, connect to the application at the following address: https://cuba.genomefoundry. org/select_primers. 2. The application consists in a simple one-page form shown in Fig. 2b. In the rest of this section, letters in parenthesis (a), (b), etc. refer to annotations in this figure. 3. Make sure the validation type is set to “Sanger sequencing” (a). 4. In selection box (b) indicate whether the primers should produce reads on the 30 –50 strand, 50 –30 strand, or both, corresponding to a 2 coverage where each nucleotide is read once from each direction. 5. Drag the Constructs Sequences Archive in the upload box (c). 6. Tick the box (d) to indicate that the constructs to validate are circular. 7. Optionally, drag the Primers Sequences File in the upload box (e). 8. Specify the expected read size (f) and the target annealing temperature of the primers (g). The default values provided are typical, but these parameters may slightly vary depending on protocol details, and must be checked with the sequencing laboratory. 9. Specify the number of digits used in the name formatting for new primers (h). For instance, a value of 3 will result in primer names of the form P001, P002, etc. Name collisions with existing primers specified in the Primers Sequences File will be automatically avoided. 10. Click on “Select primers” (i) to launch the primer selection, which may take a few minutes (progress bars will be displayed).
3.4
Output
1. As the automated primer selection process ends, “Download” appears below the form. Clicking on the button will download a multi-file zipped report on the user’s computer, referred to as the Primer Selection Report in the rest of this section, which describes an optimized Sanger sequencing plan for the batch of constructs. It consists of the three files described below.
174
Valentin Zulkower
2. File “coverage_plots.pdf” features schemas indicating the location of primer annealing and expected read regions, as shown in Fig. 2c, and can be used to quickly review the assembly plan. 3. File “primers_list.csv” lists all primers used in the sequencing plan, in spreadsheet format, and can be used as a checklist before starting the preparation of the sequencing. All newly designed primers appear at the top of the list, making it easy to order their sequences from commercial primer providers. 4. File “primers_per_record.csv” is a spreadsheet associating each construct of the batch with the list of primers necessary to sequence it (or cover the regions of interest). It is meant to be used either as a checklist if the sequencing reactions are prepared manually, or as a data file for generating robotic pick-lists if sample handling is automated.
References 1. Casini A, Storch M, Baldwin GS, Ellis T (2015) Bricks and blueprints: methods and standards for DNA assembly. Nat Rev Mol Cell Biol 16 (9):568–576. https://doi.org/10.1038/ nrm4014 2. Potapov V, Ong JL, Kucera RB, Langhorst BW, Bilotti K, Pryor JM et al (2018) Comprehensive profiling of four base overhang ligation fidelity by T4 DNA ligase and application to DNA assembly. ACS Synth Biol 7(11):2665–2674. https://doi.org/10.1021/acssynbio.8b00333 3. Shapland EB, Holmes V, Reeves CD, Sorokin E, Durot M, Platt D et al (2015) Low-cost, highthroughput sequencing of DNA assemblies using a highly multiplexed Nextera process. ACS Synth Biol 4(7):860–866. https://doi. org/10.1021/sb500362n
4. Sanger F, Coulson AR (1975) A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94 (3):441–448. https://doi.org/10.1016/00222836(75)90213-2 5. Dharmadi Y, Patel K, Shapland E, Hollis D, Slaby T, Klinkner N et al (2014) Highthroughput, cost-effective verification of structural DNA assembly. Nucleic Acids Res 42(4): e22. https://doi.org/10.1093/nar/gkt1088 6. Hancock JM, Zvelebil MJ, Hancock JM (2004). PRIMER3. In: Dictionary of bioinformatics and computational biology. https://doi.org/10. 1002/9780471650126.dob0560.pub2
Chapter 8 Characterizing Genetic Parts and Devices Using RNA Sequencing Deepti Vipin, Zoya Ignatova, and Thomas E. Gorochowski Abstract Synthetic genetic circuits are composed of many parts that must interact and function together to produce a desired pattern of gene expression. A challenge when assembling circuits is that genetic parts often behave differently within a circuit, potentially impacting the desired functionality. Existing debugging methods based on fluorescent reporter proteins allow for only a few internal states to be monitored simultaneously, making diagnosis of the root cause impossible for large systems. Here, we present a tool called the Genetic Analyzer which uses RNA sequencing data to simultaneously characterize all transcriptional parts (e.g., promoters and terminators) and devices (e.g., sensors and logic gates) in complex genetic circuits. This provides a complete picture of the inner workings of a genetic circuit enabling faults to be easily identified and fixed. We construct a complete workflow to coordinate the execution of the various data processing and analysis steps and explain the options available when adapting these for the characterization of new systems. Key words Genetic circuits, Genetic parts, Characterization, Biometrology, RNA-seq, Synthetic biology
1
Introduction Synthetic genetic circuits allow us to reprogram the behavior of living cells [1]. They consist of many genetic parts and devices that must work together to regulate gene expression [2]. A challenge often faced when constructing complex genetic circuits is that individual parts behave differently when assembled with other components. Such contextual effects can arise due to changes in the local sequence composition [3, 4], uncharacterized interactions between parts [5, 6], competition for shared cellular resources [7– 9], and many other factors [10]. As the size and complexity of genetic circuits grow [11–13], these malfunctions make it increasingly difficult to construct a working system. Furthermore, because there are numerous potential points of failure, it is difficult to exhaustively test every part and single out the root cause. What is
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_8, © Springer Science+Business Media, LLC, part of Springer Nature 2021
175
176
Deepti Vipin et al.
needed is a way of measuring the function of every component in the context of the complete circuit. To date, methods to characterize the performance of genetic parts and devices have mostly relied on the read-out of fluorescent reporter proteins as proxies for gene expression levels [11, 12, 14, 15]. Genes of interest are tagged with a fluorescent protein or a fluorescent protein is co-expressed with a gene of interest. The major benefit of this approach is that fluorescence can be easily monitored in real-time across entire populations using a plate reader or even in single cells using flow cytometry. However, a number of limitations also exist. First, only a limited number of fluorescent reporters can be used concurrently due to spectral overlap [16], and second, modifications must be made to the circuit which too may alter the behavior of the system [2]. Next-generation sequencing has revolutionized many areas of biological research offering a holistic view of many cellular processes [17]. For example, RNA sequencing (RNA-seq) [18] can be used to assess transcriptional regulation [19], ribosome profiling (Ribo-seq) [20] can offer insight into protein translation [21], and numerous other methods exist allowing us to assess the binding sites of transcriptional regulators (ChIP-seq) [22] and even the secondary structure of RNA molecules within a cell (SHAPE-seq, PARS-seq) [23, 24], to name but a few. Unlike fluorescent reporters, sequencing methods do not require the modification of the host cell or a synthetic genetic circuit and provide a complete, cellwide snapshot [17, 18, 25]. Although such methods would allow for the simultaneous characterization of many types of genetic part, sequencing has so far seen limited use in synthetic biology. This is changing with sequencing methods being recently used to characterize transcriptional and translational processes across entire genetic circuits [19, 21] and the application of multiplexing techniques to significantly decrease costs [26]. In this chapter, we demonstrate how RNA-seq data can be used to characterize genetic parts, devices, and the host cell response to a synthetic genetic circuit [19]. We assume that RNA-seq data have already been collected for the range of possible conditions the circuit can be exposed to and show how a computational tool called the Genetic Analyzer can be used to process and analyze these data. Step-by-step instructions are given on how to install and create new analysis workflows and the customizations that are necessary to study new systems of interest (Fig. 1). We also provide brief descriptions of how various tools are used at each step and the principles underpinning the characterization of transcriptional promoters and terminators (Fig. 2) and genetic devices like small-molecule sensors and logic gates (Fig. 3).
Genetic Analyzer Tool
177
START data/bed/S.bed data/fasta/S.fasta data/fastq/S.fastq data/gff/S.gff data/settings.txt 00_setup.sh
Generate normalized
Create temporary and result directories
DATA PRE-PROCESSING
02_count_reads.sh Estimate read count per gene using HTseq
tmp/S/ S.bam
results/S/ S.mapped.reads.txt S.counts.txt S.gene.lengths.txt
CHARACTERIZATION
correcting edge effects
01_map_reads.sh Map RNA-seq reads using BWA and Samtools
results/S/
06_de_analysis.sh Evaluate differential gene expression between sets of samples
results/ S.de.analysis.txt
07_part_analysis.sh Calculate promoter and terminator performance
results/
03_fragment_distributions.sh Generate fragment length distribution of mapped genes
04_read_analysis.sh Calculate FPKMs per gene and TMM between sample normalization factors
results/S/ S.fragment.distribution.txt
results/ count.matrix.txt mapped.reads.matrix.txt gene.lengths.matrix.txt norm.factors.matrix.txt fpkm.normed.matrix.txt
Fit response functions for genetic devices
results/
09_clean_up.sh
END
Fig. 1 Overview of the workflow. Major analyses shown in boxes with dependencies and flows between analyses shown by arrows. Dashed arrows denote input and output files to each step with “S” denoting a prefix that would be replaced with a specific sample name
Promoter
Terminator
Genetic Design Te
J
Transcription TSS
Postion (bp)
TTS
Fig. 2 Method for characterizing promoter and terminator parts. For both types of part, a small region of the transcription profile before and after each part is used to estimate changes in RNA polymerase (RNAP) flux [14, 19]. For promoters, a sharp increase in RNAP flux occurs at the transcription start site (TSS) and the absolute change in RNAP flux from before to after δJ captures the promoter strength. For terminators, a drop in RNAP flux occurs at the transcription termination site (TTS) as RNAP physically unbind from the DNA. As this is a stochastic process, the fractional drop in RNAP flux across the part is related to the termination efficiency Te (i.e., percentage of RNAP that terminate). Genetic design shown with Synthetic Biology Open Language Visual (SBOL Visual) symbols [34] and produced using DNAplotlib [35, 36]
Deepti Vipin et al.
a
b
Sensor
Pin
Pout Joff
Jin J
Jon
NOT-gate
1 2
+
+ Inducer
Pout Joff Jout
178
1
2 3
4
J in
3 4
Fig. 3 Method for quantifying response function of genetic devices. (a) Sensors are characterized by the activity δJ of an output promoter Pout in the presence (+) and absence () of an inducer molecule, or another environmental factor [19]. (b) Genetic gates, such as a NOT-gate, are characterized by the relationship between the total RNAP flux acting as input to the gate Jin and the activity δJout of the output promoter Pout [19]. Steady-state measurements across a range of input combinations of a circuit can then be used to fit a response function (e.g., Hill equation) for the device [12]. The response functions of each device are shown on the right of each panel, and the transcription profiles and the RNAP flux measurements used to calculate these are shown to the left
2
Materials
2.1 Software Dependencies
The Genetic Analyzer requires that the following software tools and packages are installed and accessible from a command prompt. In most cases, newer versions of the software should be compatible. However, if issues are encountered, we recommend using the precise versions listed below. (a) Python version 2.7.9 [27]—we recommend using a packaged Python distribution such as Anaconda (www.continuum.io) or Enthought (www.enthought.com). (b) R version 3.2.1 [28]. (c) edgeR version 3.8.6 [29]. (d) BWA version 0.7.4 [30]. (e) SAMtools version 1.4 [31]. (f) HTSeq version 0.9.1 [32]. (g) Git version 2.21.0.
2.2 Installation of the Genetic Analyzer
1. The Genetic Analyzer forms a part of a number of tools for analyzing sequencing data. A copy of the latest Genetic Analyzer can be downloaded by running the following command: git clone https://github.com/VoigtLab/MIT-BroadFoundry
Genetic Analyzer Tool
179
2. A directory called “MIT-BroadFoundry” will have been created. Within this, the “genetic-analyzer” directory contains code related to the Genetic Analyzer workflow. All analysis scripts can be found in the “bin” directory and an example workflow is given in the “circuit example” directory. To ensure that analysis scripts can be found by each script, it is essential that each stage of the analysis workflow is executed from within a workflow directory (a description of each stage is provided in Subheading 3 below). This will ensure that the relative paths used in the scripts point to the correct code. 2.3
3
Sequencing Data
The Genetic Analyzer assumes that sequencing data will be provided in a standardized form to allow for automated processing [19]. In particular, it requires that paired-end RNA-seq data with FASTQ files is provided for read 1 and read 2 of each fragment. We recommend preparing strand-specific RNA-seq sequencing libraries [26] to allow for the multiplexing of multiple samples during a single run, and sequencing these libraries using an Illumina sequencer (e.g., HiSeq 2500). It is essential that a sufficient number of reads are generated per sample to allow for accurate quantification of genetic parts and devices. Although the precise number is dependent on the size of the host genome and synthetic genetic constructs present, for Escherichia coli cells, we find that approximately four million reads per sample is sufficient for accurate measurements from large genetic circuits [19].
Methods In the following sections, we explain the key steps and processes required to characterize genetic parts and devices, and to understand the transcriptional response of the host cell upon introduction of a synthetic genetic circuit. It is assumed that a copy of the entire workflow and all scripts are available in the current path and that all commands are run from this location (see Note 1).
3.1 Initial Workflow Setup
1. The first step is the creation of a new workflow to store all sequencing data, metadata about the host system and synthetic genetic circuits being studied, and the generated results. To create a new workflow, it is advised that a copy of the “circuit example” directory is made and renamed as appropriate. Because the workflow relies on the specific location of certain files, keeping the same directory structure within a workflow is essential. Once a new workflow directory has been created, a number of key files must be edited and added within the “data” directory. We recommend editing and renaming the examples provided to ensure that correct file formats are maintained.
180
Deepti Vipin et al.
2. Within the “data/bed” directory, a BED file [33] (*.bed) must be present that provides the genomic regions for which transcription profiles will be created. The chromosome names used in this file must be identical to the names used in the provided reference sequences (see step 3 below). The format of this file is a line containing the chromosome name, start location, and end location (tab separated) for each region required. It is vital that transcription profiles are generated for regions containing every part that will be characterized and it is not necessary for the entire chromosome of the host cell to be considered, unless genetic parts within the host will be measured. 3. Within the “data/fasta” directory, a FASTA file (*.fa or *.fasta) must be present containing reference sequences for the host genome and any other genetic constructs present. A standard multi-FASTA format is used and the names of the chromosomes must match their use in other files. 4. Within the “data/fastq” directory, all raw FASTQ files (*.fq or *.fastq) from the sequencing must be present. There should be a pair of FASTQ files for each state of the circuit (e.g., combination of inducer molecules) corresponding to the paired-end reads produced by the sequencer. 5. Within the “data/gff” directory, a GFF file (*.gff) must be present describing the location of all features that will be needed by the analysis workflow. This file should provide a detailed annotation of the reference sequences giving the start and end location of each feature, the strand that the feature resides (+ or –), as well as information about the type and other metadata related to the part itself (if relevant). Table 1 describes the five different features that can be used, and the related metadata needed to capture regulatory links between parts and other crucial information for part characterization. It should be noted that differential gene expression analysis (see Subheading 3.4) will only consider features of the “gene” type. 6. Finally, the “data/setting.txt” file contains a tab-delimited table where each row corresponds to a particular sample/state of the circuit (e.g., combination of inducers). Existing rows for the circuit example should be edited or removed as necessary. In addition, the first row is a header and the second row should remain untouched as it states a central location for storing results collating data from all circuit samples. We recommend using relative paths so that the entire workflow directory can be easily moved without causing issues relating to absolute paths changing. 7. Before any analysis can be performed, a number of additional directories must be created to provide a location for temporary files and analysis results. This process is performed by the
Genetic Analyzer Tool
181
Table 1 Custom workflow feature types and parameters for GFF files Type
Parametera
Description
promoter
Name
Name of the promoter. Promoter features are not used for calculating promoter strengths
promoter_unit Name
b
promoter_names promoter_typesb
promoter_nsb
chrom_inputsb b promoter_unit_inputs
Name of the promoter unit. This contains one or more promoters and is used as the main feature for calculating promoter strengths Names of the individual promoters The type of each promoter, either “repress” if repressed by a gene in the circuit or “induced” if an output promoter from a sensor n values of the hill functions characterizing the response function of each individual promoter making up the promoter unit Chromosome of the input promoter units Other promoter units that act as inputs, driving expression of any genes that have a regulatory effect on this promoter unit (e.g., repressors). For induced promoters this should be a colon separating list of each circuit state and a 0 or 1, if inactive or active, respectively (e.g., “state1 > 0:state2 > 1:state3 > 0”)
gene
Name
Name of the gene. Gene features should span the coding region of a protein and are used to calculate differential gene expression (see Subheading 3.4)
transcript
Name
Name of the transcript. Transcript features are used to generate the transcription profiles Nucleotide position within the transcript where cleavage by a ribozyme takes place. A position of ten would be the tenth nucleotide in the transcript, not the reference sequence
start_site
terminator
Name
Name of the terminator. Terminator features should span the full length of the part
Parameters are provided as key-value pairs in the form “key ¼ value.” Multiple parameters should be semi-colon separated, that is, “parameter1 ¼ value1; parameter2 ¼ value2” b Where multiple promoters make up a promoter unit, the values for each promoter within the unit should be comma separated, corresponding to the promoters in sequence, e.g., “Name ¼ P1,P2” a
“00_setup.sh” script which must first be edited to create separate directories for each sample in the “results” and “tmp” directories. The names of these directories should match precisely the sample names in the “data/setting.txt” file. 8. The required directories are then created by running the command: sh 00_setup.sh
182
Deepti Vipin et al.
3.2 Data Preprocessing
1. Once a complete workflow is setup, the raw RNA sequencing data for each sample need to be mapped to the reference sequences. This is performed by the “01_map_reads.sh” script which calls the “map_reads.py” script to coordinate the SAMtools [31] and BWA [30] software for each sample. This script should be edited to include entries for each sample present in the “data/setting.txt” file. 2. The mapping of sequencing reads is then performed by running the command: sh 01_map_reads.sh
3. This creates BAM files [31] for each sample in the “tmp” directory. 4. The next step is to generate read counts for each gene feature in the GFF file of the system being studied. This is used when calculating differential gene expression in Subheading 3.4. This process is performed by the “02_count_reads.sh” script which calls the “count_reads.py” script for each sample. This script should be edited to include entries for each sample present in the “data/setting.txt” file. 5. Read counts for each gene are then calculated by running the command: sh 02_count_reads.sh
6. The script will create multiple output files in the “results” directory for each sample: “SAMPLE.counts.txt” containing read counts for each gene, “SAMPLE.gene.lengths.txt” containing the length of each gene used for calculations of gene expression in Fragments Per Kilobase of transcript per Million mapped reads (FPKM) units, and “SAMPLE.mapped.reads. txt” containing information about the total number of mapped reads. In all cases, “SAMPLE” in the filename is replaced by the full sample name. 7. The next step is to generate fragment length distributions that are needed to allow for the correction of reduced read counts at the ends of transcripts. This process is performed by the “03_fragment_distibrutions.sh” script which calls the “fragment_distributions.py” script for each sample. This script should be edited to include entries for each sample present in the “data/setting.txt” file. 8. The distribution of fragment lengths for each sample is then produced by running the command: sh 03_fragment_distributions.sh
9. The script will create output files in the “results” directory for each sample containing the fragment length distributions.
Genetic Analyzer Tool
183
These files will be named in the format “SAMPLE. fragment. distribution.txt” where SAMPLE is replaced by the sample name. 10. The final step of the preprocessing is to collate the data generated for each sample separately and calculate the trimmed mean of M-values (TMM) normalization factors using the edgeR package [29] needed to enable comparison of read counts between samples. This process is performed by the “04_read_analysis.sh” script which calls the “read_analysis.py” script. These scripts should not need to be edited. 11. Collation of all read data is then performed by running the command: sh 04_read_analysis.sh
12. The script will create three output files in the “results” directory: “norm.factors.matrix.txt” containing TMM between sample normalization factors, “mapped.reads.matrix.txt” containing mapped read counts, “count.matrix.txt” containing read counts for each gene, “gene.lengths.matrix.txt” containing the length of each gene, and “fpkm.normed.matrix.txt” containing normalized FPKM expression values for each gene. 3.3 Generating Transcription Profiles
1. Once the RNA-seq data have been preprocessed, the next step is to generate transcription profiles for specified regions of the host genome, as well as any synthetic genetic constructs that might be contained on plasmids. This process is performed by the “05_transcription_profiles.sh” script which should be edited such that calls to the “transcription_profile.py” script are made for each sample. Chromosomes for which profiles should be created are specified with the “-chroms” option. 2. Transcription profiles are then created by running the command: sh 05_transcription_profiles.sh
3. The script will create pairs of output files in the “results” directory for each sample. These files will be named using the formats “SAMPLE.fwd.norm.profiles.txt” and “SAMPLE.rev. norm.profiles.txt” where SAMPLE is replaced by the full name of the sample. These files contain transcriptional profiles for the forward and reverse strands of the regions specified in the userprovided BED file (see Subheading 3.1). 3.4 Analyzing Differential Gene Expression to Understand the Host Response
1. Synthetic circuits can impart a significant burden on a host cell which is often manifested by changes in gene expression. Differential gene expression analysis allows for shifts in expression to be quantified in a robust manner, correcting for potential between-sample variations due to differences in sequencing
184
Deepti Vipin et al.
depth. This analysis is performed by the “06_de_analysis.sh” script, which should be edited to call the “de_analysis.py” script for user-specified sets of samples to compare. For example, a user can use the options “-group1 1,2,3 -group2 4,5,6” to compare differences in gene expression between samples {1, 2, 3} and {4, 5, 6}. The numbers correspond to the sample in that row of the “data/settings.txt” file. The “-output_prefix” option can be used to provide a filename prefix for the output file containing the results. This enables multiple differential gene expression analyses to be performed simultaneously. 2. Differential gene analysis is then performed by running the command: sh 06_de_analysis.sh
3. The script will create output files in the “results” directory for each analysis performed. These will be named in the format “PREFIX.de.analysis.txt” where PREFIX is replaced by the user provided “-output_prefix” in the “06_de_analysis.sh” script. 3.5 Characterizing Promoters and Terminators
1. To characterize the performance of promoter and terminator parts, we analyze changes in a transcription profile (i.e. RNAP flux) from the start to the end of the part (see Fig. 2) [19]. To perform this task, the “07_part_analysis.sh” script makes calls to the “part_profile_analysis.py” script. These scripts load and analyze each promoter and terminator part in the workflow’s GFF file for every sample in the “data/settings.txt” file. These scripts should not need to be edited. 2. Genetic part characterization is performed by running the command: sh 07_part_analysis.sh
3. The script will create two output files in the “results” directory: “promoter.profile.perf.txt” containing estimates of promoter strengths and “terminator.profile.perf.txt” containing terminator efficiencies calculated from the transcription profiles (see Fig. 2). 3.6 Quantifying the Response Function of Genetic Devices
1. In addition to measuring the performance of genetic parts in isolation, we can also infer the functional response of many parts that work together in concert as a genetic device. Examples include sensor modules and genetic logic gates in which promoters act as inputs and outputs. To characterize these types of genetic device, we fit a steady-state response function to capture how the input and output promoter activities (calculated in Subheading 3.5) vary together across all states of the system. It should be noted that when characterizing genetic
Genetic Analyzer Tool
185
devices, it is essential that the samples taken, span the full range of possible inputs the system may be exposed to. This ensures that inputs vary over their full range and improve the fitting of a response function. In this workflow, we allow for genetic devices that have activating and repressing Hill-like response functions. The fitting of the response function to experimental data is performed by the “08_promoter_fitting.sh” script. This calls the “promoter_fitting.py” script for each set of samples corresponding to a particular condition. The script will need to be updated for the samples to be processed. If for example, you have assayed a circuit in two separate types of growth media, then the samples for one media should be fitted separately to the other. This will, therefore, require two calls to the “promoter_fitting.py” script with the appropriate samples given as arguments to the “-samples” option. 2. Genetic device characterization is performed by running the command: sh 08_promoter_fitting.sh
3. The script will create the “fitted.promoter.perf.txt” output file in the “results” directory. This contains fitted response functions for each genetic device (see Fig. 3). 3.7 Removing Temporary Files and Logs
1. Once the complete workflow has been run and all required analysis performed, a clean-up step can be used to remove all temporary files and logs. This will ensure that any generated results remain untouched but will not allow for intermediate steps to be rerun out of order (some of the temporary files are necessary for many of the analyses). 2. The clean-up step is performed by the “09_clean_up.sh” script. Before running, this file must be updated to include entries to delete all contents from the “tmp” and “logs” directories (including any sub-directories). Once edited, the script can be executed using: sh 09_clean_up.sh
4
Notes 1. This workflow has been tested on Linux and MacOS operating systems and assumes that a UNIX-compatible command prompt running a standard shell (e.g., sh, bash, and zsh) is available. For Windows users, we recommend installing the Windows Subsystem for Linux (WSL), which will provide access to a required command prompt that is able to execute the scripts in the workflow. This subsystem will require all the prerequisite tools installed and working (see Subheading 2 for details).
186
Deepti Vipin et al.
Acknowledgments D.V. and Z.I. were supported by the EU H2020 SynCrop European Training Network (grant 764591). T.E.G. was supported by BrisSynBio, a BBSRC/EPSRC Synthetic Biology Research Centre (grant BB/L01386X/1) and a Royal Society University Research Fellowship (grant UF160357). References 1. Greco FV, Tarnowski MJ, Gorochowski TE (2019) Living computers powered by biochemistry. Biochemist 41:14–18 2. Brophy JAN, Voigt CA (2014) Principles of genetic circuit design. Nat Methods 11:508 3. Kosuri S et al (2013) Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc Natl Acad Sci U S A 110(34):14024 4. Mutalik VK et al (2013) Precise and reliable gene expression via standard transcription and translation initiation elements. Nat Methods 10:354 5. Schmidl SR et al (2019) Rewiring bacterial two-component systems by modular DNA-binding domain swapping. Nat Chem Biol 15(7):690–698 6. Scott SR, Hasty J (2016) Quorum sensing communication modules for microbial consortia. ACS Synth Biol 5(9):969–977 7. Gorochowski TE, Avcilar-Kucukgoze I, Bovenberg RAL, Roubos JA, Ignatova Z (2016) A minimal model of ribosome allocation dynamics captures trade-offs in expression between endogenous and synthetic genes. ACS Synth Biol 5(7):710–720 8. Gyorgy A et al (2015) Isocost lines describe the cellular economy of genetic circuits. Biophys J 109(3):639–646 9. Qian Y, Huang H-H, Jime´nez JI, Del Vecchio D (2017) Resource competition shapes the response of genetic circuits. ACS Synth Biol 6 (7):1263–1272 10. Cardinale S, Arkin AP (2012) Contextualizing context for synthetic biology – identifying causes of failure of synthetic biological systems. Biotechnol J 7(7):856–866 11. Nielsen AAK et al (2016) Genetic circuit design automation. Science 352(6281): aac7341 12. Stanton BC, Nielsen AAK, Tamsir A, Clancy K, Peterson T, Voigt CA (2014) Genomic mining of prokaryotic repressors for orthogonal logic gates. Nat Chem Biol 10(2):99–105
13. Woodruff LBA et al (2016) Registry in a tube: multiplexed pools of retrievable parts for genetic design space exploration. Nucleic Acids Res 45(3):1553–1565 14. Canton B, Labno A, Endy D (2008) Refinement and standardization of synthetic biological parts and devices. Nat Biotechnol 26:787 15. Kelly JR et al (2009) Measuring the activity of BioBrick promoters using an in vivo reference standard. J Biol Eng 3(1):4 16. Kleeman B et al (2018) A guide to choosing fluorescent protein combinations for flow cytometric analysis based on spectral overlap. Cytometry A 93(5):556–562 17. Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of nextgeneration sequencing technologies. Nat Rev Genet 17:333 18. Stark R, Grzelak M, Hadfield J (2019) RNA sequencing: the teenage years. Nat Rev Genet 20(11):631–656 19. Gorochowski TE et al (2017) Genetic circuit characterization and debugging using RNA-seq. Mol Syst Biol 13(11):952 20. Ingolia NT (2014) Ribosome profiling: new views of translation, from single codons to genome scale. Nat Rev Genet 15:205 21. Gorochowski TE, Chelysheva I, Eriksen M, Nair P, Pedersen S, Ignatova Z (2019) Absolute quantification of translational regulation and burden using combined sequencing approaches. Mol Syst Biol 15(5):e8719 22. Park PJ (2009) ChIP–seq: advantages and challenges of a maturing technology. Nat Rev Genet 10(10):669–680 23. Del Campo C, Bartholom€aus A, Fedyunin I, Ignatova Z (2015) Secondary structure across the bacterial transcriptome reveals versatile roles in mRNA regulation and function. PLoS Genet 11(10):e1005613 24. Strobel EJ, Yu AM, Lucks JB (2018) Highthroughput determination of RNA structures. Nat Rev Genet 19(10):615–634
Genetic Analyzer Tool 25. Conway T et al (2014) Unprecedented highresolution view of bacterial operon architecture revealed by RNA sequencing. MBio 5(4): e01442–e01414 26. Shishkin AA et al (2015) Simultaneous generation of many RNA-seq libraries in a single reaction. Nat Methods 12:323 27. Sanner MF (1999) Python: a programming language for software integration and development. J Mol Graph Model 17(1):57–61 28. R. C. Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna 29. Robinson MD, McCarthy DJ, Smyth GK (2009) edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26 (1):139–140 30. Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25(14):1754–1760
187
31. Li H et al (2009) The sequence alignment/ map format and SAMtools. Bioinformatics 25 (16):2078–2079 32. Anders S, Pyl PT, Huber W (2014) HTSeq—a Python framework to work with highthroughput sequencing data. Bioinformatics 31(2):166–169 33. Quinlan AR, Hall IM (2010) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26(6):841–842 34. Beal J et al (2019) Communicating structure and function in synthetic biology diagrams. ACS Synth Biol 8(8):1818–1825 35. Der BS et al (2017) DNAplotlib: programmable visualization of genetic designs and associated data. ACS Synth Biol 6(7):1115–1119 36. Bartoli V, Dixon DOR, Gorochowski TE (2018) Automated visualization of genetic designs using DNAplotlib. In: Braman JC (ed) Synthetic biology: methods and protocols. Springer New York, New York, NY, pp 399–409
Chapter 9 Steady-State Cell-Free Gene Expression with Microfluidic Chemostats Nadanai Laohakunakorn, Barbora Lavickova, Zoe Swank, Julie Laurent, and Sebastian J. Maerkl Abstract Cell-free synthetic biology offers an approach to building and testing gene circuits in a simplified environment free from the complexity of a living cell. Recent advances in microfluidic devices allowed cell-free reactions to run under nonequilibrium, steady-state conditions enabling the implementation of dynamic gene regulatory circuits in vitro. In this chapter, we present a detailed protocol to fabricate a microfluidic chemostat device which enables such an operation, detailing essential steps in photolithography, soft lithography, and hardware setup. Key words Microfluidics, Cell-free, Synthetic biology, Steady-state gene expression
1
Introduction One of the enduring challenges in synthetic biology today is the overwhelming difficulty of predictive forward-engineering, despite major efforts to characterize, standardize, and mathematically model synthetic biological parts and systems [1]. Even if parts such as promoters and regulators are initially well-characterized, combining them together into larger subsystems typically changes the context of the parts as well as the host cell, resulting in diminished predictive accuracy, and in some cases, a loss of the original function altogether. Functional designs are therefore usually developed not in a purely rational manner, but require rounds of empirical design-build-test cycles. While this approach can certainly yield functional designs, it is preferable to ultimately develop more efficient and rational ways of engineering gene circuits. Within synthetic biology, the adoption of cell-free systems has become increasingly widespread [2]. From an engineering perspective, they behave as a very simplified “host cell,” providing a constant and controllable environment in which to build synthetic
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_9, © Springer Science+Business Media, LLC, part of Springer Nature 2021
189
190
Nadanai Laohakunakorn et al.
gene networks. Cell-free systems are thus well suited for rational, bottom-up engineering of biomolecular systems [3, 4]. Furthermore, the functionality of cell-free systems can be expanded by inclusion of additional components [5], and provide a system for quantitative analysis including mRNA and protein concentrations [6, 7]. A second key benefit is that their ease of preparation and scalability also accelerate design-build-test cycles, resulting in their adoption as an efficient rapid prototyping platform. Both lysate [8, 9] and recombinant [10] cell-free reaction systems can now be readily generated using standard laboratory equipment at reasonably low costs. Microfluidics have allowed these benefits of cell-free synthetic biology to be more fully realized [11]. By increasing the throughput, lowering reagent consumption and providing control and quantitative monitoring of thousands of reactions in parallel, they have enabled precise characterization of cell-free gene circuits both in integrated chips, [12, 23] as well as in encapsulated droplets [13, 14]. Batch cell-free reactions typically run to chemical equilibrium as substrates are exhausted, reaction products accumulate, and enzymatic machinery degrades. To maintain a more life-like nonequilibrium steady state, large-scale continuous exchange or continuous flow reactors have been used to feed the reaction with small molecules and wash away products through ultrafiltration membranes [15]. At the microfluidic level, microchemostat devices have been developed which replenish not only substrates but also the enzymatic machinery, while at the same time diluting away reaction products [16, 17]. These microchemostats enable long-term steady-state reactions, and also allow for the investigation of biologically relevant dynamical behaviors such as oscillations [16, 17] and pattern formation [18]. In this chapter, we describe the entire process of designing, fabricating, and operating a microfluidic chemostat device. The chip we chose as an example is a revised and simplified version of the microchemostat presented in Niederholtmeyer et al. 2013 [16], and is shown in Fig. 1. The operation of the device first involves selecting an input solution using the multiplexer unit, which is directed to one of eight separate reactor rings. Each reactor contains four output ports, located at specific positions around the ring. Opening these ports exchanges a fixed fraction of the reactor volume, with the exact fraction depending on the position of the port. The placement of these ports allows the reactor to be loaded with a reaction of fixed composition, and importantly also allows a dilution step to occur which preserves this composition. In between dilution steps,
Steady-State Cell-Free Gene Expression
a
Flow, 14 µm channel height (rounded) 100 µm width Control, 40 µm channel height 100 or 40 µm width
b glass slide Unpressurized
9-input multiplexer
191
Pressurized
c dual-function valve and peristaltic pump
outlet
2 mm
individual microreactor
Fig. 1 (a) A two-layer microchemostat design consists of a thin control layer sandwiched between a glass slide and a thicker flow layer. (b) Applying pressure to channels in the control layer pushes up valves which close off channels in the flow layer. (c) The chip contains eight individual chemostat reactors. Four control lines serve as dual-function valves and peristaltic pump. Actuating these lines sequentially mixes the liquid inside the reactors
the reaction is mixed using a peristaltic pump. Full details are given in Subheading 3.5. We describe the photolithographic steps required to print the chip design on a chrome mask, and subsequently transfer it onto silicon wafers. Once fabricated, these silicon molds can be used for multiple rounds of soft lithography where they are used to cast polydimethysiloxane (PDMS) devices. Finally, the hardware required for operating the chip is described, and a standard experiment outlined. Related protocols are available in the literature [19, 20].
192
2
Nadanai Laohakunakorn et al.
Materials The photolithography steps were carried out in a Class 100 clean room at EPFL. Soft lithography was done in a dedicated space in a standard wet lab. Specialized machines, consumables, and chemicals are listed below.
2.1 Photolithography Machines
1. VPG200 photoresist laser writer (Heidelberg Instruments Mikrotechnik GmbH). 2. HMR900 mask processor (Hamatech APE GmbH). 3. Optispin SB20 (ATMsse GmbH).
spin
coater
and
VB20
hotplate
4. MJB4 mask aligner (Su¨ss MicroTec AG). 5. Tepla 300 plasma stripper (PVA Tepla AG). 6. LSM250 spin coater and HP200 hotplate (Sawatec AG). 7. AccuPlate thermal accumulator and hot plate system (Detlef Gestigkeit). 2.2 Photolithography Consumables
1. AZ 9260 positive photoresist (MicroChemicals GmbH). 2. GM1070-SU8 negative photoresist (Gersteltec). 3. 1-methoxy-2-propyl-acetate (PGMEA) developer (Sigma). 4. AZ 400 K developer (Merck). 5. AZ 351B developer (Merck). 6. Cr01 chrome etchant (Technic). 7. Hexa-methyl-disilazane (HMDS) primer (Technic). 8. Silicon wafers, diameter 100 0.5 mm, thickness 525 25 μm, P-type (boron-doped), and resistivity 0.1–100 Ωcm (Siegert). 9. SLM5 500 blank chrome mask (Nanofilm).
2.3 Soft Lithography Machines
1. ARE-250 centrifugal mixer (Thinky). 2. SCS G3P-8 spin coater (Specialty Coating Systems Inc.). 3. Schmidt Press manual hole puncher and 21-gauge (OD 0.0400 ) pins (Technical Innovations, Inc.). 4. Diener Femto 40 kHz low pressure plasma oven with O2 supply (Diener electronic GmbH + Co. KG). 5. Universal Oven UF110, 108 L (Memmert). 6. SZX10 dissection microscope with DF PLANO 1.25 objective and KL 1500 LCD light source (Olympus).
2.4 Soft Lithography Consumables
1. Trimethylchlorosilane (Sigma).
Steady-State Cell-Free Gene Expression
193
2. Sylgard 184 polydimethylsiloxane (PDMS) elastomer and curing agent (Dow Corning). 3. Glass slides 76 26 1 mm 631–1550 (VWR). 2.5 Microfluidic Hardware
1. 12-station aluminium pneumatic manifold with 24 V 3-way normally open solenoid valves (S10MM-31-24-2/A Pneumadyne). 2. Polycarbonate manual luer manifold (Cole-Parmer). 3. Custom relay circuit board (see Note 1). 4. Type 10, 2–60 psi and 2–25 psi pressure regulators (Marsh Bellofram). 5. 0.1–3 bar pressure gauge (Riegler & Co. KG).
2.6 Microfluidic Connectors
1. Synflex 1201-M06 polyethylene (PE) tubing, OD 6 mm ID 4 mm (Eaton). 2. Low-density polyethylene (PE-LD) tubing OD 1/8c ID 1/1600 (Tuyau). 3. Tygon tubing, OD 0.0600 ID 0.0200 (Cole-Parmer). 4. Fluorinated ethylene-propylene (FEP_ tubing, OD 1/1600 ID 1/3200 (Upchurch). 5. Polyetherketone (PEEK) tubing, OD 1/3200 ID 0.18 mm (Vici). 6. Luer stubs 12 mm, 23 and 20 ga. 7. Male-to-male and 1/1600 barb to male luer adaptors. 8. Stainless steel connecting pins OD 0.65 mm ID 0.35 mm, 8 mm (Unimed). 9. Brass Series G pneumatic fittings (Serto AG). 10. Blue Series pneumatic fittings (Riegler & Co. KG).
2.7 Microscope Hardware
1. Ti2 Eclipse Inverted Microscope (Nikon). 2. Objectives: CFI Achro 4 NA 0.1 (Nikon); CFI S Plan Fluor 20 NA 0.45 ELWD DIC N1 (Nikon). 3. Filters: F36–504 mCherry HC filter set (Semrock); FITC (Nikon). 4. Microscope enclosure and heater (Okolab). 5. Sola SM II Light Engine (Lumencor). 6. Orca-Flash 4.0 V3 Digital CMOS Camera (Hamamatsu).
2.8
Software
1. AutoCAD2019 (Autodesk). 2. CleWin (WieWeb). 3. LabView 2018 (National Instruments). 4. Matlab 2019 (Mathworks).
194
Nadanai Laohakunakorn et al.
2.9 Experimental Reagents
1. TX-TL cell-free extract, ribosomes and energy solution, prepared as in [13]. 2. DNA template, prepared as in [14].
3
Methods
3.1 Design of Microfluidic Devices
1. Design the device (see Note 2) on AutoCAD 2019 or other software with similar functionality. A specific example is shown in Fig. 1, and other designs are available on our webpage (see Note 3). Export the final design as a .dxf file. 2. Using CleWin, convert the designs to a machine-compatible . cif file ready for photomask fabrication. 3. During curing, the PDMS layers will differentially shrink, with the thicker flow layer shrinking more than the thinner control layer, which remains attached to the rigid mold. Thus, it is crucial to enlarge the entire flow layer design by 1.5%. This can be done in CleWin during the conversion.
3.2 Photolithography for Mask and Wafer Fabrication 3.2.1 Mask Fabrication
1. Expose chrome masks with the VPG200 laser writer, using a 20 mm write lens (see Note 4) and 48% intensity. Make sure that the polarity and mirroring of the mask are correct (see Note 5). 2. Next, process the exposed masks using the HMR900 mask processor. This involves the following automated steps: 3. First purge the machine with deionized (DI) water. 4. Then develop for 100 s with a diluted developer mixture (AZ 351B:DI water in the ratio 1:3.75) and rinse with DI water. 5. Etch through the chrome layer for 60 s using the Cr01 etchant, and rinse. 6. Finally, strip the photoresist using the AZ 400 K developer for 35 s, followed by a final rinse and drying with CO2. The completed masks should be completely dry before use.
3.2.2 Flow Mold Fabrication
1. Prime a clean Si wafer with hexamethyldisilazane (HMDS) (see Note 6) for 10 s in vacuum, using the VB20 hotplate. 2. Transfer the wafer onto the Optispin SB20 spin coater and dispense a few ml of positive-resist AZ9260 onto the center of the wafer, taking care to avoid bubbles (see Notes 7 and 8). 3. Spin coat at 920 rpm for 100 s, followed by 60 s relaxation at 0 rpm. This deposits a 14-μm layer of photoresist on the surface of the wafer.
Steady-State Cell-Free Gene Expression
195
4. When the spin coating has finished, immediately transfer the wafer to a preheated hotplate, and “softbake” for 6 min exactly at 115 C. 5. Transfer the wafer to an opaque storage box and allow it to rehydrate for a minimum of 1 h (see Note 9). 6. Load the appropriate chrome mask onto the MJB4 mask aligner, and expose for 2 cycles at 18 s per cycle, with a waiting time of 10–15 s between each cycle, using the Hg-i line (365 nm) at 20 mW/cm2 (see Notes 10 and 11). Use the following parameters: expose type ¼ hard, alignment gap ¼ 30, WEC type ¼ cont, N2 purge ¼ NO, and WEC-offset ¼ OFF. 7. Develop immediately (maximal waiting time is 1 h) by transferring the wafer to a bath of diluted AZ 400 K developer (1:3 developer:DI water). Develop face-up and gently agitate the wafer in the bath for 10 min (see Note 12). 8. Rinse with DI water, then carefully but rapidly dry the wafer with N2, and inspect features under a microscope. If photoresist residues remain, develop further until all the residues are removed and repeat the cleaning and drying. 9. Finally, transfer the wafer to the AccuPlate hotplate, and carry out a “reflow” bake using the following program to round-off features (see Note 13): 1 h ramp up to 170 C, 2 h at 170 C, and 1 h ramp down to room temperature. 3.2.3 Control Mold Fabrication
1. Clean the Si wafer with 2.45 GHz O2 plasma in the Tepla 300 Plasma Stripper, using 500 W for 7 min and 400 ml/min of O2. 2. Transfer the wafer onto the LSM250 spin coater and dispense a few ml of negative resist GM1070-SU8 onto the center of the wafer, taking care to avoid bubbles. 3. Spin coat a 40-μm layer of photoresist onto the wafer using the following program: 5 s/0–500 rpm, 5 s/500 rpm, 21 s/ 500–1933 rpm, 40s/1933 rpm, 1 s/1933–2933 rpm, 1 s/ 2933–1933 rpm, 5 s/1933 rpm, and 26 s/1933–0 rpm. 4. When the spin coating has finished, immediately transfer the wafer to the hotplate and carry out an initial relaxation followed by a softbake using the following program (see Note 14): 30 min at 30 C, then 3000 s ramp 30 C to 130 C, 300 s at 130 C, and then 3000 s ramp 130 C to 30 C. 5. Load the appropriate chrome mask onto the MJB4 mask aligner and expose for 1 cycle at 16 s, using the Hg-i line (365 nm) at 20 mW/cm2. Use the following parameters: expose type ¼ soft, alignment gap ¼ 30, WEC type ¼ cont, N2 purge ¼ NO, and WEC-offset ¼ OFF.
196
Nadanai Laohakunakorn et al.
6. Transfer the wafer to the HP200 hotplate for a postexposure bake using the following program: 2400 s ramp 30 C to 90 C, 2400 s at 90 C, 2700 s at 60 C, and 2700 s at 30 C. 7. Transfer the wafer to an opaque storage box and wait from 1 h to overnight before development. 8. Develop by transferring the wafer to a bath of propylene-glycol-methyl-ether-acetate (PGMEA) developer (see Note 15). Gently agitate the wafer in the bath for 2 min before transferring to a bath of new developer for a further 1 min. 9. Rinse with isopropanol. If a reaction is visible (white residues appear) then return wafer to PGMEA for 30–60 s before rinsing with isopropanol again. Let it dry naturally. 10. Inspect features under a microscope and carefully develop further if needed. Avoid overdevelopment, which can lead to breaking of features. 11. Finally, transfer to hotplate and carry out a “hardbake” using the following program: 30 min ramp to 135 C, 2 h at 135 C, and then 30 min ramp down to room temperature. 3.3 Soft Lithography for Device Fabrication 3.3.1 Silanization of Wafers 3.3.2 Casting and Curing of PDMS Devices
1. Before first use, place wafers inside a sealed box with few drops (0.5 mL) of trimethylchlorosilane and incubate for at least 12 h. Repeat the silanization before each use for 10 min.
1. In two plastic cups, weigh out and add PDMS elastomer and curing agent in a ratio 5:1 (50 g: 10 g) for the flow layer and 20:1 (20 g: 1 g) for the control layer. 2. Defoam the mixture using the ARE-250 centrifugal mixer, by mixing at 2000 rpm for 1 min followed by defoaming at 2200 rpm for 2 min. These values correspond to machine settings specific for the ARE250, which is not a standard centrifuge but a ’planetary’ mixer, i.e. the samples spin on a platform which itself revolves around a central axis. 3. Clean both flow and control wafers using pressurised N2. 4. Put the flow layer wafer on aluminium foil inside a glass petri dish. Make sure the foil covers the dish and contains the PDMS fully. Pour all of the 5:1 PDMS mixture on top of the wafer and place the dish inside a vacuum desiccator for 40 min to degas the mixture. 5. Put the control layer wafer in the SCS G3P-8 spin coater, and carefully pour a few ml of the 20:1 PDMS onto the center of the wafer. To coat the wafer, run the following program: Step 0, rpm ¼ 0, disp ¼ 2, ramp ¼ 0.0, dwell ¼ 0; Step 1, rpm ¼ 1420, disp ¼ none, ramp ¼ 20.0, dwell ¼ 35; Step
Steady-State Cell-Free Gene Expression
197
2, rpm ¼ 100, disp ¼ none, ramp ¼ 20.0, dwell ¼ 1; and Step 3, rpm ¼ 100, disp ¼ none, ramp ¼ 1.0, dwell ¼ 0. 6. After coating, the PDMS layer will be uneven due to the high 40-μm features. Place the wafer on aluminium foil in a second petri dish, cover to protect from dust, and set aside on the bench for 40 min. 7. Then bake both flow and control wafers in an oven at 80 C. The flow layer is baked for 20 min, and the control layer for 25 min. Timings for this step must be exact (see Note 16). 8. Remove the wafers from the oven. Using a sharp scalpel, cut out each design from the flow layer, and immediately place on top of the corresponding control layer region, roughly aligning the two layers. 9. Once all the devices have been roughly aligned in this way, transfer the control wafer to a stereo dissection microscope, and align the two layers by manually lifting off and carefully placing the top layer in its precise position (see Note 17). 10. Put the aligned devices back into the oven at 80 C and bake for a minimum of 1 h 30 min. 11. Cut the multilayer devices off the wafer using a scalpel. 12. Using the hole puncher, punch through all the channel inlets. 13. Protect the PDMS surfaces from dust using Scotch tape. The completed PDMS devices can now be stored in a clean petri dish until the next step. 3.3.3 Bonding of PDMS Devices to a Glass Slide
1. Clean glass slides using pressurised N2. 2. Remove any residual dust from the slide and feature surface of the PDMS device using Scotch tape (see Note 18). 3. Switch on the Femto plasma oven and place the slide and PDMS device bonding-sideup. 4. Pump out the chamber for at least 15 min to ensure a clean vacuum environment. 5. Switch on the O2 for 2 min at a flow rate of 25 sccm and 0.1 bar, then apply 30 s of plasma at 100% power (which corresponds to a plasma of 40 kHz and 100 W (see Note 19)). 6. Immediately, ventilate the plasma byproducts before opening the chamber. Put the PDMS and glass together and manually apply even, moderate pressure for a few seconds (see Note 20). Then, put the bonded device into an oven at 80 C for 1 h to overnight. 7. The completed devices can finally be stored at room temperature until use (see Note 21).
198
Nadanai Laohakunakorn et al. To relay board + PC
Control branch
Solenoid valve
PE tubing OD 6 mm
PE tubing male luer
Luer stub 23 ga
Water-filled control line
Connector pin ID 0.35 mm
Electric manifold
Compressed air supply
To chip Buffers
Regulator Manual Luer stub 23 ga manifold
Connector pin ID 0.35 mm
TX-TL reagents
Luer stub Male-to- Luer stub male luer 23 ga 20 ga adaptor
Reagent line
PEEK tubing ID 0.18 mm
Flow branch
Fig. 2 Pneumatic connections for the setup. The compressed air supply is split into two independently regulated branches. Pressure in the control branch is switched using electric valves while the flow branch is controlled manually. Buffers and other input solutions are stored in Tygon tubing, while cell-free (TX-TL) reagents are stored in FEP–PEEK tubing 3.4
Hardware Setup
3.4.1 Regulation of Control Layer Pressure
Air pressure is supplied to the setup using polyethylene (PE) tubing connected directly to the laboratory compressed air supply. A schematic of the setup’s pneumatic connections is shown in Fig. 2. 1. Connect one branch of the input air supply to a regulator, and direct the regulated output supply to the aluminium electric manifold. 2. The electric manifold directs air pressure to the chip’s control lines. Attach Tygon tubing (ID 0.0200 ) to the manifold using appropriate adaptors as shown in Fig. 2. The tubing contains a 23 ga luer stub on one end (used for filling and connecting to the manifold) and a stainless steel connector pin on the other (used for connecting to the chip). 3. Plug the electric manifold into the relay board, which links via USB to a PC running control software written in LabVIEW. An example of the code and full documentation can be found online (see Note 22).
3.4.2 Regulation of Flow Layer Pressure
1. Connect the other branch of the input air supply to a regulator, and connect the regulated supply to the manual luer manifold. 2. Adjust the pressure as required (typically ~0.3 bar).
3.5
Device Operation
3.5.1 Filling Control Lines
1. Lower the control manifold pressure to around ~10 psi. 2. Using the PC software, close all the control line valves. 3. Fill each Tygon line with deionised water (see Note 23) through the connecting pin, using a syringe attached to a luer stub.
Steady-State Cell-Free Gene Expression
199
4. Connect each line to the appropriate control channel inlet. 5. Once all the lines are connected, open all valves. This pressurizes the control channels, pushing air into the PDMS and allowing them to fill with water. Wait until the channels are completely filled with water, which can take up to 20 min. Slowly raise the pressure up to ~20–30 psi. 6. Visually inspect all the valves to check that they actuate fully. 3.5.2 Filling Flow Lines
1. Make sure the appropriate manual manifold valve is closed. 2. Basic reagents such as buffers and chemicals are held in ID 0.0200 Tygon tubing. First, assemble the tubing which consists of a length of Tygon, a 23 ga luer stub on one end and a connector pin on the other. 3. Attach a syringe to the luer stub and carefully draw up the required reagent into the tubing. Make sure there are no bubbles. 4. Attach the connector pin to the appropriate flow inlet, before removing the syringe and attaching the luer stub to the manual manifold. 5. Make sure valves are in the appropriate configuration on the chip before opening the flow manifold valve, and allowing the reagent to fill into the device. Typically, a pressure of ~0.3 bar is ideal for the flow lines. 6. For the cell-free extract, follow the previous steps, but instead draw up the solution into the FEP coil through the PEEK tubing. Attach the PEEK tubing directly into the chip. 7. An important requirement for long-term steady-state reactions is that the cell-free extract is separated from energy and DNA solutions. If required, cooling elements can be supplemented to further prevent degradation of the solutions [16, 20].
3.5.3 Cell-Free Expression
1. The device can be characterized as shown in Fig. 3. 2. A typical experimental program is shown in Fig. 4. First switch on the environmental chamber to 29 C. 3. Load each reactor with cell-free extract, energy solution, and DNA in the ratio 40%, 40%, and 20%, respectively. 4. The reactor contents are mixed by actuating the four multifunction valves sequentially at a frequency of 20 Hz. 5. Dilution involves flowing cell-free extract, energy solution, and DNA into the reactors in the ratio 8%, 8%, and 4%, respectively. This corresponds to a 20% dilution of the reactor which preserves the original reaction composition. 6. The dilution rate can be varied by adjusting the interval between dilution steps.
Nadanai Laohakunakorn et al.
b
Loading
2
3
4
50
0
0
50
100 150 200 Time [s]
250
9 6 3 0
0
2
4 6 Cycle
8
Dilution % 4 12 20 60
60 50 40 30 20 10 0 Ring 5
100
12
12%
f
70
Ring 6
YFP fluorescence [RFU]
YFP fluorescence [RFU]
150
e
20%
Ring 4
ring number ring 1 ring 2 ring 3 ring 4 ring 5 ring 6 ring 7 ring 8
Ring 3
103 15
Ring 2
d 200
Ring 1
c
Experimentally-determined load %
60%
Experimentally-determined load %
1
Solution A Solution B
Dilution
Ring 8
a
Ring 7
200
4%
70
y = 1.01x - 0.26 R2 = 0.9996
60 50 40 30
Chip number 1 2 3 4
20 10 0
0
10 20 30 40 50 60 70 Theoretical load %
Fig. 3 Basic operations and characterization of the chip. (a) Initial loading is achieved by flowing an input solution (solution A, green) first through one side of the reactor, then the other. (b) Dilution takes place by flushing an input solution (solution B, yellow) through different outlets. The dilution fraction is controlled by the geometric positioning of the outlets and is fixed for a given design. (c) After loading 20% of a reactor with YFP, actuating the peristaltic pump at 20 Hz mixes the solution in ~100 s. (d) This shows the fluorescence from all eight reactor rings, initially loaded with 20% YFP, and repeatedly diluted with buffer. (e) Experimentally determined dilution fraction for each of the eight reactors. (f) Experimentally determined load fraction vs theoretical load fraction for four different chips
7. Image the resulting fluorescence using the microscope setup. Software for the analysis, example images, and full documentation can be found online (see Note 24).
4
Notes 1. A custom relay board is used to control the electric manifold actuation; any appropriate controller can be used in its place, for instance the 24-channel USB24PRMx (EasyDAQ). 2. Excellent guidance is available, e.g., [21]. 3. Designs for microfluidic devices are available online at http:// lbnc.epfl.ch/microfluidic_designs.html. 4. The 20 mm lens provides the highest write speed, taking ~4 min to write a 100 100 mm mask with 2 μm edge resolution, and 1 mm stripe width. Higher resolutions are possible but not necessary for soft lithography. 5. This is the step most often done incorrectly. The flow layer uses positive-resist AZ, and requires a DARK-mode mask. The
Steady-State Cell-Free Gene Expression
a
Initial loading Load A 40%
Load B 40%
c
Load C 20%
30
Solution A
103
20
Solution B
103
20
Solution C
103
ring number ring 1 ring 2 ring 3 ring 4 ring 5 ring 6 ring 7 ring 8
25
15 10
DNA-cy5 [RFU]
20
YFP [RFU]
CFP [RFU]
15
10
5
15
10
5
5
b
Dilution step Load B 8%
2
4 6 Time [hours]
d
8
Load C 4% mCherry [RFU]
Load A 8%
0
103
8
10
0
0
2
4 6 Time [hours]
Tracer for Solution A
60
8
103
10
0
0
2
4 6 Time [hours]
4
8
10
Steady-state expression ring number ring 1 ring 2 ring 3 ring 4 ring 5 ring 6 ring 7 ring 8
6 deGFP [RFU]
0
201
40
20
2
0
0
2
4 6 Time [hours]
8
10
0
0
2
4 6 Time [hours]
8
10
Fig. 4 Typical experimental operation of the chip. (a) The chip is initially loaded with three solutions A–C (green, yellow, and blue) in the ratio 40%, 40%, and 20%, and (b) subsequently diluted with the same solutions in the ratio 8%, 8%, and 4%. (c) Carrying out this process using aqueous solutions of three different fluorescent tracers demonstrates that steady-state concentrations are maintained over many hours. (d) Steady-state cell-free expression can be achieved by adding as the three solutions cell-free lysate (solution A), energy solution (solution B), and DNA template (solution C). The lysate is labeled with an mCherry tracer to assess its concentration (left), while the reaction produces deGFP, which reaches a steady-state concentration when production and dilution rates are equal (right). Here, a dilution step was carried out every 15 min
control layer uses negative-resist SU8 and requires a CLEARmode mask. Finally, as the exposure is chrome-side-down, the masks must be MIRRORED AT Y. 6. HMDS priming enhances photoresist adhesion. Alternatively, the wafer can also be treated with O2 plasma or thermally dehydrated. 7. Pouring directly from the bottle introduces fewer bubbles than using a plastic pipettor. 8. Opening the cap to the AZ9260 bottle to allow the release of air bubbles a few minutes before use can also help minimize bubbles. 9. Homogenous rehydration is important for efficient exposure, and the minimum rehydration time is a function of the photoresist thickness (5 μm, 8 min; 20 μm, 2 h). 10. The mercury lamp contains spectral lines at 365, 405, and 436 nm. On the MJB4 machine, the i-line filter is installed which passes only the 365 nm line. Without the filter, the exposure is broadband. The exposure mode must be taken into account during exposure time calculations.
202
Nadanai Laohakunakorn et al.
11. For 15 μm AZ9260, the recommended dose is 580 mJ/cm2 for i-line exposure and 660 mJ/cm2 for broadband. 12. The recommended development time is around 45 s per μm of AZ9260. 13. Rounded features are crucial for the flow layer as it allows valves to close completely. 14. The variable here is the ramp time, which depends on the specific type of SU8 used. 15. It is highly recommended to develop the wafer upside down. Prepare two baths of PGMEA. 16. The precise timing here is important. The PDMS should set sufficiently so that it is not too sticky, but not so much that the resulting multilayer device does not bond together. 17. This step requires the most practice. Alignment should be completed as quickly and precisely as possible to ensure optimal bonding. Air bubbles are typically caused by buckling of the PDMS layers, and can be removed by first ensuring the top layer is completely flat, and then with gentle application of pressure. Putting weights on top of the PDMS during subsequent baking can also help. 18. This step is important as the presence of dust between the glass and PDMS can compromise bonding or render the device nonfunctional. 19. Plasma treatment converts methylsiloxane to siloxyl groups on the PDMS surface, enabling its covalent cross-linking to silicacontaining glass. There is, however, an optimum amount of treatment, as over-treating increases the surface roughness of the PDMS and decreases the effective contact area [22]. 20. The binding can be checked by putting the chip against a black piece of paper. Regions which are not bound will show up as bubble-like features. 21. In our experience, devices can still be functional after 6 months’ storage. 22. https://github.com/nadanai263/lbnc-cellfree2 23. Ideally, all Tygon control lines should have the same length and the same amount of water. The larger the volume of water in the line, the faster the pressure transfer and valve actuation, due to the incompressibility of water; in practice, care must be taken so the water does not get into the electric manifold, so do not fill the lines fully. Finally, make certain that there are no air bubbles where the line connects to the chip. 24. https://github.com/nadanai263/lbnc-cellfreeview
Steady-State Cell-Free Gene Expression
203
Acknowledgments This work was supported by an HFSP Program Grant RGP0032/ 2015; the European Research Council under the European Union’s Horizon 2020 research and innovation program Grant 723106; and the E´cole Polytechnique Fe´de´rale de Lausanne. References 1. Purnick P, Weiss R (2009) The second wave of synthetic biology: from modules to systems. Nat Rev Mol Cell Biol 10:410–422 2. Garenne D, Noireaux V (2019) Cell-free transcription-translation: engineering biology from the nanometer to the millimetre scale. Curr Opin Biotechnol 58:19–27 3. Takahashi MK et al (2015) Characterizing and prototyping genetic networks with cell-free transcription-translation reactions. Methods 86:60–72 4. Perez JG et al (2016) Cell-free synthetic biology: engineering beyond the cell. Cold Spring Harb Perspect Biol 8:a023853 5. de Maddalena LL et al (2016) GreA and GreB enhance expression of Escherichia coli RNA polymerase promoters in a reconstituted transcription-translation system. ACS Synth Biol 5:929–935 6. Niederholtmeyer H, Xu L, Maerkl SJ (2013) Real-time mRNA measurement during an in vitro transcription and translation using binary probes. ACS Synth Biol 2:411–417 7. Wick S et al (2019) PERSIA for direct fluorescence measurements of transcription, translation, and enzyme activity in cell-free systems. ACS Synth Biol 8:1010–1025 8. Kwon Y-C, Jewett MC (2015) Highthroughput preparation methods of crude extract for robust cell-free protein synthesis. Sci Rep 5:8663 9. Sun ZZ et al (2013) Protocols for implementing an Escherichia coli based TX-TL cell-free expression system for synthetic biology. J Vis Exp 79:1–15 10. Lavickova B, Maerkl SJ (2019) A simple, robust, and low-cost method to produce the PURE cell-free system. ACS Synth Biol 8:455–462 11. Dubuc E et al (2019) Cell-free microcompartmentalised transcription-translation for the prototyping of synthetic communication networks. Curr Opin Biotechnol 58:72–80
12. Niederholtmeyer H et al (2015) Rapid cell-free forward engineering of novel genetic ring oscillators. elife 4:1–18 13. Hori Y et al (2017) Cell-free extract based optimization of biomolecular circuits with droplet microfluidics. Lab Chip 17:3037–3042 14. Chang J-C et al (2018) Microfluidic device for real-time formulation of reagents and their subsequent encapsulation into double emulsions. Sci Rep 8:8143 15. Spirin A et al (1988) A continuous cell-free translation system capable of producing polypeptides in high yield. Science 242:1162–1164 16. Niederholtmeyer H et al (2013) Implementation of cell-free biological networks at steady state. Proc Natl Acad Sci 110:15985–15990 17. Karzbrun E et al (2014) Programmable on-chip DNA compartments as artificial cells. Science 6198:829–832 18. Tayar A et al (2017) Synchrony and pattern formation of coupled genetic oscillators on a chip of artificial cells. Proc Natl Acad Sci 114:11609–11614 19. Rockel S, Geertz M, Maerkl SJ (2012) MITOMI: a microfluidic platform for in vitro characterization of transcription factor-DNA interaction. Methods Mol Biol 786:97–114 20. van der Linden A J et al (2019) A multilayer microfluidic platform for the conduction of prolonged cell-free gene expression. J Vis Exp 152:e59655 21. Ferry MS, Razinkov IA, Hasty J (2012) Microfluidics for synthetic biology: from design to execution. Methods Enzymol 497:295–372 22. Chau K et al (2011) Dependence of the quality of adhesion between poly(dimethylsiloxane) and glass surfaces on the composition of the oxidizing plasma. Microfluid Nanofluid 10:907–917 23. Swank Z, Laohakunakorn N, Maerkl SJ (2019) Cell-free gene-regulatory network engineering with synthetic transcription factors. Proc Natl Acad Sci U S A 116:5892–5901
Chapter 10 A Microfluidic/Microscopy-Based Platform for on-Chip Controlled Gene Expression in Mammalian Cells Mahmoud Khazim, Elisa Pedone, Lorena Postiglione, Diego di Bernardo, and Lucia Marucci Abstract Applications of control engineering to mammalian cell biology have been recently implemented for precise regulation of gene expression. In this chapter, we report the main experimental and computational methodologies to implement automatic feedback control of gene expression in mammalian cells using a microfluidics/microscopy platform. Key words Feedback control, Mammalian cell, Microfluidics, Cell segmentation, PDMS , Control algorithms
1
Introduction In recent years, feedback control has been widely used for controlling gene expression across cellular species and applications. In-cell feedback is implemented within cells by means of gene regulatory networks involving, for example, positive and negative feedback loops. Instead, in silico feedback control implements the control action externally: cellular outputs are measured usually by microscopy using fluorescent proteins, and actuators provide cells the control inputs (e.g., inducer molecules) to minimize the control error. Here, we report the main experimental and computational methods we employed for external feedback control of gene expression in mammalian cells using a microfluidics/microscopy platform [1–3]. The PDMS microfluidic device we used [4] has been optimized for long-term mammalian cell culturing, imaging, and precise delivery of two media to cells; it consists of 33 individual cuboid culture chambers adjoined to a main perfusion channel via a
Mahmoud Khazim, Elisa Pedone, and Lorena Postiglione contributed equally to this work.Diego di Bernardo and Lucia Marucci contributed equally to this work. Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_10, © Springer Science+Business Media, LLC, part of Springer Nature 2021
205
206
Mahmoud Khazim et al.
wide opening on one side of each chamber. For a detailed description of the device, please refer to [4]. The experimental protocols described here would need to be adapted if using a different microfluidics setup.
2 2.1
Materials Chip Fabrication
1. Master silicon wafer (Silicon Valley Microelectronics, USA). 2. Chlorotrimethylsilane (TCSM). 3. Aluminum foil. 4. Vacuum degassing chamber (Bel-Art). 5. Oven. 6. Sonicator (Camlab). 7. Acetone, methanol, isopropyl alcohol, and distilled water. 8. Pressurized nitrogen. 9. Polydimethylsiloxane (PDMS) Sylgard 184 Elastomer base (Dow Corning). 10. 0.75-mm biopsy punch (World Precision Instruments; 504,529). 11. Cover glasses (Hirschmann 24 60 mm T 0.13–0.17 mm). 12. O2 plasma asher (DienerZepto).
2.2
Chip Loading
1. PDMS microfluidic chips. 2. Dissociation reagent. 3. Phosphate-buffered saline (PBS) solution. 4. 15-mL falcon tube. 5. Centrifuge. 6. 10-mL (Terumo, IVS10) and 2.5-mL syringes (Terumo, IVS03). 7. 23-gauge needles (BD, 300800). 8. PTFE Tubing (06417-21_PTFE #24 AWG Thin Wall Tubing, Cole-Parmer Inc.). 9. 22-gauge 90 bent metal pins (922050-90BTE, Metcal). 10. Vacuum aspirator. 11. Tissue culture microscope. 12. Loading buffer: knockout Dulbecco’s modified Eagle’s medium (DMEM), 15% fetal bovine serum, 1 nonessential amino acids, 1 GlutaMax, 1 2-mercaptoethanol, 1000 U/ mL LIF.
Mammalian Cell Control
207
1. 10-mL (Terumo, IVS10) and 50-mL syringes (Terumo, SS + 50 L1).
2.3 MicrofluidicBased Time-lapse
2. 23-gauge needles (BD, 300800). 3. 22-gauge metal pins (922050-90BTE, Metcal). 4. Y-junction (Ziggy’s tubes and wires, Inc. HSCY-23). 5. PTFE Tube (06417-21_PTFE #24 AWG Thin Wall Tubing, Cole-Parmer Inc.). 6. Fluorescent dye (e.g., Atto488 or 467 dye from ThermoFisher, Sulforodhamine from Sigma).
3
Methods
3.1 Fabrication of PDMS Replica Molding
For brevity, the protocol reported here assumes that a master mold is available; therefore, only the steps to produce device replica are included. For master mold fabrication, please refer to the original publication [4].
3.1.1 Silanization of Master Mold Wafer
Prior to replication of microfluidic devices, the master mold is exposed to chlorotrimethylsilane vapors to produce a passivation of the surfaces to prevent the PDMS from adhering to the master. This process is not required for each replica fabrication, but it is done prior to the first use of the master, and when it becomes difficult to peel off the PDMS. l
l
3.1.2 PDMS Microfluidic Device Preparation
Place the master silicon wafer in a vacuum degassing chamber (Fig. 1a). Place 2–10 μL of the silanization agent in an open Eppendorf tube, stood in an aluminum foil cap, adjacent to the master, and apply the vacuum to allow silanes to form a monolayer on the surface of the master for 15–30 min. This process is not required for each replica but the first time you use it and whenever the peeling of the PDMS gets harder.
PDMS base is mixed with curing agent and placed on a silicon master mold wafer. The mixture is degassed and then cured. The cured PDMS is peeled off and autoclaved, and then, ports are punched using a biopsy punch. l
Prepare PDMS by mixing Sylgard 184 Elastomer base and curing agent in a 10:1 ratio. Mix the base and curing agent well using a lab spatula. The amount of PDMS is usually worked out to a tailored dimension. For a 4-in. wafer and petri dish, use 50 g of PDMS/curing agent in a 10:1 ratio (45 g of PDMS base and 5 g of curing agent).
208
Mahmoud Khazim et al.
Fig. 1 Steps and equipment for the fabrication of microfluidic devices by PDMS replica molding. (a) Silanization and surface treatment of the master by placing in a degassing chamber with TCSM. (b) The master is placed in a glass petri dish covered with aluminum foil. (c) PDMS and curing agent are mixed, poured over the master, and degassed. (d) After curing, the PDMS replica are cut out and ports are made by use of a reusable biopsy punch. (e) The punched devices are sonicated in isopropyl alcohol followed by sonication in H2O. (f) The cleaned replica is bonded to glass coverslips using a plasma asher l
Place master mold into a glass petri dish, with similar area dimensions, covered with aluminum foil (Fig. 1b).
l
Pour PDMS mix onto master mold and degas in a vacuum degassing chamber for 30 min or until all bubbles have been removed (Fig. 1c).
l
Place master mold with PDMS in an oven to cure for 1 h at 80 C.
l
Gently peel off cured PDMS from the mold end and release the PDMS from the master mold.
l
l
Autoclave cured PDMS for 30 min at 121 C in an autoclavable paper bag to ensure long-term viability of cells in the device. Using a 0.75-mm biopsy punch, punch the ports to create fluidic ports for access of cells and media (Fig. 1d) (see Note 1).
Mammalian Cell Control 3.1.3 Cleaning and Bonding of PDMS Chips to Glass Coverslips
3.2
Chip Loading
3.2.1 Pins Preparation and Wetting of the Device
209
Punched PDMS devices are sonicated to dislodge PDMS shavings from the ports. Coverslips are cleaned and dried. Finally, the PDMS devices and coverslips are placed in a plasma asher and bonded by bringing the surfaces to contact and optionally baking overnight to increase the bond between the device and coverslip. l
Place punched PDMS devices in isopropyl alcohol and sonicate for 10 min (Fig. 1e).
l
Sonicate in distilled water for 10 min (Fig. 1e).
l
Air-dry using pressurized nitrogen.
l
For each molded, punched PDMS device, clean a thin 24 60 mm cover glass in acetone, methanol, isopropyl alcohol, and distilled water and then dry with pressurized nitrogen.
l
Expose the PDMS devices, with layers facing up, and cover glasses to oxygen plasma in an O2 plasma asher, at 50–70% power for 2 min (Fig. 1f).
l
Bring the PDMS device into contact with the cover glass with layers facing down to form a strong irreversible bond between surfaces.
l
Use a microscope to check for any faults.
l
Optional: bake the bonded devices at 90 C overnight.
Prior to trapping the cells in the microfluidic device chambers via on-chip vacuum, the device needs to be prewet such that device channels are filled with fluid while the culture chambers remain filled with air. Also, pins need to be prepared. To release the pins from the syringe adaptors (Fig. 2a), incubate the pins with isopropyl alcohol for 24–48 h (Fig. 2b). The pins can be used to connect fluidic lines to the microfluidic chip (Fig. 3) for all future experiments.
Fig. 2 Release of metallic pins from adaptors. Metallic pins before (a) and after (b) 24/48-h isopropyl alchohol (isopropanol) incubation
210
Mahmoud Khazim et al.
Fig. 3 Microfluidic device wetting, cell loading, and preculture. (a) Microfluidic device, which has been bonded to a glass coverslip. Media is flushed through the microfluidic channels starting from port 5 (b) followed by filling through port 2 (c). (d) The wetted device is fastened onto a lab microscope, and a vacuum pump is attached to ports 3 and 4. Cells are pushed through port 1 and loaded via the vacuum into the cell chambers. (e) The loaded microfluidic device for preculture has ports 2, 5, and 6 plugged using a pin with a short length of PTFE tubing tied at the end to stop fluidic flow through the ports (*). A 10-mL syringe with media is attached to port 1 for overnight perfusion (blue arrow) and media flows out of port 5 (red arrow). (f) A 10-mL syringe with media is fastened onto a makeshift rig and attached to port 1 l
Connect a 2.5-mL syringe, via an attached 23-gauge sterile needle, to a 10-cm section of PTFE #24 AWG tubing with a bent 90 metal pin connected at the end.
l
Aspirate a 1-mL volume of media into the syringe and connect the pin to port 5 of the microfluidic device (Fig. 3b). Gentle pressure is applied until fluid fills all main device ports to the top (ports 6, 7, and 2).
l
Remove pin and tubing from port 5 and wet connect (fluid droplet from tubing is applied to the port prior to connecting to ensure bubbles do not enter the device) to port 2 (Fig. 3c). Apply very gentle pressure to the syringe until port 1 is filled to the top.
Mammalian Cell Control 3.2.2 Shear-Free Cell Loading Via on-Chip Vacuum
3.2.3 Preculture of Cells in the Microfluidic Device
211
Once the chip has been wetted, the chip is attached to a vacuum pump and cells are loaded into the chip. Cells are visually monitored via laboratory microscope while being vacuum-loaded. Cells that remain in the main device channels are flushed out using fresh media, while cells in the culture chambers are shielded and resistant to convective flow. l
Connect the on-chip vacuum to ports 3 and 4 and fasten the chip to a laboratory microscope to enable monitoring of cell loading (Fig. 3d).
l
Wash mammalian cells (previously kept in complementary media and maintained in a tissue culture incubator at 37 C and 5% CO2) with PBS; detach cells from culture dish and place into a centrifuge tube.
l
Centrifuge cells to form a pellet and resuspend in complete media, at a density of 2 106 cells per 100 μL of media. If cells are too concentrated, dilute with extra media.
l
Aspirate the cell suspension into a fresh 2.5-mL syringe via needle attached tubing and metal pin, and wet connect to port 1 of the device.
l
Gently apply pressure to syringe with cell suspension until cells are visible in the main perfusion channel upon inspection via a tissue culture microscope.
l
Once the presence of cells in the main channel is confirmed, stop the flow by releasing syringe pressure until cells are apparent at the entrance of the chip chambers.
l
To begin cell loading, turn the vacuum on and visually monitor cells entering the chambers; mechanical finger tapping of the tube near the port can enhance cell loading as it avoids cells getting stuck on the walls of the device.
l
Once loaded, turn vacuum off and disconnect the vacuum ports.
l
Use a new syringe and tubing with fresh media to flush out untrapped cells from the ports by wet connecting to port 1 and applying a gentle pressure through the main channel, out of the remaining ports of the device. Care needs to be taken so that the fluid flow through the device is not too strong, as this can cause cells properly trapped into the chambers to be washed out.
The device with cells in the culture chambers can now be precultured in the incubator, overnight and up to 48 h, to allow cells to attach and proliferate inside the device prior to undertaking control experiments on the microscope.
212
Mahmoud Khazim et al. l
3.3 Microfluidics/ Microscopy-Based Time-lapse
Plug ports 2, 6, and 7 by using 90 bent metal pins with a small amount of tubing which can be tied at the end to stop the flow (Fig. 3e).
l
Fill a 10-mL syringe with up to 5 mL of culture media onto a makeshift rig and attach by needle, tubing and pin to port 1 (Fig. 3f). The hydrostatic pressure difference between the syringe fluid and the opening in port 5 allows media to flow through the device, establishing a slow perfusion flow.
l
Place rig with culture media and microfluidic device into an incubator and culture overnight.
l
Measure and cut three sections of PTFE #24 AWG tubing for collecting the waste media and and two sections of tubing for time-controlled delivery of culture media to cells. The length of the tubing is around 120 cm for the output (waste) and 200 cm for the input (delivered) media.
l
Connect the Y-junction to a short tube of about 20 cm and two of the 120-cm waste output tubes (Fig. 4).
l
Connect the short section of tubing (Fig. 4a) to a 23-gauge needle and the longer sections of tubing to metal pins (Fig. 4b, c).
l
Similarly, connect one side of the remaining tubes (one output and two inputs) to a 23-gauge needle on one end and metal pins on the other end.
l
Attach each fluidic line to a 50-mL syringe and slowly fill them with 12 mL of culture media. Note that this amount of media is enough for experiments 0 uðt Þ ¼ umin if e ðt Þ < 0 where the control error e(t) ¼ r(t) y(t) is the difference between the reference signal r and the system output y; u is the control input. Usually, in controlling biological systems, the system output is a fluorescent protein, while the input is represented by inducer molecule(s), provided by the actuators (see Subheading 3.3.3). This control strategy, although being simple, succeeds in keeping the system output close to the desired reference. Typically, the controlled variable oscillates around the reference; this is acceptable if the oscillations are sufficiently small [1, 5]. In the Relay control strategy, when the output value is very close to the reference, the control error can rapidly change sign, thus causing the control input to continuously switch (chattering
216
Mahmoud Khazim et al.
phenomenon [6]); this can be reduced by adding hysteresis ε to the controller, modifying the control law as follows: umax if e ðt Þ ε uðt Þ ¼ umin if e ðt Þ < ε The drawback of this controller is that the amplitude of the oscillations around the set-point increases. For examples of Relay implementation in mammalian cells, see [1–3]. 3.5.2 ProportionalIntegral (PI) Controller
l
bðt Þ is a function The PI output u R t of the control error e(t), and it is bðt Þ ¼ kp e ðt Þ þ ki 0 e ðτÞdτ. defined as u
l
For a control system with a wide range of operating conditions, it may happen that the control action reaches the actuator limits; if an integral action is used, the error will continue to be integrated meaning that the integral term and the control output may become very large. The control signal will then remain saturated even when the error changes, and it may take a long time before the integrator and the controller output come inside the saturation range (integrator windup [5]). An anti-windup compensation scheme can be add to the PI controller to prevent b to become too the windup phenomenon and the control input u large (refer to [5], for examples of anti-windup scheme).
l
The proportional and integral gains of the PI controller (kp and ki, respectively) have to be tuned using a dynamical model of the system to be controlled (i.e., following the Ziegler and Nichols method [5]).
l
If the biological system under investigation can be fed with or without the inducer molecule in a mutually exclusive manner, bðt Þ has to be decoded in a discrete way. the continuous signal u The control technique to satisfy the above constraint is to couple the PI regulator with a PWM (pulse-width modulator). Specifically, at each sampling time kT, the PWM algorithm calculates the duty cycle of the input d k ¼ Tu as the ratio between the b and the sampling time T. The input u(t) is control input u computed as follows: umax if kT t < ðk þ d k ÞT uðt Þ ¼ umin if ðk þ d k ÞT t < ðk þ 1ÞT
The PI-PWM controller was used, for example, in [1] to control gene expression from the tetracycline-inducible promoter in CHO cells. The first advantage of PI controller is that, besides its very simple implementation, it guarantees zero steady-state error for constant reference and the rejection of constant disturbances at steady state. Moreover, the PI controller does not require a model of the controlled system, although an idea of its dynamics is necessary for gains’ tuning. On the other hand, a PI controller does not
Mammalian Cell Control
217
achieve a satisfactory performance for tracking time-varying references unless the reference dynamics are much slower than the closed-loop system dynamics. 3.5.3 Model Predictive Control (MPC)
Model predictive control (MPC) is a well-established technique for controlling multivariable systems subject to constraints. Applications of MPC to regulate gene expression and signaling pathway activity in mammalian cells are reported in [2]. l
Given a desired control reference, MPC aims at finding the optimal control input to minimize the difference between the target value and the measured value, by means of a dynamical model of the system being controlled and a cost function.
l
To speed up computation, a discretized version of the dynamical models describing the biological system is used, assuming that the input is piece-wise constant during the sampling period T (zero-order hold method): x kþ1 ¼ Ax k þ Buk y k ¼ Cx k where, for example, in the case of a three-state system with 1 0 x1ðkT Þ C B 1 input, x k ¼ @ x2ðkT Þ A are the system states, uk ¼ u(kT) is
l
x3ðkT Þ the control input, and yk ¼ () is the system output with being a natural number (∈[1,2,. . .]). Starting from the experimental data, at each sampling time , the MPC controller uses the discrete model to predict the dynamic behavior of the system to be controlled over a defined prediction horizon and to determine the input such that an open-loop objective function is minimized [7]. An example of cost function to be minimized is the squared control error (SSE), defined as follows: SSEk ¼
kþN X
ðN þ 1 þ k i Þε2i
i¼kþ1
where N defines the length of the prediction horizon in terms of sampling intervals; (N + 1 + k i) is a weighting factor that that weights the control error samples at the beginning of the prediction horizon more than those at the end. . The MPC strategy requires a mathematical model of the process being controlled to calculate the control input. Note that these models are used only to synthesize the controllers and not to estimate biological parameters quantitatively; thus, the uniqueness of the identified parameters is not ensured but only the models’ ability to predict the system output given the input.
218
4
Mahmoud Khazim et al. l
Use the microfluidics platform in open-loop to measure input/ output time-series data. For example, deliver an input to the cells in the microfluidic device as a series of pulses of inducer molecule with variable duration but fixed amplitude (square waves) and measure the mean fluorescence in the cell population, which we considered as the output of the system [8].
l
If the biological processes being controlled are fluorescent proteins driven by inducible promoters, you can assume that their dynamics could be well approximated by state-space linear models.
l
Derive the dynamical model from the input–output data by using black-box or gray-box identification approaches [2, 9].
Notes 1. Chip fabrication procedure is best performed in a clean room environment to avoid the inclusion of impurities in the chip. 2. To monitor correct media flow and measure which input is delivered to the cells, add a fluorescent dye to one of the two syringes with media. 3. To control correct chip perfusion, we suggest to image also the DAW junction of the chip, where the media coming from the two actuation syringes mix. 4. The microscope settings need to be adjusted depending on the resolution of the camera and the brightness of each fluorescent tag.
Acknowledgments This work was funded by Medical Research Council grant MR/N021444/1 to L.M., by the Engineering and Physical Sciences Research Council grants EP/R041695/1 and EP/S01876X/1 to L.M., and by BrisSynBio, a BBSRC/EPSRC Synthetic Biology Research Centre (BB/L01386X/1) to L.M. References 1. Fracassi C, Postiglione L, Fiore G, di Bernardo D (2016) Automatic control of gene expression in mammalian cells. ACS Synth Biol 5 (4):296–302. https://doi.org/10.1021/ acssynbio.5b00141 2. Postiglione L, Napolitano S, Pedone E, Rocca DL, Aulicino F, Santorelli M, Tumaini B, Marucci L, di Bernardo D (2018) Regulation
of gene expression and signaling pathway activity in mammalian cells by automated microfluidics feedback control. ACS Synth Biol 7 (11):2558–2565. https://doi.org/10.1021/ acssynbio.8b00235 3. Pedone E, Postiglione L, Aulicino F, Rocca DL, Montes-Olivas S, Khazim M, di Bernardo D, Pia Cosma M, Marucci L (2019) A tunable dual-
Mammalian Cell Control input system for on-demand dynamic gene expression regulation. Nat Commun 10 (1):4481. https://doi.org/10.1038/s41467019-12329-9 4. Kolnik M, Tsimring LS, Hasty J (2012) Vacuum-assisted cell loading enables shear-free mammalian microfluidic culture. Lab Chip 12 (22):4732–4737. https://doi.org/10.1039/ c2lc40569e 5. Astrom KJ, Murray RM (2010) Feedback systems: an introduction for scientists and engineers. Princeton University Press 6. Utnik V, Lee, H (2006) Chattering problem in sliding mode control systems. Paper presented at the international workshop on variable structure systems, Alghero, Sardinia, Italy
219
7. Morari M, Lee JH (1999) Model predictive control: past, present and future. Comput Chem Eng 23(4):667–682. https://doi.org/10. 1016/S0098-1354(98)00301-9 8. Fiore G, Menolascina F, di Bernardo M, di Bernardo D (2013) An experimental approach to identify dynamical models of transcriptional regulation in living cells. Chaos 23(2):025106. https://doi.org/10.1063/1.4808247 9. Menolascina F, Fiore G, Orabona E, De Stefano L, Ferry M, Hasty J, di Bernardo M, di Bernardo D (2014) In-vivo real-time control of protein expression from endogenous and synthetic gene networks. PLoS Comput Biol 10 (5):e1003625–e1003625. https://doi.org/10. 1371/journal.pcbi.1003625
Chapter 11 Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2 Eva Balsa-Canto, Lucia Bandiera, and Filippo Menolascina Abstract Dynamic modeling in systems and synthetic biology is still quite a challenge—the complex nature of the interactions results in nonlinear models, which include unknown parameters (or functions). Ideally, timeseries data support the estimation of model unknowns through data fitting. Goodness-of-fit measures would lead to the best model among a set of candidates. However, even when state-of-the-art measuring techniques allow for an unprecedented amount of data, not all data suit dynamic modeling. Model-based optimal experimental design (OED) is intended to improve model predictive capabilities. OED can be used to define the set of experiments that would (a) identify the best model or (b) improve the identifiability of unknown parameters. In this chapter, we present a detailed practical procedure to compute optimal experiments using the AMIGO2 toolbox. Key words Biological systems, Dynamic models, Optimal experimental design, Practical identifiability
1
Introduction The ultimate aim of systems biology is the discovery of the design principles of life, while the ultimate aim of synthetic biology is to apply those design principles to synthesize novel biological systems. Both the discovery and the synthesis rely on a combination of data and mechanistic mathematical models that capture the most relevant features of the system. Ideally, models will offer the means to prediction and, thus, to design. Model building is an iterative process which goes back and forth between model refinement and data validation. The first steps of the modeling process include (a) defining the question to be addressed, (b) generating hypotheses, (c) selecting the modeling framework, and (d) formulating one (or several) candidate model (s). Steps (b)–(d) often rely on prior knowledge and observations. This chapter focuses on modeling biological networks. The first steps of modeling (1 and 2) define the topology of the network: which are the biomolecules of interest and whether they interact
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_11, © Springer Science+Business Media, LLC, part of Springer Nature 2021
221
222
Eva Balsa-Canto et al.
with each other. Next steps (3 and 4) describe the kinetics and strengths of biomolecular interactions within the system. If we consider the cell as a well-stirred reactor, we can explain the behavior of the network using a set of ordinary differential equations which determine concentration changes as prescribed by kinetic laws. The model would read as follows: dx ¼ f ðx, u, θ, t Þ; xðt 0 Þ ¼ x0 dt
ð1Þ
where x, u, and θ regard the vectors of state variables, inputs, and model parameters, respectively. The dynamics of the system (1) depends on the initial conditions (x0) and the parameter values. Parameter estimation offers the means to reconcile models with data [1]. The underlying idea is to solve a nonlinear optimization problem to compute unknown and nonmeasurable kinetic constants to maximize the likelihood of the data e y. The experimental data consist of a matrix of values corresponding to individual measurements obtained under the conditions specified by an experimental scheme ε. We encode the experimental data and the model predictions in the following vectors: h i h i e y¼ e y 1 , ey 2 , . . . , ey d , . . . , ey nd y ¼ y 1 , y 2 , . . . , y d , . . . , y nd ð2Þ where e y represents the experimental data and y ¼ g(x, u, θ, t) the corresponding model predictions; d represents a specific experimental condition defined by subindexes ε-for the experiment-, o-for the observables in the experiment ε-, and s – for the sampling times in the experiment ε. nd regards the total number of such conditions, that is, the number of data. Accordingly, the operators to be defined in the sequel can be easily condensed as follows: !! nεo no,ε nd nε s X X X X ðÞ ¼ ðÞ ð3Þ d¼1
i¼1
j ¼1
k¼1
Output additive experimental noise is often assumed in such a way that: e y d ¼ y d þ ed ,
ð4Þ
where ed belongs to a sequence of independent random variables with probability density Π(ed). In many practical examples, experimental noise is assumed to be Gaussian, and its variance σ 2d is known for all d’s (homoscedastic case) or unknown and dependent of d (heteroscedastic case).
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
223
When information about the nature of the experimental noise is available, parameter estimation looks for the value of the parameters gives the highest probability to the measured data: yjθÞÞ J llk ðθÞ ¼ lnðπð~
ð5Þ
The probability density function will condition the exact formulation of the cost function. In practice, homoscedastic Gaussian additive noise is often assumed, and the resulting cost function is similar to the well-known generalized least squares, with weights set to the inverse of the variance of the experimental data. The parameter estimation problem is formulated as finding the parameter values that minimize: 2 nd X yd ðθÞ e yd ð6Þ J llk ðθÞ ¼ σ 2d d¼1 Precise parameter estimates require informative data, that is, informative experimental schemes ε. The precision of the parameter estimates can be measured in terms of the volume and eccentricity of the confidence hyperellipsoid. In this regard, the confidence hyperellipsoid informs about the quality and quantity of information provided by the experimental scheme for parameter estimation. Related to this, the concept of practical identifiability refers to the (im)possibility of assigning precise values to model parameters. Performing experiments to obtain a rich enough set of experimental data for nonlinear dynamic modeling is costly and timeconsuming. Model-based optimal experiment design (OED, see, e.g., [2, 3]) allows devising the minimum set of experiments for model identification, that is, for either model selection or parameter estimation. Mathematically the OED problem can be formulated as a dynamic optimization problem where the objective is to find the observables, time-varying inputs (stimuli), initial conditions, sampling times, and experiment duration, to maximize or minimize a performance index which is related to the experiment information content. Figure 1 illustrates the concept of the experimental scheme. The definition of the information depends on the aim of optimal experimental design. For the case of model selection, the information typically relates to the differences in predictions between candidate models: J OED,MS uε ðt Þ, t εs , nεs , t εf ¼ Ψ y εA , y εB ð7Þ where y εA and y εB correspond to the observables as predicted by model A and B given the experimental conditions ε. The functional Ψ may correspond to, for example, the Euclidian distance between the models.
224
Eva Balsa-Canto et al.
Fig. 1 Concept of the experimental scheme. It includes the number of experiments and/or replicates, stimulation conditions, measured states, experiment duration, and sampling times
For the case of parameter estimation, the determinant or the eigenvalues of the Fisher information matrix provide a measure of the statistical quality of the parameter estimates, that is, a measure of the volume and eccentricity of the confidence hyperellipsoid [4]. The Fisher information matrix reads as follows: ( T ) dJ llk ðθÞ dJ llk ðθÞ F ¼ E ð8Þ dθ dθ eyjμ being E the expected value and μ a near-optimum value of the parameters. The Crame`r–Rao inequality provides a lower bound on the covariance of the estimators (under given conditions): C F 1 ðμÞ
ð9Þ
The confidence interval for a given parameter μi is then given by the following: pffiffiffiffiffiffiffi ð10Þ t γα=2 Cii where t γα=2 is given by the Students t-distribution, γ regards the number of degrees of freedom, and α is the (1 α) 100% confidence interval. Both the parameter estimation and the optimal experimental design problems can be cast as nonlinear programming problems (NLP) subject to dynamic and algebraic constraints. For the case of optimal experimental design, the stimuli are parametrized to transform the function u(t) into a vector w ∈ Rρ, with ρ the number of
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
225
Fig. 2 Iterative solution of the parameter estimation and optimal experimental design problems. The iterative solution requires a NLP solver to generate candidate solutions at each iteration k; and an IVP solver to solve the model equations, plus the parametric sensitivities in the case of OED, to evaluate the cost function and constraints
parameters required to characterize the stimuli profile (e.g., number of steps or pulses, stimuli switching times). Figure 2 shows the iterative procedure for their resolution. In an outer iteration, the NLP solver will generate candidate solutions, and in an inner iteration, the initial value problem (IVP) solver will compute model predictions and parametric sensitivities for the given candidate solution. As for the selection of the initial value problem solver, possibly the most popular are the Runge–Kutta in its implicit and explicit versions, the Adams–Bashforth, and the backward differentiation formula (BDF)-based methods. For an extensive review of methods, see [5]. NLP solvers are designed to generate, from one or several initial guesses, a sequence of solutions—iterates—that eventually converge to the minimum of the cost function (Eqs. 4 and 5). The way these iterates are computed allows the first classification of NLP solvers into three major groups: local, global, and hybrid optimizers (Fig. 3). Local methods use information about the cost function and possibly its gradient and its Hessian to compute search directions. These methods guarantee convergence to a local optimum—the global optimum if the problem is convex. Interested readers are referred to, for example, the book by Fletcher (1987) [6] for extensive descriptions of local optimizers; further Seber and Wild (1989) [7] and Schittkowski (2002) [8] describe Levenberg–Marquardt and Gauss–Newton-based methods for the specific case of least squares problems. The use of adjoint sensitivities may largely improve the efficiency of evaluating the gradient of the cost function [9], thereby ameliorating the overall convergence rate of local indirect methods. The use of first- and projected second-order
226
Eva Balsa-Canto et al.
Fig. 3 Classification of NLP solvers and some popular examples
sensitivities also enhances the convergence rate of the solution of dynamic optimization problems such as those solved in optimal experimental design [10]. However, the nonlinear character of the dynamic biological models often leads to multimodality, and thus, local methods may end up in suboptimal solutions. Lin and Stadtherr (2006) [11] or Polisetty et al. (2006) [12] suggested the use of global deterministic optimizers for parameter estimation. Although very promising and powerful, there are still limitations to their application, mainly due to the rapid increase of computational cost with the size of the considered system and the number of its parameters. Alternatively, stochastic global optimization algorithms make use of pseudorandom sequences to determine search directions toward the global optimum. The main advantage of these methods is that they rapidly arrive at the proximity of the solution, which makes them particularly attractive to implement hybrid global-local optimizers suitable for dynamic optimization and optimal experimental design [13] and parameter estimation [14]. Villaverde et al. (2019) [15] conclude, in their benchmark of optimizers for parameter estimation, that the best performance is achieved with hybrid approaches such as the enhanced Scatter Search method (eSS, [16, 17]). AMIGO2 [18] is a multiplatform MATLAB-based toolbox designed to automate the solution of optimization problems which are at the core of systems and synthetic biology: (a) the iterative identification of dynamic models, (b) the use of optimality principles for predicting biological behavior, and (c) the multiobjective optimal control of biological systems. This chapter presents a protocol to iteratively build predictive mathematical models using parameter estimation and optimal experimental design as implemented in the AMIGO2 toolbox [18].
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
2
227
Materials
2.1 Toolbox Download and License
AMIGO2 toolbox and the corresponding documentation are available at: https://sites.google.com/site/amigo2toolbox/. The toolbox is provided as a zip file with a password; it is free of charge for academic purposes under the creative commons license. For further details on license conditions, please visit http:// creativecommons.org/licenses/by-nc-nd/3.0.
2.2 Toolbox Requirements and Installation Guide
AMIGO2 has been implemented in MATLAB and tested in several MATLAB versions. However, it may interface to C code for model simulation and parameter estimation. For full capabilities, the user will require the following additional software: l
Cytoscape is needed for network visualization.
l
MATLAB optimization toolbox is required to use the local optimizers fmincon (SQP method for constrained problems, suitable for dynamic optimization) or lsqnonlin (a least-squares local NLP solver, suited for parameter estimation).
l
MATLAB symbolic manipulation toolbox is used to evaluate exact Jacobians and for network visualization.
l
C compiler (e.g., gcc) is required to use AMIGO2-enhanced modes with C.
The most computationally demanding step in all tasks in AMIGO2 is the solution of the system dynamics, that is, the set of ordinary differential equations (ODE). In this regard, the tool offers the possibility of automatically generating C code, and this will be automatically mexed to CVODES (Sundials) as included in AMIGO2. The toolbox does not require installation. Once unzipped, open a MATLAB session and move to the AMIGO2 path. The code is initialized by typing AMIGO_Startup. The Startup automatically adds AMIGO2 to the path and generates mex options files. From that moment on, users can access the Help from the MATLAB help Supplemental Software section. 2.3
Code Structure
AMIGO2 is organized in four main modules: the preprocessor, the numerical kernel, the postprocessor, and the module of main tasks. Figure 4 presents the code structure once unzipped. l
Help folder keeps all toolbox-related documentation.
l
Examples folder keeps several implemented examples that the user may consider as templates to address new problems.
228
Eva Balsa-Canto et al.
Fig. 4 Code structure. The code is organized in user-oriented folders (Examples, Inputs, Help, and Results), code folders (Preprocessor, Postprocessors, Add-ons, Release-info), and tasks (Startup, Prep, SModel, SObs, SData, LRank, GRank, ContourP, RIdent, PE, REG_PE, PE-PostAnalysis, OED, IOC, and DO) l
Inputs folder, initially empty, is devoted to keeping new inputs created by users.
l
Kernel folder keeps mathematical functions, NLP solvers, IVP solvers, and auxiliary code.
l
Postprocessor folder keeps all MATLAB functions to generate reports, structures, and figures.
l
Preprocessor folder keeps all MATLAB functions to generate MATLAB or C code, to mex files when required, and to create necessary paths.
l
Release_info folder contains the AMIGO_release_info.m with all details about the current release.
l
Results folder, initially empty, is devoted, by default, to keep all results. User may create other results folders.
Inputs to the code are kept into a MATLAB structure inputs. Different tasks require different inputs. For the purpose of this chapter, we will use the following: l
inputs.model: To include all information about the model, that is, the number of states, parameters, and stimuli; their names; model equations; and a nominal value of the parameters.
l
inputs.exps: To specify the experimental scheme, that is, the number of experiments and for each experiment, its initial and stimulation conditions, observables, sampling times, experiment duration, and available experimental data and experimental noise.
l
Inputs.Dosol: To specify the optimal experimental design for model selection, that is, the objective functional, the control vector parameterization approach, initial conditions for the experiment, experiment duration, and constraints.
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
2.4 Basics on the Use of AMIGO2
3
229
l
inputs.OEDsol: To specify the objective of the optimal experimental design for parameter estimation.
l
inputs.ivpsol: To specify the IVP solver, the sensitivity solver, and the integration tolerances. Defaults are set for these inputs.
l
inputs.nlpsol: To define the NLP solver and the corresponding parameters, for example, the maximum number of function evaluations or maximum computational time for optimization, to name a few. Defaults are set for these inputs.
l
inputs.plotd: To specify the characteristics of the display of results. Defaults are set for these inputs.
As described in the previous section, AMIGO2 is organized in tasks or tools. Every task is devoted to a specific problem in systems and synthetic biology. The following tasks are of interest for optimal experimental design: l
AMIGO_Prep interprets the “inputs” structure and creates the necessary files for other tasks.
l
AMIGO_SModel, AMIGO_Sobs, and AMIGO_SData are devoted to model simulation. Models or observables are evaluated, and results are plotted against experimental data. Pseudodata can be generated for numerical tests or synthetic problems.
l
AMIGO_OED solves the model-based optimal experimental design problem to improve parametric identifiability.
l
AMIGO_DO solves multi- and single-objective dynamic optimization problems using the control vector parameterization (CVP) approach [19]. This tool can be used for optimizationbased modeling or stimulation design, among others. It is, therefore, applicable to the OED for model selection.
Methods
3.1 Illustrative Example: Modeling of an Inducible Promoter
We consider the modeling of an inducible promoter in Saccharomyces cerevisiae to motivate the protocol. Promoters are defined as the DNA sequence surrounding the transcription start site, which is bound by the basal transcription machinery, thus allowing transcriptional initiation. Knowledge about the promoter architecture led to the development of inducible promoter systems. Here, we consider a chemically regulated system which is induced by the presence of IPTG and controls the expression of a fluorescent reporter, Citrine. The model reads as follows [20]: L_ 0 ¼ kLacI ð2k2 IPTGi þ kd ÞL 0 þ km2 L 1 k1 ð2G 20 þ G 21 ÞL 0 þ km1 ðG 21 þ 2G 22 Þ
ð11Þ
230
Eva Balsa-Canto et al.
L_ 1 ¼ k2 ð2L 0 L 1 ÞIPTGi ðkm2 þ kd ÞL 1 þ 2km2 L 2
:
IPTG ¼ i
ð12Þ
L_ 2 ¼ k2 L 1 IPTGi ð2km2 þ kd ÞL 2
ð13Þ
Lac12 ¼ kLac12 ðkTP1 þ kd ÞLac12
ð14Þ
Lac12m ¼ kTP1 Lac12 kd Lac12m
ð15Þ
G_ 2,0 ¼ 2k1 L 0 G 2,0 þ ðkm1 þ kd ÞG 2,1
ð16Þ
G_ 2,1 ¼ 2k1 L 0 G 2,0 þ 2ðkm1 þ kd ÞG 2,2 ðkm1 þ k1 L 0 þ kd ÞG 2,1
ð17Þ
G_ 2,2 ¼ 2ðkm1 þ kd ÞG 2,2 þ k1 L 0 G 2,1
ð18Þ
kcat Lac12mIPTGe K m þ IPTGe ðkout kd þ 2k2 L 0 þ k2 L 1 ÞIPTGi þ ðkd þ km2 ÞL 1 þ 2ðkd þ km2 ÞL 2 ð19Þ Cit_ m ¼ kC G 2,0 þ l k kC G 2,1 þ G 2,2 kd Citm ð20Þ
where L0 regards the LacI2 repressor; L1 and L2 correspond to LacI2 IPTGi and LacI2 2IPTGi complexes; Lac12 and Lac12m represent the protein in the cytoplasm and membrane, respectively; G2,0 corresponds to the gene that codes for citrine; G2,1 and G2,2 regard the G1 IPTGi and G1 2IPTGi complexes. Unknown kinetic constants include kLacI, k2, kd, km2, k1, km1, kLac12, kTP1, kcat, Km, kout, kC, and lk. As a second candidate model, we consider the reduced model proposed by Bandiera et al. (2018). The model structure builds on the assumption of time-scale separation between the expression of the repressor LacI, its dimerization and subsequent binding to the operator sites and IPTG, considered at quasi-steady state, and Citrine expression. The model reads as follows: Cit_mrna ¼ α1 þ V m1
IPTGeh d 1 Citmrna K hm1 þ IPTGeh
ð21Þ
_ ¼ α2 Citmrna ðd 2 þ K f ÞCitfoldedP CitfoldedP
ð22Þ
Cit_fluo ¼ K f CitfoldedP d 2 Citfluo
ð23Þ
where Citmrna, CitfoldeP, and Citfluo are the concentrations of Citrine mRNA, immature folded protein, and matured (fluorescent) protein. The model describes transcription, translation, and maturation of the fluorescent reporter. Transcription depends on the concentration of the inducer IPTGe through a Hill equation, where Vm1 is the maximal-induced transcriptional rate; h is the Hill coefficient; Km1 is the Michaelis–Menten coefficient. Translation occurs at a rate α2, and the folded protein matures at rate Kf.
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
231
mRNA and protein are subject to linear degradation occurring at rates d1 and d2, respectively. Both models, regarded here as MA and MB, respectively, were fitted to the time-series data available in [20]. Details on the fitting can be found in [21]. The best parameter values will be used here as the reference for optimal experimental design. Here, we show the definition of model in AMIGO2: inputs.model.input_model_type='charmodelC';%Model type: charmodelC for C code inputs.model.n_st=4; % Number of states inputs.model.n_par=9; % Number of model parameters inputs.model.n_stimulus=1; % Number of inputs inputs.model.st_names=char('Cit_mrna','Cit_foldedP',... 'Cit_fluo','Cit_AU'); % Names of states inputs.model.par_names=char('alpha1','Vm1','h1','Km1','d1','alpha2',... 'd2','Kf','sc_molec'); % Names of the parameters inputs.model.stimulus_names=char('IPTG'); % Names of the stimuli or inputs inputs.model.eqns=... % ODEs system dynamics. char('dCit_mrna=alpha1+Vm1*(IPTGe^h1/(Km1^h1+IPTGe^h1))d1*Cit_mrna',... 'dCit_foldedP=alpha2*Cit_mrna-(d2+Kf)*Cit_foldedP',... 'dCit_fluo=Kf*Cit_foldedP-d2*Cit_fluo',... 'dCitAUB = sc_molec*dCit_fluo'); inputs.model.par=[0.000377125304442752, 0.00738924359598526, 1.53333782244337, 5.01927275636639, 0.00118831480244382, 0.0461264539194078, 0.000475563708997018, 0.000301803966012407, 68.8669567134881]; % Reference value for the parameters
3.2 Optimal Experimental Design for Model Selection in AMIGO2
Currently, AMIGO2 does not offer a specific task to solve the problem of OED for model selection. Still, it is possible to use AMIGO_DO for that purpose. Remark that the use of DO implies that it is possible to measure regularly over time. AMIGO_DO requires the definition of inputs.model, inputs. DOsol, inputs.IVPsol, inputs.NLPsol, and inputs.plotd.
3.2.1 Definition of the Objective Functional
The first step in the protocol corresponds to the definition of the objective functional that will characterize the differences between the models. Several possibilities exist. Here, we include a couple of examples: 1. The integral of the squared differences of the fluorescent protein:
J OED,MS
2t¼tf 31=2 ð 2 ¼4 CitAU,A CitAU,B dt 5
ð24Þ
t¼0
where CitAU,A and CitAU,B correspond to the fluorescence in arbitrary units as predicted by models A and B, respectively. A
232
Eva Balsa-Canto et al.
scaling parameter transforms Citm (in model A) or Citfluo (in model B) into the A.U. of the experimental data. 2. The integral of the squared differences of the fluorescent protein over the experiment duration:
J OED,MS
2t¼tf 31=2 ð 1 4 2 ¼ CitAU,A CitAU,B dt 5 tf
ð25Þ
t¼0
Remark that this second possibility will penalize longer experiments. Alternative formulations to penalize experiment duration are possible. Also solving the problem as a multiobjective problem would allow to find the Pareto front of solutions offering the best compromise between model differences and experiment duration. Here, we focus on the single-objective case. The definition of the cost function in AMIGO2 requires the declaration of both models in sequence plus one or more additional ODEs to account for the objective functional(s). In this way, models and objective functional will be solved simultaneously, reducing the computational effort. For the particular case of functionals in Eqs. (24) and (25), the additional ODEs read:
Equation 24: 'dJ_OED_MS=(CitAUA-CitAUB)^2'
Equation 25: 'dtfinal=1',… 'dJ_OED_MS=(1/tfinal)*(CitAUA-CitAUB)^2' 3.2.2 Definition of the Optimization Problem
The definition of the dynamic optimization problem requires the following elements: the initial conditions for model simulation; the tentative experiment duration; the type of optimization problem (minimization or maximization); the definition of the objective; and control vectors parameterization (type of input interpolation, number of discretization elements, initial guess and bounds for the inputs, and bounds for the experiment duration). Inputs are shown in the sequel. For illustrative purposes, we will assume that the experiment may last between 4 and 24 h, and the input profile corresponds to a step-wise profile with five elements of fixed duration (i.e., the experiment is split into four segments of equal duration). The input parameterization can be easily modified in the inputs structure to consider steps of varying duration, pulse-wise profiles, or linear-wise profiles. It should be noted that the use of steps or linear-wise profiles with elements of varying duration increases the multimodality of the optimization problem. In general, solving a case with ten constant duration steps is simpler than solving a case with five steps whose duration is also
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
233
optimized, even if the NLP problem has the same number of decision variables. inputs.DOsol.y0=[ 6.105e-05 0.32025 419.98 10251 0.99998 1.9501e-05 9.5071e-11 3.1415e+09 17157 6.5338 387.69 246.04 16944 0 0]; % Initial conditions (state after overnight at maximum IPTGe) inputs.DOsol.tf_type='od'; designed inputs.DOsol.tf_guess=15*3600;
598.88 1247.8
% Experiment duration: fixed or % Tentative experiment duration
inputs.DOsol.DOcost_type='max'; % Type of problem: max/min inputs.DOsol.DOcost='sqrt(J_OED_MS)';% Objective functional % CVP DETAILS inputs.DOsol.u_interp='stepf';
% Stimuli definition: % 'sustained' |'step'|'stepsf'
|'linear' inputs.DOsol.n_steps=5; inputs.DOsol.u_guess=500*ones(1,inputs.DOsol.n_steps); % Guess for the input inputs.DOsol.u_min=zeros(1,inputs.DOsol.n_steps); inputs.DOsol.u_max=1000*ones(1,inputs.DOsol.n_steps); % Min/max for the input inputs.DOsol.t_con=linspace(0,inputs.DOsol.tf_guess,inputs.DOsol.n_steps+1); % Input swithching times: Initial and final time inputs.DOsol.tf_min =8*3600; inputs.DOsol.tf_max =24*3600;
3.2.3 Definition of the Numerical Methods
% Min/max for the experiment duration
The user selects the initial value problem solver plus the optimizer. AMIGO_DO allows for successive input refinements, and therefore, the user may activate that possibility.
% SIMULATION % inputs.ivpsol.ivpsolver='cvodes'; C)|
% IVP solver: 'cvodes'(default, % 'ode15s'
(default,MATLAB,sbml)|'ode113'| inputs.ivpsol.rtol=1.0D-7; tolerances inputs.ivpsol.atol=1.0D-7;
% 'ode45' % [] IVP solver integration
% OPTIMIZATION % inputs.nlpsol.nlpsolver='local_fmincon'; % [] NLP solver: % LOCAL: 'local_fmincon'|'local_n2fb'|'local_dn2fb'|'local_dhc'| % 'local_ipopt'|'local_solnp'|'local_nomad'| % MULTISTART:'multi_fmincon'|'multi_n2fb'|'multi_dn2fb'|'multi_dhc'| % 'multi_ipopt'|'multi_solnp'|'multi_nomad'|
234
Eva Balsa-Canto et al.
% GLOBAL: 'de'|'sres' % HYBRID: 'hyb_de_fmincon'|'hyb_de_n2fb'|'hyb_de_dn2fb'|'hyb_de_dhc'| % 'hyp_de_ipopt'|'hyb_de_solnp'|'hyb_de_nomad'| % 'hyb_sres_fmincon'|'hyb_sres_n2fb'|'hyb_sres_dn2fb'|'hyb_sres_dhc'| % 'hyp_sres_ipopt'|'hyb_sres_solnp'|'hyb_sres_nomad' % METAHEURISTICS: % 'ess' or 'eSS' (default) % Note that the corresponding defaults are in files: % OPT_solvers\DE\de_options.m; OPT_solvers\SRES\sres_options.m; % OPT_solvers\eSS_**\ess_options.m % inputs.nlpsol.reopt='off'; % Reoptimization inputs.nlpsol.reopt_local_solver='fmincon'; % Optimiser for reoptimization inputs.nlpsol.n_reOpts=2; % Number of reoptimizations
3.2.4 Running the Code
The first step is to preprocess the model to generate necessary scripts: C code for model simulation and the objective function. After preprocessing, the AMIGO_DO can be run.
>> IP_model_selection
% Reads the inputs structure
>> AMIGO_Prep(inputs)
% Runs preprocess
>> AMIGO_DO(inputs)
% Solves the OED problem
The optimizer was not able to converge: the final solution corresponds to the initial guess. At this point, we switch to a global optimizer. In this particular case, we have selected eSS [16]. As an alternative, it is recommended to check differential evolution [22] as it has been quite successful in solving dynamic optimization problems [13]. inputs.nlpsol.nlpsolver='ess'; inputs.nlpsol.eSS.maxeval = 100000; % Maximum ner of function evaluation inputs.nlpsol.eSS.maxtime = 120; % Maximum CPU time in seconds inputs.nlpsol.eSS.local.solver = 'fmincon'; % Local solver refinements inputs.nlpsol.eSS.local.finish = 'fminsearch';
Since the optimizer is stochastic, it is advised to run the code several times to check for convergence. Note that if all runs provide different solutions, the maximum computation time should be increased. Ideally, all runs should end up in the same optimum.
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
235
Fig. 5 Optimal experimental design for model selection. (a) The optimal IPTG profile corresponds to a pulsewise profile, starting from the absence of stimulation. Response time of model B to IPTG pulses is shorter than the corresponding response time for model A. (b) The optimal IPTG profile starts from no stimulation for more than half the experiment, and after that, the IPTG value increases to a final value of 152. The experiment lasts the minimum allowed of 8 h. In both cases, models respond differently, thus being distinguishable
The optimal IPTG profile, as obtained from the maximization of the models’ distance defined in Eq. (24), corresponds to a pulsewise profile, shown in Fig. 5a. Since there is no penalty on the final time, the optimum corresponds to the maximum of 24 h. The optimal IPTG profile as obtained for the maximization of the models’ distance defined in Eq. (25) corresponds to the step-wise profile shown in Fig. 5b. Note that the optimum is achieved in an experiment lasting 8 h. 3.3 Optimal Experimental Design for Parameter Estimation in AMIGO2
AMIGO_OED offers the possibility of solving the optimal experimental design problem for parameter estimation. The problem is formulated as a dynamic optimization problem in which the objective is to find the experimental scheme that minimizes a specific functional of the Fisher information matrix subject to a given set of constraints. The model is defined as in Subheading 3.1.
3.3.1 Definition of the Objective Functional
The toolbox predefines various OED problems. The most widely used are as follows: l
D-optimum design corresponds to the maximization of the determinant of the Fisher information matrix. This design is
236
Eva Balsa-Canto et al.
particularly suited to cases in which parameters are not highly correlated but poorly identifiable. l
E-optimum design corresponds to the maximization of the minimum eigenvalue of the Fisher information matrix. This design is particularly adequate for those cases in which parameters are highly correlated, that is, the confidence hyperellipsoid is highly eccentric.
Remark that the evaluation of the Fisher information matrix requires information about the expected (or typical) experimental noise in the system. The definition of the objective functional would read as follows: %================================== % OJECTIVE FUNCTIONAL RELATED DATA %================================== inputs.PEsol.id_global_theta='all'; for OED
% Parameters to be considered
'all'|User selected inputs.PEsol.global_theta_guess=inputs.model.par; % Nominal value of the % parameters to compute the FIM inputs.exps.noise_type='homo_var'; % Type of experimental noise: 'homo' % |'homo_var'| 'hetero' inputs.exps.std_dev{1}=0.1; % Standard deviation of the noise for each experiment: Ex: 0.05 5% inputs.OEDsol.OEDcost_type='Eopt'; % FIM based criterium: 'Dopt'|'Eopt'|'Aopt'|'Emod' 3.3.2 Definition of the Optimization Problem
The user needs to define what is being designed: initial conditions, stimuli condition, observation function, experiment duration, and number and location of sampling times. Inputs are shown in the sequel. In this particular example, we will assume that we design a single 24-h experiment, and the input profile corresponds to a step-wise profile with five elements of fixed duration.
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
237
%=========================================== % DEFINITION OF EXPERIMENT 1: TO BE DESIGNED %=========================================== inputs.exps.exp_type{1}='od'; inputs.exps.n_obs{1}=1; % Number of observables inputs.exps.obs_names{1}=char('CitAU'); % Name of observables inputs.exps.obs{1}=char('CitAU=CitAUB'); % Observation function %Initial conditions for the experiment inputs.exps.exp_y0{1}=[ 6.5338 387.69 inputs.exps.t_f {1}=24*3600;
246.04
16944];
% Experiment duration
inputs.exps.ts_type{1}='fixed'; designed) inputs.exps.n_s{1}=244; 5 min inputs.exps.u_type{1}='od'; 'od' (to inputs.exps.u_interp{1}='stepf'; experiment 1:
% [] Type of sampling times: % 'fixed'(default) | 'od' (to be % Number of sampling times, every
% Type of stimulation: 'fixed' | % be designed) % Stimuli definition for % OPTIONS:u_interp: 'sustained'
|'step'| %'linear'(default)|'pulseup'|'pulse-down' inputs.exps.n_steps{1}=5; % Number of pulses _|-|_|-|_ inputs.exps.u_min{1}=0*ones(1,inputs.exps.n_steps{1}); inputs.exps.u_max{1}=1000*ones(1,inputs.exps.n_steps{1}); % Max/min u inputs.exps.u_guess{1}=1000*rand(1,inputs.exps.n_steps{1});% Guess u
3.3.3 Definition of the Numerical Methods
The evaluation of the Fisher information matrix requires the solution of the model parametric sensitivities. The toolbox implements several possibilities including CVODES (for C code), a modification of ode15s for sensitivity computation (for MATLAB models), and a couple of finite differences schemes which may be used for C, MATLAB, or blackbox models.
% SIMULATION inputs.ivpsol.ivpsolver='cvodes'; 'cvodes'(default,C)|
% IVP solver: %'ode15s' (default, MATLAB,
sbml)| %'ode113'|'ode45' inputs.ivpsol.senssolver='cvodes'; % Sensitivities solver:'cvodes' % (default,C)| 'sensmat'(matlab)| % Finite differences: 'fdsens2'|'fdsens5' inputs.ivpsol.rtol=1.0D-7; % Solver integration tolerances inputs.ivpsol.atol=1.0D-7;
238
Eva Balsa-Canto et al.
Fig. 6 Optimal input profiles for parameter estimation. (a) Presents the optimal input to achieve maximum information, that is, to maximize the determinant of the Fisher information matrix (Dopt). (b) Presents the optimal input to achieve minimum correlation, that is, to maximize the minimum eigenvalue of the Fisher information matrix (Eopt). The profiles are completely different, while for Dopt, the optimum corresponds to a pulse-wise profile starting from full stimulation; for Eopt, it seems more convenient to use steps of different magnitudes
As for the optimizers, the definition is similar to that in Subheadings 3.2.3 and 3.2.4. 3.3.4 Running the Code
The first step is to preprocess the model to generate necessary scripts: C code for model simulation and the objective function. After preprocessing, the AMIGO_OED can be run.
>> IP_OED
% Reads the inputs structure
>> AMIGO_Prep(inputs)
% Runs preprocess
>> AMIGO_OED(inputs)
% Solves the OED problem
As an initial guess, we considered a random input profile which corresponds to a minimum eigenvalue (Eopt) of 5.29 1011 and a volume (Dopt) of 9.23 109. After optimizing to maximize the minimum eigenvalue, Eopt is 4.13 105, while the volume of the corresponding Fisher information matrix is 1.14 1011. When solving the problem to maximize the volume of the information, Dopt is 2.54 1018, while the minimum eigenvalue would be 3.56 105. Figure 6 shows the optimal input profiles for both cases. The protocol can be complemented with new experiment designs in an online optimal experimental design scheme as the one suggested in [21]. The underlying idea is to iteratively improve the quality of the parameter estimates given previous results. For this purpose, the iterative procedure could eventually switch between objectives and focus the attention on less accurate parameters.
Optimal Experimental Design for Systems and Synthetic Biology Using AMIGO2
239
Acknowledgments The authors acknowledge financial support from the Spanish Ministry of Science, Innovation and Universities and the European Union FEDER (project grant RTI2018-093744-B-C33). This work was also supported by a Royal Society of Edinburgh-MoST grant, EPSRC grant EP/R035350/1 and EP/S001921/1 to Dr. Menolascina, and the EPSRC grant EP/P017134/1CONDSYC to Dr. Bandiera. References 1. Jaqaman K, Danuser G (2006) Linking data to models: data regression. Nat Rev Mol Cell Biol 7(11):813–819 2. Balsa-Canto E, Alonso AA, Banga JR (2008) Computational procedures for optimal experimental design in biological systems. IET Syst Biol 2(4):163–172 3. Kreutz C, Timmer J (2009) Systems biology: experimental design. FEBS J 276(4):923–942 4. Walter E, Pronzato L (1997) Identification of parametric models from experimental data. Springer 5. Quarteroni A, Sacco R, Saleri F (2000) Numerical mathematics. Springer-Verlag, New York 6. Fletcher R (1987) Practical methods of optimization. Wiley, Chichester 7. Seber GAF, Wild CJ (1989) Nonlinear regression. Wiley series in probability and mathematical statistics. Wiley, New York 8. Schittkowski K (2002) Numerical data fitting in dynamical systems. Kluwer, Dordrecht 9. Fro¨hlich F, Kaltenbacher B, Theis FJ, Hasenauer J (2017) Scalable parameter estimation for genome-scale biochemical reaction networks. PLoS Comp Biol 13(1):e1005331 10. Balsa-Canto E, Banga JR, Alonso AA, Vassiliadis VS (2002) Restricted second order information for the solution of optimal control problems using control vector parameterization. J Proc Cont 12(2):243–255 11. Lin Y, Stadtherr MA (2006) Deterministic global optimization for parameter estimation of dynamic systems. Ind Eng Chem Res 45:8438–8448 12. Polisetty P, Voit E, Gatzke E (2006) Identification of metabolic system parameters using global optimization methods. Theor Biol Med Model 3:4 13. Balsa-Canto E, Vassiliadis VS, Banga JR (2005) Dynamic optimization of single- and multistage systems using a hybrid stochastic-
deterministic method. Ind Eng Chem Res 44 (5):1514–1523 14. Rodriguez-Fernandez M, Mendes P, Banga JR (2006) A hybrid approach for efficient and robust parameter estimation in biochemical pathways. Biosystems 83(2–3):248–265 15. Villaverde AF, Fro¨hlich F, Weindl D, Hasenauer J, Banga JR (2019) Benchmarking optimization methods for parameter estimation in large kinetic models. Bioinformatics 35 (5):830–838 16. Egea JA, Balsa-Canto E, Garcı´a M-SG, Banga JR (2009) Dynamic optimization of nonlinear processes with an enhanced scatter search method. Ind Eng Chem Res 48(9):4388–4401 17. Egea JA, Martı´ R, Banga JR (2010) An evolutionary method for complex-process optimization. Comp Oper Res 37(2):315–324 18. Balsa-Canto E, Henriques D, Gabor A, Banga JR (2016) AMIGO2, a toolbox for dynamic modeling, optimization and control in systems biology. Bioinformatics 32(21):3357–3359 19. Vassiliadis VS, Sargent RWH, Pantelides CC (1994) Solution of a class of multi-stage dynamic optimization problems: 1, problems without path constraints, 2, problems with path constraints. Ind Eng Chem Res 33 (2111–2122):2123–2133 20. Gnugge R, Dharmarajan L, Lang M, Stelling J (2016) An orthogonal permease–inducer–repressor feedback loop shows bistability. ACS Synth Biol 5:1098–1107 21. Bandiera L, Hou Z, Kothamachu V, BalsaCanto E, Swain P, Menolascina F (2018) On-line optimal input design increases the efficiency and accuracy of the modelling of an inducible synthetic promoter. Processes 6 (9):148 22. Storn R, Price K (1997) Differential evolution – a simple and efficient heuristic for global optimization over continuous spaces. J Glob Optim 11:341–359
Chapter 12 A Cyber-Physical Platform for Model Calibration Lucia Bandiera, David Gomez-Cabeza, Eva Balsa-Canto, and Filippo Menolascina Abstract Synthetic biology has so far made limited use of mathematical models, mostly because their inference has been traditionally perceived as expensive and/or difficult. We have recently demonstrated how in silico simulations and in vitro/vivo experiments can be integrated to develop a cyber-physical platform that automates model calibration and leads to saving 60–80% of the effort. In this book chapter, we illustrate the protocol used to attain such results. By providing a comprehensive list of steps and pointing the reader to the code we use to operate our platform, we aim at providing synthetic biologists with an additional tool to accelerate the pace at which the field progresses toward applications. Key words Synthetic biology, Mathematical modeling, System identification, Optimal experimental design, Microfluidics
1
Introduction Despite a booming community and some notable successes of synthetic biology, synthetizing new genetic circuits remains extremely time-consuming. This is mostly due to the fact that their building blocks, so-called parts, are rarely properly characterized. Mathematical models are uniquely suited to address this problem. However, despite being an engineering discipline, Synthetic Biology has so far made limited use of them, mostly because their inference has been traditionally perceived as expensive and/or difficult. Our group recently proposed [1] to combine Optimal Experimental Design (OED) and microscopy/microfluidics in a cyberphysical platform (Fig. 1a) that automates model calibration, i.e., the identification of parameters in a model. Given a part of interest and an initial model for it, this system iteratively identifies the most informative experiment to refine parameter estimates and runs such experiments (off-line). In the on-line configuration, the system periodically uses the newly
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_12, © Springer Science+Business Media, LLC, part of Springer Nature 2021
241
242
Lucia Bandiera et al.
Fig. 1 Cyber-physical platform and test case used to illustrate its implementation. (a) In the cyber-physical platform, the computer where the OED algorithm is implemented quantifies gene expression and uses Parameter Estimation/OED to stimulate the cells with inputs that maximize the amount of information extracted per experiment. Such input is translated in a stimulus for the cells in the microfluidic device using a Hydrostatic Pressure Modulation System (HPMS, see [7]). A microscope is used to observe cells and close the OED loop. (b) An inducible promoter in engineered yeast cells, presented in [2], is considered in the following. (c) Ordinary differential equation model used to mathematically formalize the behavior of the inducible promoter
acquired experimental data to update the model and design an optimal experiment for the new model, iterating until robust estimates are reached. To achieve automatic model inference, the platform we developed integrates seamlessly microfluidics, fluorescence videomicroscopy, real-time optimization, and fluidic actuation (Fig. 1a). In a computational study [1] we determined that, compared to standard experiments used in systems and synthetic biology, this platform allows the error of parametric estimates to be reduced by 60% when used in the off-line configuration. The on-line (Fig. 2) mode allows increasing this figure to over 80%. In the following, we consider the problem of identifying the model of an inducible promoter in yeast S. cerevisiae as an example of the application of our platform. We readapted the model from the original paper [2] and, as a first step, we run structural identifiability, sensitivity, and practical identifiability analyses. These analyses aim at determining the parameters the Optimal Experimental Design should focus on. In fact, there is no reason for optimizing the design of experiments for parameters that are not theoretically (structurally) identifiable, parameters that do not sufficiently affect the dynamical behavior of the model or that, in practical terms, have little influence on the output. This is a crucial step as (a) there
A Cyber-Physical Platform for Model Calibration
243
Fig. 2 On-line vs off-line OED. In off-line OED (a) the input (red signal) is optimized before the beginning of the experiment and then applied during the experiment while the output (green) is recorded. The experiment stops at τH ¼ τS when the data are gathered for a potential new iteration. At this point, the off-line and on-line modes differ: in on-line OED (b) at τS < τH a new parameter estimation routine is run on the input/output data acquired up until then (0 < t < τS). The resulting model ℳ( p1) is used to design a new optimal input u ∗ 2 that maximizes the information content of subexperiment 2 when it is administered to the cells
is no reason for optimizing the design of experiments for parameters that are not theoretically/practically identifiable and (b) by allowing us to focus only on identifiable parameters, this analysis reduces the dimensionality of the Fisher Information Matrix and the computational complexity of the Optimal Experimental Design procedure. These steps usually take place well before the experiments themselves. The day before the experiment, we fabricate the microfluidic devices and inoculate the overnight (O/N) cultures. On the day of the experiment, we start by mounting the microfluidic device on the microscope, connecting the fluidic lines, and loading the cells that carry the part of interest. Then, the experiment management algorithm is started: this software triggers the acquisition of images from the microscope, segments/tracks cells, quantifies the expression levels of the gene of interest and records the input/output data. The experiment either stops after one iteration (off-line OED) or the new data are used to (a) update the model and (b) design a new input to refine it (Fig. 2). The procedure is
244
Lucia Bandiera et al.
repeated until parametric convergence or external factors bring the experiment to a halt. In the following, we describe all wet- and dry-lab procedures to automate model inference. The reader is provided with pointers to additional protocols and code where necessary to maximize the applicability of the outlined procedure.
2
Materials Prepare and store all reagents at room temperature, unless otherwise stated. Comply with laboratory health and safety and waste disposal regulations.
2.1 Computational Tools
1. Structural identifiability analysis: MATLAB toolbox STRIKEGOLDD [3]. 2. Practical identifiability and sensitivity analyses, parameter estimation, optimal experimental design: MATLAB toolbox AMIGO2 [4]. 3. Image processing: cell segmentation and tracking are performed using the ImageJ plugins U-Net [5] and Lineage Mapper [6].
2.2 Microfluidic Device Fabrication
1. Soft lithography: wafer of the MFD005a microfluidic device [7], disposable plastic cup and spatula, silicon elastomer kit (Sylgard 184), desiccator, vacuum pump, and oven. 2. Polydimethylsiloxane (PDMS) processing: razor blades, sterile disposable scalpel, 1 mm disposable biopsy puncher, magic tape, kimwipes, 5-mL disposable syringe, 25G needle, DI water, fine bore polythene tubing, high-precision cover slips 24 60 mm N 1.5H, light source, UV cleaner, and oven.
2.3 Microfluidic Experimental Setup
1. Supplement 60 mL of synthetic complete (SC) media with the appropriate sugar, e.g., glucose: 20% w/V solution in water. Weigh 20 g of glucose and transfer it to a 100-mL graduated cylinder containing about 60 mL of water. Make it up to 100 mL with water and filter-sterilize through a 0.22 μm filter. 2. Prepare a chemical inducer solution, e.g., isopropyl β-D-1thiogalactopyranoside (IPTG): 0.1 M solution in double distilled water. Weigh 0.238 g of IPTG powder in a weighing tray and transfer to a 25-mL beaker containing 8 mL of water. Make it up to 10 mL with water and filter-sterilize through a 0.22 μm filter. Prepare 1 mL aliquots and store them at 20 C. 3. Prepare a fluorescent dye to track the inducer concentration, e.g., sulforhodamine B: 0.1% w/V solution in water. Weigh 10 mg of sulforhodamine B powder in a 15-mL falcon tube and
A Cyber-Physical Platform for Model Calibration
245
resuspend in 8 mL of water. Make it up to 10 mL with water and filter-sterilize through a 0.22 μm filter. Store in the dark. 4. Colonies of the S. cerevisiae strain under investigation. 5. Microfluidic devices. 6. 7 25G needles, 6 50-mL and 1 5-mL sterile, disposable syringes. 7. Fine bore Polythene tubing (SMME800/100/120), electric tape, Kimwipes. 8. Hydrostatic pressure linear actuators system [7]. 9. Nikon Eclipse TI fluorescence microscope.
3
Methods
3.1 Structural Identifiability
1. In Matlab, create a .mat file with information on your model: list the symbolic variables (syms) and specify the model states (x), the output variables (h), the unknown parameters (p), the dynamic equations (f), the vector of initial conditions (ics), the known initial conditions (known_ics), and the inputs (u) (see Note 1). 2. Open the file options.m and specify as modelname the string given to the generated .mat file. If not already existent, create a directory called results where the Structural Identifiability results will be saved. If the complexity of the model requires decomposition for the analysis to be run, additionally specify the directory path for MEIGO software. 3. Select the desired identifiability options for the computation of the generalized observability-identifiability matrix (see Note 2). With the inducible promoter model example, the rank of the matrix has been computed symbolically (value set to 0), the states have not been replaced with known initial conditions (value set to 0), identifiability of initial conditions and input observability have been checked (value set to 1), finding of identifiable combinations, checks for unidentifiability and model decomposition have not been selected (value set to 0), and maximum time allowed for computing 1 Lie derivative has been set to 1000 s. To resolve structural identifiability issues, the basal transcriptional rate α has been fixed (see Note 3). This parameter is therefore defined in the vector of previously identified parameters (prev_ident_pars) since its value has been fixed. 4. Start the Structural Identifiability analysis (see Note 4) by running the script called STRIKE_GOLDD.m.
246
Lucia Bandiera et al.
3.2 Sensitivity Analysis
1. In AMIGO2, define the model to be analyzed in the Matlab structure inputs.model by setting the number of states (.n_st), parameters (.n_par), and inputs (.n_stimulus) as integers and specify their names (.st_names, .par_names, and .stimulus_names). Add the differential model equations (.eqns) as character vectors. Nominal parameter values can be defined in inputs.model.par. In addition, indicate the directory where the results will be saved in .pathd.results_folder and .pathd. short_name. 2. Define the experimental conditions that will be used for the sensitivity analysis. Multiple experimental conditions can be specified in a cell array. These will be defined in the Matlab structure inputs.exps, containing the number of experiments (. n_exp), the model initial conditions (.exp_y0), the experiment duration (.t_f), the name (.obs_names) and number of observables (.n_obs) as well as their corresponding state variable in the model (.obs). The experiment definition is complemented by the type of stimuli applied (.u_interpret), the input switching times (.t_con), the values of the input (.u) and the number (.n_s) and location (.t_s) of sampling times. In the inducible promoter model, we hypothesize the use of three random step profile experiments of 24 h. The piece-wise constant inputs are composed of 180 min steps, while the inducer concentration is randomly sampled in the range 0–10 μM, 0–30 μM, and 0–100 μM. 3. Provide information about the model parameters in the Matlab structure inputs.PEsol. Define a character list of the parameters to be considered in the analysis (.id_global_theta), their upper (.global_theta_max) and lower (.global_theta_min) bounds, and an initial estimate of their values (.global_theta_guess). For a more exhaustive analysis, we selected 30 different initial guesses for the parameter vector sampled through Latinhypercube sampling of the parameter boundaries on a logarithmic scale. 4. If desired, select the initial value problem (ivp) solver in the Matlab structure inputs.ivpsol.ivpsolver and sensitivities solver in inputs.ivpsol.senssolver as well as absolute (inputs.ivpsol. atol) and relative (inputs.ivpsol.rtol) tolerances. We selected as the ivp and sensitivities solver cvodes with both absolute and relative tolerances set to 108. 5. Specify the number of samples, taken within the parameter bounds, that will be used for the analysis (inputs.rank.gr_samples). The default value, also used in our example, is 10,000 samples. 6. To include all the necessary functions and tools into the Matlab working directory, run the AMIGO2 script AMIGO_Startup.
A Cyber-Physical Platform for Model Calibration
247
Fig. 3 Results of global sensitivity analysis on the inducible promoter model considering a multiexperiment scheme composed by three random dynamic inputs. (a) Importance factors computed by AMIGO2 to quantify global sensitivity considering a random initial guess for the parameter vector. Note that, while the output sensitivity values depend on parameter estimates, the ranking of the kinetic rates remains conserved across multiple runs. (b) Box plots, overlaid with swarmplots, of the importance factor δmsqr , computed for 30 random p initial guesses of the parameter vector, for each parameter of the model ( p). Decreasing values of the importance factor (from left to right) relate to a smaller sensitivity of the model output to the parameter
m or manually add the AMIGO2 folder and subfolders to the path. 7. Run the AMIGO2 function AMIGO_Prep(), with the Matlab structure inputs as an argument, for the preprocessing step. This compiles and generates mex functions for the required tasks. 8. Run the function AMIGO_GRank(), with the Matlab structure inputs as an argument, to perform the Global Sensitivity analysis (see Note 5) and obtain the list of model parameters, ranked according to their decreasing ability to affect the output of the model (Fig. 3). 3.3 Practical Identifiability
1. Follow steps 1 and 2 from Subheading 3.2 to create the Matlab structure inputs.exps, which will additionally be populated with the experimental data. To define the type of data used, specify real or pseudo as a string in .data_type. Then, introduce the data (.exp_data) and their associated error (.error_data). 2. Follow steps 3 and 4 from Subheading 3.2.
248
Lucia Bandiera et al.
Fig. 4 Results of the practical identifiability analysis. Example of joint plot of the most (Kr) and least (kf) identifiable parameters in the inducible promoter model, as selected from a comparison of the coefficient of variation of their estimates. The marginal distributions were computed from the 95% confidence interval on parameter estimates inferred on 600 in silico realizations of the experimental data from a user-specified initial guess. The bivariate plot conveys information about the correlation between the parameters (e.g., weak correlation in this example)
3. If desired, select the cost function for the Parameter Estimation problem in the Matlab structure inputs.PEsol.PEcost_type. The cost function can be lsq (least squares, used in our example), llk (log-likelihood), or user-defined. From the different options in AMIGO2, select the desired type of weights from inputs.PEsol.lsq_type or inputs.PEsol.llk_type. In our example, this has been set to .lsq_type ¼ “Q_expmax.” 4. Define the optimization algorithm to be used in the Matlab structure inputs.nlpsol.nlpsolver as well as the different hyperparameters associated with it (see Note 6). In our example, the selection has been .nlpsolver ¼ “eSS,” and in the Matlab structure inputs.nlpsol.eSS, .maxeval ¼ 10,000, .maxtime ¼ 100, . local.solver ¼ “lsqnonlin”, and .local.finish ¼ “lsqnonlin”. 5. Select the number of iterations where noisy simulated data will be generated from the experimental profiles and the parameter estimation problem will be solved (see Note 7). 6. Follow steps 6 and 7 from Subheading 3.2. 7. Run the function AMIGO_RIdent(), with the Matlab structure inputs as an argument, to perform the robust Practical Identifiability analysis (see Note 8). An example of practical identifiability results is shown in Fig. 4.
A Cyber-Physical Platform for Model Calibration
3.4 Microfluidic Experiments 3.4.1 Microfluidic Device Fabrication
249
To enhance the efficiency of microfluidic device fabrication, perform all the steps described below on the same day. Soft lithography (steps 1–4) should be executed in a clean or semiclean room to prevent dust and debris accumulation in the devices. Perform all procedures using nitrile gloves. 1. Mixing the PDMS. A PDMS prepolymer is prepared by mixing the curing agent and the silicone monomer in a 1:9 ratio. To compute the mass of the prepolymer required to obtain a 0.5 cm thick PDMS mold, measure the diameter d (in cm) of the patterned region of the wafer. As the density of the prepolymer is ρ ¼ 1.1 g/cm3, the mass of the mixture is 11=80π d 2. In a plastic cup, weigh 99=800π d 2 g of silicon elastomer and 11=800π d 2 g of silicon curing agent. Stir the mixture with a plastic spatula for approximately 3 min or until it becomes white and foamy. 2. Degassing the PDMS. To remove the bubbles generated when mixing, place the plastic cup into a desiccator. With the valve in the open position, apply vacuum for 10–15 min: the decrease in pressure will cause bubbles’ expansion and migration to the PDMS surface. Close the valve, turn off the pump, and quickly release the vacuum to pop bubbles. Repeat this cycle until complete removal of air bubbles. 3. Pouring the PDMS. Slowly pour the PDMS on the master, selecting a point without features. As this step will generate new bubbles, cover the pyrex petri dish containing the wafer and repeat step 2 until there will not be visible bubbles. 4. Curing. While keeping the waver as flat as possible, place it in the oven at 60 C for 3 h. 5. Removing the PDMS mold. Remove the wafer from the oven and let it cool down to room temperature. Using a sterile scalpel, excise the PDMS in a circle comfortably containing the patterned region of the wafer. To prevent damaging the wafer, insert the scalpel in the PDMS at the minimum depth allowing air to appear at the cut site. Keeping at a distance from the patterned area, slowly lift up the PDMS layer from one side and allow it to peel off from the wafer. Cover the petri dish containing the wafer and store it in a safe place. 6. Cutting and punching the PDMS devices. To aid features’ visualization, cover both surfaces of the PDMS mold with overlapping stripes of magic tape. Using a razor blade, cut the patterned region of PDMS along the external perimeter. Remove the tape in a single movement and place the PDMS with feature-side up. Using a light source to enhance contrast, place the biopsy puncher at the ports location and orient it perpendicularly to the PDMS. Apply a downward pressure
250
Lucia Bandiera et al.
until the puncher will break through the PDMS. Use the puncher plunger to get rid of the PDMS core, lift the PDMS layer, and carefully pull out the puncher from the hole while rotating the puncher in a counterclockwise direction. Following the steps above, punch all ports in all devices. Cover the PDMS with magic tape again and, using a blade razor, isolate single microfluidic chips following the grids. 7. Cleaning device ports. Insert a 25G needle in a short length of tubing and connect the needle adapter to a 5-mL disposable syringe filled with double distilled water. Insert the free extremity of the tubing in a port and apply pressure. Water should flow through the port, removing PDMS debris. Repeat the outlined procedure on all ports on both sides of the devices. 8. Bonding chips to coverslips. Warm up the plasma cleaner. After 15 min, run two cycles of vacuum (30 s), plasma (45 s), and pressure release. The plasma cleaner is ready for use when a bright pink plasma is visible in the chamber. Remove dust from the device by covering it with magic tape and insert it in the plasma cleaner, feature-side up. Using kimwipes, gently wipe both sides of a high precision coverslip until it is completely free of dust and insert it in the chamber of the plasma cleaner. Turn on the vacuum (30 s) and apply the plasma (45 s). Turn off the vacuum and gently release the pressure. Remove the device and the coverslip from the plasma cleaner and quickly bond them by letting the device fall on the coverslip from a 45 angle. Application of a downward pressure, which could cause the features to collapse, has to be avoided. Transfer the bonded chip to a 60 C oven for 15 min. Repeat the above steps for each chip. Place the devices in a petri dish and store them at room temperature. 3.4.2 Overnight Culture
1. On the day before the experiment, under the fume hood, pick an isolated, average size colony from an SC plate supplemented with the appropriate sugar (e.g., 2% w/V glucose) and inoculate it in a 20-mL test tube containing 5 mL of SC media supplemented with sugar (e.g., 2% w/V glucose) and the highest concentration of the chemical inducer to be used (e.g., 1000 μM IPTG). 2. Grow the cell culture overnight in a shaking incubator at 30 C, 230 rpm. 3. On the day of the experiment, measure the Optical Density (OD600) of the cell culture. Dilute the cell culture to an OD600 ~ 0.1 in fresh media having the same composition of the one used in the overnight culture. 4. Grow in a shaking incubator at 30 C, 230 rpm for 2–3 h or until the cell culture reaches the middle exponential phase (OD600 ∈ [0.3, 0.5]). In the meanwhile, proceed with the following steps.
A Cyber-Physical Platform for Model Calibration 3.4.3 Syringe Preparation
251
1. Prepare 7 lengths of tubing (4 70 cm, 2 150 cm, 1 20 cm), 7 25G needles, 6 50-mL and 1 5-mL syringes. 2. Under the fume hood, connect a 25G needle to a free end of each length of tubing. Remove the plunger from the syringes and connect a needle adapter to each of them (the shortest length of tubing is for the 5-mL syringe). 3. Fill the 5-mL syringe, to be used for device wetting, with SC media supplemented with sugar. 4. Using paper tape, label the 50-mL syringes with sequential numbers 1–6 (1 and 2 are the input syringes and are connected to the longer tubing). The volume and composition of the solutions to add to each syringe is specified in Table 1. For each syringe, using a P1000 pipette, make contact with the inside of the leur stub adapter, and load 1 mL of the appropriate solution. This reduces the formation of bubbles that would prevent the flow of solutions in the tubing. Using a 10-mL stripette, load the residual volume by letting it run on the syringe wall before reaching the bottom. 5. Raise the height of the syringe relative to the free end of the tubing until a droplet appears at the end of the latter. Examine the line to check for absence of air bubbles. 6. Cover the barrel flange with a parafilm and, using the tip of scissors, create a hole in it. 7. In the microscope room, place syringes 1 and 2 on the linear actuators at a height of 50 cm from the microscope stage. Attach syringes 3, 4, and 5 to the microscope chamber, at a height of 18, 23, and 23 cm above the stage (see Note 9).
3.4.4 Wetting the Microfluidic Chip
1. Secure the microfluidic device to the lid of a petri dish, acting as a chip holder, on one side of the cover slip using paper tape. Examine the quality of the device features at 10 magnification to verify correct punching of the ports and absence of debris obstructing the channels. 2. Apply pressure to the 5-mL syringe containing media until the short length of tubing is free from air bubbles and droplets appear at its free end. 3. Insert the free end of the tubing in port 5 and apply a gentle pressure to enable media flow while preventing the chip from being lifted off the cover slip. When a media droplet appears at port 4, detach the tubing from port 5 by applying a counter pressure and connect it to port 4 (see Note 10). Repeat the above procedure for ports 3, 1, and 2. 4. Under the microscope, verify the absence of air bubbles in the chip. If air bubbles are present, repeat the procedure above. 5. Using kimwipes, remove the excess of media on top of the device.
252
Lucia Bandiera et al.
Table 1 Content of the 50 mL syringes for the microfluidic experiment Syringe identifier
Syringe content
1
SC media complemented with the appropriate sugar, inducer and fluorescent dye for a total volume of 10 mL (e.g., 8.73 mL SC media, 1 mL 20% glucose, 100 μL IPTG 0.1 M, and 170 μL Sulforhodamine B 1 mM)
2
SC media complemented with the appropriate sugar for a total volume of 10 mL
3
10 mL of SC media
4
10 mL of SC media
5
5 mL of SC media
6
5 mL of cell culture
3.4.5 Connecting Syringes to the Chip
1. Remove the device from the petri dish and secure it to the sample holder using electrical tape. Clean the lower side of the cover slip using 70% EtOH and kimwipes. 2. At 10 magnification, re-examine the chip for absence of air bubbles and debris in the ports, the channels, and the chamber. Using the wetting syringe, cover the ports of the chip with media. 3. Check for the absence of air bubbles in the lines and, operating at the height of the stage, connect syringe 5 to its port. Proceed connecting syringe 4, 1, 2, and 3. 4. Verify that no air bubbles were introduced in the microfluidic device during the procedure and secure the tubing to the stage using paper tape.
3.4.6 Calibration of the Microfluidic Device
1. Select the microscope channels to be used (DIC and the channel for the fluorescent dye used to track the inducing media, e.g., sulforhodamine) and set the field of view at the DAW junction of the microfluidic device (Fig. 5). 2. Specify the minimum (hmin) and maximum (hMax) heights of the actuators, which generate an approximate mixing ratio of 0% (i.e., absence of fluorescent signal in the channel feeding the chamber) and 100% (i.e., fluorescent signal detected across the entire width of the main channel). From these, the average height (hmean, mixing ratio of approximately 50%) can be retrieved. 3. To enable an accurate, a posteriori estimate of the 0% and 100% mixing ratios, heights that correspond to pressures that will slightly overshoot the central channel of the DAW junction should be considered. To this aim, we specify a range for the
A Cyber-Physical Platform for Model Calibration
253
Fig. 5 Architecture of the MFD005a microfluidic device [7]. The microfluidic chamber, the ports, the dial a wave junction (DAW), and the mixing channel are highlighted. Black arrows depict the direction of media flow in running conditions
actuator’s height equal to [hmean 0.6 (hMax hmin), hmean + 0.6 (hMax hmin) ]. 4. Program the actuators to perform three triangle input waves, centred on a mixing ratio of 50% and with period T ¼ 6 min. Specify the amplitude of the actuators steps and acquire images from the two selected channels at each step. In our case, the step length has been set to 2 s (see Notes 11 and 12). 5. Save all fluorescence and DIC images as matrices in a Matlab cell array structure. 6. From the fluorescence images, extract the region of interest (mixing channel) (see Note 13). 7. Applying an edge detection function (e.g., edge() in Matlab) to one DIC image, detect the boundaries of the mixing channel and make a binary mask with unitary entries for the pixels inside the channel. Next, perform an element-wise multiplication between the mask and each fluorescence image. This will allow the selection of the pixels inside the channel required for the computation of the mixing ratio. 8. Extract all the pixel values within the region of interest and use a two-component Gaussian Mixture Model to identify the distributions corresponding to background and sulforhodamine fluorescence (see Note 14). Then, define a threshold to discriminate them as a multiple of the standard deviation associated to one of the distributions depending on the level of noise of the fluorescent images (i.e., 3 standard deviations away from the mean to recover 99.7% of the data) (see Note 15).
254
Lucia Bandiera et al.
9. Compute the mixing ratio as the number of pixels above the threshold divided by the total number of nonzero pixels in the selected region of interest. 10. Use a curve-fitting algorithm (least squares curve fit in our case) to map the input pressure of one of the actuators to the computed output mixing ratio. This will provide the exact estimate of hmin, hMax, and hmean to be used [7] (see Note 16). 11. Save the results and verify the calibration accuracy. Visually, the accuracy of calibration can be assessed through the input– output relation plot (data and fit), the error percentage of the fit versus the mixing ratio or a comparison between the fluorescent images and a binary mask generated using the selected threshold value. 3.4.7 Loading the Cells
1. Using a spectrophotometer, measure the OD600 of the cell culture to verify whether it reached middle exponential phase. 2. Prepare the cell syringe, having ID 6 (see Subheading 3.4.3, step 4) and attach it at a height above 23 cm from the microscope stage. 3. Disconnect syringe 5 and connect syringe 6 to port 5. Move syringe 4 to an upper position while keeping it below the cell syringe. 4. Monitoring in live DIC, at a 60 magnification, verify cell flow. Flickering the tubing of syringe 6 might help to perturb the flow. To prevent premature clogging of the device, the initial number of cells in the trap should be low, ideally below 10. 5. When satisfied with the number of cells in the trap, adjust the syringes to the running position: gently lower syringe 4 to 23 cm above the stage, bring the cell syringe to the same height, disconnect the cells reservoir from port 5, and plug in syringe 5. 6. Verify the absence of air bubbles in the ports, channels, and the chamber.
3.4.8 Microscope Setup
1. Using a 70% EtOH-wet kimwipe, clean the 40 objective. Add oil and set the focus. 2. With the help of the stage controller to navigate the device, mark the position of the chamber and DAW junction. 3. Select the DIC and fluorescence channels (e.g., sulforhodamine, citrine) to be acquired during the experiment. For each of them, specify the exposure time. 4. Specify the sampling frequency and the number of acquired images. These two fields determine the duration of the experiment.
A Cyber-Physical Platform for Model Calibration
255
5. Load the text file containing the dynamic perturbation profile to be administered to the cells. 6. Specify the path to the folder in which the images will be stored and start the acquisition. 3.5 Image Processing
3.5.1 Fine-Tuning of the Weights of the Convolutional Neural Network
In this section, we describe image processing and extraction of fluorescence time-series data at the single-cell and population level. While the computational tools we employ were selected for their flexibility toward alternative imaged cell-types, numerous approaches are currently available for segmentation of yeast cells from brightfield/DIC images [8–11]. 1. Open manually annotated images in ImageJ (see Note 17). 2. Open the U-Net Job manager (Plugins! U-Net ! Job Manager) and select Fine-tuning. 3. Use pretrained weights, available in the U-Net example 2d_cell_net_v0.caffemodel.h5, as a starting point for transfer learning (see Note 18). 4. Subdivide the set of annotated images in training (67%) and test set (33%). Both sets should contain representative samples of the images to segment. 5. Specify the number of evaluations of the loss function used to optimize the network weights. While 1.5 105 iterations are normally enough, the number should be significantly increased when the network is trained from scratch. In addition, set the learning rate (1 105) and the validation interval (150). 6. Specify the file name and path where the resultant weights will be saved (see Note 19). 7. Untick the selection “labels are classes” (see Note 20). 8. Press OK to start the fine-tuning. 9. Statistics plots are generated in real time during training. Among these, the Loss function and the Intersection Over Union plots are the most informative (Fig. 6). 10. Once newly optimized weights are available, qualitatively check the performances of the network on samples in the validation set and manually annotated images not included in the tuning.
3.5.2 Image Segmentation
1. To isolate from the images the section corresponding to the cell trap, cut all DIC (and fluorescence) images using a rectangular region of interest located at the same coordinate. 2. In ImageJ, open the DIC image to be segmented. 3. Open the U-Net Job manager (Plugins! U-Net ! Job Manager) and select Segmentation.
256
Lucia Bandiera et al.
Fig. 6 Statistics plots generated by U-Net [5] during fine-tuning. (a) Plot of the intersection over union metric as a function of the number of iterations in fine-tuning. The metric, computed as the ratio between the overlapping cell objects predicted from convolutional neural network and identified in the ground-truth (i.e., manually annotated images) and the cell area encompassed by both, quantifies the accuracy of cell detection. A value above 0.5 suggests a good prediction. (b) Cross-entropy loss, computed on the training (gray line) and validation (blue line) sets, as refinement of the network weights occurs. In this example, convergence to optimal weights is achieved after a limited number of iterations
4. Select the caffe.model.h5 file containing the CNN weights obtained in Subheading 3.5.1 and press OK (see Note 18). 5. Once a binary mask (Fig. 7, top central panels) has been generated, verify whether its pixel size is coherent with the original DIC and fluorescence images and save them in TIFF format (see Note 21). 3.5.3 Cell-Tracking and Extraction of Fluorescence Time-Series
1. To identify single cells from the population in the binary image, open each image as a matrix in Matlab and label connected components by applying the bwlabel function. Hence, save the matrix as an image in TIFF format. 2. Open Lineage Mapper (Plugins! Tracking ! Lineage Mapper). 3. Specify the path and file names for the images to be tracked as well as the identifier of the directory and files in which the results, i.e., masks with the tracking indexes, will be stored. 4. Populate the fields corresponding to the tracking parameters, following the instructions provided by the plugin developers [12] (see Note 22). 5. Press the tab “track.”
A Cyber-Physical Platform for Model Calibration
257
Fig. 7 Visual representation of the outcome of a microfluidic experiment in which the response of cells to a random stepwise input (blue line) is measured in fluorescence microscopy. DIC images, together with the associated binary mask, acquired at 0 and 24 h are shown (top panels). The mean fluorescence across the cell population (black line) and its standard deviation (gray shaded area) are computed from single cell time series. Representative single cell data are shown as yellow, pink, and purple lines. Note that the bottom panel reports in silico data
6. In Matlab, import each mask in the time-series as a matrix. 7. For each cell-index in the mask, compute the average of the fluorescence signal of the corresponding pixels at each time point. This yields a vector of raw fluorescence, whose entries are the average fluorescence of a given cell, at each time-point. 8. Correct the single-cell fluorescence vector by time-point subtraction of the background signal (see Note 23). This can be computed as the average of the nonzero entries of a matrix obtained by the entry-wise product of the fluorescence image and the image complement of the binary mask obtained at step 5 of Subheading 3.5.2. 9. Merge the single-cell fluorescence vectors in a matrix, with the number of rows equal to the number of cells and the number of columns equal to the number of time points in the experiment. Perform column-wise computation of the mean and standard deviation of the fluorescence signal across the cell population (Fig. 7, bottom panel) (see Note 24). 3.6 Parameter Estimation
1. Follow steps 1 and 2 from Subheading 3.2 and step 1 from Subheading 3.3 to create the inputs.exps Matlab structure. Populate it with the experimental data (Fig. 8a, b) obtained from Subheadings 3.4 and 3.5.
258
Lucia Bandiera et al.
Fig. 8 Comparison of pseudo-experiments in which random (a, orange line) or optimally designed (b, cyan line) inputs are used to gather data for parameter estimation. While aimed at exemplifying the outcome of OED and PE in the cyber-physical platform, pseudo-data were here obtained by sampling the model output, in response to the shown input, and adding 5% Gaussian noise. The green line represents the calibrated model response to the data. (c) Distributions of the estimate of parameter γf, inferred with the two input profiles when assuming a uniform prior, are compared to the true parameter value. The higher informative content of the optimally designed input is reflected in the location (i.e., centered on the true value) and width of the distribution
2. Follow steps 3 and 4 from Subheading 3.2 and steps 3 and 4 from Subheading 3.3 to specify the options related to parameter estimation, optimization, and ivp solvers. To ensure convergence of the optimization algorithm, an adequate number of evaluations (.maxeval) of the cost function and maximum computation time (.maxtime) should be used. In our example, we set them to 2 105 and 5 103, respectively. 3. Follow steps 6 and 7 from Subheading 3.2 to prepare the inputs Matlab structure to perform Parameter Estimation. 4. Run the function AMIGO_PE(), passing the inputs Matlab structure as an argument (see Note 25), to obtain parameter estimates and their associated uncertainty (Fig. 8c). The goodness of fit can be assessed by measuring the distance between the model output (Fig. 8a, b, green line), computed with the inferred parameters, and the experimental data. 3.7 Optimal Experimental Design for Model Calibration
1. Follow step 1 in Subheading 3.2 to create the inputs Matlab structure that contains the ODEs. 2. Create the inputs.exps Matlab structure as described in step 2 from Subheading 3.2. To specify the properties of the
A Cyber-Physical Platform for Model Calibration
259
experimental scheme that will be optimized, select the type of experiment as optimally designed (.exp_type ¼ “od”), the allowed boundaries for the inducer (.u_min, .u_max) (see Note 26), the noise type (.noise_type), and standard deviation (.std_dev) associated to the experiment. In our example, we constrained optimal experimental design (see Note 27) to the identification of an optimal perturbation profile. This was defined as a stepwise input with segments of fixed duration (.u_interpret ¼ “stepf”). The total number of steps is defined in .n_steps. 3. Follow steps 3 and 4 from Subheading 3.2 and steps 3 and 4 from Subheading 3.3 to set the options related to parameter estimation, optimization, and ivp solvers. To improve the convergence of the optimization algorithm .maxeval and .maxtime were set to 5 104 and 3 104, respectively. 4. In inputs.OEDsol.OEDcost_type, select the scalar measure of the Fisher Information Matrix (FIM) to be used for optimal experimental design (see Note 28). 5. Follow steps 6 and 7 from Subheading 3.2 to prepare the inputs Matlab structure to perform Optimal Experimental Design. 6. Run the function AMIGO_OED(), with the inputs Matlab structure as an argument, to obtain the optimal input profile (Fig. 8b, cyan line).
4
Notes 1. Follow the nomenclature specified within brackets when assigning the name to each vector. If the initial conditions are unknown, define the corresponding vector as empty. The vector of initial conditions is binary and has the same length and order of the state variables vector. Entries are set to 1 if the initial condition for the corresponding state variable is known. 2. Structural identifiability analysis is computationally expensive due to the high memory consumption. Limited computational resources or high complexity of the model, as determined by the number of parameters and states, represents challenges to the analysis. In such cases, the rank of the observabilityidentifiability matrix can be computed numerically (opts. numeric ¼ 1). To overcome the risk of obtaining an artificial decrease of the matrix rank, the analysis should be re-run several times to ensure convergence to the correct result. It is worth noting that a parameter known to be identifiable (from alternative analysis or because its value has been fixed) can be specified as a symbolic variable in prev_ident_pars, with the advantage of reducing the complexity of the problem or
260
Lucia Bandiera et al.
structural identifiability issues. As an alternative, memory consumption can be restricted by reducing the maximum time allowed for the computation of Lie derivatives (opts.maxLietime), although this might lead to uncertain identifiability results for some parameters. Finally, models with a large number of states can be decomposed (opts.forcedecomp, opts. decomp, and opts.decomp_user) on the ground that parameters found to be identifiable in a submodel will be identifiable in the whole model. Generation of submodels is an optimized process performed by the software MEIGO [13]. In our example, the computational time was approximately 1 min. 3. In some instances, for an in-depth insightful structural identifiability assessment of models, a multiexperiment analysis (e.g., use of different inputs in one experiment) might be required. In these cases, we suggest the use of the software GenSSI 2.0 [14], which presents a structure/syntax similar to STRIKE-GOLDD and a clear user manual on how to set up the required scripts in the associated GitHub repository. 4. The analysis intends to determine the possibility of assigning unique values to model parameters from ideal output measurements (i.e., continuous and noise-free). STRIKE-GOLDD performs a structural identifiability analysis as an extension of the observability concept (i.e., the possibility to infer the internal state of the model from time-finite output measurements) where parameters are considered as states without dynamic. To test the structural identifiability of the model, STRIKEGOLDD makes use of Lie derivatives of the output function to develop a generalized observability-identifiability matrix. A full rank of the observability-identifiability matrix denotes local structural identifiability of the model, while a lower rank indicates unidentifiability for a given set of parameters [3]. 5. Global ranking of the parameters is performed to assess the relative influence of each parameter on the model predictions, as quantified by relative parametric sensitivities. The general case of multiple observables and experiments is considered in AMIGO2, making use of diverse importance factors [15]. For a broader analysis of the parametric sensitivities, AMIGO2 uses n samples from the parameter space obtained by Latin hypercube sampling. Since the analysis considers the generic case, in which the isolated effect of time points, observables, or experiments cannot be explored, it is recommended to run a similar analysis for these elements. This supports an improved understanding of which parameters exert a more relevant effect on a particular observable in a particular experimental scheme.
A Cyber-Physical Platform for Model Calibration
261
6. AMIGO2 presents a set of local, global, and hybrid optimization algorithms that can be found in the software’s theoretical background documentation [16]. Due to the general nonconvexity of the problems usually encountered, we recommend the use of a hybrid optimizer such as enhanced Scatter Search (eSS) combined with an indirect solver, which uses control vector parameterization to avoid issues with nonsmooth functions. However, this comes at a higher computational cost (indirect methods make use of gradient descent methods, which are faster). If only proximity to the global optimum is required, use of a global optimizer, such as Differential Evolution, can be sufficient. Finally, local methods can be used if a multistart for the initial cost function evaluation is used. Due to the nonlinear character of biological models, the use of only local solvers might lead to suboptimal solutions. 7. This cycle needs to be performed for a sufficient number of iterations for more robust and reliable results. While the minimum recommended number of trials is 500 (600 used in our example), the value can be adjusted according to the available time and computational resources. 8. Practical identifiability analysis is performed to quantify the expected uncertainty of the parameter estimates in relation to a specific experimental scheme. Monte-Carlo sampling of the parameter space is used to generate noise-corrupted pseudodata subject to an experimental profile and parameter estimation is performed for each time-series. Principal Component Analysis (PCA) is then applied to the 0.95–0.05 interquartile range of the hyper-ellipsoid approximated by the samples so the uncertainty of the estimates or correlation between parameters can be estimated [16]. 9. The height of the syringe above the microscope stage is measured from the bottom of the meniscus of media in the syringe. 10. By merging the media droplet on the port with the one at the free end of the tubing, you can reduce the risk of air bubbles entering the device. 11. It is recommended to have previously defined a mask containing the region of interest of the DAW junction (i.e., the mixing channel). In the first iteration step, verify the mask overlays with the acquired DIC image. This simplifies the definition of the region of interest at the DAW junction. 12. To improve the stability of the procedure, the periodic signal should be preceded and followed by a constant input at 50% mixing ratio for 30 s.
262
Lucia Bandiera et al.
13. To prevent bias in the computation of the mixing ratio, due to false positive/negative inclusions of pixels in the region of interest, make sure that the latter covers an area that exceeds the walls of the mixing channel. 14. A different function in Matlab can be used; some examples are mixGaussEm() or fitgmdist(). A sufficiently high number of iterations prevent convergence to a local minimum, in which the function could not provide 2 Gaussian distributions with different means in output. 15. Noise filtering might be required to enhance the quality of the binary mask computed with the fluorescence threshold. For example, noise could be detected by the presence of outlier pixels. 16. To automate the procedure, the mentioned functions can be integrated into a script controlling the actuators within Matlab or in a designed GUI/platform. 17. To perform the fine-tuning of the convolutional neural network, manually annotated images, acting as ground-truth samples, are required. You can use either full images, if denoted by a low cell count, or a selected region. The manually annotated images should constitute a representative set of the experimental time-lapse. In our example, we used a combination of 12 images (frames with a low number of cells) and subsection of images (frames with a higher number of cells). Open in ImageJ the DIC image to be annotated and, using the elliptical selection tab, draw an initial contour around each cell in the image. Ctrl+T will add the region of interest (ROI) to the ROI Manager. Select each ROI and, using the brush selection tool, refine the initial contour (ROIs must not overlap, but they can be adjacent) and update the ROI. Transfer the ROIs into the original image as an Overlay (Image ! Overlay ! From ROI Manager) and save the resulting image in TIFF format. 18. U-Net offers different options inherent to the memory and computational allocation of the process (i.e., fine-tuning or segmentation). Among these, the user is asked to execute the computation in CPU or GPU. Since the process is not optimized for CPUs, we recommend the use of a GPU: this scales the computational time from days to hours. It is worth noting that, due to the dependency of the developed patch for Caffe (python deep learning framework), both CPU and GPU implementations rely on a Linux operating system. The U-Net plugin is implemented to support computation in a remote machine (e.g., Amazon Web Services) that can be accessed to through an SSH connection.
A Cyber-Physical Platform for Model Calibration
263
19. The set of weights for the Convolutional Neural Network is always saved in the remote host unless the user specifies an additional copy should be stored locally. 20. The option “labels are classis” is of interest only when segmentation or selection of different cell types is performed. 21. When segmenting a large number of images, the procedure can be automated by running it in batch-mode using ImageJ Macros code. We recommend recording the procedure of one image (Plugins ! Macros ! Record. . .) to obtain the basic Macros code for the segmentation. Then, integrate the section in a loop to iterate over all the images in a script that can be run in the Macros console (Plugins ! Macros ! Startup Macros. . .). 22. In our example, we selected Minimum Object size ¼ 4, Maximum Centroid Displacement ¼ 50, Enable Division ¼ yes, Enable Fusion ¼ no, Weight Cell Overlap ¼ 0, Weight Cell Centroid Displacement ¼ 100, Weight Cell Size ¼ 0.75, Minimum Division Overlap ¼ 0, Daughter Size Similarity ¼ 30, Daughter Aspect ratio similarity ¼ 75, Mother Circularity Threshold ¼ 50, Number of frames to check for circularity ¼ 5, Minimum Cell Lifespan ¼ 30, Cell Death Delta Centroid Threshold ¼ 0, and allow cell density and border cells to affect the confidence index. 23. As within the microfluidic device cells are growing in a monolayer, by the end of the experiment the number of pixels corresponding to the background will be significantly low, impeding an appropriate correction for the background fluorescence level. Under the hypothesis of minimal variation in time of the background fluorescence signal, we suggest subtracting the average background computed over the previous segment of the experiment whenever the number of background pixels falls below a user-specified threshold. 24. The computation of single cell fluorescence vectors enables the screening of cells that have been imaged for a minimum amount of time defined by the experimentalist and the exclusion of abnormal cells that can populate the last frames. 25. Parameter estimation (i.e., model calibration) aims to estimate unknown model parameters. Here, parameter estimation is framed as a nonlinear optimization problem whose objective is to minimize a predefined distance measure (cost-function) between experimental data and model predictions [15]. AMIGO2 implements both the weighted least squares and the log-likelihood cost functions to be selected upon depending on the information available on the noise (homoscedastic or heteroscedastic) corrupting the data. While these scalar measures assume normally distributed noise, the
264
Lucia Bandiera et al.
software allows the introduction of alternative, user-defined cost functions. 26. Here, considering experimental constraints due to the use of a microfluidic platform, we cast OED as a constrained optimization problem that searches for the most informative stepwise perturbation profile composed of segments of fixed duration. This corresponds to optimizing the concentration of the inducer administered to the cells at each step. It is worth mentioning that AMIGO2 supports the optimization of other control variables: number and location of sampling times, observed species, initial conditions, and experimental duration. In general, the selection of the most suitable strategy will depend on the biological system, the complexity of its mathematical description, and limitations of the experimental platform used for data gathering. 27. Optimal experimental design (OED) is a branch of statistics that searches for the most informative and less resourceintensive experimental scheme (here for model calibration). By increasing the informative content of the acquired data, OED allows overcoming issues that affect parameter estimation (e.g., practical identifiability). To quantify the amount of information of an experiment, AMIGO2 makes use of the Fisher Information Matrix to solve a general dynamic optimization problem minimizing or maximizing a scalar measure that relates to the shape and size of the hyper-ellipsoid associated to the FIM [15]. 28. Multiple scalar measures of the FIM (i.e., optimality criteria) are available in scientific literature. AMIGO2 implements D-optimality (Determinant), E-optimality (Eigenvalue), A-optimality (Average), and DoverE-optimality. D-optimality seeks to minimize the determinant of the inverse of the FIM, E-optimality to maximize the minimum eigenvalue of the FIM, and A-optimality to minimize the trace of the inverse of the FIM. In our examples, following its widespread adoption, we selected D-optimality. References 1. Bandiera L, Hou Z, Kothamachu V, BalsaCanto E, Swain P, Menolascina F (2018) On-line optimal input design increases the efficiency and accuracy of the modelling of an inducible synthetic promoter. Processes 6 (9):148 2. Gnu¨gge R, Dharmarajan L, Lang M, Stelling J (2016) An orthogonal Permease-inducerrepressor feedback loop shows bistability. ACS Synth Biol 5(10):1–29
3. Villaverde AF, Barreiro A, Papachristodoulou A (2016) Structural identifiability of dynamic systems biology models. PLoS Comput Biol 12 (10):1–22 4. Balsa-Canto E, Henriques D, Ga´bor A, Banga JR (2016) AMIGO2, a toolbox for dynamic modeling, optimization and control in systems biology. Bioinformatics 32(21):3357–3359
A Cyber-Physical Platform for Model Calibration 5. Falk T et al (2019) U-net: deep learning for cell counting, detection, and morphometry. Nat Methods 16(1):67–70 6. Chalfoun J, Majurski M, Dima A, Halter M, Bhadriraju K, Brady M (2016) Lineage mapper: a versatile cell and particle tracker. Sci Rep 6:1–9 7. Ferry MS, Razinkov IA, Hasty J (2011) Microfluidics for synthetic biology, vol 497, 1st edn. Elsevier Inc., San Diego 8. Versari C et al (2017) Long-term tracking of budding yeast cells in brightfield microscopy: CellStar and the Evaluation Platform. J R Soc Interface 14:20160705 9. Dimopoulos S, Mayer CE, Rudolf F, Stelling J (2014) Accurate cell segmentation in microscopy images using membrane patterns. Bioinformatics 30(18):2644–2651 10. Bredies K, Wolinski H (2011) An activecontour based algorithm for the automated segmentation of dense yeast populations on transmission microscopy images. Comput Vis Sci 14(7):341–352 11. Bakker E, Swain PS, Crane MM (2018) Morphologically constrained and data informed cell
265
segmentation of budding yeast. Bioinformatics 34(1):88–96 12. “Lineage Mapper User Guide.” [Online]. https://github.com/USNISTGOV/LineageMapper/wiki/User-Guide 13. Egea JA, Henriques D, Cokelaer T, Villaverde AF, Julio R (2014) MEIGOR: a software suite based on metaheuristics for global optimization in systems biology and bioinformatics. Continuous and mixed-integer problems: enhanced scatter search, pp. 1–33 14. Ligon TS, Fro¨hlich F, Chis¸ OT, Banga JR, Balsa-Canto E, Hasenauer J (2018) GenSSI 2.0: multi-experiment structural identifiability analysis of SBML models. Bioinformatics 34 (8):1421–1423 15. Balsa-canto E, Alonso AA, Banga JR (2010) An iterative identification procedure for dynamic modeling of biochemical networks. BMC Syst Biol 4:11 16. “AMIGO2 Documentation.” [Online]. https://sites.google.com/site/ amigo2toolbox/doc
Chapter 13 Prediction of Cellular Burden with Host–Circuit Models Evangelos-Marios Nikolados, Andrea Y. Weiße, and Diego A. Oyarzu´n Abstract Heterologous gene expression draws resources from host cells. These resources include vital components to sustain growth and replication, and the resulting cellular burden is a widely recognized bottleneck in the design of robust circuits. In this tutorial we discuss the use of computational models that integrate gene circuits and the physiology of host cells. Through various use cases, we illustrate the power of host–circuit models to predict the impact of design parameters on both burden and circuit functionality. Our approach relies on a new generation of computational models for microbial growth that can flexibly accommodate resource bottlenecks encountered in gene circuit design. Adoption of this modeling paradigm can facilitate fast and robust design cycles in synthetic biology. Key words Cellular burden, Growth models, Whole-cell modeling, Gene circuit design, Synthetic biology, Resource allocation
1
Introduction The grand goal of Synthetic Biology is to engineer living systems with novel functions. The approach relies on the combination of biological knowledge with design strategies from engineering sciences [1–4]. Engineering principles, such as modularity and standardization, have led to gene circuits with a wide range of functions such as cellular oscillators [5, 6], memory devices [7], and biosensors [8, 9]. As synthetic biology matures into an engineering discipline of its own, mathematical modeling is playing an increasingly important role in the design of biological circuitry [10]. Moreover, model-based design offers opportunities for other fields such as computer-aided design [11], control theory [12], and machine learning [13] to contribute with new methods and protocols for gene circuit design. The success of the celebrated “design–build–test–learn” cycle [14] relies on the availability of good quality models for circuit function. A major drawback of current modeling frameworks, however, is the implicit assumption that biological circuits function in
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_13, © Springer Science+Business Media, LLC, part of Springer Nature 2021
267
268
Evangelos-Marios Nikolados et al.
isolation from their host. This simplification limits the predictive power of circuit models and slows down the iterations between system design, testing, and characterization. In reality, gene circuits interact with their host in many ways, including the consumption of molecular resources such as amino acids, nucleotides, or energy, as well as using major components of the genetic machinery such as polymerases and ribosomes. Competition for a limited pool of host resources produces a two-way interplay between synthetic circuits and the native physiology of the host [15]. This interplay is commonly known as burden and perturbs the homeostatic balance of the host, resulting in slowed growth, reduced biosynthesis, and the induction of stress responses [16]. Since such effects can impact circuit behavior, they create feedback effects that can potentially break down circuit function [17–19]. As a result, individual modeling of circuit parts and their connectivity is not sufficient to predict circuit function accurately. In a seminal study on host–circuit interactions, Tan and colleagues [20] studied a simple circuit consisting of T7 RNA polymerase that activates its own expression in Escherichia coli. Contrary to what standard mathematical models would predict, the circuit displayed bistable dynamics. The authors showed that synthesis of the polymerase produced an indirect, growth-mediated, positive feedback loop, which when included in their model was able to reproduce the observed bistability. This study was the first empirical demonstration that growth defects can drastically change circuit function. A number of subsequent works have focused on the sources and impact of burden on gene circuits. For example, Ceroni et al. showed that genes with weaker ribosomal binding-strength are less taxing on the host resources [21]. Other works have focused on strategies to mitigate burden. An and Chin built a gene expression system that combines orthogonal transcription by T7 RNA polymerase and translation by orthogonal ribosomes [22]. The system reported in [23] allows to allocate resources among competing genes, while [24] built libraries of promoters that tune expression of burdensome proteins and decrease cellular stress. The work by Shopera et al. showed that negative feedback control can reduce the cross-talk between gene circuits [25]. Another strategy for reducing burden was proposed in [26] using an orthogonal ribosome for translation of heterologous genes. A particularly attractive strategy is to exploit burden to improve functionality. For example, Rugbjerg and colleagues increased metabolite production by coupling pathway expression to that of essential endogenous genes [27], while [28] employed stress-response promoters to build a feedback system with increased protein yield.
Host-Circuit Modelling
269
As a result of the increasing interest in cellular burden and host–circuit interactions, the modeling community has devoted substantial attention to improving models for gene circuits and their interaction with a host. A key challenge is to find a suitable level of model complexity with enough detail to describe tunable circuit parts but without excessive granularity that makes models impractical. At one end of the complexity spectrum, a number of works have proposed simple resource allocation models for the interplay between circuit and host genes [29–31]. Using different modeling approaches and assumptions, these models generally predict a linear relation between expression of native and heterologous genes. Increases in the expression of one gene cause a linear drop in the expression of another gene, as a result of a limited abundance of ribosomes for translation. At the other end of the spectrum, the whole-cell model of Mycoplasma genitalium [32] was an ambitious attempt to describe all layers of cellular organization under a single computational model. A subsequent work demonstrated the use of the whole-cell model in conjunction with gene circuits [33]. Yet to date such whole-cell models have not been built for bacterial hosts commonly employed in synthetic biology, and their high complexity prevents their systematic use in circuit design and optimization. A number of approaches have sought to find a middle ground between model complexity and tractability. Inspired by the widely established “bacterial growth laws” [19, 34], Weiße and colleagues built a mechanistic growth model for Escherichia coli [35]. The model uses a coarse-grained partition of the proteome to describe how cells allocate their resources across various gene expression tasks. It accurately predicts growth rate from the interplay between metabolism and gene expression, and can be extended with a wide range of genetic circuits. Applications of the mechanistic growth model include the design of orthogonal ribosomes [26], the addition of extra layers of regulation [36], and its extension to singlecell growth dynamics [37]. Most recently, Nikolados et al. employed the model to study the impact of growth defects in various exemplar circuits [38]. In this tutorial we describe how mechanistic growth models can be employed to simulate gene circuits together with the host physiology (Fig. 1). In Subsection 2 we first revisit the bacterial growth laws and explain the core principles of the mechanistic growth model. In Subsection 3 we present how to extend the growth model with heterologous genes. We illustrate the methodology with a number of transcriptional logic gates in Subsection 4. We conclude the chapter with a perspective for future research in the field.
270
Evangelos-Marios Nikolados et al.
cellular host
circuit parts genes
promoters
RBS
host-circuit model
circuit function
resource usage
translation protein
expression
design space
parameter 2
dbl time
growth defects
time
ribosomes
parameter 1
Fig. 1 Host–circuit modeling. Integrated host–circuit models provide a quantitative basis to study the impact of design parameters on circuit function and genetic burden on their host
2
Coarse-Grained Models for Bacterial Growth We begin by describing the bacterial growth laws that form the basis for most current models for growth. Our focus is on coarsegrained models that describe cell physiology using lumped variables representing aggregates of molecular species. We deliberately exclude whole-cell models [32] and genome-scale models [39], both of which have been discussed extensively in the literature [40–42] and so far have found relatively limited applications in gene circuit design.
2.1 Bacterial Growth Laws
Bacterial growth has been an active topic of study for many decades. The celebrated work of Nobel laureate Jacques Monod provided a key quantitative description for growth [43], based on the observation that bacteria in batch cultures exhibit several phases of growth: l
Lag phase: cells do not immediately start to grow after nutrient induction, as they first must adapt to the new environment; RNA and proteins are produced as the cell prepares for division.
l
Exponential phase: cells duplicate at a constant rate, so that their number grows exponentially as N(t) ¼ N02 t/τ with τ being the average doubling time. Equivalently, the number of cells can be expressed as N(t) ¼ N0eλt, where λ ¼ log 2=τ is the growth rate.
Host-Circuit Modelling
271
l
Stationary phase: cell replication stops because an essential nutrient has been depleted from the batch. The number of cells remains constant during this phase.
l
Death phase: cells begin to die, resulting in a decreasing cell population.
The vast majority of studies on bacterial growth focus on the exponential phase, and to date this remains the best characterized growth phase. A widely empirical model for exponential growth is given by Monod’s law, which relates the instantaneous growth rate and the substrate concentration: λ¼
λmax s , s þ Ks
ð1Þ
where s is the growth substrate, λmax is the maximum growth rate possible in the substrate, and Ks is the substrate concentration for which growth rate is half maximal. The relationship in Eq. 1 is known as Monod’s law and describes the hyperbolic dependence of the growth rate λ on the concentration of a growth-limiting nutrient s in the medium. Measurements of bacterial cells growing at different rates [44, 45] have revealed a central role of ribosome synthesis in maintaining exponential growth [46, 47]. In particular, the ribosomal mass fraction, ϕR, has been shown to increase linearly with growth rate [44, 48]. This is the second growth law, described mathematically as: þ ϕR ¼ ϕmin R
λ , κt
ð2Þ
where ϕmin R is an offset term and κ t is a phenomenological parameter related to protein synthesis. The third growth law relates to growth inhibition. It has been shown that sublethal antibiotic doses targeting ribosomal activity produce a negative linear relation between growth rate and the ribosomal mass fraction [19]. Mathematically, this growth law can be described by: ϕR ¼ ϕmax R
λ , κn
ð3Þ
where the parameter κ n describes the nutrient capacity of the growth medium and ϕmax is the maximum allocation to ribosomal R synthesis in the limit of complete translational inhibition. Taken together, Eqs. 1–3 provide a remarkably simple description of exponential growth. Yet a common caveat of such descriptions is their lack of explicit links between phenomenological parameters and the molecular processes that drive growth. Some works have indeed found quantitative descriptions of model parameters in terms of intracellular properties [19, 34]. However,
272
Evangelos-Marios Nikolados et al.
another strand of research has moved away from phenomenological models toward mechanistic descriptions of cell physiology [35, 49, 50]. Notably, earlier work by Molenaar and colleagues [51] proposed a model that integrates metabolism and protein biosynthesis into a resource allocation model. A key assumption in that approach is that microbes adjust their proteome composition to maximize growth. This leads to growth predictions that rely on an optimality principle, without the need of a mechanistic description of how cellular constituents contribute to growth and replication. 2.2 A Mechanistic Model of Bacterial Growth
Here we describe a mechanistic model that predicts bacterial growth rate from first principles [35]. The model, illustrated in Fig. 2, reproduces the bacterial growth laws and, at the same time, contains detailed mechanisms for nutrient metabolism, transcription, and translation. It employs a partition of the proteome similar to an earlier work [51], but it does not require the assumption of growth maximization. The model is versatile and can predict how cells reallocate their proteome composition under various types of perturbations, including nutrient shifts, genetic modifications, and antibiotic treatments. The model combines nutrient import and its conversion to cellular energy with the biosynthetic processes of transcription and translation. In its basic form, the model includes 14 intracellular variables: an internalized nutrient si; a generic form of energy, denoted a, that models the total pool of intracellular molecules required to fuel biosynthesis, such as ATP and amino acids; and four types of proteins: ribosomes pr, transporter enzymes pt,
ribosomes
proteome pr ro o
transcription translation n
energy
enzymes
metabolism
nutrients
Fig. 2 Mechanistic model for bacterial growth. The model predicts growth rate from the allocation of two cellular resources (energy and ribosomes) among the various processes that fuel growth and replication [35]
Host-Circuit Modelling
273
metabolic enzymes pm, and house-keeping proteins pq. The model also contains the corresponding free and ribosome-bound mRNAs for each protein type, denoted by mx and cx, respectively, with x ∈{r, t, m, q}. The model can be described by the chemical reactions listed in Table 1. From these reactions we model the cell as a system of ordinary differential equations, describing the rate of change of the numbers of molecules per cell of a particular species. Next we explain in detail how the model equations are built. The environment, or growth medium, of the cell contains a single nutrient described by the constant parameter s. A transporter protein pt is responsible for the uptake of the external nutrient at a fixed concentration, which once internalized, si, is catabolized by a metabolic enzyme pm. The dynamics of the internalized nutrient obey: s_i
¼ vimp vcat λs i :
ð4Þ
Similarly to the bacterial growth laws described in Subsection 2.1, the growth rate is denoted by λ. All intracellular species are assumed to be diluted at a rate λ because of partitioning cellular content between daughter cells at division. Nutrient import (vimp) and catabolism (vcat) are assumed to follow Michaelis–Menten kinetics: vt s vm s i vimp ¼ pt , vcat ¼ pm , ð5Þ Kt þ s K m þ si where vt and vm are maximal rates, while Kt and Km are Michaelis– Menten constants. Since translation is known to dominate energy consumption [48], the model neglects other energy-consuming processes. Using cx to denote the complex between a ribosome and the mRNA for a protein px, the translation rate for every protein obeys vx ¼ c x
γðaÞ : nx
ð6Þ
The parameter nx in Eq. 6 is the length of the protein px in terms of amino acids, and the term γ(a) represents the net rate of translational elongation. Assuming that each elongation step consumes a fixed amount of energy [35], the net elongation rate depends on the energy resource by: γ a γðaÞ ¼ max , ð7Þ Kγ þ a where γ max is the maximal elongation rate and Kγ is the energy required for a half-maximal rate. From Eq. 6 we can compute the total energy consumption by translation of all proteins and get a differential equation for the net turnover of energy: a_ ¼ ns vcat
P
nx vx λa,
xfr, t, m, qg
ð8Þ
λþd m
λþd m
ϕ ! mt
wt
wm
wq
Transporter enzyme
Metabolic enzyme
House-keeping proteins
Energy molecules
v imp
vcat
Metabolism
s i ! ns a
Internal nutrient
s ! si
λþd m
mq ! ϕ
mm ! ϕ
Nutrient import
ϕ ! mq
ϕ ! mm
λþd m
wr
mt ! ϕ
mr ! ϕ
ϕ ! mr
Dilution/degradation
Ribosomes
Transcription
Table 1 Chemical reactions in the mechanistic growth model [35]
λ
a!ϕ
λ
si ! ϕ
ku
k
b pr + mq − − cq
ku
k
b − pr + mm − cm
ku
k
b − pr + mt − ct
ku
k
b − pr + mr − cr
Ribosome binding
cq ! ϕ
λ
cm ! ϕ
λ
ct ! ϕ
λ
λ
cr ! ϕ
Dilution
vq
nq a þ c q ! pr þ m q þ pq
vm
nm a þ c m ! pr þ m m þ pm
vt
nt a þ c t ! pr þ m t þ pt
vr
nr a þ c r ! pr þ m r þ pr
Translation
λ
pq ! ϕ
λ
pm ! ϕ
λ
pt ! ϕ
λ
pr ! ϕ
Dilution
274 Evangelos-Marios Nikolados et al.
Host-Circuit Modelling
275
where the sum over x is over all types of protein in the cell. Overall, energy is created by metabolizing si and lost through translation and dilution by growth. The positive term in Eq. 8 determines energy yield per molecule of internalized nutrient from Eq. 4. The parameter ns describes the nutrient efficiency of the growth medium. In rapidly growing E. coli, it is known that transcription has a minor role in energy consumption [52]. We therefore model transcription as an energy-dependent process, but with a negligible impact on the overall energy pool. If wx,max denotes the maximal transcription rate, the effective transcription rate has the form a , w x ¼ w x;max ð9Þ θx þ a for all proteins except house-keeping ones, i.e. x ∈{r, t, m}. We assume that the transcription of housekeeping mRNAs is subject to negative autoregulation so as to keep constant expression levels in various growth conditions: wq ¼ w q;max
a θq þ a |fflfflffl{zfflfflffl} energy dependent translation
1 : 1 þ ðpq =K q Þhq |fflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflffl} negative autoregulation
ð10Þ
In Eqs. 9 and 10, the parameter θ x denotes a transcriptional threshold, while Kq and hq are regulatory parameters. The differential equations for the number of mRNAs (mx) are therefore: m_ x ¼ wx ðλ þ d m Þmx þ vx kb pr m x þ ku c x ,
ð11Þ
where x ∈{r, t, m, q}. In Eq. 11, mRNAs are produced through transcription with rate wx, while mRNAs are lost through dilution λ and degradation with rate dm. At the same time, mRNAs bind and unbind with ribosomes, so that the ribosome–mRNA complexes (cx) follow c_x ¼ λc x vx þ kb pr m x ku c x ,
ð12Þ
where kb and ku are the rate constants of binding and unbinding. Translation contributes with a positive term to Eq. 11 and a negative term to Eq. 12. The differential equations for protein abundance are therefore: p_x ¼ v x λpx , x∈ft, m, qg:
ð13Þ
We note that Eq. 13 applies to all proteins except free ribosomes. The equation for free ribosomes pr includes an additional term: p_ r ¼ vr λpr þ
P
x∈fr, t, m, qg
ðvx kb pr mx þ ku c x Þ:
ð14Þ
276
Evangelos-Marios Nikolados et al.
Through Eq. 14 the model accounts for competition among different mRNAs for free ribosomes, as well as ribosomal autocatalysis. Ribosomal transcripts sequester free ribosomes for their own translation, and the pool of free ribosomes can increase as a result of translation of new ribosomes and, at the same time, the release of ribosomes engaged in translation of non-ribosomal mRNAs. Finally, it can be shown (details in [35]) that under the assumption of constant average mass, the specific growth rate can be computed in terms of the total number of ribosomes engaged in translation: λ¼
γðaÞ M
X
cx,
ð15Þ
x∈fr, t, m, qg
where M is the constant cell mass. Overall, Eqs. 4–15 constitute the core of the mechanistic growth model. Equations 8 and 14, in particular, model the availability of energy and ribosomes, both regarded as cellular resources shared between metabolism and protein biosynthesis. The model contains 22 parameters. For E. coli, some parameter values were mined directly from the literature and others were estimated with Bayesian inference on published growth data [19, 35]. The parameter values are shown in Table 2. We note that we have assumed that all components of the proteome are not subject to active degradation. As we shall see in the next sections, the core model can be extended with gene circuits of varying complexity.
Table 2 Model parameters for an Escherichia coli host, taken from [35] Parameter
Value
Parameter
Value
s
104 (molecules)
M
108 (aa)
nr
7459 (aa/molecules)
θr
427 (molecules)
γ max
1260 (aa/min molecules)
Kγ
7 (molecules)
Kt
1000 (molecules)
Km
1000 (molecules)
vt
726 (min
1
)
1
vm
5800 (min
wr,max
930 (molecules/min)
wm,max, wt,max
4.14 (molecules/min)
wq,max
949 (molecules/min)
dm
0.1 (min1)
Kq
152,219 (molecules)
hq
4
θ q, θ t, θ m
4.38 (molecules)
nq, nt, nm
300 (aa/molecules)
ku
1 (min1)
kb
0.0095 (min
)
1
molecules
1
)
Units of aa correspond to number of amino acids per cell
Host-Circuit Modelling
3
277
Modeling Gene Circuits Coupled with Their Host In this section we discuss how to extend the mechanistic growth model with heterologous gene circuits. The extended model can be employed for predicting the impact of genetic parameters, such as promoter strengths or gene length, on the growth rate of the host strain and the resulting heterologous expression levels. We first describe the steps needed to extend the model, and then illustrate the ideas with a simple model for an inducible gene. This is a simple example that contains all the elements needed by more complex circuits.
3.1 Extending the Model with Heterologous Genes
The extension of the model requires three steps: Step 1: Add New Model Species: First, we include mass balance equations for the expression of each heterologous gene. This requires three additional species per gene: the transcript, the mRNA–ribosomal complex, and the protein, all of which follow dynamics similar to Eqs. 11–13: c p_ i m_ ci
¼
vci ðλ þ d p Þpci ,
¼
wci ðλ þ d m Þmci þ v ci kcb,i pr m ci þ kcu,i c ci ,
c_ci
¼
λc ci þ kcb,i pr m ci kcu,i c x vci , ð16Þ
where the superscript c denotes heterologous species and the subscript i denotes the ith heterologous gene. The ribosomal binding parameters kcb,i and kcu,i are specific to each gene and can be used, for example, to model different ribosomal binding sequences. The translation rate vci is modeled similarly as that of native genes in Eq. 6: v ci ¼
c ci γ a max , nci a þ K γ
ð17Þ
with nci being the length of the ith circuit protein. Likewise, the transcription rate is similar to Eq. 9: a R, wci ¼ w cmax,i c ð18Þ θ þa i where wcmax,i is the maximal transcription rate. Note that we have included an additional term Ri to model regulatory interactions by other genes. Complex circuit connectivities can be modeled by suitable choices of the function Rii. Later in Subheading 4 we exemplify this with models for transcriptional logic gates. Step 2: Modify Allocation of Resources: Second, we include the additional consumption of energy and ribosomes in the model. Starting from the resource equations in Eqs. 8 and 14, we write:
278
Evangelos-Marios Nikolados et al.
a_ ¼ ns vcat
X
P x
nx vx
i
|fflfflfflfflffl{zfflfflfflfflffl} energy consumption by foreign genes
p_ r ¼ v r λpr þ
X
P x
ð19Þ
ðvx kb pr m x þ ku c x Þ
ðvci kcb,i pr m ci þ kcu,i c ci Þ
þ
λa,
nci vci
:
ð20Þ
i
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} consumption of free ribosomes by foreign genes Step 3: Adjust Growth Rate Prediction: Third, we update the prediction of growth rate in Eq. 15 to include translation of heterologous genes: λ
3.2 Simulation of an Inducible Gene
¼
γðaÞ X cx þ M x
ð
X
c ci
i
|fflffl{zfflffl} ribosomal complexes
Þ: ð21Þ
Inducible expression systems are widely employed as building blocks of complex gene circuits. As an example, we consider a reporter gene (rep) under the control of an inducible promoter, modeled by the reactions in Table 3. The model contains mRNAs of the heterologous gene, which can reversibly bind to free ribosomes of the host, pr. Protein translation consumes energy (a) and, at the same time, proteins and other model species are diluted by cell growth. In contrast to native proteins of the host, however, we assume that heterologous proteins are tagged for degradation by proteases, a strategy often employed to accelerate protein turnover [53]. This active degradation is modeled by the parameter dp,rep in Table 3. We do not explicitly model the molecular mechanism for induction, as this will depend on the particular implementation of choice. For example, in the tetR inducible system, the inducer anhydrotetracycline (aTc) activates gene expression by reversible binding to the tetracycline repressor tetR, whereas in the lac inducible system, the inducer Isopropyl-β-D-thiogalactoside (IPTG) binds to allosteric sites of the lac repressor lacR. Instead, we lump the induction mechanism into an effective transcription rate, denoted as wrep in Table 3.
REP
ϕ ! mrep
w rep
Transcription
m rep
! ϕ
λþd m;rep
Dilution/degradation
Table 3 Reactions for an inducible reporter gene
ku
k
b pr + mrep − − crep
Ribosome binding c rep ! ϕ
λ
Dilution
vrep
nrep a þ c rep ! pr þ m rep þ prep
Translation
λþd p;rep
prep ! ϕ
Dilution/degradation
Host-Circuit Modelling 279
280
Evangelos-Marios Nikolados et al.
Using the general circuit equations in 16–18 of Subsection 3.1, for the inducible gene Eq. 16 becomes: p_rep m_ rep c_rep
¼ vrep ðλ þ d p;rep Þprep , ¼ w rep ðλ þ d m;rep Þm rep þ vrep kb;rep pr mrep þ ku;rep c rep , ¼ λc rep þ kb;rep pr m rep ku;rep c rep vrep : ð22Þ
The rate of reporter translation follows as in Eq. 23: c rep γ a vrep ¼ max , nrep a þ K γ
ð23Þ
where nrep is the length of the reporter in amino acids. Likewise, the transcription rate in Eq. 18 becomes: a : w rep ¼ w max;rep c ð24Þ θ þa Note that in the transcription rate, the regulatory term is Ri ¼ 1, because the inducible system does not contain any regulatory interactions. Before simulating the expression of the heterologous protein, we first need to obtain an estimate for the proteome composition of the wild-type. This is required to initialize the host–circuit simulations with a physiologically realistic cellular composition. To this end, we first simulate Eqs. 4–15 for the “wild-type model” until steady state. The results, summarized in Fig. 3a, show that host proteins are translated at different rates with most of the translating ribosomes bound to mRNAs of house-keeping proteins. However, a sizeable fraction is bound to ribosomal mRNA, highlighting how the growth model accounts for ribosomal autocatalysis. A closer look (Fig. 3a, bottom) reveals that translation-engaged ribosomes account approximately for two-thirds of the total ribosomal fraction in the form of mRNA–ribosomal complexes, with one-third remaining free. Next, we simulate heterologous expression using the maximal transcription rate wmax,rep in Eq. 24 to describe the effect of different gene induction strengths. As shown in the dose–response curve in Fig. 3b, the model predicts that increased induction causes an increase in expression. We observe, however, that protein expression reaches a maximum at a critical induction strength and subsequently drops sharply for stronger induction. This reflects the limitations that resource competition imposes on the expression of a heterologous gene [38]. To understand the main source of the resource limitations, we use the model to explore the synthesis rates of the various components of the proteome. Because growth rate is linearly related to the total rate of translation (Eq. 21), we can make direct conclusions for cellular growth as well. As shown in Fig. 3b (inset), the model
Host-Circuit Modelling
wild-type metabolic
b 50
ribosomal
translation rates house-keeping free
expression (# of molecules)
a
x102 100 growth rate % of WT 50
30 100
104
ribosomes house-keeping uptake enzyme metabolic enzyme heterologous protein
10
ribosomes bound
281
0 100
101
102
103
104
gene induction (mRNAs/min)
Fig. 3 Simulation of an inducible gene. (a) Steady state translation rates and ribosomal abundance predicted for the wild-type Escherichia coli model, parameterized as in Table 2. (b) Predicted steady state expression of a heterologous gene for increasing induction strength. The pie charts indicate translation rates and ribosomal abundance as in the left panel. The inset shows the predicted growth rate, relative to the wild-type. The induction strength was modeled with the parameter wmax,rep in Eq. 24. The binding rate constant was set equal to the dissociation rate constant, so that kb,rep ¼ 1 102 min1molecules1, ku,rep ¼ 1 102 min1. Transcript and protein half-lives were set to two and four minutes, respectively [5], so that d m,rep = ln 2 / 2 min −1 and d p,rep = ln 2 / 4 min −1
predicts a sigmoidal decrease in growth rate for stronger gene induction. At low induction, expression of the foreign gene is mostly at the expense of house-keeping proteins, while ribosomes, transporter, and metabolic enzymes, show little decrease. This suggests that the host can compensate for this load through transcriptional regulation and repartitioning of the proteome (Fig. 3b). As the induction of the reporter gene increases, circuit mRNAs dominate the mRNA population, hence increasing the competition for free ribosomes. Finally, for sufficiently strong induction, ribosomal scarcity leads to reduction of all proteins, which in turn leads to the drop in growth rate observed in Fig. 3b (inset). These results are in agreement with the widespread conception that ribosomal availability is a major control node for cellular physiology [19, 54, 55], with depletion of free ribosomes being the main source of burden for translation of circuit genes [21, 31].
4
Simulation of Transcriptional Logic Gates There has been substantial interest in gene constructs that mimic digital electronic circuitry [6, 56, 57]. Cellular logic gates, in particular, have been used to produce desired behaviors in response to various inputs such as temperature, pH, and small molecules
282
Evangelos-Marios Nikolados et al.
c
b
a
input 1
input 1 gene 1
input gene 1
NOT
gene 1
output
output
gene 2
gene 3
output gene 3 gene 2
gene 2
input 2
input 2
AND
gene 4
NOT
AND
NAND
Fig. 4 Logic gates based on transcriptional regulators. (a) The NOT gate contains two genes connected in cascade. Repression of gene 2 inverts the input signal. (b) The AND gate contains three genes, in which two transcriptional activators jointly trigger the expression of a third output gene. (c) The NAND gate contains four genes and is the composition of an AND and a NOT gate. Circuit connectivities are based on the implementation by Wang et al. [61]
[58–60]. Multiple logic gates can be combined to build larger information-processing circuits with advanced cellular functions [8]. To illustrate our simulation strategy in more complex circuitry, here we build host–circuit models for cellular logic gates based on transcriptional regulators [61]. We first build and simulate the models for NOT, AND, and NAND gates shown in Fig. 4. To highlight the power of our approach for circuit design, we then use the host–circuit models to predict circuit function across the design space, using combinations of RBS strength and growth media. As discussed in Subsection 3.1, we model the circuits by adding extra genes to the growth model and modifying the mass balance and growth rate equations. We model the circuit connectivity by choosing suitable regulatory terms Ri in the transcription rates in Eq. 18, and the gate inputs via the maximal transcription rate w cmax,i . To compare our host–circuit simulations with those of traditional models, we built circuit-only models using mass balance equations for mRNAs and proteins: m_ ci
¼
wci Ri ðλeff þ d m Þm ci ,
c p_i
¼
c eff keff þ d p Þpci , i m i ðλ
ð25Þ
where the subscript i denotes the ith circuit gene and we assume a constant dilution rate, λeff ¼ 0.022 min1, which is equal to the growth rate predicted by the model for the wild-type with a nutrient efficiency of ns ¼ 0.5. The effective translation rates are fixed to eff 1 1 keff and keff for the AND gate, 1 ¼ k 2 ¼ 16:8 min 3 ¼ 0:61 min eff eff eff 1 and k1 ¼ k2 ¼ 13:86 min , k3 ¼ 0:058 min1 , and keff 4 ¼ 1 347 min for the NAND gate. In all cases, we assume that mRNAs and proteins are actively degraded with rate constants dm = ln 2/2 min−1 and dp = ln 2/4 min−1.
Host-Circuit Modelling
283
The NOT gate contains two genes in cascade, where gene 1 codes for a transcriptional repressor that inhibits the expression of gene 2; the circuit diagram is shown in Fig. 4a. We first model the NOT gate in isolation using Eq. 25. We choose the regulatory functions Ri as
4.1 Host-Aware NOT Gate
R1 ¼ 1, R2 ¼
1 c h : p1 1þ K c1
ð26Þ
The choice of R2 models the inhibition of gene 2, and different inhibitory strengths and cooperativity effects can be described by suitable choices of the threshold K c1 and Hill coefficient h. We fix K c1 ¼ 250 molecules and h1 ¼ 2. As shown in Fig. 5a, the isolated models correctly predict the expected circuit function, with stronger induction of the input gene 1 gradually suppressing the expression of the output proteins (pc2 ), with strong induction resulting in minimal output yield. In other words, the gate has high output only when the input signal is low, in effect acting as an inverter of the input signal. To simulate the host-aware NOT gate, we follow the procedure outlined in Subsection 3.1. The host-aware simulations shown in Fig. 5b suggest that the function of the NOT gate remains largely unaffected by host–circuit interactions. For intermediate input levels, simulations predict an increase in growth rate of up to 50% with respect to a basal case. Such apparent growth benefit is a consequence of the circuit architecture (Fig. 4a): an increase in the input causes a stronger repression of gene 2 and thus relieves the burden on the host. But since the expression of the repressor coded by gene 1 also burdens the host, for high inputs the expression of gene 1 counteracts the growth advantages gained by repression of gene 2, resulting in an overall drop in growth rate.
b
isolated model
50
NOT 15
input output 1 0 0 1 5 0 100
101
102
103
input (mRNAs/min)
104
host-aware model
x102
150 growth rate (% of basal)
output (# molecs.)
25
x102
output (# molecs.)
a
25
0 100
101
102
103
input (mRNAs/min)
104
100 basal 50
0 100
101
102
103
104
input (mRNAs/min)
Fig. 5 Host-aware simulation of a NOT gate. (a) Gate output predicted by a model isolated from the cellular host. Inset shows the Boolean truth table for the NOT gate. (b) Output and growth rate predictions from hostaware model of the NOT gate. Growth rate is normalized to a basal case
284
Evangelos-Marios Nikolados et al.
The AND gate comprises two genes that co-activate a third output gene (Fig. 4b). As built in the original implementation [61], the promoter for gene 3 is activated only when both the co-dependent enhancer-binding proteins, encoded by genes 1 and 2, are present in a heteromeric complex. Consequently, the regulatory functions for the AND gate are: c h1 c h 2 p1 p2 K c1 K c2 R1 ¼ 1, R2 ¼ 1, R3 ¼ ð27Þ c h1 c h 2 , p1 p2 1þ 1þ K c1 K c2
4.2 Host-Aware AND Gate
with K c1 ¼ 200 molecules and h1 ¼ 2.381 for the activation by gene 1, and K c2 ¼ 3000 molecules and h2 ¼ 1.835 for the activation by gene 2; these values are similar to the parameter values estimated in Wang et al. [61]. Simulations of the isolated model (Fig. 6a) show that, as expected, the gate has a high output only when the input signals are high. This agrees with the expected truth table of the AND, shown in the inset of Fig. 6a. In contrast, simulations of the hostaware model, shown Fig. 6b, suggest a strong impact of host– circuit interactions. The host-aware model predicts a bell-shaped response surface, where the output reaches a maximal value for an intermediate level of the inputs, beyond which the output drops monotonically. Such loss-of-function coincides with a drop in growth rate observed for increased levels of either input, as seen in the right panel of Fig. 6b, and thus suggests a link between growth defects and poor circuit function.
AND
101 input 1 input 2 output 100 100
0 0 1 1
0 1 0 1
1
0 0 0 1
2
0 3
10 10 10 input 2 (mRNAs/min)
10
4
104
2.5 103 102 101 0 100 100
1
2
3
10 10 10 input 2 (mRNAs/min)
10
4
100 103 102 101 100 100
0
basal 1
2
3
10 10 10 input 2 (mRNAs/min)
growth rate (% of basal)
2.5 103 102
host-aware model
104
output (# molecules x102) input 1 (mRNAs/min)
input 1 (mRNAs/min)
b
isolated model
104
output (# molecules x104) input 1 (mRNAs/min)
a
4
10
Fig. 6 Host-aware simulation of an AND gate. (a) Output predicted by a model isolated from the cellular host. Inset shows the Boolean truth table for the AND gate. (b) Output and growth rate predictions from host-aware model of the AND gate across the input space. Growth rate is normalized to the basal case in lower left corner of the heatmap
Host-Circuit Modelling
285
The NAND gate is the negation of an AND gate, and thus produces a low output only when both inputs are high. As shown in Fig. 4c, the gate has four genes connected as the composition of an AND and NOT gate. As with the previous two cases, we simulate the isolated model using Eq. 25. The regulatory functions for the NAND gate are:
4.3 Host-Aware NAND Gate
R1
¼ 1,
R2
¼ 1,
R3
R4
h c h 2 pc1 1 p2 c K1 K c2 ¼ c h 1 c h2 , p1 p2 1þ 1þ K c1 K c2 1 ¼ c h 3 , p3 1þ K c3
ð28Þ
with parameter values for R3 equal to those for R3 of the AND gate in Eq. 27, and parameter values for R4 equal to those of R2 for the NOT gate in Eq. 26. As shown in Fig. 7, simulations reveal substantially different predictions between the isolated and host-aware models of the NAND gate. The host-aware model predicts a complex relation between inputs and output that differs from the ideal response predicted by the isolated model. Host-aware simulations produce the correct response across a range of the input space (Fig. 7b), but display significant distortions possibly caused by the loss-of-function of the AND component shown in Fig. 6b. The impact of host– circuit interactions can also be observed in the predicted growth rate, which suggests a growth advantage for intermediate levels of the inputs. This is a result of the architecture of the NOT gate, akin to what we observed in Fig. 5b. b
0 0 1 1
102
0 1 0 1
1 1 1 0
101
100 100
0
101
102
103
input 2 (mRNAs/min)
104
45
103
102
101
100 100
0
101
102
103
input 2 (mRNAs/min)
104
150
103
102
101
basal 100 0 10
growth rate (% of basal)
2.5
output
input 1 input 2 output
104
input 1 (mRNAs/min)
NAND
103
host-aware model
104
output (# molecules x104)
isolated model
input 1 (mRNAs/min)
input 1 (mRNAs/min)
104
output (# molecules x104)
a
101
0
102
103
input 2 (mRNAs/min)
104
Fig. 7 Host-aware simulation of a NAND gate. (a) Output predicted by a model isolated from the cellular host. Inset shows the Boolean truth table for the NAND gate. (b) Output and growth rate predictions from host-aware model of the AND gate across the input space. Growth rate is normalized to the basal case in lower left corner of the heatmap
286
Evangelos-Marios Nikolados et al.
4.4 Impact of Design Parameters on Circuit Function
In this final section, we conduct a series of simulations that mimic experiments commonly used in circuit design. These aim to explore the impact of design parameters and growth media on circuit function.
4.4.1 Ribosomal Binding Sites (RBS)
A number of studies have shown that RBS strength is a key modulator of cellular burden [21, 29–31]. Here we examine the impact of RBS strengths on the AND and NAND gates from the previous section. Using the notation in our model, see e.g. Eq. 16, we define the RBS strength as: RBSi
kcb;i ¼ c , ku;i
ð29Þ
where kcb;i is the mRNA-ribosome binding rate constant (in units of min1molecules1), and kcu;i is their dissociation rate constant (in units of min1). We simulated the AND and NAND gates with variable RBS strengths and gene induction strengths. As shown in Fig. 8a (left), the AND gate retains its function for increasing RBS strength. We observe that for the same induction, designs with stronger RBS lead to increased circuit yield. At the same time, the simulations predict (Fig. 8a, left) a larger bell-shaped response surface, suggesting, that by increasing RBS, we expect a slightly larger design space where the output can reach a larger maximal value for the same range of inputs. In all cases, however, after the output reaches a maximal value, we find a monotonic drop in circuit yield. The lossof-function coincides with a drop in growth rate observed in all designs (Fig. 8a, right), which becomes more pronounced with stronger RBS. As shown in Fig. 8b, the impact of RBS is more notable for the NAND gate. For designs with stronger RBS (insets Fig. 8b, left), but weak induction, the gate displays a behavior akin to that of the basal case. For intermediate induction, increasing RBS strength has more detrimental effects on the circuit’s function. Specifically, the NOT component fails to fully repress the AND component, thus distorting the region where the circuit is functional. However, further increase in RBS greatly impairs the system leading to near total loss-of-function across the entire response surface (insets Fig. 8b, left). Likewise, for stronger RBS and intermediate levels of the input, we observe loss of the growth advantage gained by the NOT gate component (Fig. 8b, right). 4.4.2 Nutrient Quality
Bacterial growth is known to depend critically on the quality of the growth media. As a final illustration of our approach, we used the host-aware models to explore the impact of media on the function of the transcriptional logic gates. We model the quality of the media
Host-Circuit Modelling output (# molecules x102)
growth rate (% of basal)
0
0
45
input 1
AND
102
input 2
RBS X50
101
102
103
RBS X50
101
basal 10 100
104
input 2
101
output (# molecules x102)
RBS X50
101
input 2
101
102
103
input 2 (mRNAs/min)
104
input 1 (mRNAs/min)
input 1 input 2
input 1
input 1 (mRNAs/min)
NAND
102
100 100
104
RBS
103
150
0
RBS X10
104
104
growth rate (% of basal)
45
0
103
input 2 (mRNAs/min)
input 2 (mRNAs/min)
b
102
RBS X10
103 input 1
101
input 2
0
input 2
100
102
102
input 2
RBS X50
101 input 1
10
0
103 input 1
AND
103
RBS X10
input 1
RBS
100
104
RBS X10
input 1
input 1 (mRNAs/min)
104
input 1 (mRNAs/min)
a
287
basal 100 100
101
102
103
input 2 (mRNAs/min)
104
input 2
Fig. 8 Impact of ribosomal binding site (RBS) strength. (a) Output and growth rate predictions for the AND gate in Fig. 4b and three RBS strengths. (b) Output and growth rate predictions for the NAND gate in Fig. 4c. RBS strengths were computed from Eq. 29 by simultaneously increasing the binding rate constant k cb;i ∈f102 , 101:5 , 101:155 g and decreasing the dissociation rate constant k cu;i ∈f102 , 102:5 , 102:855 g in a pairwise manner for i ¼ 3 (AND gate) and i ¼ 4 (NAND gate). Gene induction strengths were varied in the range 100 w cmax;i 104 mRNAs/min for i ¼ 1, 2 in both gates, and fixed w cmax;3 ¼ 375 mRNAs/min for the AND gate, and w cmax;3 ¼ 375 mRNAs/min and w cmax;4 ¼ 250 mRNAs/min for NAND gate
via the nutrient efficiency parameter ns in Eqs. 4 and 19, which determines the energy yield per molecule of internalized nutrient. Our simulations suggest that nutrient quality affects the quantity of output, but not the specific response of the AND gate (Fig. 9a). As the quality of the growth medium improves, the gene expression capacity of the host increases and, as a result, we observe an increase in the operational range of the circuit. However, this is not the case for the NAND gate, which displays a more complex behavior for low nutrient quality. As seen in Fig. 9b, richer media improve the function of the gate, compared to the basal case (Fig. 7a). This is because an increase in nutrient quality improves
Evangelos-Marios Nikolados et al.
AND input 1
AND
input 2
102
n s = 0.2
input 1
101 100 100
input 2
1
2
10 10 10 input 2 (mRNAs/min) 3
104
n s = 0.6
n s = 1.0
10
4
n s = 0.6
n s = 1.0
NAND
103
input 1
104 103
output (# molecules x102) 70 0
45
0
input 1 (mRNAs/min)
b
output (# molecules x102)
input 2
102
n s = 0.2
101 100 100
input 1
a
input 1 (mRNAs/min)
288
input 2
1
2
10 10 10 input 2 (mRNAs/min) 3
10
4
Fig. 9 Impact of growth media on circuit function. (a) Simulations of the AND gate in Fig. 4b in various growth media. (b) Simulations of the NAND gate in Fig. 4c in various growth media. In both cases the nutrient quality parameter was set to n s ∈f0:2, 0:6, 1:0g; all other model parameters are identical to the simulations in Figs. 6 and 7b
the output of the gate’s AND component, which in turn leads to a stronger input for the NOT component, and hence stronger repression. On the contrary, poor nutrient quality leads to loss-offunction for the circuit. As observed in Fig. 9a, poorer media correspond to significantly decreased expression of the AND gate, which is also true for the AND component of the NAND gate. This translates to very weak input for the NOT component, which in turn does not properly repress gene 4 (Fig. 4c), resulting in the loss of gate functionality (Fig. 9b).
5
Discussion In this chapter we discussed host-aware modeling in Synthetic Biology. Starting from the three bacterial growth laws, we presented a deterministic model to simulate the dynamics of a bacterial host [35]. We showed how to incorporate synthetic gene circuits into the host model, and used this methodology to simulate hostaware versions of various gene circuits. Finally, we examined the impact of host–circuit interactions on the gates, for combinations of inputs, RBS strength, and growth media of different nutrient quality. While we focused on host–circuit competition for energy and free ribosomes, in practice gene circuits also consume other components that may become resource bottlenecks, such as RNA polymerases and σ-factors for transcription, or amino acids and tRNAs for translation. Molecular species associated with these processes can be readily incorporated into the growth model. For instance, instead of a single energy resource a, the catabolism of the
Host-Circuit Modelling
289
internalized nutrient si by the metabolic protein pm, could also produce a pool of amino acids, which would then participate in the downstream transcription and translation processes. Explicit models of amino acid pools could be employed to study amino acid recycling after protein degradation, or global effects such as upregulation of transcription triggered by nutrient starvation [36, 62]. Such extensions, however, need to be dealt with caution since they can increase model complexity, and ultimately obscure the relations between different sources of burden. A grand goal of Synthetic Biology is to produce target phenotypes through rational design of gene circuits. As with other engineering disciplines, predictive models are an essential step to accelerate the design cycle, yet current models in synthetic biology are largely under-powered for this task. Integrated host–circuit models can effectively bridge this gap and offer a flexible framework to account for a wide range of resource bottlenecks. For example, recent data [63, 64] suggest highly nonlinear relations between growth rate and heterologous expression and a sizeable burden caused by metabolic imbalances typically found in pathway engineering [65]. Such findings raise compelling prospects for the integration of mechanistic cell models with large-scale characterization data, ultimately paving the way for more robust and predictable Synthetic Biology. References 1. Andrianantoandro E, Basu S, Karig DK, Weiss R (2006) Synthetic biology: new engineering rules for an emerging discipline. Mol Syst Biol 2(1):2006.0028 2. Canton B, Labno A, Endy D (2008) Refinement and standardization of synthetic biological parts and devices. Nat Biotechnol 26(7):787 3. Ninfa AJ, Selinsky S, Perry N, Atkins S, Song QX, Mayo A, Arps D, Woolf P, Atkinson MR (2007) Using two-component systems and other bacterial regulatory factors for the fabrication of synthetic genetic devices. Methods Enzymol 422:488–512 4. Teo JJ, Woo SS, Sarpeshkar R (2015) Synthetic biology: a unifying view and review using analog circuits. IEEE Trans Biomed Circ Syst 9 (4):453–474 5. Elowitz MB, Leibler S (2000) A synthetic oscillatory network of transcriptional regulators. Nature 403(6767):335 6. Hasty J, McMillen D, Collins JJ (2002) Engineered gene circuits. Nature 420(6912):224 7. Gardner TS, Cantor CR, Collins JJ (2000) Construction of a genetic toggle switch in Escherichia coli. Nature 403(6767):339
8. Tabor JJ, Salis HM, Simpson ZB, Chevalier AA, Levskaya A, Marcotte EM, Voigt CA, Ellington AD (2009) A synthetic genetic edge detection program. Cell 137(7):1272–1281 9. Mannan AA, Liu D, Zhang F, Oyarzu´n DA (2017) Fundamental design principles for transcription-factor-based metabolite biosensors. ACS Synth. Biol. 6:1851–1859 10. Oyarzu´n DA, Stan G-BV (2013) Synthetic gene circuits for metabolic control: design trade-offs and constraints.. J R Soc Interf 10:20120671 11. Nielsen AA, Der BS, Shin J, Vaidyanathan P, Paralanov V, Strychalski EA, Ross D, Densmore D, Voigt CA (2016) Genetic circuit design automation. Science 352(6281): aac7341 12. Chaves M, Oyarzu´n DA (2019) Dynamics of complex feedback architectures in metabolic pathways. Automatica 99:323–332 13. Carbonell P, Radivojevic T, Garcı´a Martı´n H (2019) Opportunities at the intersection of synthetic biology, machine learning, and automation. ACS Synth Biol 8:1474–1477 14. Hughes RA, Ellington AD (2017) Synthetic DNA synthesis and assembly: putting the
290
Evangelos-Marios Nikolados et al.
synthetic in synthetic biology. Cold Spring Harbor Perspect Biol 9:a023812 15. Rondelez Y (2012) Competition for catalytic resources alters biological network dynamics. Phys Rev Lett 108(1):018102 16. Cardinale S, Arkin AP (2012) Contextualizing context for synthetic biology–identifying causes of failure of synthetic biological systems. Biotechnol J 7(7):856–866 17. Gyorgy A, Del Vecchio D (2014) Limitations and trade-offs in gene expression due to competition for shared cellular resources. In: 2014 IEEE 53rd Annual Conference on Decision and Control (CDC), pp. 5431–5436. IEEE, New York (2014) 18. Mather WH, Hasty J, Tsimring LS, Williams RJ (2013) Translational cross talk in gene networks. Biophys J 104(11), 2564–2572 19. Scott M, Gunderson CW, Mateescu EM, Zhang Z, Hwa T (2010) Interdependence of cell growth and gene expression: origins and consequences. Science 330(6007):1099–1102 20. Tan C, Marguet P, You L (2009) Emergent bistability by a growth-modulating positive feedback circuit. Nat Chem Biol 5(11):842 21. Ceroni F, Algar R, Stan G-B, Ellis T (2015) Quantifying cellular capacity identifies gene expression designs with reduced burden. Nat Methods 12(5):415 22. An W, Chin JW (2009) Synthesis of orthogonal transcription-translation networks. Proc Natl Acad Sci 23. Segall-Shapiro TH, Meyer AJ, Ellington AD, Sontag ED, Voigt CA (2014) A resource allocator for transcription based on a highly fragmented T7 RNA polymerase. Mol Syst Biol 10 (7):742 24. Pasini M, Ferna´ndez-Castane´ A, Jaramillo A, de Mas C, Caminal G, Ferrer P (2016) Using promoter libraries to reduce metabolic burden due to plasmid-encoded proteins in recombinant Escherichia coli. New Biotechnol 33 (1):78–90 25. Shopera T, He L, Oyetunde T, Tang YJ, Moon TS (2017) Decoupling resource-coupled gene expression in living cells. ACS Synth Biol 6 (8):1596–1604 26. Darlington APS, Kim J, Jime´nez JI, Bates DG (2018) Dynamic allocation of orthogonal ribosomes facilitates uncoupling of co-expressed genes. Nat Commun 9:695 27. Rugbjerg P, Sarup-Lytzen K, Nagy M, Sommer MOA (2018) Synthetic addiction extends the productive life time of engineered Escherichia coli populations. Proc Natl Acad Sci 115 (10):2347–2352 28. Ceroni F, Boo A, Furini S, Gorochowski TE, Borkowski O, Ladak YN, Awan AR, Gilbert C,
Stan G-B, Ellis T (2018) Burden-driven feedback control of gene expression. Nat Methods 15(5):387 29. Gyorgy A, Jime´nez JI, Yazbek J, Huang H-H, Chung H, Weiss R, Del Vecchio D (2015) Isocost lines describe the cellular economy of genetic circuits. Biophys J 109(3):639–646 30. Carbonell-Ballestero M, Garcia-Ramallo E, ˜ ez R, Rodriguez-Caso C, Macı´a J Montan (2015) Dealing with the genetic load in bacterial synthetic biology circuits: convergences with the ohm’s law. Nucleic Acids Res 44 (1):496–507 31. Gorochowski TE, Avcilar-Kucukgoze I, Bovenberg RA, Roubos JA, Ignatova Z (2016) A minimal model of ribosome allocation dynamics captures trade-offs in expression between endogenous and synthetic genes. ACS Synth Biol 5(7):710–720 32. Karr JR, Sanghvi JC, Macklin DN, Gutschow MV, Jacobs JM, Bolival Jr B, Assad-Garcia N, Glass JI, Covert MW (2012) A whole-cell computational model predicts phenotype from genotype. Cell 150(2):389–401 33. Purcell O, Jain B, Karr JR, Covert MW, Lu TK (2013) Towards a whole-cell modeling approach for synthetic biology. Chaos 23 (2):025112 34. Klumpp S, Zhang Z, Hwa T (2009) Growth rate-dependent global effects on gene expression in bacteria. Cell 139:1366–1375 35. Weiße AY, Oyarzu´n DA, Danos V, Swain PS (2015) Mechanistic links between cellular trade-offs, gene expression, and growth. Proc Natl Acad Sci 112(9):E1038–E1047 36. Liao C, Blanchard AE, Lu T (2017) An integrative circuit–host modelling framework for predicting synthetic gene network behaviours. Nat. Microbiol. 2(12):1658 37. Thomas P, Terradot G, Danos V, Weiße AY (2018) Sources, propagation and consequences of stochasticity in cellular growth. Nat Commun 9(1):1–11 38. Nikolados E-M, Weiße AY, Ceroni F, Oyarzu´n DA (2019) Growth defects and loss-of-function in synthetic gene circuits. ACS Synth Biol 8(6):1231–1240 39. O’Brien EJ, Lerman JA, Chang RL, Hyduke DR, Palsson B (2013) Genome-scale models of metabolism and gene expression extend and refine growth phenotype prediction. Mol Syst Biol 9:693 40. Carrera J, Covert MW (2015) Why build whole-cell models? Trends Cell Biol 25 (12):719–722 41. Karr JR, Takahashi K, Funahashi A (2015) The principles of whole-cell modeling. Curr Opin Microbiol 27:18–24
Host-Circuit Modelling 42. O’Brien EJ, Monk JM, Palsson BO (2015) Using genome-scale models to predict biological capabilities Cell 161(5):971–987 43. Monod J (1949) The growth of bacterial cultures. Ann Rev Microbiol 3(1):371–394 44. Schaechter M, Maaløe O, Kjeldgaard NO (1958) Dependency on medium and temperature of cell size and chemical composition during balanced growth of Salmonella typhimurium. Microbiology 19(3):592–606 45. Neidhardt FC, Magasanik B (1960) Studies on the role of ribonucleic acid in the growth of bacteria. Biochim Biophys Acta 42:99–116 46. Dennis PP, Ehrenberg M, Bremer H (2004) Control of rRNA synthesis in Escherichia coli: a systems biology approach. Microbiol Mol Biol Rev 68(4):639–668 47. Maaløe O (1979) Regulation of the proteinsynthesizing machinery—ribosomes, tRNA, factors, and so on. In: Biological Regulation and Development, pp. 487–542. Springer, New York (1979) 48. Bremer H, Dennis PP, et al (1996) Modulation of chemical composition and other parameters of the cell by growth rate. EcoSal Cell Mol Biol 2(2):1553–1569 49. Maitra A, Dill KA (2015) Bacterial growth laws reflect the evolutionary importance of energy efficiency. Proc Natl Acad Sci 112(2):406–411 50. Bosdriesz E, Molenaar D, Teusink B, Bruggeman FJ (2015) How fast-growing bacteria robustly tune their ribosome concentration to approximate growth-rate maximization. FEBS J 282(10):2029–2044 51. Molenaar D, Van Berlo R, De Ridder D, Teusink B (2009) Shifts in growth strategies reflect tradeoffs in cellular economics. Mol Syst Biol 5 (1):323 52. Russell JB, Cook GM (1995) Energetics of bacterial growth: balance of anabolic and catabolic reactions. Microbiol Mol Biol Rev 59 (1):48–62 53. McGinness KE, Baker TA, Sauer RT (2006) Engineering controllable protein degradation. Mol Cell 22(5):701–707 54. Vind J, Sørensen MA, Rasmussen MD, Pedersen S (1993) Synthesis of proteins in Escherichia coli is limited by the concentration of free
291
ribosomes: expression from reporter genes does not always reflect functional mRNA levels. J Mol Biol 231(3):678–688 55. Dong H, Nilsson L, Kurland CG (1995) Gratuitous overexpression of genes in Escherichia coli leads to growth inhibition and ribosome destruction. J Bacteriol 177(6):1497–1504 56. Lim WA (2010) Designing customized cell signalling circuits. Nat Rev Mol Cell Biol 11 (6):393 57. Khalil AS, Collins JJ (2010) Synthetic biology: applications come of age. Nat Rev Genet 11 (5):367 58. Joshi N, Wang X, Montgomery L, Elfick A, French C (2009) Novel approaches to biosensors for detection of arsenic in drinking water. Desalination 248(1–3):517–523 59. Paitan Y, Biran I, Shechter N, Biran D, Rishpon J, Ron EZ (2004) Monitoring aromatic hydrocarbons by whole cell electrochemical biosensors. Anal Biochem 335(2):175–183 60. Saeidi N, Wong CK, Lo T-M, Nguyen HX, Ling H, Leong SSJ, Poh CL, Chang MW (2011) Engineering microbes to sense and eradicate Pseudomonas aeruginosa, a human pathogen. Mol Syst Biol 7(1):521 61. Wang B, Kitney RI, Joly N, Buck M (2011) Engineering modular and orthogonal genetic logic gates for robust digital-like synthetic biology. Nat Commun 2:508 62. Hartline CJ, Mannan AA, Liu D, Zhang F, Oyarzu´n DA (2020) Metabolite sequestration enables rapid recovery from fatty acid depletion in Escherichia coli. mBio 11:e03112–e03119 63. Cambray G, Guimaraes JC, Arkin AP (2018) Evaluation of 244,000 synthetic sequences reveals design principles to optimize translation in Escherichia coli. Nat Biotechnol 36 (10):1005 64. Borkowski O, Bricio C, Murgiano M, Rothschild-Mancinelli B, Stan GB, Ellis T (2018) Cell-free prediction of protein expression costs for growing cells. Nat Commun 9 (1):1457 65. Liu D, Mannan AA, Han Y, Oyarzu´n DA, Zhang F (2018) Dynamic metabolic control: towards precision engineering of metabolism. J Ind Microbiol Biotechnol 45:535–543
Chapter 14 A Practical Step-by-Step Guide for Quantifying Retroactivity in Gene Networks Andras Gyorgy Abstract One of the fundamental properties of engineered large-scale complex systems is modularity. In synthetic biology, genetic parts exhibit context-dependent behavior. Here, we describe and quantify a major source of such behavior: retroactivity. In particular, we provide a step-by-step guide for characterizing retroactivity to restore the modular description of genetic modules. Additionally, we also discuss how retroactivity can be leveraged to quantify and maximize robustness to perturbations due to interconnection of genetic modules. Key words Retroactivity, Gene transcription networks, Modularity, Synthetic biology, Contextdependence, Model order reduction, Loading
1
Introduction Modularity greatly simplifies the design and analysis of complex systems. Although biological systems comprise motifs at the structural level [2, 34, 42, 53, 57], these modules display contextdependent behavior [8, 11, 44, 46, 59, 65], hindering the rational design of large-scale synthetic genetic circuits [22, 38, 51]. Therefore, genetic modules currently need to be re-designed through a lengthy and ad hoc process every time they are inserted into a different system [11, 59], thus the development of even simple circuit components requires an iterative process in which slight modifications are tested and then tuned [4, 60], where the (optimal) characterization of each part is slow and costly [7]. Sources of context-dependence include interactions among parts due to spatial co-localization [13, 18, 63], dependence on the host organism and strain [6], growth-dependence [9, 56, 64], environmental dependence [10, 20, 47, 48, 66], the limited availability of shared cellular resources [12, 23, 24, 26, 44, 52, 58], and retroactivity due to the composition of modules [19, 28–30, 33]. Here, we focus on this last source of context-dependence, capturing how a downstream module perturbs the dynamic state
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_14, © Springer Science+Business Media, LLC, part of Springer Nature 2021
293
294
Andras Gyorgy
Fig. 1 Experimental demonstration of retroactivity, adapted from [43]. Upon addition of DOX, rtTa binds to the promoter pTET, expressing SKN7m, which then triggers GFP production in the output module
of its upstream module in the process of receiving information from the latter [17, 54]. This is illustrated in Fig. 1: addition of the load module affects the upstream (input) module, and as a result, the output of the system as well. Perturbations due to retroactivity can have dramatic effects on the upstream module’s behavior [5], for instance, by changing the behavior of a toggle switch [41], one of the most widely used genetic modules with applications ranging from clocks [50] to frequency multipliers [14]. Additionally, retroactivity has profound effects on the robustness of modules as well [45], thus accounting for it is essential for the predictable design of complex systems by combining small modules (e.g., accurate and sensitive biosensors [3]). Here, we provide a step-by-step quantitative framework to accurately predict how protein expressions become coupled as a result of retroactivity. In particular, we demonstrate that the dynamic effects of loading due to interconnections can be fully captured via appropriate retroactivity matrices. To this end, we detail a workflow comprising five major steps: step 1 step 2 step 3 step 4 step 5
derive mathematical model of modules; compute internal retroactivity of modules; compute external retroactivity of modules; compute scaling and mixing retroactivity of modules; bound the effects of retroactivity.
The results presented here provide a summary of the main results of [25], interested readers are encouraged to consult the original publication for more details.
2
Materials The standard mechanistic model of gene transcription networks includes protein production, decay, and reversible binding reactions between transcription factors (TFs) and promoter sites, required for transcriptional regulation. Genetic modules are thus a set of TFs, communicating with each other by having TFs
A Step-by-Step Guide for Retroactivity in Gene Networks
295
produced in one module regulate the expression of TFs produced in a different module. After introducing the reactions that govern the behavior of gene networks, we present the two main mathematical tools we later use to quantify retroactivity and its effects.
2.1 Biochemical Reactions
The production of TF xi is regulated by its parents pi,1, pi,2, . . . (where pi,j ¼ xk for some k): they bind to the promoter of xi, and form complexes ci,1, ci,2, . . . with the promoter according to αi,j ,k
ci,j þ pi,l Ð ci,k , βi,k,j
ð1Þ
where αi,j,k and βi,j,k are the corresponding association/dissociation rates. Each of these complexes, in turn, produces xi with a different rate π i,j (incorporating features such as the RBS strength and the promoter strength) according to π i,j
ci,j !ci,j þ xi ,
ð2Þ
where we use a one-step production process (see Note 1) encapsulating both transcription and translation [2]. Finally, we consider external induction and decay of xi modeled by ζ i ðt Þ
ð3Þ
; Ð xi , δi
where δi denotes the protein decay, whereas ζ i ðt Þ represents the production rate that may be due to external inputs or perturbations (inducer, noise or disturbance). Finally, we assume that the total concentration of the promoter, denoted by ηi, for each transcripP tion component is conserved, so that ηi ¼ Cj i¼0 c i,j , where Ci is the number of possible complexes formed with the promoter of xi. The concentration ηi is proportional to the copy number of plasmids from which the genes are expressed, which can be easily tuned [35]. 2.2 Model Order Reduction via Time-Scale Separation
Consider the dynamics x_ ¼
f ðt, x, z, εÞ,
xðt 0 Þ ¼ ξðεÞ,
εż ¼
gðt, x, z, εÞ,
zðt 0 Þ ¼ χðεÞ,
ð4Þ
where ξ(ε) and χ(ε) depend smoothly on ε and t0 ∈ [0, t1) and let x (t, ε) and z(t, ε) denote the solution of 4. Furthermore, let z ¼ h (t, x) denote an isolated root of 0 ¼ g(t, x, z, 0). In addition to some smoothness properties (see Theorem 11.1 in [32] for technical details), assume that x_ ¼ f ðt, x, hðt, xÞ, 0Þ with x(t0) ¼ ξ(0) has a unique solution xðtÞ, and that the origin is an exponentially stable equilibrium point of dy ¼ gðt, x, y þ hðt, xÞ, 0Þ dτ
296
Andras Gyorgy
with y :¼ z h(t, x) and τ :¼ (t t0)/ε. Then, there exists a positive constant ε∗ such that for all χ(0) h(t0, ξ(0)) and 0 < ε < ε∗, the dynamics in 4 has a unique solution x(t, ε), z(t, ε) on [t0, t1], such ¼ OðεÞ . Moreover, for any tb > t0, there is that xðt, εÞ xðtÞ ε∗∗ ε∗ such that zðt, εÞ hðt, xðtÞÞ ¼ OðεÞ holds uniformly for t ∈ [tb, t1] whenever ε < ε∗∗. For details, see Theorem 11.1 in [32]. 2.3 Contraction Theory
A system x_ ¼ f ðx, t Þ is called contracting [40] if there exists a square matrix Θðx, t Þ with the following two properties: (1) ΘT Θ is uniformly positive definite and (2) the symmetric part of the generalized Jacobian ∂f J ðx, t Þ :¼ Θ_ þ Θ Θ1 ∂x is uniformly negative definite. The absolute value of the largest eigenvalue of the symmetric part of J is called the system’s contraction rate with respect to the metric Θ.
3
Methods To quantify the effect of retroactivity, we next detail a workflow comprising the five major steps outlined in Subheading 1.
3.1 Step 1: Mathematical Model of Modules
When a TF xi belongs to the module, we call it an internal TF, otherwise it is an external TF. Further, we identify external TFs that are parents to internal TFs as inputs to the module. Consider first a network of n transcription factors and the reactions given in 1–3. Let x, u, and c denote the concentration vector of internal TFs, inputs, and TF-promoter complexes, respectively. 1. Introduce the reaction flux vector v containing all the reaction rates in the system such that v is partitioned into r and r∗, where r is composed of the fast reactions in 1, whereas r∗ contains the slow processes in 2–3: 0 1 ⋮ 0 1 ⋮ B C B ζi C B C m B C i,l B αi,j ,k c i,j p C B C i,l C B ∗ B r ¼ B δi x i C r¼B ð5Þ C, C: B βi,k,j c i,k C B C @ A B π i,j c i,j C @ A ⋮ ⋮ 2. According to [36], write the dynamics of x and c as # ! ! " 0 A ˙c r ∗ ðx, cÞ , ¼ B∗ B ˙x rðx, c, uÞ |{z} |{z} N st
vðx, c, uÞ
A Step-by-Step Guide for Retroactivity in Gene Networks
297
where Nst is the stoichiometry matrix (the upper left block matrix is the zero matrix as DNA is not produced/degraded [1]). 3. Once the context of the module is present, represent all the quantities related to the context with an overbar. In this case, the dynamics of the species in the module (c and x) and in the can be written as context ( c and x)
ð6Þ
Here, the upper left block matrix is zero as DNA is assumed to be a conserved species; the off-diagonal block matrices in the upper right block matrix are zero since r and r encapsulate the binding/unbinding reactions in the module and in its context, respectively; and the off-diagonal block matrices in the lower left block matrix are zero as r∗ and r∗ encapsulate the production/decay reactions in the module and its context, respectively. 4. Introduce s :¼ Er to describe the effective rate of change of x due to intermodular binding reactions (presence of context), as the stoichiometry matrix E represents how internal TFs of the module participate in binding/unbinding reactions in the context of the module (E can be interpreted similarly). Milestone 1: With g ðx, c Þ :¼ B ∗ r ∗ ðx, c Þ we obtain c_ ¼
Ar ðx, c, uÞ,
x_ ¼
g ðx, c Þ þ Br ðx, c, uÞ,
ð7Þ
which we call the isolated dynamics of a module. Conversely, c_ ¼ Ar ðx, c, uÞ, x_ ¼ g ðx, c Þ þ Brðx, c, uÞ þ sðx, c, uÞ,
ð8Þ
is called the connected dynamics of a module. Insight from Milestone 1: We refer to s as the retroactivity to the output of the module, encompassing retroactivity applied to the module due to the context of the module. Similarly, we call r the retroactivity to the input of a module, representing retroactivity originating inside the module. The two major drawbacks of the above description are the following. First, it involves microscopic parameters that are hard to measure, for instance, association rate constants. As a result, its practical usability is limited. Second, it fails to provide insights into how retroactivity affects a module’s dynamics, and more importantly, how do the dynamics and behavior change once the module is interconnected with other modules as part of a larger system (see Note 2).
298
Andras Gyorgy
3.2 Step 2: Internal Retroactivity
Here we derive the reduced order model of the isolated dynamics of a module when the module has no inputs. 1. The binary matrix Vi has as many columns as the number of TFs in the module, and as many rows as the number of parents of xi, such that its ( j, k) element is 1 if the jth parent of xi is xk, otherwise the entry is zero. That is, an entry in the following matrix 2
x1
x2
...
3 7 7 7 5
6 Vi ¼ 6 6 4
pi,1 pi,2 ⋮
is 1 if the species indexing the corresponding row and column are the same, otherwise the entry is zero, yielding pi ¼ Vix. Furthermore, let Φ denote the set of TFs having parents from inside the module. 2. The binary matrix Ψi has as many columns as the number of complexes formed with the promoter of xi, and as many rows as the number of parents of xi. That is, the ( j, k) element in the following matrix c i,1 2
c i,2
6 Ψi ¼ 6 6 4
... 3 7 7 7 5
pi,1 pi,2 ⋮
is m if the jth parent of xi is bound as an m-multimer in ci,k (m ¼ 0 if the jth parent is not bound). 3. Since A in 7 has a block diagonal structure [36] with blocks Ai, we can write c_i ¼ A i r i ðpi , c i Þ where ri( pi, ci) denotes the reaction flux vector corresponding to reversible binding reactions with the promoter of xi. Let ci ¼ γ i( pi) denote the vector of concentrations of complexes with the promoter of xi at the quasi-steady state, obtained by setting 0 ¼ Airi( pi, ci), and similarly, let c ¼ γ(x) be the locally unique solution of 0 ¼ Ar (x, c) from 7. 4. Let γ i,j( pi) denote the jth entry in γ i( pi) and Ci the number of complexes with the promoter of xi. Define H i ðpi Þ ¼
Ci P j ¼0
π i,j γ i,j ðpi Þ
ð9Þ
A Step-by-Step Guide for Retroactivity in Gene Networks
(see Note 3) and introduce 0 ζ 1 þ H 1 ðp1 Þ δ1 x 1 B B ζ 2 þ H 2 ðp2 Þ δ2 x 2 B hðxÞ ¼ B B ⋮ @
299
1 C C C C: C A
ð10Þ
ζ N þ H N ðpN Þ δN x N 5. Define the retroactivity Ri( pi) of TF xi ∈ Φ (see Note 3) as dγ ðp Þ Ri pi ¼ Ψi i i : dpi
ð11Þ
6. Introduce the internal retroactivity of a module as
P
RðxÞ ¼
V Ti Ri ðpi ÞV i :
f i j x i ∈Φ g
ð12Þ
Milestone 2: Let (c(t), x(t)) be the solution of the isolated module ^ of dynamics 7 with initial condition (c0, x0). The solution xðtÞ x_ ¼ ½I þ RðxÞ1 hðxÞ:
ð13Þ
^ with initial condition xð0Þ ¼ x^0 well approximates x(t) when x^0 þ x B ðγðx^0 ÞÞ ¼ x 0 þ x B ðc 0 Þ where x B ðcÞ ¼
P
V Ti Ψi c i :
f i j x i ∈Φ g
ð14Þ
Insight from Milestone 2: The reduced order dynamics in 13 reveal how internal retroactivity R of the module affects its dynamics (see Note 4). When R ¼ 0, we have x_ ¼ hðxÞ , the commonly used Hill function-based model for gene transcription networks [2]. Moreover, 13 describes how changes in the total concentration of TFs h(x) relate to changes x_ in the concentration of free TFs. Specifically, to change the concentration of free TFs by one unit, the module has to change the total concentration of TFs by (I + R) units, as R units are “spent on” changing the concentration of bound TFs. Having R ¼ 0 implies that the module’s effort on affecting the total concentration of TFs is entirely spent on changing the concentration of free TFs. By contrast, jjRjj ! 1 implies that no matter how much the total concentration of TFs changes, it is not possible to achieve any changes in the free concentration of some of the TFs. Therefore, the internal retroactivity R describes how “stiff” the module is against changes in x due to loading applied by internal connections. The retroactivity Ri( pi) of each TF can be interpreted similarly. 3.3 Step 3: External Retroactivity
Here, we extend the reduced order model in 13 to the case in which the module has external TFs as inputs.
300
Andras Gyorgy
1. Let u ¼ (u1, u2, . . ., uW)T denote the concentration vector of TFs external to the module, and define Ω as the set of TFs having parents from outside the module (external TFs). 2. The binary matrix Di has as many columns as the number of inputs of the module, and as many rows as the number of parents of xi, such that its ( j, k) element is 1 if the jth parent of xi is uk, otherwise the entry is zero. That is, an entry in the following matrix 2
u1
u2
...
3 7 7 7 5
6 Di ¼ 6 6 4
pi,1 pi,2 ⋮
is 1 if the species indexing the corresponding row and column are the same, otherwise the entry is zero, yielding pi ¼ T ½ V i D i ð x T uT Þ . Note that in the presence of input u, both h(.) and R(.) given in 10 and 12, respectively, depend on x and u, as some of the parents of internal TFs are external TFs. Similarly, we now have r(x, c, u) instead of r(x, c). 3. In the presence of input u, R(.) given in 12 depends on both x and u, as some of the parents of internal TFs are external TFs, so that Rðx, uÞ ¼
P
V Ti Ri ðpi ÞV i :
f i j x i ∈Φ g
ð15Þ
4. Define the external retroactivity as Q ðx, uÞ ¼
P
f i j x i ∈Φ\Ω g
V Ti Ri ðpi ÞD i :
ð16Þ
Milestone 3: Let (c(t), x(t)) be the solution of the isolated module dynamics 7 with initial condition (c0, x0) and with smooth input u ^ of (t). The solution xðtÞ _ x_ ¼ ½I þ Rðx, uÞ1 ½h ðx, uÞ Q ðx, uÞu_ ¼: f ðx, u, uÞ
ð17Þ
^ with initial condition xð0Þ ¼ x^0 well approximates x(t) when x^0 þ x B ðγðx^0 , uð0ÞÞÞ ¼ x 0 þ x B ðc 0 Þ with xB() defined in 14. Insight from Milestone 3: The reduced order dynamics in 17 reveals the role that the external retroactivity Q plays. Recall that h(x, u) ¼ 0 implies that the total concentrations of internal TFs are _ constant from 10. In this case, 17 reduces to x_ ¼ ðI þ RÞ1 Q u, where x is the concentration vector of free internal TFs. This means that the concentrations of free internal TFs can still be changed subsequent to changes in the external TFs (input), despite the fact
A Step-by-Step Guide for Retroactivity in Gene Networks
301
that the total concentration (free and bound) of internal TFs remains unaffected. Therefore, Q captures the phenomenon by which external TFs force internal TFs to bind/unbind, for instance, by competing for the same binding sites. 3.4 Step 4: Scaling and Mixing Retroactivity
Next, consider the interconnection of the module together with its context. 1. The binary matrix U has as many rows as the number of inputs of the module, and as many columns as the number of TFs in the context, such that its ( j, k) element is 1 if the jth input of the module is the kth internal TF of the context (u j ¼ xk ), otherwise the entry is zero. That is, an entry in the following matrix 2
x1
x2
...
3
u1
7 7 7 5
6 U ¼ 6 6 4
u2 ⋮
is 1 if the species indexing the corresponding row and column are the same, otherwise the entry is zero, yielding u ¼ U x . Define U similarly for the context, yielding u ¼ U x. 2. Define the scaling retroactivity of the module as
P
Sðx, xÞ ¼
½D i U T Ri ðpi ÞD i U :
f i j x i ∈Ω g
ð18Þ
3. Define the mixing retroactivity of the module as
P
M ðx, xÞ ¼
½D i U T Ri ðpi ÞV i :
f i j x i ∈ðΦ\ΩÞ g
ð19Þ
4. Define the scaling retroactivity of the context as
P
Sðx, xÞ ¼ f
i j x i ∈Ω
g
Di U
T
Ri ðpi ÞD i U :
ð20Þ
5. Define the mixing retroactivity of the context as M ðx, xÞ ¼
P
Þ g f i j x i ∈ðΦ\Ω
Di U
T
Ri ðpi ÞV i :
6. Introduce x 0 :¼ ðx xÞ and c 0 :¼ ðc cÞ together with ! " # rðx, c, U xÞ A 0 , A0 ¼ , r 0 ðx 0 , c 0 Þ ¼ 0 A rðx, c, U xÞ
ð21Þ
ð22Þ
and let c 0 ¼ γ~0 ðx 0 Þ be an isolated root of 0 ¼ A0 r0 (x0 , c0 ).
302
Andras Gyorgy
Milestone 4: Let (c0 (t), x0 (t)) be the solution of the dynamics 6 with initial condition ðc 00 , x 00 Þ. The solution x^0 ðtÞ of ! " #1 x_ I þ ðI þ RÞ1 S ðI þ RÞ1 M ¼ 1 S 1 M x_ I þ ðI þ RÞ ðI þ RÞ ! U xÞ _ f ðx, U x, x, U xÞ U _ fðx, |{z} isolated dynamics
ð23Þ
of the module and of its context with initial condition x^00 well approximates x0 (t) when x^0 ð0Þ ¼ x00 such that x^00 þ x 0B ðγ 0 ðx^00 ÞÞ ¼ x 00 þ x 0B ðc 00 Þ where 0 1 P T Ψ c V i i i n o B C B i j xi ∈Φ C B C C: x 0B ðc 0 Þ ¼ B P B c C T Ψ B C V o i i iA @n i j x∈Φ i
Insight from Milestone 4: The reduced order dynamics in 23 describes how the dynamics of the module and that of the context change upon interconnection as it relates the connected dynamics to the isolated dynamics, characterized by the internal, scaling, and mixing retroactivity matrices according to 23. First, zero matrices lead to no alteration in the dynamics upon S, M, S , and M ¼ 0, the dynamics of the module after interconnection. When M interconnection become h i1 U x_ , x_ ¼ I þ ðI þ RÞ1 S f x, U x, |{z} ð24Þ isolated dynamics of the module that is, S determines how the isolated dynamics of the module get “scaled” upon interconnection. Complementing this effect, the dynamics of the context enter into the module’s dynamics of the context, referring to through the mixing retroactivity M the “mixing” of the dynamics of the module and that of its context. 6¼ 0, a perturbation applied in the context can result in a When M response in the upstream module, even without TFs in the context regulating TFs in the module, leading to a counter-intuitive transmission of signals from downstream (context) to upstream (module).
A Step-by-Step Guide for Retroactivity in Gene Networks
3.5 Step 5: Error Due to Retroactivity
303
Here, we provide three distinct ways to quantify the measure of disturbance on the module dynamics due to retroactivity from its context when parameter values are known (see Note 5). For sim ¼ 0. Let plicity, we focus on the case when M _ x_ ¼ f ðx, u, uÞ
ð25Þ
denote the dynamics of the module in isolation from 17. Once the module is connected to its context, its dynamics change according to 1
x_ ¼ ½I þ ðI þ RÞ1 S f ðx, u, u_ Þ
ð26Þ
from 24. Let x ðt Þ and x~ðt Þ denote the solution of 25 and 26, respectively, with identical initial conditions. 1. Introduce μðx, uÞ: ¼ jj½I þ ðI þ RÞ1 S
1
I jj2 :
ð27Þ
2. If they exist, define l^, f^ , and μ^ such that (i) f ðx, u, u_ Þ have Lipschitz constant l^ , (ii) jj f ðx, u, u_ Þjj2 f^ , and (iii) μðx, uÞ μ^. 3. Let σ min ðI þ RÞ denote the smallest singular value of (I + R), stands for the greatest singular value of S and similarly, σ max ðSÞ and define μ^ ¼ max x, x
σ max ðSÞ σ min ðI þ RÞ σ max ðSÞ
< σ min ðI þ RÞ. provided that σ max ðSÞ 4. It the system 25 is contracting [40] with rate λ > 0 and metric transformation Θðx, t Þ , then denote by κ ðx, t Þ the condition number of Θðx, t Þ, and let κ^ 0 such that κ^ κðx, t Þ. Milestone 4: The change in dynamics of a module due to retroctivity from its context is bounded according to _ f~ðx, u, uÞk _ 2 kf ðx, u, uÞ μðx, uÞ: _ 2 kf ðx, u, uÞk
ð28Þ
Similarly, the difference between trajectories of 25 and 26 is bounded as i μ^ f^ h lt^ e 1 , jjx ðt Þ x~ðt Þjj2 l^ and also by jjx ðt Þ x~ðt Þjj2
μ^ f^κ^ : λ
304
Andras Gyorgy
Insight from Milestone 5: The above results suggest that the module becomes more robust to interconnection as μ^ decreases, for instance, by increasing min x,x σ min ðI þ RÞ or by decreasing Such a metric can be used not only in the design max x,xσ max ðSÞ. of gene transcription networks (low values of μ^ lead to modules that behave almost the same when connected or isolated), but also during their analysis, for instance, by enhancing existing partitioning methods based on other measures (e.g., edge betweenness [21], its extension to directed graphs with nonuniform weights [67], round trip distance [61] or retroactivity [55]) with respect to robustness to interconnection. The bounds on the difference in dynamics and trajectories upon interconnection can be used to specify the fan-out of a module [37]: the amount of “load” a module can tolerate while satisfying certain design specifications, such as switching time in the case of a toggle, or period and amplitude in the case of an oscillator. 3.6 Illustrating the Effects of Intermodular Connections
To illustrate both the steps detailed above and the effect of intermodular connections on the dynamics of interconnected modules, we consider first a natural recurring network motif, then a commonly used synthetic genetic module. Example 1: Single-Input Motif: The single-input motif in Fig. 2a is a recurrent motif in gene transcription networks [31, 57]. Here, we show that the dynamic performance (speed) of the module and its robustness to interconnection with its context are not independent, and that this trade-off can be analyzed by focusing on the interplay between the internal retroactivity R of the module and the scaling retroactivity S of the context. Let x_ 1 ¼ f ðx 1 Þ denote the isolated dynamics of the module from 7. Furthermore, we have 1 Þ ¼ Pl R i ¼ 1 for i ¼ 1, 2, . . ., l and U ¼ 1, so that Sðx D i¼1 l ðx 1 Þ i ðx 1 Þ is the retroactivity of TF xi in the context. by 20, where R According to 24, the dynamics of the module upon interconnection modify to 1 þ Rðx 1 Þ x_ 1 ¼ 1 Þf ðx 1 Þ ¼ ½1 μðx 1 Þ f ðx 1 Þ, 1 þ Rðx 1 Þ þ Sðx |{z} |{z} effect of the context
effect of the context
1 Þ=½1 þ Rðx 1 Þ þ Sðx 1 Þ . The smaller μ(x1), the where μðx 1 Þ ¼ Sðx more robust the module to interconnection. From a design perspective, if speed is a priority, one should choose a strong RBS with a low-copy number plasmid, or alternatively, a promoter with high dissociation constant k1. By contrast, if robustness to interconnection is central, a weak RBS with a high-copy number plasmid (or with low k1) is a better choice. If both speed and robustness to interconnection are desired, other design approaches may be
A Step-by-Step Guide for Retroactivity in Gene Networks
305
Fig. 2 (a) Single input motif. (b) The response time increases with the load. (c) High internal retroactivity counteracts the effect of loading
required, such as the incorporation of insulator devices, as proposed in other works [27]. Example 2: Oscillator: The common clock design in Fig. 3a is based on two TFs, one of which is an activator and the other is a repressor [5, 15, 62]. Here, we illustrate that while internal retroactivity acts against sustained oscillations (Fig. 3c), scaling retroactivity of the context promotes them. To see this, note that V1 ¼ I and V 2 ¼ ½ 1 0 , whereas h(x) and R(x) can be constructed by considering R1(x1, x2), R2(x1), H1(x1, x2), and H2(x1), respectively, in Tables 1 and 2. With this, we write R2 ¼ a and
306
Andras Gyorgy
" R1 ¼
b
c
d
e
# :
Then, we obtain that 13 takes the form 3 2 1þe c ! x_ 1 ð1 þ a þ bÞð1 þ eÞ cd 7 6 ð1 þ a þ bÞð1 þ eÞ cd 7 ¼6 5 4 d 1þaþb x_ 2 ð1 þ a þ bÞð1 þ eÞ cd ð1 þ a þ bÞð1 þ eÞ cd
|{z} ½I þRðxÞ1
H 1 ðx 1 , x 2 Þ δ1 x 1 H 2 ðx 1 Þ δ2 x 2
! :
|{z} hðxÞ
Therefore, the activator and repressor dynamics are slowed down asymmetrically (diagonal terms in [I + R(x)]1) due to internal retroactivity. In particular, in the case when c, d 1 + e 1 + a + b, the activator slows down compared to the repressor, quenching the oscillations (Fig. 3d) [16]. To restore sustained oscillations, we have to render the repressor dynamics slower with respect to the activator dynamics by adding extra loading for the repressor (Fig. 3a, right panel) [28]. In this case, we have R3(x2) > 0 given in Table 1, which, due to 13, will yield the following change
Fig. 3 (a) AR-clock. (b) AR-clock with load. (c) Neglecting retroactivity, the isolated AR-clock displays sustained oscillations. (d) When internal retroactivity is accounted for, oscillations are quenched. (e) Oscillations can be restored by loading the repressor, thus increasing the scaling retroactivity
A Step-by-Step Guide for Retroactivity in Gene Networks
307
in the above dynamics: instead of e, we will have e + R3 > e, rendering the dynamics of the repressor slower with respect to the activator dynamics, restoring oscillations (Fig. 3e).
4
Notes 1. We treat gene expression as a one-step process, neglecting mRNA dynamics. This assumption is based on the fact that mRNA dynamics occur on a time scale much faster than protein production/decay [2]. Additionally, including mRNA dynamics is not relevant for the study of retroactivity, and would yield only minor changes in our results (see [25] for details). 2. While the most widely used modeling approach employing Hill functions conceals the effects of retroactivity, the framework presented here reveals and quantifies these effects. Furthermore, this framework only involves measurable macroscopic parameters. 3. For the most common binding types, we provide the expressions of Ri( pi) and Hi( pi) in Tables 1 and 2. In particular, if node xi has no parents, we have that Hi ¼ π i,0ηi and its node retroactivity is not defined. In the single parent case, node xi has one parent, y binding as an n-multimer with dissociation
Table 1 Retroactivity Ri of a node for the most common binding types Binding type
Ri
Single parent
Independent
Competitive
Cooperative
ηi y 1þky
2
n2 y n1 ky
2
ηi n2 y n1 2 k 6 y 6 1þ y 6 ky 6 6 6 0 4 2
3 0
7 7 7 7 7 2 m1 ηi m z 7 2 k 5 z z 1 þ kz
n2 y n1 kz þ z m 6 k kz y 6 ηi 2 6 n1 m 4 y zn ny mz 1þky þ kz ky kz 2 n2 y n1 kz þ z m 6 k kz y 6 ηi 2 6 n1 4 ny y n mz m 1þky þzkz ky kz
3 ny n mz m1 7 ky kz 7 7 2 m1 k þ y n 5 m z y kz ky
3 ny n mz m1 7 ky kz 7 n7 n 2 m1 ky þ y 5 y m z ky kz ky
308
Andras Gyorgy
Table 2 Hill function Hi for the most common binding types Binding type
Hi
Single parent
ηi
Independent Competitive Cooperative
y
π i,0 þπ i,1 ky yn 1þ ky
y
ηi
m
π i,0 þπ i,1 ky þπ i,2 zkz m yn 1þ ky þzkz y
ηi
yn m
m yn m yn 1þ ky þzkz þ ky zkz y
ηi
m
π i,0 þπ i,1 ky þπ i,2 zkz þπ i,3 ky zkz
yn m
π i,0 þπ i,1 ky þπ i,3 ky zkz yn yn m 1þ ky þ ky zkz
constant ky. In the case of independent, competitive and cooperative binding, node xi has two parents, y and z, binding as multimers with multimerization factors n and m, respectively, together with dissociation constants ky and kz, respectively. The total concentration of the promoter of xi is denoted by ηi. The production rates π i,0, π i,1, π i,2, and π i,3 correspond to the promoter complexes without parents, with y only, with z only, and with both y and z, respectively. 4. The main technical assumptions are that (a) there is a separation of time scale between production/degradation of proteins and the reversible binding reactions between TFs and DNA, and that (b) the corresponding quasi-steady state is locally exponentially stable. Assumption (a) is justified by the fact that gene expression is on the time scale of minutes to hours while binding reactions are on the time scale of subsecond to second [2]. Assumption (b) is implicitly made any time Hill function-based models are used in gene regulatory networks. 5. Since cellular systems are highly stochastic and experience disturbances from many sources, parameter values are uncertain. To handle their effects on the behavior of interconnected components, one can use dissipativity analysis and SOSTOOLS [49] or by studying the effects of robustness of low-copy and high-copy genetic circuits to noise [39]. References 1. Akerlund T, Nordstrom K, Bernander R (1995) Analysis of cell size and DNA content in exponentially growing and stationary-phase batch cultures of Escherichia coli. J Bacteriol 177:6791–6797 2. Alon U (2007) Network motifs: theory and experimental approaches. Nat Rev Genet 8 (6):450–461
3. Aris H, Borhani S, Cahn D, O’Donnell C, Tan E, Xu P (2019) Modeling transcriptional factor cross-talk to understand parabolic kinetics, bimodal gene expression and retroactivity in biosensor design. Biochem Eng J 144:209–216. https://doi.org/10.1016/j. bej.2019.02.005. http://www.sciencedirect.
A Step-by-Step Guide for Retroactivity in Gene Networks com/science/article/pii/ S1369703X19300452 4. Arpino JAJ, Hancock EJ, Anderson J, Barahona M, Stan GBV, Papachristodoulou A, Polizzi K (2013) Tuning the dials of Synthetic Biology. Microbiol 159 (7):1236–1253. https://doi.org/10.1099/ mic.0.067975-0 5. Atkinson MR, Savageau MA, Myers JT, Ninfa AJ (2003) Development of genetic circuitry exhibiting toggle switch or oscillatory behavior in Escherichia coli. Cell 113(5):597–607 6. Balagadde FK, You L, Hansen CL, Arnold FH, Quake SR (2005) Long-term monitoring of bacteria undergoing programmed population control in a microchemostat. Science 309 (5731):137–140 7. Bandiera L, Hou Z, Kothamachu VB, BalsaCanto E, Swain PS, Menolascina F (2018) On-line optimal input design increases the efficiency and accuracy of the modelling of an inducible synthetic promoter. Processes 6(9), https://doi.org/10.3390/pr6090148. http://www.mdpi.com/2227-9717/6/9/ 148 8. Borkowski O, Ceroni F, Stan G, Ellis T (2016) Overloaded and stressed: whole-cell considerations for bacterial synthetic biology. Curr Opin Microbiol 33:123–130. https://doi.org/10. 1016/j.mib.2016.07.009 9. Bremer H, Dennis P (1996) Modulation of chemical composition and other parameters of the cell by growth rate in Escherichia coli and Salmonella: cellular and molecular biology. ASM Press, Washington 10. C CM, Nieto JM, S SP, Falconi M, Gualerzi CO, Juarez A (2002) Temperature- and H-NSdependent regulation of a plasmid-encoded virulence operon expressing Escherichia coli hemolysin. J Bacteriol 184(18):5058–5066 11. Cardinale S, Arkin AP (2012) Contextualizing context for synthetic biology – identifying causes of failure of synthetic biological systems. Biotechnol J 7(7):856–866 12. Ceroni F, Algar R, Stan GB, Ellis T (2015) Quantifying cellular capacity identifies gene expression designs with reduced burden. Nat Methods 12(5):415–418 13. Cox RS, Surette MG, Elowitz MB (2007) Programming gene expression with combinatorial promoters. Mol Syst Biol 3:145 14. Cuba Samaniego C, Franco E (2018) A robust molecular network motif for period-doubling devices. ACS Synth Biol 7(1):75–85. pMID: 29227103. https://doi.org/10.1021/ acssynbio.7b00222
309
15. Danino T, Mondragon-Palomino O, Tsimring L, Hasty J (2010) A synchronized quorum of genetic clocks. Nature 463 (7279):326–330 16. Del Vecchio D (2007) Design and analysis of an activator-repressor clock in E. coli. In: Proceedings of the American Control Conference, pp 1589–1594 17. Del Vecchio D, Ninfa AJ, Sontag ED (2008) Modular cell biology: retroactivity and insulation. Nature/EMBO Mol Syst Biol 4:161 18. Du L, Villareal S, Forster AC (2012) Multigene expression in vivo: supremacy of large versus small terminators for T7 RNA polymerase. Biotechnol Bioeng 109(4):1043–1050 19. Franco E, Friedrichs E, Kim J, Jungmann R, Murray R, Winfree E, Simmel FC (2011) Timing molecular motion and production with a synthetic transcriptional clock. Proc Natl Acad Sci 108(40):E787 20. Giladi H, Goldenberg D, Koby S, Oppenheim AB (1995) Enhanced activity of the bacteriophage lambda PL promoter at low temperature. FEMS Microbiol Rev 17(1–2):135–140 21. Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826 22. Guido NJ, Wang X, Adalsteinsson D, McMillen D, Hasty J, Cantor CR, Elston TC, Collins JJ (2006) A bottom-up approach to gene regulation. Nature 439(7078):856–860 23. Gyorgy A (2018) Sharing resources can lead to monostability in a network of bistable toggle switches. IEEE Control Syst Lett 3 (2):308–313. https://doi.org/10.1109/ LCSYS.2018.2871128 24. Gyorgy A, Murray RM (2016) Quantifying resource competition and its effects in the TX-TL system. In: 55th IEEE Conference on Decision and Control (CDC), IEEE, pp 3363–3368. https://doi.org/10.1109/CDC. 2016.7798775 25. Gyorgy A, Vecchio DD (2014) Modular composition of gene transcription networks. PLoS Comput Biol 10(3):e1003486 26. Gyorgy A, Jime´nez JI, Yazbek J, Huang HH, Chung H, Weiss R, Del Vecchio D (2015) Isocost lines describe the cellular economy of genetic circuits. Biophys J 109(3):639–646. https://doi.org/10.1016/j.bpj.2015.06.034 27. Jayanthi S, Del Vecchio D (2011) Retroactivity attenuation in bio-molecular systems based on timescale separation. IEEE Trans Autom Control 56(4):748–761 28. Jayanthi S, Del Vecchio D (2012) Tuning genetic clocks employing DNA binding sites. PLoS One 7(7):e41019
310
Andras Gyorgy
29. Jayanthi S, Nilgiriwala KS, Del Vecchio D (2013) Retroactivity controls the temporal dynamics of gene transcription. ACS Synth Biol 2(8):431–441 30. Jiang P, Ventura AC, Sontag ED, Merajver SD, Ninfa AJ, Del Vecchio D (2011) Load-induced modulation of signal transduction networks. Sci Signal 4(194):ra67 31. Kalir S, McClure J, Pabbaraju K, Southward C, Ronen M, Leibler S, Surette MG, Alon U (2001) Ordering genes in a flagella pathway by analysis of expression kinetics from living bacteria. Science 292(5524):2080–2083 32. Khalil HK (2002) Nonlinear systems. Prentice Hall, Upper Saddle River 33. Kim Y, Paroush Z, Nairz K, Hafen E, Jime´nez G, Shvartsman SY (2011) Substratedependent control of MAPK phosphorylation in vivo. Mol Syst Biol 7:467 34. Kirschner MW, Gerhart JC (2006) The plausibility of life: Resolving Darwin’s dilemma. Yale University Press, New Haven 35. Kittleson JT, Cheung S, Anderson JC (2011) Rapid optimization of gene dosage in Escherichia coli using dial strains. J Biol Eng 5:10 36. Klipp E, Liebermeister W, Wierling C, Kowald A, Lehrach H, Herwig R (2009) Systems biology: a textbook. Wiley, Hoboken 37. Kyung KH, Sauro HM (2010) Fan-out in gene regulatory networks. J Biol Eng 4:16 38. Lauffenburger DA (2000) Cell signaling pathways as control modules: complexity for simplicity? Proc Natl Acad Sci 97(10):5031–5033 39. Lee JW, Gyorgy A, Cameron DE, et al. (2016) Creating single-copy genetic circuits. Mol Cell 63(2):329–336. https://doi.org/10.1016/j. molcel.2016.06.00 40. Lohmiller W, Slotine JJE (1998) On contraction analysis for non-linear systems. Automatica 34(6):683–696 41. Lyons SM, Xu W, Medford J, Prasad A (2014) Loads bias genetic and signaling switches in synthetic and natural systems. PLoS Comput Biol 10(3):e1003533 42. Milo R, Shen-Orr SS, Kashtan N, Chlovskii DB, Alon U (2002) Network motifs: simple building blocks of complex networks. Science 298(5594):824–827 43. Mishra D, Rivera PM, Lin A, Vecchio DD, Weiss R (2014) A load driver device for engineering modularity in biological networks. Nat Biotechnol 32(12):1268–1275 44. Moore SJ, MacDonald JT, Wienecke S, Ishwarbhai A, Tsipa A, Aw R, Kylilis N, Bell DJ, McClymont DW, Jensen K, Polizzi KM, Biedendieck R, Freemont PS (2018) Rapid
acquisition and model-based analysis of cellfree transcription–translation reactions from nonmodel bacteria. Proc Natl Acad Sci https://doi.org/10.1073/pnas.1715806115. http://www.pnas.org/content/early/2018/ 04/16/1715806115.full.pdf 45. Mou S, Del Vecchio D (2015) How retroactivity impacts the robustness of genetic networks. In: 2015 54th IEEE Conference on Decision and Control (CDC), pp 1551–1556. https:// doi.org/10.1109/CDC.2015.7402431 46. Nagaraj VH, Greene JM, Sengupta AM, Sontag ED (2017) Translation inhibition and resource balance in the TX-TL cell-free gene expression system. Synt Biol 2(1):1–7. https:// doi.org/10.1093/synbio/ysx005 47. Neupert J, Karcher D, Bock R (2008) Design of simple synthetic RNA thermometers for temperature-controlled gene expression in Escherichia coli. Nucleic Acids Res 36(19):e124 48. Perez-Martin J, Espinosa M (1994) Correlation between DNA bending and transcriptional activation at a plasmid promoter. J Mol Biol 241(1):7–17 49. Prescott TP, Gyorgy A (2015) Isocost lines describe the cellular economy of genetic circuits. In: Proceedings of the IEEE Conference on Decision and Control 50. Purcell O, di Bernardo M, Grierson CS, Savery NJ (2011) A multi-functional synthetic gene network: a frequency multiplier, oscillator and switch. PLOS One 6(2):1–12. https://doi. org/10.1371/journal.pone.0016140 51. Purnick PEM, Weiss R (2009) The second wave of synthetic biology: from modules to systems. Nat Rev Mol Cell Biol 10(6):410–422 52. Qian Y, Huang HH, Jime´nez JI, Del Vecchio D (2017) Resource competition shapes the response of genetic circuits. ACS Synth Biol 6 (7):1263–1272. https://doi.org/10.1021/ acssynbio.6b00361 53. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555 54. Saez-Rodriguez J, Kremling A, Gilles ED (2005) Dissecting the puzzle of life: modularization of signal transduction networks. Comput Chem Eng 29(3):619–629 55. Saez-Rodriguez J, Gayer S, Ginkel M, Gilles ED (2008) Automatic decomposition of kinetic models of signaling networks minimizing the retroactivity among modules. Bioinformatics 24(16):213–219 56. Scott M, Gunderson C, Mateescu E, Zhang Z, Hwa T (2010) Interdependence of cell growth
A Step-by-Step Guide for Retroactivity in Gene Networks and gene expression: origins and consequences. Science 330:1099–1102 57. Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31(1):64–68 58. Siegal-Gaskins D, Tuza ZA, Kim J, Noireaux V, Murray RM (2014) Gene circuit performance characterization and resource usage in a cellfree “Breadboard”. ACS Synth Biol 3:416–425. https://doi.org/10.1021/ sb400203p 59. Slusarczyk AL, Lin A, Weiss R (2012) Foundations for the design and implementation of synthetic genetic circuits. Nat Rev Genet 13 (6):406–420 60. Smanski MJ, Bhatia S, Zhao D, Park Y, Woodruff L BA, Giannoukos G, Ciulla D, Busby M, Calderon J, Nicol R, Gordon DB, Densmore D, Voigt CA (2014) Functional optimization of gene clusters by combinatorial design and assembly. Nat Biotechnol 32 (12):1241–1249 61. Sridharan GV, Hassoun S, Lee K (2011) Identification of biochemical network modules based on shortest retroactive distances. PLoS Comput Biol 7(11):e1002262 62. Stricker J, Cookson S, Bennett MR, Mather WH, Tsimring LS, Hasty J (2008) A fast,
311
robust and tunable synthetic gene oscillator. Nature 456(7221):516–519 63. Tamsir A, Tabor JJ, Voigt CA (2011) Robust multicellular computing using genetically encoded nor gates and chemical ‘wires’. Nature 469(7329):212–215 64. Tan C, Marguet P, You L (2009) Emergent bistability by a growth-modulating positive feedback circuit. Nat Chem Biol 5 (11):842–848 65. Weiße AY, Oyarzu´n DA, Danos V, Swain PS (2015) Mechanistic links between cellular trade-offs, gene expression, and growth. Proc Natl Acad Sci 112(9):E1038–E1047. https:// doi.org/10.1073/pnas.1416533112 66. Yates EA, Philipp B, Buckley C, Atkinson S, Chhabra SR, Sockett RE, Goldner M, Dessaux Y, Camara M, Smith H, Williams P (2002) N-acylhomoserine lactones undergo lactonolysis in a pH-, temperature-, and acyl chain length-dependent manner during growth of Yersinia pseudotuberculosis and Pseudomonas aeruginosa. Infect Immun 70 (10):5635–5646 67. Yoon J, Blumer A, Lee K (2006) An algorithm for modularity analysis of directed and weighted biological networks based on edgebetweenness centrality. Bioinformatics 22 (24):3106–3108
Chapter 15 Engineering Sensors for Gene Expression Burden Alice Boo and Francesca Ceroni Abstract RNA-seq enables the analysis of gene expression profiles across different conditions and organisms. Gene expression burden slows down growth, which results in poor predictability of gene constructs and product yields. Here, we describe how we applied RNA-seq to study the transcriptional profiles of Escherichia coli when burden is elicited during heterologous gene expression. We then present how we selected early responsive promoters from our RNA-seq results to design sensors for gene expression burden. Finally, we describe how we used one of these sensors to develop a burden-driven feedback regulator to improve cellular fitness in engineered E. coli. Key words Synthetic construct, Gene expression burden, RNA-seq, Sensor, Feedback
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_15, © Springer Science+Business Media, LLC, part of Springer Nature 2021
313
314
1
Alice Boo and Francesca Ceroni
Introduction In cell engineering, cells are modified with synthetic constructs to express molecules of interest. The expression of exogenous proteins has been shown to cause detrimental physiological changes in the host cells [1], usually leading to decreased growth and poor yields, a phenomenon known as cellular burden [2]. Burden can stem from the specific role of a protein and its interactions with the intracellular environment [3, 4]. However, recent work in the field of synthetic biology has provided evidence that gene expression burden is mainly caused by the competition between the host cell and the synthetic constructs for the intracellular resources needed for gene expression with ribosome uptake and energy consumption shown to play a major role [1, 5, 6]. Burden is not only detrimental for the cells; it is also the major cause of the poor predictability of the behavior of synthetic constructs. Understanding the cell’s response to burden is thus crucial to be able to counteract it and identify strategies for a more robust design of gene expression devices. We recently combined multiplex RNA-seq with an in vivo assay to reveal the major transcriptional changes occurring in E. coli when a set of inducible synthetic constructs are expressed. We identified that native promoters related to the heat-shock response activate rapidly in response to synthetic expression, regardless of the construct in use. We termed these natural biosensors for burden as they allow early detection of gene expression burden occurring in the cell. Using these promoters, we built a CRISPR/dCas9-based feedback regulation system that automatically adjusts synthetic construct expression in response to burden. Cells equipped with this general-use controller maintain capacity for native gene expression to ensure robust growth and outperform unregulated cells in terms of protein yields in batch production. Cells are transformed with constructs of interest and grown over time in a plate reader. Induction of gene expression is performed. Cells are harvested at 15 and 60 min after induction of gene expression. Total RNA is isolated and genomic DNA removed. rRNA is removed from the extract and mRNAs are retro-transcribed to cDNA. The Illumina Nextera XT kit is used to perform library preparation starting from cDNA. Identification of early responsive promoters via analysis of RNA-seq results leads to burden sensor design and testing. The htpG1 promoter is selected to design and build a burden-based biomolecular feedback system. The feedback adjusts heterologous gene expression levels to mitigate the effect of burden on the cells.
Burden Sensors for Synthetic Biology
2 2.1
315
Materials Strains
2.2 Molecular Cloning
We used bacterial strains MG1655 (K-12 F- λ- rph-1) and DH10B (K-12 F- λ- araD139 Δ(araA-leu)7697 Δ(lac)X74 galE15 galK16 galU hsdR2 relA rpsL150(StrR) spoT1 deoR ϕ80dlacZΔM15 endA1 nupG recA1 e14- mcrA Δ(mrr hsdRMS mcrBC)), acquired from the National BioResource Project Japan. Users should select the strain of their own interest and apply the materials and methods for the strain(s) of their choice. 1. Plasmid DNA isolation, extraction of DNA from agarose gels, and PCR purification were done using Qiagen kits. 2. All our PCR reactions were carried out using the NEB Phusion High Fidelity Polymerase and oligonucleotide primers synthesized by IDT. 3. The burden-responsive promoters were synthesized as gBlocks from IDT and inserted into the destination vector below through restriction cloning using SfiI and PacI. All enzymes were ordered from NEB. 4. The structure of the plasmids used to test the burden early responsive promoters on a plasmid is as described in Fig. 1. The gene reporting the activity of the burden-responsive promoter, here sfGFP, can be easily swapped using restriction cloning using the PacI and BsaI restriction sites. 5. The structure of the burden-driven feedback plasmid used to regulate heterologous protein production is as described in Fig. 2. The actuator, here the sgRNA, can be swapped using restriction cloning using the PacI and AscI restriction sites. The target site on the sgRNA can also be swapped using inverse PCR of the feedback plasmid with insertion-encoding 50 phosphorylated primers, followed by DpnI digestion and religation before transformation into the strain of interest. This allows for replacement of the target domain, which in our case was designed to hit the araBAD promoter.
Fig. 1 Schematic of the sensor plasmid designed to characterize the promoters upregulated by burden out of their genomic context
316
Alice Boo and Francesca Ceroni
Fig. 2 Feedback plasmid constitutively expressing dCas9. Expression of the sgRNA is driven by the htpG1 burden promoter sensor, selected from the pool of upregulated promoters identified by RNA-seq
Table 1 Protocol to make 400 mL of M9 0.4% fructose supplemented with Casamino acids M9 minimal media recipe
2.3
Medium
Solution
Volume of Solution to Add to make 400 mL of M9 Medium
Autoclaved distilled water
278 mL
M9 minimum salts (5)
80 mL
Thiamine hydrochloride (10 mg/mL)
10 mL
Fructose (10%)
16 mL
Casamino acids (10%)
8 mL
MgSO4 (0.1 M)
8 mL
CaCl2 (0.1 M)
400 μL
The M9 media used in our experiments (Table 1) consisted of M9 minimal salts (5) supplemented with 0.4% casamino acids, 0.25 mg/mL thiamine hydrochloride, 2 mM MgSO4, 0.1 mM CaCl2, 0.4% fructose, and the appropriate antibiotic (see Note 1).
2.4 RNA-Seq Library Preparation
1. Qiagen RNeasy mini kit (Qiagen 74104).
2.4.1 Consumables
3. Agilent RNA 6000 Nano Kit (5067-1511).
2. DNase I (Qiagen 79254). 4. MicrobExpress rRNA AM1905) (see Note 2).
removal
kit
(Thermo
5. Tetro cDNA synthesis kit (Bioline BIO-65043). 6. Random hexamers (Bioline BIO-38028).
Scientific
Burden Sensors for Synthetic Biology
317
7. Next second-strand synthesis buffer (NEB B6117S). 8. dNTPs (NEB N0446S). 9. RNase H (NEB M0297L). 10. Polymerase I (Thermo Scientific 18010025). 11. MiniElute PCR purification kit (Qiagen 28004). 12. DEPC-treated free water. 13. Nextera XT kit (Illumina FC-131-1096). 14. Ampure beads (Beckman Coulter A63880). 15. Agilent high-sensitivity DNA analysis kit (5067-4626). 2.4.2 Equipment
1. Agilent 2100 Bioanalyzer. 2. Qubit fluorometer (Invitrogen). 3. Multi-well plate reader.
3
Methods
3.1 Identify How the Host Responds to Burden Using RNA-Seq
Here, we describe the workflow we followed to prepare our E. coli strains for measuring the impact of expressing a synthetic construct on the host. We also describe how the samples were prepared for studying the impact of our burden-inducing construct on the host transcriptome via RNA-seq. The workflow is represented in Fig. 3.
3.1.1 Transformation with Synthetic Construct Causing Burden
In our work, we were interested in expression burden triggered by overexpressing a heterologous protein. We specifically looked at the burden caused by the expression of LacZ, the Lux operon and VioB-mCherry. For the purpose of this chapter, we will be discussing the case of VioB-mCherry expression. One of our constructs consisted of VioB-mCherry a 3.7 kb fusion protein consisting of VioB and mCherry, controlled by the araBAD promoter (Fig. 4). Induction of the expression of VioB-mCherry was done by adding arabinose to the media.
Fig. 3 Workflow to measure the burden induced by the expression of a heterologous construct and extract the RNA for RNA-seq
318
Alice Boo and Francesca Ceroni
Fig. 4 Design of our burden inducing construct
1. Construct the plasmid carrying your gene of interest which you want to test. 2. Transform the construct into your strain and verify that it is being expressed through any method of your choice. (We checked concentrations of VioB-mCherry by tracking red fluorescence in a plate-reader.) 3. Together with your construct, transform into the same strain its empty plasmid to be used as negative control. You will compare the gene expression profile of this negative control to the one of the strains carrying the burden plasmid to investigate which native genes are up/downregulated in the presence of burden. 3.1.2 Time-Course Assay
1. Grow overnight cultures of E. coli cells transformed with the construct and control plasmids at 37 C overnight with aeration in a shaking incubator in 5 mL of M9 medium (see Materials Subheading 3.3). 2. In the morning, dilute 60 μL of each sample into 3 mL of fresh M9 media supplemented with the appropriate antibiotics and grow them at 37 C with shaking for another hour (outgrowth). 3. Then, transfer 200 μL of each sample into a 96-well plate (we used clear transparent 96-well Costar plates) at approximately 0.1 OD600. 4. Place the samples in a microplate reader (we used a Biotek Synergy HT plate-reader) and incubate them at 37 C with orbital at Medium Shaking. For 1 h. Take measurements of VioB-mCherry (excitation, 590 nm; emission, 645 nm) and OD600 every 15 min. (if using GFP then use excitation, 485 nm; emission, 528 nm). 5. Sixty minutes into the incubation, briefly remove the plate to add the inducer to the wells (our final concentrations of inducers were: l-arabinose, 0.2%; l-rhamnose, 2%). Set this time point as your “time 0.”
Burden Sensors for Synthetic Biology
319
6. If you are doing a burden assay: grow the cells in the reader for 4.5 h, taking measurements of VioB-mCherry (excitation, 590 nm; emission, 645 nm) and OD600 every 15 min. 7. If you are performing RNA-seq analysis: remove the samples from the wells at 15 and 60 min after induction for processing: (a) Take 170 μL from each of four wells per time point and dispense it in a fresh tube to which you would have added 1.360 mL of RNA protection buffer. (b) Leave the samples for 5 min at room temperature and then centrifuge them at 4 C at maximum speed. (c) Discard the supernatant and freeze the pellets at 20 C. (d) Repeat the experiment for the three replicates on three different days (our three replicates were repeated independently on three different days for a total of 90 samples used to produce the final data set (7 constructs 2 strains 3 replicates 2 time points ¼ 84 samples; plus control strain DH10B-GFP cells 3 replicates 2 time points)). 3.1.3 RNA-Seq Sample Preparation
The library preparation uses a custom protocol adapted from previous Nextera kit methods [7]. 1. Extract the RNA from your samples taken in the Section Burden Assay and RNA-seq Time Course. Use the Qiagen RNeasy mini kit (Qiagen 74104). 2. Remove possible traces of genomic DNA contamination by treating 2 μg of each sample for a second time with DNase I (Qiagen 79254). 3. Assess the total RNA quality and integrity with an Agilent 2100 Bioanalyzer and Agilent RNA 6000 Nano Kit (5067-1511). The average RNA integrity number should be superior to 9. 4. Enrich the mRNA with the MicrobExpress rRNA removal kit (Thermo Scientific AM1905). 5. Assess successful rRNA depletion on the Bioanalyzer. 6. Carry the retrotranscription starting from 50 ng of total enriched mRNA with the Tetro cDNA synthesis kit (Bioline BIO-65043) and 6 μL of random hexamers (Bioline BIO-38028) per reaction. 7. For the second cDNA synthesis, add 5 μL of NEB Next second-strand synthesis buffer (NEB B6117S) to the firststrand synthesis mix, 3 μL of dNTPs (NEB N0446S), 2 μL of RNase H (NEB M0297L), 2 μL of polymerase I (Thermo Scientific 18010025), and 18 μL of water per reaction. 8. Incubate the samples at 16 C for 2.5 h. 9. Purify the cDNA with the MiniElute PCR purification kit (Qiagen 28,004) and elute in 10 μL of DEPC-treated free water.
320
Alice Boo and Francesca Ceroni
10. Quantify the amount of cDNA with a Qubit fluorometer (Invitrogen). 11. For the library preparation, use the Nextera XT kit (Illumina FC-131-1096) starting from 1 ng of total cDNA. Use 3 min of tagmentation and 13 cycles of step-limited PCR. 12. Purify the library using ampure beads (Beckman Coulter A63880). 13. Assess the quality and quantity of the library with an Agilent 2100 Bioanalyzer and Agilent high-sensitivity DNA analysis kit (5067-4626). 14. Pool together all your samples in the same reaction tube at a final concentration of 1 nM. 3.1.4 RNA-Seq Library Sequencing
3.1.5 Sequencing Quality Control and Alignments
We performed the library sequencing at the Imperial College London Genomic Facility. We used two lanes from the HiSeq 2500 sequencer for paired-end sequencing with read length of 100 bp. 1. Trim and assess the quality of your raw reads for all sequenced samples using Trim Galore v0.4.1 with default settings. Look for potential batch effects by pooling your technical replicates. 2. Obtain the genomic sequences of your organism, for example, using Ensembl Genomes. (In our case, we created a FASTA format sequence file corresponding to our DH10B-GFP and MG1655-GFP strain by merging the composite of strain, plasmid, and integrated GFP for each sample to use as a reference for read alignment.) 3. Align the trimmed reads using the BWA mem algorithm v0.7.12-r1039 with the default settings. 4. Create a sorted BAM file for each sample using SAMtools v1.3.1 on the alignments obtained at the previous step. 5. Check that your biological replicates do not exhibit any batch effects before you generate the raw counts with Bioconductor Rsubread package v1.12.6. 6. Discard all reads identified as unremoved rRNA, and in the one case where reads could align to either the plasmid or the strain genome, assign the raw reads appropriately to match those of flanking sequence. 7. Check the biological replicates to identify any outlier sample. 8. Generate the normalized FPKM counts with the Bioconductor edgeR package version 3.4.2, accounting for gene length and library size (by TMM normalization), which will be used for downstream analysis.
Burden Sensors for Synthetic Biology 3.1.6 Transcription Profiles and Promoter Characterization
321
We adopted the method of Gorochowski et al. (2017) [8] to generate the transcription profiles from RNA-seq data. 1. Map the raw reads from the sequencer, previously saved in a FASTQ format, to your appropriate host genome reference sequence (which includes any genomically integrated sequences and/or plasmid sequences) with BWA version 0.7.4 with default settings. You will obtain BAM files for each of your samples. 2. Separately process each of these BAM files with custom Python scripts [8] to extract the position of the mapped reads, count read depths across the reference sequences, and apply corrections to the profiles at the ends of transcription units. 3. Normalize the obtained profiles to be able to compare them between samples. 4. Characterize the promoters with custom Python scripts as in Gorochowski et al. (2017) [8], which take as input a GFF reference of the construct defining the location of all parts. 5. Use DNAplotlib version 1.0 [9] to visualize your transcription profiles, and associated genetic design information were generated in an SBOL Visual format [10] (All our analyses were carried out with custom scripts run using Python version 2.7.12, NumPy version 1.11.2, and matplotlib version 1.5.3.).
3.1.7 Analyze the Plate-Reader Data to Evaluate Burden
To calculate the burden imposed by the constructs, refer to Ceroni et al. [11]: Growth rateðt 2 Þ
¼
GDP Capacityðt 2 Þ
¼
RFP Production Rate per Cellðt 2 Þ ¼
ln ðODðt 3 ÞÞ ln ðODðt 1 ÞÞ t3 t1 Total GFPðt 3 Þ Total GFPðt 1 Þ ODðt 2 Þ ðt 3 t 1 Þ Total RFPðt 3 Þ Total RFPðt 1 Þ ODðt 2 Þ ðt 3 t 1 Þ
where t1 ¼ time 15 min after induction, t2 ¼ time after induction, and t3 ¼ time + 15 min after induction. Mean rates and their standard errors are calculated from three biological. To account for the background red fluorescence of M9, we added 400 to all RFP output rates per cell as we measured that red fluorescence decreases at a rate of approximately 400 RFP h1 as it is consumed by cells during growth. 3.2 Select the Best Burden-Responsive Promoter to Build a Burden Biosensor
The next step is to identify which promoters are upregulated in the presence of burden. We identified early responsive promoters using RNA-seq, isolated and cloned them upstream of a fluorescent reporter so to characterize their response to burden when out of their genomic context on a plasmid. This workflow is presented in Fig. 5. This allowed us to select our burden sensor: the promoter exhibiting the best fold activation when it is triggered by burden.
322
Alice Boo and Francesca Ceroni
Fig. 5 Workflow to identify promoters that are upregulated by burden from RNA-seq results and test them out of their genomic context in order to select the best candidate to use as a burden biosensor 3.2.1 Interpret the RNA-Seq Results to Identify Promoter Upregulated by Burden
Here, we describe how to interpret the RNA-seq results to identify promoters with an early response to burden. We used DESeq2 for our differential expression analyses [12]. 1. Compare gene expression between cells transformed with synthetic constructs and the analogous cells transformed with the corresponding empty plasmid (We excluded the reads mapping to ribosomal genes or to the synthetic constructs). 2. Annotate the differentially expressed genes with data extracted from the EcoCyc database [13] using custom Python code. 3. Using a volcano plot can help visualizing which genes were upregulated or downregulated in the cells experiencing the imposed burden compared to the control cells (Fig. 5). We specifically looked at the differential gene expression at 15 min, and 1 h after induction.
3.2.2 Test the Burden-Responsive Promoters Out of Their Genomic Context
Once we identified which promoters upregulate gene expression through RNA-seq analysis, we studied their behavior out of their genomic context on a plasmid. 1. Order gBlock of each candidate promoter that upregulated the expression of a native gene while exposed to burden. Include SfiI restriction site upstream of the promoter sequence and PacI restriction site downstream of the promoter sequence for easy insertion into the sensor plasmid (Fig. 1). 2. Insert each gBlock into the sensor plasmid via restriction cloning. 3. The reporter, currently sfGFP, can be swapped to a different reporter gene by restriction cloning using PacI and BsaI.
3.2.3 Select the Best Promoter to Use as Burden Biosensor
Analyze the plate-reader data and select the sensor plasmid that exhibits the best ON/OFF properties. 1. Analyze the plate-reader data according to sect. 3.1.7. Plot bar graphs at 1 h post-induction with burden of the GFP production rate per cell.
Burden Sensors for Synthetic Biology
323
Fig. 6 Workflow to build a burden-driven feedback for gene expression based on a burden biosensor uncovered with RNA-seq
2. Select the promoter that is the most responsive to burden (the highest GFP production rate per cell when there is burden) but that also has the lowest OFF activity when there is no burden (the lowest GFP production rate per cell when there is no burden). We found the promoter with the best fold change in GFP production rate per cell between the two conditions. We constructed four sensor plasmids: htpG1, htpG2, groSL, and ibpAB promoters driving the expression of sfGFP. We found that htpG1 had the best fold activation out of the four constructs. Since the htpG regulon is driven by two overlapping promoters, namely htpG1 and htpG2, both promoters were tested separately on the sensor plasmid. 3.3 Build the Burden-Driven Feedback Loop
Once we identified our burden sensor, we used it to drive the expression of an actuator able to regulate gene expression in response to burden. Our workflow for building a burden-driven feedback loop is represented in Fig. 6. In the presence of burden, the actuator should be triggered to decrease heterologous gene expression, thus decreasing the burden imposed on the cell, and restore some of its cellular capacity. To measure cellular burden, we used the capacity monitor from Ceroni et al. [1]. This can assess the burden of genetic constructs by calculating the changes in GFP productions from a “monitor cassette” constitutively expressing GFP from the bacterial genome. A detailed protocol of how to integrate the capacity monitor into a strain of interest can be found in Note 3. GFP capacity, or the GFP production rate per cell, should be maintained above a specific threshold, which means that burden would be contained to an upper bound.
3.3.1 Build the Feedback Plasmid
Build the feedback plasmid (Fig. 2) by restriction cloning: the promoter can be inserted using the previously synthesized gBlocks carrying the SfiI and PacI restriction sites. The actuator can also be
324
Alice Boo and Francesca Ceroni
Fig. 7 Architecture of the burden-driven feedback implemented with CRISPRi to regulate the production of a heterologous protein
synthesized with PacI and AscI restriction sites for insertion into the feedback plasmid via restriction cloning. In our case, the sgRNA was placed under the regulation of the htpG1 promoter to promote fast dynamics of our system and such that the levels of sgRNA in the cell will be directly related to the host cell capacity. dCas9 is constitutively expressed and binds to sgRNA present in the cell to inhibit the production of VioB-mCherry, which slows down cell growth when its expression is triggered (Fig. 7). 1. Transform the burden plasmid and the feedback plasmid into a strain containing the sfGFP capacity monitor integrated into the genome. Also transform an open-loop version of the feedback: the sgRNA should not target anything in the cell. 2. Carry a time-course assay in the plate-reader: take measurements of VioB-mCherry (excitation, 590 nm; emission, 645 nm), sfGFP (excitation, 485 nm; emission, 528 nm), and OD600 every 15 min. 3. Sixty minutes into the incubation, briefly remove the plate to add the inducer to the wells (0.2% arabinose). 4. Grow the cells for 6 h. 5. Analyze the data by plotting the GFP capacity and the VioBmCherry production rate at 1 h post-induction. Repression of the VioB-mCherry production is tunable by controlling the intracellular concentration of dCas9 available to form an inhibiting complex together with the guide RNA. dCas9 expression sets the steady-state repression levels of the heterologous VioB-mCherry protein, but its production rate should be carefully chosen such that it does not itself impose a large burden on the host cell. The capacity monitor can assess the burden of genetic constructs by calculating the changes in GFP productions from a “monitor cassette” constitutively expressing GFP from the bacterial genome (see Note 3). 6. Create a library of feedback constructs with promoters of various strengths driving dCas9 expression to check if increasing dCas9 levels strengthen repression of the feedback. Randomly
Burden Sensors for Synthetic Biology
325
mutate the J23100 Anderson promoter for specific positions by analyzing the variable positions in the constitutive Anderson promoter library. Order primers to insert your random mutations through inverse PCR. 7. After construction of the library, the promoter strength of the constructs can be assessed by monitoring their GFP capacity via a plate-reader characterization assay (Sect. 3.1.7). Higher GFP capacity implies that dCas9 production has a lower impact on the cellular burden; hence, the promoter in front of the dCas9 must be a weak constitutive promoter. Similarly, if GFP capacity tends to zero, then the constitutive promoter driving dCas9 expression must be strong. 8. Sequence enough library constructs to obtain a diversified range of GFP capacities in the above experiment. 9. Transform the selected sequenced constructs with the burdensome plasmid and select the one for which the GFP capacity is the best conserved when burden is induced. 3.3.2 Tune the Burden-Driven Feedback
One of the advantages of using a dCas9-gRNA-based regulation is that the sgRNA sequence can be easily and quickly mutated so to bind the target with different affinity, thus providing a convenient way to tune the gain of the feedback Fig. 8. The same result is achievable by changing the strength of the promoter guiding dCas9 expression. The library of promoters controlling the expression of dCas9 demonstrated the capacity of the feedback system to repress to production of VioB-mCherry and keep the cellular capacity close to that of the wild-type strain. The feedback system should have a maximized heterologous protein production rate while keeping cellular capacity high. Introducing a mutation in the sgRNA contributes to decreasing the binding affinity between the dCas9/ sgRNA complex and the araBAD promoter, hence lowering the repression of VioB-mCherry and improving its rate of production. Farasat et al. [14] described how to rationally introduce mismatches in the guide RNA to regulate the activity of the dCas9/ sgRNA complex. One mismatch in the 6 bases closest to the PAM site is expected to reduce repression by 3 or four-fold, while two mismatches lead to a 14-fold decrease in repression. We decided to introduce one mismatch in our sgRNA targeting the ARABAD promoter, intuitively predicting that two mismatches would decrease the repression too much for the feedback to have a noticeable effect on maintaining the cellular burden high in the cell. 1. Construct a library of randomly point-mutated sgRNA was done using inverse PCR: introduce random point mutations in the 6 bases closest to the PAM site.
326
Alice Boo and Francesca Ceroni
Fig. 8 Tune the feedback gain by changing the expression level of dCas9 (promoter/RBS) or by varying the affinity of the sgRNA with its target promoter (bp mutation)
2. Evaluate the performance of the point-mutated sgRNA feedback constructs by conducting a batch experiment and comparing the final yield of the different constructs: (a) Inoculate 3 mL of M9 fructose media, supplemented with the appropriate antibiotics, in 15 mL culture tubes with the constructs carrying the different mutated sgRNAs. (b) Grow the cultures in the 37 C shaking incubator for 5 h, before diluting them to 0.015 OD600. (c) Use 50 μL of the diluted culture (~150,000 cells) to inoculate batch cultures of 50 mL M9 supplemented with the inducer and the appropriate antibiotics in 500-mL baffled shake-flasks. (d) Grow the cultures in the 37 C shaking incubator during 16 h. (e) Then, every hour from 16 h until 24 h, dispense 200 μL of each culture in individual wells of a 96-well plate (it will be used to read the cell density and bulk fluorescence of each construct into the plate-reader). Also dilute 350 μL of culture into 650 μL of PBS and store in the fridge at 4 C (it will be used to read the fluorescence of each construct with the flow cytometer).
Burden Sensors for Synthetic Biology
327
(f) After sampling at each hour, place the 96-well plate in a preheated plate-reader at 37 C and start a plate-reader kinetic, performing OD measurements (OD600 and OD700), GFP measurements (excitation, 485 nm; emission, 528 nm), and RFP measurements (excitation, 590 nm; emission, 645 nm) every 2 min for 10 min. Average over the 5 points to obtain the OD, GFP and RFP values at the specific sampling time point (this is to allow the sample to settle down in the plate-reader). (g) Measure the GFP and RFP levels of individual cells from the cultures stored in PBS at 4 C with a flow-cytometer (we used the FortessaX20). (h) Select the sgRNA that produced the highest final yield. (We found that the strain producing the highest yield was the one without any mutation as they grew faster than the other strains.)
4
Notes 1. M9 Medium Recipe (a) M9 Minimum salts (5) stock solution: dissolve 56.4 g of M9 Minimum Salts into 1 L of distilled H2O. Stir to suspend and sterilize by autoclaving. Store at room temperature. (b) Thiamine hydrochloride stock solution: dissolve 10 mg of thiamine hydrochloride into 1 mL of water. Agitate to suspend. Filter-sterilize. Cover the sterile container with aluminum foil to protect it from the light. Store at room temperature. (DH10B cannot produce thiamine hydrochloride.) (c) Fructose stock solution: dissolve 10 g of fructose into 100 mL of distilled H2O. Filter-sterilize. Store at 4 C. (We used fructose as the main carbon source to avoid the strong catabolite repression of AraBAD and RhaBAD promoters known to occur in glucose media.) (d) Casamino acids stock solution: dissolve 10 g of Casamino Acids into 100 mL of distilled H2O. Stir to suspend and sterilize by autoclaving. Store at room temperature. (We tried various Casamino acids brands and found that Casamino acids from MP Biomedicals gave us consistent growth for our DH10B and MG1655 cells.) (e) 1 M Magnesium sulfate (MgSO4) stock solution: dissolve 246 g of MgSO4l7H2O into 1 L of distilled H2O. Sterilize by autoclaving. Store at room temperature.
328
Alice Boo and Francesca Ceroni
(f) 1 M Calcium chloride (CaCl2) stock solution: dissolve 44 g of CaCl2l6H2O into 200 mL of distilled H2O. Sterilize by autoclaving. Store at room temperature. 2. Ribodepletion The ribodepletion step was carried out using the MICROBExpress mRNA Enrichment Kit. We selected this kit for its cost effectiveness, though better depletion of the ribosomal RNA can be achieved using Illumina kits, especially the RiboZero Kit [15]. During our analysis, we found that around 60% of the sequences were coming from ribosomal RNA, but this varied from sample to sample with different efficiency. 3. Integration of the sfGFP capacity monitor Our CRIM plasmid carrying the sfGFP constitutive cassette is available from Addgene (https://www.addgene.org/ 66073/) such that it can be inserted using the helper plasmid pINT-ts (https://www.addgene.org/66076/) in the users strain(s) of interest. The sfGFP capacity monitor [1] was integrated into the λ site of E. coli using the CRIM [16] plasmid pAH63. The following protocol for the insertion of the sfGFP capacity monitor is adapted from Dr. Algar PhD thesis [17]. “To insert the monitor into the genome we used the CRIM system [16]. This system involves two separate plasmids, one of which contains the monitor and will be inserted into the genome, the other being a ‘helper plasmid’ that facilitates the genomic integration. The CRIM system works by placing the circuit you wish to insert into the genome into the CRIM plasmid corresponding to the integration site. CRIM plasmids have the γ replication origin of R6K, which requires the trans-acting Π protein (encoded by pir) for replication. This means that these plasmids can only be maintained in cells which have a pir + genotype. In order to replicate the CRIM plasmid with monitor we transformed into pir + cells. For this we used TransforMax™EC100D™pir-116 Electrocompetent E. coli cells.” (a) Construct the CRIM integration pAH63 plasmid (Kanamycin resistance) containing the sfGFP monitor into TransforMax™EC100D™pir-116 electrocompetent E. coli cells. (b) In parallel, transform the pINT-ts helper plasmid (Ampicillin resistance) into DH10B. Always grow these cells at 30 C. (c) Make electrocompetent the DH10B cells transformed with the pINT-ts helper plasmid. Always grow these cells at 30 C.
Burden Sensors for Synthetic Biology
329
(d) Transform the pAH63 plasmid containing the sfGFP monitor into your pINT-ts electrocompetent cells. (e) Following electroporation, suspend the cells in SOC or SOB. Incubate at 37 C for 1 h and then at 42 C for 30 min. (The phage integrase (Int) enzyme is synthesized at elevated temperatures from the CRIM helper plasmid pINT-ts. The helper plasmid has a temperature sensitive origin of replication such that resulting colonies are nearly always cured of the helper plasmid.) (f) Spread onto selective agar (Kanamycin in our case) and incubate at 37 C. References 1. Ceroni F, Algar R, Stan G-B, Ellis T (2015) Quantifying cellular capacity identifies gene expression designs with reduced burden. Nat Methods 12:415–418. https://doi.org/10. 1038/nmeth.3339 2. Borkowski O, Ceroni F, Stan GB, Ellis T (2016) Overloaded and stressed: whole-cell considerations for bacterial synthetic biology. Curr Opin Microbiol 33:123–130. https:// doi.org/10.1016/j.mib.2016.07.009 3. Ellis T (2018) Predicting how evolution will beat us. Microb Biotechnol 12(1):41–43. https://doi.org/10.1111/1751-7915.13327 4. Martin VJJ, Pitera DJ, Withers ST et al (2003) Engineering a mevalonate pathway in Escherichia coli for production of terpenoids. Nat Biotechnol 21:796–802. https://doi.org/10. 1038/nbt833 5. Gyorgy A, Jime´nez JI, Yazbek J et al (2015) Isocost lines describe the cellular economy of genetic circuits. Biophys J 109:639–646. https://doi.org/10.1016/j.bpj.2015.06.034 6. Shachrai I, Zaslaver A, Alon U, Dekel E (2010) Cost of unneeded proteins in E. coli is reduced after several generations in exponential growth. Mol Cell 38:758–767. https://doi.org/10. 1016/j.molcel.2010.04.015 7. Gertz J, Varley KE, Davis NS et al (2012) Transposase mediated construction of RNA-seq libraries. Genome Res 22:134–141. https://doi.org/10.1101/gr.127373.111. 134 8. Gorochowski TE, Espah Borujeni A, Park Y et al (2017) Genetic circuit characterization and debugging using RNA-seq. Mol Syst Biol 13:952. https://doi.org/10.15252/msb. 20167461 9. Der BS, Glassey E, Bartley BA et al (2017) DNAplotlib: programmable visualization of
genetic designs and associated data. ACS Synth Biol 6:1115–1119. https://doi.org/10. 1021/acssynbio.6b00252 10. Myers CJ, Beal J, Gorochowski TE et al (2017) A standard-enabled workflow for synthetic biology. Biochem Soc Trans 45:793–803. https://doi.org/10.1042/BST20160347 11. Ceroni F, Boo A, Furini S et al (2018) Burdendriven feedback control of gene expression. Nat Methods 15:387–393. https://doi.org/ 10.1038/nmeth.4635 12. Love MI, Huber W, Anders S (2014) Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15:1–21. https://doi.org/10.1186/s13059014-0550-8 13. Keseler IM, Mackie A, Santos-Zavaleta A et al (2017) The EcoCyc database: reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res 45:D543–D550. https:// doi.org/10.1093/nar/gkw1003 14. Farasat I, Salis HM (2016) A biophysical model of CRISPR/Cas9 activity for rational design of genome editing and gene regulation. PLoS Comput Biol 12:1–33. https://doi.org/10. 1371/journal.pcbi.1004724 15. Petrova OE, Garcia-Alcalde F, Zampaloni C, Sauer K (2017) Comparative evaluation of rRNA depletion procedures for the improved analysis of bacterial biofilm and mixed pathogen culture transcriptomes. Sci Rep 7:1–15. https://doi.org/10.1038/srep41114 16. Haldimann A, Wanner BL (2001) Conditionalreplication, integration, excision, and retrieval plasmid-host systems for gene structurefunction studies of bacteria. J Bacteriol 183:6384–6393. https://doi.org/10.1128/ JB.183.21.6384
330
Alice Boo and Francesca Ceroni
17. Algar RJR (2013) Understanding, characterising and modelling the interactions between synthetic genetic circuits and their host chassis
Rhys James Richmond Algar, MA (Oxon), MRes Submission for the degree of PhD. Imperial College London
Chapter 16 Engineering Protein-Based Parts for Genetic Devices in Mammalian Cells Giuliano Bonfa´, Federica Cella, and Velia Siciliano Abstract Synthetic biology has been advancing cellular and molecular biology studies through the design of synthetic circuits capable to examine diverse endogenously or exogenously driven regulatory pathways. While early genetic devices were engineered to be insulated from intracellular crosstalk, more recently the need of achieving dynamic control of cellular behavior has led to the development of smart interfaces that connect signal information (sensor) to desired output activation (actuator). Sensor-actuator circuits can respond to diverse inputs, including small molecules, exogenous and endogenous mRNA, noncoding RNA (i.e., miRNA), and proteins to regulate downstream events, transcriptionally, posttranscriptionally, and translationally. These devices require attentive engineering to either create complex chimeric proteins or modify protein structures to be amenable to the specific circuits’ architecture and/or purpose. In this chapter, we describe how to implement two different protein-based devices in mammalian cells: (1) a modular platform that sense and respond to disease-associated proteins and (2) a protein-based system that allows simultaneous regulation of RNA translation and protein activity, via RNA-protein and newly engineered protein–protein interactions. Key words Mammalian synthetic biology, Protein sensor-actuator, Synthetic smart interfaces, Protein–protein regulation, Protein–RNA regulation, RNA-binding protein
1
Introduction
1.1 Synthetic Devices that Sense Intracellular Protein and Regulate Cellular Fate
Programmable and model aided synthetic circuits hold the potential to improve our understanding of the rules that govern biological processes [1–4] and to create new tools for biomedical purposes [22]. Genetic biosensors with medical applications focus on cell function rewiring by triggering a therapeutic output via transcriptional or translational regulation [5–9]. Most of synthetic biosensors have been designed to respond to extracellular stimuli either by building input-specific devices or by creating a generalizable framework to adapt to different cues [7, 10, 23]. Here, we describe the first modular platform that can be repurposed to sense and respond to several intracellular proteins that function as disease’ biomarkers. This synthetic platform couples
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9_16, © Springer Science+Business Media, LLC, part of Springer Nature 2021
331
332
Giuliano Bonfa´ et al.
sm pla to y C
mKate
Membrane Tag scFv162
scFv35
TCS
TEVp Gal4-VP16
us cle u N
Actuator
Fig. 1 Schematics of intracellular protein sensors. The sensing modules are composed by two intrabodies that recognize and interact with an intracellular target protein (green star). One intrabody is anchored to the cell membrane (i.e., for the HCV protein biomarker NS3 we used scFv35) and is fused at the N-terminus to mKate fluorescent tag and at the C-terminus to the TCS and to a GAL4-VP16 transcriptional activator. The second intrabody (i.e., for the HCV protein biomarker NS3 we used scFv162) is cytosolic and fused to the TEVp. Interaction of the intrabodies with the target protein approximates the TEVp to TCS, and the cleavage reaction results in the release of GAL4-VP16 that translocate to the nucleus and induce the expression of the output gene (Actuator)
intracellular protein sensing to actuator modules to convert protein detection into programmed gene expression in a modular fashion [6]. The architecture of the device is very articulated and include intrabodies that specifically bind the selected proteins connected to the viral TEV protease (TEVp) system to release a membranebound transcription factor when the protein is sensed (Fig. 1). Intrabodies guarantee the modularity of the framework that was built to allow their easy interchange and quick rearrangements toward new proteins of interest. We chose to detect disease proteins primarily expressed in the cytoplasm, and for each target, we selected two intrabodies (or interacting peptide domains) that bind to different epitopes.
Engineering Protein-Based Parts in Mammalian Cells
333
Output was driven by GAL4 cognate promoter including (a) fluorescence for diagnostic purpose, (b) apoptotic gene for cell killing, and (c) chemokine for immunomodulatory purposes. We demonstrate the functionality of engineered protein sensors by creating devices that sense three different proteins associated with the following diseases hepatitis C virus (HCV) infection, human immunodeficiency virus (HIV) infection, and Huntington’s disease and respond with either fluorescent reporter activation or biological activity (cell apoptosis and HLA-I downregulation). 1.2 A Protein-Based Strategy to Regulate RNA Translation and Protein Activity
RNA-encoded genetic circuits have the potential to limit immunogenicity and mutagenicity issues of DNA-based system and exhibit faster dynamics. They have thus become an appealing strategy for synthetic systems’ regulation with a variety of applications mostly in the biomedical fields [11, 12]. However, the ability to achieve fine control over gene expression at posttranscriptional and translational level is limited by the poor toolbox of regulatory devices available: ribozymes, aptamers, riboswitches can modulate the translation of the associated output but cannot be interconnected to create modular and scalable circuits [13, 14]. Recently, RNA-binding proteins (RBPs) have been demonstrated for the engineering of RNA-encoded networks, enhancing the regulatory features of RNA expression [15]. We envisioned to create a multilayered system that adds further regulatory elements via protein–RNA and protein–protein interactions using a protein engineering strategy. Proteases can recognize specific aminoacidic sequences leading to proteolytic events. In theory, these protease-responsive sequences could be transferred to other proteins modifying their structure but with no impairment of their function. This is potentially possible due to the availability of protein crystal structures for a large number of proteins, as well as of multiple software available to study and infer protein structure and sequence [16, 17], after in silico modification or via homology analysis for native protein sequences. Thus, this system is potentially highly modular, and more levels of regulation can be multiplexed. This study use TEVp and others from the same family, but this framework could be extended to endogenous proteins that are activated following specific cellular state. Here we report on how to use protein engineering to create regulatory cascades that connect proteases to RBPs L7Ae and Ms2-cNOT7, to tune output expression at posttranscriptional and translational level. L7Ae binds kink-turn motives in the 50 UTR of the target mRNA and blocks output translation [15]. We modified its structure to be TEV protease dependent. Ms2-cNOT7 is a fusion protein that binds Ms2 binding motifs in the 30 UTR and chops the poly(A) of the target mRNA,
334
Giuliano Bonfa´ et al.
that is consequently degraded [15]. In the linker between Ms2 and cNOT7, we inserted the cleavage site for four different cognate proteases: TEV, TUMV, TVMV, and SuMMV. We chose four viral proteases that are orthogonal to the host organism, so they do not interfere with endogenous pathways, and with each other. Furthermore, we reengineered protease structure to create protease–protease interactions. Finally, we created protease-based, multi-stage regulatory cascades and here we report step by step how this was achieved.
2
Material
2.1 Intracellular Protein-Sensor Devices to Regulate Cell Fate
1. Plasmid backbones: pDestination.
pEntry-promoters,
pDonor-gene,
2.1.1 DNA Cloning and Plasmid Construction
3. Gateway (Life Technology), Infusion (Clonetech) and Golden Gate [18] systems.
2. For pDonor-gene design using Golden Gate: Type IIS restriction enzymes BsaI and BsmBI.
4. LB medium and a 37 C shaking incubator for bacterial growth. 5. Kit for DNA extraction and gel purification. 2.1.2 Mammalian Cells Culture and Transfection/ Electroporation
1. Required cell lines: HEK293FT (Invitrogen), HeLa-based TZM-bl (NIH AIDS reagent program), and Jurkat (ATCC). 2. To maintain HEK293FT and HeLa-based TZM-bl cells, use Dulbecco’s modified Eagle medium (DMEM, Cellgro) supplemented with 10% FBS (Atlanta BIO), 1% penicillin/streptomycin/L-Glutamine (Sigma-Aldrich) and 1% non-essential amino acids (HyClone). Culture them at 37 C in a 5% CO2-humidified incubator. To maintain Jurkat cells use RPMI-1640 (ATCC) supplemented with 10% FBS (Atlanta BIO), and 1% non-essential amino acids (HyClone). Culture them at 37 C in a 5% CO2-humidified incubator. 3. Doxycycline (Clonetech): prepare a stock solution by diluting 5 g in double distilled H2O to reach 1 mg/mL. Filter, aliquot, and store the solution at 20 C in the dark. 4. Perform transfection of HEK293FT cells with Attractene (Qiagen) and electroporation of HeLa-based TZM-bl and Jurkat cells using Neon Transfection System with 10 μL Neon Tip (Life Technologies). 5. For HIV-1 production and infection use HIV-1 corresponding infectious molecular clones of strains IIIB, JRCSF, LAI, and NL4.3 (NIH AIDS reagents program) and JetPRIME® reagent (Polyplus transfection #114-07).
Engineering Protein-Based Parts in Mammalian Cells
335
6. Ultracentrifuge for virus concentration. 7. HIV-1 p24 ELISA Kit (PerkinElmer NEK050B001KT). 8. Cytofix/cytoperm solution (BD Biosciences #554722). 9. Anti-p24 antibody FITC-conjugated (KC57-FITC from Beckman Coulter #CO6604665). 10. 35 mm glass bottom dishes (Fluorodish), DMEM base medium (Cellgro) without supplements, Opti-MEM I reduced serum medium (Life Technologies), Trypsin (Invitrogen), PBS 1, humidified incubator at 37 C with 5% CO2, centrifuge. 11. Leica TCS SP5 II microscope equipped with an incubation chamber using a x63 objective. 12. Evos Cell Imaging System (Life Technology). 2.1.3 Flow Cytometry Staining, Acquisition, and Analyses
1. LSR Fortessa flow cytometer, equipped with 405, 488, and 561 nm lasers and LSR-II system (BD Biosciences). 2. SpheroTech RCP-30-5A beads (SpheroTech). 3. To determine surface expression of HLA-I molecules, use AlexaFluor® 647 mouse anti-human HLA A, B, C antibody clone W6/32 (Biolegend® #311414). 4. For apoptosis assays, stain post-transfected and PBS washed cells with Pacific-Blue conjugated Annexin V (LifeTechnologies) before flow cytometry analysis. 5. FACSDiva8 software.
2.1.4 RNA Extraction, cDNA Synthesis, and qPCR
1. RNeasy Mini Kit (Qiagen). 2. RNase free water. 3. QuantiTect Reverse Transcription Kit (Qiagen). 4. Fast SYBR Green Master Mix (ThermoFisher Scientific). 5. MicroAmp™ Fast Optical 96-Well Reaction Plate (0.1 mL) (ThermoFisher Scientific). 6. StepOnePlus™ 7500 Fast Real Time PCR machine (ThermoFisher Scientific). 7. Primers: GAPDH Forward GAAGATGGTGATGGGATTTC. GAPDH Reverse GAAGTTGAAGGTCGGAGT. XCL-1 Forward CTTGGCATCTGCTCTCTCACT. XCL-1 Reverse AGGCTCACACAGGTCCTCTTA.
336
Giuliano Bonfa´ et al.
2.2 Protein-Based Devices to Regulate RNA and Protein Activity 2.2.1 Cell Culture, Transient Transfection Cell Imaging and Flow Cytometry
1. Required cell line: HEK293FT (Invitrogen). 2. To maintain the cells, use Dulbecco’s Modified Eagle’s Medium DMEM phenol red (Cellgro) supplemented with 10% fetal bovine serum FBS (Atlanta Bio), 1% penicillin/streptomycin (Sigma), 1% L-Glutamine (Sigma), 1% MEM nonessential amino acids (HyClone). Culture them at 37 C in a 5% CO2-humidified incubator. 3. DMEM phenol red medium (Cellgro). 4. DMEM medium (Cellgro) supplemented with 10% fetal bovine serum FBS (Atlanta Bio). 5. Trypsin-EDTA 0.25% phenol red (Gibco). 6. DPBS no calcium no magnesium (Gibco). 7. Attractene transfection reagent (Qiagen). 8. Lipofectamine 3000 transfection reagent (Invitrogen). 9. Opti-MEM reduced serum medium (Gibco). 10. 24-well plates flat bottom (Corning). 11. Countess™ II Automated Cell Counter and Countess® Cell Counting Chamber Slides (Invitrogen). 12. Trypan blue. 13. EVOS® Cell Imaging System (Life Technologies) using 10 objective with EVOS® Light Cubes Texas Red, GFP, and DAPI. 14. LSR Fortessa flow cytometer equipped with 405, 488 and 561 nm lasers (BD Biosciences). 15. FlowJo software version 10.5 to perform data analysis.
2.2.2 PCR and Plasmid Cloning
1. Accuprime PFX Supermix (ThermoFisher Scientific). 2. 5 U/μL of BamHI-HF and PacI restriction enzymes (NEB) and CutSmart® Buffer to a final concentration of 1. 3. In-Fusion HD cloning kit (Clontech) is used to a final concentration of 1. 4. E. coli Stellar™ Competent Cells (Takara).
2.2.3 In Silico Protein Engineering
3
1. Personal computer equipped with Pymol [16]. 2. SWISS-MODEL server for protein modeling.
Methods The framework combines sensing and actuation modules. The sensing modules are based on intrabodies and the actuation modules are based on the Tango-TEV technology. The building and test process comprise the following steps:
Engineering Protein-Based Parts in Mammalian Cells
3.1 Intracellular Protein-Sensor Devices to Regulate Cell Fate
337
1. Genetic circuits construction. 2. Transduction/transfection. 3. Protein expression. 4. Protein binding (detection). 5. Protease cleavage. 6. Protein nuclear translocation. 7. Transcriptional activation.
3.1.1 DNA Cloning and Plasmid Construction
One intrabody is fused at the N-terminus to a membrane-tethered fluorescent tag (mKate) and at the C-terminus to a Tabacco Etch Virus (TEV) cleavage site (TCS) and to a GAL4-VP16 transcriptional activator, forming a chimeric protein sequestered in the cytosol. A second intrabody is fused to the TEV protease (TEVp) that recognizes and cleaves the TCS. The presence of the target protein in the cytosol and subsequent recognition by the two intrabodies results in TEVp cleavage of TCS and release of GAL4VP16, which translocate into the nucleus and converts proteins detection into programed gene expression. Plasmids containing the sensing and actuation modules are built with Gateway, Infusion or Golden Gate systems described below. The membrane-tethered module is driven by a constitutive promoter hEF1, whereas the TEVp module is either driven by hEF1 or TET (responsive to doxycycline) promoter. Below we show examples of module combination for NS3, HTT, and HIV devices (see Notes 1–4): 1. NS3 device: FGFR-mKate-scFv35-LD0-TCS(L)-GAL4-VP16 and TEVp-LD0-scFv162 (see Notes 5 and 6). 2. HTT device: Happ1-LD0-TCS-(S) and DD-TEVp-LD0Vl12.1. 3. Nef device: sdAb19-LD0-TCS(L)-GAL4-VP16 and TEVpLD0-SH3. Final plasmids can be obtained by a combined golden gate and gateway strategy. First, chimeric proteins are generated by PCR amplification of each gene and inserting restriction sites for type IIS enzymes BsaI or BsmBI to perform golden gate reactions in the donor vector. Next, gateway recombinations are performed with a plasmid containing the promoter and a destination vector following the manufacturer’s instructions.
3.1.2 Transfection of HEK 293FT Cells and Fluorescence Imaging
1. Carry out protein sensor transfections in 24-well plate format transfecting HEK 293FT with Attractene. 2. Prepare a mix of 300 ng of total DNA in DMEM base medium without supplements to a final volume of 60 μL.
338
Giuliano Bonfa´ et al.
3. Add 1.5 μL of Attractene to the DNA mix prepared and vortex the samples promptly to mix. Incubate the complexes for 20–25 min at room temperature (see Note 7). 4. During the incubation time, harvest the cells by trypsinization and seed 2 105 cells in 500 μL of complete culture medium per well. 5. Add transfection complexes dropwise to the freshly seeded cells. Gently mix the plates and incubate at 37 C in a 5% CO2—humidified incubator. 6. Supplement the cells with 1 mL of fresh growth medium 24 h post-transfection and analyze by flow cytometry after 48 h (see Note 9). 7. Perform confocal imaging with Leica TCS SP5 II microscope equipped with an incubation chamber using a 63 objective. Fluorescence and bright-field micrographs can be acquired with Evos Cell Imaging System, using 10 objective. 3.1.3 Electroporation of TZM-bl and Jurkat Cells
1. Electroporate TZM-bl and Jurkat cells with Neon Transfection System using 10 μL Neon Tip. 2. For TZM-bl cells, prepare a total of 2 μg of DNA in a 1.5 mL tube. 3. Harvest 2 105 cells by trypsinization and centrifuge in PBS at 150 g for 5 min at room temperature. 4. Remove the supernatant with a pipette, suspend the cells in buffer R, and then add the cells to the DNA tube mixing gently. 5. Pick the DNA and cell mix with the appropriate Neon Tip and transfer to the electroporator. Apply a pulse (pulse voltage: 1005 v, pulse width: 35 ms, pulse number: 2) and transfer all the cells to the well. 6. For Jurkat cells, prepare a total of 4 μg of DNA in a 1.5 mL tube. 7. Harvest 3 105 cells and centrifuge in PBS at 150 g for 5 min at room temperature. 8. Remove the supernatant with a pipette, suspend the cells in buffer R, and then add the cells to the DNA tube mixing gently. 9. Pick the DNA and cell mix with the appropriate Neon Tip and transfer to the electroporator. Apply a pulse (pulse voltage: 1325 v, pulse width: 10 ms, pulse number: 3) and transfer all the cells to the well. 10. Infect the TZM-bl and Jurkat cells with HIV strains around 6–12 h post-transfections, allowing for recovery after electroporation.
Engineering Protein-Based Parts in Mammalian Cells 3.1.4 Flow Cytometry and Data Analysis
339
1. Acquire the cells with LSR Fortessa flow cytometer, equipped with 405, 488, and 561 nm lasers. 2. Collect 30,000–100,000 events per sample and acquire fluorescence data with the following cytometer settings: 488 nm laser and 530/30 nm bandpass filter for EYFP/EGFP, 561 nm laser and 610/20 nm filter for mKate, and 405 nm laser, 450/50 filter for EBFP. 3. Convert flow cytometry data from arbitrary units to compensated molecules of equivalent fluorescein (MEFL) using the TASBE characterization method [19, 20]. The TASBE method uses a strong constitutively expressed fluorophore, which serves as both a transfection marker and an indicator of relative circuit copy count. 4. An affine compensation matrix is computed from single positive and blank controls. 5. FITC measurements are calibrated to MEFL using SpheroTech RCP-30-5-A beads. 6. Mappings from other channels to equivalent FITC are computed from co-transfection of constitutively expressed EBFP, EYFP, and mKate, each controlled by the hEF1a promoter on its own otherwise identical plasmid. 7. MEFL data are segmented by constitutive fluorescent protein expression into logarithmic bins at 10 bins/decade and because the data are log-normally distributed, geometric mean, and variance computed for those data points in each bin. 8. Observe the constitutive fluorescence distributions. Select the threshold based on each data set, below which data are excluded as being too close to the non-transfected population (e.g., 1 107 MEFL for NS3 and NEF HEK, 3 107 MEFL for HTT and TAT, 2 105 MEFL for TZM-bl, and 105 for Jurkat data sets). 9. Removed high outliers by excluding all bins without at least 100 data points. Both population and per-bin geometric statistics are computed over this filtered set of data. 10. Include at least three biological replicates for all experiments and indicate error bars using standard deviation. Variance for all groups should be generally similar: any differences should be reflected in the displayed standard deviation.
3.1.5 Determination of HLA-I Surface Expression by Flow Cytometry
1. Determine surface expression of HLA-I molecules by staining before fixation with AlexaFluor® 647 mouse anti-human HLA A, B, C antibody clone W6/32 (dilution 1:20). 2. Quantify fluorescence signals with a flow-cytometer BD LSR-II system and FACSDiva8 software with the following settings: 640 nm laser and 670/14 nm filter.
340
Giuliano Bonfa´ et al.
3. Convert the flow cytometry data from arbitrary units to compensated molecules of equivalent soluble fluorochrome (MESF) using Spherotech RCP-30- 5A-2. 4. Analyze the data with FlowJo software. Determine the mean of fluorescence (MFI) and plot it for each condition. Include at least three biological replicates for all experiments and indicate error bars using standard deviation. 3.1.6 HIV Production and Infection
1. Produce HIV-1 strains by transfecting HEK-293 T cells with the corresponding infectious molecular clones (NIH AIDS reagents program) and JetPRIME® reagent. 2. After 40 h, concentrate virus preparations by ultracentrifugation for 1 h, 64,074 g, 4 C on 20% sucrose to avoid viral particle-free proteins. 3. Titrate viral stocks by HIV-1 p24 ELISA. 4. For infection of TZM-bl cells, use a viral inoculum of 500 ng of p24 for each strain. 5. Forty hours after infection, harvest, fix, and permeabilize the cells with cytofix/cytoperm solution for 15 min at room temperature. 6. Determine the percentage of infected cells by intracellular staining of viral protein p24 with a FITC-conjugated antibody (KC57-FITC, dilution 1:50) and flow cytometry.
3.1.7 Apoptosis Assays
1. Sensing-actuation devices transfections should be performed along with pCMV-EGFP transfection marker. 2. Harvest sample cells 48 h post-transfection (including those in supernatant). Wash with PBS and stain with 2.5 μL of Pacific Blue conjugated to Annexin V in 50 μL of binding buffer for 10 min at room temperature. 3. Analyze the cells by flow cytometry. Gate transfected cells (EGFP+) and calculate apoptosis induction within this population (cell death), defining as the percentage of Pacific-Blue conjugated Annexin V positive cells. 4. Include at least three biological replicates for all experiments and indicate error bars using standard deviation. Perform data analysis for the apoptotic assays using FlowJo software.
3.1.8 RNA Extraction, cDNA Synthesis, and qPCR
1. Perform RNA extraction with RNeasy Mini Kit. Wash the cells in PBS and add buffer RTL directly into the wells. 2. Incubate for 2 min at room temperature and collected with a sterile scraper. Proceed the RNA extraction according to manufacturer’s instructions. 3. Elute RNA in 30 μL of RNAse free water to maximize the yield.
Engineering Protein-Based Parts in Mammalian Cells
341
4. Conserve RNA samples at 80 C. 5. Synthesize cDNA using QuantiTect Reverse Transcription Kit according to manufacturer’s instructions. Perform this protocol on ice in RNAse-free environment to avoid RNA degradation. 6. Always prepare a negative control without Quantiscript Reverse Transcriptase to assess contamination of genomic DNA of the RNA preparation. 7. Dilute cDNA 1:10 and perform qPCR using Fast SYBR Green Master Mix. 8. Load samples in MicroAmp™ Fast Optical 96-Well Reaction Plate (0.1 mL) so that each well contains 20 μL of final volume (10 μL SYBR Green Master Mix, 7 μL ddH2O, 1 μL of each primer, and 1 μL of template). 9. Run the experimental plate in a StepOnePlus™ 7500 fast machine. 10. Set a control without template (blank). 11. Perform analyses by calculating the 2ddCt to measure the fold change of output expression (XCL-1) in presence or absence of the target protein (Nef), after normalization of Ct values to endogenous housekeeping gene expression (GAPDH). 3.2 Protein-Based Devices to Regulate RNA and Protein Activity 3.2.1 Protein Structural Analysis and Plasmid Cloning
To build protease-responsive RBPs and protease-responsive proteases, insert the cognate protease cleavage sites into their aminoacidic sequence. The modification at the insertion site has to (a) minimally affect the protein structure and activity; (b) assure protein disruption by proteolytic cleavage. To test the efficiency of the devices, clone fluorescence reporters responsive to RBPs’ activity. All plasmids must be confirmed by sequencing. 1. L7Ae crystal structure is reported with the PDB id: 1RLG [21]. Visualize its structure in Pymol to identify possible insertion loci for TEV protease cleavage site (TCS): three loci can be identified. Synthetize the three L7Ae-CS variants as gblocks and inserted into pL-A1 by In-Fusion between BamHI and PacI restriction sites with a backbone: gblock ratio of 1:2. 2. A reporter plasmid encoding for two kink-turn motives upstream an EGFP fluorescent reporter (pBoxCDGC_2xKMet_DD-EGFP) was designed in [15]. 3. Insert cleavage sites for different proteases in the linker between Ms2 and cNOT7 by PCR: amplify both Ms2 and cNOT7 by PCR with Accuprime Pfx DNA Polymerase from pL-R1 and clone them by In-Fusion in pL-A1 between BamHI and PacI restriction sites with a backbone:PCR1: PCR2 ratio of 1:2:2.
342
Giuliano Bonfa´ et al.
4. A reporter plasmid encoding for eight Ms2-binding motives downstream an EGFP fluorescent reporter (pBoxCDGCmut_KMet-EGFP-8xMS2-pA) is designed in [15]. 5. Visualize TVMV protease crystal structure Pymol: alternative insertion sites can be identified for the TUMVp cleavage site (TUCS): (a) TVMVp-TUCS1 between amino acid residues D26-G27, (b) TVMVp-TUCS2 between amino acid residues Q119-K120, and (c) TVMVp-TUCS3 between amino acid residues T173-N174. Insert the cleavage site by PCR into the three loci. Link the PCR products to the backbone by Infusion. 6. Design TVMV-responsive TEV protease variants (TEV-TVCS) by homology between TEV and TVMV, so add the cleavage sites in the same aminoacidic regions mentioned for TVMV. 7. As TUMV crystal structure is not resolved, infer it is by homology with TEV using SWISS-MODEL. 3.2.2 Protein–Protein Devices Testing
To test protein–protein interaction devices measure fluorescent reporters’ expression by flow cytometry at steady state 48 h posttransfection (see Notes 7–14). 1. Perform transfections to test L7Ae-CS with Attractene transfection reagent in HEK293FT cells in 24-well plates format. Aliquot 60 μL of uncomplemented DMEM for each transfection mix. Add a total of 400 ng of DNA per reaction mix (50 ng of fluorescent reporter, 150 ng of L7Ae variant, 60 ng of wildtype protease, 50 ng of transfection marker, empty plasmid to 400 ng) followed by 1.5 μL of Attractene transfection reagent. Vortex all the reaction mixes are vortexed and incubated for 15 min (Fig. 2). 2. Perform transfections to test Ms2-cNOT7 constructs and protease–protease circuits with Lipofectamine 3000 transfection reagent in HEK293FT cells in 24-well plates. Prepare two master mixes: (a) 25 μL of Opti-MEM and 1 μL of P3000 Reagent for each sample; (b) 25 μL of Opti-MEM and 0.75 μL of Lipofectamine 3000 for each sample. Aliquot 26 μL of master mix (a) per sample in separate Eppendorf tubes. Add 400 ng of DNA per reaction mix to the master mix (a) (25 ng of fluorescent reporter, 50 ng of Ms2-CS-cNOT7 variant, 30 ng of engineered protease, 50 ng of wild-type protease, 50 ng of transfection marker, empty plasmid to 400 ng). Then add 25.75 μL of master mix (b) to each sample and mix them by vortexing. Incubate the reactions for 15 min (Fig. 3). 3. During the 15 min of incubation time, plate HEK293FT cells in 24-well plates. First, remove the medium from the flask, then gently wash the flask with PBS (10 mL for T75 and 5 mL for T25) and add trypsin (1.5 mL for T75 and 0.5 mL for T25). Keep the flasks 2 min in the incubator. Add fresh new
Engineering Protein-Based Parts in Mammalian Cells
a
343
Ins1
Ins3
Ins2
b State 1
State 2 TEV
L7Ae_TCS
L7Ae_TCS
EGFP
EGFP
k-turns
k-turns
Fig. 2 (a) Crystal structure of L7Ae bound to RNA target with the possible insertion sites highlighted. (b) Graphical representation of the RNA-encoded circuit regulated by a TEV-responsive L7Ae. State1: In absence of TEVp, L7Ae_TCS represses EGFP translation. State2: When TEVp is present, it cleaves L7Ae rendering it nonfunctional and EGFP levels increases
b
a
c TVMV
TUMV
Stage 3
TEV
Stage 2 Stage 1 Stage 0
TVMV_TUCS
TEV_TVCS
TVCS
EGFP
AAA
Ms2 binding motives
TUMV_TCS
TCS
EGFP
AAA
Ms2 binding motives
TUCS
EGFP
AAA
Ms2 binding motives
Fig. 3 Graphical representation of protease-based cascades. In all cascade variants, at stage 0, EGFP is expressed and at stage 1 is downregulated by Ms2-cNOT7. (a) At stage 2, Ms2-TVCS-cNOT7 activity is disrupted by TVMV-TUCS and EGFP expression is restored; at stage 3, EGFP expression is knocked down again as TVMV-TUCS is repressed by TUMV. (b) At stage 2, Ms2-TCS-cNOT7 activity is disrupted by TEV-TVCS and EGFP expression is restored; at stage 3, EGFP expression is knocked down again as TEV-TVCS is repressed by TVMV. (c) At stage 2, Ms2-TUCS-cNOT7 activity is impaired by TUMV-TCS and EGFP expression increases; at stage 3, TUMV-TCS is repressed by TEV, and EGFP is downregulated
complemented DMEM to the trypsinized cells (3.5 mL for T25 and 5.5 for T75). Mix 10 μL of resuspended cells with 10 μL of Trypan Blue and load 10 μL of mix in a Countess® Cell Counting Chamber Slide and loaded in the Countess® Cell Counter II for cell counting. Seed a total of 140,000 cells/well in a final volume of 500 μL of complete DMEM.
344
Giuliano Bonfa´ et al.
4. After 24 h, add 1 mL of fresh complemented DMEM to each well. 5. After 48 h, observe the cells at EVOS® Cell Imaging System and acquire images of the transfection in all the fluorescent channels. 6. Then analyze the cells with flow cytometer. First, remove DMEM from the wells and add 50 μL of trypsin to each well. Keep the plates in incubator for 2 min. Add 300 μL of DMEM supplemented with 10% of FBS to each well. Transfer the cells into FACS tubes and keep them on ice. Vortex each tube for few seconds before loading it into the flow cytometer. Record 20,000 events in the single cell population.
4
Notes 1. In order to optimize sensing-actuation performance, we designed variants of the device by tuning several of its features. TEVp was fused to the N-terminus or C-terminus of the intrabody with Glycine-Serine flexible linker sequence. 2. To obtain devices with significant ON/OFF ratio for output expression, we tested flexible linker sequences of different length to maximize the likelihood of intrabodies with protein and TEVp with TCS interactions, and tested TCS mutantsTEVp complexes with different binding constants. 3. Constitutive TEVp expression induced significant activation of the reporter gene (up to 100 fold ON/OFF induction), indicating that TEV cleavage site is accessible in the design configuration, and suggesting that careful tuning of protease expression is critical to maximize signal-to-noise ratio. 4. The selection of target proteins for testing our sensingactuation framework was based on: (a) partial or complete cytosolic localization, and (b) existence of intrabodies binding two different epitopes of the antigen. Following these criteria, we engineered genetic devices that recognize NS3, HTT, and Tat/Nef proteins, respectively, specific for HCV, Huntington’s disease, and HIV. 5. We confirmed NS3-intrabody interaction by fusing a BFP tag (Blue Fluorescent Protein) to the N-terminus of nNS3 (BFP-nNS3) in a colocalization assay. Colocalization imaging was performed after transfecting 293FT cells in 35 mm glass bottom dishes with NS3, BFP-nNS3 and intrabody against NS3 constructs. Cells were transfected with Lipofectamine LTX following manufacturer’s instructions.
Engineering Protein-Based Parts in Mammalian Cells
345
6. We found that low-affinity TEV cleavage site (TCS-L) and low sensor concentration provide the best operating conditions, in agreement with the conclusion from a predictive computational model that we implemented. 7. Transfection efficiency improves if the mix is vortexed immediately after Attractene/Lipofectamine 3000 addition for 3–5 s. 8. Transfection quality improves pipetting plasmid DNA directly into the mix (a) and not on the sidewall of the 1.5 mL tube. 9. FACS analysis quality improves by adding EDTA 2 mM to the medium as it reduces cellular clumps formation. 10. Washing cells with PBS before detaching them with trypsin reduces clumps formation. 11. Pour PBS towards the flask’s sidewall and not directly on the cell layer, because HEK cells very easily detach from the flask surface, thus they could go lost in this step. 12. When aliquoting the cells for transfection, it is recommended to put extra care in resuspending them with serological pipets several times, to disrupt clumps, and avoid sedimentation. 13. Filtering trypan blue increases the accuracy of cell counting. 14. When preparing cells to seed for transfection, it is recommended to prepare a mix of media and cells at the correct density, the mix is then aliquoted in 500 μL per well. References 1. Ausl€ander D, Eggerschwiler B, Kemmer C, Geering B, Ausl€ander S, Fussenegger M (2014) A designer cell-based histamine-specific human allergy profiler. Nat Commun 5:4408 2. di Bernardo D, Marucci L, Menolascina F, Siciliano V (2012) Predicting synthetic gene networks. Methods Mol Biol 813:57–81 3. Tigges M, Marquez-Lago TT, Stelling J, Fussenegger M (2009) A tunable synthetic mammalian oscillator. Nature 457:309–312 4. Siciliano V, Garzilli I, Fracassi C, Criscuolo S, Ventre S, di Bernardo D (2013) MiRNAs confer phenotypic robustness to gene networks by suppressing biological noise. Nat Commun 4:2364 5. Kipniss NH, Dingal PCDP, Abbott TR, Gao Y, Wang H, Dominguez AA, Labanieh L, Qi LS (2017) Engineering cell sensing and responses using a GPCR-coupled CRISPR-Cas system. Nat Commun 8:2212 6. Siciliano V, DiAndreth B, Monel B, Beal J, Huh J, Clayton KL, Wroblewska L, McKeon A, Walker BD, Weiss R (2018)
Engineering modular intracellular protein sensor-actuator devices. Nat Commun 9:1881 7. Scheller L, Strittmatter T, Fuchs D, Bojar D, Fussenegger M (2018) Generalized extracellular molecule sensor platform for programming cellular behavior. Nat Chem Biol 14:723–729 8. Courbet A, Endy D, Renard E, Molina F, Bonnet J (2015) Detection of pathological biomarkers in human clinical samples via amplifying genetic switches and logic gates. Sci Transl Med 7:289ra83 9. Sedlmayer F, Fussenegger M (2017) Synthetic biology: a probiotic probe for inflammation. Nat Biomed Eng 1:0097 10. Schwarz KA, Daringer NM, Dolberg TB, Leonard JN (2016) Rewiring human cellular input–output using modular extracellular sensors. Nat Chem Biol 13:202 11. McNamara MA, Nair SK, Holl EK (2015) RNA-based vaccines in cancer immunotherapy. J Immunol Res 2015:794528
346
Giuliano Bonfa´ et al.
¨ (2014) mRNA12. Sahin U, Kariko´ K, Tu¨reci O based therapeutics — developing a new class of drugs. Nat Rev Drug Discov 13:759–780 13. Cella F, Wroblewska L, Weiss R, Siciliano V (2018) Engineering protein-protein devices for multilayered regulation of mRNA translation using orthogonal proteases in mammalian cells. Nat Commun 9:1–9 14. Culler SJ, Hoff KG, Smolke CD (2010) Reprogramming cellular behavior with RNA controllers responsive to endogenous proteins. Science 330:1251–1255 15. Wroblewska L, Kitada T, Endo K, Siciliano V, Stillo B, Saito H, Weiss R (2015) Mammalian synthetic circuits with RNA binding proteins for RNA-only delivery. Nat Biotechnol 33:839–841 16. PyMOL | pymol.org. https://pymol.org/2/. Accessed 30 Oct 2019 17. Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T (2018) SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res 46:W296–W303
18. Engler C, Marillonnet S (2014) Golden Gate cloning. Methods Mol Biol 1116:119–131 19. Beal J, Wagner TE, Kitada T, Azizgolshani O, Parker JM, Densmore D, Weiss R (2015) Model-driven engineering of gene expression from RNA replicons. ACS Synth Biol 4:48–56 20. Beal J, Weiss R, Yaman F, Davidsohn N, Adler A (2012) A method for fast, high-precision characterization of synthetic biology devices. MIT CSAIL Tech Report 2012-008 21. Moore T, Zhang Y, Fenley MO, Li H (2004) Molecular basis of box C/D RNA-protein interactions; cocrystal structure of archaeal L7Ae and a box C/D RNA. Structure 12:807–818 22. Caliendo F, Dukhinova M, Siciliano V (2019) Engineered Cell-Based Therapeutics: Synthetic Biology Meets Immunology. Front. Bioeng. Biotechnol. 7:43 23. Cella F, Siciliano V (2019) Protein-based parts and devices that respond to intracellular and extracellular signals in mammalian cells. Curr. Opin. Chem. Biol. 52:47–53
INDEX A AMIGO2 Matlab..................................................................... 246 MATLAB-based toolbox ....................................... 226 model selection numerical methods................................... 233–234 objective functional.................................. 231–232 optimization problem .............................. 232–233 running the code...................................... 234–235 parameter estimation numerical methods................................... 237–238 objective functional.................................. 235–236 optimization problem .............................. 236–237 running the code.............................................. 238 use of....................................................................... 229 Asymptotic graph .................................................... 27, 28 Automated selections design...................................................................... 119 laboratory (see Laboratory automation) necessary data files.................................................. 171 output ............................................................ 173–174 primer-free regions........................................ 171–173 web application ..................................... 169–170, 173
B Bacterial growth laws................................................................. 270–272 mechanistic model......................................... 272–277 Bifurcation boundary ................................................................ 106 diagrams................................................... 99–100, 107 transition................................................................. 106 Biological parts............................................................. 189 Biological systems ...................... 2, 7, 92, 109, 138, 148, 152, 215–217, 226, 264, 293 Biosensor .......................... 267, 294, 314, 321–323, 331 Boolean models control ................................................................ 31–32 IRMA circuit ...................................................... 11–13 oscillator with positive feedback ......................... 9–11 phosphorylated and non-phosphorylated forms .... 17 schedules................................................................... 35 toggle switch .......................................................... 7–9 Burden-driven feedback loop cellular burden............................................... 325–327
library of promoters ............................................... 325 plasmid........................................................... 323–325
C Cell-free systems gene expression (see Steady-state gene expression) lysate ....................................................................... 201 microfluidics .................................................. 148–149 Cell segmentation ...................................... 214, 215, 244 Cellular burden ......................... 269, 286, 314, 323, 325 Characterization genetic part............................................................. 176 modules’ attractors............................................. 26–27 operations ............................................................... 200 promoters and terminators .................................... 184 of results ........................................................ 105–106 RNA-seq (see RNA sequencing (RNA-seq)) TASBE .................................................................... 339 transcription profiles .............................................. 321 transition graphs......................................................... 7 Chemical Langevin equation (CLE) ................ 43, 71–72 Euler–Maruyama discrete formulation ................... 49 OpenFPM client program ....................................... 52 QS/Fb .................................................. 58, 63, 64, 82 CLE, see Chemical Langevin equation (CLE) Computer-aided design DNA assemblies ............................................ 157–165 model-based design ............................................... 267 Computer-aided manufacturing automated selection enzyme...................................................... 168–170 primer ....................................................... 170–174 DNA assembly project .................................. 167–168 Context-dependence.................................................... 293 Control algorithms cell segmentation ................................................... 214 MPC............................................................... 217–218 PI controller .................................................. 216–217 relay controller .............................................. 215–216
D Diffusion term ......................................................... 49, 71 Drift term ....................................................................... 71
Filippo Menolascina (ed.), Synthetic Gene Circuits: Methods and Protocols, Methods in Molecular Biology, vol. 2229, https://doi.org/10.1007/978-1-0716-1032-9, © Springer Science+Business Media, LLC, part of Springer Nature 2021
347
SYNTHETIC GENE CIRCUITS : METHODS
348 Index
AND
PROTOCOLS
DNA assembly automated selection enzyme...................................................... 168–170 primer ....................................................... 170–174 batch part standardization ............................ 158–162 circular plasmids ..................................................... 167 EGF................................................................ 157–158 NGS ........................................................................ 168 and strain development................................. 149–150 synthetic biology projects ...................................... 157 type-2S assembly ........................................... 162–165 verification .............................................................. 168 DNA verification .......................................................... 168 Dynamic models........... 67, 74, 119, 120, 123, 223, 226
E Edinburgh Genome Foundry (EGF)................ 157–158, 168, 341–343
F Feedback burden-based biomolecular ................................... 314 burden-driven................................................ 323–327 control algorithms................................................. 215–218 laws...................................................................... 32 negative loop ........................................................ 5, 34 positive................................................. 4, 9–11, 17–19 stability of oscillations ................................................ 2 three-gene negative.................................................... 4 Feedback control controller PI............................................................... 216–217 relay........................................................... 215–216 implementation ........................................................ 34 law ............................................................................. 32 MPC............................................................... 217–218 Focal point...................................................................... 15 Funding fee-for-service model ............................................. 145 government ................................................... 143–145 project partnerships................................................ 145
G Gene circuits cell-free ................................................................... 190 CLE approach .......................................................... 43 construction ........................................................... 337 design.................................................... 123, 267, 270 heterologous genes ....................................... 277–278 living cells ............................................................... 175 modeling................................................................... 92 QS/Fb circuit..................................................... 43, 44
repressilator .............................................................. 91 RNA-seq (see RNA sequencing (RNA-seq)) simulation of an inducible gene ................... 278–282 stochastic simulations............................................... 42 whole-cell model .................................................... 269 See also Synthetic circuits Gene expression burden burden-driven feedback loop cellular burden.......................................... 325–327 library of promoters ......................................... 325 plasmid...................................................... 323–325 burden-responsive promoter biosensor................................................... 322–323 genomic context............................................... 322 RNA-seq results ............................................... 322 cell engineering ...................................................... 314 host responds ................................................ 317–321 medium................................................................... 316 molecular cloning.......................................... 315–316 RNA-seq library preparation ........................ 316–317 strains ...................................................................... 315 Gene network retroactivity (see Retroactivity) synthetic.................................................................. 110 Gene regulatory networks ................................. 3, 14, 93, 101, 120, 308 Genetic parts descriptions............................................................. 151 DNA constructs ..................................................... 157 receptor vector ....................................................... 171 RNA-seq (see RNA sequencing (RNA-seq)) standardization .............................................. 158–159 Gene transcription networks ............... 32, 294, 299, 304 Gillespie algorithm choosing a reaction ................................................ 104 iterating................................................................... 105 Markov process....................................................... 101 rate vector function................................................ 114 resampling time trace data..................................... 116 SSA...................................................................... 42, 77 stochastic algorithm ................................................. 29 stoichiometry matrix ..................................... 115, 116 system update ......................................................... 104 time to next reaction..................................... 103–104 time-trace simulating .................................... 102–103 Global optimization ............................................ 120, 226 Growth models ......................................... 269, 274, 276, 280, 282, 288
H Hardware ................. 138, 141, 142, 146, 149, 191, 193 control layer pressure regulation ........................... 198 flow layer pressure .................................................. 198 Hill function ................................ 96, 111, 112, 299, 308
SYNTHETIC GENE CIRCUITS : METHODS Host–circuit models bacterial growth ............................................ 269–276 gene circuits heterologous genes .................................. 277–278 inducible gene .......................................... 278–282 model complexity................................................... 269 transcriptional logic gates ............................. 281–288 T7 RNA polymerase .............................................. 268
I Intracellular protein-sensor acquisition .............................................................. 335 analyses ................................................................... 335 apoptosis assays ...................................................... 340 cDNA synthesis ..................................... 335, 340–341 data analysis ............................................................ 339 DNA cloning ................................................. 334, 337 electroporation ...................................... 334–335, 338 flow cytometry .............................................. 335, 339 fluorescence imaging..................................... 337–338 of HEK 293FT cells ...................................... 337–338 HIV production and infection .............................. 340 HLA-I surface expression ............................. 339–340 mammalian cells culture ............................... 334–335 plasmid construction..................................... 334, 337 qPCR ..................................................... 335, 340–341 RNA extraction ..................................... 335, 340–341
L Laboratory automation automation field cell-free systems........................................ 148–149 DNA assembly.......................................... 149–150 metrology ................................................. 150–151 microfluidics ............................................. 148–149 ML .................................................................... 148 open science ..................................................... 150 standardization ......................................... 150–151 strain development................................... 149–150 build ............................................................... 139–140 business plan education .................................................. 146–147 funding ..................................................... 143–145 partnerships .............................................. 145–146 system maintenance and personnel ................. 147 design...................................................................... 139 ML .......................................................................... 140 strategy........................................................... 141–144 synthetic biology .................................................... 138 test........................................................................... 140 Liquid handling................................................... 149, 152
AND
PROTOCOLS Index 349
M Machine learning algorithms............................................................... 148 automation need .................................................... 140 gene circuit design ................................................. 267 scientific experiments design ................................. 148 test cycle ................................................................. 140 trial-and-error approach ........................................ 148 Mammalian cell culture and transfection ................................ 334–335 electroporation .............................................. 334–335 microfluidics/microscopy (see Microfluidics) segmentation .......................................................... 215 tissue culture........................................................... 211 Mammalian synthetic biology intracellular protein-sensor devices ............. 331–335, 337–341 protein-based devices.............................................. 336, 341–344 strategy...................................................... 333–334 synthetic devices ....................................... 331–333 Mathematical modeling ......................... 2, 109, 110, 267 Metabolic engineering ................................................. 146 Metrology ............................................................ 150–152 Microfluidics and cell-free systems.............................. 148–149, 190 chip fabrication....................................................... 206 chip loading ............................................................ 206 pins preparation and wetting................... 209–210 preculture of cells ..................................... 211–212 shear-free cell loading ...................................... 211 computational algorithms...................................... 215 connectors .............................................................. 193 device fabrication ................................................... 244 experiments calibration ................................................. 252–254 cells loading ...................................................... 254 connecting syringes to the chip....................... 252 fabrication................................................. 249–250 microfluidic chip wetting................................. 251 microscope setup...................................... 254–255 overnight culture.............................................. 250 setup.......................................................... 244–245 syringe preparation........................................... 251 feedback control algorithms MPC.......................................................... 217–218 PI controller ............................................. 216–217 relay controller ......................................... 215–216 hardware ................................................................. 193 PDMS ..................................................................... 205 time-lapse................................................................ 207
SYNTHETIC GENE CIRCUITS : METHODS
350 Index
AND
PROTOCOLS
actuation system ............................................... 214 chip positioning....................................... 213, 214 microscope specs .............................................. 214 settings ...................................................... 214–215 tubes.......................................................... 212–213 and turbidostats........................................................ 44 Mixed integer nonlinear programming ...................... 122 Model calibration computational tools ............................................... 244 image processing cell-tracking and extraction ..................... 256–257 fine-tuning ........................................................ 255 segmentation ............................................ 255–256 microfluidic device fabrication ............................................. 244 experimental setup ................................... 244–245 optimal experimental design......................... 258–259 parameter estimation .................................... 257–258 practical identifiability ................................... 247–248 sensitivity analysis .......................................... 246–247 structural identifiability .......................................... 245 test case ................................................................... 242 Model order reduction .......................... 46, 48, 295–296 Model predictive control (MPC) ....................... 217–218 Modularity ........................ 119, 149, 150, 267, 293, 332 Moieties .......................................................................... 75 MPC, see Model predictive control (MPC) Multi-objective optimization.............................. 131–133
N Network control............................................................... 4 Boolean models .................................................. 31–32 strategies ............................................................. 30–31 synthetic circuits................................................. 32–33 Network dynamics analysis attractors and their stability ......................... 20–21 formal verification of network properties ................................................ 23–26 modular analysis ........................................... 26–28 state transition graphs ..................... 21–23, 28–30 control Boolean models ............................................ 31–32 strategies ....................................................... 30–31 synthetic circuits........................................... 32–33 Next generation sequencing (NGS) .................. 168, 176
O ODEs, see Ordinary differential equations (ODEs) Optimal experimental design AMIGO2 ....................................................... 226, 229 candidate models.................................................... 223 code structure................................................ 227–229
dynamics ................................................................. 222 inducible promoter modeling ...................... 229–231 local methods ......................................................... 225 model building............................................................. 221 calibration ................................................. 258–259 selection .................................................... 231–235 parameter estimation ................... 224, 225, 235–238 probability density function................................... 223 stochastic global optimization algorithms ............ 226 toolbox download and license....................................... 227 requirements and installation guide................ 227 Ordinary differential equations (ODEs) cyber-physical platform .......................................... 242 Matlab solvers......................................................... 123 nonlinear deterministic .......................................... 120 solving................................................................. 98–99 writing....................................................................... 98
P Parameter space analysis ....................... 22, 99–100, 106, 121, 260, 261 PDMS, see Polydimethysiloxane (PDMS) Photolithography consumables ........................................................... 192 control mold fabrication............................... 195–196 flow mold fabrication.................................... 194–195 machines ................................................................. 192 mask fabrication ..................................................... 194 PI controller, see Proportional-Integral (PI) controller Piecewise-linear differential equation (PLDE) models cyclic orbit ................................................................ 11 discontinuities .......................................................... 35 IRMA circuit ............................................................ 19 oscillator with positive feedback ....................... 17–19 toggle switch ...................................................... 15–17 Polydimethysiloxane (PDMS) bonding .................................................................. 197 casting and curing ......................................... 196–197 degassing................................................................. 249 elastomer................................................................. 193 mixing ..................................................................... 249 replica molding cleaning and bonding ...................................... 209 microfluidic device preparation ............... 207–208 silanization........................................................ 207 Practical identifiability 223, 242, 244, 247–248, 261, 264 Proportional-integral (PI) controller .................... 31, 32, 216–217 Protein-based devices cell culture .............................................................. 336 flow cytometry ....................................................... 336 PCR......................................................................... 336
SYNTHETIC GENE CIRCUITS : METHODS
Q
R Relay controller ................................................... 215–216 Reproducibility.......................... 137–139, 142, 149–151 Resource allocation ...................................... 33, 269, 272 Restriction digest analysis DNA assembly verification .................................... 168 web application ............................................. 169–170 Retroactivity biochemical reactions............................................. 295 contraction theory ................................................. 296 error ............................................................... 303–304 external .......................................................... 299–301 intermodular connections............................. 304–307 internal........................................................... 298–299 mathematical model of modules .................. 296–297 model order reduction.................................. 295–296 modularity .............................................................. 293 perturbations .......................................................... 294 scaling and mixing......................................... 301–302 time-scale separation ..................................... 295–296 RNA-binding protein .................................................. 333 RNA sequencing (RNA-seq) burden-driven feedback ......................................... 323 characterize genetic parts....................................... 176 computational tool................................................. 176 host responds library sequencing ............................................ 320 plate-reader data............................................... 321 promoter characterization ............................... 321
PROTOCOLS Index 351
quality control and alignments........................ 320 sample preparation ................................... 319–320 time-course assay...................................... 318–319 transcription profiles ........................................ 321 transformation .......................................... 317–318 library preparation consumables ............................................. 316–317 equipment......................................................... 317 materials genetic analyzer installation..................... 178–179 sequencing data ................................................ 179 software dependencies ..................................... 178 methods data preprocessing.................................... 182–183 differential gene expression ..................... 183–184 initial workflow setup............................... 179–181 promoters and terminators .............................. 184 response function ..................................... 184–185 temporary files and logs removing .................. 185 transcription profiles ........................................ 183 in vivo assay ............................................................ 314
plasmid cloning ..................................... 336, 341–342 protein–protein devices testing .................... 342–344 protein structural analysis ............................. 341–342 in silico protein engineering .................................. 336 transient transfection cell imaging ........................ 336 Protein-protein regulation......................... 333, 342–344 Protein-RNA regulation .............................................. 333 Protein sensor-actuator...................... 332–334, 337–341
QSSA, see Quasi steady-state approximation (QSSA) Qualitative modeling Boolean models .................................................... 6–13 DNA synthesis............................................................ 1 dynamic properties............................................. 34–35 gene expression dynamics .......................................... 2 network dynamics analysis .......................................................... 20–30 control .......................................................... 30–33 PLDE models ................................................ 3, 15–19 reviews ...................................................................... 34 synthetic regulatory circuits................................... 4–6 Quasi steady-state approximation (QSSA) ................... 75
AND
S Sanger sequencing necessary data files.................................................. 171 output ............................................................ 173–174 primer-free regions........................................ 171–173 web application ...................................................... 173 Sensor gene expression burden ......................................... 314 genetic logic gates .................................................. 184 intracellular protein (see Intracellular protein-sensor) small-molecule........................................................ 176 Soft lithography clean/semiclean room ........................................... 249 consumables .................................................. 192–193 device fabrication bonding of PDMS ........................................... 197 casting and curing, PDMS....................... 196–197 silanization........................................................ 196 machines ................................................................. 192 Software ........................................................................ 193 automatic syringe movement ................................ 214 code implementation ......................................... 46–49 dependencies .......................................................... 178 DSGRN .................................................................... 22 FACSDiva8............................................................. 335 FlowJo .................................................................... 340 and hardware components..................................... 138 MEIGO .................................................................. 245 Snapgene Viewer .................................................... 160 spectrum companies............................................... 146 SynBioHub ............................................................. 151 tools ........................................................................ 139
SYNTHETIC GENE CIRCUITS : METHODS
352 Index
AND
PROTOCOLS
Software (cont.) web-based ............................................................... 162 SSA, see Stochastic simulation algorithm (SSA) Standardization .................................. 138, 149–152, 157 genetic part.................................................... 158–159 necessary data files......................................... 159, 160 output ............................................................ 161–162 part regions.................................................... 160–161 web application ...................................................... 161 State transition graphs ............................ 3, 8–11, 13, 16, 18, 20–23, 28–30 Steady-state gene expression batch cell-free reactions ......................................... 190 device operation cell-free expression ................................... 199–201 filling control lines ................................... 198–199 flow lines filling ................................................ 199 experimental reagents ............................................ 194 hardware setup ....................................................... 198 host cell................................................................... 189 microfluidic.................................................... 193, 194 microscope hardware ............................................. 193 photolithography consumables ..................................................... 192 control mold fabrication .......................... 195–196 flow mold fabrication............................... 194–195 machines ........................................................... 192 mask fabrication ............................................... 194 soft lithography consumables ............................................. 192–193 device fabrication ..................................... 196–198 machines ........................................................... 192 software................................................................... 193 Stochastic modeling ....................................................... 92 CME ......................................................................... 42 continuous deterministic approach ......................... 42 gene expression noise............................................... 41 materials.............................................................. 43–52 methods .............................................................. 52–71 spatial ........................................................................ 42 Stochastic simulation algorithm (SSA) ................. 42, 43, 78, 79, 83 Stochastic simulations characterization of results ............................. 105–106 circuit performance .................................................. 44 dynamic model ......................................................... 67 gene circuits.............................................................. 42 Gillespie algorithm ........................................ 102–105 parameter scan............................................... 106–107 stochastic notation ........................................ 101–102 time-trace....................................................... 102–103 SYNBADm initialization ................................................... 123–124 installation ..................................................... 123–124
modeling framework..................................... 120–121 optimization problem design......................................... 121–122 solvers ....................................................... 122–123 oscillator design library of components.............................. 124–125 objective function..................................... 125–126 problem definition ........................................... 124 simulating the dynamics, circuit...................... 128 single objective optimization problem ... 126–128 Pareto front of solutions........................................ 134 practical examples................................................... 123 switch-like circuit design definition .......................................................... 129 library of components.............................. 129–130 multi-objective optimization problem .... 131–133 objective functions ........................................... 130 SynBioHub ................................................................... 151 Synthetic biology application .................................................................. 2 batch cell-free reactions ......................................... 190 biocircuits ............................................................... 119 cell-free systems...................................................... 189 cyber-physical platform .......................................... 242 DBT ........................................................................ 137 design-build-test-learn cycle........................... 92, 267 federal investments................................................. 143 genetic design......................................................... 177 homeostasis .............................................................. 32 laboratory automation (see Laboratory automation) microfluidics ........................................................... 190 OED .............................................................. 241, 242 on-line vs. off-line .................................................. 243 photolithographic steps ......................................... 191 promoters and regulators ...................................... 189 sequencing methods .............................................. 176 stochastic perturbations ........................................... 22 toggle switch .............................................................. 4 two-layer microchemostat design ................ 190, 191 Synthetic circuits characterizing promoter and terminator ..... 176, 177 contextual effects.................................................... 175 control ................................................................ 32–33 dynamic properties..................................................... 3 genetic parts and devices ....................................... 176 Hill function .................................................. 111, 112 interactions ............................................................... 21 inverse transform sampling.................................... 113 materials built-in/custom-coded functions ..................... 93 computing long-term statistics.................... 49–50 model in proper form .................................. 44–49 noise .............................................................. 49–50 software......................................................... 50–52
SYNTHETIC GENE CIRCUITS : METHODS memorylessness property....................................... 113 methods abstracting the circuit .................................. 94–95 compilation......................................................... 67 deterministic solution ................................ 98–100 mass action equations .................................. 95–96 models to redesign ................................... 107–110 OpenFPM client program ........................... 52–67 parameter estimation ................................... 96–97 simulation ..................................................... 67–71 stochastic simulations............................... 100–107 models....................................................................... 91 novel gene circuits.................................................... 92 parameter values..................................................... 110 redesign..................................................................... 92 response function .......................................... 176, 178 structure and behavior ............................................... 2 workflow ........................................................ 176, 177 See also Qualitative modeling Synthetic construct ............................ 314, 317–318, 322 Synthetic gene circuits, see Synthetic circuits
AND
PROTOCOLS Index 353
T Throughput ............................... 137–143, 149–152, 167 Trade-offs ........................................ 2, 83, 120, 122, 304 Transcriptional logic gates circuit function nutrient quality......................................... 286–287 RBS ................................................................... 286 host-aware gate AND ................................................................. 284 NAND .............................................................. 285 NOT ................................................................. 283 Type-2S assembly pre-validation necessary data files.................................................. 163 output ..................................................................... 165 restriction sites........................................................ 157 web application ............................................. 163–165
W Whole-cell modeling........................................... 269, 270