Probabilistic Structures in Evolution (Ems of Congress Reports, 17) 9783985470051, 3985470057

This volume collects twenty-one survey articles about probabilistic aspects of biological evolution. They cover a large

120 30 7MB

English Pages 502 [503]

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
Preface
1 Accessibility percolation in random fitness landscapes
1.1 Introduction
1.2 Accessibility percolation in sequence space
1.3 Accessibility percolation on trees
1.4 Correlated fitness landscapes
1.5 Paths with valley crossing
1.6 Summary and conclusions
References
2 Branching random walks in random environment
2.1 Introduction
2.2 Spatial branching random walks in random environment
2.3 Multitype branching random walk in random potential
2.4 The PAM on finite graphs
2.5 Higher moments of the numbers of particles
2.6 Further perspectives
References
3 Microbial populations under selection
3.1 Introduction
3.2 Lenski's long-term evolution experiment
3.3 A host-parasite model with balancing selection and reinfection
References
4 The population genetics of the CRISPR-Cas system in bacteria
4.1 Introduction
4.2 Classification of CRISPR systems and components
4.3 The pattern of spacer arrays within CRISPR
4.4 Outlook
References
5 Evolution of altruistic defence traits in structured populations
5.1 Introduction
5.2 Asymptotic frequencies of altruistic defence traits
5.3 Fixation/extinction of the average altruist frequency
5.4 Convergence to a forest of trees of excursions
5.5 Differentiability of semigroups
References
6 Stochastic processes and host-parasite coevolution: Linking coevolutionary dynamics and DNA polymorphism data
6.1 Introduction
6.2 Intrinsic stochasticity
6.3 Extrinsic stochasticity
6.4 Conclusion
References
7 Stochastic models for adaptive dynamics: Scaling limits and diversity
7.1 Theories of evolution
7.2 The individual based model
7.3 Scaling limits
7.4 The polymorphic evolution sequence
7.5 In one step
7.6 Escape through a fitness well
References
8 Genealogies and inference for populations with highly skewed offspring distributions
8.1 Multiple merger coalescents in population genetics
8.2 Inference based on the site-frequency spectrum
8.3 Multiple loci, diploidy and $\Xi$-coalescents
8.4 Discussion: Are they really out there?
References
9 Multiple-merger genealogies: Models, consequences, inference
9.1 Multiple-merger coalescents
9.2 Modelling multiple mergers for variable population size
9.3 How much genetic information is contained in a subsample?
9.4 Model selection between $n$-coalescents
9.5 Partition blocks and minimal observable clades
References
10 Diploid populations and their genealogies
10.1 Introduction
10.2 Haploid models
10.3 Diploid models
10.4 Various examples of diploid population models
10.5 Discussion and further connections to the literature
References
11 Probabilistic aspects of $\Lambda$-coalescents in equilibrium and in evolution
11.1 Introduction
11.2 Dust-free $\Lambda$-coalescents
11.3 $\Lambda$-coalescents with a dust component
11.4 An asymptotic expansion for Beta-coalescents
11.5 Evolving $n$-coalescents
11.6 Evolving $\Lambda$-coalescents
References
12 Population genetic models of dormancy
12.1 Introduction
12.2 Seed banks with spontaneous switching
12.3 Simultaneous switching
12.4 Open problems and perspectives for future work
References
13 From high to low volatility: Spatial Cannings with block resampling and spatial Fleming–Viot with seed-bank
13.1 Background
13.2 Two models
13.3 Equilibrium
13.4 Random environment
13.5 Extensions
13.6 Perspectives
References
14 Ancestral lineages in spatial population models with local regulation
14.1 Introduction
14.2 Random walk on the oriented percolation cluster
14.3 Ancestral lineages for logistic branching random walks
14.4 Discussion
References
15 The symbiotic branching model: Duality and interfaces
15.1 Introduction
15.2 The discrete-space voter model
15.3 The symbiotic branching model
15.4 Self-duality in the symbiotic branching model
15.5 Moment duality in the symbiotic branching model
15.6 Interface duality in the symbiotic branching model
15.7 Entrance laws for annihilating Brownian motions
15.8 Outlook
References
16 Multitype branching models with state-dependent mutation and competition in the context of phylodynamic patterns
16.1 Introduction
16.2 The individual-based branching model
16.3 Possible scaling regimes
16.4 The measure-valued model
16.5 The evolving phylogenies
16.6 A two-level model
References
17 Ancestral lines under recombination
17.1 Introduction
17.2 Moran model with recombination
17.3 Ancestral recombination graph and deterministic limit
17.4 An explicit solution for single-crossover recombination
17.5 Recombination in discrete time
References
18 Towards more realistic models of genomes in populations: The Markov-modulated sequentially Markov coalescent
18.1 Modelling the evolution of genomes in populations
18.2 Heterogeneity of processes along the genome
18.3 Existing approaches to account for spatial heterogeneity
18.4 The integrative sequentially Markov coalescent
18.5 Conclusions
References
19 Diffusion limits of genealogies under various modes of selection
19.1 Introduction
19.2 Tree-valued Fleming–Viot process with selection and mutation
19.3 Genealogies under low levels of selection
19.4 A result on stochastic averaging
19.5 The tree-valued Fleming–Viot process under fluctuating selection
References
20 Counting, grafting and evolving binary trees
20.1 Introduction
20.2 Counting trees
20.3 Properties of ranked trees
20.4 Induced subtrees
20.5 Recombination
20.6 Evolving trees
References
21 Algebraic measure trees: Statistics based on sample subtree shapes and sample subtree masses
21.1 Introduction
21.2 Algebraic measure trees
21.3 The subspace of binary algebraic measure trees
21.4 (Sub-)triangulations of the circle
21.5 The $\alpha$-Ford chain on $m$-labelled cladograms and its dual
21.6 The $\alpha$-Ford tree in the limit as $N \to \infty$
21.7 The $\alpha$-Ford chain in the diffusion limit
References
List of contributors
Index
Recommend Papers

Probabilistic Structures in Evolution (Ems of Congress Reports, 17)
 9783985470051, 3985470057

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

EMS SERIES OF CONGRESS REPORTS

Probabilistic Structures in Evolution Edited by Ellen Baake Anton Wakolbinger

EMS Series of Congress Reports The EMS Series of Congress Reports publishes volumes originating from conferences or seminars ­focusing on any field of pure or applied mathematics. The individual volumes include an introduction into their subject and review of the contributions in this context. Articles are required to undergo a refereeing process and are accepted only if they contain a survey or significant results not published elsewhere in the literature.

Previously published in this series: Trends in Representation Theory of Algebras and Related Topics,  A. Skowroński (Ed.) K-Theory and Noncommutative Geometry,  G. Cortiñas et al. (Eds.) Classification of Algebraic Varieties,  C. Faber, G. van der Geer, E. Looijenga (Eds.) Surveys in Stochastic Processes,  J. Blath, P. Imkeller, S. Rælly (Eds.) Representations of Algebras and Related Topics,  A. Skowroński, K. Yamagata (Eds.) Contributions to Algebraic Geometry. Impanga Lecture Notes,  P. Pragacz (Ed.) Geometry and Arithmetic,  C. Faber, G. Farkas, R. de Jong (Eds.) Derived Categories in Algebraic Geometry. Toyko 2011,  Y. Kawamata (Ed.) Advances in Representation Theory of Algebras,  D. J. Benson, H. Krause, A. Skowroński (Eds.) Valuation Theory in Interaction,  A. Campillo, F.-V. Kuhlmann, B. Teissier (Eds.) Representation Theory – Current Trends and Perspectives,  H. Krause et al. (Eds.) Functional Analysis and Operator Theory for Quantum Physics. The Pavel Exner Anniversary Volume, J. Dittrich, H. Kovařík, A. Laptev (Eds.) Schubert Varieties, Equivariant Cohomology and Characteristic Classes,  J. Buczyński, M. Michałek, E. Postinghel (Eds.) Non-Linear Partial Differential Equations, Mathematical Physics, and Stochastic Analysis, F. Gesztesy et al. (Eds.) Spectral Structures and Topological Methods in Mathematics,  M. Baake, F. Götze, W. Hoffmann (Eds.) t-Motives: Hodge Structures, Transcendence and Other Motivic Aspects,  G. Böckle et al. (Eds.)

Probabilistic Structures in Evolution Edited by Ellen Baake Anton Wakolbinger

Editors: Ellen Baake Technische Fakultät Universität Bielefeld Universitätsstr. 25 33615 Bielefeld, Germany E-mail: [email protected]

Anton Wakolbinger Institut für Mathematik Goethe-Universität Frankfurt Robert-Mayer-Str. 10 60325 Frankfurt am Main, Germany E-mail: [email protected]

2020 Mathematics Subject Classification: 92D15, 60-XX Keywords: stochastic processes, population genetics, population dynamics, coalescent theory, random trees

ISBN 978-3-98547-005-1 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. Published by EMS Press, an imprint of the European Mathematical Society – EMS – Publishing House GmbH Institut für Mathematik Technische Universität Berlin Straße des 17. Juni 136 10623 Berlin, Germany https://ems.press © 2021 European Mathematical Society Typeset using the authors’ LaTeX sources: WisSat Publishing + Consulting GmbH, Fürstenwalde, Germany Printing and binding: Beltz Bad Langensalza GmbH, Bad Langensalza, Germany ♾ Printed on acid free paper 987654321

Preface When Charles Darwin published The Origin of Species 160 years ago, he initiated evolution as a substantial research direction in the biological sciences. The mathematical side of evolutionary theory was founded in the 1920s by Ronald Fisher, Sewall Wright, and John Haldane, and led to the modern synthesis of Darwinian evolution and Mendelian genetics. Today, understanding the processes of biological evolution in greater depth is a continuing challenge to both biological and mathematical research. Evolution is a complex phenomenon driven by various processes, such as mutation and recombination of genetic material, reproduction of individuals, and selection of favourable types. The outcome of their interplay is impossible to predict without the substantial use of mathematical models and methods. Over many decades, much of this modelling and analysis took place on a deterministic level, using classical dynamical systems and differential equations, and this has led to an elaborate theory. However, the processes of evolution have intrinsically random elements, which give rise to a wealth of phenomena that cannot be explained by deterministic processes. Examples of such effects are the loss of genetic variability due to randomness in reproduction, and the emergence of random genealogies. The desire to model such phenomena has inspired the theory of stochastic processes since its very beginning. Wright–Fisher diffusion, Feller’s branching diffusion and Kingman’s coalescent not only had great success as biological models, but also became prototypes of mathematical objects with a rich and fascinating structure. The fact that these concepts and their refinements turned out to be both natural and convincing from the modelling point of view has greatly motivated researchers to analyse them in depth. Currently, the area of biological evolution attracts increasing attention worldwide, both from an experimental and a theoretical point of view. Keeping up with the increasing complexity of the models provides great challenges for probability theory. It requires to go beyond the long-standing prototype models of mathematical population genetics and to systematically explore the various elementary processes (also referred to as evolutionary forces), alone and in combination. From a mathematical point of view, the objects of evolution are populations of individuals. The type of an individual can be described at varying degrees of complexity. It can be the type at a single gene (or locus in biological language), the genetic sequence at a set of multiple loci or an entire genome, or some aspect of the phenotype (the collection of traits) of an individual, possibly together with the individual’s geographical (or more abstract) location. To a certain degree of abstraction, a population can thus be modelled as (a vector of) type frequencies, or as a measure on type space. Its (random) dynamics is governed by the various evolutionary forces. Individuals reproduce, transmitting their genetic material to their offspring. In sexual reproduction, this goes along with recombination, which combines a maternal and a paternal (geno)type into the “mixed” type of the offspring. In addition, along the

Preface

vi

individual lines of descent, types may be changed by mutation; and individuals may migrate between locations. The individual net offspring production rate (fitness) can depend on the type as well as on the environment, for instance on the state of the entire population due to competition for resources. If reproduction rates vary across types, fitter individuals flourish at the expense of less fit ones. This is known as selection. These evolutionary forces form the basis for the two main theories in the area: population genetics (the theory of evolution at the level of individual genotypes within populations) and population dynamics (its counterpart at the level of phenotypes). A conceptual and methodological anchor for both of them is the topic of random genealogies; they are typically related to the corresponding population models by way of duality. The aforementioned prototype models have been extended to include effects such as selection and recombination. Here, one of the central issues has been, and still is, to make random genealogies available in these complex situations. Beside their benefits for applications, the richness of the various models at the interface of probability and evolution (such as Fleming–Viot and ancestral processes with high offspring variation, coalescents with spatial and genetic structure, and individual-based ecological models) is leading to the exploration of new mathematical structures, and hence gives new impulses to probability theory and stochastic analysis as well. This was the motivation for a large group of researchers from all over Germany (and one from the Netherlands) to join their efforts in the Priority Programme Probabilistic Structures in Evolution (DFG-SPP-1590) over two funding periods from 2011–2020. The current volume is a collection of survey articles that grew out of this work. They span the following topics. Evolution in random fitness landscapes. The fitness landscape encodes the mapping of genotypes to fitness. In Chapter 1, Joachim Krug explores the structure of various kinds of random fitness landscapes by investigating the existence of mutational paths with non-decreasing fitness connecting distant genotypes; this translates into a problem of accessibility percolation. In Chapter 2, Wolfgang König investigates branching random walks on such landscapes, that is, the movement (via mutation) combined with birth and death events at rates that depend on the current position in the landscape. Stochastic processes in models of population genetics and population dynamics, forward in time. The individual-based modelling of the evolutionary forces leads to processes that describe the evolution of type frequencies that may then be analysed via suitable limit theorems for various scales of time and population size. In this vein, Ellen Baake and Anton Wakolbinger describe and analyse two population-genetic models of microorganisms under selection (Chapter 3): Lenski’s long-term evolution experiment, that is, experimental evolution under directional selection, and pathogen evolution under balancing selection. The population genetics of bacteria is also the theme of Rolf Backofen and Peter Pfaffelhuber (Chapter 4) – here a mutation model for the CRISPR-Cas (clustered regularly interspaced short palindromic repeats) system is developed, together with a classification of such systems via bioinformatics methods.

Preface

vii

In Chapter 5, Martin Hutzenthaler and Dirk Metzler formulate an individual-based, spatially structured predator-prey model with an altruistic defence trait in the prey population, and analyse the conditions under which this trait will persist. Wolfgang Stephan and Aurélien Tellier investigate the impact of the randomness in reproduction (known as genetic drift in biology) and mutation on the genetic structure of populations that experience host-parasite coevolution (Chapter 6). Chapter 7 by Anton Bovier reviews the stochastic individual-based model of adaptive dynamics (a concept for the joint description of population regulation and genetic change) and discusses how different scaling limits can be obtained by taking limits of large populations, small mutation rate, and small effect of single mutations together with appropriate time rescaling. Stochastic population models and random genealogies. The fundamental construct of a random genealogy is Kingman’s coalescent (1982), which describes the ancestral relationships between the individuals in a random sample from a population with small family sizes in the absence of selection, recombination, and further evolutionary forces. Substantial complications arise when one goes beyond these “standard” assumptions. Matthias Birkner and Jochen Blath (Chapter 8) review recent progress on coalescents with multiple and simultaneous multiple mergers, which emerge in populations with highly-skewed offspring distribution, and they discuss questions of inference from genetic data. Inference for multiple-merger coalescents is also the theme of Chapter 9 by Fabian Freund, who additionally considers effects of variable population size. Anja Sturm reviews what changes when the population is diploid rather than haploid, that is, when individuals carry two copies of each gene which they inherit from two distinct parent individuals (Chapter 10). Götz Kersting and Anton Wakolbinger (Chapter 11) investigate functionals of multiple-merger coalescents, such as the total external branch length and the time to the most recent common ancestor, and discuss evolving coalescents. Multiple-merger coalescents may also appear in populations with dormancy, where individuals switch between an active state, and an inactive one in which they do not reproduce, thus generating a seed bank of dormant individuals. Such models, along with the resulting genealogies, are developed and analysed by Jochen Blath and Noemi Kurt in Chapter 12. The effect of a hierarchical spatial structure on the “volatility” (i.e. the strength of fluctuations) of the type frequencies is investigated by Greven and den Hollander (Chapter 13) and compared with spatial effects in a model with highly-skewed offspring distribution, but without dormancy. Lines of descent are further shaped by interaction within a population. Matthias Birkner and Nina Gantert investigate spatial population models with local regulation and explain how an ancestral lineage can be interpreted as a random walk in a dynamic random environment, and how, in particular in the one-dimensional case, the whole collection of ancestral lineages converges to the Brownian web (Chapter 14). In Chapter 15, Jochen Blath and Marcel Ortgiese study the interface (that is, the region of coexistence) in the symbiotic branching model with nearest-neighbour migration, with methods of duality that may in some cases be interpreted as tracing back ancestral

Preface

viii

lines. Anja Sturm and Anita Winter provide a formal framework for a type-dependent branching model with mutation and competition, and the corresponding genealogies (Chapter 16). Powerful concepts to deal with recombination or selection are given by ancestral structures that are not trees, but branching-coalescing graphs, namely the ancestral recombination graph and the ancestral selection graph, respectively; the “true” genealogical trees (for each individual genetic locus) are embedded into these graphs. Ellen Baake and Michael Baake obtain a transparent solution of the deterministic recombination equation by means of its dual, namely an ancestral partitioning process derived from the ancestral recombination graph in a law of large numbers regime (Chapter 17). Julien Dutheil (Chapter 18) reviews methods based on the ancestral recombination graph that allow to infer recombination parameters from populationgenetic data sets, in particular when the recombination rates vary across the genome. Martin Hutzenthaler and Peter Pfaffelhuber study the genealogical trees embedded into the ancestral selection graph, in particular their lengths relative to the case without selection, and their behaviour when selection fluctuates (Chapter 19). Trees as such. Last not least, the properties of trees as such are crucial, regardless of their meaning as genealogies in population models. Thomas Wiehe reviews combinatorial and topological properties of binary trees, the probability distributions in tree space under different generation mechanisms, as well as the effects of the pruning and grafting operations that result when the tree evolves (Chapter 20). The volume concludes with Chapter 21, in which Anita Winter considers binary tree topologies as algebraic objects, equips them with a probability measure that allows to sample leaves from the tree, and describes the topology and statistics of the resulting subtrees. Each chapter has been refereed by two experts (anonymous for the chapter’s authors), one from outside and one from inside the Priority Programme. Our thanks go to all of them. The Priority Programme would not have been thinkable without our fellow members on the steering committee, namely Anton Bovier, Andreas Greven, and Frank den Hollander, who went with us through the adventure of designing the concept of, applying for, and guiding the Programme over so many years. It is our pleasure to thank those people who made the Priority Programme run smoothly, including Nicole Althermeler, Mareike Esser, and Sebastian Probst, who took responsibility for organisational matters from running the web page to assisting with meetings; and Jochen Röndigs, who managed the general administration including about two thousand (!) travel reimbursements. Special thanks go to Britta Heitbreder for supporting the coordinator (E. B.) in every possible way, and for being the LATEX wizard who shouldered the complex endeavour of preparing, correcting, and revising this volume from titlepage to index. Last not least, we thank the German Research Foundation (DFG) for having invested into this project. Ellen Baake

Anton Wakolbinger

Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

1

1

2

3

4

Accessibility percolation in random fitness landscapes . by Joachim Krug 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Accessibility percolation in sequence space . . . . . . 1.3 Accessibility percolation on trees . . . . . . . . . . . . . 1.4 Correlated fitness landscapes . . . . . . . . . . . . . . . . 1.5 Paths with valley crossing . . . . . . . . . . . . . . . . . . 1.6 Summary and conclusions . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

............ . . . . . . .

. . . . . . .

. . . . . . .

Branching random walks in random environment . . . . . . . by Wolfgang König 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Spatial branching random walks in random environment 2.3 Multitype branching random walk in random potential . . 2.4 The PAM on finite graphs . . . . . . . . . . . . . . . . . . . . . 2.5 Higher moments of the numbers of particles . . . . . . . . . 2.6 Further perspectives . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 3 9 11 14 16 19

. . . . . . . . . 23 . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

Microbial populations under selection . . . . . . . . . . . . . . . . . . . by Ellen Baake and Anton Wakolbinger 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Lenski’s long-term evolution experiment . . . . . . . . . . . . . . . 3.3 A host-parasite model with balancing selection and reinfection References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The population genetics of the CRISPR-Cas system in bacteria by Rolf Backofen and Peter Pfaffelhuber 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Classification of CRISPR systems and components . . . . . . . 4.3 The pattern of spacer arrays within CRISPR . . . . . . . . . . . . 4.4 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

23 24 30 35 36 39 40

. . . . . 43 . . . .

. . . .

. . . .

. . . .

. . . .

43 44 54 65

. . . . . . 69 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

69 71 76 79 80

x

Contents

5

6

7

8

9

Evolution of altruistic defence traits in structured populations . by Martin Hutzenthaler and Dirk Metzler 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Asymptotic frequencies of altruistic defence traits . . . . . . . . 5.3 Fixation/extinction of the average altruist frequency . . . . . . 5.4 Convergence to a forest of trees of excursions . . . . . . . . . . . 5.5 Differentiability of semigroups . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . 85 . . . . . .

Stochastic processes and host-parasite coevolution: Linking coevolutionary dynamics and DNA polymorphism data by Wolfgang Stephan and Aurélien Tellier 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Intrinsic stochasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Extrinsic stochasticity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

85 89 92 93 100 102

. . . . . 107 . . . . .

. . . . .

. . . . .

. . . . .

Stochastic models for adaptive dynamics: Scaling limits and diversity by Anton Bovier 7.1 Theories of evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The individual based model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Scaling limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 The polymorphic evolution sequence . . . . . . . . . . . . . . . . . . . . . . 7.5 In one step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Escape through a fitness well . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Genealogies and inference for populations with highly skewed offspring distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . by Matthias Birkner and Jochen Blath 8.1 Multiple merger coalescents in population genetics . . . . . . 8.2 Inference based on the site-frequency spectrum . . . . . . . . . 8.3 Multiple loci, diploidy and „-coalescents . . . . . . . . . . . . 8.4 Discussion: Are they really out there? . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . .

107 114 121 123 123

. 127 . . . . . . .

127 129 131 136 140 143 147

. . . . . . . 151 . . . . .

Multiple-merger genealogies: Models, consequences, inference by Fabian Freund 9.1 Multiple-merger coalescents . . . . . . . . . . . . . . . . . . . . . . 9.2 Modelling multiple mergers for variable population size . . . . 9.3 How much genetic information is contained in a subsample? .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

151 158 166 171 173

. . . . . . 179 . . . . . . 179 . . . . . . 183 . . . . . . 187

xi

Contents

9.4 Model selection between n-coalescents . . . . . . . . . . . . . . . . . . . . . 191 9.5 Partition blocks and minimal observable clades . . . . . . . . . . . . . . . . 196 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 10 Diploid populations and their genealogies . . . . . . . . . by Anja Sturm 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Haploid models . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Diploid models . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Various examples of diploid population models . . . 10.5 Discussion and further connections to the literature References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . 203 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

11 Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution by Götz Kersting and Anton Wakolbinger 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Dust-free ƒ-coalescents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 ƒ-coalescents with a dust component . . . . . . . . . . . . . . . . . . . . . 11.4 An asymptotic expansion for Beta-coalescents . . . . . . . . . . . . . . . 11.5 Evolving n-coalescents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Evolving ƒ-coalescents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

12 Population genetic models of dormancy . . . . . . . . by Jochen Blath and Noemi Kurt 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Seed banks with spontaneous switching . . . . . . 12.3 Simultaneous switching . . . . . . . . . . . . . . . . . 12.4 Open problems and perspectives for future work References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

203 204 208 212 215 218

. 223 223 225 231 233 236 239 243

. . . . . . . . . . . . . . . 247 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

13 From high to low volatility: Spatial Cannings with block resampling and spatial Fleming–Viot with seed-bank . . . . . . . . . . . . . . . . . . . . by Andreas Greven and Frank den Hollander 13.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Two models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Random environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

247 250 258 262 263

. . 267 . . . . . . .

. . . . . . .

267 269 278 282 285 286 287

xii

Contents

14 Ancestral lineages in spatial population models with local regulation by Matthias Birkner and Nina Gantert 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Random walk on the oriented percolation cluster . . . . . . . . . . . . . 14.3 Ancestral lineages for logistic branching random walks . . . . . . . . 14.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

15 The symbiotic branching model: Duality and interfaces . by Jochen Blath and Marcel Ortgiese 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 The discrete-space voter model . . . . . . . . . . . . . . . . 15.3 The symbiotic branching model . . . . . . . . . . . . . . . 15.4 Self-duality in the symbiotic branching model . . . . . 15.5 Moment duality in the symbiotic branching model . . . 15.6 Interface duality in the symbiotic branching model . . 15.7 Entrance laws for annihilating Brownian motions . . . 15.8 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . 291 . . . . .

291 293 302 307 308

. . . . . . . . . . . 311 . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

16 Multitype branching models with state-dependent mutation and competition in the context of phylodynamic patterns . . . by Anja Sturm and Anita Winter 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 The individual-based branching model . . . . . . . . . . . . . . 16.3 Possible scaling regimes . . . . . . . . . . . . . . . . . . . . . . . 16.4 The measure-valued model . . . . . . . . . . . . . . . . . . . . . 16.5 The evolving phylogenies . . . . . . . . . . . . . . . . . . . . . . 16.6 A two-level model . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

17 Ancestral lines under recombination . . . . . . . . . . . . . . . by Ellen Baake and Michael Baake 17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Moran model with recombination . . . . . . . . . . . . . . . 17.3 Ancestral recombination graph and deterministic limit . 17.4 An explicit solution for single-crossover recombination 17.5 Recombination in discrete time . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

311 312 315 320 322 326 329 332 333

. . . . . . . . 337 . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

337 340 343 346 348 354 360

. . . . . . . . . . 365 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

365 366 370 377 379 380

xiii

Contents

18 Towards more realistic models of genomes in populations: The Markov-modulated sequentially Markov coalescent . . by Julien Y. Dutheil 18.1 Modelling the evolution of genomes in populations . . . . 18.2 Heterogeneity of processes along the genome . . . . . . . . 18.3 Existing approaches to account for spatial heterogeneity . 18.4 The integrative sequentially Markov coalescent . . . . . . . 18.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . 383 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

19 Diffusion limits of genealogies under various modes of selection . . . by Martin Hutzenthaler and Peter Pfaffelhuber 19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.2 Tree-valued Fleming–Viot process with selection and mutation . . 19.3 Genealogies under low levels of selection . . . . . . . . . . . . . . . . 19.4 A result on stochastic averaging . . . . . . . . . . . . . . . . . . . . . . . 19.5 The tree-valued Fleming–Viot process under fluctuating selection References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Counting, grafting and evolving binary trees by Thomas Wiehe 20.1 Introduction . . . . . . . . . . . . . . . . . . . . . 20.2 Counting trees . . . . . . . . . . . . . . . . . . . 20.3 Properties of ranked trees . . . . . . . . . . . 20.4 Induced subtrees . . . . . . . . . . . . . . . . . 20.5 Recombination . . . . . . . . . . . . . . . . . . . 20.6 Evolving trees . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

383 395 396 399 403 403

. . . 409 . . . . . .

. . . . . .

. . . . . .

409 411 415 419 423 425

. . . . . . . . . . . . . . . . . . . 427 . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

21 Algebraic measure trees: Statistics based on sample subtree shapes and sample subtree masses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . by Anita Winter 21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Algebraic measure trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 The subspace of binary algebraic measure trees . . . . . . . . . . . . . 21.4 (Sub-)triangulations of the circle . . . . . . . . . . . . . . . . . . . . . . . 21.5 The ˛-Ford chain on m-labelled cladograms and its dual . . . . . . 21.6 The ˛-Ford tree in the limit as N ! 1 . . . . . . . . . . . . . . . . . . 21.7 The ˛-Ford chain in the diffusion limit . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

427 428 431 437 440 443 448

. . . 451 . . . . . . . .

. . . . . . . .

. . . . . . . .

451 453 457 461 463 465 470 473

List of contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479

Chapter 1

Accessibility percolation in random fitness landscapes Joachim Krug The fitness landscape encodes the mapping of genotypes to fitness and provides a succinct representation of possible trajectories followed by an evolving population. Evolutionary accessibility is quantified by the existence of fitness-monotonic paths connecting far away genotypes. Studies of accessibility percolation use probabilistic fitness landscape models to explore the emergence of such paths as a function of the initial fitness, the parameters of the landscape or the structure of the genotype graph. This chapter reviews these studies and discusses their implications for the predictability of evolutionary processes. Determining the paths that evolution does not take is as important in evolutionary outcomes as shaping those it may pass through. T. N. Starr and J. W. Thornton [63]

1.1 Introduction An evolving population traces out a path in the space of genetic sequences or genotypes. Depending on the level of resolution, the genotype may be described in terms of nucleotide bases, amino acid residues or the alleles of genes. It has been recognised for a long time that sequence spaces are vast, and that only a tiny fraction of all sequences code for viable phenotypes. This raises the question of how evolution nevertheless manages to navigate these spaces across macro-evolutionary distances [3, 37]. An early mathematical formulation of this problem was presented by John Maynard Smith, who estimated the fraction of functional proteins from a percolation argument [43]. Assuming that evolution is restricted to proceed by single amino acid substitutions, he conceptualised the space of all protein sequences as a network where each protein is connected to its n one-step mutant neighbours, and a fraction p of proteins is functional. The expected number of functional neighbours of a given sequence is therefore pn. It is then plausible (and can be proved [24, 56, 57]) that a large connected component of functional proteins exists with high probability if pn > 1. Since typically n  103 , only a small fraction of proteins has to be functional to ensure evolvability over large genetic distances. In this chapter we introduce and review a different kind of percolation problem motivated by evolutionary adaptation. In its most general form the problem of accessibility percolation can be formulated as follows [46]. Consider a graph G D .V; E/ where each vertex x 2 V is labelled by a real-valued random number fx drawn from a joint continuous distribution. We call a path between two vertices x; y accessible

Joachim Krug

2

if the random numbers along the path are monotonically increasing1 , and we ask for the probability of existence of such a path when the distance d.x; y/ between the two points (and the size of the graph) becomes large. The use of the term percolation in this context is motivated by the fact that one studies a certain connectivity property of a random structure. A more specific relation to conventional forms of percolation will be described below in Section 1.2.1. The notion of evolutionary accessibility was first introduced by Daniel Weinreich and collaborators [68]. In a seminal empirical study they constructed and characterised all 25 D 32 combinations of 5 point mutations in the bacterial antibiotic resistance gene TEM-1 ˇ-lactamase [67]. TEM-1 confers resistance to ampicillin but has very low activity against the novel antibiotic cefotaxime. In combination, the 5 point mutations in TEM-1 increase the baseline resistance against cefotaxime (measured in terms of the concentration up to which bacterial growth is possible) by a factor of about 105 . The aim of the study was to reconstruct the mutational pathways along which the highly resistant mutant could arise, assuming that mutations occur one at a time and that every step has to provide a benefit in terms of increased resistance2 . Since each path corresponds to a particular order of occurrence of the mutational steps, there are 5Š D 120 distinct paths, of which only 18 were found to be accessible. This observation led the authors to conclude that adaptive evolution is more constrained, and hence more predictable, than previously appreciated. Along with other related empirical studies [38, 55], the work of Weinreich et al. motivated the first theoretical investigations of evolutionary accessibility in random fitness landscapes [10, 21]. Here the term fitness landscape refers to the assignment of fitness values to genotypes that are connected by mutations [15, 16, 20, 62, 65]. In terms of the general definition of accessibility percolation given above, the graph of interest in this case is the space of genetic sequences endowed with the standard Hamming metric, and the random labels encode fitness or some proxy thereof, such as antibiotic resistance. The precise mathematical setting will be introduced in the next section, and the basic phenomenology will be explained with particular emphasis on the occurrence of abrupt, percolation-like transitions in the accessibility properties. Section 1.3 reviews accessibility percolation on trees, which sheds additional light on the role of graph geometry, and some generalisations of the standard models will be discussed in Sections 1.4 and 1.5. Concluding remarks addressing the relation between accessibility and predictability as well as the role of accessible paths in the evolutionary dynamics are presented in Section 1.6. In addition to a survey of the literature, some unpublished new results are reported in Sections 1.2.3 and 1.5. In its focus on the structure of fitness landscapes, the chapter is complementary to the contributions of Anton Bovier [7] and Wolfgang König [35] to this volume, where different types of evolutionary dynamics on fitness landscapes are discussed.

1 2

The requirement of strict monotonicity will be relaxed in Section 1.5. The conditions for these assumptions to hold are summarised in Section 1.6.2.

3

Accessibility percolation in random fitness landscapes

1.2 Accessibility percolation in sequence space Genotypes are encoded by sequences x of length L with entries drawn from an alphabet A D ¹0; 1; 2; : : : ; a 1º of size jAj D a. The elements of A will be referred to as alleles. The Hamming distance between two genotypes x; y is defined by d.x; y/ D

L X .1

ıi .x/;i .y/ /;

i D1

where i .x/ 2 A is the i-th entry of x, that is the allele at the i-th genetic locus. Loci are elements of the locus set L D ¹1; 2; : : : ; Lº with jLj D L. The sequence space AL endowed with the Hamming metric d.  ;  / is the Hamming graph HaL (see [62]). The binary or biallelic Hamming graphs H2L are L-dimensional hypercubes, see Figure 1.2.1 for illustration. Fitness values fx are drawn from a continuous distribution and assigned independently3 to genotypes [31]. Since accessibility depends only on the rank ordering of genotypes, the distribution of fitness values does not need to be specified. In the following we assume without loss of generality that the fx are uniformly distributed on Œ0; 1. The fitness values induce a natural orientation of the Hamming graph where links between neighbouring sequences are oriented in the direction of increasing fitness. The resulting acyclic directed graph is called the fitness graph and provides a convenient visualisation of the accessible paths [12, 17], see Figure 1.2.1 for illustration. The quantity of main interest in this chapter is the number of accessible evolutionary paths between two genotypes x; y with fy > fx . Throughout we will denote this integer-valued, non-negative random variable by Xx;y . While it would be desirable to characterise the full distribution of Xx;y , we will mostly be restricted to statements about the expectation E.Xx;y / and the probability of existence of at least one path P .Xx;y > 1/, where both quantities are understood to be conditioned on some positive value of the fitness difference fy fx . Colloquially, we will sometimes say that a genotype space is accessible (inaccessible) if P .Xx;y > 1/ ! 1 (P .Xx;y > 1/ ! 0/ when d.x; y/ ! 1 (see [21]). 1.2.1 Directed paths A path of length ` between two genotypes x; y is a sequence x ! x1 ! x2 !    ! x`

1

of genotypes such that d.x; x1 / D d.xi ; xi C1 / D d.x` and it is accessible if fx < fx1 < fx2 <    < fx` 3

1

!y 1 ; y/

D 1 for i D 1; : : : ; `

< fy :

The assumption of independence will be dropped in Section 1.4.

1,

(1.2.1)

4

Joachim Krug 101

111 110

100

01

001

011

11

00

000

10

010

Figure 1.2.1. Fitness landscapes on the binary hypercube of dimension L D 2 (left) and L D 3 (right). In the left panel the fitness values are plotted on the vertical axes, whereas in the right panel they are represented by circles of different size, and arrows on the edges point in the direction of higher fitness. The fitness landscape on the left is double-peaked, and as a consequence there are no accessible paths from x D 00 to y D 11. In the right panel 4 out of 6 D 3Š possible directed paths from 000 to 111 are accessible.

Along a directed path the distance to the target genotype y decreases by one in each step, which implies that ` D d.x; y/ and d.x; xi / D i. The properties of directed paths do not depend on the number of alleles a, since the only mutations that occur along such a path convert an allele of the initial genotype x into the corresponding allele of the final genotype y (see [69]). There are ` such mutations that can occur in arbitrary order, hence the total number of directed paths is `Š. We begin by computing the expected number of accessible directed paths conditioned on the fitness difference ˇ D fy fx > 0. The condition (1.2.1) implies that all intermediate fitness values must lie between fx and fy , which happens with probability ˇ ` 1 , and additionally they have to be ordered, which happens with probability 1 . Thus the probability for the path to be accessible is .` 1/Š Pˇ;l D

ˇ` 1 : .` 1/Š

(1.2.2)

This expression suggests an alternative interpretation which emphasises the link to percolation. Suppose we fix the initial and final fitness values at fx D 0 and fy D 1, and remove all other genotypes (endowed with i.i.d. fitness values) independently with probability 1 ˇ. Then the joint probability for a path of length ` from x to y to exist and to be accessible is precisely given by (1.2.2), and ˇ is seen to play the role of the occupation probability in a standard percolation problem. Multiplying (1.2.2) with the total number of directed paths we arrive at (see [26]) E`;ˇ .Xx;y / D `Š

ˇ` 1 D `ˇ ` .` 1/Š

1

:

(1.2.3)

Accessibility percolation in random fitness landscapes

5

For any fixed ˇ < 1, the expectation tends to zero for4 ` ! 1, and by Markov’s inequality we conclude that lim`!1 P`;ˇ .Xx;y > 1/ D 0 as well. If ˇ is allowed to vary with ` as ˇ` D 1 ` such that ` ! 0, the expectation (1.2.3) vanishes in the limit if ` > ln`` and diverges if ` < ln`` . In the latter case no statement about the existence of accessible paths can be inferred from Markov’s inequality. Hegarty and Martinsson [26] used the second moment inequality (see [1]) P .Xx;y > 1/ >

E.Xx;y /2 2 / E.Xx;y

(1.2.4)

to show that the upper bound provided by the first moment is essentially tight, in the sense that ´ 0; ` D 1 ˇ` > ln`` C ı` ; lim P`;ˇ` .Xx;y > 1/ D (1.2.5) `!1 1; ` D 1 ˇ` < ln`` ı` ; where ı` > 0 with lim`!1 `ı` D 1. For the estimate of the second moment in (1.2.4) one needs to consider pairs of paths and their intersections. Thus for directed paths a transition from low to high accessibility occurs near ˇ D 1, and this result can be read off from the behaviour of the expectation E.X /. The full distribution of the number of accessible directed paths was obtained by Berestycki, Brunet and Shi [5]. Working in a scaling limit where ` D C` for ` ! 1 with C > 0, they show that Xx;y =` converges in law to exp. C / times the product of two standard independent exponential random variables. The result (1.2.5) can also be applied to the setting originally considered in [21], where the paths were constrained to end at the global fitness maximum (which corresponds to setting fy D 1) but the initial fitness was not specified. This amounts to integrating (1.2.3) with respect to ˇ D 1 fx . Remarkably, the result 1

Z

dˇ E.X`;ˇ / D 1

(1.2.6)

0

is independent of ` and thus illustrates the fact that the directed hypercube is “marginally” accessible. Whereas (1.2.6) could naively be interpreted to imply that accessible paths are likely to exist, the simulations reported in [21] showed that most realisations of landscapes had Xx;y D 0, and the unit mean value was achieved through rare instances with Xx;y  1. On the basis of (1.2.5) these instances are understood to be those where the initial fitness happens to be below ln`` .

4

Here and in the following the limit ` ! 1 is understood to be taken along with the limit L ! 1. In the case of directed paths we can set ` D L without loss of generality.

6

Joachim Krug

1.2.2 Paths with back steps From a biological perspective there is no good reason to exclude the possibility of mutational reversions, where a mutation that occurs at some point along the path is later reverted. Specialising to the biallelic case with a D 2, a path between two genotypes at distance D that includes k reversions has total length ` D D C 2k, since each reversion has to be compensated by an additional forward step. When mutations are rare, evolution is expected to proceed preferentially along the shortest paths, and longer paths are obviously also less likely to be accessible. This disadvantage may however be offset by the enormous increase in the number of possible paths, which merely have to be self-avoiding. A re-analysis of the 5-dimensional TEM-1 ˇ-lactamase resistance landscape of Weinreich et al. [67] that included mutational reversions found a moderate increase in the number of accessible paths, from 18 to 27 (see [14]). At the same time the number of possible paths increases from 120 to 18,651,552,840. For hypercubes of dimension L > 6, the total number PL of self-avoiding paths connecting two antipodal corners is not explicitly known, but it can be shown to grow double-exponentially as (see [6]) ln ln PL lim D ln 2: L!1 L L

This coincides with the behaviour of the naive estimate PL  L2 obtained by noting that a self-avoiding path takes at most 2L steps, and that each step can proceed in L different directions. In the following we consider the binary hypercube of dimension L and ask for the number of general (undirected) accessible paths connecting two genotypes x; y at distance D D d.x; y/ that differ in fitness by ˇ D fy fx > 0. An expression for the expected number of paths can be formally written down along the lines of equation (1.2.3) as EL;D;ˇ .Xx;y / D

X k>0

aL;D;k

ˇ DC2k 1 ; .D C 2k 1/Š

where aL;D;k is the number of self-avoiding paths with k reversions that connect two genotypes at distance D on a hypercube of dimension L. Through a careful analysis of the asymptotics of the aL;D;k , Berestycki et al. showed that in the joint limit L; D ! 1 at fixed ˛ D D=L the exponential growth rate of the expected number of accessible paths is given by (see [6]) lim ŒEL;˛L;ˇ .Xx;y /1=L D sinh.ˇ/˛ cosh.ˇ/1

L!1

˛

:

(1.2.7)

Similar to the directed path case discussed previously, when the right-hand side of (1.2.7) is less than 1, Markov’s inequality implies limL!1 PL;˛L;ˇ .Xx;y > 1/ D 0. The condition sinh.ˇ  /˛ cosh.ˇ  /1p˛ D 1 defines a function ˇ  .˛/ that takes on its maximal value ˇ  .1/ D ln.1 C 2/  0:88137 : : : at ˛ D 1 and tends to 0 for

Accessibility percolation in random fitness landscapes

7

˛ ! 0. Berestycki et al. conjectured that, similar to the directed case, the expectation (1.2.7) “tells the truth”, in the sense that limL!1 PL;˛L;ˇ .Xx;y > 1/ D 1 when ˇ > ˇ  .˛/. This conjecture was proven independently by Martinsson [40] and Li [36]. Martinsson’s proof5 makes use of an ingenious mapping to first passage percolation on the hypercube, which allows him to refer to earlier results for the latter problem [41]. Thus we see that the extension to undirected paths fundamentally changes the nature of the problem, in that the transition to high accessibility now occurs at a nontrivial threshold fitness ˇ  < 1. Moreover, the fact that ˇ  decreases with decreasing ˛ shows that, in contrast to the directed path case, the genotypes that do not lie “between” the initial and final point of the path6 cannot be ignored. Evolutionary accessibility increases when D decreases relative to L because of the contribution from paths that accumulate and later revert mutations that are part of neither the initial nor the final genotype. 1.2.3 Multiple alleles In sequence spaces with more than two alleles .a > 2/, mutational paths can include “sideways” steps where the distance to the initial and final point neither decreases nor increases, because a site mutates to an allele that is contained in neither the initial nor the final genotype. Zagorski et al. carried out simulations of mutational paths in multiallelic sequence spaces and found a significant increase of accessibility with increasing a that is caused mainly by sideways steps [69]. In the following we summarise the main results of a recent analytic study of this problem [61]. For this purpose we formalise the mutational structure on the allele set A through the adjacency matrix A of the mutation graph, with elements Akk D 0 and Akl D 1 if and only if mutations can occur from allele k to allele l, where k; l 2 A. Then the exponential growth rate of the expected number of accessible paths between two genotypes x; y is given by lim ŒEL;A;ˇ .Xx;y /1=L D

L!1

a Y1

Œ.e ˇ A /kl pkl ;

(1.2.8)

k;lD0

where pkl denotes the fraction of sites at which i .x/ D k and i .y/ D l in the joint limit d.x; y/ ! 1 and L ! 1. The information about the Hamming distance between x and y is contained in the pkl through the relation d.x; y/ lim D1 L!1 L

5

a 1 X

pkk ;

kD0

For technical reasons Martinsson’s proof is limited to the range ˛ > 0:002. In biological terms, these are genotypes that cannot be generated from the initial and final genotype by crossover. 6

8

Joachim Krug

but in general the number of paths depends on the entire allelic composition of the initial and final genotypes. An important special case is the complete mutation graph, where Akl D 1 ıkl . In the biallelic case considered previously this implies A2 D 1, and therefore e ˇ A D sinh.ˇ/A C cosh.ˇ/1. Observing further that p01 C p10 D ˛ and p11 C p00 D 1 ˛, the expression (1.2.8) is seen to reduce to (1.2.7). As before, the expression (1.2.8) can be invoked together with Markov’s inequality to derive a lower bound ˇ  on the critical fitness difference below which lim PL;A;ˇ .Xx;y > 1/ D 0:

L!1

For the complete mutation graph over a alleles and initial and final genotypes at maximal distance d D L, the equation for ˇ  .a/ reads 1 .a .e a

1/ˇ 

e

ˇ

/ D 1;

(1.2.9)

which can be solved explicitly for a 6 4. In particular, for the case of the 4-letter nucleotide alphabets of RNA and DNA the threshold fitness is r   p 1 1  ˇ .4/ D ln p C  0:5088 : : : : 2 2 2 For large a the solution of (1.2.9) can be approximated as ˇ  .a/ D

 ln.a/  ln.a/ 1 C ln.a/ C C O : a a2 a3

For the 20-letter amino acid alphabet this yields the estimate ˇ   0:1598 : : : . Because of the restrictions of the genetic code, the amino acid mutation graph is not complete, but a calculation based on the actual mutation graph shows that this only leads to minor deviations from this estimate. The comparison to numerical simulations [69] and related results for first passage percolation [42] indicate that the bound ˇ  given by setting the right-hand side of (1.2.8) to unity is tight at least for the complete graph, and presumably also for a rather general class of allelic mutation graphs. An interesting application for which the lower bound provided by (1.2.8) suffices is the linear graph7 , where allele k is allowed to mutate only to the neighbouring alleles k ˙ 1 for 1 6 k 6 a 2, and the boundary alleles mutate as 0 ! 1 and a 1 ! a 2. For paths connecting the boundary genotypes x; y with i .x/ D 0 and i .y/ D a 1 or vice versa, one finds that ˇ  > 1 for a > 2, which implies that accessible paths do not exist for any value of ˇ 2 Œ0; 1.

7

A possible biological interpretation of the linear mutation graph is that alleles represent copy-number variants of genes [2]. In this case the assignment of random fitness values is however not very plausible.

Accessibility percolation in random fitness landscapes

9

1.3 Accessibility percolation on trees Computing higher moments of Xx;y on the directed hypercube is difficult because different paths can merge and diverge multiple times [26]. The observation that sequence spaces are nevertheless essentially tree-like for large L has motivated a number of studies of accessibility percolation on trees, where this problem does not arise. We begin with a regular rooted n-tree of height h (see [46]). The tree has nh leaves and equally many paths of length h C 1 from the root to one of the leaves. The nodes and leaves are labelled by continuous, independent and identically distributed (i.i.d.) random numbers, and therefore the probability for a given path to be accessible is 1=.h C 1/Š. Since the exponential growth of the number of paths cannot compensate the factorial decrease in probability, the usual first moment bound shows that accessible paths do not exist for h ! 1 for any fixed value of n. One is thus led to consider trees where the branching number grows with increasing height according to a function n.h/. The expected number of accessible paths is then8 Eh;n.h/ .X/ D

n.h/h Œe n.h/= hh  p .h C 1/Š 2h

for large h, which suggests that the transition to high accessibility occurs for linear functions n.h/ D h with  >  > 1=e. In [46] the upper bound  6 1 was obtained using the second moment inequality (1.2.4), and a subsequent refined analysis showed that the lower bound is tight and  D 1=e exactly [58]. The linear growth n  h corresponds to the geometry of the directed hypercube9 , which can be viewed as a directed graph whose vertex degree and diameter are both equal to L. The fact that trees with linear growth are marginally accessible is thus consistent with the results for the directed hypercube described in Section 1.2.1. A similar conclusion can be drawn from a subtly different analysis carried out by Coletti et al. [11], who consider infinite trees with the branching number at level l given by an increasing function n.l/ (the root is located at l D 0). For a linear growth function n.l/ D l C 1 the number of leaves (and hence the number of distinct paths from the root) at height h is hŠ, the same as the number of directed paths on a hypercube of dimension h. Without constraints on the fitness of the root, the probability for a path to be accessible is again 1=.h C 1/Š. The expected number of paths is Eh;n.l/DlC1 .X/ D

hŠ 1 D .h C 1/Š hC1

and there is no accessibility for h ! 1. The main result established in [11] is that this case is marginal in the sense that the probability for existence of accessible paths is positive for growth functions n.l/ D d.l C 1/ e with > 1. 8

Throughout this section paths are assumed to go from the root to a leaf of the tree. The indices x; y of X indicating the start and end point of the paths are therefore omitted. 9 See [5] for a precise way of approximating the directed hypercube by a tree.

10

Joachim Krug

Instead of letting the branching number of the tree grow with its height, accessibility can also be increased by introducing a bias on the random fitness variables [46]. Specifically, we take the fitness of a node x at distance l from the root to be of the form fx D x C cl (1.3.1) where c > 0 and the x are continuous i.i.d. random variables. The linear trend in (1.3.1) increases the likelihood of the variables to be in increasing order in a way that depends on the distribution of the x . For the case of the Gumbel distribution P .x < z/ D expŒ e z , the ordering probability P .fx1 < fx2 <    < fxh /, which is also the probability for a path of length h to be accessible, can be shown to be given by [23] .1 P .fx1 < fx2 <    < fxh / D Qh

e

lD1 .1

c h

/

e

cl /

D

1 : Œhe c Š

(1.3.2)

Here Œkq Š 

k Y j D1

Œj q ;

Œj q 

1 qj ; 1 q

q 2 Œ0; 1;

defines the q-factorial of a q-number10 Œkq (see [34]). Note that for q ! 1 the standard factorial is retrieved. The result (1.3.2) was first derived in the context of record statistics, where it describes the probability for all entries in a sequence of random variables with a linear trend to be records11 [23]. Since the product in the denominator of (1.3.2) converges for h ! 1, for a tree of fixed branching number n, the expected number of accessible paths grows or shrinks exponentially as Œn.1 e c /h for large h. This suggests that the accessibility transition occurs when n.1 e c / D 1 or c D c  with  n  c  D ln : n 1 An analysis using the second moment inequality (1.2.4) confirms this expectation and moreover provides the lower bound r c 2   c lim Ph;n;c .X > 1/ > Œe exp e c c 24 6c h!1 for c > c  (see [46]).

10 11

For a related application of q-factorials see [52]. A similar relation between record statistics and accessibility was described in [11].

Accessibility percolation in random fitness landscapes

11

1.4 Correlated fitness landscapes Although the assumption of i.i.d. random fitness values is mathematically convenient, it cannot be expected to be biologically realistic. Indeed, analyses of experimental data generally show that real fitness landscapes are less rugged than predicted by the i.i.d. model [4, 16, 21, 65]. For this reason several classes of probabilistic fitness landscapes with tunable fitness correlations have been proposed. In contrast to the i.i.d. case, in these models the rank ordering of genotypes (and hence the accessibility of mutational paths) generally depends on the base distribution of the random variables from which the landscape is constructed. In the following we focus on properties for which this dependence does not matter, and consider biallelic sequence spaces (a D 2) throughout. In the rough Mount Fuji (RMF) model, i.i.d. random fitness values are combined additively with a linear12 function of the distance to a reference genotype x0 , such that fx D x C cd.x; x0 /

(1.4.1)

with c > 0 and continuous i.i.d. random variables x (see [45]). This is obviously very similar to the fitness assignment (1.3.1) in the tree model with bias discussed in Section 1.3. The RMF model on the directed hypercube was analysed by Hegarty and Martinsson [26], who showed that accessible paths of length L starting at x0 exist with probability converging to 1 for any c > 0 and L ! 1. The effect of the bias in the undirected case is less clear cut, since mutational reversions are discouraged when c > 0. In particular, for c ! 1 all directed paths become accessible and no back steps can occur. Numerical simulations for small systems suggest that the competition between directed and undirected paths leads to a maximum in the number of accessible paths at an intermediate value of c (see [30]). The NK-models13 originally introduced by Kauffman and Weinberger [32, 33] constitute a popular and versatile class of correlated fitness landscapes that continues to attract the attention of diverse research communities (see [27] for a recent review). The basic structural element of the model are the interaction sets Bi  L of genetic loci, which are subsets of the locus set L D ¹1; 2; : : : ; Lº. In the most commonly used setting there are L interaction sets of equal size jBi j D k, 1 6 k 6 L, and moreover it is assumed that locus i belongs to its own interaction set, i 2 Bi . Loci within one interaction set affect each other’s fitness effects in a random manner, whereas the fitness effects from different interaction sets combine additively. This is implemented by writing the fitness of a genotype x as a sum over contributions from the interaction sets, L X fx D k.i/ .#Bi x/: (1.4.2) i D1 12

For a generalisation to nonlinear functions see [51, 53]. The acronym refers to the number of loci N (here denoted by L) and the number of interaction partners K of a locus (here denoted by k 1). 13

12

Joachim Krug

Figure 1.4.1. Three examples of NK interaction structures for L D 15 loci. Interaction sets are shown as ellipses. Left panel: Block structure [50, 54] with k D 3. The interaction sets are disjoint and form five triples of identical sets. Middle panel: Random structure with k D 2. Loci are assigned randomly to interaction sets subject to the constraints described in the text. Right panel: Star structure with k D 3. Loci 1 and 2 are centre loci that are contained in all interaction sets. The interaction sets associated with the remaining 13 ray loci contain only the locus itself and the centre loci (modified from [27]).

Here the k.i/ are functions on the k-dimensional hypercube that assign a continuous i.i.d. random number to each of the 2k sequences in H2k , and #Bi W H2L ! H2k projects the genotype sequence onto the interaction set according to i .# x/ D i .x/

for all i 2   L:

The functions k.i/ for different i are independent. The choice of the interaction sets defines the interaction structure of the model [27, 47], see Figure 1.4.1 for illustration. The correlations in the fitness landscape are tuned through the size k of the interaction sets [9]. For k D L every interaction set contains all loci and the model reduces to the uncorrelated landscape. On the other hand, when k D 1 the fitness effects of different loci combine additively according to fx D

L X i D1

1.i/ .0/ C

L X

1.i/ .1/

 1.i / .0/ i .x/;

i D1

where the 1.i/ .0/ and 1.i/ .1/ are continuous i.i.d. random variables. The additive landscape has a unique maximum and all directed paths to the maximum are accessible [68]. For general k, the analysis of directed accessible paths is relatively straightforward for the block structure shown in the left panel of Figure 1.4.1 [60]. Note first that in this case the L interaction sets fall into b D L=k groups14 within which the sets are 14

We assume here that L is divisible by k.

13

Accessibility percolation in random fitness landscapes

identical, and therefore only b interaction sets have to be distinguished. Following the setting originally introduced in [21], we consider directed paths of length L that end at the global maximum xmax of the fitness landscape and start at its antipodal point x max defined by i .x max / D 1 i .xmax /. Along such a path each locus has to be mutated once. Since the interaction sets are disjoint, each mutation event occurs in one of the sets and changes only the fitness contribution corresponding to this set (compare to (1.4.2)). In this way the path can be decomposed into b sub-paths of length k on H2k , each of which remains within one interaction set. The global path is accessible if and only if all the sub-paths are. Denoting the number of accessible paths in the interaction set Bi by Xk.i/ , the number of global paths is thus block XL;k

b LŠ Y .i / D X : .kŠ/b i D1 k

(1.4.3)

The combinatorial prefactor describes the number of ways in which a given set of sub-paths can be combined into a global path. An immediate consequence is that LŠ block XL;k is a non-negative multiple of .kŠ/ b. Since the fitness contributions within each interaction set are continuous i.i.d. random variables, the statistics of the Xk.i/ are given by the uncorrelated model discussed in Section 1.2.1. In particular, it follows from (1.2.6) that E.Xk.i / / D 1 independently of k, and therefore LŠ block E.XL;k /D : (1.4.4) .kŠ/b Similarly, block P .XL;k > 1/ D ŒP .Xk > 1/b ;

where Xk is the number of accessible directed paths in the uncorrelated model on H2k that end at the global maximum and start at the (unconstrained) antipodal point. The results described in Section 1.2.1 imply that15 P .Xk > 1/ < 1 for any k > 2, and we conclude that block lim P .XL;k > 1/ D 0 (1.4.5) L!1

for any fixed k > 2. Taken together, the results (1.4.4) and (1.4.5) show that the accessibility properties of the NK model with block interactions are strikingly different from those of the uncorrelated model, in that the expected number of accessible paths grows factorially with L, while at the same time the probability that the landscape is accessible vanishes exponentially. It is clear from the second moment inequality (1.2.4) that these two statements are compatible only if the coefficient of variation of block XL;k diverges with L. This is indeed the case and follows from the multiplicative structure of (1.4.3) [60].

15

Specifically, P .X2 > 1/ D 2=3 and P .X3 > 1/ D 97=210  0:462 : : : (see [60]).

Joachim Krug

14

An early numerical investigation indicated that the accessibility of NK landscapes depends on the interaction structure in a significant and complicated way [22], but subsequent work has shown that the behaviour of the block structure described above is quite representative at least asymptotically for large L. In brief, it has been proved that for a large class of interaction structures characterised as locally bounded, the probability P .XL;k > 1/ vanishes exponentially in L for L ! 1 and any fixed k > 2 (see [27, 59]). The proof relies on the inevitable existence of a certain local genetic interaction motif known as reciprocal sign epistasis [55] which prevents an accessible path (directed or undirected) from traversing the hypercube and hence decomposes the genotype space into mutually inaccessible domains. Most commonly studied interaction structures, including the random structure depicted in the middle panel of Figure 1.4.1, are locally bounded, but the star structure shown in the right panel of the figure is not.

1.5 Paths with valley crossing In this section we consider the consequences of relaxing the condition of strict monotonicity on accessible paths. Theoretical studies of populations that cross a fitness valley have shown that different dynamic modes for this process have to be distinguished [66]. In small populations16 a deleterious mutation that decreases fitness can fix, which implies that the entire population moves to a state of lower fitness and continues its trajectory from there. By contrast, in larger populations a small subpopulation of valley genotypes is maintained by mutation-selection balance, and if a target genotype that is fitter than both the initial and the valley genotypes exists, it can arise by mutation from the valley population. In the latter case the majority of the population never resides in the valley, and for this reason the process is also known as stochastic tunnelling [28]. Increasing population size even further, eventually double mutations become likely. In the following we focus on the regime where multiple mutations can be neglected and valleys are crossed either by fixation or stochastic tunnelling. These two modes of valley crossing impose different conditions on the fitness values along the evolutionary trajectory [13]. For the fixation mode the crossing event implies an unconditional decrease of the fitness value, after which the monotonic increase in fitness is resumed until the next valley crossing. By contrast, when valley crossing occurs in the (stochastic) tunnelling mode, the mutational step out of the valley has to overcompensate the fitness decrease of the preceding step. Denoting the valley genotype by v, the fitness values along the sub-trajectory x ! v ! x 0 have to satisfy the condition fx 0 > fx > fv , see Figure 1.5.1 for illustration. In the

16

The condition on the population size N is Nf  1, where f D fx fitness difference between the initial genotype x and the valley genotype v.

fv > 0 is the

Accessibility percolation in random fitness landscapes

15

fitness 1 A B C 0

genotype

000 100 010 001 110 101 011 111

Figure 1.5.1. Three directed paths in a three-dimensional hypercube that illustrate the different modes of valley crossing. Path A (full) is monotonic, path B (dashed) contains a stochastic tunnelling event in the second step, and along path C (dash-dotted) a deleterious mutation fixes in the first step.

following we examine the effect of allowing for one or several valley crossings on the accessibility of directed paths of length ` on the hypercube. Fitness values are drawn independently from the uniform distribution on Œ0; 1, and the final genotype y is assumed to have maximal fitness, fy D 1. We first show that a single valley crossing of the fixation type suffices to induce high accessibility. Starting at an arbitrary initial genotype x with fitness fx , we make use of the crossing event in the first step to move to the genotype x1 that has the lowest fitness among the ` available neighbours. The probability distribution function of this fitness value is given by P .fx1 < z/ D 1

.1

z/` ! 1

e

z`

for large ` and small z, and using (1.2.5) we conclude that an accessible path will be found with certainty from the second step onward. The expected number of accessible paths starting from a random fitness value can be shown to grow exponentially as 2` ` in this setting [13]. The analysis of paths with stochastic tunnelling events is more subtle17 . Following the setup of Section 1.2.1, we fix the initial and final fitness values at fx D 1 ˇ and fy D 1, respectively. We consider paths of length ` with a fixed number t of tunnelling events, and denote by t .`; ˇ/ the probability that such a path is accessible. We know from (1.2.3) that 0 .`; ˇ/ D ˇ ` 1 =.` 1/Š. For general t and ` > 3, the t satisfy the recursion relation Z 1 Z 1 .`; ˇ/ D .1 ˇ/ du .` 2; u/ C du t .` 1; u/: (1.5.1) t t 1 1 ˇ

17

1 ˇ

This part is based on unpublished notes by Éric Brunet [8].

16

Joachim Krug

The first term on the right-hand side describes the situation when the first step is a tunnelling event. This implies that the second fitness value fx1 2 Œ0; 1 ˇ, which happens with probability 1 ˇ, whereas the subsequent value u D fx2 has to be in Œ1 ˇ; 1. From this point on one has to cover the remaining ` 2 steps using t 1 tunnelling events. The second term covers the cases where fitness increases in the first step, such that fx1 D u 2 Œ1 ˇ; 1 and the remaining ` 1 steps can make use of all t tunnelling events. The solution of (1.5.1) with the appropriate boundary conditions is given by h` t t i 1 ˇ ` t .2 ˇ/t t .`; ˇ/ D t 2 tŠ .` 2t/Š ˇ 2 ˇ 1 d ` t D t Œˇ .2 ˇ/t : (1.5.2) 2 tŠ .` 2t/Š dˇ Integrating (1.5.2) with respect to ˇ and multiplying by the total number of directed paths `Š one obtains the expected number of paths with arbitrary starting fitness as E.X`;t / D

`Š / `2t 2t tŠ .` 2t /Š

for large ` and fixed t, which generalises (1.2.6) to t > 1. Although this suggests a significant increase in accessibility, the algebraic growth in ` is not sufficient to overcome the exponential reduction in probability caused by the factor ˇ ` t in (1.5.2). Conditioned on the initial fitness, the expected number of paths behaves as E.X`;t;ˇ / /

`2tC1 ` ˇ 2t tŠ

t 1

.2

ˇ/t / `2tC1 ˇ `

for large ` and fixed t, which converges to zero for any fixed ˇ < 1. Setting ˇ D ˇ` D 1 ` with lim`!1 ` D 0, we see that a necessary condition for accessibility is ` < .2t C 1/ ln`` , which is only a small improvement compared to the case without valley crossings in (1.2.5). We conclude that the effects of the two modes of valley crossing on accessibility are very different. Duque et al. recently studied accessibility percolation with valley crossings on n-trees with a height-dependent branching number n.h/, compare to Section 1.3. They define a path to be k-accessible if any k consecutive fitness values along the path contain at least one element of a monotonically increasing sub-sequence, and show that the critical growth of n.h/ required to guarantee accessibility is given by n.h/ / Œh=.ek/1=k (see [18]).

1.6 Summary and conclusions 1.6.1 Accessibility and predictability The results reviewed in the preceding sections reveal a variety of mathematical mechanisms by which evolutionary accessibility can arise in high-dimensional genotype

Accessibility percolation in random fitness landscapes

17

spaces. The directed hypercube with continuous i.i.d. fitness values [10, 21] turns out to be marginally accessible, in the sense that accessible paths exist only if the fitness difference between the initial and final genotype is as large as possible [26]. Allowing for undirected paths lowers the threshold for accessibility to a nontrivial value ˇ  2 .0; 1/ (see [6, 36, 40]). In multiallelic sequence spaces ˇ  depends on the graph of allowed mutational transitions on the set of alleles A. If the mutation graph is complete, ˇ  decreases with increasing number of alleles and tends to zero for a ! 1 (see [61, 69]). When the genotype space is a tree, accessibility percolation occurs as a function of tree geometry [11, 46, 58] or as a function of a bias imposed on the fitness values [46]. Apart from the last example, in all cases the transition is characterised by a discontinuous change of the limiting value of P .X > 1/ from 0 to 1. On correlated fitness landscapes of NK-type, accessibility is determined by the graph of interactions among loci, and accessible paths do not exist for interaction structures that are locally bounded [27, 59]. This general result as well as the results for the explicitly solvable block model [60] show that, despite being less rugged (more correlated) than the uncorrelated model, NK-landscapes are much less accessible. Evolutionary accessibility is thus not strictly linked to other measures of fitness landscape ruggedness such as the number of local fitness maxima [65]. The concept of accessible paths was originally introduced in the context of evolutionary predictability. Following the interpretation that Weinreich et al. applied to their seminal experiment on antibiotic resistance evolution [67], the evolutionary trajectories connecting an initial to a final genotype are highly predictable if a small but nonzero number of accessible pathways exist. Referring to the results described in Section 1.2, we see that this condition is approximately satisfied for the directed hypercube with ˇ D 1, where limL!1 P .X > 1/ D 1 and the expected number of paths is equal to L and hence much smaller than the total number of paths LŠ. For the undirected hypercube at ˇ > ˇ  , the number of accessible paths increases exponentially in L but is still much smaller than the total number of paths. In this sense the uncorrelated model can be said to conform to the scenario of high predictability envisioned in [67]. However, the results for the NK-models discussed in Section 1.4 show that other scenarios are possible as well. According to equations (1.4.4), (1.4.5), in the NK-model with block interactions accessible paths typically do not exist, but if they do their number grows factorially in L, leading to low predictability. 1.6.2 Accessible paths and evolutionary dynamics Formalising evolutionary accessibility through fitness-monotonic mutational paths is conceptually appealing, because it does not require any assumptions about the evolutionary dynamics. On the other hand, this simplification also limits the applicability of the theory to actual evolutionary processes. The paradigm of evolutionary dynamics underlying the notion of accessible paths, known as the strong selection-weak mutation (SSWM) regime, applies in a window of population size N where mutations

Joachim Krug

18

are rare18 , N  1, and selection is strong, N jf j  1 (see [25, 48]). Under these conditions the population is almost always monomorphic and can evolve only by fixing beneficial mutations, which implies that it is constrained to move along accessible paths. A rigorous derivation of the SSWM limit from microscopic adaptive dynamics is described in the chapter by Anton Bovier [7]. The SSWM framework allows one to assign a weight to each accessible path, which is given by the product of the relative fixation probabilities along the path. Empirical studies indicate that these weights are often strongly concentrated on a small number of paths [38, 67]. As a consequence the evolutionary predictability may be even higher than expected based only on the number of accessible paths. However, evolving populations are myopic and lack the foresight required to determine which of the many available paths will eventually lead to high fitness. Studies of SSWM dynamics on the L-dimensional hypercube with uncorrelated fitness values have shown that adaptive walks governed by local dynamics typically terminate at a local fitness peak after  ln L steps and are thus much shorter than the accessible paths considered in this chapter [19, 39, 44, 48]. Greedy adaptive walks that always fix the most beneficial mutation terminate already after e 1  1:718 steps [49]. Walks following a (biologically unrealistic) “reluctant” dynamics by always choosing the smallest available positive fitness difference take O.L/ steps but do not reach fitness levels comparable to the globally maximal fitness [47]. The behaviour of adaptive walks on correlated fitness landscapes is more complex, but studies of the RMF model have shown that the available, long accessible paths are dynamically relevant only if the bias parameter c in (1.4.1) is sufficiently large and/or the distribution of the random component x is sufficiently heavy tailed [51–53]. Typical walk lengths in the NK-model at fixed k are of the order of (but smaller than) L (see [47]). When the population size increases beyond the weak mutation regime, additional complications arise. On the one hand, the simultaneous presence of multiple mutation clones in the population implies an advantage for mutations of large beneficial effect, such that the dynamics becomes increasingly greedy and hence deterministic [29, 52]. On the other hand, a higher mutation rate facilitates the crossing of fitness valleys, which strongly increases the number of available paths. A detailed numerical study has shown that the distinguished role of fitness-monotonic paths is largely lost in this regime, and moreover the interplay of the two effects mentioned above leads to a non-monotonic dependence of predictability on population size [64]. Taken together, the considerations sketched in this section make it clear that the investigation of accessible paths only constitutes a first step towards a broader understanding of evolutionary predictability.

18

Here  denotes the mutation rate per individual and generation, and f is the typical fitness difference between neighbouring genotypes.

Accessibility percolation in random fitness landscapes

19

Acknowledgements. The work described here is the joint effort of a large number of collaborators. I am particularly grateful to Éric Brunet, Lucas Deecke, Jasper Franke, Mario Josupeit, Alexander Klözer, Stefan Nowak and Benjamin Schmiegelt for their contributions to this project, and to Benjamin Schmiegelt for a critical reading of the manuscript.

References [1] N. Alon and J. Spencer, The Probabilistic Method, 2nd ed., Wiley, New York, 2000. [2] L. Altenberg, Fundamental properties of the evolution of mutational robustness, preprint (2015), httpsW//arxiv.org/abs/1508.07866. [3] F. H. Arnold, The library of Maynard-Smith: My search for meaning in the protein universe, Microbe 6 (2011), 316–318. [4] C. Bank, S. Matuszewski, R. T. Hietpas, and J. D. Jensen, On the (un)predictability of a large intragenic fitness landscape, Proc. Natl. Acad. Sci. USA 113 (2016), 14085–14090. [5] J. Berestycki, É. Brunet, and Z. Shi, The number of accessible paths in the hypercube, Bernoulli 22 (2016), 653–680. [6] J. Berestycki, É. Brunet, and Z. Shi, Accessibility percolation with backsteps, ALEA Lat. Am. J. Probab. Math. Stat. 14 (2017), 45–62. [7] A. Bovier, Stochastic models for adaptive dynamics: Scaling limits, diversity, and applications to cancer therapy, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 127–149. [8] É. Brunet, private communication (2015). [9] P. R. A. Campos, C. Adami, and C. O. Wilke, Optimal adaptive performance and delocalization in NK fitness landscapes, Phys. A 304 (2002), 495–506; erratum ibid. 318 (2003), 637. [10] M. Carneiro and D. L. Hartl, Adaptive landscapes and protein evolution, Proc. Natl. Acad. Sci. USA 107 (2010), 1747–1751. [11] C. F. Coletti, R. J. Gava, and P. M. Rodríguez, On the existence of accessibility in a treeindexed percolation model, Phys. A 492 (2018), 382–388. [12] K. Crona, D. Greene, and M. Barlow, The peaks and geometry of fitness landscapes, J. Theoret. Biol. 317 (2013), 1–10. [13] L. A. C. Deecke, Fitness landscapes and evolutionary accessibility: The effect of downhill steps, Bachelor thesis, University of Cologne, 2015, httpsW//kups.ub.uni-koeln.de/10364/. [14] M. A. DePristo, D. L. Hartl, and D. M. Weinreich, Mutational reversions during adaptive protein evolution, Mol. Biol. Evol. 24 (2007), 1608–1610. [15] J. A. G. M. de Visser, S. F. Elena, I. Fragata, and S. Matuszewski, The utility of fitness landscapes and big data for predicting evolution, Heredity 121 (2018), 401–405. [16] J. A. G. M. de Visser and J. Krug, Empirical fitness landscapes and the predictability of evolution, Nat. Rev. Genet. 15 (2014), 480–490.

Joachim Krug

20

[17] J. A. G. M. de Visser, S.-C. Park, and J. Krug, Exploring the effects of sex on empirical fitness landscapes, Am. Nat. 174 (2009), S15–S30. [18] F. Duque, A. Roldán-Correa, and L. A. Valencia, Accessibility percolation with crossing valleys on n-ary trees, J. Stat. Phys. 174 (2019), 1027–1037. [19] H. Flyvbjerg and B. Lautrup, Evolution in a rugged fitness landscape, Phys. Rev. A 46 (1992), 6714–6723. [20] I. Fragata, A. Blanckaert, M. A. Dias Louro, D. A. Liberles, and C. Bank, Evolution in the light of fitness landscape theory, Trends Ecol. Evol. 34 (2019), 69–82. [21] J. Franke, A. Klözer, J. A. G. M. de Visser, and J. Krug, Evolutionary accessibility of mutational pathways, PLoS Comp. Biol. 7 (2011), Article ID e1002134. [22] J. Franke and J. Krug, Evolutionary accessibility in tunably rugged fitness landscapes, J. Stat. Phys. 148 (2012), 706–723. [23] J. Franke, G. Wergen, and J. Krug, Records and sequences of records from random variables with a linear trend, J. Stat. Mech. Theor. Exp. 2010 (2010), Article ID P10013. [24] S. Gavrilets and J. Gravner, Percolation on the hypercube and the evolution of reproductive isolation, J. Theor. Biol. 184 (1997), 51–65. [25] J. H. Gillespie, Molecular evolution over the mutational landscape, Evolution 38 (1984), 1116–1129. [26] P. Hegarty and A. Martinsson, On the existence of accessible paths in various models of fitness landscapes, Ann. Appl. Probab. 24 (2014), 1375–1395. [27] S. Hwang, B. Schmiegelt, L. Ferretti, and J. Krug, Universality classes of interaction structures for NK fitness landscapes, J. Stat. Phys. 172 (2018), 226–278. [28] Y. Iwasa, F. Michor, and M. A. Nowak, Stochastic tunnels in evolutionary dynamics, Genetics 166 (2004), 1571–1579. [29] K. Jain, J. Krug, and S. C. Park, Evolutionary advantage of small populations on complex fitness landscapes, Evolution 65 (2011), 1945–1955. [30] M. Josupeit, The influence of backwards mutations on the number of accessible paths in a fitness landscape, Bachelor thesis, University of Cologne, 2015, httpsW//kups.ub.uni-koeln. de/10344/. [31] S. Kauffman and S. Levin, Towards a general theory of adaptive walks on rugged landscapes, J. Theoret. Biol. 128 (1987), 11–45. [32] S. A. Kauffman, The Origins of Order, Oxford University Press, London, 1993. [33] S. A. Kauffman and E. D. Weinberger, The NK model of rugged fitness landscapes and its application to maturation of the immune response, J. Theoret. Biol. 141 (1989), 211–245. [34] R. Koekoek, P. Lesky, and R. Swarttouw, Hypergeometric Orthogonal Polynomials and Their q-Analogues, Springer, Berlin, 2010. [35] W. König, Branching random walks in random environment, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 23–41. [36] L. Li, Phase transition for accessibility percolation on hypercubes, J. Theoret. Probab. 31 (2018), 2072–2111.

Accessibility percolation in random fitness landscapes

21

[37] A. A. Louis, Contingency, convergence and hyper-astronomical numbers in biological evolution, Stud. Hist. Philos. Biol. Biomed. Sci. 58 (2016), 107–116. [38] E. R. Lozovsky, T. Chookajorn, K. M. Brown, M. Imwong, P. J. Shaw, S. Kamchonwongpaisan, D. E. Neafsey, D. M. Weinreich, and D. L. Hartl, Stepwise acquisition of pyrimethamine resistance in the malaria parasite, Proc. Natl. Acad. Sci. USA 106 (2009), 12025–12030. [39] C. A. Macken, P. S. Hagan, and A. S. Perelson, Evolutionary walks on rugged landscapes, SIAM J. Appl. Math. 51 (1991), 799–827. [40] A. Martinsson, Accessibility percolation and first–passage site percolation on the unoriented binary hypercube, preprint (2015), httpsW//arxiv.org/abs/1501.02206. [41] A. Martinsson, Unoriented first-passage percolation on the n-cube, Ann. Appl. Probab. 26 (2016), 2597–2625. [42] A. Martinsson, First-passage percolation on Cartesian power graphs, Ann. Probab. 46 (2018), 1004–1041. [43] J. Maynard Smith, Natural selection and the concept of a protein space, Nature 225 (1970), 563–564. [44] J. Neidhart and J. Krug, Adaptive walks and extreme value theory, Phys. Rev. Lett. 107 (2011), Article ID 178102. [45] J. Neidhart, I. G. Szendro, and J. Krug, Adaptation in tunably rugged fitness landscapes: The rough Mount Fuji model, Genetics 198 (2014), 699–721. [46] S. Nowak and J. Krug, Accessibility percolation on n-trees, EPL 101 (2013), Article ID 66004. [47] S. Nowak and J. Krug, Analysis of adaptive walks on NK fitness landscapes with different interaction schemes, J. Stat. Mech. Theory Exp. 2015 (2015), Article ID P06014. [48] H. A. Orr, The population genetics of adaptation: The adaptation of DNA sequences, Evolution 56 (2002), 1317–1330. [49] H. A. Orr, A minimum on the mean number of steps taken in adaptive walks, J. Theoret. Biol. 220 (2003), 241–247. [50] H. A. Orr, The population genetics of adaptation on correlated fitness landscapes: The block model, Evolution 60 (2006), 1113–1124. [51] S.-C. Park and J. Krug, ı-exceedance records and random adaptive walks, J. Phys. A 49 (2016), Article ID 315601. [52] S.-C. Park, J. Neidhart, and J. Krug, Greedy adaptive walks on a correlated fitness landscape, J. Theoret. Biol. 397 (2016), 89–102. [53] S.-C. Park, I. G. Szendro, J. Neidhart, and J. Krug, Phase transition in random adaptive walks on correlated fitness landscapes, Phys. Rev. E 91 (2015), Article ID 042707. [54] A. S. Perelson and C. A. Macken, Protein evolution on partially correlated landscapes, Proc. Natl. Acad. Sci. USA 92 (1995), 9657–9661. [55] F. J. Poelwijk, D. J. Kiviet, D. M. Weinreich, and S. J. Tans, Empirical fitness landscapes reveal accessible evolutionary paths, Nature 445 (2007), 383–386.

Joachim Krug

22

[56] C. M. Reidys, Random induced subgraphs of generalized n-cubes, Adv. Appl. Math. 19 (1997), 360–377. [57] C. M. Reidys, Large components in random induced subgraphs of n-cubes, Discrete Math. 309 (2009), 3113–3124. [58] M. Roberts and L. Zhao, Increasing paths in regular trees, Electron. Commun. Probab. 18 (2013), 1–10. [59] B. Schmiegelt, Sign epistasis networks, Master thesis, University of Cologne, 2016, httpsW// kups.ub.uni-koeln.de/10398/. [60] B. Schmiegelt and J. Krug, Evolutionary accessibility of modular fitness landscapes, J. Stat. Phys. 154 (2014), 334–355. [61] B. Schmiegelt and J. Krug, Accessibility percolation on Cartesian power graphs, preprint (2019), httpsW//arxiv.org/abs/1912.07925. [62] P. F. Stadler, Fitness landscapes, in: Biological Evolution and Statistical Physics (eds. M. Lässig and A. Valleriani), Springer, Berlin (2002), 183–204. [63] T. N. Starr and J. W. Thornton, Epistasis in protein evolution, Prot. Sci. 25 (2016), 1204– 1218. [64] I. G. Szendro, J. Franke, J. A. G. M. de Visser, and J. Krug, Predictability of evolution depends non-monotonically on population size, Proc. Natl. Acad. Sci. USA 110 (2013), 571–576. [65] I. G. Szendro, M. F. Schenk, J. Franke, J. Krug, and J. A. G. M. de Visser, Quantitative analyses of empirical fitness landscapes, J. Stat. Mech. Theory Exp. 2013 (2013), Article ID P01005. [66] D. M. Weinreich and L. Chao, Rapid evolutionary escape of large populations from local fitness peaks is likely in Nature, Evolution 59 (2005), 1175–1182. [67] D. M. Weinreich, N. F. Delaney, M. A. DePristo, and D. L. Hartl, Darwinian evolution can follow only very few mutational paths to fitter proteins, Science 312 (2006), 111–114. [68] D. M. Weinreich, R. A. Watson, and L. Chao, Perspective: Sign epistasis and genetic constraint on evolutionary trajectories, Evolution 59 (2005), 1165–1174. [69] M. Zagorski, Z. Burda, and B. Waclaw, Beyond the hypercube: Evolutionary accessibility of fitness landscapes with realistic mutational networks, PLoS Comp. Biol. 12 (2016), Article ID e1005218.

Chapter 2

Branching random walks in random environment Wolfgang König We consider branching particle processes on discrete structures like the hypercube in a random fitness landscape (i.e. random branching/killing rates). The main question is about the location where the main part of the population sits at a late time, if the state space is large. For answering this, we take the expectation with respect to the migration (mutation) and the branching/killing (selection) mechanisms, for fixed rates. This is intimately connected with the parabolic Anderson model, the heat equation with random potential, a model that is of interest in mathematical physics because of the observed prominent effect of intermittency (local concentration of the mass of the solution in small islands). We present several advances in the investigation of this effect, also related to questions inspired from biology.

2.1 Introduction In this chapter we study a topic that is of interest both in mathematics and in mathematical population biology: branching random walks on a graph that either models the space in which we live or a space of genotypes. That is the particles are under the influence of three random mechanisms: movement, branching and killing. In the case of genotypes, the movement can be understood as mutation, and the killing as selection, hence such models belong to the cornerstones of the mathematical description of random population processes. We introduce this mathematical model in Section 2.2 and explain its biological interpretation in Section 2.2.2. The main point that we are interested in here is the situation where both the branching rates and the killing rates may depend on the state that they are attached to, and that these rates are taken as random, typically independent and identically distributed. In this case, the expected number of particles in all the sites at a given time is the solution to the heat equation with potential, which is now random. This model is called the parabolic Anderson model (PAM) and has high physical significance also in other parts of applied mathematics because of the interesting effect of a high concentration property called intermittency. All this is explained in Section 2.2. In Section 2.3 we introduce a time-discrete version of the model that has two state spaces: one describes the location, the other describes the type of the particles. We develop a version of a Feynman–Kac formula in this setting and derive the large-time asymptotics for the expected total mass of the branching process for a particular distribution of the fitness variables. The formulas that describe these asymptotics contain interesting information about the counterplay of the two state spaces; mostly

Wolfgang König

24

we are interested in the description of the limiting structures that the main part of the particles is sitting in. A slightly different question is investigated in Section 2.4: For a particularly simple fitness distribution on a finite but large graph, how much time is needed for the main part of the population to move to the fittest site? In Section 2.5, we extend the setting of the parabolic Anderson model (which describes the first moment of the number of particles) to a Feynman–Kac-type formula for all the moments of the number of particles, taken over migration, branching and killing, valid for all choices of the branching/killing rates. This formula is employed afterwards to derive a large-time asymptotics for the high moments, for a particular potential distribution. In Section 2.6 we mention and motivate future work on extensions of the study of branching random walks in random environment, like the study of the high-moment asymptotics, the parabolic Anderson model on random tree-like graphs and the effect coming from inserting a mutually repellent force between the particles.

2.2 Spatial branching random walks in random environment We introduce the basic model in Section 2.2.1, explain its interpretation for evolutionary applications in Section 2.2.2, its relation with the parabolic Anderson model in Section 2.2.3 and give a brief account on the types of limiting statements available in the literature in Section 2.2.4. 2.2.1 The model Let X be some graph and X D .X t / t2Œ0;1/ be a Markov process in continuous time on X, which is most often taken as the simple random walk. The main examples for the state space X are the infinite space Zd or the finite hypercube X D ¹1; : : : ; Kºn for some K; n 2 N. Let C D .C .x//x2X and  D . .x//x2X be two collections of nonnegative numbers, attached to the sites of the state space. Now the spatial branching random walk with potential .C ;  / is introduced as follows. Starting with a single particle at time zero at the origin O 2 X, the particle runs like a copy of X through X. Furthermore, when located at x, the particle splits into two particles with rate C .x/ and dies (is removed from the system) with rate  .x/. Hence C .x/ is the branching rate at x, and  .x/ is the killing rate. The two newborn copies proceed independently in the same way as the initial particle. This is one of the basic models for the time-evolution of a population in a space with several random mechanisms: migration, branching and killing. The fields of rates, C and  , introduce disorder in the latter two mechanisms; indeed we will consider them here often as being random and independent and identically distributed. Areas with high values of C will produce many offspring over large time, and areas with high values of  will contain fewer particles over long time. Since the C .x/ are rates,

Branching random walks in random environment

25

the number of particles at x at time t is likely to grow exponentially fast with rate C .x/ (neglecting the killing mechanism), but the migration mechanism diminishes this effect, since the many newborn particles have the tendency to move away from this site (but may return later). See [6] for more explanations about random walks with branching and killing in a random medium, in particular in connection with equations of the type that we will be examining in Sections 2.2.2 and 2.2.3. The space X is often taken as (a discrete version of) the real space in which we live, i.e. the Euclidean space Z3 , and many investigations of the model concentrate on this interpretation. However, from the viewpoint of evolution of a population in a biological sense, it is also highly interesting to view X as a space of genotypes, i.e. as a space of more abstract, biological properties of the particles. One particularly descriptive example is the choice of X as the set of gene sequences, representing each particle by its genome, which is a sequence composed of four or, for simplicity, two alleles. Here the hypercube X D ¹ 1; 1ºN is the most canonical choice. In Section 2.2.2 we explain the interpretation of our model from the viewpoint of biological evolution. The standard reflex of a probabilist is to choose the migration mechanism as the nearest-neighbour simple random walk, where two gene sequences are considered neighbours if they differ in precisely one allele. We will consider this in Section 2.6.2, but let us note that there are more biologically inspired migration mechanisms that respect the way in which mutation and selection of gene sequences really happen [3]. A mathematical investigation of the above branching model with this type of random motion is widely open and would be rather interesting and well-motivated. A spatial branching process model (called the logistic branching random walk) without randomness in the birth rates, but a kind of birth control in the case of a local overpopulation is considered in [4, Section 3] in this volume. Here the birth rate in a given site is bounded, and is small (possibly zero) if the current number of individuals in a neighbourhood is large. The main results reported there are criteria for survival and for long-time distributional convergence to equilibrium, conditioned on survival. Proofs are based on a comparison technique using coupling. 2.2.2 The heat equation and a mutation-selection population model The model of branching random walks introduced in Section 2.2 on the hypercube X D ¹ 1; 1ºN with random branching rates has a profound interpretation in terms of a mutation-selection population model in a random fitness landscape. (See [14] for theoretical background about the notion of fitness in biological context and about the numbers and lengths of paths between sites of different fitnesses.) In [2], this is described as follows. The mutation-selection model is given by the solution vN .t;  ; y/ of the partial differential equation (PDE) vN .t; x; y/ D

1 vN .t; x; y/ C Œ.x/ .t /vN .t; x; y/; N t 2 Œ0; 1/; x 2 ¹ 1; 1ºN ;

(2.2.1)

26

Wolfgang König

with the localised initial condition vN .t;  ; y/ D ıy .  /. The potential  D C  is interpreted as a fitness landscape, and the mean fitness is given by X .t/ D vN .t; x; y/.x/: (2.2.2) x2¹ 1;1ºN

Here  is the Laplace operator defined by X f .x/ D f .y/

 f .x/ :

(2.2.3)

y2XWyx

Now, (2.2.1) is not a particle model, but a PDE; however, let us already note that such types of PDEs are very suitable for describing the expectation of the number of particles in a spatial branching process for fixed branching rates. This will be explained in a broader perspective in Section 2.2.4. Let us briefly explain the biological meaning of (2.2.1). Haploid genotypes are identified with linear arrangement of N sites x D .x.1/; : : : ; x.N // with each site taking values 1 or C1. In the multilocus context, sites correspond to loci and the variables x.i / to alleles. In the context of molecular evolution, x corresponds to a DNA (or RNA) sequence, where the nucleotides are lumped into purines (say, C1) and pyrimidines (say, 1). In the biology literature, the hypercube is usually called the sequence space. Then the mutation-selection model in (2.2.1) describes the evolution of an infinite population of haploids that experience only mutation and selection. The population evolves in continuous time (non-overlapping generations) with mutation and selection occurring independently (in parallel). The potential value .x/ is the Malthusian fitness of type x and the .x/ form a fitness landscape, which in our case is random. Site mutations happen with rate N1 (hence, at a total rate of one). From (2.2.2) P it follows that x vN .t; x; y/ D 1, and vN .t; x; y/ corresponds to the frequency of type x under this evolution. Finally, note that the localised initial condition means that initially the population consists of only type y. The competition between diffusion and potential in the PAM translates into a competition between mutation and selection, two driving forces of Darwinian evolution. We refer to the classical book [7] for an introduction to population genetics and to [9] for an excellent survey that involves the statistical physics methods used to solve mutation-selection models for a wide range of landscapes. In [13], the genealogy of the gene sequence for a large class of models with selection and mutation is constructed and simulated. In the following, we will concentrate on the situation where the fitness landscape is random. The motivation for this is the following. Realistic landscapes are expected to be complex with structures such as valleys and hills. Random fitness landscapes naturally form a class of complex landscapes. The first obvious choice, that is an i.i.d. landscape, is also known as the house of cards model.

Branching random walks in random environment

27

Let us also mention that it is well known [15] that vN .t; x; y/ D

uN .t; x; y/ ; uN .t; x/

where uN is the solution to (2.2.1) with  replaced by zero, which is called the heat equation that we are going to introduce in Section 2.2.3. In a way, uN .t; x; y/ can be thought of as an absolute frequency. 2.2.3 The parabolic Anderson model Let us return to the branching random walk model introduced in Section 2.2.1 with fixed branching rates C .x/ and killing rates  .x/. We want to explain this model from the viewpoint of PDEs, like in (2.2.1). For the time being, it is inessential that the rates are random. By .t; x/ we denote the number of particles at time t at site x, and by u.t; x/ its expectation with respect to migration, branching and killing. Note that the expectation of .t; x/ does not depend on C and  , but only on the difference field  D C  , since the branching and killing balances out in the expectation. We can therefore view u.t;  / as a time-dependent field that depends on  D ..x//x2X . As a matter of fact, the time-dependent field u.t;  / satisfies the heat equation with potential  D C  , which is the equation @ t u.t; x/ D u.t; x/ C .x/u.t; x/; x 2 X; t 2 .0; 1/; u.0; x/ D 10 .x/;

x 2 X:

(2.2.4)

Here  is the generator of the Markov process X, i.e. in case of a simple random walk on the graph X, it is the Laplace operator defined in (2.2.3). The solution theory for the heat equation is rather rich and explicit. Indeed, the solution u to (2.2.4) can be represented in terms of the Feynman–Kac formula u.t; x/ D EO Œe

Rt

0

.Xs / ds

1¹X t D xº;

(2.2.5)

and in terms of an eigenvalue expansion for the eigenvalues 1 > 2 > 3 >    of the operator  C , X etk k .O/k .x/; (2.2.6) u.t; x/ D k

where .k /k is an orthonormal sequence of corresponding eigenfunctions. So far, the field  of differences of branching and killing rates collects just the coefficients of the PDE in (2.2.4), and their randomness is inessential for the above. But let us turn now to the case that  is a random collection of numbers, which we then call a random potential. In this case, the operator  C  and its spectrum are also random; it is called the Anderson operator and plays an important role in mathematical physics; in particular the famous phenomenon of Anderson localisation has kept the interest in the description of its spectrum awake for decades. The heat equation with random potential in (2.2.4) is then called the parabolic Anderson model.

Wolfgang König

28

As we are interested in the behaviour of u.t;  / for large t, we will be concerned only with the boundary part of the spectrum, mostly only with the principal eigenvalue 1 (this is easily seen from (2.2.6)). Anderson localisation is, roughly speaking, the phenomenon that all the eigenfunctions of the Anderson operator  C , at least the ones corresponding to eigenvalues close to the boundary of the spectrum, are highly concentrated in small areas that are randomly located. The principal eigenvalue is included in this statement. Again looking at (2.2.6), we can guess that this concentration property is inherited by the random function u.t;  /. This has been shown to be true in some generality, and this is the basis of a detailed understanding of the large-time behaviour of the entire branching process in a random potential of branching/killing rates. The main assumption on the random potential  is that it is a collection of independent, identically distributed (i.i.d.) random variables. Many distributions can be considered, and for many applications there is no canonical choice. Not even ubiquitous distributions like the standard normal distribution or the exponential distribution can be considered canonical from the viewpoint of applications nor from the viewpoint of interesting emerging effects in this model. Rather, it turned out in the study of the moments of u that the most natural choice in this respect is perhaps the doubleexponential distribution,  u= Prob .0/ > u D e e ; u 2 R; (2.2.7) since it turned out in several investigations that the emerging picture in the long-time behaviour of the process has a non-trivial and not too complicated structure. Our main interest here is in the description of the long-time (i.e. large-t) behaviour of the random field u.t;  /, in particular of its total mass X U.t/ D u.t; x/: x2X

Of particular interest is the question whether u.t;  / develops some (randomly located) areas with particularly high values, i.e. particularly high accumulation of particles. Such areas are often called intermittent islands. Further questions concern the size and the location of these islands and characterisations of the potential .  / and the solution u.t;  / inside them. 2.2.4 Asymptotics Let us explain some of the fundamental results on the large-t behaviour of the parabolic Anderson model introduced in Section 2.2.4. See [12] for a comprehensive survey on the research until 2016 on the PAM, which was almost exclusively done on the Euclidean space Rd or Zd . We abstain in this section from giving explicit references to the original literature and refer to [12] across-the-board. Here is a fundamental assertion that has been proved about the large-t behaviour of the PAM: the asymptotics of the moments of the total mass of the solution, hU.t /i

Branching random walks in random environment

29

(we write h  i for expectation with respect to the potential ). We present this here only for the double-exponential distribution, as this is the one that we will consider further. The following is also the historically first result of this type; together with its proof, it gave guidance to the analysis of the PAM for many other potential distributions over the last 20 years. Introducing H.t/ D loghet.0/ i, the cumulant generating function of one of the potential variables, we have the following (see [12, Theorem 3.13 and Remark 3.17]). Theorem 2.2.1 (Moment asymptotics of the PAM). We consider the PAM as in (2.2.4) on X D Zd and assume that the random potential  D ..x//x2Zd is i.i.d. and doubly-exponentially distributed as in (2.2.7) with parameter  2 .0; 1/. Then there is a number  D ./ 2 Œ0; 2d  such that H.t/ 1 loghU.t/i D t t

 C o.1/;

t ! 1:

(2.2.8)

Note that H.t/ D t log t C t C o.t/ for t ! 1, hence the leading term H.t /=t tends to infinity, while the second term  is a constant. Indeed, the characteristic quantity  is represented by a certain variational formula that is interpretable and can be investigated more deeply. The interpretation is that the best contribution to the expected total mass comes from a localised peak in a certain island in Zd , called the intermittent island. More precisely, it comes from a pretty small area in which the potential  assumes particularly high values. This area is essentially a centred ball with a deterministic, fixed radius (not depending on t ), which is a special feature that only the double-exponential distribution shows. We can also argue in terms of the Feynman–Kac formula in (2.2.5) that the random walk .Xs /s2Œ0;t  spends practically all the time in this island in order to benefit as much as possible from the high potential values. In terms of the eigenvalue expansion (2.2.6), we can argue that the local principal eigenvalue (say, with zero boundary condition) in that island is enormously large. The leading term H.t /=t expresses the high potential value, and the second term  expresses the optimal compromise between the long stay of the random walk in the island and the size of the island. (In order to see that the large-t asymptotics may have anything to do with any kind of optimisation, note that both the Feynman–Kac formula and the eigenvalue expansion reveal that U.t / is something like a t-th exponential moment, and recall that, for any random variable Y , the moment EŒetY  runs like et ess sup Y .) Another fundamental result is on the almost sure behaviour of the total mass, which reads as follows in the above special case (see [12, Theorem 5.1 and Remark 3.17]). Theorem 2.2.2 (Almost-sure asympotics of the PAM). Under the same assumptions as in Theorem 2.2.1, there is a number Q 2 Œ0; 2d  such that 1 log U.t/ D h t t

Q C o.1/;

t ! 1; almost surely;

where h t D  log logjB t j, and jB t j is the cardinality of the box B t D Œ t; t d \ Zd .

Wolfgang König

30

Indeed, h t is the asymptotics of the maximum of all the potential values in B t , and Q is another variational formula that describes the optimal principal eigenvalue of the operator  C q under the assumption that the potential qW Zd ! R is not too improbable to be realised under . The explanation is that, in a box of some radius r t around the starting site 0, one searches for a local region in which the local principal eigenvalue of  C  is as large as possible. Then the path in the Feynman–Kac formula runs quickly to that place (which is typically  r t away) and stays in that island for the remaining time until t. Here the radius r t is chosen in such a way that an optimal compromise is reached between the probabilistic costs to reach that remote place in a short time and to have enough time left to benefit from that favourable potential. As a deeper investigation in [5] has shown, the optimal choice is r t D t =log t log3 t (with logk t the k-fold iteration of log), and the time that is used for getting there is of the order t =log t log2 t log3 t. Like for the moments, it is characteristic for the double-exponential distribution in (2.2.7) that the size of the intermittent island does not depend on t. Other potential distributions lead to different sizes of the intermittent islands. As a rule, the heavier the upper tails of the potential variable is, the smaller the island is. For example, for standard normal, the Weibull or the Pareto distribution, one obtains single-site islands, while for bounded random variables the islands depend on t and grow unboundedly.

2.3 Multitype branching random walk in random potential In this section, we review the results of [11] about a version of the spatial branching walk model defined in Section 2.2.1 with the additional feature that we give two categories of properties to each particle, a spatial information and information about its type. We now interpret X as the location space and take some Markov chain on it as a basis, and concerning the type, we introduce an additional discrete set T , the type space. We will let the rates depend on the types of the particle and of the offspring. Mathematically, there are certainly possibilities to conceive X  T as one state space, but we are interested in the effect coming from the difference between the two components, since there are different modeling ideas behind X and T . We consider a natural and flexible discrete-time version. Denote the transition matrix of the Markov chain on X by P D .Pxy /x;y2X . We equip T with a set A of directed edges .i; j / 2 T  T and obtain a directed finite graph G D .T ; A/. We assume that each directed edge appears at most once in A, and for each i 2 T , there is at least one j 2 T such that .i; j / 2 A. Self-edges .i; i / may appear in A. To each y 2 X we attach a matrix Fy D .Fy.i;j / /.i;j /2A of probability distributions on N0 , the environment. Given F D .Fy /y2X , we define a discrete-time Markov process .n /n2N0 on N0T X , where n .i; x/ is the number of particles of type i at site x at time n. The environment F does not depend on time and is fixed throughout the evolution of particles. We specify the transition mechanism of .n /n2N0 as follows:

Branching random walks in random environment

31

for a given state , during the time interval .n; n C 1/, given that the configuration is equal to  at time n, (1) A particle of type i located at site y 2 X produces, independently for j 2 T such that .i; j / 2 A, precisely k particles of type j at the same site y with probability Fy.i;j / .k/, for any k 2 N0 . This offspring production is independent over all the particles in X and over the time n 2 N0 . (2) Immediately after creation, each new particle at x independently chooses a site y with probability Pxy and moves there. (3) The resulting particle configuration is nC1 . Fix a site y 2 X and type j 2 T as starting sites. We start the Markov chain .n /n with the initial configuration 0 .i; x/ D ıj .i/ıy .x/, and by Pj;y and Ej;y we denote its distribution and expectation, respectively. Note that the Markov chain depends on the realisation of the environment F . We are interested in the expectation of the global P population size, jn j WD i 2T ;x2X n .i; x/, un .i; x/ WD Ei;x Œjn j;

n 2 N0 ; x 2 X; i 2 T :

As the first main result, we develop two formulas for un : as the n-th power of a certain X  T -matrix, and as an expectation of a multiplicative functional over n steps of a particular Markov chain on that set. These two formulas are extensions of well-known representations of the expected number of particles of multitype branching processes without space; we like the fact that one can see these formulas from the viewpoint of a Feynman–Kac formula. To formulate this, we need to introduce a Markov chain T D .Tn /n2N0 on the type space T with transition probabilities pij D

1¹.i; j / 2 Aº ; deg.i/

i; j 2 T ;

where deg.i / D j¹k 2 T W .i; k/ 2 Aºj is the outdegree of i. We define T and X inde.T;X / / pendently on a common probability space and write Pi;x and E.T;X for probability i;x and expectation, respectively, where T starts from i and X from x. We denote by P mij .y/ D k2N0 kFy.i;j / .k/ the expectation of Fy.i;j / (the offspring expectation) and collect these random numbers in the matrix M.y/ WD .mij .y//i;j 2T , where we put mij .y/ D 0 if .i; j / … A. The following is [11, Proposition 1]. Proposition 2.3.1 (Representations for un ). For any i 2 T and any x 2 X and any n 2 N0 , Y  n  .T;X/ un .i; x/ D Ei;x mTl 1 Tl .Xl 1 / deg.Tl 1 / ; (2.3.1) lD1 n

un .i; x/ D B 1.i; x/ D

XX j 2T y2X

n B.i;x/;.j;y/ ;

(2.3.2)

32

Wolfgang König

where B is the .T  X/  .T  X/ matrix with coefficients B.i;x/;.j;y/ D mij .x/Pxy 1¹.i; j / 2 Aº: The equation in (2.3.2) easily follows from (2.3.1) by writing out explicitly the expectation over the Markov chain and the n-fold matrix product. The interpretation of (2.3.1) is as follows. Every n-step path ..X0 ; T0 /; : : : ; .Xn ; Tn //, together with the product over the mTl 1 Tl .Xl 1 /, stands for the union of all n-step branching subtrees that produce, in the l-th step, for any l 2 ¹1; : : : ; nº, from a particle of type Tl 1 located at Xl 1 a particle of type Tl that makes a step to Xl immediately after creation. The product over the PXl 1 Xl pTl 1 Tl , together will the product over the deg.Tl 1 / (note the partial canceling of terms), summarises the probabilities of all the jumps in the state space. In this way, we encounter a discrete-time version of a Feynman– Kac formula for the Markov chain .T; X/ on T  X, however with an interesting difference: the potential log mi;j .x/ depends on the vertices of the space X and on the edges of the type space T . So far, the environment F was fixed, but let us now turn to the case of a random environment, and let us describe our assumptions. We assume that the collection of all distributions Fy.i;j / with y 2 X and .i; j / 2 A is independent. Their distribution depends on .i; j /, but not on y. We call F D .Fy /y2X the random environment and denote by Prob and h  i probability and expectation with respect to F , respectively. Since we are only interested in the expectation of the global number of particles here, we will make on the environment only in terms of the quantities P our assumptions .i;j / mij .y/ D k2N0 kFy .k/. In particular, we assume that the collection of the mij .y/ is stochastically independent over y 2 X and i; j 2 T (but of course not identically distributed). We will study the case where the upper tails of mij .y/ lie in the vicinity of the Weibull distribution, with parameter 1=ij 2 .0; 1/, i.e. Prob.mij .y/ > r/  exp¹ r 1=ij º;

r ! 1:

Hence, log mij .y/ lies in the vicinity of the double-exponential distribution defined in (2.2.7) with parameter ij . Let Hij .t/ WD loghmij .y/t i denote the logarithm of the moment generating function. For .i; j / 2 T 2 n A, we put ij D 0. Hence, our environment distribution is characterised by the matrix-valued parameter  D .ij /i;j 2T . The larger ij is, the thicker are the tails of mij .y/, i.e. the easier it is for mij .y/ to achieve extremely high values. The second main result of [11] is a formula for the large-n asymptotics for the expectation of the Feynman–Kac formula from (2.3.1). This formula is in the spirit of (2.2.8), however, it has some peculiarities and some novelties. Let us first mention that there are basically two possible lines of proof for deriving the result: one about the large deviations for the empirical pair measure of the underlying Markov chain, and one using Frobenius eigenvalue theory. We decided to carry out only the former one. For any discrete set S, we denote by M1 .S/ the set of probability measures on S

Branching random walks in random environment

33

and by M1.s/ .S 2 / the set of probability measures on S 2 with equal marginals. The first quantity of interest is X ® ¯ ./ D sup h; i W  2 M1.s/ .T 2 / ; where h; i D .i; j /ij ; (2.3.3) .i;j /2A

and the set of the corresponding maximisers: ® ¯ ƒ./ WD  2 M1.s/ .T 2 / W h; i D ./ : We introduce some notation. Each measure  2 M1.s/ ..T  X/2 / has a number of marginal measures that are defined on different spaces, but in order to keep the notation simple, we denote by  all these marginals, namely, X X ..i; x/; .j; y//; .i; x/ D .i; j; x/ D .i; j; x/; y2X

.i; j / D

X

j 2T

X

.i / D

.i; j; x/;

x2X

.i; j /:

j 2T

To describe the second term in the asymptotics, we need to introduce two functionals on probability measures  2 M1.s/ ..T  X/2 /, an energy functional  and an entropy functional I. Indeed, define X X X ./ WD ij .i; j; x/ log .i; j; x/ C .i; j /ij log ij ; .i;j /2A

I./ WD

X

x2X

X

.i;j /2A

..i; x/; .j; y// log

.i;j /2A x;y2X

..i; x/; .j; y// : .i; x/Pxy

We set I./ D 1 if  is not absolutely continuous with respect to the measure ..i; x/; .j; y// 7! .i; x/Pxy 1¹.i; j / 2 Aº. Then I./ is equal to the relative entropy of  with respect to this measure, up to the missing normalisation; note that the P reference measure is not normalised, but has mass equal to i 2T .i / deg.i /. Now we can state our main result [11, Theorem 3]. Theorem 2.3.2. For any i 2 T and x 2 X, as n ! 1,  n hun .i; x/i D .nŠ/./ e n./ eo.n/ D exp ./n log e

 n./ C o.n/ ; (2.3.4)

where ® ./ D inf I./

¯ ./ W  2 M1.s/ ..T  X/2 /;  2 ƒ./ :

The central object in the proof and in the understanding of this result is the empirical pair measure n

n D

1X ı..Tl n lD1

1 ;Xl

1 /;.Tl ;Xl //

:

Wolfgang König

34

In terms of the space-time random walk .X; T /, the number nn ..i; x/; .j; y// is equal to the number of j -type offspring of any i-type particle located at x by time n that makes a step to y right after creation. Hence, n stands for the union of all n-step paths ..X0 ; T0 /; : : : ; .Xn ; Tn // that make precisely nn ..i; x/; .j; y// steps .i; x/ ! .j; y/ for every i; j 2 T and every x; y 2 X. The term I./ is the negative exponential rate of the probability of this union under the Markov chain X , together with the combinatorial complexity of the trajectories of types, and ./, together with the leading term ./, is the one under the expectation with respect to the random environment. Theorem 2.3.2 in particular shows that the main contribution to the annealed moments of the numbers of particles, ./, comes from those n-step branching process subtrees that produce, for some  2 ƒ./, at approximately n.i; j / of the n steps a number of j -type particles from one or more i-type particles, for any i; j 2 T . Then the value h; i gives the leading contribution on the scale n log ne . It is interesting to note that the optimality of the leading term has nothing to do with the spatial part of the branching process, but only with the creation of particles. The reason is that all the probabilities of spatial actions, i.e. of the random walk X, are on the scale n, but the values of the offspring expectations mij .x/ are typically on the scale eO.log n/ . In this light, let us analyse the leading term ./ a bit more closely. A simple cycle on G is a path D .i1 ; : : : ; il / in T , with steps .im ; imC1 / in A, that begins and ends at the same vertex i1 D imC1 , but otherwise has no repeated vertices or edges. We write .i; j / 2 if the directed edge .i; j / belongs to , that is if .i; j / D .im ; imC1 / for some m 2 ¹1; : : : ; lº. We call j j D l its length. We denote by €l the set of all simple cycles of length l and by € the set of all simple cycles. We define the uniform measure on the edges of a simple cycle , ´ 1=j j if .i; j / 2 ;  .i; j / D 0 otherwise: It is clear that  2 M1.s/ .T 2 / for any 2 €. Simple cycles are important for the asymptotics of the annealed moments because the set of extreme points of M1.s/ .T 2 / consists exactly of the measures  with the simple cycles of the graph G . The following is [11, Lemma 1]. Lemma 2.3.3. The set of extreme points of the convex set M1.s/ .T 2 / is equal to ¹ W 2 €º. Since the optimisation problem in (2.3.3) is a linear optimisation problem on the convex, compact set M1.s/ .T 2 /, the Krein–Milman theorem and Lemma 2.3.3 imply the following characterisation of the leading term in (2.3.4). The following is [11, Lemma 2].

35

Branching random walks in random environment

Lemma 2.3.4. One has ² j j ® P ¯ 1 X ./ D max h ; i W 2 € D max i j j mD1 m

³ 1 im

W .i1 ; : : : ; ij j / 2 € :

The interpretation of Lemma 2.3.4 is that the leading contribution to the expected population size comes from optimal cycles .i1 ; : : : ; ij j / 2 €j j . In terms of branching process trees, they are considered optimal if they produce only imC1 -type particles from im -type particles for any m 2 ¹1; : : : ; j jº (with ij jC1 D i1 ), but no other offspring.

2.4 The PAM on finite graphs Another, more biologically inspired, direction in the investigation of the PAM, is the PAM on some finite graph (like the full graph or the hypercube) and the concentration on the question about the amount of time that one has to give to the branching process such that the overwhelming part of the population has found the way to the “fittest” site. Here we rely on the interpretation that we explained in Section 2.2.2, and the question is investigated in the limit of a large (but finite) graph and late times, and the relation of the growths of time and space is crucial. In order to make a mathematical treatment feasible without too many technicalities, one typically assumes the random potential (the “fitness landscape”) to be i.i.d. exponentially distributed times a growth parameter, such that one is in the regime where the intermittent islands (in the understanding of Section 2.2.4) are singletons. The first work on the PAM on a finite graph X was – to the best of our knowledge – [8], which considered the complete graph with N nodes and an exponentially distributed i.i.d. random potential . However, instead of  in (2.2.4), the rescaled Laplace operator N1  is considered. This is mathematically equivalent to (2.2.4) with graph-size dependent potential N ; the large prefactor N supports a strong concentration of the mass of branching particles in small intermittent islands; actually here we are concerned with single sites. In that work, the initial condition was taken to be localised in the site of the k-th largest of the potential values, and the question was raised, for what choices of t D tN , in the large-N limit, the main mass of the particle system travels to the site of the largest potential value, and for what choices it stays at the initial site by time t. The authors found the leading scale and the criterion for the answer, and they derived the first term in the asymptotics for the expectation of the total mass with this initial condition. We do not go deeper into these details. In [2], the same question was raised for the hypercube ¹ 1; 1ºN with an i.i.d. potential, the assumptions of which are formulated by requiring that 2N i.i.d. copies of the potential variables leave gaps of order one between their consecutive leading order statistics, asymptotically as N ! 1. According to standard extreme-value statistics, this includes the case of centred Gaussians with variance N . The main result is the following (see [2, Theorem 1.2]).

36

Wolfgang König

Theorem 2.4.1. Assume that UN .t/ is the total mass of the solution to the PAM in (2.2.4) on X D ¹ 1; 1ºN with  replaced by N1  and with an i.i.d. random potential  on X such that their leading order statistics maxX  D 1;2N > 2;2N > 3;2N >    leaves gaps of order one between its consecutive values, asymptotically as N ! 1. We assume that the initial condition in (2.2.4) is the delta-measure in the site of X in which k;2N sits with a fixed integer k > 1. Fix " > 0. If tN > .1 C "/.N log N /=2.1;2N k;2N /, then, almost surely, ° UN .t / D exp .1;2N

± 1 1/tN C .1 C o.1//N log N ; 2

N ! 1:

In contrast, if tN 6 .1

"/.N log N /=2.1;2N k;2N /, then, almost surely, ® ¯ UN .t/ D exp .k;2N 1/tN .1 C o.1//; N ! 1:

The interpretation is that the main mass has, in the first case of a large time horizon, enough time to move to the site of the maximal potential value, 1;2N , while in the second case (small time horizon), it stays in the initial site.

2.5 Higher moments of the numbers of particles As we mentioned in Section 2.2.3, the first moment (the expectation) u.t; x/ of the number of particles in the branching random walk system at time t in the site x, satisfies the heat equation with random potential in (2.2.4), which is rather amenable to a deeper analysis, due to the Feynman–Kac formula in (2.2.5) and the eigenvalue expansion in (2.2.6). However, this concerns only the first moment, and this is quite short of a deeper understanding of the entire particle system. Substantially more information is contained in the n-th moments of the particle number at time t at site x for n 2 N, i.e. in the expectation of .t; x/n , again taken only over the migration, branching and killing mechanisms with fixed branching/killing rates. In this section, we are going to report on the work [10] on this aspect. To Pkeep things simple in this survey, we only consider here Un .t/, the expectation of . x2X .t; x//n . Hence, the case n D 1 is the one that we handled before. Again, like in Section 2.3, we will present two main results: An explicit formula of Feynman–Kac type for Un .t / for fixed branching/killing rates, and the identification of the large-t asymptotics of the expectation of Un .t/, taken with respect to a particular choice of the distribution of these random rates. One of our motivations to study Un .t/ is that asymptotic knowledge on the behaviour of hUn .t /i might be able to tell something about the distributional behaviour of the total number of particles, as in the general theorem where (under some technical conditions) the convergence of all the n-th moments of a random variable implies its weak convergence towards a random variable whose n-th moments are equal to the limits.

Branching random walks in random environment

37

Meanwhile, also deep investigations have been carried out directly on the branching particle system in the spirit of the PAM and using a great deal of the results and methods specific for the PAM [16], but this concerned only the simplest potential distribution (the Pareto distribution) and was technically enormously cumbersome. The method of choice to analyse Un .t/ is to derive a similar equation as the heat equation with random potential and to try to employ the techniques that were helpful in the analysis of the PAM. In earlier work [1], a recursive equation was derived for un .t; x/ (defined as the expectation of .t; x/n ) in terms of all the functions u1 ; : : : ; un 1 . This formula was explicit enough to derive, for a particular potential distribution that leads to single-site islands, the first term in the large-t asymptotics of the moments of un .t;  /. However, this is obviously unsatisfactory for several reasons. Instead, in [10], a direct formula for the expectation of un (in fact, for its p-th moment for any p 2 N) is derived, and this is so explicit that, for the double-exponential distribution in (2.2.7), the two main terms could be derived in terms of a variational formula that admits an interpretation and further analysis. The main tool for deriving the direct formula is the many-to-one formula from the theory of branching processes, derived via spine techniques. Unlike in the case n D 1 of the first moment, the n-th moment is not just a function of the difference  D C  , but we need to keep track of both the splitting rate C D 2 and the killing rate  D 0 . Let us state the formula from [10, Theorem 2.1] in the special case n D 2. Theorem 2.5.1 (Feynman–Kac-like formula for U2 ). Let Un .t / denote the n-th moment of the total number of particles, for any n 2 N. Then, in the case n D 2, we have U2 D U1 C UQ 2 , where Z t Rt Rt  Rs  0 00 UQ 2 .t / D EO e 0 .Xr / dr e s .Xr / dr e s .Xr / dr 22 .Xs / ds; (2.5.1) 0

where .Xr /r2Œ0;s , .Xr0 /r2Œs;t and .Xr00 /r2Œs;t are three independent simple random walks, conditioned on X0 D O and Xs0 D Xs00 D Xs . In other words, the right-hand side of (2.5.1) is the expectation over a branching random walk with precisely one splitting event in the time interval Œ0; t , namely at time s. The first term in the representation U2 D U1 C UQ 2 corresponds to absence of splitting, the second to exactly one splitting. For general n, the formula for Un is similar (one has to sum over all numbers of splitting events in ¹0; : : : ; n 1º), but much more cumbersome, since combinatorial prefactors are involved and multiple powers of 2 -values at the splitting locations. The formula in (2.5.1) is the main result of [10] about a representation of the n-th moments of the number of particles, for fixed branching/killing rates, in the special case n D 2. Now we turn to the second question. For the analysis of the large-t asymptotics of the expectation of UQ 2 .t/ and for its intuitive understanding, the formula in (2.5.1) is well suitable, at least for the case that both random fields 2 and 0 are i.i.d. sequences of double-exponentially distributed variables as in (2.2.7) and are independent. The

38

Wolfgang König

first observation is that the term 22 .Xs / should hardly have any influence, and this is indeed true under some technical assumption that forbids too thick upper tails of the random variable 2 .0/. The second observation is that, as we are considering the case that the potential can assume very large (positive) values, it should be giving particularly large values if the splitting time s is as early as possible, since it is then two independent random walks that can contribute to the random-potential expectation: it is indeed the expectation of a square of the Feynman–Kac formula, since each of the two copies contributes independently like one total mass. This reasoning can be done already when considering the leading terms of the expectation: Since we have three random walks with running time s, t s and t s, one should expect that the leading logarithmic term is roughly equal to H.s C .t s/ C .t s// D H.2t s/, which is maximal if s is minimal, since we assume that 1t H.t/ ! 1 (as  is unbounded). Hence, the first term in the asymptotics for 1t loghUQ 2 .t/i should be H.2t /, and the second one should be 2.p/ in the notation of (2.2.8). The same picture is true for the higher moments, i.e. for the moment asymptotics of Un .t/, with 2 replaced by n at both occurrences. This has indeed been proved as the second main result in [10, Theorem 1.3]. Theorem 2.5.2 (Moment asymptotics for the branching random walk in random environment). Let the random potentials 2 D .2 .x//x2Zd and 0 D .0 .x//x2Zd be independent and i.i.d. and double-exponentially distributed as in (2.2.7) with parameter  2 .0; 1/. Denote by Un .t/ the n-th moment of the total number of particles in the branching random walk in the random environment .0 ; 2 /. Then, for any n; p 2 N, ® ¯ hUn .t/p i D exp H.npt/ npt C o.t / ; t ! 1; where  2 Œ0; 2d  is the number appearing in Theorem 2.2.1. The most important conclusion is that, at least for the potential double-exponentially distributed, the main contribution to the expected p-th power of the total mass of the n-th moment comes from early splittings in the moment formula. This implies that hUn .t/p i D hU.t/np ieo.t/ D hU.t np/ieo.t / ;

t ! 1:

Recall that we believe that this phenomenon comes from the unboundedness of the branching rates .0/, more precisely from the super-linear behaviour of the leading term, the logarithmic moment generating function H.t /. If the potential random variable .0/ is not positive, but attains only strictly negative values, then we expect that the opposite behaviour is crucial, i.e. a very late splitting, and the result should be that hUn .t /i D hU1 .t/ieo.t/ as t ! 1. If the essential supremum of the potential is zero, then we expect that deeper investigations are necessary and that, in some cases, richer pictures may appear.

Branching random walks in random environment

39

2.6 Further perspectives 2.6.1 High-moment asymptotics In ongoing work, we are currently deriving the large-n asymptotics of the n-th moment (taken over all randomnesses: migration, branching, killing, and all the rates) of the number of particles in the branching random walk model of Section 2.2.1 on a finite time interval in Zd , i.e. the asymptotics of hUn .1/i. This will say something about the most probable way of the branching particle system to produce as much offspring as possible over a finite time interval, i.e. about the questions how high the potential values should be, how often and how quickly after each other the splitting of the particles occurs, and how many steps all the many paths make. For deriving this, we are exploiting the moment formula that we explained in Section 2.5. 2.6.2 Branching random walks on (random) graphs As we explained in Section 2.4, an investigation of the PAM on (possibly finite) graphs is biologically sound, but has not been done yet on a broad front; we are actually aware only of the two works [2, 8]. In another ongoing investigation, the state space X is taken as a random graph that is locally tree-like. The main examples are a Galton– Watson tree with bounded degrees and the configuration model. The main goal is – for the random potential double-exponentially distributed as in (2.2.7) – to find the large-t asymptotics for the total mass U.t/ of the solution with high probability and to describe the structure of the parts of the random graphs that give the main contribution, i.e. of the intermittent islands. The two main interests here are to understand in general the influence (1) of the exponential structure of the tree, i.e. the fact that the volume of a ball with diameter r runs exponentially fast in r, and (2) of the randomness of the graph structure. In the long-term, we hope to answer also the questions that are answered in the case of the state space X D ¹ 1; 1ºN in Theorem 2.4.1, but the infiniteness of the state space, its randomness and the fact that the intermittent islands are not single sites, but have some structure make the question about the location and the shape of the intermittent islands a big enterprise. 2.6.3 Self-repellent random walk in random potential As described in Section 2.2.4, the Feynman–Kac formula that represents the total mass of the solution to the PAM actually displays a random walk in random potential, and the large-time asymptotics are carried by those random walk paths that find their way to an optimal local region in the potential. The random walk in random potential feels an attractive force towards the extremal regions of the potential. In another ongoing work, we are investigating the effect of an additional counter force: some additional self-interaction that suppresses self-intersections of the random walk until time t. In other words, we replace the free walk by the well-known weakly

Wolfgang König

40

self-avoiding or self-repellent walk, which is given by an exponential weight of the form ² ³ Z TZ T 1 exp ˇ 1¹Xs D X t º ds dt ; ZT;ˇ 0 0 where ˇ 2 .0; 1/ is a parameter, and ZT;ˇ is the normalising constant. In this model, the path in the Feynman–Kac formula cannot spend too much time anymore in single sites, the intermittent islands. The goal is to describe what it does instead. If the underlying state space is the lattice Zd , then the most obvious conjecture is that it will visit not only one of the intermittent islands, but several ones after each other, even though they are far away from each other. This is a random strategy, depending via an optimisation problem on the limiting spatial extreme-value order statistics of the potential. We think that this strategy is indeed optimal for most values of the thickness parameter of the potential distribution, but for some values this does not seem to be true, and we currently have no clue how to describe the typical behaviour of the path properly.

References [1] S. Albeverio, L. V. Bogachev, S. A. Molchanov, and E. B. Yarovaya, Annealed moment Lyapunov exponents for a branching random walk in a homogeneous random branching environment, Markov Process. Related Fields 6 (2000), 473–516. [2] L. Avena, O. Gün, and M. Hesse, The parabolic Anderson model on the hypercube, Stochastic Process. Appl. 130 (2020), 3369–3393. [3] E. Baake and H.-O. Georgii, Mutation, selection, and ancestry in branching models: A variational approach, J. Math. Biol. 54 (2007), 257–303. [4] M. Birkner and N. Gantert, Ancestral lineages in spatial population models with local regulation, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 291–310. [5] M. Biskup, W. König, and R. Soares dos Santos, Mass concentration and aging in the parabolic Anderson model with doubly-exponential tails, Probab. Theory Related Fields 171 (2018), 251–331. [6] R. Carmona and S. A. Molchanov, Parabolic Anderson Problem and Intermittency, American Mathematical Society, Providence, RI, 1994. [7] J. F. Crow and M. Kimura, An Introduction to Population Genetics Theory, Harper & Row, New York, 1970. [8] K. Fleischmann and S. A. Molchanov, Exact asymptotics in a mean field model with random potential, Probab. Theory Related Fields 86 (1990), 239–251. [9] W. Gabriel and E. Baake, Biological evolution through mutation, selection, and drift: An introductory review, in: Annual Reviews of Computational Physics. Vol. VII (ed. D. Stauffer), World Scientific, Singapore (2000), 203–264. [10] O. Gün, W. König, and O. Sekulovic, Moment asymptotics for branching random walks in random environment, Electron. J. Probab. 18 (2013), 1–18.

Branching random walks in random environment

41

[11] O. Gün, W. König, and O. Sekulovic, Moment asymptotics for multitype branching random walks in random environment, J. Theoret. Probab. 28 (2015), 1726–1742. [12] W. König, The Parabolic Anderson Model. Random Walk in Random Potential, Birkhäuser, Berlin, 2016. [13] S. M. Krone and C. Neuhauser, Ancestral processes with selection, Theor. Popul. Biol. 51 (1997), 210–237. [14] J. Krug, Accessibility percolation in random fitness landscapes, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 1–22. [15] P. A. P. Moran, Global stability of genetic systems governed by mutation and selection, Math. Proc. Cambridge Philos. Soc. 80 (1976), 331–336. [16] M. Ortgiese and M. Roberts, Scaling limit and ageing for branching random walk in Pareto environment, Ann. Inst. Henri Poincaré Probab. Stat. 54 (2018), 1291–1313.

Chapter 3

Microbial populations under selection Ellen Baake and Anton Wakolbinger This chapter gives a synopsis of recent approaches to model and analyse the evolution of microbial populations under selection. The first part reviews two population genetic models of Lenski’s long-term evolution experiment with Escherichia coli, where models aim at explaining the observed curve of the evolution of the mean fitness. The second part describes a model of a host-pathogen system where the population of pathogens experiences balancing selection, migration, and mutation, as motivated by observations of the genetic diversity of HCMV (the human cytomegalovirus) across hosts.

3.1 Introduction The genetic diversity and evolution of microbial populations is a rich and diverse object of biological research. Among the very first investigations was the Luria–Delbrück experiment [25], which revealed fundamental insight into the spontaneous nature of mutations, even before the discovery of DNA as the carrier of genetic information (see [1] for a historical account). Experimental evolution takes advantage of the short generation time of bacteria and allows to observe evolution in real time; one of the most famous instances is Lenski’s long-term evolution experiment (LTEE); see [38] and references therein. The diversity and evolution of pathogens has immediate medical relevance, as exemplified by the prediction of the yearly influenza strain; see [28] for a recent review. As also noted in the contribution of Backofen and Pfaffelhuber [5] in this volume, population-genetic methods have so far rarely been applied to microorganisms. That chapter discusses the diversity of the bacterial CRISPR-Cas (clustered regularly interspaced short palindromic repeats–CRISPR associated sequences) system, which plays an important role in the defence of bacteria against bacteriophages. The present chapter reviews two recent approaches to model and analyse, in a population-genetic framework, the evolution of microbial populations under selection in combination with migration and/or mutation. We first consider Lenski’s LTEE with Escherichia coli, that is experimental evolution under directional selection and mutation, and then turn to pathogen evolution under balancing selection, migration, and mutation, as motivated by observations of the genetic diversity of HCMV (the human cytomegalovirus) across hosts. For models of host-pathogen coevolution, we refer the reader to the contribution of Stephan and Tellier [37] in this volume. These scenarios and the corresponding models appear to be quite different at first sight, but they share a similar spirit, for various reasons. First, due to the short

Ellen Baake and Anton Wakolbinger

44

generation times of microbes, it makes sense to also consider and observe the dynamics of evolution, whereas, with higher organisms, one is often restricted to equilibrium considerations. Second, due to the large population sizes, laws of large numbers are appropriate in places, so that, under a suitable scaling, the evolution is close to a deterministic one. Third, in both cases, a moderate selection–weak mutation/ migration regime is used, which allows for time-scale separation. The parallels between the two parts go even further: While the LTEE model features a process of beneficial mutations only a few of which are successful (in the sense that they lead to fixation of the mutant), the model for HCMV contains a process of reinfections only a few of which are effective (in the sense that they lead to a transition from a pure state to a (quasi-)equilibrium state for the balancing selection); we will refer to such a transition as a balancing event. In both parts, Haldane’s formula for the fixation probability (and its analogue in the case of balancing selection) plays an important role, and jump processes appear in an appropriate scaling limit, with sequential fixations in the first and sequential balancing events in the second part.

3.2 Modelling Lenski’s long-term evolution experiment R. Lenski started twelve replicates of an experiment in 1988, and since then it has been running without interruption. This has become famous as the E. coli LTEE [22]. Every morning, a sample of  5  106 Escherichia coli bacteria is inserted into a defined amount of fresh minimal glucose medium; let us call them the founder individuals (at day i, say). As soon as the nutrients are consumed, the bacteria stop dividing – this is the case when the population has reached a size of  5  108 bacteria, with  5  106 clones each of average size  100, see Figure 3.2.1. The next morning, the process is repeated by taking out of the  5  108 cells a random sample of  5  106 , which then form founder individuals at day i C 1. This induces a genealogy in discrete time i D 0; 1; : : : among the founder individuals. The goal of the experiment is to observe evolution in real time. Indeed, the bacteria evolve via beneficial mutations, which allow them to adapt to the environment and thus to reproduce faster. One special feature of the LTEE is that samples are frozen at regular intervals. They can be brought back to life at any time for the purpose of comparison and thus form a living fossil record. In particular, one can, at any day i, compare the current population with the initial (day-0) population via a competition experiment [23, 38], which yields the empirical (Malthusian) fitness at day i relative to that at day 0; see below for a precise definition. Figure 3.2.3 shows the time course over 21 years of the empirical relative fitness averaged over the replicate populations, as reported in [38]. Obviously, the mean relative fitness has a tendency to increase, but the increase levels off, which leads to a conspicuous concave shape. In this section, we report on two closely related models that describe the experiment by means of a Cannings model and explain the mean fitness curve. The first model was set up by González-Casanova, Kurt, Wakolbinger, and Yuan [19] and will henceforth

Microbial populations under selection

45

Figure 3.2.1. Illustration of some day i 1 (and the beginning of day i ) of Lenski’s LTEE with four founder individuals (bullets), their offspring trees within day i 1, and the sampling from day i 1 to i (dotted), for an average clone size of 5. The second founder from the left at day i 1 (and its offspring) is lost due to the sampling, and the second founder from the right at day i carries a new beneficial mutation (indicated by the square). Reprinted from [3, Figure 1], © 2019, with permission from Elsevier.

be referred to as the GKWY model; it assumes that (a) the fitness increments conveyed by beneficial mutations are deterministic and follow a regime of diminishing returns epistasis (that is the increments decrease with increasing fitness), (b) a weak mutation– moderate selection regime applies, and (c) the population is so large that a law of large numbers is appropriate. Building on this, the second model by Baake, GonzálezCasanova, Probst, and Wakolbinger ([3], referred to as the BGPW model) introduces additional random elements by (d) allowing for stochastic fitness increments and (e) considering effects of clonal interference, which means that two mutants compete with each other for the success of going to fixation. Both models build on earlier work by Wiser, Ribeck, and Lenski [38], who work close to the data and perform an approximate analysis in the spirit of theoretical biology, while we focus on precise definitions of the models, on population-genetic concepts, and on mathematical rigour where possible. Our presentation will be guided by [3]. 3.2.1 Interday and intraday dynamics Both the GKWY and the BGPW models take two distinct dynamics into account, namely, the dynamics within each individual day, and the dynamics from day to day. We will now explain these building blocks. Here, as well as in Sections 3.2.2 and 3.2.3, we will focus on the case of deterministic fitness increments and the absence of clonal interference (as considered in the GKWY model); the extensions in the BGPW model will be discussed in Section 3.2.4. Intraday dynamics. Let T be (continuous) physical time within a day, with T D 0 corresponding to the beginning of the growth phase. Day i starts with N founder individuals (N  5  106 in the experiment). The reproduction rate or Malthusian fitness of founder individual j at day i is Rij , where i > 0 and 1 6 j 6 N . It is

46

Ellen Baake and Anton Wakolbinger

assumed that at day 0 the population is homogeneous (or monomorphic), that is R0j  R0 . Offspring inherit the reproduction rates from their parents. We denote by Rij t D R0 T and rij D R0 dimensionless time and rates, so that on the time scale t there is, on average, one split per time unit at the beginning of the experiment, so r0j  1. We consider the rij as given (non-random) numbers. We thus have N independent Yule processes at day i: all descendants of founder individual j (the members of the j -clone) branch at rate rij , independently of each other. They do so until t D i , where i is the duration of the growth phase on day i. We define i as the value of t that satisfies E.population size at time t/ D

N X

erij t D N;

(3.2.1)

j D1

where is, equivalently, the multiplication factor of the population within a day, the average clone size, and the dilution factor from day to day in the experiment (  100 in the LTEE). Note that, in the definition of i , we have idealised by replacing the random population size by its expectation. Since N is very large, this is well justified, because the fluctuations of the random time needed to grow from size N to size 100N are small relative to that time’s expectation. Interday dynamics. At the beginning of day i > 0, one samples N new founder individuals out of the N cells from the population at the end of day i 1. We assume that one of these new founders carries a beneficial mutation with probability ; otherwise (with probability 1 ), there is no beneficial mutation. We think of  as the probability that a beneficial mutation occurs in the course of day i 1 and is sampled for day i. Assume that the new beneficial mutation at day i appears in individual m, and that the reproduction rate of the corresponding founder individual k in the morning of day i 1 has been ri 1;k . The new mutant’s reproduction rate is then assumed to be rim D ri

1;k

C ı.ri

1;k /

with ı.r/ WD

' : rq

(3.2.2)

Here, ' is the beneficial effect due to the first mutation (that is ı.1/), and q determines the strength of epistasis. In particular, q D 0 implies constant increments (that is fitness is additive), whereas q > 0 means that the increment decreases with r, that is we have diminishing returns epistasis. Note that we only take into account beneficial mutations and adhere to the simplistic assumption that the fitness landscape is permutation invariant, that is every beneficial mutation on the same background conveys the same deterministic fitness increment, no matter where it appears in the genome; this simplification is already used by Fisher in his staircase model [16] and is still common

Microbial populations under selection

47

in the modern literature [12]. The assumption will be relaxed in Section 3.2.4, where we turn to stochastic increments. Mean relative fitness. Let us now define the mean relative fitness, depending on the reproduction rates rij of the N individuals in the sample at the beginning of day i, as   X N 1 1 rij i : Fi WD e log i N

(3.2.3)

j D1

Here, i is as defined in (3.2.1). Note that (3.2.3) implies that e

Fi i

N 1 X rij i D e : N j D1

Thus, Fi may be understood as the effective reproduction rate of the population at 1 P day i, which differs from the mean Malthusian fitness N j rij unless the population is homogeneous. 3.2.2 Heuristics for the power law of the mean fitness curve In a homogeneous population of relative fitness F , the length of the growth period is .F / D

log F

(3.2.4)

(since this solves eF t D , cf. (3.2.1)). It is crucial to note that the length  of the growth period decreases with increasing F . Assume a new mutation arrives in a homogeneous population of relative fitness F . It conveys to the mutant individual a relative fitness increment ı.F / D

' ; Fq

(3.2.5)

that is the mutant has relative Malthusian fitness F C ı.F /. We define the selective advantage of the mutant as s.F / D ı.F /.F /: (3.2.6) This is because the selective advantage is the product of the fitness increment and the duration of a generation, which, in our case, is the time required for the population to grow to times its original size, namely .F /; for details, see [10], [35, p. 1977], and [3]. It is essential to note that s in (3.2.6) inherits the dependence on F from  and thus s decreases with increasing F even for q D 0. This is what we call the runtime effect: adding a constant to an interest rate F of a savings account becomes less efficient when the runtime decreases.

Ellen Baake and Anton Wakolbinger

48

Furthermore, it is precisely this notion of selective advantage conveyed by (3.2.6) that governs the fixation probability. Namely, the fixation probability of the mutant turns out to be .F /  C s.F /: (3.2.7) Here,  means asymptotic equality in the limit N ! 1,1 and C WD =. 1/. The key to understanding the role of C is to formulate the interday dynamics in terms of a Cannings model. In a neutral setting, this classical model of population genetics works by assigning in each time step to each of N (potential) mothers indexed j D 1; : : : ; N a random number j of daughters such that the j add up to N and are exchangeable, that is they have a joint distribution that is invariant under permutations of the mother’s indices [14, Section 3.3]. In [19], the mothers are identified with the founders in a given day and the daughters with the founders in the next day, thus resulting in an extension of the Cannings model that includes mutation and selection. This extension is obtained by decreeing that a mutant with a selective advantage s compared to the resident type has an expected offspring of size .1 C s/ times the expected offspring of a resident. (For the large class of Cannings models that admit a paintbox representation, a graphical construction that includes directional selection is given in [6].) In our situation, the offspring variance v in one Cannings generation satisfies v D V .1 /  2

1

D

2 C

([19], see also [3] for an explanation in terms of pair coalescence probabilities in the Cannings model). Hence (3.2.7) is in line with Haldane’s formula 

s ; v=2

(3.2.8)

which relies on a branching process approximation of the initial phase of the mutant growth; see [30] for an account of this method, including a historic overview. Another crucial ingredient of the heuristics is the time window of length u.F / 

log.N s.F // s.F /

(3.2.9)

after the appearance of a beneficial mutation that will survive drift (a so-called contending mutation); this results from a branching process approximation of the expected time it takes for the mutation to become dominant in the population, see [3, 12, 26].

1

That is .F /=.C s.F // D N .F /=.C sN .F // ! 1 as N ! 1, in the setting of Section 3.2.3.

49

Microbial populations under selection F + δ(F )

F  u(F )

≈ u(F )

Figure 3.2.2. Schematic drawing of the relative fitness process (black) and the approximating jump process (grey). Reprinted from [3, Figure 3], © 2019, with permission from Elsevier.

Indeed, let .Zi /i >0 be a Galton–Watson process with offspring mean 1 C s and s nonnegative and small. Then according to (3.2.8) we have the asymptotics EŒZi j Zi > 0 

v .1 C s/i : 2 s

(3.2.10)

Hence, for any " > 0, the expression in (3.2.9) is asymptotically equal to that generation for which (3.2.10) reaches "N . All this now leads us to the dynamics of the relative fitness process. As illustrated in Figure 3.2.2, most mutants only grow to small frequencies and are then lost again (due to the sampling step). But if a mutation does survive the initial fluctuations and gains appreciable frequency, then the dynamics turns into an asymptotically deterministic one and takes the mutation to fixation quickly, cf. [12, 17], or [13, Chapter 6.1.3]. Indeed, within time u.F /, the mutation has either disappeared or gone close to fixation. Moreover, in the scaling regime (3.2.16) specified in Section 3.2.3, this time is much shorter than the mean interarrival time 1 between successive beneficial mutations. As a consequence, there are, with high probability, at most two types present in the population at any given time (namely, the resident and the mutant), and clonal interference is absent. Therefore, in the scenario considered, survival of drift is equivalent to fixation. The parameter regime u  1 is known as the periodic selection or sequential fixation regime, and the resulting class of origin-fixation models is reviewed in [27]. Next, we consider the expected per-day increase in relative fitness, given the current value F . This is E.F j F /  .F /ı.F / 

€ F 2qC1

:

(3.2.11)

Here, the asymptotic equality is due to (3.2.5)–(3.2.7), and the compound parameter € WD C' 2 log

(3.2.12)

is the rate of fitness increase per day at day 0 (where r0j  F0 D 1). Note that '=F q appears squared in the asymptotic equality in (3.2.11) since it enters both  and ı.

50

Ellen Baake and Anton Wakolbinger

Note also that the additional C1 in the exponent of F comes from the factor of F1 in the length of the growth period (3.2.4), and thus reflects the runtime effect. The effect would be absent if, instead of our Cannings model, a discrete-generation scheme were used, as in [38]; or a standard Wright–Fisher model, for which [20] calculated the expected fitness increase and the fitness trajectory for various fitness landscapes, including the one in (3.2.5). Equation (3.2.11) now leads us to define a new time variable  via j k iD (3.2.13) € with € of (3.2.12); this means that one unit of time  corresponds to € days. With this rescaling of time, equation (3.2.11) corresponds to the differential equation 1 d f ./ D 2qC1 ; d f ./

f .0/ D 1;

with solution f ./ D 1 C 2.1 C q/

1  2.1Cq/

(3.2.14)

:

(3.2.15)

That is the fitness trajectory follows a power law (and is concave). Note that (3.2.14) is just a scaling limit of (3.2.11), where the expectation was omitted due to a dynamical law of large numbers, as will be explained next. 3.2.3 Scaling regime and law of large numbers We now think of  D N and ' D 'N as being indexed with population size because the law of large numbers requires to consider a sequence of processes indexed with N . Thus, other quantities now also depend on N (so ı D ıN , s D sN ,  D N , € D €N , etc.), and so does the relative fitness process .Fi /i>0 D .FiN /i >0 with Fi of (3.2.3). More precisely, we will take a weak mutation–moderate selection limit, which requires that N and 'N become small in some controlled way as N goes to infinity. Specifically, it is assumed in [19] that N 

1 ; Na

'N 

1 Nb

as N ! 1;

0 3b:

(3.2.16)

These assumptions will enter in Theorem 3.2.1 below. Due to the assumption a > 3b, N is of much lower order than 'N . This is used in [19] to prove that, as N ! 1, with high probability no more than two fitness classes are simultaneously present in the population over a long time span. Here is a quick intuitive reason for the bound a > 3b in (3.2.16). Reaching a macroscoping increase of the relative fitness requires asymptotically (as N ! 1) no more than 'N1 " successful mutations (with some " > 0), and in view of (3.2.8), between two successful mutations there are asymptotically no more than 'N1 " unsuccessful ones. Because of (3.2.9), the time until a new mutation has either disappeared or gone to fixation can be estimated from

Microbial populations under selection

51

above in probability by 'N1 " . The condition a > 3b ensures that the expected number of mutations that arrive in a total time of .'N1 " /3 is asymptotically negligible. Indeed, as proved in [19], this condition guarantees a strict absence of clonal interference as N ! 1; there it is conjectured that Theorem 3.2.1 holds even under the weaker condition a > b. Furthermore, the scaling of 'N implies that selection is stronger than genetic drift as soon as the mutant has reached an appreciable frequency. The method of proof applied in [19] requires the assumptions (3.2.16) in order to guarantee a coupling between the new mutant’s offspring and two nearly-critical Galton–Watson processes, between which the mutant offspring’s size is “sandwiched” for sufficiently many days. Specifically, under the assumption 0 < b < 12 , the coupling works until the mutant offspring in our Cannings model has reached a small (but strictly positive) proportion of the population, or has disappeared. For the case 12 < b < 1, Haldane’s formula (3.2.8) holds for a very closely related class of Cannings models [6, Theorem 3.5]; the proof relies on duality methods invoking an ancestral selection graph in discrete time. We conjecture that the assertion of Theorem 3.2.1 holds also for 0 < b < 1. In the case where selection is much stronger than mutation, the classical models of population genetics, such as the Wright–Fisher or Moran model, display the wellknown dynamics of sequential fixation [27], that is a new beneficial mutation is either lost quickly or goes to fixation. (An analogous regime is considered in adaptive dynamics, see the contribution by Bovier [7] in this volume.) Qualitatively, our Cannings model displays a similar behaviour. Furthermore, as already indicated, with the chosen scaling the population turns out to be homogeneous on generic days i as N ! 1. This has the following practical consequence for the relative fitness process .FiN /i >0 . On a time scale with a unit of 1=.N 'N / days, .FiN /i >0 turns into a jump process as N ! 1, cf. Figure 3.2.2. The precise formulation of the limit law [19] reads as follows. Theorem3.2.1. For N ! 1 and under the scaling (3.2.16), the sequence of processes N Fb=€ converges, in distribution and locally uniformly, to the deterministic N c >0 function .f . //>0 in (3.2.15). The theorem was proved along the heuristics outlined above.2 It is a reasoning that allows to go from (3.2.11) to (3.2.14) (and thus to “sweep the expectation under the carpet”), in the following sense. For large N and under the scaling assumption (3.2.16), fitness is the sum of a large number of small per-day increments accumulated over many days, and may be approximated by its expectation. Since time has been rescaled via (3.2.13), equation (3.2.15) has q as its single parameter. Note that 1=.2.1 C q// < 1 (leading to a concave f ) whenever q > 0; in particular, the fitness curve is concave even for q D 0, that is in the absence of epistasis.

2

Note that [19] partly works with dimensioned variables, which is why the notation and the result look somewhat different.

Ellen Baake and Anton Wakolbinger

52

In contrast, the fitness trajectory obtained in [20] for the Wright–Fisher model under q D 0 is linear. The difference is due to the runtime effect, which is present in our Cannings model even for q D 0 because of the parametrisation of the intraday dynamics with the individual reproduction rate r: If the population as a whole already reproduces faster, then the end of the growth phase is reached sooner and thus leaves less time for a mutant to play out its advantage ı.r/ D '=r 0 D ' of (3.2.2). The Wright–Fisher model of [20] does not display the runtime effect because it does not contain the individual (intraday) reproduction rate as a parameter. The second parameter, namely €N , reappears when  is translated back into days; that is FiN  f .€N i /. 3.2.4 Random fitness increments and clonal interference Let us now turn to random beneficial effects. To this end, we scale the fitness increments with a positive random variable X with density h and expectation E.X / D 1. We assume throughout that 0 < E.X 2 / < 1 (the degenerate case X  1 requires special treatment, as detailed in [3]). Taking into account the dependence on X, the quantities in (3.2.5)–(3.2.7) and (3.2.9) turn into log ' ; .F / D Fq F s.F; X/ D ı.F; X/.F /;

ı.F; X/ D X

(as before);

.F; X/  C s.F; X/; log.N s.F; X// log.N'X / u.F; X/ D  s.F; X/ s.F; X /

(3.2.17)

(see [3] for an explanation of the approximation for u). Here we use  in place of  because – in contrast to Section 3.2.3 – we have no limit law available so far. Note that large X implies large s and hence small u and vice versa. The following Poisson picture will be central to our heuristics: The process of beneficial mutations with scaled effect x that arrive at time  has intensity  d h.x/ dx with points .; x/ 2 RC  RC . And in fitness background  F , we denote by … the Poisson process of contending mutations, which has intensity  dh.x/.F; x/ dx on RC  RC . Recall that contending mutations are those that survive drift; but since we now take into account clonal interference, contending mutations do not necessarily go to fixation. We now describe a refined version of the Gerrish–Lenski heuristics for clonal interference, adapted to the context of our model. Verbally, the heuristics says that “if two contending mutations appear within the time required to become dominant in the population, then the fitter one wins”. We therefore posit that if, in the fitness background  F , two contending mutations .; x/ and . 0 ; x 0 / appear at  <  0 <  C u.F; x/, then the first one outcompetes (“kills”) the second if x 0 6 x, and the second one kills the first if x 0 > x. Thus, neglecting interactions of higher order, given that a contending mutation arrives at .; x/ in the fitness background  F , the

Microbial populations under selection

53

probability that it does not encounter a killer in its past is  Z 1   .F; x/ WD exp .F; y/u.F; y/h.y/ dy ; x

whereas the probability that it does not encounter a killer in its future is Z 1   !  .F; x/ WD exp u.F; x/ .F; x 0 /h.x 0 / dx 0 x

(note that only the term corresponding to !  is considered by [17]; the inclusion of  makes the heuristics consistent with its verbal description). Using (3.2.17),  .F; x/ is approximated by Z 1   .x/ WD exp C log.N'y/h.y/ dy ; x

whereas !  .F; x/ is approximated by !

Z   C log.N'x/ 1 0 .x/ WD exp  x h.x 0 / dx 0 : x x

!  WD  !  and analogously Note that neither nor depend on F . Thus, setting ! ! ! WD , we obtain, as an analogue of (3.2.11), the expected (per-day) increase of F , given the current value of F , as Z 1 € (3.2.18) E.F j F /   ı.F; x/.F; x/ !  .F; x/h.x/ dx  2qC1 ; F 0 where € WD C' 2 log. /I.; '/

(3.2.19)

! and I.; '/ WD E. .X/X 2 /. Similarly to Section 3.2.2, the assumption of a suitable concentration of the random variable F around its conditional expectation allows us to take (3.2.18) into Fb=€c  E.Fb=€c /  f . / with f as in (3.2.15). This means that an approximate power law of the mean fitness curve applies under any suitable distribution of fitness effects; in particular, the epistasis parameter q is not affected by the distribution of X . Exponentially distributed beneficial effects. For definiteness, we now specify X as following Exp.1/, the exponential distribution with parameter 1. This was the canonical choice also in previous investigations (cf. [17, 38]) and is in line with experimental evidence (reviewed by [15]) and theoretical predictions [18, 29]. Figure 3.2.3 shows the corresponding simulations (the estimation of the parameters  and ' is an art of

54

Ellen Baake and Anton Wakolbinger

Fi 1:7

relative fitness

1:6 1:5 1:4

1:3

1:3

1:2

1:2

1:1

1:1

1 0

100

200

300

400

1 0

1000

2000

3000

4000

5000

6000

7000

i

time in days Figure 3.2.3. The fitness curve in model and experiment. Bullets: empirical relative fitness averaged over all twelve populations with error bars (95 % confidence limits based on the twelve populations) from [38, Figure 2A and Table S4]; solid line: Fi  f .€i/ with f of (3.2.15) and parameter values q D 4:2 and € D 3:2  10 3 obtained by a least-squares fit to the data; grey dashed lines: twelve individual trajectories Fi obtained via simulations of the Cannings model with X following Exp.1/ and parameters N D 5  106 , D 100, ' D 0:0375, and  D 0:73 (the values for  and ' are obtained from the fitted value of € via (3.2.19) combined with the value of the mean fitness increment of the first fixed beneficial mutation as observed in independent experiments [21]); black dashed line: average over the twelve simulations; inset: zoom on the early phase.

its own, for which we refer to [3]). The simulation mean agrees nearly perfectly with the approximating power law. However, for i small, the fluctuations in the simulation are larger than those observed, whereas for large i, this is reversed. While we have no convincing explanation for the former, the latter may be due to the constant parameters assumed by the model, whereas parameters do vary across replicate populations in the experiment.

3.3 A host-parasite model with balancing selection and reinfection In this section, we review a model motivated by observations concerning the human cytomegalovirus (HCMV), an old herpesvirus, which is carried by a substantial fraction of mankind [8]. In DNA data of HCMV, a high genetic diversity is observed in coding

Microbial populations under selection

55

regions, see [34]. This diversity can be helpful to resist the defence of the host. Furthermore, for guaranteeing its long-term survival, HCMV seems to have developed elaborate mechanisms that allow it to persist lifelong in its host and to establish reinfections in already infected hosts. The purpose of the model described in the sequel is to study the effects of these mechanisms on the maintenance of diversity in a parasite population. A central issue hereby is that, due to the reinfections, the diversity of the (surrounding) parasite population can be transmitted to the individual hosts. Our presentation will be guided by [33]. 3.3.1 A hierarchical Moran model Consider a population consisting of M hosts, each of which carries a population of N parasites. For simplicity we assume that M and N remain constant over time, and that there are two types of parasites, A and B. We model the evolution of the parasite population distributed over the hosts by a ¹0; N1 ; : : : ; 1ºM -valued Markovian jump N;M process XN;M D .X1N;M .t/; : : : ; XM .t// t>0 , where XiN;M .t / and 1 XiN;M .t /, 1 6 i 6 M , represent the relative frequencies of type A- and type B-parasites, respectively, in host i at time t. Before stating the jump rates of XN;M in Definition 3.3.1, we describe the dynamics in words. The host population as well as the parasite population within each host follow dynamics that are modifications of the classical Moran dynamics. We work at the time scale of host replacement, that is every host dies at rate 1 and is replaced by an offspring of a uniformly chosen member of the host population. (See Remark 3.3.4 (f) for a possible generalisation of the latter assumption.) The N parasites of the new host all carry the type of a parasite chosen randomly from the parasite population of its parent. On this time scale, the reproduction rate of each parasite is assumed to be gN . The parasite population within a host experiences balancing selection towards an equilibrium frequency  for some fixed  2 .0; 1/. More specifically, in host i parasites of type A, when at relative frequency xi , reproduce at rate gN .1 C sN . xi // and those of type B at rate gN .1 sN . xi //, where sN is a small positive number. At a reproduction event, a parasite splits into two and replaces a randomly chosen parasite from the same host. Reinfection events occur at rate rN per host; then a single parasite in the reinfecting host (both of which are randomly chosen) is copied and transmitted to the reinfected host. At the same time, a randomly chosen parasite is instantly removed from this host; this way the parasite population size in each of the hosts is kept constant. This hierarchical host-parasite dynamics is summarised in the following definition. Definition 3.3.1. In the process XN;M , jumps from state ° 1 ±M x D .x1 ; : : : ; xM / 2 0; ; : : : ; 1 N

Ellen Baake and Anton Wakolbinger

56

occur for i D 1; : : : ; M , to x C

1 ei N

to x

1 ei N

to x C .1

 xi / N xi .1 xi / M 1 X C rN xj .1 xi /; M j D1  at rate gN 1 C sN .xi / N xi .1 xi / M 1 X C rN .1 xj /xi ; M j D1 M X at rate xN WD xj ; at rate gN 1 C sN .

xi /ei

j D1

to x

xi ei

at rate .1

xN /;

with ei D .0; : : : ; 1; : : : ; 0/ the i-th unit vector of length M . This scenario can be interpreted in classical population genetics terms as a population distributed over M islands and migration between islands. Within each island, reproduction is panmictic and driven by balancing selection. The model is hierarchical in the sense that also the population of hosts is evolving. Related hierarchical models have been studied from a mathematical perspective in [11,24], for example. In these papers, an emphasis is on models for selection on two scales, and phase transitions (in the mean-field limit) are studied, in which particularly the higher level of selection (namely group selection) can drive the evolution of the population. In our model, balancing selection only acts at the lower level (i.e. the within-host parasite populations); but we focus on parameter regimes in which balancing selection is also lifted to the higher level, such that both parasite types are maintained for a long time in the host population that consists of hosts carrying a single parasite type only, as well as hosts carrying both types of parasites. This corresponds to observations in samples of HCMV hosts, see [31] and the references therein. 3.3.2 Laws of large numbers and propagation of chaos In what follows, we specify the assumptions on the strength of selection, intensity of reinfection, and parasite reproduction relative to the rate of host replacement. For a moderate strength of selection we show that, in the limit of an infinitely large parasite population per host, only three states of typical hosts exist, namely those infected with only one of the types A or B and those infected with both types, where A is at frequency . These three host states will be called the pure states (if the frequency of type A in a host is 0 or 1) and the mixed state (if the frequency of type A in a host is ). Selection is assumed to be weak enough that many reinfection attempts are required until an effective reinfection occurs, and at the same time to be strong enough that

Microbial populations under selection

57

an effective reinfection leads the parasite type frequency in the reinfected host to the equilibrium frequency  quickly. Only reinfection events can change a host state from pure to mixed. In most cases reinfection is not effective, in the sense that it only causes a short excursion from the boundary frequencies 0 and 1. We will see that if the selection strength and reinfection rate are appropriately scaled, the effective reinfection acts on the same time scale as host replacement. Furthermore, if selection is of moderate strength and parasite reproduction is fast enough (but not too fast), transitions of the boundary frequencies to the equilibrium frequency  will appear as jumps on the host time scale, and transitions between the host states 0,  and 1 are only caused by host replacement and effective reinfection events. It turns out (see Section 3.3.3) that, if the effective reinfection rate is larger than a certain bound depending on , then, in the limit N ! 1 and M ! 1, there exists a stable equilibrium of the relative frequencies of hosts of types 0,  and 1, at which both types of parasites are present in the overall parasite population at non-trivial frequencies. We now collect the assumptions on the parameter regime, namely moderate selection in (A1), frequent reinfection in (A2), and upper and lower bounds for a fast parasite reproduction in (A3). Assumptions (A). There exist b 2 .0; 1/, r > 0 and  > 0 such that the parameters sN , rN and gN of Definition 3.3.1 obey (A1) sN D

1 , Nb

(A2) limN !1 rN sN D r,  1 (A3) g1N D o N 3bC , gN D O.exp.N 1

b.1C/

//.

Remark 3.3.2. Assumption (A1) implies that lim sN D 0

N !1

and

lim sN N D 1;

N !1

while assumptions (A1), (A2), (A3) together imply that for large N , 1  rN  gN : The latter says that hosts experience frequent reinfections during their lifetime, and many parasite reproduction events happen between two reinfection events. Such a parameter regime seems realistic; see [31] for additional discussion. In what follows, we will analyse the cases “N ! 1 with M fixed” (Theorem 3.3.3), “first N ! 1, then M ! 1” (Proposition 3.3.8), and “N ! 1, M ! 1 jointly” (Theorem 3.3.9). Large parasite population, finite host population. Let the number M of hosts be fixed. We specify the jump rates of the ¹0; 1; ºM -valued Markovian jump process M YM D .Y1M .t /; : : : ; YM .t// t >0 , which will turn out to be the process of type-A

58

Ellen Baake and Anton Wakolbinger

frequencies in hosts 1; : : : ; M in the limit N ! 1. From state y D .y1 ; : : : ; yM /, the process YM jumps by flipping, for i 2 ¹1; : : : ; M º, the component yi from 0 or  to 1 at rate

M 1 X yj ; M j D1

M 1 X from 1 or  to 0 at rate .1 M

yj /;

j D1

(3.3.1)

M

from 0

2r X to  at rate yj ; M j D1

M

from 1

to  at rate

2r.1 / X .1 M

yj /:

j D1

The first two lines in the display (3.3.1) correspond to host replacement, and the last two lines capture the effective reinfection. That is the coefficient 2r in the third line should, in the light of assumption (A2), be understood as the limit of the product of the reinfection rate rN and the asymptotic “probability to balance” 2sN , see the explanation in Remark 3.3.4 (c) below. Theorem 3.3.3 ([33, Theorem 1]). Let XN;M be the ¹0; N1 ; : : : ; 1ºM -valued process with jump rates given in Definition 3.3.1. Fix M 2 N and assume that the law of XN;M .0/ converges weakly, as N ! 1, to a distribution  concentrated on .¹0º [ Œ˛; 1 ˛ [ ¹1º/M for some ˛ > 0. Let YM be the process with jump rates (3.3.1), and with the distribution of YM .0/ being the image of  under the mapping 0 7! 0, 1 7! 1, and Œ˛; 1 ˛ 3 x 7! . Assume (A) and let 0 < t < t < 1. On the time interval Œt ; t  the sequence of processes XN;M , N D 1; 2; : : : , then converges, as N ! 1, to the process YM , in distribution with respect to the Skorokhod M1 -topology (see Remark 3.3.4 (e) for an intuitive description of this topology). Remark 3.3.4. (a) Theorem 3.3.3 reveals the emergence of processes of jumps from the boundary points 0 and 1 to the equilibrium frequency ; these jumps capture the outcomes of effective reinfections. (b) The jump processes YiM that arise as the limits of the sequences XiN;M as N ! 1 bear similarities to the jump process addressed before Theorem 3.2.1: a jump from type 0 (or type 1) to type  in host i is caused by a “successful reinfection” (with a parasite of the complementary type), whereas in the context of the LTEE model, it is the “successful mutations” that count for the jumps of the mean fitness. (c) Essential quantities required to show the concentration on the two pure frequencies and the mixed equilibrium are the probability to balance, i.e. the probability with which a reinfection event leads to the establishment of the second type in a host so far of pure type; and the time to balance, i.e. the time needed to reach (a small neighbourhood of) the equilibrium frequency  after reinfection. These quantities

Microbial populations under selection

59

determine the parameter regimes in which we can observe the described scenario. Similar to the case of directional selection (see e.g. [9, 32] and Section 3.2), branching process approximations as well as approximations by (deterministic) ordinary differential equations can be used to estimate these probabilities and times. A notable difference compared to the situation of directional selection described in Section 3.2 is a change of the (role of the) coefficient of selection s in Haldane’s formula (3.2.8) for the fixation probability: according to the jump rates specified in Definition 3.3.1, when starting from frequency N1 , the coefficient of selection is  sN , while the offspring variance is again asymptotically equal to 1. Thus Haldane’s formula predicts that the probability to “take off”, and hence the probability to balance (when starting from frequency N1 ), is asymptotically equal to 2sN  as N ! 1. This is proved in [33, Lemma 3.6]. Furthermore, the time to balance [33, Proposition 3.8] is longer than the fixation time in the corresponding setting of Section 3.2. This is due to the fact that random fluctuations close to the equilibrium are larger than fluctuations close to the boundary. (d) The reason for considering, in Theorem 3.3.3, time intervals Œt ; t  instead of Œ0; t  is to disregard the jumps at time 0 in the limiting process that result from the asymptotically instantaneous (as N ! 1) stabilisation of the type frequencies due to the balancing selection. (e) After an effective reinfection of host i, the i-th component of XN;M performs (when N is large) a quick transition with small jumps starting from the boundary and leading close to . Thus the Skorokhod M1 -topology, which is coarser than the more common J1 -topology, adequately describes the mode of convergence of XN;M to the jump process YM as N ! 1. For a definition and characterisation of these topologies see [36]. Roughly stated, two paths are close to each other in the J1 -topology if they are uniformly close after a small (possibly inhomogeneous) time shift of one of them. For a similar notion in the M1 -topology, define the graph of a path by completing the path’s jumps through vertical interpolations, and require that there exist parametrisations of the two graphs that are uniformly close to each other in space-time. Beside the convergence (in the sense of finite-dimensional distributions) of the genealogies of XN;M towards that of YM , which relies on elements of the graphical construction described in the next paragraph, the tightness criterion for the M1 -topology provided by [36, Theorem 3.2.1] is used in the proof of Theorem 3.3.3. (f) Stimulated by a referee’s question, we conjecture that the essentials of the results of [33], which are reviewed here, remain valid if the hosts’ lifetimes are generalised from i.i.d. standard exponentials to i.i.d. random variables with expectation 1. This would make the flip rates in the last line of Definition 3.3.1 and in the first two lines of (3.3.1) dependent also on the age of the respective host. Furthermore, it would introduce a renewal component into the dynamics of the process V specified in Definition 3.3.5, but otherwise leave the statements of Theorem 3.3.3, Proposition 3.3.8, and Theorem 3.3.9 unchanged.

Ellen Baake and Anton Wakolbinger

60

Theorem 3.3.3 assumes a finite host population (of constant size), with each host carrying a large number of parasites. However, in view of the discussion in [31], it is realistic to assume that the number of infected hosts is also large. We will consider two cases: In the next paragraph we let first N ! 1 and then M ! 1; thereafter we assume a joint convergence of N and M D MN to 1. Iterative limits: Huge parasite population, large host population. According to Theorem 3.3.3, the process YM arises in the limit of N ! 1 parasites per host, with the number M of hosts fixed. The process YM has a propagation of chaos property as M ! 1 for a sequence of initial states that are exchangeable, see Proposition 3.3.8 below. If the empirical distributions of YM .0/ converge to the distribution with weights v0 D .v00 ; v0 ; v01 / as M ! 1, then, for each t > 0, YM .t / converges to the distribution with weights v t D .v t0 ; v t ; v t1 / given by the solution of the dynamical system vP 0 D .1 vP  D

/v 

2rv 0 .v 1 C v  /;

v  C 2r 2 v 0 v  C .1

vP 1 D v 

2r.1

 /2 v 1 v  C v 0 v 1 ;  /v 1 v 0 C .1 /v 

(3.3.2)

with initial value v0 , see [33, Corollary 2.8]. The system leaves the unit simplex 3 WD ¹.z0 ; z ; z1 / 2 Œ0; 13 W z0 C z C z1 D 1º invariant (as it must). Here is a brief intuitive explanation of (3.3.2), exemplarily for its first equation. The first term on the right-hand side of that equation arises from host replacement, as a sum of what happens when a host of type 0, type  or type 1 dies. The corresponding contributions to vP 0 are v 0 .v 1 C v  /, v  .v 0 C .1 /v  / and v 1 .v 0 C .1 /v  /, and these add up to .1 /v  . The second term on the right-hand side of the first equation in (3.3.2) arises from reinfection and corresponds to the rate in the third line of (3.3.1), see also the explanation there. The equilibria of (3.3.2) and their stabilities are analysed in [33, Proposition 2.11]. In particular, there is an equilibrium in the interior of 3 , which is globally stable if and only if ° 2 1 1 2 ± r > max ; : (3.3.3) 2.1 /2 2.1 /2 In the limit M ! 1, the process of type-A parasite frequencies in a typical host turns out to be a ¹0; ; 1º-valued Markov process V D .V .t // t>0 defined as follows. Definition 3.3.5 (Evolution of a typical host in the limit M ! 1). For a given v0 2 3 , let v D .v t / t>0 be defined by (3.3.2). We specify the jump rates of V as follows: At time t, the process V jumps  from any state to state 0 at rate v t0 C .1

/v t ;

1 at rate v t1 C v t ;  from state 0 to state  at rate 2r.v t1 C v t /, and  from state 1 to state  at rate 2r.1

/.v t0 C .1

/v t /.

Microbial populations under selection

t

61

Rt

t′

0 Figure 3.3.1. Nested trees in the graphical representation of V . Left: A realisation of T t . The root R t is situated at time t , the leaves are at time 0. The distinguished line is drawn bold. HR events along the lines different from the distinguished one are indicated as dots. Branches incoming to the distinguished line at HR events are drawn to the right of the continuing branch, incoming branches at PER events (along any line) are drawn to the left of the continuing branch. Right: The corresponding realisation of T t 0 for a t 0 < t, nested into T t .

The second part of [33, Corollary 3.4] gives the following. Proposition 3.3.6. Let v and V be as in Definition (3.3.5). Then P .V .t/ D a/ D v ta ;

a 2 ¹0; ; 1º:

This confirms that the dynamics of V is of Vlasov–McKean type: the jump rates of V at time t react on the distribution of V .t/. This goes along with the underlying mean-field situation. Notably, there is a graphical representation of the process .V .t // t >0 in terms of nested trees .T t / t>0 , which does not require a prior analytic construction of v but rather gives a probabilistic representation of it. This representation is in the spirit of ancestral graphs in the absence of coalescences, see [2, Theorem 2] or [4, Proposition 2]. The leaves of T t are coloured independently according to the distribution with weights v0 , with the colours being transported in an interactive way from the leaves to the root; the random state V .t/ is then the colour of the root of T t . Let us first describe the construction of T t , following [33, Definition 3.1]. A single (distinguished) line starts from the root R t of T t backwards in time, this is the downward direction in the (schematic) Figure 3.3.1. The growth of the tree in downward direction is defined via the splitting rates of its lines. Namely, each line is hit by host replacement (HR) events at rate 1 and potential effective reinfection (PER) events at rate 2r. At each such event, the line splits into two branches, the continuing and the incoming one (where “incoming” refers to the direction from the leaves to the root). Whenever the distinguished line is hit by an HR event, we keep both branches in the tree and designate the continuing branch as the continuation of the distinguished line. In order to discriminate between PER and HR events along the distinguished line,

Ellen Baake and Anton Wakolbinger

62

we draw the incoming line at PER events to the left and at HR events to the right of the distinguished line. Whenever a line different from the distinguished one is hit by an HR event, we discard (or prune) the continuing branch and keep only the incoming one; but we keep the event in mind by placing a dot on the continuing line. At a PER event (to any line), we keep both the incoming and the continuing branches. Next we describe the transport of the colours that result from an independent colouring of the leaves of T t according to v0 . Assume an HR event occurs at time  . If the incoming branch at time  is in state 0 or 1, then the continuing branch takes the state of the incoming branch. If the incoming branch at time  is in state , then the state of the continuing branch at time  is decided by a coin toss: it takes the state 1 or 0 with probability  or 1 , respectively. Intuitively, this coin toss decides which type is transmitted (type A with probability  and type B with probability 1 ). At a PER event (occurring at time , say), the state of the continuing branch is decided via at most two independent coin tosses, each with success probability . If, for example, at time  the incoming branch is in state  and the continuing branch is in state 0, then at time  the state of the continuing branch changes to  with probability 2 , and remains in 0 with probability 1 2 . Intuitively, the first coin toss decides which type is transmitted, and the second coin toss decides whether the reinfection is effective. This second coin toss decreases the rate 2r of PER events to the host-state dependent rate of effective reinfection events. For the rules for the other combinations of states at the incoming and continuing branches, we refer to [33, Definition 3.1]. In this way, given the tree T t and the realisations of the coin tosses indexed by the HR and PER events along its lines, the states of the leaves are propagated in a deterministic way into the state C t of the root. As indicated Figure 3.3.1, .T t ; C t / can be coupled for various t by nesting the trees in an obvious way. Proposition 3.3.7 (Graphical construction of the state evolution of a typical host [33, Corollary 3.4]). For v0 2 3 , let .T t ; C t / t >0 be constructed as described above. Then .C t / t >0 has the same distribution as the process .V .t // t>0 specified in Definition 3.3.5. This graphical approach is instrumental for proving the next result as well as Theorems 3.3.3 and 3.3.9. Proposition 3.3.8 (Propagation of chaos [33, Proposition 2.7]). Assume M 1 X ıY M .0/ ! v00 ı0 C v0 ı C v01 ı1 i M i D1

as M ! 1 for some v0 D .v00 ; v0 ; v01 / 2 3 . Moreover, assume that the initial states M Y1M .0/; : : : ; YM .0/ are exchangeable, i.e. arise through sampling without replacement from their empirical distribution (given the latter). Then, for each t > 0, the random paths YiM D .YiM .t//06t 6t , i D 1; : : : ; M , are exchangeable. Furthermore, for each

Microbial populations under selection

63

k 2 N as M ! 1, one has .Y1M ; : : : ; YkM / ! .V1 ; : : : ; Vk / in distribution with respect to the Skorokhod J1 -topology, where V1 ; : : : ; Vk are i.i.d. copies of the process V D .V .t//06t 6t specified in Definition 3.3.5. Joint limit: M D MN ! 1 for N ! 1. In analogy to Proposition 3.3.8, propagation of chaos can also be shown in the case of a joint limit of N and M to 1, i.e. M D MN and MN ! 1 for N ! 1. This is the topic of the next theorem. Theorem 3.3.9 (Propagation of chaos [33, Theorem 2]). Let Assumptions (A) be valid. N;MN Assume that, for any N , the initial states X1N;MN .0/; : : : ; XM .0/ are exchangeable N (i.e. arise via sampling without replacement from their empirical distribution N 0 ). For M D MN ! 1 as N ! 1, assume that N converges weakly as N ! 1 to 0 a distribution  on ¹0º [ Œ˛; 1 ˛ [ ¹1º for some ˛ > 0. Let 0 < t < t < 1 and k 2 N. On the time interval Œt ; t , the processes X1N;MN ; : : : ; XkN;MN then converge, as N ! 1, to k i.i.d. copies of the process V specified in Definition 3.3.5. Here the distribution of V .0/ has weights .¹0º/; .Œ˛; 1 ˛/; .¹1º/, and convergence is in distribution with respect to the Skorokhod M1 -topology. This allows to derive the following result on the asymptotics of the empirical N;MN distributions N .t/; i D 1; : : : ; MN / as N ! 1. t of .Xi Corollary 3.3.10 (Law of large numbers [33, Corollary 2.10]). (i) In the situation of Theorem 3.3.9, the sequence of M1 .D.Œt ; t I Œ0; 1/-valued random variables N converges in distribution (with respect to the weak topology) to the distribution of V as N ! 1, where the space D.Œt ; t I Œ0; 1/ is equipped with the Skorokhod M1 -topology. (ii) For t > 0, the sequence of M1 .Œ0; 1/-valued random variables N t converges in distribution (with respect to the weak topology) to v t0 ı0 C v t ı C v t1 ı1 as N ! 1, where v D .v 0 ; v  ; v 1 / is the solution of (3.3.2) with initial condition ..¹0º/; .Œ˛; 1 ˛/; .¹1º//. 3.3.3 Maintenance of a polymorphic state For large N and M , the weights of the empirical frequencies N t are close to the solution of the dynamical system (3.3.2) by Corollary 3.3.10. Hence – once a state close to the stable equilibrium state of (3.3.2) is reached – both parasite types A and B are maintained in the population for a long time. Indeed, an asymptotic lower bound for this time, which is “almost exponential” in MN , is given in [33, Theorem 3 (ii)]. However, with N being finite, eventually one of the types goes to fixation and the population enters a monomorphic state with all hosts infected with either type A or B only.

Ellen Baake and Anton Wakolbinger

64

To formally overcome this problem of ultimate fixation, we now enrich our model by allowing, in addition to the rates specified in Definition 3.3.1, for two-way mutation for the parasites at rate uN per parasite generation. Then N WD uN NMN gN is the population mutation rate, i.e. the total rate at which parasites mutate in the total host population on the host time scale. If N D o.rN /, the rate at which a type is transmitted by reinfections is much larger than the mutation rate to this type, even if this type is retained only in a single host (at around the equilibrium frequency). The dynamical system arising as the limiting evolution of X N;M as N ! 1 is then not perturbed by mutations. Even though most mutations away from a monomorphic population will get lost due to fluctuations, the assumed recurrence of the mutations will eventually turn a monomorphic host population into a polymorphic one. The condition ° r > max

 2.1

1 ± ; ; /2 22

which is stronger than (3.3.3), allows for a coupling with a supercritical branching process that estimates from below the number of hosts infected with the currently-rare parasite type. Under this condition together with Assumptions (A), [33, Theorem 3 (i)] gives an asymptotic upper bound (in terms of N , sN and MN ) on the time at which, with high probability (i.e. with a probability that tends to 1 as N ! 1), the empirical distribution of the host’s states reaches a small neighbourhood of the stable fixed point of the dynamical system (3.3.2), when the particle system is started from a monomorphic state. A comparison of the two bounds in [33, Theorem 3, parts (i) and (ii)] shows that, as long as N D o.rN / and N obeys a mild asymptotic lower bound, the proportion of time the population spends in a monomorphic state is negligible relative to the time it spends in a polymorphic state. In fact, the required lower bound on N turns out to be subexponentially small in the host population size. From a modelling perspective, it seems important that the lower bound is this small. Indeed, the genetic variability we are modelling is found in coding regions. Types A and B represent different genotypes/alleles of the same gene (e.g. in HCMV there exist two genotypes for the region UL 75; they are separated by one deletion (removing an amino acid) and 8 amino acid substitutions, requiring at least 8 non-synonymous point mutations). Since no “intermediate genotypes” have been found, it is likely that a fitness valley lies between the two genotypes, see [31] for more details on the biological motivation. Acknowledgements. We thank Cornelia Pokalyuk for comments that helped to improve the presentation, Sebastian Probst for help with figures, and an anonymous referee for thoroughly reading the manuscript and for insightful questions and suggestions.

Microbial populations under selection

65

References [1] E. Baake, The Luria–Delbrück experiment: Are mutations spontaneous or directed?, Eur. Math. Soc. Newsl. 69 (2008), 17–20. [2] E. Baake, F. Cordero, and S. Hummel, A probabilistic view on the deterministic mutationselection equation: Dynamics, equilibria, and ancestry via individual lines of descent, J. Math. Biol. 77 (2018), 795–820. [3] E. Baake, A. González Casanova, S. Probst, and A. Wakolbinger, Modelling and simulating Lenski’s long-term evolution experiment, Theor. Popul. Biol. 127 (2019), 58–74. [4] E. Baake and A. Wakolbinger, Lines of descent under selection, J. Stat. Phys. 172 (2018), 156–174. [5] R. Backofen and P. Pfaffelhuber, The population genetics of the CRISPR-Cas system in bacteria, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 69–83. [6] F. Boenkost, A. González Casanova, C. Pokalyuk, and A. Wakolbinger, Haldane’s formula in Cannings models: The case of moderately weak selection, Electron. J. Probab. 26 (2021), Paper No. 4. [7] A. Bovier, Stochastic models for adaptive dynamics: Scaling limits and diversity, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 127–149. [8] M. J. Cannon, D. S. Schmid, and T. B. Hyde, Review of cytomegalovirus seroprevalence and demographic characteristics associated with infection, Rev. Med. Virol. 20 (2010), 202–213. [9] N. Champagnat, A microscopic interpretation for adaptive dynamics trait substitution sequence models, Stochastic Process. Appl. 116 (2018), 1127–1160. [10] L. M. Chevin, On measuring selection in experimental evolution, Biol. Lett. 7 (2011), 210–213. [11] D. A. Dawson, Multilevel mutation-selection systems and set-valued duals, J. Math. Biol. 76 (2018), 295–378. [12] M. M. Desai and D. S. Fisher, Beneficial mutation-selection balance and the effect of linkage on positive selection, Genetics 176 (2007), 1759–1798. [13] R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008. [14] W. J. Ewens, Mathematical Population Genetics, 2nd ed., Springer, New York, 2004. [15] A. Eyre-Walker and P. D. Keightley, The distribution of fitness effects of new mutations, Nat. Rev. Gen. 8 (2007), 610–618. [16] R. Fisher, The correlation between relatives on the supposition of Mendelian inheritance, Phil. Trans. R. Soc. Edinburgh 52 (1918), 399–433. [17] P. J. Gerrish and R. E. Lenski, The fate of competing beneficial mutations in an asexual population, Genetica 102/103 (1998), 127–144. [18] J. H. Gillespie, Molecular evolution over the mutational landscape, Evolution 38 (1984), 1116–1129.

Ellen Baake and Anton Wakolbinger

66

[19] A. González Casanova, N. Kurt, A. Wakolbinger, and L. Yuan, An individual-based model for the Lenski experiment, and the deceleration of the relative fitness, Stochastic Process. Appl. 126 (2016), 2211–2252. [20] S. Kryazhimskiy, G. Tka˘cik, and J. B. Plotkin, The dynamics of adaptation on correlated fitness landscapes, Proc. Natl. Acad. Sci. USA 106 (2009), 18638–18643. [21] R. Lenski, M. R. Rose, S. Simpson, and S. C. Tadler, Long term experimental evolution in Escherichia coli. I. Adaptation and divergence during 2000 generations, Amer. Nat. 138 (1991), 1315–1341. [22] R. E. Lenski, The E. coli long-term experimental evolution project site, httpW//myxo.css. msu.edu/ecoli, 2021. [23] R. E. Lenski and M. Travisano, Dynamics of adaptation and diversification: A 10,000generation experiment with bacterial populations, Proc. Natl. Acad. Sci. USA 91 (1994), 6808–6814. [24] S. Luo and C. Mattingly, Scaling limits of a model for selection at two scales, Nonlinearity 30 (2017), 1682–1707. [25] S. Luria and M. Delbrück, Mutations of bacteria from virus sensitivity to virus resistance, Genetics 28 (1943), 491–511. [26] J. Maynard Smith, What determines the rate of evolution? Amer. Nat. 110 (1976), 331–338. [27] D. M. McCandlish and A. Stoltzfus, Modelling evolution using the probability of fixation: History and implications, Q. Rev. Biol. 89 (2014), 225–252. [28] D. H. Morris, K. Gostic, S. Pompei, T. Bedford, M. Łuksza, R. A. Neher, B. T. Grenfell, M. Lässig, and J. W. McCauley, Predictive modeling of influenza shows the promise of applied evolutionary biology, Trends Microbiol. 26 (2018), 102–118. [29] H. A. Orr, The distribution of fitness effects among beneficial mutations, Genetics 163 (2003), 1519–1526. [30] Z. Patwa and L. M. Wahl, The fixation probability of beneficial mutations, J. R. Soc. Interface 5 (2008), 1279–1289. [31] C. Pokalyuk and I. Görzer, Diversity patterns in parasite populations capable for persistence and reinfection with a view towards the human cytomegalovirus, preprint (2019), httpsW// www.biorxiv.org/content/10.1101/512970v2. [32] C. Pokalyuk and P. Pfaffelhuber, The ancestral selection graph under strong directional selection, Theor. Popul. Biol. 87 (2013), 25–33. [33] C. Pokalyuk and A. Wakolbinger, Maintenance of diversity in a parasite population capable of persistence and reinfection, Stochastic Process. Appl. 130 (2020), 1119–1158. [34] E. Puchhammer-Stöckl and I. Görzer, Human cytomegalovirus: An enormous variety of strains and their possible clinical significance in the human host, Future Med. 6 (2011), 259–271. [35] R. Sanjuán, Mutational fitness effects in RNA and single-stranded DNA viruses: Common patterns revealed by site-directed mutagenesis studies, Phil. Trans. R. Soc. B 365 (2010), 1975–1982. [36] A. V. Skorokhod, Limit theorems for stochastic processes, Theory Probab. Appl. 1 (1956), 261–290.

Microbial populations under selection

67

[37] W. Stephan and A. Tellier, Stochastic processes and host-parasite coevolution: linking coevolutionary dynamics and DNA polymorphism data, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 107–125. [38] M. J. Wiser, N. Ribeck, and R. E. Lenski, Long-term dynamics of adaptation in asexual populations, Science 342 (2013), 1364–1367.

Chapter 4

The population genetics of the CRISPR-Cas system in bacteria Rolf Backofen and Peter Pfaffelhuber The Clustered Regularly Interspaced Short Palindromic Repeats-system (or CRISPRCas system) is known as the immune system of bacteria against phages. Albeit the general function is similar across the different CRISPR systems, the known systems are highly diverged, and a classification system is required that identifies the different components and their function. In addition, for any given CRISPR system, a population sample may have different spacers, which calls for a population genetic model for the spacer sequences. In future work, such a model can be used to determine rates for spacer gain and loss depending on the type of the CRISPR system.

4.1 Introduction Population genetic methods are rarely applied to bacterial species and Maynard-Smith even asked in [35]: Do bacteria have population genetics? The reason is that the biological species concept as a reproducing unit can hardly be applied to bacteria. Nevertheless, undoubtedly, bacterial populations evolve, and population genetic models for bacterial species (and archaea) have been introduced [6, 29, 50], see also the contribution of Baake and Wakolbinger [3]. More recently, several fundamental questions on bacterial population genetics are investigated, such as the abundance of recombination [11], ill-defined bacterial populations and the lack of neutral polymorphism [40]. We focus here on the evolution of a specific part of most bacterial genomes, known as the CRISPR-Cas system (Clustered Regularly Interspaced Short Palindromic Repeats – CRISPR associated sequences). It is contained in most bacteria and archaea [4, 44] and consists of two elements. The first is an array of structured exact repeats that enclose sequence elements called spacers. The second is a set of associated cas genes [21, 25, 38, 51], which encode proteins required for the processing of the information contained in the array. The characteristic repetitive structure of CRISPR arrays has been known since 1987 [23], but it was only in 2000 that Mojica used a specifically designed computer program to extend the repertoire of known CRISPR arrays to 20 examples [37]. From the newly-enlarged set of known repeat spacers, subsequent bioinformatics analyses revealed that spacers could match with bacteriophage genomes [36], leading to the later-confirmed hypothesis that the CRISPR-Cas system is an adaptive immune system [12, 39]. In other words, the spacer elements are short pieces of DNA from former invading phages, which were included in a process called adaptation (which is

Rolf Backofen and Peter Pfaffelhuber

70

Figure 4.1.1. The three major phases of CRISPR-Cas immune systems. First, in the adaptation phase, Cas proteins excise the protospacer sequence from foreign DNA and insert it into the repeat, adjacent to the leader at the CRISPR locus. Second, CRISPR arrays are transcribed and then processed into multiple crRNAs, each carrying a single spacer sequence and part of the adjoining repeat sequence. Third, at the interference phase, the crRNAs are assembled into different types of protein targeting complexes (Cascades) that anneal, and cleave, spacer matching sequences on either invading elements or their transcripts. Figure taken from [1].

Figure 4.1.2. The components of a CRISPR system.

different from the same term used in evolutionary theory). After the transcription of the CRISPR array into RNA and further processing, these spacer sequences and part of the encompassing repeats form the mature crRNA, which is utilised by the proteins encoded in the cas genes to attack invading phages whose DNA contains a similar DNA-fragment called protospacer (Figure 4.1.1). Similarity is defined here as a high degree of complementarity. In this process, termed interference, the invading DNA [16] is recognised via base-pair interaction and then destroyed through enzymatic cleavage. Both the adaptation and the interference process often require an additional motif in the vicinity of the protospacer, the PAM-element (Protospacer Adjacent Motif). Besides the cas genes, the CRISPR array is also accompanied by the leader sequence, which is required for gaining new spacers (Figure 4.1.2). One interesting aspect is that the CRISPR-Cas system is acting not only against viral DNA, but against any invading DNA element. So in theory, a CRISPR system might prevent horizontal gene transfer (HGT) itself. This was hypothesised in [34], however no evidence was found on an evolutionary timescale [18].

The population genetics of the CRISPR-Cas system in bacteria

71

The importance of bioinformatics in CRISPR research has led to the development of many tools for solving standard tasks. One of the first is to detect CRISPR arrays, which is accomplished by tools like CRISPRfinder [19], CRT [10] or PILER-CR [14]. CRISPRfinder, the most highly cited tool, determines possible CRISPR arrays by detecting repetitive elements using VMATCH as the first step. These putative CRISPR locations are then filtered using a built-in scoring function that takes into account the length of the direct repeat and the number of mismatches. The filtered list is then further reduced by using a structural test. CRISPRtarget [9] tackles the problem of finding targets of mature crRNA. It is based on a blast screen of spacers against selected sequence databases. Blast is a fast, heuristic version of the local alignment algorithm and thus allows to screen for sequences that are similar across genomes. As blast is insensitive with respect to the number and position of mutations accepted by the interference machinery, the authors added a further scoring metric, namely C1 for matches and 1 for mismatches, to evaluate the resulting hits. In the next section, we will address the non-trivial task of classifying CRISPRsystems using various aspects such as presence and absence of different cas genes, RNA secondary structure and possible targets of the Cas proteins. This also falls into the realm of bioinformatics, using techniques from machine learning. Mathematical (evolutionary) models of CRISPR-systems are still in their infancy. For CRISPR systems, a notable exception is [27]. Here, the sequence of spacers is modelled, which are gained and lost along a phylogenetic tree. Importantly, under the ordered independent loss model (see also Section 4.3), spacers are inserted into the CRISPR cassette only at the leader end, such that older spacers are further apart from the leader. Future research on this model will help to understand the role of gain and loss of spacers in different CRISPR systems.

4.2 Classification of CRISPR systems and components There is quite a diversity of cas genes that are associated with the repeat of a CRISPR array. Albeit there is a tight coupling between cas genes and the repeat elements of the associated CRISPR arrays, there is, however, no one-to-one relationship between a set of cas genes and the type of repeat element that are recognised by this set. A surprising degree of modularity and interchangeability of cas genes has been observed. For this reason, classification of CRISPR-Cas systems is an important and difficult problem. An initial classification for CRISPR was provided by Haft, who generated 45 hidden Markov models for the main Cas protein families based on the multiple sequence alignment of known cas genes [20]. The next step was the inclusion of multiple criteria, including evolutionary relationship, for the classification provided by Makarova et al. in 2011 [30, 31]. It is now widely accepted that classification of CRISPR-Cas systems should be based on multiple criteria including evolutionary relationship of Cas proteins. In this classification, the CRISPR-Cas systems are divided into three major types

Rolf Backofen and Peter Pfaffelhuber

72

(types I–III) based on universally present core proteins. These major types are then further subdivided into subtypes using signature cas genes identified by [30, 31]. Extensive computational screening of prokaryotic genomes led to the detection of many new CRISPR systems and to the definition of two additional types (types IV and V) as well as many new subtypes [32]. Furthermore, the types I–V were grouped into two classes, where class 1 contains all types (i.e. the types I, III and IV) in which the effector complex (i.e. the protein complex destroying the invading DNA) is composed by several proteins. In contrast, CRISPR systems belonging to class 2 (i.e. types II and V) have a single effector protein, making them good candidates for biotechnological application due to their simplicity. Concerning the distribution of CRISPR-Cas systems, the majority consist of class 1 systems found in both bacteria and archaea. These systems are complex, with approximately six to ten Cas proteins requiring either the large Cascade or Cmr complexes consisting of several proteins for successful interference reactions. In contrast, class 2 systems are lightweight, with only two to four Cas proteins found in bacteria, and only one protein for DNA interference. A prominent example is Cas9, which occurs in subtype II-C [32] and is used for gene editing in CRISPR-Cas9. Class 2 CRISPR-Cas systems have been facilitated for efficient genome engineering in eukaryotes using the crRNA-guided DNA targeting principle. Class 2 CRISPRCas systems are optimal for this purpose since they could be used with minimal requirements for genome editing in eukaryotic cells [24]. The above classification was extended to archaeal CRISPR-Cas systems in [46]. Both classification approaches, however, are based on manual curation and thus are not directly applicable to newly detected CRISPR arrays. Recently, a new classification model was introduced that retained the overall structure, but classified the known CRISPR-Cas systems into two classes, five types and sixteen subtypes. This is currently the most widely used model, with 1000 citations (Google scholar) [32], and was recently updated [33]. Here, a key step was the introduction of signature cas genes to determine the different subtypes. One reason for the diversity of cas genes and the difficulty in classifying them is that there is solid evidence that the CRISPR-Cas system is subject to HGT, albeit the exact amount of HGT is hard to quantify. In [17], a phylogenetic analysis is used for known cas genes to collect evidence for HGT. Instead of single cas genes, the authors also used concatenation of cas genes to tackle whole cassettes (i.e. the whole set of cas genes associated with a CRISPR array), the proposed unit of HGT. The analysis was also extended to metagenomics data, finding that CRISPR-loci are less conserved than surrounding sequences. The same HGT analysis was also employed in [45]. The CRISPR subtype, as outlined above, is predicted using hidden Markov models representing protein families to examine the composition of the set of cas associated with the CRISPR array. This does, however, not determine the evolutionary history of the array itself. Here, the evolutionary history comprises changes in the repeat sequence as well as changes in the composition of used spacers. The latter, i.e. the

The population genetics of the CRISPR-Cas system in bacteria

73

gain and loss of spacers, will be covered in our evolutionary model as described in the next section. An initial attempt to investigate the evolution history of CRISPR arrays was made by Kunin et al. [26], who clustered all repeat sequences known at that time into sequence families in order to determine conserved RNA secondary structures. In order to classify the repeats in CRISPR arrays with a higher precision, we developed the CRISPRmap tool [28]. CRISPRmap determines the evolution of CRISPR repeats by clustering them using a similarity measure that takes both the sequence and RNA secondary structure into account. We need to compare structure as well since the characteristic hairpin loop structure (which is formed by base pairs between As and Us or Gs and Cs in the sequence) is important for the recognition by some Cas proteins. We used more than 3500 consensus repeat sequences out of  2500 genomes and clustered these repeats using our LocaRNA tool [48, 49] that simultaneously aligns both sequence and structure in the determination of conserved CRISPR hairpins. We additionally cluster the repeats based on their sequence only using MCL (Markov-based Clustering), under the reasoning that the sequence and structure provide independent information about the evolution of the repeat. The repeat type is only one step in the classification of CRISPR arrays. The full characterisation also requires the determination of the orientation of the array. The reason is simply that the DNA encoding the array is doublestranded, and thus either the positive or the negative strand is used. In the latter case, the array sequence has to be reverted and complemented, meaning that each A has to be replaced by U, G by C, U by A and C by G. This is important as the precise array sequence implies the definition of young and old spacers as used in the evolutionary model described in the next section, as well as the precise characterisation of the leader sequence. Concerning the array orientation, Biswas et al. [8] considered six features that were known to be indicative of strand orientation and used a weighted combination of these features to predict the orientation of different CRISPR loci. Here, we developed CRISPRstrand [1], a tool based on graph kernel combined with a support vector machine to predict the orientation of a CRISPR array. A graph kernel encodes a graph as a feature vector, where each feature corresponds to a specific small subgraph and counts the number of occurrences of this subgraph in the original graph encoded by the graph kernel. This results in large but sparse feature vectors, which can be used by support vector machines to discriminate positive from negative data. The core principle of this runs under the notion that older repeats should have acquired more mutations over time. We encode the mutation information together with the repeat consensus information as a graph, where the first layer is a linear graph encoding the consensus sequence. Below each position, we encode information about whether there is a mutation at this position, and its relative location. Since the adaptation process always integrates a new spacer at the end of the array preceded by the leader (see also Section 4.3), and the leader itself indicates the start of transcription, mutation should occur more frequently at the end of transcription. In comparison to the method developed by Biswas et al. [8], we improved the prediction quality significantly (e.g. for similarities of 6 65 % between test and train arrays, from approximately 67 %

Rolf Backofen and Peter Pfaffelhuber

74

AUC to over 80 % AUC; recall the AUC as the Area under the Receiver-Operator Characteristic curve [15]). Another problem lies within the actual prediction of the leader sequence, where comparative methods have so far failed to detect the precise end of these leader sequences. To address this, we developed a tool for determining the leader sequence [2] that relies on our initial CRISPRmap approach, which groups arrays according to the similarity of their repeats. After the prediction of strand orientation as described above, we determine positional sequence similarities between the leader sequences from conserved arrays using a string kernel (see Figure 4.2.1 for details). A logistic function is then fitted to the positional conservation score in order to estimate the boundaries of the leader. We were able to demonstrate high agreement between our predicted leaders and leaders that were determined experimentally. We also applied our prediction model to the task of determining the distribution of leader lengths, where we found that bacterial leaders were significantly smaller than archaeal ones. Furthermore, data suggests that leaderless arrays (i.e. arrays that were predicted to have a leader-length of 0) had a smaller number of spacers; a result that is consistent with the interpretation that these arrays can still interfere with invading DNA but cannot adapt new spacers. A key contribution was the generation of a comprehensive database of spacer and repeats to assess the evolution of CRISPR arrays. For that purpose, we downloaded all complete and non-complete genomes of archaea and bacteria from NCBI (National Center for Biotechnology Information). We detected CRISPR arrays using the tools CRISPRfinder [19], CRT [10], PILER-CR [14] and our newly developed in-house tool named CRISPRidentifier. For all tools, we used parameters requiring at least three repeat sequences within an array, and spacer lengths were set to 18–78 nucleotides. For each array, we used the MAFFT alignment program [41] to calculate the consensus sequence of all repeat occurrences. Finally, we merged the results from all tools and created the non-redundant CRISPR-array database. For each array in the database, we extracted the spacer sequences and created a spacer-database that contains the spacer sequences, Array-Id, genomic-ID (accession number), and similarity to other spacers (ids). We used the two tools described before, namely CRISPRstrand [1] and CRISPRleader [2], to correct the array orientation and to determine the leader sequence. Overall, the database now contains around 58,131 arrays consisting of 52,131 consensus repeats and 817,465 spacers. We have found around 200 leader clusters containing fifteen to sixteen leader sequences on average, where 820 leaders came from archaeal genomes and 2315 leaders came from bacterial genomes. This database is the basis of our new (yet unpublished) array detection tool CRISPRidentifier. When trying to understand the evolution of the different CRISPR systems, another important problem is that current sets of core cas genes are often conserved across several subtypes. However, more recent screens revealed a set of accessory proteins that show more variability, but are nevertheless important for determining evolution of CRISPR systems. A substantial fraction of these accessory proteins have been found

The population genetics of the CRISPR-Cas system in bacteria

75

Figure 4.2.1. We use a string-kernel to determine the exact leader sequence. On a high level, a string kernel is a function that measures the similarity of strings, yielding higher values for more similar sequences. A: We select candidates for having a common leader by clustering arrays based on the similarity of the repeats using CRISPRmap. Once we have a cluster of arrays with similar repeat sequence, we investigate the sequence upstream the first repeat to determine the end of the conserved region, which we interpret as the end of the leader. B: To measure conservation, we use the string kernel in a window-based approach. In more detail, we start with the sequence from position 1 to 40 upstream each repeat, and compare the similarity (using the string kernel) of these windows across all arrays with the similarity of shuffled window sequences. The latter is required to correct for low complexity regions. We then slide the window one position to the left (i.e. looking at positions 2 to 41 upstream), and repeat this process. C: Finally, we fit the logistic function f .x/ D 1  e .x 3 /=2 =.1 C e .x 3 /=2 / to the scores for each sequence position. Here, 1 indicates the maximal conservation score, and 3 the point when the signal transitions from one of the saturated region to the other, which we use as the signal indicating the end of the conserved region. Figure taken from [2].

in type III systems [43]. The lower degree of conservation makes it hard to study these proteins by a traditional approach. For this reason, we used a guilt-by-association approach in [42] to determine new accessory proteins. Guilt-by-association is a very common approach in bioinformatics to estimate whether two entities (such as two genes, or a gene and a pathway) are functionally related. The idea is simply that if a gene A of unknown function is co-occurring more frequently than expected with

Rolf Backofen and Peter Pfaffelhuber

76

another gene B of known function, chances are high that gene A has a related similar function as gene B. The co-occurrences is measured with an association score. In our case, the association score is based on a clustering of genes flanking known type III modules by sequence similarity. The conservation across a wide range of host genomes is then taken as a measure for functionally linked proteins. Out of 4467 investigated genes flanking known type III modules, we determined 231 gene clusters, of which 76 were considered to be highly associated to core cas genes.

4.3 The pattern of spacer arrays within CRISPR Let us switch gears from the empirical description of CRISPR-Cas-systems in the previous section to population genetic modelling of CRISPR evolution. Evolutionary models for bacteria may come in two flavours. Either, the genealogy of the sample is given externally through a phylogeny, or random trees are used in order to model their genealogy. If data from ribosomal RNA is available, a fixed phylogeny is frequently preferred. If this is not the case, coalescent trees provide a straight-forward approach to describe the probability distribution of diversity in a population sample [13, 22, 47]. We take the former approach and rely on an externally given genealogy. In [27], the following ordered independent loss model for the CRISPR array of spacers was presented: Along each lineage of a given phylogeny, spacers can be gained and lost. All spacers present can get lost from any individual at constant rate 2 (per spacer). New spacers are entering an individual at rate 2 at one end of the CRISPR-cassette (the leader end). As [27] finds, this model fits best to a dataset from Yersinia pestis, in particular better than the unordered model, where spacers can be introduced everywhere. Gains and losses in this model are reminiscent of the infinitely many genes model, which was introduced for the evolution of bacterial pangenomes in [5]. In this model, genes (rather than spacers) are gained by bacterial cells (e.g. by uptake from the environment) at constant rate, and genes present within a bacterial genome are lost at another constant rate. Since genes are inserted in the genome anywhere, the resulting model is unordered. When considering spacers rather than genes, they can as well be gained and lost. However, in our ordered independent loss model, gains appear only at the leader end of the CRISPR array. However, when considering statistics which do not depend on the order, e.g. the number of spacers appearing only in a single bacterium from a population sample, or the complete frequency spectrum of spacers, results from the infinitely many genes model can be used also in the ordered independent loss model. Here, we describe theoretical predictions for the joint-spacer-distribution depending on  and  from [7]. We start with a formal definition of the model. In Figure 4.3.1, the ordered independent loss model along a tree T is displayed.

The population genetics of the CRISPR-Cas system in bacteria

77

Gain of spacer ui Loss of spacer ui Figure 4.3.1. In the ordered loss model, a spacer ui is gained along a line (indicated by N), and it can be lost again (indicated by H). Gained spacers are always introduced at the leader end. This means that the leftmost leaf has the spacer-array .u2 / (since u1 ; u3 ; u5 ; u7 are lost along the path from root to leaf), and the rightmost leaf has spacer-array .u9 ; u6 ; u2 ; u1 / (when starting from the leader-end).

Definition 4.3.1 (Ordered independent loss model). Let  D .S t / t >0 be a Markov jump process with state space E D Œ0; 1N , starting with an independent family .U1 ; U2 ; : : :/ of U.Œ0; 1/-distributed random variables, and the following dynamics: If S.t / D s D .s1 ; s2 ; : : : /, it jumps to  for some independent, uniform U 2 Œ0; 1; 2  .s1 ; : : : ; si 1 ; si C1 ; si C2 ; : : : / at rate for each i D 1; 2; : : : : 2 .U; s1 ; s2 ; : : : / at rate

The first type of event is called a gain-event (of a spacer), while the latter is called a loss-event. We refer to S as the ordered independent loss model along a single line. If T is a tree with one infinite branch, denote the set of leaves by L and the endpoint of the infinite branch by 1. Let 6 be the usual partial ordering of T (i.e. s 6 t if s is on the path between t and 1). Then we denote by  D .S t / t 2T the Markov jump process with state space E D Œ0; 1N and the above dynamics along all paths between root and leaves, where distinct branches on the tree evolve independently, by the ordered independent loss model along T . We will abbreviate S t;i WD .S t /i . Moreover, we will identify the vector S t with the set of its entries, i.e. S t D ¹S t;i W i D 1; 2; : : :º. Note that for all t 2 T , all entries of S t are different, and jS t j is finite, almost surely.

78

Rolf Backofen and Peter Pfaffelhuber

In the special case that L consists of only two elements, we denote these leaves by 1 and 2. In addition, we set the distance of the two leaves d.1; 2/ D T for some T > 0 and define recursively V0 D W0 D 0 and Vi WD inf¹k > Vi

W S1;k 2 S2 º;

1

Wi WD inf¹k > Wi

1

(4.3.1)

W S2;k 2 S1 º;

for i D 1; 2; : : : . So Vi is the ith element of S1 that is also contained in S2 and Wi is the ith element of S2 that is also contained in S1 . We then obtain the following result which will help us to estimate . Theorem 4.3.2 (Distribution of equal spacer sequence in two leaves [7, Theorem 1]). Let .A; B/; .A1 ; B1 /; .A2 ; B2 /; : : : be i.i.d. pairs of random variables with joint distribution       aCb e 2T 1 e 2 T aCb P .A D a; B D b/ D   : (4.3.2)   a 2 e 2T 2 e 2T In addition, let C1 ; C2 be i.i.d. Poisson distributed with parameter  .1 .V1 ; W1 /; .V2

V1 ; W2

W1 /; .V3

V2 ; W3

e

 2T

/. Then

W2 /; : : :

are independent and .V1 ; W1 / is distributed as .C1 C A1 ; C2 C B1 / and .Vi Wi Wi 1 / is distributed as .Ai ; Bi /, i D 2; 3; : : : .

Vi

1;

Remark 4.3.3 (Marginal distributions of A and B). Rather than a formal proof of the above theorem, we argue that Vi WD Vi Vi 1  Ai for i D 2; 3; : : : must be  geometrically distributed with parameter e 2 T . (This fact can also be seen from (4.3.2) by summing over b.) Since spacer Vi is also present in 2, we can be certain that only losses affect Vi . Consider all spacers present in the MRCA of 1 and 2. Any  spacer can either be lost along the path to both 1 and 2 (probability .1 e 2 T /2 ), it can be kept in both (probability e T ) or lost in one and not the other (probability   e 2 T .1 e 2 T / each). For Vi , we need to count the number of spacers present only in 1 before we find a spacer that is present in both. This leads to  P .A D a/ D 1 D .1

e e e

T

Ce

 2T

a 1

/

T

 2T

e

a

.1

 2T

e

 2T

/

1

e e

T

Ce

T

 2T

.1

e

 2T

/

:

Actually, the argument leading to this geometric distribution can be extended in order to obtain the distribution of spacer sequences if jLj D n; see [7, Theorems 2 and 3]. Moreover, it paves the way to estimate  based on V1 ; W1 ; V2 ; W2 ; : : : . Since in real datasets only finitely many spacers exist, we restrict ourselves to the case of m < 1 joint spacers. Again, this result can be extended to more than two samples. All details are found in [7].

The population genetics of the CRISPR-Cas system in bacteria

79

Theorem 4.3.4 ([7, Theorem 4]). Let .S1;k /k6Vm ; .S2;k /k6Wm be given, i.e. the data at hand has m joint spacers. Set D WD

m X .Vi

Vi

1/

C .Wi

Wi

1/

D Vm

V1 C Wm

W1

i D2

with Vi ; Wi as in (4.3.1). The joint distribution of Vi WD Vi Vi 1 and Wi WD Wi Wi 1 , i D 2; : : : ; m, only depends on  and is a single-parameter exponential family with sufficient statistics t .v2 ; : : : ; vm ; w2 ; : : : ; wm / D

m X

vi C wi

i D2

D .vm

v1 / C .wm

The maximum likelihood estimator of  is given by h  e T  1  D arg max .m 1/ log C D log 2 e T 2 

e e

w1 /:

T

i

T

and can be computed explicitly through  D

log p  T

 with p  D 1 C

 D 2.m 1/

1

:

Remark 4.3.5 (Estimation of  in larger samples). While it is straight-forward to give the maximum likelihood estimator for  based on data from just a sample of two bacterial genomes, the estimation becomes more elaborate in larger samples. One way out is to restrict to subsamples of small sizes and average the results. In addition, estimation of  (which is not treated here) can be made on top of the estimation of , e.g. by observing that the expected number of spacers is  for all individuals. As shown in [7, Section 4], the maximum-likelihood estimator from Theorem 4.3.4 is nearly unbiased when applied to simulated data. However, the variance can be large, and in particular for a sample of size n D 2, it might even be better to estimate  (and ) based on the frequency spectrum of spacers (i.e. on the number of private and joint spacers). Further developments on estimating  will extend the above approach to samples of arbitrary size. Then applications to spacer sequences found in bacteria and archaea will possibly reveal connections between the type of a CRISPR-system and the corresponding loss parameters.

4.4 Outlook Both classification (see Section 4.2) and (mathematical) modelling (see Section 4.3) of CRISPR systems are only prerequisites for a deeper understanding of CRISPR evolution. Needless to say, data quality is of highest importance for this task, as well

Rolf Backofen and Peter Pfaffelhuber

80

as predictive mathematical models that are able to process the former. Taking our previous results as a starting point, ongoing work aims at answering the following questions: (1) Spacer distribution across CRISPR types: It will be interesting to see if spacer gain and loss rates depend on the CRISPR class/type/subtype. In particular, the function of the CRISPR-system might determine its speed of evolution. (2) Spacer distribution within a CRISPR type: For single CRISPR systems, we can ask whether evolution was neutral. In order to answer this, note that, under the neutral hypothesis, the number of spacers private to any of a pair of cells between two joint spacers follows the same distribution, irrespective of the distance to the leader sequence; see Theorem 4.3.2. If deviations exist, we would conclude that gain and loss have not been constant processes along the genealogy, that is there is a deviation from neutrality. For example, if older spacers (more distal to the leader end) tend to have a shorter distance, we would conclude that selection removes spacers from the genome that do not have a defence function any more.

References [1] O. S. Alkhnbashi, F. Costa, S. A. Shah, R. A. Garrett, S. J. Saunders, and R. Backofen, CRISPRstrand: Predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci, Bioinform. 30 (2014), i489–i496. [2] O. S. Alkhnbashi, S. A. Shah, R. A. Garrett, S. J. Saunders, F. Costa, and R. Backofen, Characterizing leader sequences of CRISPR loci, Bioinform. 32 (2016), i576–i585. [3] E. Baake and A. Wakolbinger, Microbial populations under selection, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 43–67. [4] R. Bàrrangou and J. van der Oost, CRISPR-Cas Systems: RNA-Mediated Adaptive Immunity in Bacteria and Archaea, Springer, Heidelberg, 2013. [5] F. Baumdicker, W. Hess, and P. Pfaffelhuber, The diversity of a distributed genome in bacterial populations, Ann. Appl. Probab. 20 (2010), 1567–1606. [6] F. Baumdicker, W. Hess, and P. Pfaffelhuber, The infinitely many genes model for the distributed genome of bacteria, Genome Biol. Evol. 4 (2012), 443–456. [7] F. Baumdicker, A. M. I. Huebner, and P. Pfaffelhuber, The independent loss model with ordered insertions for the evolution of CRISPR spacers, Theor. Popul. Biol. 119 (2018), 72–82. [8] A. Biswas, P. Fineran, and C. Brown, Accurate computational prediction of the transcribed strand of CRISPR noncoding RNAs, Bioinform. 30 (2014), 1805–1813. [9] A. Biswas, J. N. Gagnon, S. J. J. Brouns, P. C. Fineran, and C. M. Brown, CRISPRTarget: Bioinformatic prediction and analysis of crRNA targets, RNA Biol. 10 (2013), 817–27. [10] C. Bland, T. L. Ramsey, F. Sabree, M. Lowe, K. Brown, N. C. Kyrpides, and P. Hugenholtz, CRISPR recognition tool (CRT): A tool for automatic detection of clustered regularly interspaced palindromic repeats, BMC Bioinform. 8 (2007), Paper No. 209.

The population genetics of the CRISPR-Cas system in bacteria

81

[11] L. M. Bobay and H. Ochman, Biological species are universal across Life’s domains, Genome Biol. Evol. 9 (2017), 491–501. [12] A. Bolotin, B. Quinquis, A. Sorokin, and S. Dusko Ehrlich, Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin, Microbiol. 151 (2005), 2551–2561. [13] R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008. [14] R. C. Edgar, PILER-CR: Fast and accurate identification of CRISPR repeats, BMC Bioinform. 8 (2007), Paper No. 18. [15] T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006), 861–874. [16] J. E. Garneau, M.-E. Dupuis, M. Villion, D. A. Romero, R. Barrangou, P. Boyaval, C. Fremaux, P. Horvath, A. H. Magadan, and S. Moineau, The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA, Nature 468 (2010), 67–71. [17] J. S. Godde and A. Bickerton, The repetitive DNA elements called CRISPRs and their associated genes: Evidence of horizontal transfer among prokaryotes, J. Mol. Evol. 62 (2006), 718–729. [18] U. Gophna, D. M. Kristensen, Y. I. Wolf, O. Popa, C. Drevet, and E. V. Koonin, No evidence of inhibition of horizontal gene transfer by CRISPR-Cas on evolutionary timescales, ISME J. 9 (2015), 2021–2027. [19] I. Grissa, G. Vergnaud, and C. Pourcel, CRISPRFinder: A web tool to identify clustered regularly interspaced short palindromic repeats, Nucleic Acids Res. 35 (2007), W52–57. [20] D. H. Haft, J. Selengut, E. F. Mongodin, and K. E. Nelson, A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes, PLoS Comput. Biol. 1 (2005), Paper No. e60. [21] C. R. Hale, S. Majumdar, J. Elmore, N. Pfister, M. Compton, S. Olson, A. M. Resch, C. V. C. 3rd Glover, B. R. Graveley, R. M. Terns, and M. P. Terns, Essential features and rational design of CRISPR RNAs that function with the Cas RAMP module complex to cleave RNAs, Mol. Cell 45 (2012), 292–302. [22] R. R. Hudson, Gene genealogies and the coalescent process, in: Oxford Surveys in Evolutionary Biology. Vol. 7 (eds. D. Futuyma and J. Antonovics), Oxford University, New York (1992), 1–44. [23] Y. Ishino, H. Shinagawa, K. Makino, M. Amemura, and A. Nakata, Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in Escherichia coli, and identification of the gene product, J. Bacteriol. 169 (1987), 5429–5433. [24] M. Jinek, K. Chylinski, I. Fonfara, M. Hauer, J. A. Doudna, and E. Charpentier, A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity, Science 337 (2012), 816–821. [25] M. M. Jore, M. Lundgren, E. van Duijn, J. B. Bultema, E. R. Westra, S. P. Waghmare, B. Wiedenheft, U. Pul, R. Wurm, R. Wagner, M. R. Beijer, A. Barendregt, K. Zhou, A. P. L. Snijders, M. J. Dickman, J. A. Doudna, E. J. Boekema, A. J. R. Heck, J. van der Oost, and S. J. J. Brouns, Structural basis for CRISPR RNA-guided DNA recognition by Cascade, Nat. Struct. Mol. Biol. 18 (2011), 529–536.

Rolf Backofen and Peter Pfaffelhuber

82

[26] V. Kunin, R. Sorek, and P. Hugenholtz, Evolutionary conservation of sequence and secondary structures in CRISPR repeats, Genome Biol. 8 (2007), Paper No. R61. [27] A. Kupczok and J. P. Bollback, Probabilistic models for CRISPR spacer content evolution, BMC Evol. Biol. 13 (2013), Paper No. 54. [28] S. J. Lange, O. S. Alkhnbashi, D. Rose, S. Will, and R. Backofen, CRISPRmap: An automated classification of repeat conservation in prokaryotic adaptive immune systems, Nucleic Acids Res. 41 (2013), 8034–8044. [29] A. E. Lobkovsky, Y. I. Wolf, and E. V. Koonin, Gene frequency distributions reject a neutral model of genome evolution, Genome Biol. Evol. 5 (2013), 233–242. [30] K. S. Makarova, L. Aravind, Y. I. Wolf, and E. V. Koonin, Unification of Cas protein families and a simple scenario for the origin and evolution of CRISPR-Cas systems, Biol. Direct 6 (2011), Paper No. 38. [31] K. S. Makarova, D. H. Haft, R. Barrangou, S. J. J. Brouns, E. Charpentier, P. Horvath, S. Moineau, F. J. M. Mojica, Y. I. Wolf, A. F. Yakunin, J. van der Oost, and E. V. Koonin, Evolution and classification of the CRISPR-Cas systems, Nat. Rev. Microbiol. 9 (2011), 467–477. [32] K. S. Makarova, Y. I. Wolf, O. S. Alkhnbashi, F. Costa, S. A. Shah, S. J. Saunders, R. Barrangou, S. J. J. Brouns, E. Charpentier, D. H. Haft, P. Horvath, S. Moineau, F. J. M. Mojica, R. M. Terns, M. P. Terns, M. F. White, A. F. Yakunin, R. A. Garrett, J. van der Oost, R. Backofen, and E. V. Koonin, An updated evolutionary classification of CRISPR-Cas systems, Nat. Rev. Microbiol. 13 (2015), 722–736. [33] K. S. Makarova, Y. I. Wolf, J. Iranzo, S. A. Shmakov, O. S. Alkhnbashi, S. J. Brouns, E. Charpentier, D. Cheng, D. H. Haft, P. Horvath, S. Moineau, F. J. M. Mojica, D. Scott, ˇ Venclovas, M. F. White, A. F. Yakunin, W. Yan, S. A. Shah, V. Siksnys, M. P. Terns, C. F. Zhang, R. A. Garrett, R. Backofen, J. van der Oost, R. Barrangou, and E. V. Koonin, Evolutionary classification of CRISPR-Cass systems: A burst of class 2 and derived variants, Nat. Rev. Microbiol. 18 (2020), 67–83. [34] L. A. Marraffini and E. J. Sontheimer, CRISPR interference limits horizontal gene transfer in staphylococci by targeting DNA, Science 322 (2008), 1843–1845. [35] J. Maynard-Smith, Do bacteria have population genetics? in: Population Genetics of Bacteria (eds. S. Baumberg, J. P. W. Young, E. M. H. Wellington, and J. R. Saunders), Cambridge University, Cambridge (1995), 1–12. [36] F. J. M. Mojica, C. Diez-Villasenor, J. Garcia-Martinez, and E. Soria, Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements, J. Mol. Evol. 60 (2005), 174–182. [37] F. J. Mojica, C. Diez-Villasenor, E. Soria, and G. Juez, Biological significance of a family of regularly spaced repeats in the genomes of Archaea, bacteria and mitochondria, Mol. Microbiol. 36 (2000), 244–246. [38] K. H. Nam, C. Haitjema, X. Liu, F. Ding, H. Wang, M. P. DeLisa, and A. Ke, Cas5d protein processes pre-crRNA and assembles into a cascade-like interference complex in subtype I-C/Dvulg CRISPR-Cas system, Structure 20 (2012), 1574–1584. [39] C. Pourcel, G. Salvignol, and G. Vergnaud, CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies, Microbiol. 151 (2005), 653–663.

The population genetics of the CRISPR-Cas system in bacteria

83

[40] E. P. C. Rocha, Neutral theory, microbial practice: Challenges in bacterial population genetics, Mol. Biol. Evol. 35 (2018), 1338–1347. [41] J. Rozewicki, K. D. Yamada, and K. Katoh, MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization, Brief. Bioinform. 20 (2019), 1160–1166. [42] S. A. Shah, O. S. Alkhnbashi, J. Behler, W. Han, Q. She, W. R. Hess, R. A. Garrett, and R. Backofen, Comprehensive search for accessory proteins encoded with archaeal and bacterial type III CRISPR-cas gene cassettes reveals 39 new cas gene families, RNA Biol. 16 (2019), 530–542. [43] S. A. Shmakov, K. S. Makarova, Y. I. Wolf, K. V. Severinov, and E. V. Koonin, Systematic prediction of genes functionally linked to CRISPR-Cas systems by gene neighborhood analysis, Proc. Natl. Acad. Sci. USA 115 (2018), E5307–E5316. [44] M. P. Terns and R. M. Terns, CRISPR-based adaptive immune systems, Curr. Opin. Microbiol. 14 (2011), 321–327. [45] G. W. Tyson and J. F. Banfield, Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses, Environ. Microbiol. 10 (2008), 200–207. [46] G. Vestergaard, R. A. Garrett, and S. A. Shah, CRISPR adaptive immune systems of Archaea, RNA Biol. 11 (2014), 157–168. [47] J. Wakeley, Coalescent Theory: An Introduction, Roberts, Greenwood Village, 2008. [48] S. Will, T. Joshi, I. L. Hofacker, P. F. Stadler, and R. Backofen, LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs, RNA 18 (2012), 900–914. [49] S. Will, K. Reiche, I. L. Hofacker, P. F. Stadler, and R. Backofen, Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering, PLoS Comput. Biol. 3 (2007), Paper No. e65. [50] J. Xu, Microbial Population Genetics, Caister Academic, Poole, 2010. [51] J. Zhang, C. Rouillon, M. Kerou, J. Reeks, K. Brugger, S. Graham, J. Reimann, G. Cannone, H. Liu, S.-V. Albers, J. H. Naismith, L. Spagnolo, and M. F. White, Structure and mechanism of the CMR complex for CRISPR-mediated antiviral immunity, Mol. Cell 45 (2012), 303–313.

Chapter 5

Evolution of altruistic defence traits in structured populations Martin Hutzenthaler and Dirk Metzler Defence traits against predators, as for example alarm calls, can be costly for the acting individual and beneficial for others in the same subpopulation. There is a growing literature trying to explain persistence of such altruistic defence traits. In this review we summarise recent progress on a specific individual-based Lotka–Volterra-type predator-prey model with two types (altruists and cheaters) of prey. This dynamic can also be considered as a model for host-parasite interactions, where parasites correspond to predators and hosts correspond to prey. For our analysis of persistence of altruists, we focus on the special case of uniform migration on finitely many demes. We perform two approximation steps: First we let the number of individuals tend to infinity and then we let the number of demes tend to infinity. The central observation for this McKean–Vlasov-type diffusion limit is then that the altruistic defence trait persists in the population if the cost of defence is smaller than a particular model parameter, which can be interpreted as benefit of defence.

5.1 Introduction A behaviour that is known from many different animal species is to warn each other when predators arrive. The warning signal may be a call, like in ground squirrels and birds [14, 37, 38, 47], or a chemical secretion, like in social aphids [41, 57]. The warning signal can be costly for the acting individual, not only because producing the signal may require energy, but also because the signal may attract the attention of the predator to the caller or because the warned conspecifics may be competitors for food or other resources. Thus, warning signals may be altruistic, that is increase the fitness of surrounding individuals at the cost of a fitness reduction of the actor. Social insects do not only defend their nests against predators but also against parasites. An example for parasitism in social insects are slave-making ant species such as Temnothorax americanus, formerly known as Protomognathus americanus [55]. Workers of such parasite species raid colonies of their host, e.g. the related species Temnothorax longispinosus and Temnothorax curvispinosus, and steal their brood to replenish their labour force [1, 3]. Achenbach and Foitzik [1] described a defence behaviour of T. longispinosus and T. curvispinosus against the slave-makers as “slave rebellion”. It is, however, not obvious how such a trait could evolve as it cannot increase the individual fitness of enslaved ant workers, who have no chance to produce own offspring under any conditions [3]. Furthermore, carrying a rebellion trait may

Martin Hutzenthaler and Dirk Metzler

86

potentially be costly as such a trait may also disturb the worker’s behaviour in its own nest. So, at first glance, costly defence traits seem to contradict Darwin’s theory on natural selection. A classical explanation of altruistic behaviour is kin selection [22, 23]. This means that alleles that lead to a behaviour can spread in the population if the behaviour increases the actor’s inclusive fitness, that is the sum of her or his direct fitness and the fitness of her or his relatives weighted with relatedness. To explain the evolution of slave rebellion in T. longispinosus, Pamminger et al. [43] discussed a kin selection model on the level of ant nests. In fact, genetic analyses showed that (non-enslaved) ants of the host species T. longispinosus in nests in the vicinity of the slave-maker nests were often closely related to enslaved ants of the host species. Thus slave rebellion could potentially protect the relatives of the slaves. A challenge for the kin selection explanation of altruistic defence behaviour is that individuals of the same group or deme who profit from altruism may also be “cheaters”, who do not pay the cost of altruism and compete against the altruists. Taking local competition into account, Metzler et al. [40] carried out extensive simulation studies based on a spatial model of the T. longispinosus/T. americanus system and found parameter combinations in which a costly defence trait could spread in the host population. In these scenarios, however, kin selection was not effective among neighbouring ant nests but on the substantially larger scale of demes representing forests in which thousands of ant nests can live. This was only observed with parameter combinations that led to a meta-population dynamic in which the slave-makers would frequently drive the host species to extinction in demes, which were then quickly re-colonised by the offspring of very few immigrants from other demes. An alternative approach to explain the evolution of altruism in substructured populations is group selection [58]: Groups of individuals that collaborate (that is show an increased level of altruism) will produce more offspring than groups with fewer altruists, and the many migrants from the more successful groups will have a higher probability of carrying the altruistic allele. This requires that frequencies of altruists differ between the demes. If, as in [40], these differences arise because individuals who live in the same group (or deme) tend to be related, group selection can be understood as kin selection acting on the level of groups. The question whether kin selection and group selection are generally equivalent in biologically relevant scenarios has been debated for a long time [5, 11, 18, 19, 35, 42, 46]. In a reply to Wynne-Edwards [58], Maynard Smith [39] conjectured that a necessary condition for group selection to work would be that groups are founded by very few individuals. Indeed, results for many group selection models involve metapopulation dynamics in which demes go extinct and are quickly re-colonised by the offspring of a small number of immigrants; see e.g. [2,15,34,40,49,51]. Uyenoyama [52], however, presented a mathematical model in which group selection is effective without the extinction of demes if migration rates are high. In mathematical models of group selection it is usually assumed that groups with a large fraction of altruists produce more offspring than other groups, but these

Evolution of defence in structured populations

87

models do not always explain where this fitness advantage comes from (see e.g. [2, 15, 34, 39, 49, 51–53, 56]). It is not obvious whether such generic models are applicable to the evolution of costly defence traits as they neglect that group selection for defence traits depends on the abundance of predators or parasites, which may result in a negative feed-back mechanism: If group selection for defence leads to a reduction of predators or parasites, the group selection advantage of defence may vanish [4, 6, 10, 13, 48]. We now turn to an informal description of our model assumptions and results from [27], which will be reviewed in more detail in the subsequent sections. The approach in [27] is to consider an individual-based predator-prey model in which a costly defence trait exists in the prey population. Instead of explicitly assuming a certain group selection effect of the defence trait, we assume that the prey population interacts with the predator population in each deme according to a Lotka–Volterra dynamic [36, 54]. In this dynamic, the predation rate per prey individual decreases as the frequency of the defence trait among the prey individuals increases. As a consequence, there are fewer predators on demes with more altruists and this allows for more prey individuals in such demes and more emigrants that migrate to other demes. Thus, in our model, the advantage of groups with more altruists is modulated via the abundances of predators. In the model specifications and mathematical analyses throughout this chapter we refer to predators and prey although the same dynamics can also be considered as a model for host-parasite interactions. A persistence analysis for this individual-based model with general migration rates seems to be out of reach in general. For this reason we first perform two approximation steps, namely we let the number of individuals and the number of demes tend to infinity. The main result in [27] derives a diffusion approximation for the frequencies of altruists in the prey population in the limit of many individuals. For this, we assume in [27] – as is usual in population genetics such as in the derivation of the Kimura stepping stone model (see e.g. [24]), but in contrast to [52] – migration rates are low, that is that the number of immigrants per deme and per generation converges to a constant even if we let the number of individuals per deme go to infinity. Furthermore, we assume weak selection for the cost of the defence trait (cf. [33]), which means that the reduction in fitness due to having the defence trait is on the order of magnitude of the inverse of the typical deme size. Thus, changes in allele frequencies due to selection happen on the same time scale as genetic drift and allele frequency differences between demes, which are necessary for group-level selection, can arise due to genetic drift. These assumptions lead to a separation of the ecological and the population genetic time scales (see also [7]). This means that the numbers of prey and predators evolve much faster than the frequencies of defenders among prey. As a consequence, the population sizes of prey and predators in each deme are always immediately in the equilibrium .h1 .X t .i//; p1 .X t .i/// (see (5.2.2)) of the Lotka–Volterra dynamics that results from the current frequency X t .i/ of the defence trait in deme i at time t 2 Œ0; 1/. From this it follows intuitively that the frequency of altruists within the prey population is asymptotically well described by a spatial system of Wright–Fisher

Martin Hutzenthaler and Dirk Metzler

88

diffusions where the population size in deme i at time t is replaced by h1 .X t .i //. For a formal statement hereof see Theorem 5.2.1 below. We also note that entire demes do not go extinct since the equilibrium states (5.2.2) are strictly positive. The diffusion limit (5.2.3) derived in [27] is mathematically hard to analyse in the case of general migration rates due to nonlinear interactions and the lack of a suitable dual process. For this reason we perform a second approximation step. More precisely, we additionally assume that the population is distributed over a large but finite number of demes (which is realistic) and that each migrant chooses a deme uniformly among all demes. This latter assumption is unrealistic but we believe that the resulting models represent a certain universality class of high spread migration kernels. Then we let the number D of demes tend to infinity in the case of uniform migration on D demes. We obtain that the global relative frequency of altruists converges as D ! 1 to the McKean–Vlasov stochastic differential equation (SDE) (5.3.2); see Proposition 5.3.1 below. This McKean–Vlasov SDE is now one-dimensional and, therefore, much easier to analyse. After these two approximation steps we now answer the question whether the defence trait goes to fixation, or goes to extinction, or whether altruists and cheaters coexist. Roughly speaking, an altruistic defence trait persists if its cost is sufficiently small. More precisely, we show for the McKean–Vlasov SDE (5.3.2) in the case of nontrivial initial frequency that the defence trait will become fixed in the entire prey population if the so-called “benefit of defence” (a certain model parameter) is greater than the cost of the defence trait and that it will go extinct if the benefit of defence is smaller than its cost; see Theorem 5.3.2 below for details. We also explored the robustness of this criterion in simulation studies in [31] for next-neighbour migration instead of uniform migration, more precisely for models with around 500 demes that are arranged in a one- or two-dimensional grid with migration occurring only between demes that are direct neighbours on the grid. Thereby we have shown that under the above assumptions, predator-prey dynamics can induce group selection that maintains a defence trait in the structured prey population without the extinction of entire demes. We note that if there were no predators, then the McKean–Vlasov diffusion would converge to 0 when started nontrivially due to the cost of defence. Moreover, we note that in a panmictic population (corresponding to D D 1) the expected frequency of altruists decreases in time and, in particular, does not converge to 1 even if the cost of defence is small. In our asymptotic many-demes model, not only selection effects but also migration rates are small enough to allow that genetic drift leads to the among-deme variation necessary for group selection to be effective. In addition, we are interested in whether a newly appearing altruistic defence allele (appearing e.g. through mutation or immigration) could persist/survive in the population. For this, we first show that, in the case of initially summable frequencies of altruists, the diffusion approximation of the altruist frequencies with uniform migration on D demes converges as D ! 1 to a forest of trees of excursions, see Theorem 5.4.1 below. In this forest of trees of excursions, the survival probability is positive if the benefit of defence is greater than the cost of the defence trait and zero otherwise.

Evolution of defence in structured populations

89

Again this confirms the above criterion that an altruistic defence trait can only persist in a population if the benefit of defence is greater than the cost of the defence trait. The structure of the rest of this chapter is as follows. In Section 5.2 we review the diffusion approximation for the frequencies of altruists in the prey population from [27]. In Section 5.3 we review the criterion for extinction/fixation of the average altruist frequency from [27]. In Section 5.4 we review the convergence to a forest of trees of excursions from [29]. A central step in the proof of this convergence result are uniform bounds on the derivatives of suitable semigroups and we review these bounds from [28] in Section 5.5.

5.2 Asymptotic frequencies of altruistic defence traits In this section we review the asymptotic approximation of frequencies of altruists in the prey population from [27]. First we introduce our model. We assume that predator and prey individuals populate demes given by a countable set D, and the prey population N N consists of altruistic defenders and cheaters. For all i 2 D let AN i , Ci and Pi be the current total deme population sizes of altruistic defenders, cheaters and predators in deme i measured in units of N 2 N individuals. The total population size of prey N individuals in deme i 2 D is denoted as HiN D AN i C Ci . We assume that the prey and predator populations interact in each deme according to a Lotka–Volterra model with growth rate , carrying capacity K, per-predator death rate ı for the prey, per-prey growth rate , competition rate and death rate  for the predator. Furthermore, we assume a per-defender decrease  of the growth rate of the predator population size. Regarding the fitness cost of defence, we assume that the (predator-independent) death rate of defenders is increased by N˛ . / We further assume that individuals migrate at rate  m.i;j from deme i to deme j N DD where m 2 Œ0; 1/ is a symmetric stochastic matrix. In addition, we model the stochastic component of reproductive success by critical binary branching. Every prey (resp. predator) individual branches with rate ˇH (resp ˇP ) and is replaced by 0 or 2 offspring individuals with equal probability. Moreover, in order to avoid extinction of the prey populations on the ecological time scale, we assume immigration of cheater (resp. predator) individuals at rate NC (resp. NP ) into each deme with C >

4ı 3 C ˇH 3. C / 2

and P > ˇP :

This immigration, however, is only assumed for technical reasons and does not appear in the diffusion approximation of the altruist frequencies. We summarise the above transition rates for the (unscaled) numbers N N .ak ; ck ; pk / D .AN k  N; Ck  N; Pk  N /

Martin Hutzenthaler and Dirk Metzler

90

of altruist, cheater and predator individuals in each deme k: .ak ; ck ; pk /k2D ! .ak C 1kDi ; ck ; pk /k2D W

.ak ; ck ; pk /k2D .ak ; ck ; pk /k2D .ak ; ck ; pk /k2D .ak ; ck ; pk /k2D .ak ; ck ; pk /k2D .ak ; ck ; pk /k2D



H

2 ˇ

 C ;

 ai C c i pi ˛ Cı C ; 2 K N N N  ˇ H C ; ! .ak ; ck C 1kDi ; pk /k2D W ci 2 ˇ  ai C ci pi  H ! .ak ; ck 1kDi ; pk /k2D W ci C Cı ; 2 K N N  ˇ ai C ci ai P C  ! .ak ; ck ; pk C 1kDi /k2D W pi ; 2 N N ˇ pi  P C C ; ! .ak ; ck ; pk 1kDi /k2D W pi 2 N  1 ! .ak 1kDi C 1kDj ; ck ; pk /k2D W ai ; ND  1 ! .ak ; ck 1kDi C 1kDj ; pk /k2D W ci ; ND  1 ! .ak ; ck ; pk 1kDi C 1kDj /k2D W pi : ND

.ak ; ck ; pk /k2D ! .ak .ak ; ck ; pk /k2D

ai

1kDi ; ck ; pk /k2D W

ai

H

C

According to a Lotka–Volterra modelling approach and according to the usual diffusion approximation with SDEs, AN , C N and P N approximatively satisfy the SDEs h   N ˛i AN t .i/ C C t .i/ N dAN ıP tN .i / dt t .i / D A t .i/  1 K N   X AN m.i; j / AN C t .i / dt t .j / N j 2D r ˇH N C A .i/ dW tA .i/; N t h   i N AN C t .i/ C C t .i/ dC tN .i / D C tN .i/  1 ıP tN .i / dt C dt K N X   m.i; j / C tN .j / C tN .i / dt C (5.2.1) N j 2D r ˇH N C C .i/ dW tC .i/; N t   P dP tN .i / D P tN .i/  P tN .i/ C C tN .i/ C . /AN dt t .i / dt C N X   C m.i; j / P tN .j / P tN .i / dt N j 2D r ˇP N C P .i/ dW tP .i/; N t

Evolution of defence in structured populations

91

where W A .i /; W C .i/; W P .i/W Œ0; 1/   ! R, i 2 D, are independent Brownian motions with continuous sample paths. For simplicity we assume existence of h0 ; p0 2 .0; 1/ such that, for all i 2 D, N 2 N, we have H0N .i / D h0 , P0N .i / D p0 . Existence of solutions to (5.2.1), which we assume here, can be established in suitable Liggett– Spitzer spaces if D is an Abelian group and if m is translation invariant and irreducible; cf. [30, Proposition 2.1]. Next we consider the diffusion approximation of the frequency F tN .i/ WD

AN t .i / N A t .i/ C C tN .i /

of altruists in the prey population in deme i at time t 2 Œ0; 1/. For  D ˛ D  D ˇH D ˇP D 0 and K > , we obtain a classical Lotka–Volterra model in which the frequencies .Hi .t/; Pi .t// converge to  K.ı C / K   ;  C ıK  C ıK as t ! 1 for each i 2 D; see [36, 54]. In the presence of defenders, however, this equilibrium changes and we need to replace  by  x if the current frequency of altruists is x 2 Œ0; 1. Define  C ıK a WD ıK (note that a > 1) and ˇH ı b WD ; ı C  which we interpret as the benefit of defence. While the factor ı=.ı C  / can be interpreted as the severeness of predator abundance, and the role of the defence effect  on the predator is obvious, the factor ˇH can be explained as it leads to the variation in defence trait frequency among the demes that is necessary for group selection to be effective. With a and b we can express the equilibria as Œ0; 1 3 x 7! h1 .x/ WD

K.ı C / ˇH D 2 .0; 1/;  C ıK. x/ b.a x/

K. x/   C ıK. x/   ˇH D 1 2 .0; 1/: ı Kb.a x/

Œ0; 1 3 x 7! p1 .x/ WD

(5.2.2)

For these functions to be well defined we will assume that K. / > . Theorem 1.3 in [27] then establishes the following diffusion approximation of the frequencies of altruists in the local population on the evolutionary time scale.

92

Martin Hutzenthaler and Dirk Metzler

Theorem 5.2.1 ([27, Theorem 1.3]). Assume the above setting, assume that ; ; ı > 0,   > ,   > K , > 2ı, let .W .i//i 2D be independent Brownian motions, and N assume that F0 )  as N ! 1. Then the SDE dX t .i / D 

X

m.i; j /

j 2D

C

q b a

a a

X t .i/ X t .j / X t .j /

 X t .i/ X t .i/ 1

 X t .i / dt

 X t .i / dt

˛X t .i / 1

 X t .i / dW t .i /;

t 2 .0; 1/; i 2 D;

(5.2.3)

starting in X0 D  has a unique strong solution and it holds that N .F tN / t2Œ0;1/ ) .X t / t2Œ0;1/

as N ! 1 in C.Œ0; 1/; l1 / where l1 is a suitable Liggett–Spitzer space. We note that the SDE (5.2.3) is Kimura’s stepping stone model with selection as considered e.g. in [16, Chapter 6] except that the local population size Ni in deme i 2 D is replaced by h1 .X t .i// at time t 2 Œ0; 1/.

5.3 Fixation/extinction of the average altruist frequency An important problem is to derive conditions under which altruism persists. For the SDE (5.2.3) it is difficult to derive conditions on the parameters which ensure persistence of the solution process. For this reason, we consider the many-demes limit (also denoted as mean-field approximation) of the SDE (5.2.3). More precisely, we 1 replace D by ¹1; : : : ; Dº and m by . D /i;j 2¹1;:::;Dº , so that each deme has the same chance to be chosen by a migrant. Thus, for each D 2 N, we consider the solution X D of the SDE dX tD .i /

D   X a X tD .i/ D D X t .j / X tD .i / dt ˛X tD .i / 1 D D a X t .j / j D1 q   C b a X tD .i/ X tD .i/ 1 X tD .i / dW t .i /;

 X tD .i / dt (5.3.1)

where t 2 .0; 1/, i 2 ¹1; : : : ; Dº. The following proposition shows that if the initial frequencies are exchangeable, then the average altruist frequency is well approximated by the McKean–Vlasov SDE (5.3.2) if the number D 2 N of demes is large. Proposition 5.3.1 ([27, Proposition 3.1 and Lemma 3.2]). Assume for every D 2 N that .X0D .i //i 2¹1;:::;Dº are p exchangeable, let  be a Œ0; 1-valued random variable, and assume that supD2N Œ D EŒjX0D .1/ j < 1. Then there exists a unique

93

Evolution of defence in structured populations

solution M of the McKean–Vlasov SDE Z t   h 1 i 1 ˛Ms .1 Mt D  C .a Ms / .a Ms / E a Ms 0 Z tp C b.a Ms /Ms .1 Ms / dWs ; t 2 Œ0; 1/;

 Ms / ds (5.3.2)

0

and it holds for all T 2 .0; 1/ that ˇ X  p ˇ1 D D D E ˇˇ sup sup X t .i / D D2N t2Œ0;T  i D1

ˇ ˇ M t ˇˇ < 1;

where X D solves (5.3.1). It is now interesting to investigate when M t converges to 0 or 1 as t ! 1. Theorem 5.3.2 ([27, Theorem 1.4]). Let M be the solution of (5.3.2) and assume that EŒM0  2 .0; 1/. Then it holds that  lim t!1 EŒjM t

0j D 0 if ˛ > b,

 lim t!1 EŒjM t

1j D 0 if ˛ < b,

 lim t!1 EŒM t  2 .0; 1/ if ˛ D b. Our rough interpretation of Proposition 5.3.1 and Theorem 5.3.2 is that in the limit of many demes and starting from a nontrivial density the defence trait almost surely will become fixed in the entire prey population if the benefit of defence b is greater than the cost ˛ of the defence trait and that it will go extinct if the benefit b is smaller than the cost ˛. In contrast to the mean-field case of Theorem 5.3.2, such a group selection effect is not present in a panmictic population. More precisely, in the case D D 1, the solution of (5.3.1) satisfies for all t 2 .0; 1/, x 2 .0; 1/ that EŒX t1 .1/ j X01 .1/ D x < x so that the expected frequency of the defence trait cannot converge to 1 for any combination of parameter values ˛ > 0, b > 0, and a > 1 in a panmictic population. For populations that are structured into demes it remains unclear whether b > ˛ (b < ˛) is also the criterion for fixation (extinction) for migration mechanisms more general than uniform. Empirically we confirmed this criterion in simulation studies in [31] for next-neighbour migration on one- and two-dimensional grids with 500 demes.

5.4 Convergence to a forest of trees of excursions In this section we consider the time period just after an altruistic defence trait appeared in the population (e.g. through immigration or mutation). More precisely we ask

Martin Hutzenthaler and Dirk Metzler

94

whether an altruistic defence allele can persist/survive in the population when starting only from a few immigrants/mutants. Since this question is difficult to answer for the general model (5.2.3), we again focus on the many-demes limit. Let X D be the solution of the SDE (5.3.1). We assume that the total sum of initial altruist frequencies is positive and bounded in DP2 N. More precisely, let X0 .i /, i 2 N, be Œ0; 1-valued random variables that satisfy i2N X0 .i/ 2 .0; 1/ and assume for all D 2 N, i 2 ¹1; : : : ; Dº D that Thus the initial altruist population is sparse (by which we mean P1X0 .i / D X0 .i/. D lim X .i/ < 1 in a suitable sense). Our goal is then to find parameter D!1 0 i D1 P D configurations such that lim t!1 limD!1 EŒ D i D1 X t .i / D 1. First we describe heuristically the many-demes limit of X D as D ! 1. For this, we assume for simplicity that X0 .i/ D 0 for all i 2 N \ Œ3; 1/. The total mass is almost surely bounded in D for every time point. As a consequence, the summand D  Xa D a j D1

X tD .i/ D X t .j / X tD .j /

on the right-hand side of (5.3.1) converges to zero as D ! 1 for every t 2 Œ0; 1/, i 2 N and the first deme X D .1/ converges weakly to the solution Y of the SDE h  i p dY t D .a Y t /Y t ˛Y t .1 Y t / dt C b.a Y t /Y t .1 Y t / dW t (5.4.1) a as D ! 1. Mass emigrates from this first deme. Now the key observation is that no 1 two migrants go to the same deme since this has probability D which is negligible as D ! 1. This immediately leads to a tree structure in the set of all populated demes. In addition, since individuals have infinitesimally small mass, frequency paths on initially empty demes start in zero and follow the dynamics of Y once the path is positive. Thus the evolution in initially empty demes is described by the excursion measure associated with Y (this measure is conceptually similar to the excursion measure of Brownian motion). For an illustration of such a tree of excursions see Figure 5.4.1. Analogously, the subpopulations that originate from descendants of migrants from deme 2 constitute a tree of independent subpopulations. In addition, these two trees are disjoint (and thus driven by independent families of Brownian motions) and therefore independent if X0 .1/ and X0 .2/ are independent random variables. In other words, independence of the family ¹X0 .i / W i 2 Nº propagates in the manydemes limit and results in a forest of independent trees of independent subpopulations. A formal statement of this “propagation of chaos” result in the sparse regime is stated in Theorem 5.4.1 below. The notion “propagation of chaos” was originally termed by Mark Kac [32] and refers to a relation between microscopic and macroscopic models. Microscopic descriptions are based on molecules (or particles, individuals, subpopulations, etc.) and model their interactions and driving forces. Macroscopic descriptions are based on macroscopic quantities such as the density and model the dynamics of these quantities. Microscopic and macroscopic descriptions can be connected by the limit of

Evolution of defence in structured populations

95

Figure 5.4.1. While family sizes of defenders make their excursions into the positive range, they can produce migrants to other demes. In the limit of many demes these excursions become stochastically independent and the process becomes a tree of excursions.

the density in the D-molecule microscopic model in the limit D ! 1 and this limit should be the density in the macroscopic model. Kac’s idea behind the terminology “propagation of chaos” is that if the initial distribution is “chaotic” (e.g. velocities and positions of molecules are independent and purely random), then the dynamics of the microscopic model destroys this independence, nevertheless finitely many fixed molecules should in the limit as D ! 1 evolve independently (depending on all other molecules only through deterministic macroscopic observables such as the density). In this sense, independence of finitely many fixed molecules “propagates” as the system size increases to infinity. We point out the analogy between molecules in this paragraph and the demes in the setting of Theorem 5.4.1 below. Next we formally construct a forest of trees of excursions. The space of excursions from zero is given by ® U WD W R ! Œ0; 1 W  is a càdlàg-path and there exists t0 2 .0; 1/ such that for all t 2 .0; t0 / we have  t > 0 and¯ for all t 2 . 1; 0/ [ Œt0 ; 1/ we have  t D 0 : The set U is equipped with the supremum-norm and with the resulting Borel-algebra. Theorem 1 in [25] states that there exists a unique -finite measure Q on U satisfying the following property: For every bounded continuous function F W U ! R for which there exists a ı > 0 such that for all  2 U with sup t 2Œ0;1/  t < ı it holds that F ./ D 0, one has Z 1 lim EŒF .Y / j Y0 D " D F ./Q.d/: "!0 " U

96

Martin Hutzenthaler and Dirk Metzler

The measure Q is called the excursion measure associated with Y ; see also [45]. Moreover, let Y .i/ D .Y t .i// t 2Œ0;1/ , i 2 N, be solutions of (5.4.1) driven by independent Brownian motions and such that Y0 .i/ D X0 .i / for all i 2 N. Let ¹….n;s;/ W .n; s; / 2 N0  Œ0; 1/  U º be an independent family of Poisson point processes on Œ0; 1/  U with intensity measures EŒ….n;s;/ .dt ˝ d/ a D t a t s

s

dt ˝ Q.d/;

.n; s; / 2 N0  Œ0; 1/  U:

The elements of ….n;s;/ describe the demes that descend from a deme with frequency trajectory . t s / t2Œ0;1/ (the trajectory is zero before the time s of its colonisation) and where the ancestral lineages of individuals living on these demes have exactly  a 0 n 2 N migration events. Note that mass of size D  dt migrates to each a t s t s of D O.1/ many essentially empty demes in a time interval dt from a deme with trajectory . t s / t 2Œ0;1/ (as D ! 1). This induces a Poisson structure since there are many trials with small success probability. The 0-th P generation is the random -finite measure on Œ0; 1/  U defined through T .0/ WD 1 i D1 ı.0;Y .i // . For every n 2 N0 the .n C 1/-th generation is the random -finite measure representing all the demes that have been colonised from demes of the n-th generation, that is Z T .nC1/ WD ….n;s;/ T .n/ .ds ˝ d/: The forest of trees of excursions T is then the sum of all of these random measures X T WD T .n/ : n2N0

Theorem 1.4 and Section 1.3 in [29] show that the solutions of the SDE (5.3.1) converge in the many-demes limit in the sparse regime to the forest T of trees of excursions in a suitable sense. Theorem 5.4.1 (Convergence to a forest of trees of excursions [29, Theorem 1.4]). Let .X D /D2N be solutions of the SDE (5.3.1) and let T be the forest of trees of excursions. Then X D i D1

X tD .i/ıX D .i/ t

 ) t 2Œ0;1/

Z

t

s ı t

 T .ds ˝ d/ s

t 2Œ0;1/

(5.4.2)

as D ! 1 in the sense of convergence in distribution on D.Œ0; 1/; Mf .Œ0; 1//. The weak convergence in (5.4.2) involving size-biasing is somewhat unusual PD D and deserves some explanation. Clearly f .X t .i // does not converge for coniD1 PD stant test functions. Thus . i D1 ıX D .i/ / t 2Œ0;1/ does not converge in distribution on t D.Œ0; 1/; Mf .Œ0; 1//. The problem is that there are infinitely many demes with very

Evolution of defence in structured populations

97

small excursions. One way to avoid this problem is to consider only test functions that vanish on some neighbourhood of the zero function. The formulation (5.4.2) with size-biased Dirac measures, however, is stronger inP the sense that it allows more test D functions Rand e.g. implies for all t 2 Œ0; 1/ that D i D1 X t .i / converges in distribution to  t s T .ds ˝ d/ as D ! 1. This latter integral is indeed finite almost surely since it has finite expectation according to [25, Lemma 5.2, Lemma 9.9, and Lemma 9.10]. We emphasise that the limiting object T is easier to analyse than the solution of the SDE (5.3.1) because of the tree structure and since general branching processes are very well understood. For R example, [25, Theorem 2] (suitably adapted to the interval Œ0; 1) shows that  t s T .ds ˝ d/ converges in probability to zero as t ! 1 if and only if Z 1>

1

 a ay y

0:5b.a y/y.1 y/ Z 1 2 2 a y  2˛ b D  .1 a ab 0

Z

y

exp 0

0

2

y/ ab

1

 .a a

 x/x ˛x.1 x/ dx dy 0:5b.a x/x.1 x/

dy

which holds if and only if ˛ > b. Thus the survival probability is positive if and only if ˛ < b. This shows that an altruistic defence allele may spread in a large population if the cost of defence ˛ is smaller than the benefit of defence b. In that case we also get an idea how fast the defence allele spreads. More precisely, if ˛ < b, then there exists  2 .0; 1/ such that Z 1 Z a e s s Q.d/ ds D 1: a s 0 Then [25, Theorem 4] shows that there exists a random variable V such that Z  t s T .ds ˝ d/ e  t ) V as t ! 1 and P .V > 0/ is equal to the survival probability. An interesting problem would be to establish this exponential growth rate  for the D-islands model as D ! 1, P that isDto prove for suitable .tD /D2N  .0; 1/ with limD!1 tD D 1 that e  tD D i D1 X tD .i/ converges as D ! 1 in a suitable sense. We note that if .X0 .i//i 2N in Theorem 5.4.1 are independent random variables then all trees are independent. In other words, independence of the family .X0 .i //i 2N propagates in the many-demes limit and results in a forest of independent trees. In contrast to the exchangeable regime, “propagation of chaos” does not mean that fixed demes become independent in the limit as D ! 1 (which is rather trivial) but that the full progenies of individuals starting on different demes do not interfere and evolve independently of each other. We also note that [29, Theorem 1.4] establishes

98

Martin Hutzenthaler and Dirk Metzler

propagation of chaos in the sparse regime for the more general SDE dX tD .i / D

D   1 X D X t .j /f X tD .j /; X tD .i / dt C hD X tD .i / dt D j D1 q  C  2 X tD .i/ dW t .i/; t 2 .0; 1/; i 2 ¹1; : : : ; Dº;

where f W Œ0; 12 ! R, W Œ0; 1 ! Œ0; 1/, and hD W Œ0; 1 ! R, D 2 N, are functions with suitable properties. In the literature, this type of “propagation of chaos” in the sparse regime has already been established in two special cases. Theorem 3.3 in [26] proves the analogue of Theorem 5.4.1 in the special case where the infinitesimal variance  2 is additive (and where I D Œ0; 1/ and for all x; y 2 Œ0; 1/ it holds that f .y; x/ D 1) and this additivity of infinitesimal variances is a strong tool for decomposing the total population into “loop-free” processes. In particular, additive  2 includes the case of Feller’s branching diffusions where a decomposition into subfamilies is natural. Moreover, [9, Proposition 2.9] proves an analogue of Theorem 5.4.1 in the special case where for all x; y 2 Œ0; 1 and all D 2 N it holds that  2 .x/ D dx.1

f .y; x/ D c;

x/;

hD .x/ D

cx C sx.1

x/ C

m .1 D

x/;

where cd; m; s 2 .0; 1/ are positive constants and where the forest of trees of excursions is replaced by a dynamic description hereof which is a continuous atomic-valued Markov process and where independence of disjoint trees is not obvious. In this special case of Wright–Fisher diffusions with selection and rare mutation, there exists a duality with a particle jump process and this duality is a very strong tool. At this point we illuminate the central step in the proof of Theorem 5.4.1. Roughly speaking, this step consists of sorting the population by the number of migration steps that appear in the ancestral lineages of the individuals. For this, let ¹W k .i / W .i; k/ 2 N  N0 º be a set of independent standard Brownian motions and for every D 2 N, k 2 N0 consider the SDE dX tD;k .i / D

D 1 X D;k Xt D j D1

C

1

.j /f

X

X tD;m .j /;

m2N0 D;k X X t .i/ h P D D;m .i/ m2N0 X t m2N0

X

 X tD;m .i / dt

m2N0

 X tD;m .i / dt

v u X  u X tD;k .i/ D;m 2 C tP  X .i / dW tk .i /; t D;m X .i/ t m2N0 m2N0 t 2 .0; 1/; i 2 ¹1; : : : ; Dº;

(5.4.3)

99

Evolution of defence in structured populations

and the SDE dZ tD;k .i /

D   1 X D;k 1 D Zt .j /f Z tD;k 1 .j /; Z tD;k .i / dt C hD Z tD;k .i / dt D j D1 q  C  2 Z tD;k .i/ dW tk .i/; t 2 .0; 1/; i 2 ¹1; : : : ; Dº; (5.4.4)

where X D; 1 D Z D; 1 D 0. We refer to the solutions of the SDE (5.4.4) as loop-free processes. These processes have the nice property that the SDE for Z D;k only involves the process Z D;k 1 and a recursive analysis. Moreover, it is not difficult P this allows D D;k to see that X and k2N0 X have the same distribution. The key step is now to prove that the solutions of the SDEs (5.4.3) and (5.4.4) are close to each other in distribution. To see this, fix K; D 2 N, t 2 Œ0; 1/, 2 C 2 .Œ0; 1¹1;:::;Dº¹0;:::;Kº ; R/ and let the function u satisfy for all s 2 Œ0; t, x 2 Œ0; 1¹1;:::;Dº¹0;:::;Kº that     u.s; x/ D E Z tD;ks .i/ .i;k/2¹1;:::;Dº¹0;:::;Kº j Z0D;k D x : Then Itô’s formula, the fact that u solves the Kolmogorov backward equation and taking expectations yields for all t 2 Œ0; 1/ that    E X tD;k .i/ .i;k/2¹1;:::;Dº¹0;:::;Kº    E Z tD;k .i/ .i;k/2¹1;:::;Dº¹0;:::;Kº    D E u t; X tD;k .i/ .i;k/2¹1;:::;Dº¹0;:::;Kº    E u 0; X0D;k .i/ .i;k/2¹1;:::;Dº¹0;:::;Kº Z t X D X K   @u D E s; XsD;m .j / .j;m/2¹1;:::;Dº¹0;:::;Kº @xi;k 0 i D1 kD0



²X D

 X XsD;k 1 .j /   X D;m f Xs .j /; XsD;m .i / D j D1 m2N0 m2N0  D;k 1 f Xs .j /; XsD;k .i / CP

XsD;k .i/

m2N0

XsD;m .i/

hD

X



XsD;m .i /

h

XsD;k .i /



³

m2N0

D K   1 X X @2 u D;m C s; X .j / s 2 .j;m/2¹1;:::;Dº¹0;:::;Kº 2 @xi;k i D1 kD0 ² ³ X   XsD;k .i/ 2 D;m 2 D;k  P  Xs .i /  Xs .i / ds:(5.4.5) D;m .i/ m2N0 Xs m2N0

Theorem 5.5.1 below shows that u is indeed differentiable and that the partial derivatives up to order two are bounded uniformly in D 2 N. Moreover, we use that every

100

Martin Hutzenthaler and Dirk Metzler

Lipschitz function 'W Œ0; 1/ ! R with '.0/ D 0 satisfies for all x; y 2 Œ0; 1/ that ˇ ˇ ˇ

x '.x C y/ xCy

ˇ ˇ '.x/ˇ 6 2

j'.a/ ja a;b2Œ0;1/ sup

'.b/j min¹x; yº: bj

(5.4.6)

a¤b

In addition, one can show with moment estimates that lim

sup E

D!1 s2Œ0;t

X D X

° min XsD;k .i/;

iD1 k2N0

X

±

XsD;m .i /

D 0:

(5.4.7)

m2N0 n¹kº

Roughly speaking this means that masses with different migration levels cannot be in the same deme in the many-demes limit. Combining Theorem 5.5.1, (5.4.6) and (5.4.7), one can show that the differences on the right-hand side of (5.4.5) vanish in the limit as D ! 1. This essentially shows that the solutions of the SDEs (5.4.3) and (5.4.4) are close to each other in distribution.

5.5 Differentiability of semigroups of stochastic differential equations with Hölder-continuous diffusion coefficients As described above, an important tool in the proof of Theorem 5.4.1 are dimensionindependent upper bounds for partial derivatives of the involved semigroups and this is the subject of the following Theorem 5.5.1. Throughout this section we use the following notation: let k  k be the Euclidean norm and let [ k  k1 W C.Œ0; 1d ; Rm / ! Œ0; 1/ d;m2N

be the function that satisfies for all d; m 2 N, k k1 D

2 C.Œ0; 1d ; Rm / that

sup k .x/k:

x2Œ0;1d

Theorem 5.5.1 (Differentiability of semigroups [28, Theorem 4.1]). Let d 2 N, let .; F ; P ; .F t / t2Œ0;1/ / be a stochastic basis, let W W Œ0; 1/   ! Rd be a standard Brownian motion, let g1 ; : : : ; gd 2 C 3 .Œ0; 1; R/ satisfy for all i 2 ¹1; : : : ; d º, x 2 .0; 1/ that gi .0/ D 0 D gi .1/ and gi .x/ > 0, let f1 ; : : : ; fd 2 C 3 .Œ0; 1d ; R/ satisfy for all i 2 ¹1; : : : ; d º, x D .x1 ; : : : ; xd / 2 Œ0; 1d with xi 2 ¹0; 1º that . 1/xi fi .x/ > 0, and for every m 2 ¹1; 2º we define am D 0
0; x1 ; : : : ; xn 2 X :

i D1

The population process, . t / t >0 , is then defined as an M.X/-valued Markov process with generator L, defined, for any bounded measurable function f from M.X/ to R and for all  2 M.X/, by Z   .Lf /./ D f . C ıx / f ./ 1 m.x/ b.x/.dx/ X Z Z  C f . C ıxCy / f ./ m.x/b.x/M.x; dy/.dx/ ZX X Z   C f . ıx / f ./ d.x/ C c.x; y/.dy/ .dx/: X

X

The first and second terms are linear (in ) and describe the births (without and with mutation), but the third term is non-linear and describes the deaths due to age or competition. The density-dependent non-linearity of the third term models the competition in the population, and hence drives the selection process. Assumption 7.2.1. We make the following assumptions on the parameters of the model: N d; N cN < 1 such that (i) b, d and c are measurable functions, and there exist b; N 0 6 b.  / 6 b;

0 6 d.  / 6 dN

and

0 6 c.  ;  / 6 c: N

(ii) There exists c > 0 such that for all x 2 X, c 6 c.x; x/. (iii) The support of M.x;  / is uniformly bounded for all x 2 X.

Stochastic models for adaptive dynamics

131

Remark 7.2.2. Assumptions (i) allow to deduce the existence and uniqueness in law of a process on D.RC ; M.X// with infinitesimal generator L (cf. [13]). Assumption (ii) ensures the population size to stay bounded locally. Assumption (iii) is made in view of the convergence to the canonical equation, see below, and can be relaxed.

7.3 Scaling limits There are three natural parameters that can be introduced into the model that give rise to interesting and biologically relevant scaling limits. These are: (i) The population size, or carrying capacity, K. This is achieved by dividing the competition kernel c by K, so that it requires of order K individuals to affect the death rate of one individual in a significant way. To obtain a limit then also requires to divide the measures  by K. (ii) The mutation rate, u. Multiplying the mutation rate m.x/ by u allows to study limits of small mutation rates. (iii) The effect of a single mutation step can be scaled to zero. The mutation step size can be scaled to zero by introducing a parameter  and replacing ıxCy in the mutation term of the generator by ıxCy . The generator with these scaling parameters acting on the space of rescaled measures then reads Z    1  K K f . K / .1 um.x//b.x/K K .dx/ .L f /. / D f  K C ıx K X Z Z     1 C f  K C ıxCy f . K / K X X  um.x/b.x/M.x; dy/K K .dx/ Z    1  C f K ıx f . K / K Z X    d.x/ C c.x; y/ K .dy/ K K .dx/: X

In general one is interested in taking limits as K " 1, u # 0, and  # 0. At the same time, we may want to scale time in such a way to obtain interesting effects. Having large K, small u, and small  is biologically reasonable in many (but not all) situations. 7.3.1 The law of large numbers A fundamental result, which also provides a frequently used tool, is a law of large numbers (LLN), which asserts convergence of the process to a deterministic limit over finite time intervals when K tends to infinity. This LLN goes in fact back to Ethier and Kurtz [11] in the case of a finite trait space and was generalised by Fournier and Méléard [13]. See also [3].

132

Anton Bovier

Theorem 7.3.1. Fix u and . Let Assumption 7.2.1 hold and assume in addition that the initial conditions 0K converge, as K " 1, in law and for the weak topology on M.X/, to some deterministic finite measure 0 2 M.X/, and supK EŒh0K ; 1i3  < 1. Then, for all T > 0, the sequence  K converges, as K " 1, in law, on the Skorokhod space D.Œ0; T ; M.X//, to a deterministic continuous function  2 C.Œ0; T ; M.X//. This measure-valued function  is the unique solution, satisfying sup t 2Œ0;T  h t ; 1i < 1, of the integro-differential equation reading in its weak form: for all bounded and measurable functions, hW X ! R, Z Z h.x/ t .dx/ h.x/0 .dx/ X Z t Z X Z D ds um.x/b.x/ M.x; dy/h.x C y/s .dx/ 0 X Z Z t Z   C ds h.x/ 1 um.x/ b.x/ d.x/ Z  0 X s .dy/c.x; y/ s .dx/: X

Remark 7.3.2. In all the results mentioned in these notes, the LLN is in fact only used for finite trait spaces. 7.3.2 Scaling u # 0 in the deterministic limit In the absence of mutations (u D 0) and if initially there exists a finite number of phenotypes in the population, one obtains convergence to the competitive system of Lotka–Volterra equations defined below (see [13]). Corollary 7.3.3 (The special case u D 0 and 0 is n-morphic). If the Pnsame assumptions as in the theorem above P hold with u D 0 and if in addition 0 D i D1 zi .0/ıxi , then  t is given by  t D niD1 zi .t/ıxi , where zi is the solution of the competitive system of Lotka–Volterra equations defined below. Definition 7.3.4. For any .x1 ; : : : ; xn / 2 X n , we denote by LV.n; .x1 ; : : : ; xn // the competitive system of Lotka–Volterra equations defined by  dzi .t / D zi .t/ b.xi / dt

d.xi /

n X

 c.xi ; xj /zj .t / ;

1 6 i 6 n:

j D1

It is also easy to see (using Gronwall’s lemma) that on finite time intervals, solutions converge, as u # 0, to those of the system with u D 0. The same is not true if time tends to infinity as u # 0. We introduce the notation of coexisting traits and of invasion fitness (see [7]). Definition 7.3.5. The distinct traits x and y coexist if the system LV.2; .x; y// admits a unique non-trivial equilibrium, zN .x; y/ 2 .0; 1/2 , which is locally strictly stable in

Stochastic models for adaptive dynamics

133

the sense that the eigenvalues of the Jacobian matrix of the system LV.2; .x; y// at zN .x; y/ are all strictly negative. The invasion of a single mutant trait in a monomorphic population close to its equilibrium is governed by its initial growth rate. Therefore, it is convenient to define the fitness of a mutant trait by its initial growth rate. Definition 7.3.6. If the resident population has the trait x 2 X, then we call the following function invasion fitness of the mutant trait y: f .y; x/ D b.y/

d.y/

c.y; x/Nz .x/:

Remark 7.3.7. The unique strictly stable equilibrium of LV.1; x/ is zN .x/ D

b.x/ d.x/ ; c.x; x/

and hence f .x; x/ D 0 for all x 2 X. Determining polymorphic fixed points with a subset of ` 6 n components (take w.l.o.g. the first ` components) leads to the linear equations b.xi /

d.xi / D

` X

c.xi ; xj /zNj ;

1 6 i 6 `;

j D1

for the equilibrium values zNj , which in addition must be all strictly positive. The existence of such equilibria clearly requires conditions on the parameters that are more difficult to verify. 7.3.3 Small mutation limit at divergent time scales We have seen that for finite time horizons, the limit of the deterministic equations as u # 0 is a mutation free ecological equation. The reason for this is that growth of solutions is at most exponential in time, and so anything seeded by a mutation term is proportional to u and will vanish in the limit. This is no longer true if we consider time scales that depend on u. To understand this, consider an initial condition that is monomorphic and the simplest case where X is just the set ¹1; 2º. Then the deterministic system can be reduced to the two-dimensional Lotka–Volterra system with mutation (to lighten the notation we set m.1/ D m.2/ D 1):  dz1 .t/ D z1 .t/ r1 c.1; 1/z1 .t / c.1; 2/z2 .t / dt uM.1; 2/z1 .t/ C uM.2; 1/z2 .t /;  dz2 .t/ D z2 .t/ r2 c.2; 2/z2 .t / c.2; 1/z1 .t / dt uM.2; 1/z2 .t/ C uM.1; 2/z1 .t /:

134

Anton Bovier

r1 Assume that z1 .0/ D zN 1  c.1;1/ and z2 .0/ D 0. Assume further that the invasion fitness of type 2 is positive, i.e. r2 c.2; 1/Nz1 > 0. Then, at time 0, we have

 dz1 .0/ D zN1 r1 c.1; 1/Nz1 dt D uM.1; 2/Nz1 ; dz2 .0/ D CuM.1; 2/Nz1 : dt For u small, this implies that at time t D 1,  z1 .1/  zN1 1 uM.1; 2/ ;

uM.1; 2/Nz1

z2 .1/  uM.1; 2/Nz1 :

Hence, as long as z2 .t/ is small compared to zN1 , dz2 .t/ > z2 .t/ r2 dt

 c.2; 1/Nz1 ;

and hence exponential growth at rate .r2 c.2; 1/Nz1 /  R > 0 will set in, i.e. for t > 1 and as long as z2 .t/ remains small compared to one, z2 .t/  uM.1; 2/Nz1 e.t

1/R

;

and so by time t  R1 ln.uM.1; 2/Nz1 /, z2 will have reached a level O.1/ that is independent of u. Then, for vanishing u, the system will evolve over times of order one like the mutation free Lotka–Volterra system and approach its unique fixed point .0; zN2 /. Thus, defining  Z u .t/  z1u .tjln uj/; z2u .tjln uj/ (where we added superscripts u to make the dependence on u evident), we see that, in the sense of weak convergence of distribution functions, lim Z u .t/ D .zN1 ; 0/ 106t1=R :

u#0

So, interestingly, on the time scale ln.1=u/, the solution of the deterministic Lotka– Volterra system with polymorphic evolution sequence mutations converge to a deterministic jump process. To my knowledge this scaling was first considered in [6] and fully developed in [20]. A particular situation relating to escape from an evolutionary stable state was analysed in [5]. What we observed in this simple example is generic and gives rise to the first example of a polymorphic evolution sequence (PES), by which we mean a jump process between equilibria of a sequence of competitive Lotka–Volterra systems. This can be described informally as follows. Assume (for simplicity) that X is a countable set. Let I0  X be a finite set of cardinality n such that LV.n; I0 / has an equilibrium zN such that zN i > 0, for all xi 2 I0 .

135

Stochastic models for adaptive dynamics

Step 1: At time 1, all the (mutant) populations at all points x … I0 are either of size zero or of order u˛x with ˛x 2 N. The populations at x 2 I0 remain close to their equilibrium values. This remains true as long as none of the mutant populations has reached a level e > 0 independent of u. Step 2: The populations of the types x 62 I0 grow exponentially with rate given by their invasion fitness with respect to the resident equilibrium until a time T;1 , which is the first time that one of the non-resident populations reaches the value . Population growth also takes into account mutations. The system is, however, well approximated by a linear system. T;1 is of order ln.1=u/. Step 3: At time T;1 , assume that the set J of types y for which limu#0 zy .T;1 / ¤ 0 is finite (typically, this will be I0 plus one new type). Then, in time of order one, the system will approach the equilibrium of LV.jJ j; J /. Let I1  J be the subset on which this equilibrium is strictly positive. All types outside I1 have population size of some order u˛ . Step 4: Restart as in Step 2 and iterate. The general result obtained in [20] concerns the system of differential equations h dzxu .t/ D r.x/ dt Cu

X

i ˛.x; y/zyu .t / zxu .t /

y2H

X

b.y/m.y; x/zyu .t /

ub.x/zxu .t /;

(7.3.1)

yx

where H is the n-dimensional hypercube ¹0; 1ºn , but the same result holds for any locally finite graph. The mutation kernel m.x; y/ is positive if and only if x and y are connected by an edge in H. Theorem 7.3.8 ([20]). Consider the system of differential equations (7.3.1). Assume that the initial conditions z u .0/ are such that for some x 0  H, it holds that for all y 2 x 0 , zyu .0/  zNy .x 0 / and for all y … x 0 , zyu .0/  uy , for some y > 0. Set y0  min.z C jz

yj/;

z2H

where jz yj denotes the graph distance between y and z, and T0  0. Then define, for i 2 N, yi 1 yi  arg min ; y2HW f .y; x i 1 / f .y;x i

Ti  Ti

1

C

 yi  min zi z2H

1 />0

yi

min

y2HW f .y;x i 1 />0 1

C jz

yj

1

f .y; x i .Ti

1/

Ti

; 1 /f .z; x

i 1

 /:

136

Anton Bovier

Let x i be the support of the equilibrium state of the Lotka–Volterra system involving x i 1 [ yi and set Ti  1, as soon as there exists no y 2 H such that f .y; x i 1 / > 0. Then (under some weak non-degeneracy hypotheses), for every t … ¹Ti ; i > 0º, 1 X  X lim z u tjln uj D 1Ti 6t 0, exp. VK/  uK 

1 ; K ln.K/

as K " 1:

Then the sequence of the rescaled processes . tK=KuK / t >0 with initial state 0K converges in the sense of finite-dimensional Pdistributions to the measure-valued pure jump process ƒ defined as follows: ƒ0 D x2x zNx .x/ıx and the process ƒ jumps for all y 2 x from X X zNx .x/ıx to zNx .x [ y/ıx x2x

x2x[.xCh/

with infinitesimal rate X

m.x/b.x/Nzx .x/

x2x

f .y; x/C M.x; dy/: b.y/

The process ƒ is called the polymorphic evolution sequence (PES). Remark 7.4.4. In reference [9], the mutation kernel is assumed absolutely continuous, but this assumption is not necessary, as one can easily check. A special case of a PES is the trait substitution sequence (TSS), when all equilibria are monomorphic. In some sense the situation that the dimension of the successive equilibria stays constant is generic. Cases when the dimension increases are called evolutionary branching. 7.4.2 The canonical equation The trait substitution sequence still contains , the scale of a mutation step, as a small parameter. If we denote the corresponding process by X  , one can obtain a further limiting process which describes continuous evolution of the population in phenotypic space. In adaptive dynamics, this equation is called the canonical equation. Theorem 7.4.5 ([9, Remark 4.2]). If Assumption 7.2.1 is satisfied and the family of initial states of the rescaled TSS, X0 , is bounded in L2 and converges to a random variable X0 as  ! 0, then, for each T > 0, the rescaled TSS X t= 2 converges, as  # 0, in the Skorokhod topology on D.Œ0; T ; X/ to the process .x t / t6T with initial state X0 and with deterministic sample path, which is the unique solution of an ordinary differential equation, known as CEAD: Z   dx t D h hm.x t /zN .x t /@1 f .x t ; x t / C M.x t ; dh/; dt

Anton Bovier

140

where @1 f denotes the partial derivative of the function f .x; y/ with respect to the first variable x. Note that the CEAD has fixed points where the derivative of f .x; x/ vanishes. Typically, a population will evolve towards such a fixed point and slow down. The further fate of the population cannot be determined on the basis of the CEAD alone. However, in the underlying stochastic model, the population can either stay fixed, if the fixed point is stable and an evolutionary stable situation is reached, or, in the case of an unstable fixed point, evolutionary branching may occur.

7.5 To the CEAD in one step Deriving the CEAD through the successive limits first K " 1, uK # 0 followed by  # 0 is somewhat unsatisfactory. It would be more natural to give conditions under which the limits of large population size, K ! 1, rare mutations, uK ! 0, and small mutation steps, K ! 0, can be taken simultaneously and lead to the CEAD. Such a result was achieved in a paper with Baar and Champagnat [2]. It turns out that the combination of the three limits simultaneously entails some considerable technical difficulties. The fact that the mutants have only a K-dependent small evolutionary advantage decelerates the dynamics of the microscopic process such that the time of any macroscopic change between resident and mutant diverges with K. This makes it impossible to use a law of large numbers to approximate the stochastic system with the corresponding deterministic system during the time of invasion. Showing that the stochastic system still follows, in an appropriate sense, the corresponding competitive Lotka–Volterra system (with K-dependent coefficients) requires a completely new approach. Developing this approach, which can be seen as a rigorous “stochastic Euler scheme”, is the main novelty in the paper [2]. The proof requires methods based on couplings with discrete time Markov chains combined with some standard potential theory arguments for the “exit from a domain problem” in a moderate deviations regime, as well as comparison and convergence results of branching processes. 7.5.1 The main result In this section, we present the main result of [2], namely the convergence to the canonical equation of adaptive dynamics in one step. The time scale on which we control the population process is t=.K2 uK K/. For technical reasons, we make the following simplifying assumption. Assumption 7.5.1. (i) The trait space X is a subset of R. (ii) The mutant distribution M.x; dh/ is atomic and the number of atoms is uniformly bounded. (iii) For all x 2 X, @1 f .x; x/ ¤ 0.

141

Stochastic models for adaptive dynamics

Assumption 7.5.1 implies that either, for all x 2 X, @1 f .x; x/ > 0 or, for all x 2 X, @1 f .x; x/ < 0. Therefore coexistence of two traits is not possible. Without loss of generality, we can assume that, for all x 2 X, @1 f .x; x/ > 0. In fact, a weaker assumption is sufficient, see Remark 7.5.3 (iii). Theorem 7.5.2. Assume that Assumptions 7.2.1 and 7.5.1 hold and that there exists a small ˛ > 0 such that K

1 2 C˛

 K  1;

exp. K ˛ /  uK 

K1C˛ ; K ln K

as K ! 1:

Fix x0 2 X and let .N0K /K>0 be a sequence of N-valued random variables such that N0K K 1 converges in law, as K ! 1, to the positive constant zN .x0 / and is bounded in Lp , for some p > 1. For each K > 0, let  tK be the process generated by LK with monomorphic initial state N0K K 1 ı¹x0 º . Then, for all T > 0, the sequence of rescaled processes, K 2 / ; . t=.Ku K  / 06t 6T K

converges in probability, as K ! 1, with respect to the Skorokhod topology on D.Œ0; T ; M.X// to the measure-valued process zN .x t /ıx t , where .x t /06t6T is given as a solution of the CEAD, Z   dx t D h hm.x t /zN .x t /@1 f .x t ; x t / C M.x t ; dh/; (7.5.1) dt Z with initial condition x0 . Remark 7.5.3. (i) If x t 2 @X for t > 0, then (7.5.1) is 1C˛ K

dx t dt

D 0, i.e. the process stops.

(ii) The condition uK  K ln K allows mutation events during an invasion phase of a mutant trait, see below, but ensures that there is no successful mutational event during this phase. 1

(iii) The fluctuations of the resident population are of order K 2 , so the condition 1 K 2 C˛  K ensures that the sign of the initial growth rate is not influenced by these fluctuations. (iv) exp.K ˛ / is the time the resident population stays with high probability in an O.K /-neighbourhood of an attractive domain. This is a moderate deviation result. Thus the condition exp. K ˛ /  uK ensures that the resident population is still in this neighbourhood when a mutant occurs. (v) The time scale is .KuK K 2 / 1 since the expected time for a mutation event is .KuK / 1 , the probability that a mutant invades is of order K and one needs O.K 1 / mutant invasions to see an O.1/ change of the resident trait value.

Anton Bovier

142

7.5.2 The structure of the proof of Theorem 7.5.2 Under the conditions of the theorem, the evolution of the population will be described as a succession of mutant invasions. Analysis of a single invasion step. In order to analyse the invasion of a mutant, we divide the time until a mutant trait has fixed in the population into two phases. Phase 1. Fix  > 0 and prove the existence of a constant, M < 1, independent of , such that, as long as all mutant densities are smaller than K , the resident density stays in an MK -neighbourhood of zN .x/. Note that, because mutations are rare and the population size is large, the monomorphic initial population has time to stabilise in an MK -neighbourhood of this equilibrium zN .x/ before the first mutation occurs. (The time of stabilisation is of order ln.K/K 1 and the time where the first mutant occurs is of order 1=.KuK /.) This allows to approximate the number of the mutants of trait y1 by a branching process with birth rate b.y1 / and death rate d.y1 / c.y1 ; x/zN .x/ such that we can compute the probability that the number of the mutant reaches K , which is of order K , as well as the time it takes to reach this level or to die out. Therefore, the process needs O.K 1 / mutation events until there appears a mutant subpopulation that reaches a size K . Such a mutant is called successful and its trait will be the next resident trait. Phase 2. If a mutant population with trait ys reaches size K , it will increase to an MK -neighbourhood of its equilibrium density zN .ys /. Simultaneously, the density of the resident trait decreases to K and finally dies out. Since the fitness advantage of the mutant trait is only of order K , the dynamics of the population process and the corresponding deterministic system are very slow, and require a time of order at least K 1 to reach an -neighbourhood of its equilibrium density. Thus, the law of large numbers, see Theorem 7.3.1, cannot be used to control this phase, as it covers only finite, K-independent time intervals. The method we develop to handle this situation can be seen as a rigorous stochastic “Euler scheme”. Nevertheless, the proof contains an idea that is strongly connected with the properties of the deterministic dynamical system. Namely, the deterministic system of equations for the case K D 0 has an invariant manifold of fixed points with a vector field independent of K pointing towards this manifold. Turning on a small K , we therefore expect the stochastic system to stay close to this invariant manifold and to move along it with speed of order K . With this method one can prove that the mutant density reaches the MK neighbourhood of zN .ys / and the resident trait dies out. Convergence to the CEAD. The proof of convergence to the CEAD uses comparison of the measure valued process  tK with two families of control processes, 1;K; and 2;K; , which converge to the CEAD as K ! 1 and then  ! 0. To make more precise statements, one uses the order relation 4 for random variables. Roughly speaking, X 4 Y will mean that Y is larger than X in law.

Stochastic models for adaptive dynamics

143

Given T > 0, with the results of the two invasion phases, one defines, for all  > 0, two measure-valued processes in D.Œ0; 1/; M.X// such that, for all  > 0, h lim P for all t 6

K!1

i T 1;K; 2;K; K W  4  4  D 1; t t t KuK K2

and, for all  > 0 and i 2 ¹1; 2º, h lim P sup ki;K; t=.KuK  2 / K!1

2/ 06t 6T =.KuK K

K

i zN .x t /ıx t k0 > ı./ D 0;

for some function ı such that ı./ ! 0 when  ! 0. Here k  k0 denotes the Kantorovich–Rubinstein norm: ²Z ³ k t k0  sup f d t W f 2 Lip1 .X/ with supjf .x/j 6 1 ; X

x2X

where Lip1 .X/ is the space of Lipschitz continuous functions from X to R with Lipschitz norm one (cf. [4, p. 191]). This implies the theorem.

7.6 Escape through a fitness well When a population reaches an ESC, there may still be uninhabited loci where the invasion fitness is positive but that cannot be reached by a single mutation from the current population. This question was already addressed by Maynard Smith [25] and heuristic computations of the crossing time of such fitness valleys were computed by Gillespie [14]. In [5] we have analysed how such a fitness valley can be crossed in a simple scenario where the trait space is the finite set ¹0; : : : ; Lº, the resident population is monomorphic with trait zero, the invasion fitness is negative for 1; : : : ; L 1 and positive for L. In contrast to the previous chapters, we analyse a wider range of dependencies of the mutation rate on the carrying capacity, interpolating all the extreme regimes first K " 1, then u # 0 to the regime u  1=.K ln K/. As we will show, three essentially different regimes occur. In the first, the mutation rate is so large that many mutants (a number of order K) are created in a time of order 1. In this case the fixation time scale is dominated by the time needed for a successful mutant to invade (which is of order log K). The second scenario occurs if the mutation rate is smaller, but large enough so that a fit mutant will appear before the resident population dies out. In this case the fixation time scale is exponentially distributed and dominated by the time needed for the first successful mutant to be born. The last possible scenario is the extinction of the population before the fixation of the fit mutant, which occurs when the mutation rate is very small (smaller than e CK for a constant C to be made precise later).

144

Anton Bovier

7.6.1 The setting We analyse the escape problem in a specific simple special case of the general model, which does, however, capture the key mechanism. We choose the trait space X  ¹0; 1; : : : ; Lº. For each trait i we denote by Xi .t / the number of individuals of trait i at time t. For simplicity, we allow mutations only in the forward directions and to nearest neighbours, that is we set mij D uıi C1;j . For n; m 2 N0 such that n 6 m, we introduce the notation ŒŒn; m  ¹n; n C 1; : : : ; mº. We want to consider the situation when an equilibrium population at 0 is an evolutionary stable condition, and when L is the closest trait with a positive invasion fitness. Assumption 7.6.1.  (Fitness valley) All traits are unfit with respect to 0 except L: f .i; 0/ < 0

for i 2 ŒŒ1; L

1 and f .L; 0/ > 0:

 All traits are unfit with respect to L: f .i; L/ < 0

for i 2 ŒŒ0; L

1:

 The following fitnesses are different: f .i; 0/ ¤ f .j; 0/

for all i ¤ j;

f .i; L/ ¤ f .j; L/ for all i ¤ j: Under these assumptions, all mutants created by the initial population initially have a negative growth rate and thus tend to die out. However, if such mutants survive long enough to give rise to further mutants, etc., such that eventually an individual will reach the trait L, it will found a population at this trait that, with positive probability, will grow and possibly eliminate the resident population through competition. 7.6.2 Results The more interesting results in [5] concern the case when u D uK  K 1=˛ . There are two very different cases to distinguish. First, if KuL D K 1 L=˛  1 (i.e. ˛ > L), there will be essentially immediately a divergent number of mutants at L. These then grow exponentially with rate f .L; 0/ and will therefore reach a macroscopic level at time of order ln K=.˛f .L; 0//. Second, if KuL D K 1 L=˛  1 (i.e. ˛ < L), there are typically no mutants at L. For v > 0 and 0 6 i 6 L, let Tv.K;i/ denote the first time the i-population reaches the size bvKc, ® ¯ Tv.K;i/  inf t > 0 W Xi .t / D bvKc :

145

Stochastic models for adaptive dynamics

Let us introduce t .L; ˛/ 

° L 1 C sup 1 ˛ f .L; 0/

i 1 W06i 6L ˛ jf .i; L/j

± 1 ;

and the time needed for the populations of all types but L to go extinct, ° ± X T0.K;†/  inf t > 0 W Xi .t / D 0 : 06i 6L 1

With this notation we have the following asymptotic result. Theorem 7.6.2. Assume that L < ˛ < 1. Then there exist two positive constants "0 and c such that, for every 0 < " 6 "0 ,  lim inf P .1 K!1

TxN.K;L/ T".K;L/ 1 L " c"/ < < L ˛ f .L; 0/ log K log K  1 L >1 < .1 C c"/ ˛ f .L; 0/

c":

Moreover, T0.K;†/ ! t.L; ˛/ in probability as jK ! 1j; log K and there exists a positive constant V such that  lim sup P sup jXL .TxN.K;L/ C t/ xNL Kj > c"K 6 c": L " K!1

t 6eK V

In other words, it takes a time of order t.L; ˛/ log K for the L-population to outcompete the other populations and enter in a neighbourhood of its monomorphic equilibrium size xNL K. Once this has happened, it stays close to this equilibrium for at least a time eK V , where V is a positive constant. Note that the constant t.L; ˛/ can be intuitively computed from the deterministic limit. Indeed, for ˛ > L, we will prove that the system performs small fluctuations around the deterministic evolution studied above: the i-population first stabilises around O.Kui / in a time of order one, then the L-population grows exponentially with rate f .L; 0/ and needs a time of order L log K=.˛f .L; 0// to reach a size of order K, while the other types stay stable, the swap between populations 0 and L then takes a time of order one, and finally, for i ¤ L, the i-population decays exponentially from O.Kui / to extinction with a rate given by the lowest (negative) fitness of its left neighbours (sub-critical branching process, needs a time close to .supj 2ŒŒ0;i .1 ˛j /=jf .i; L/j/ log K). Thus the time until extinction of all non-L populations is close to a constant times log K. Next we consider the case when L > 1. In this case, there is no L-mutant at time ˛ one, and the fixation of the trait L happens on a much longer time scale. In fact, there will be some last j < L where there will be of order  1 mutants present essentially

146

Anton Bovier

all the time. Already at j C 1, mutants arrive only sporadically and will typically get extinct quickly. Mutants arrive at L only when the rare event that a sequence of mutants manages to survive the trip from j to L occurs. Such an excursion can be described as follows: First, a mutant of type j C 1 is born from the j -population. This generates a subcritical branching process with birth rate bj C1 and death rate dj C1 C cj C1;0 xN 0 . Define the parameter j C1 

bj C1 : bj C1 C dj C1 C cj C1;0 xN 0

The expected number of individuals that are generated by this process before extinction is then 1 X .2k/Š k .1 j C1 /kC1 : .j C1 / D .k 1/Š .k C 1/Š j C1 kC1

Thus, on average, the probability that, during the lifetime of the descendants of this mutant, a j C 2-mutant is born is u.j C1 /. Should that happen, this will create a subcritical process of j C 2 individuals, which produce a j C 3-mutant with average probability u.j C2 /, and so forth. Thus we see that the probability a j C 1-child Q that 1 of the j -population has offspring that reaches L is about uL j L i Dj C1 .i /. This explains the result stated in the theorem below. Theorem 7.6.3.  Assume that ˛ … N and ˛ < L. Then there exist two positive constants "0 and c, and two exponential random variables E with parameters .1 ˙ c"/

xN 0 b0 : : : bb˛c 1 f .L; 0/ jf .1; 0/j : : : jf .b˛c; 0/j bL

L Y1

.i /

i Db˛cC1

such that, for every " 6 "0 ,  lim inf P E 6 TxN.K;L/ _ T0.K;†/ KuL 6 EC > 1 L " K!1

c":

 There exists a positive constant V such that if u satisfies e

VK

 Ku  1;

then the same conclusion holds, with the corresponding parameters, for E and EC : L 1 f .L; 0/ Y .1 C c"/xN 0 .i / bL i D1

and .1

L 1 f .L; 0/ Y c"/xN 0 .i /: bL i D1

Moreover, under both assumptions, there exists a positive constant V such that   lim sup P sup jXL .TxN.K;L/ C t/ x N Kj > c"K 6 c": L L " K!1

t6eK V

Stochastic models for adaptive dynamics

147

In the first case, the typical trajectories of the process are as follows: mutant populations of type i, for 1 6 i 6 b˛c, reach a size of order Kui  1 in a time of order log K (they are well approximated by birth-death processes with immigration and their behaviour is then close to the deterministic limit), and mutant populations of type i, for b˛c C 1 6 i 6 L, describe a.s. finite excursions, among which a proportion of order u produces a mutant of type i C 1. Finally, every L-mutant has a probability f .L; 0/=bL to produce a population that outcompetes all other populations. The term .i / is the expected number of individuals in an excursion of a subcritical birth and death process of birth rate bi and death rate di C ci 0 xN 0 excepting the first individual. Hence u.i / is the approximated probability for a type-i population .b˛c C 1 6 i 6 L 1/ to produce a mutant of type i C 1, and the overall time scale can be recovered as follows: (1) The last “large” population is the b˛c-population, which reaches a size of order Kub˛c after a time that does not go to infinity with K. (2) The b˛c-population produces an excursion of an .b˛c C 1/-population at a rate of order Kub˛cC1 , which has a probability of order u to produce an excursion of an .b˛c C 2/-population, and so on, giving the order KuL . Notice that Theorem 7.6.3 implies that, for any mutation rate that converges to zero more slowly than e VK =K, the population will cross the fitness valley with a probability tending to 1 as K ! 1. Our results thus cover a wide range of biologically relevant cases. Acknowledgements. I am indebted to all my collaborators on this project, notably Martina Baar, Nicolas Champagnat, Loren Coquille, Anna Kraut, Rebecca Neukirch, and Charline Smadi. Many thanks for insightful discussions on adaptive walks with Joachim Krug. I am particularly grateful to our colleagues from medicine, Nicole Glodde, Michael Hölzel, Meri Rogova, and Thomas Tüting for fruitful and inspiring collaborations. This work was also funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy – GZ 2047/1, Projekt-ID 390685813 and GZ 2151, Projekt-ID 390873048.

References [1] E. Baake and A. Walkolbinger, Microbial populations under selection, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 43–67. [2] M. Baar, A. Bovier, and N. Champagnat, From stochastic, individual-based models to the canonical equation of adaptive dynamics in one step, Ann. Appl. Probab. 27 (2017), 1093–1170.

Anton Bovier

148

[3] V. Bansaye and S. Méléard, Stochastic Models for Structured Populations, Springer, Cham, 2015. [4] V. I. Bogachev, Measure Theory. Vol. I, II, Springer, Berlin, 2007. [5] A. Bovier, L. Coquille, and C. Smadi, Crossing a fitness valley as a metastable transition in a stochastic population model, Ann. Appl. Probab. 29 (2019), 3541–3589. [6] A. Bovier and S.-D. Wang, Trait substitution trees on two time scales analysis, Markov Process. Related Fields 19 (2013), 607–642. [7] N. Champagnat, A microscopic interpretation for adaptive dynamics trait substitution sequence models, Stochastic Process. Appl. 116 (2006), 1127–1160. [8] N. Champagnat, R. Ferrière, and S. Méléard, From individual stochastic processes to macroscopic models in adaptive evolution, Stoch. Models 24 (2008), 2–44. [9] N. Champagnat and S. Méléard, Polymorphic evolution sequence and evolutionary branching, Probab. Theory Related Fields 151 (2011), 45–94. [10] C. Darwin, The Origin of Species, Murray, London, 1859. [11] S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Convergence, Wiley, New York, 1986. [12] W. J. Ewens, Mathematical Population Genetics: I. Theoretical Introduction, 2nd ed., Springer, New York, 2004. [13] N. Fournier and S. Méléard, A microscopic probabilistic description of a locally regulated population and macroscopic approximations, Ann. Appl. Probab. 14 (2004), 1880–1919. [14] J. H. Gillespie, Molecular evolution over the mutational landscape, Evolution 38 (1984), 1116–1129. [15] K. Jain, Evolutionary dynamics of the most populated genotype on rugged fitness landscapes, Phys. Rev. E 76 (2007), Article ID 031922. [16] K. Jain and J. Krug, Evolutionary trajectories in rugged fitness landscapes, J. Stat. Mech. 2005 (2005), Article ID P04008. [17] K. Jain and J. Krug, Deterministic and stochastic regimes of asexual evolution on rugged fitness landscapes, Genetics 175 (2007), 1275–1288. [18] S. Kauffman and S. Levin, Towards a general theory of adaptive walks on rugged landscapes, J. Theor. Biol. 128 (1987), 11–45. [19] S. A. Kauffman, The origins of order: Self-organization and selection in evolution, in: Spin Glasses and Biology (ed. D. L. Stein), World Scientific, Singapore (1992), 61–100. [20] A. Kraut and A. Bovier, From adaptive dynamics to adaptive walks, J. Math. Biol. 79 (2019), 1699–1747. [21] J. Krug, Accessibility percolation in random fitness landscapes, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 1–22. [22] J. Krug and C. Karl, Punctuated evolution for the quasispecies model, Phys. A 318 (2003), 137–143. [23] A. J. Lotka, Quantitative studies in epidemiology, Nature 88 (1912), 497–498.

Stochastic models for adaptive dynamics

149

[24] T. Malthus, An Essay on the Principle of Population as it Affects the Future Improvement of Society, with Remarks on the Speculations of Mr. Goodwin, M. Condorcet and Other Writers, J. Johnson, London, 1798. [25] J. Maynard Smith, Natural selection and the concept of a protein space, Nature 225 (1970), 563–564. [26] J. Metz, R. Nisbet, and S. Geritz, How should we define “fitness” for general ecological scenarios?, Trends Ecol. Evol. 7 (1992), 198–202. [27] J. A. Metz, Adaptive dynamics, In: Encyclopedia of Theoretical Ecology, Cambridge University, Cambridge (2012), 7–17. [28] J. A. J. Metz, S. A. H. Geritz, G. Meszéna, F. J. A. Jacobs, and J. S. van Heerwaarden, Adaptive dynamics: A geometrical study of the consequences of nearly faithful reproduction, in: Stochastic and Spatial Structures of Dynamical Systems (eds. S. J. van Strien and S. M. Verduyn Lunel), North-Holland, Amsterdam (1996), 183–231. [29] J. Neidhart and J. Krug, Adaptive walks and extreme value theory, Phys. Rev. Lett. 107 (2011), Article ID 178102. [30] S. Nowak and J. Krug, Analysis of adaptive walks on NK fitness landscapes with different interaction schemes, J. Stat. Mech. 2015 (2015), Article ID P06014. [31] H. A. Orr, A minimum on the mean number of steps taken in adaptive walks, J. Theor. Biol. 220 (2003), 241–247. [32] B. Schmiegelt and J. Krug, Evolutionary accessibility of modular fitness landscapes, J. Stat. Phys. 154 (2014), 334–355. [33] V. Volterra, Variations and fluctuations of the number of individuals in animal species living together, J. Conseil Int. Explor. Mer. 3 (1928), 1–51.

Chapter 8

Genealogies and inference for populations with highly skewed offspring distributions Matthias Birkner and Jochen Blath We review recent progress in the understanding of the role of multiple- and simultaneous multiple merger coalescents as models for the genealogy in idealised and real populations with exceptional reproductive behaviour. In particular, we discuss models with “skewed offspring distribution” (or under other non-classical evolutionary forces) which lead to multiple merger coalescents in the single locus haploid case, and to simultaneous multiple merger coalescents in the multi-locus diploid case. Further, we discuss inference methods under the infinitely-many sites model which allow both model selection and estimation of model parameters under these coalescents.

8.1 Multiple merger coalescents in population genetics 8.1.1 Introduction The “standard” model in mathematical population genetics is Kingman’s coalescent [48], which describes, on appropriate time scales, the random genealogies of a large class of population models. A salient feature of models in the domain of attraction of Kingman’s coalescent and its ramifications is that, at least in the limit of large population size, only binary mergers of ancestral lineages are visible. This is owed to the fact that the number of offspring of any individual must be negligible in comparison with the total population size. It is an important and very useful universality feature of Kingman’s coalescent that as the population size N ! 1, the details of the actual offspring distribution are 2 “washed out” from the limit model, only its variance N !  2 2 .0; 1/ remains as a time-rescaling compared to the “standard” Kingman coalescent. A crucial assumption here is  2 < 1. The question “what if  2 D 1?” is also biologically relevant: While all real populations are finite, coalescent theory is about (tractable) limit results as N ! 1, 2 and  2 D 1 really means that N is large when N is large. As we will see below, there is a variety of biological mechanisms that predict a deviation from the Kingman coalescent model. In this article, we will first describe general coalescent models (where the term “general” means that multiple- and even simultaneous multiple mergers of ancestral lineages will be allowed), and review briefly population models that lead to limiting genealogies described by certain subclasses of these general coalescent processes. We will then investigate how one of the most popular statistics of real DNA sequence

Matthias Birkner and Jochen Blath

152

data (under the infinitely-many sites model), namely the site-frequency spectrum, behaves under these coalescent models, and then derive inference methods that allow to estimate evolutionary parameters within a certain class of coalescent models, or to distinguish between different underlying genealogical models. While this theory is mostly confined to single-locus data of haploid populations, we will finally derive the genealogy in a simple diploid multi-locus model. Interestingly, this will naturally lead to genealogies driven by coalescents with simultaneous multiple mergers. Also, the additional information contained in multi-locus data will, despite dependence between different loci that is inherent in multiple-merger coalescents even in the face of high recombination rates, increase the statistical power of our methods for inference. We conclude this text with an outlook on recent developments in the field and the potential relevance of our results. To sum up, we aim to take steps towards understanding in how far the conjecture of Eldon and Wakeley [29, p. 2622] holds: For many species, the coalescent with multiple mergers might be a better null model than Kingman’s coalescent. Note that this article is related to several others in this volume that also touch upon the topic of non-standard genealogies, in particular those by Fabian Freund [31], by Götz Kersting and Anton Wakolbinger [47] and by Anja Sturm [68]. We will highlight concrete links in the sequel. 8.1.2 Multiple and simultaneous multiple merger coalescents About two decades ago, two natural classes of general coalescent processes, the so-called ƒ-coalescents [25, 54, 58] and „-coalescents [52, 61] were introduced in the mathematical literature. All these coalescents have in common that they are (exchangeable) partition-valued continuous-time Markov chains, that is they take values in the space Pn , the space of finite partitions of Œn WD ¹1; : : : ; nº if started from a finite number of blocks. Both the above classes of coalescent processes allow multiple mergers of ancestral lines, by which we mean a transition that is obtained from the current partition state by merging a certain number of blocks (representing ancestral lines) into one or several new blocks, thus obtaining a “coarser partition”. In the case of the classical Kingman coalescent, these transitions are always binary, that is precisely two blocks merge into one new block. In the case of a ƒ-coalescent, however, at transition times, multiple lines necessarily merge into one single new block, while for „-coalescents, subsets of blocks involved in a coalescence event may merge into different “target blocks”. The path of an n-coalescent process corresponds in a natural way to a random tree where the leaves correspond to ¹1º; ¹2º; : : : ; ¹nº and internal nodes to larger blocks. In fact, one can interpret a coalescent as a random metric space; see e.g. [35, 38, 39]. In this article, we only consider coalescent processes starting from finitely many blocks (i.e. n-coalescents). The corresponding coalescents with n D 1 can be constructed by employing consistency and using Kolmogorov’s extension theorem, or

Genealogies and inference for multiple merger coalescents

153

explicitly via look-down constructions [15, 25]. They have very interesting mathematical properties which are, however, not in the focus of this text. Let us first briefly introduce the pertinent notation. 8.1.2.1 Multiple merger (MMC) coalescents. For  2 Pn let jj denote the number of blocks and for ;  0 2 Pn we write  0 m;k  if jj D m and  0 arises from  by merging k blocks into a single one (a “k-merger”). For a finite measure ƒ on Œ0; 1, define Z

1

m;k WD m WD

0 m X kD2

xk

2

.1

x/m

 Z 1 m xk k 0

k

ƒ.dx/; (8.1.1)

2

.1

x/m

k

ƒ.dx/:

The n-ƒ-coalescent is a Pn -valued continuous-time Markov chain ¹….ƒ/ t ; t > 0º with transition rates q; 0 from  to  0 ¤  given by ´ m;k if  0 m;k  for some k; q; 0 D (8.1.2) 0 otherwise: Remark 8.1.1. A natural interpretation of (8.1.1) is to imagine that for x 2 .0; 1 at rate x 2 ƒ.dx/, a “merging event of size x” occurs: In such an event, every block independently flips a “coin” with success probability x and all the “successful” blocks are merged. In fact, such constructions are in [25,54] and this intuition is also corroborated by the duality with the ƒ-Fleming–Viot process (see page 157). Obviously, the class of all ƒ-coalescents (corresponding to all the finite measures on Œ0; 1) is quite large and in particular non-parametric. The following important special cases have frequently appeared in the literature: Example 8.1.2. .K/

The Kingman coalescent ….K/ (see [48]) corresponds to the choice ƒ.dx/ D ı0 .dx/; i.e. ….K/ D ….ı0 / . Here, the measure ƒ is concentrated on the point 0 and no multiple, only binary mergers happen, as is evident from (8.1.1).

.S/

The “star-shaped coalescent” ….S/ corresponds to the choice ƒ.dx/ D ı1 .dx/: This coalescent exhibits only one single transition, in which all active lines merge into a single line within one step.

154

Matthias Birkner and Jochen Blath

.BS/ The Bolthausen–Sznitman coalescent ….BS/ , introduced in [18] as a tool to study certain spin glass models in statistical mechanics, is given by ƒ.dx/ D 1Œ0;1 .x/.dx/; i.e. the measure ƒ is the uniform distribution on Œ0; 1. .B/

The Beta.2

˛; ˛/-coalescent ….B/ is given by ƒ.dx/ D

€.2/ x1 €.2 ˛/€.˛/

˛

.1

x/˛

1

dx;

with ˛ 2 .0; 2/. Here, the measure ƒ is associated with the beta distribution with parameters 2 ˛ and ˛. The limiting case ˛ D 2 (in the sense of weak convergence of measures) corresponds to the Kingman coalescent, while ˛ D 1 returns the Bolthausen–Sznitman-coalescent ….BS/ and (the weak limit) ˛ ! 0 gives the star-shaped coalescent ….S/ . For a visual impression of realisations of Beta-coalescent trees for different values of ˛ we refer to Figure 11.1.1. in the article by Kersting and Wakolbinger [47] in this volume. .EW/ The following class of purely atomic coalescents has been investigated by [29]: Here, one considers the cases ƒ.dx/ D ı .dx/ and ƒ.dx/ D with

2 Œ0; 1, where

2 2C

2

ı .dx/ C 2 0

2C

2

ı .dx/;

D 0 gives the Kingman coalescent.

We refer to [5, 33] for surveys on ƒ-coalescents. See also the contribution by Kersting and Wakolbinger [47] in this volume. 8.1.2.2 Simultaneous multiple merger (SMMC) coalescents. Formulating the dynamics of an SMMC requires some notational overhead but we will see that they appear naturally as genealogies in diploid population models with highly skewed offspring distributions. For k D .k1 ; k2 ; : : : ; kr / with r 2 N; k1 > k2 >    > kr > 2

(8.1.3)

and ;  0 2 Pn with jj D m we write  0 m;k  if  0 arises from  by merging r groups of blocks of sizes k1 ; k2 ; : : : ; kr (and leaving the other blocks unchanged). We write jkj D k1 C    C kr . In order to describe the dynamics of an SMMC, we need a bit of notation: Let  denote the infinite simplex ° ± X  WD x D .x1 ; x2 ; : : :/ W x1 > x2 >    > 0; xi 6 1; xi > 0 for all i i

Genealogies and inference for multiple merger coalescents

155

and let 0 WD  n ¹.0; 0; : : :/º D  n ¹0º. Let „0 be a finite measure on 0 , a > 0, then „ WD aı0 C „0 is a finite measure on . For k as in (8.1.3), with s D m jkj, put m;k D a1.rD1;k1 D2/ Ps

`D0

Z C 0

P

i1 ¤¤irC`

s `

 xirC1    xirC` P 2 j xj

xik11    xikrr s P 1 j xj

!



`

„0 .dx/:

(8.1.4)

An n-„-coalescent ¹…„ t º is a continuous-time Markov chain on Pn that jumps from  2 Pn with jj D m to  0 2 Pn at rate q; 0 D m;k if  0 m;k with k as in (8.1.3), and q; 0 D 0 if  0 ¤  is not of this form. The form of the jump rates (8.1.4) has a similar interpretation as discussed in Remark 8.1.1 for the case of ƒ-coalescents: At P rate a, pairwise merging occurs. Furthermore, for x D .x1 ; x2 ; : : : / 2 0 , at rate . j xj2 / 1 „0 .dx/ an “x-merging event” occurs. In such an event, every block independently draws a “colour”, where colour i is drawn with probability xi for i > 1 and colour 0 with probability 1 jxj. Then all blocks with the same colour i for i > 1 are merged. Obviously, the class of „-coalescents is even richer than the class of ƒ-coalescents. In particular, one recovers a ƒ-coalescent by choosing „ WD ƒ ˝ ı0 ˝ ı0 ˝    , i.e. if „ is concentrated on the first component of the simplex. However, only a handful of natural examples have been motivated and analysed on the basis of an underlying population model so far. The following important special cases have appeared in the literature. Example 8.1.3. .PD/ Let PD be the Poisson–Dirichlet distribution with  > 0. The Poisson–Dirichlet P coalescent with „ D . i xi2 / 1 PD appears in [59] as the genealogy of the “Dirichlet compound Wright–Fisher model”. .SK/ Subordinated Kingman-coalescents. If one applies a discontinuous time-change to a Kingman coalescent, as soon as more than one binary coalescence event of the original process falls into a jump-interval of the time-change, one obtains a multiple or simultaneous multiple merger event. When the (random) timechange is given by a subordinator ¹S t º, the time-changed process ¹….K/ S t º t >0 is a „-coalescent. The representation of „ in terms of ¹S t º as mixture of Dirichlet distributions is non-trivial and omitted here for brevity, see [15, Proposition 6.3] for a partial answer. See also [34] for the related class of “symmetric coalescents”. .DS/ Durrett and Schweinsberg [27] approximate the genealogy in a selective sweep by a „-coalescent, where „ is described by a stick-breaking construction, see [27, Section 3].

Matthias Birkner and Jochen Blath

156

.xEW/; .xB/ In diploid bi-parental populations, in which the reproduction events of each parent are governed by a certain ƒ-coalescent, one obtains genealogies given by „-coalescents of the form Z 1 „D ı. x ; x ; x ; x ;0;0;0;::: / ƒ.dx/: 4 Œ0;1 4 4 4 4 In particular, the cases ƒ D ı and ƒ D Beta.2 ˛; ˛/ for suitable and ˛ have been considered, see [13]. The reason for the fourfold split is that the ancestral line of a chromosome may merge into any of the four parental chromosomes (two for each parent). Such „-coalescents will play an important role in Section 8.3 below. 8.1.3 Population models A substantial amount of work has been devoted to understanding conditions under which population models converge to limits whose genealogy can be described by one of the above coalescent processes. Typically, one considers populations of fixed size N , whose reproductive events can be described by exchangeable offspring distributions, see [19]. A full classification of offspring distributions and time scalings in Canningsmodels for convergence to ƒ- and „-coalescents has been found in [52]. It is thus possible to provide abstract criteria and descriptions for population models that make their ancestral distributions converge to any prespecified „- or ƒ-coalescent. However, the relevance of a particular (SMMC) model clearly depends on its plausibility as limit of an (in some sense) natural population model. We thus now briefly review such population models and their genealogical coalescent limits. .B/

Beta.2 ˛; ˛/-coalescents with ˛ 2 .1; 2 are obtained as limiting genealogies of Schweinsberg’s model [62], in which individuals produce, in a first step, potential offspring according to a stable law with index ˛ and mean m > 1, and then N out of these are selected for survival. This corresponds to what is known as a “highly skewed offspring distribution” or “sweepstakes reproduction” (cf. [1, 41, 42]). In population biology, it resembles so-called “type-III survivorship”, that is high fertility leading to excessive amounts of offspring, corresponding to the first reproduction step, whereas high mortality early in life is modelled in the second step. Several authors have proposed this class of coalescents to describe the reproductive behaviour of Atlantic cod (see e.g. [2, 66]). One can see heuristically why this particular form of the ƒ-measure appears: The probability that a given individual’s offspring provides more than a fraction y of the next generation, given that the family is substantial (i.e. given X1 > "N , for y > "), is approximately ˇ     X1 .N 1/my ˇˇ ˇ P > y ˇ X1 > "N D P X1 > ˇ X1 > "N X1 C .N 1/m 1 y

Genealogies and inference for multiple merger coalescents

 const: 

.1

y/˛ y˛

D const:  Beta.2

157

˛; ˛/.Œy; 1/;

where we replaced X2 C    C Xn  .N 1/m by the law of large numbers. The model is also mathematically appealing, since it exhibits a close connection to renormalised ˛-stable branching processes, see [12]. .B0 / Huillet’s Pareto model: [45] derives Beta.2 ˛; ˛/-coalescents as limiting genealogies in a population model similar to the one in .B/ where the sampling can be interpreted as according to a “random fitness value”. .BS/ The Bolthausen–Sznitman coalescent appears for ˛ D 1 in the sweepstakes model, but also as limiting genealogy at the “tip of a fitness wave”. This was predicted in [21] using non-rigorous arguments (for a related model see also [53]), and partly confirmed (for certain variations of the model) in [7, 63, 64]. .EW/ This model corresponds to populations in which in each reproductive step, a fraction of individuals are produced by one single parent. This can be combined with classical Wright–Fisher type reproduction to produce the “Kingman atom” at 0. See [29]. .GM/ Generalised Moran models. Independently in each reproduction event, a random number ‰ .N / of offspring are born to a single pair of parents, these offspring replace ‰ .N / randomly chosen individuals from the present population. Here, P .‰ .N / D 1/ D 1 corresponds to the classical Moran model; .EW/ is also a special case of this. By suitably choosing L.‰ .N / / one can in fact approximate any ƒ-coalescent, see Section 8.3.1. .xEW/; .xB/ Appear as scaling limits of diploid bi-parental models with skewed reproduction. We will present a corresponding model in Section 8.3.1. A complete classification of the corresponding diploid population limits can be found in [16]. See also Tellier and Lemaire [69] for a recent overview from a biological perspective. There are many further extensions of population and coalescent models in the literature, including spatial models such as Barton, Etheridge and Véber’s spatial ƒ-Fleming– Viot process [4], or so-called on/off coalescents in situations with seed banks [11]. However, in this article, our focus is on the reproductive mechanism of neutral wellmixed populations, so that we refrain from providing a further discussion of these models here. All of the above coalescent processes are dual to the corresponding forwardin-time population limit, given as a (generalised) Fleming–Viot process (which is a measure-valued (jump-)diffusion), [25] and e.g. [9]. Details of this and a representation of the generator of „-coalescents can be found in [15]. There, it is also shown that the above duality can be strengthened to a strong pathwise duality via an extension of Donnelly and Kurtz’ celebrated lookdown-construction [24, 25].

Matthias Birkner and Jochen Blath ......................................... . .... •..... 1 •... 3 .. .. •... 2 .. .... .. .... .. ... ... 5 .................... .........................................• . .. 7 .... .. .. 4 .... • .... • .... .... ..... • ... 6 a b c d e

a b c d e

158

1 2 3 4 5 6 7 ....................................× ................................................................. ....................................× ............× ..................................................... ....................................× .........................× ........................................ ..........× .............× ...............................................................× ............... ..........× .............× ..................................................× ............................

Figure 8.2.1. Mutations on a coalescent tree and resulting data matrix (in schematic form). Implicitly, identical columns are removed from the data matrix. The corresponding SFS is  .5/ D .4; 2; 1; 0/.

8.2 Inference based on the site-frequency spectrum One of the most important and well-studied statistical quantities derived from DNA sequence data is the site frequency spectrum (SFS).1 For the theoretical analysis, we assume that all underlying data fits to the infinitely-many sites model (IMS) of population genetics (cf. [72] or [70]), that is we assume that every observed site mutated at most once during the entire history of the sample. This assumption is often at least approximately true since typical per-site mutation rates are very small. Here, “site” refers to a single base pair in the DNA molecule. Furthermore, from a pragmatical point of view, the SFS of a dataset is well-defined even if the assumptions of the IMS model are violated (see e.g. [40] for the combinatorial characterisation of data complying with the IMS model). For the analysis, we also assume that the genealogy of a sample of size n 2 N is described by one of the above coalescent models … and that mutations occur at some rate 2 > 0 on the coalescent branches, see Figure 8.2.1 for an illustration. If we know the ancestral state, then, the SFS of an n-sample is defined as  .n/ WD .1.n/ ; : : : ; n.n/1 /; where i.n/ , i 2 Œn 1, is the number of sites at which a mutation appears i-times in our sample. If the ancestral states are unknown (and thus the data matrix as in Figure 8.2.1 is only defined up to column-flips), one considers instead the folded site frequency spectrum (ıi;j is the Kronecker delta)  ˘ .n/ .n/ D i.n/ C .1 ıi;n i /n.n/i ; i D 1; : : : ; n2 : .n/ WD ..n/ 1 ; : : : ; bn=2c / with i

1

One can in fact attempt to base statistical inference on the likelihood of the full sequence data, see e.g. [10, 30, 66, 67] and references there. However, this is computationally still prohibitively expensive even for moderate sample sizes.

Genealogies and inference for multiple merger coalescents

159

8.2.1 The expected site frequency spectrum For a coalescent process … D ¹… t º t >0 with mutation rate  we denote its law by P …; , that is the law of the coalescent process … on which mutations appear along its branches at rate 2 . We denote the expectation corresponding to P …; by E…; . Recall that the block-counting process Y D ¹Y t º t>0 of the coalescent process … Y t WD j… t j;

t > 0;

(8.2.1)

simply counts the number of ancestral lineages present at each time. Then a general representation of E…; Œi.n/  for any coalescent model … (see [37]) is E…; Œi.n/  D

n i C1  X .n/;… p Œk; i   k  g.n; k/ 2 kD2

D

 2

nX i C1

p .n/;… Œk; i   k  E… ŒTk.n/ ;

(8.2.2)

kD2

i 2 Œn 1, where Tk.n/ is the random amount of time that ¹Y t º t >0 , starting from Y0 D n, spends in state k, and p .n/;… Œk; i is the probability that conditional on the event that Y t D k for some time point t, a given one of these k blocks subtends exactly i 2 Œn 1 leaves. Thus, in (8.2.2) mutations are classified according to the “level” k, which is the value of the block-counting process when they appear in the tree. 8.2.1.1 The block-counting process. For brevity, we consider only ƒ-coalescents … in this paragraph. We see from (8.1.2) that Y corresponding to … from (8.2.1) is itself a continuous-time Markov chain on N (as ; 0 depends only on  and  0 ) with jump rates   i qij D i;i j C1 ; i > j > 1: i j C1 P 1 The total jump rate away from state i is qi i D ji D1 qij . We will need the Green function of Y , Z 1  g.n; m/ WD En 1.Ys Dm/ ds for n > m > 2: (8.2.3) 0

For the Kingman coalescent, we have g.n; m/ D m.m2 1/ for m 6 n, and for the Bolthausen–Sznitman coalescent, explicit expressions can be obtained from [51]. In general, there is no explicit formula for (8.2.3), but decomposing according to the first jump of Y gives a recursion for g.n; m/: g.n; m/ D

n 1 X

pnk g.k; m/; n > m > 2; (8.2.4)

kDm

g.m; m/ D

1 ; qmm

m > 2;

160

Matthias Birkner and Jochen Blath

where pnk WD chain.

qnk qnn

are the transition probabilities of the embedded discrete skeleton

8.2.1.2 The expected SFS for ƒ-coalescents. Decomposing according to the first jump of Y corresponding to a ƒ-coalescent …, starting from n, yields a recursion for p .n/;ƒ Œk; b: Proposition 8.2.1 ([13, Propositions 1 and A.1]). For 1 < k 6 n, we have p

.n/;ƒ

Œk; b D

n 1 X

pn;n0

n0 Dk

g.n0 ; k/  1.b>n g.n; k/

.n n0 / .n0 /;ƒ p Œk; b .n n0 / n0  n0 b .n0 /;ƒ C 1.b n

.k

1/.

The terms on the right-hand side of (8.2.5) have a natural interpretation: The probability of seeing a jump from n to n0 , conditionally on hitting k, has probability 0 ;k/ pn;n0 g.n . Namely, by the Markov property of Y , g.n;k/ Pn0 ¹Y hits kº g.n0 ; k/ Pn ¹Y first jumps to n0 \ Y hits kº D pn;n0 D pn;n0 : Pn ¹Y hits kº Pn ¹Y hits kº g.n; k/ Then, thinking “forwards in time from n0 lineages”, either the initial .n n0 C 1/-split occurred to one of the (then necessarily b .n n0 /) lineages subtended to the one we are interested in, or it occurs to one of the (then necessarily n0 b) others. Specialising (8.2.2) to the case of a ƒ-coalescent …, combined with E… ŒTk.n/  D g.n; k/ (with g.n; k/ from (8.2.3), which can be computed recursively via (8.2.4)) gives the following result. Proposition 8.2.2. We have, for i D 1; : : : ; n E

ƒ;

Œi.n/ 

1,

n i C1  X .n/;ƒ D p Œk; i   k  g.n; k/: 2

(8.2.6)

kD2

It is interesting to see that the expected site-frequency spectra differ significantly for the various coalescent models. In Figure 8.2.2, we compare the folded expected frequency spectra of a Kingman and a Beta-coalescent. We also include the frequency spectrum of mtDNA data for Atlantic cod from [1] (1278 sequences). The fit of the Beta-coalescent to the real dataset is striking, see [14] for a discussion. Remark 8.2.3. (1) For a ƒ-coalescent … there are analogous recursions for variances Var… Œi.n/  and covariances Cov… Œi.n/ ; j.n/ , see [14, Theorem 2]. (2) For the Kingman case, we have  n b 1  .n/;ı0 k 2 p Œk; b D n 1  and Eı0 ; D Œi.n/  D ; i k 1

Genealogies and inference for multiple merger coalescents

161

0

5

10

15

20

Kingman coalescent data ^ ,α ^ )−coalescent Beta(2 − α

1

2

3

4

5

6

7

8

9

10

Folded site frequency spectrum index

Figure 8.2.2. The folded frequency spectrum (white bars) of the data of [1] along with predictions of the Kingman coalescent (light-grey), and the Beta.2 ˛; O ˛/-coalescent O (dark-grey), where ˛O D 1:5 is the best fit estimated from the data according to [14]. Vertical lines represent the standard deviation; obtained for the Beta.2 ˛; O ˛/-coalescent O from 105 iterations. Class “11” represents the collated tail of the spectrum, from 11 to 1278/2. Adapted from [14, Figure 11].

as computed by Fu [32]. For general ƒ-coalescents, no closed expressions for (8.2.5), (8.2.6) are known. However, the recursions can easily be solved numerically, even for n in the hundreds. (3) The computation of the expected SFS through (8.2.6) is natural and conceptually appealing. We note however that there are now numerically more efficient alternatives, either via a spectral decomposition of the jump rate matrix of Y as in Spence et al. [65] or via an interpretation as a multivariate phase-type distribution as in Hobolth et al.’s approach [43]. (4) For ƒ-coalescents with “strong ˛-regular variation” near 0 (i.e. ƒ.dx/ D f .x/ dx with f .x/  Ax 1 ˛ as x # 0 for some A 2 .0; 1/, and this includes the Beta.2 ˛; ˛/-coalescent from Example 8.1.2), [6, Theorem 8] shows i.n/ 

 2 n 2

˛

C˛;i

a.s. with an explicit constant C˛;i . However, the convergence in n can be quite slow, see [13, Figure 8] and the discussion there. (5) Using similar arguments, one can derive recursion formulas for the expectation and covariances of the site frequency spectrum under „-coalescents. See [17, 65].

Matthias Birkner and Jochen Blath

162

We see from (8.2.8) below and the following discussion that the SFS is closely allied to the distribution of branch lengths in coalescents. Asymptotic results for such lengths are a focus of the contribution of Kersting and Wakolbinger [47] in this volume. For example, see [22,23] for the asymptotic behaviour of B .n/ (the total branch length for sample size n) and of B1.n/ (the total branch length of the leaves) for very general coalescents and [20] for the fluctuations of .B1.n/ EŒB1.n/ /=n1 ˛C1=˛ for Beta.2 ˛; ˛/-coalescents with 1 < ˛ < 2. For the Bolthausen–Sznitman coalescent and some “relatives”, corresponding to ˛ D 1, [23] obtain the asymptotic behaviour as n ! 1 of Bi.n/ for any i 2 N, see Theorem 11.2.10 in the contribution of Kersting and Wakolbinger [47] in this volume. The question of the theoretical identifiability of coalescent models from the expected site frequency spectrum has been treated in [65]. For example for ƒ-coalescents, the first n 2 moments of the measure ƒ can be determined from the expected SFS with sample size n and vice versa. 8.2.2 Inference methods based on the site-frequency spectrum 8.2.2.1 Inference of mutation rates and real-time embeddings. When analysing data based on the SFS, one often needs to infer the underlying mutation rate first. Hence we begin this subsection with a brief discussion of this estimation and its consequences for the real-time embedding (assuming a “molecular clock”) of our coalescent models. Estimating  (or 2 ) is often done via the (analogue of) the Watterson estimator. Here, as pointed out e.g. in [28], it is important to understand that the choice of a multiple merger coalescent model … strongly affects this estimate. We illustrate this with an example. Assume w.l.o.g. for all multiple merger coalescents in question that the underlying coalescent measure ƒ is always a probability measure: This normalisation fixes the coalescent time unit as the expected time to the most recent common ancestor of two individuals sampled uniformly from the population. Given an observed number of segregating sites S in a sample of size n, a common (and unbiased) estimate O … of the scaled mutation rate  in the coalescent scenario … is the Watterson estimate 2S O … WD … ; (8.2.7) E ŒB .n/  .n/ where again E… ŒB .n/  is the expectation of the total tree length PnB of an (n-)coales… .n/ cent model …. One can compute for example E ŒB  D kD1 kg.n; k/ with the Green function g.n; k/ from (8.2.3). Now with the estimate O … , given knowledge of the substitution rate O per year at the locus under consideration, one can obtain an approximate real-time embedding of the coalescent history via

coalescence time unit 

O …  year  ; O 2

Genealogies and inference for multiple merger coalescents

163

cf. [66, Section 4.2], which of course depends on the law P … of the …-coalescent via the expected value E… ŒB .n/ . See also [71] for a study of the related concept of “effective population size”. Given a Cannings population model of fixed size N as discussed in Section 8.1.3, let cN be the probability that two gene copies, drawn uniformly at random and without replacement from a population of size N , derive from a common parental gene copy in the previous generation. While for the usual haploid Wright–Fisher model cN D 1=N , in the class .B/ from Section 8.1.3, cN is proportional to 1=N ˛ 1 , for 1 < ˛ 6 2. By the limit theorem for Cannings models of [52], one coalescent time unit corresponds to approximately 1=cN generations in the original model with population size N . Thus the mutation rate Q at the locus under consideration per individual per generation must be scaled with 1=cN , and the relation between , Q the coalescent mutation rate …  =2 and cN is then given by the (approximate) identity cN  2= Q … . In particular, if a Cannings model class (and thus cN as a function of N ) is given, the “effective population size” N can then be estimated. 8.2.2.2 Approximate likelihood functions based on the SFS. Since mutations in our models occur as a Poisson process along the branches of a coalescent tree, for k D .k1 ; k2 ; : : : ; kn

1/

with jkj D

n 1 X

ki D s;

i D1

the true likelihood function is ® ¯ L..…;  /; k/ D P …; i.n/ D ki.n/ ; i 2 Œn 1   nY1  .n/ ki  Bi  .n/ … B 2 DE e 2 i ki Š iD1    .n/ s B sŠ  .n/ 2  D E… e 2 B sŠ k1 Š    kn

n Y1 1Š

i D1

 Bi.n/ ki ; B .n/

(8.2.8)

where Bi.n/ is the random length of branches subtending i 2 Œn 1 leaves and B .n/ D Bi.n/ C    C Bn.n/1 is the total branch length of the n-coalescent tree …. Equation (8.2.8) is in general not expressible as a simple formula involving the coalescent parameters; it is in principle straightforwardly approximable via a “naive” Monte Carlo approach but this is computationally very expensive even for moderate sample sizes. We note that Sainudiin and Véber [60] implement a clever approach to computing the expectation in (8.2.8) via importance sampling in the case of the Kingman coalescent (including variable population size and geographic structure); as far as we know, there is currently no study analogous to [60] that would include multiple merger coalescents. Let us discuss an approximate likelihood function based on the so-called “fixeds-method”. The idea is to treat the observed number of segregating sites as a fixed

164

Matthias Birkner and Jochen Blath

parameter s 2 N, not as (realisation of) a random variable S. This approximation appears quite common in the population genetics literature, see [28] and references there. Consider  n Y1 B .n/ ki.n/  sŠ i E… .n/ (8.2.9) k1 Š    kn.n/1 Š B .n/ i D1

(i.e. we take only the last term inside the expectation in (8.2.8)); this corresponds to uniformly and independently throwing s mutations on the coalescent tree. An approximation is L.…; k .n/ ; s/ 

n Y1 …;.n/ .n/ sŠ .'i /k i ; k1.n/ Š    kn.n/1 Š

(8.2.10)

i D1

where we replaced the random quantities Bi.n/ =B .n/ in (8.2.9) by the expected normalised branch lengths E… ŒB .n/  (8.2.11) 'i…;.n/ D … i : E ŒB .n/  Equation (8.2.10) motivates the following family of “approximate” (in a twofold sense: regarding both fixing s and exchanging expectation of a fraction with a fraction of expectations) likelihood functions Q L.…;  .n/ I s/ D

D

n Y1

 .…; O s/ … .n/ …;.n/  E ŒB 'i 2 i D1  .n/ O .…;s/ E… ŒB .n/ 'i…;.n/ i 2  i.n/ Š

n Y1 i D1

exp

.n/

exp. s'i…;.n/ /

.s'i…;.n/ /i ; i.n/ Š

(8.2.12)

O where .…; s/ D 2s=E… ŒB .n/  is the Watterson estimator for the mutation rate under a …-coalescent with n leaves when S D s segregating sites are observed, recall (8.2.7). In (8.2.12), we view s as a parameter rather than as observed data, noting that LQ is well defined even if j .n/ j ¤ s. Note that for a principled approach to remove the dependence on the “nuisance parameter” , one could follow [8]. However, this is computationally very costly in the context of MMC’s and we do not pursue it here. For further discussion see [28]. Equation (8.2.12) is a practical starting point for testing and parameter inference for multiple merger coalescent models, in particular this can be evaluated (and optimised) numerically very easily even for large sample sizes n  1. Let us also remark that (8.2.11) can also be the starting point for inference based on minimum-distance statistics, see [14].

Genealogies and inference for multiple merger coalescents

165

8.2.3 Can one distinguish population growth from multiple merger coalescents? We now employ the approximate likelihood functions from the previous section to construct a likelihood-ratio test for model selection. While this method has also been employed to select between various „-coalescent models (see [13]), it can also be used to distinguish between different “evolutionary forces” leading to non-Kingman-like variability in the SFS. As an example, we discuss a scenario where the underlying population in question has undergone an exponential population increase as in [28]. Consider a haploid Wright–Fisher model with population size N at generation r D 0 and size N.r/ D ˇ N.1 C N / r in generation r before the present. This is in fact a special case of the set-up in [46] and we obtain in the limit, by speeding up time with a factor N as usual, a Kingman-coalescent with exponentially growing coalescence rates .s/ D eˇ s . Such a time-changed Kingman coalescent satisfies equation (8.2.2). A population which has undergone a recent rapid increase should produce an excess of singletons in the SFS compared to model .K/, which is a pattern also observed for Beta-coalescents. Similarly, Tajima’s D (a classical test statistic in the Kingman context, see [70, Section 4.3]) would tend to be significantly negative under both model classes. Our aim is to construct a statistical test to distinguish between the model classes .E/ and .B/ (which intersect exactly in .K/). In order to distinguish .E/ from .B/, based on an observed site-frequency spectrum  .n/ with sample size n and S D j .n/ j segregating sites, a natural approach is to construct a likelihood-ratio test. Suppose our null-hypothesis H0 is presence of recent exponential population growth .E/ with (unknown) parameter ˇ 2 Œ0; 1/, and we wish to test it against the alternative H1 hypothesis of a multiple merger coalescent, say, the Beta.2 ˛; ˛/coalescent .B/ for (unknown) ˛ 2 Œ1; 2, where ˇ D 0 and ˛ D 2 correspond to the Kingman coalescent. The coalescent mutation rate  is not directly observable, but plays the role of a nuisance parameter. By fixing S D s and treating it as a parameter of our test, we may consider the pair of hypotheses ® ¯ H0s W … 2 ‚Es WD Kingman coal., growth parameter ˇ W ˇ 2 Œ0; 1/ ; ® ¯ H1s W … 2 ‚Bs WD Beta.2 ˛; ˛/-coalescent W ˛ 2 Œ1; 2 : We can construct an “approximate likelihood-ratio” test based on L.…;  .n/ ; s/ via %.E;BIs/ . .n/ / WD

sup¹L.…;  .n/ ; s/; … 2 ‚Es º sup¹L.…;  .n/ ; s/; … 2 ‚Bs º

(8.2.13)

introduced in the previous section. Given a significance level a 2 .0; 1/ (say, a D 0:05), let %.E;BIs/ .a/ be the a-quantile of %.E;BIs/ . .n/ / under .E/, chosen as the largest value so that ® ¯ sup P …;s %.E;BIs/ . .n/ / 6 %.E;BIs/ .a/ 6 a: (8.2.14) …2‚Es

Matthias Birkner and Jochen Blath

166

The decision rule that constitutes the “fixed-s-likelihood-ratio test”, given s and sample size n, is reject H0s ” %.E;BIs/ . .n/ / 6 %.E;BIs/ .a/: The corresponding power function of the test, that is the probability to reject a false null-hypothesis, is given by ¯ ® G.E;BIs/ .…/ D P … %.E;BIs/ . .n/ / 6 %.E;BIs/ .a/ ; … 2 ‚Bs : Q  ;  I s/ from (8.2.12) is not literally a likelihood function Alternatively, even though L. s of any model from H0 [ H1s , we can consider the statistic %Q .E;B/ . .n/ /, where we Q replace L.…;  .n/ ; s/ by L.…;  .n/ ; j .n/ j/ in (8.2.13). For a given value of s, we can then (by simulations using the fixed-s-approach) determine approximate quantiles %Q .E;BIs/ .a/ associated with a significance level a as in (8.2.14), and base our test on the criterion %Q .E;B/ . .n/ / 6 %Q .E;BIs/ .a/. Similarly, the (approximate) power function ¯ ® GQ .E;BIs/ D P … %Q .E;BIs/ . .n/ / 6 %Q .E;BIs/ .a/

(8.2.15)

for … 2 ‚Bs can be estimated using simulations. See the discussion in [28] and in particular [28, Figure 2] (a part of which we reproduce in Figure 8.2.3 below). For example, if the “truth” was a Beta.2 ˛; ˛/-coalescent with ˛ D 1:5, the power of a test of this form with significance level 5 % to reject H0s (the null hypothesis of a Kingman model with exponential growth) based on a (single-locus) sample of size n D 500 would be about 75 %. Note that the power is reasonably high for ˛ 6 1:5, say, but decays to the nominal level as ˛ ! 2. The boundary case ˛ D 2 in the class of Beta.2 ˛; ˛/-coalescents is the Kingman coalescent, after all.

8.3 Multiple loci, diploidy and „-coalescents 8.3.1 A diploid bi-parental multi-locus model We model a population of N diploid individuals. Each carries two chromosome copies, and each chromosome consists of L loci. In a reproduction event, two randomly chosen parents produce a random number ‰ .N / of offspring, and these replace as many randomly chosen individuals; ‰ .N / is drawn afresh for each event. Each child inherits one (possibly recombined) chromosome from each parent according to the Mendelian laws; we assume that during meiosis, a crossover recombination between locus ` and ` C 1 happens with probability r`.N / for ` 2 ŒL 1. See Figure 8.3.1 for an illustration, and the contribution of Baake and Baake [3] in this volume for the recombination equation and the corresponding ancestral recombination graph in a law of large numbers regime. Example 8.3.1. For a concrete example, assume that P .‰ .N / D d N e/ D c=N 2 and P .‰ .N / D 1/ D 1 c=N 2 with 2 .0; 1/, c > 0. This leads to model .xEW/.

167

1.0

Genealogies and inference for multiple merger coalescents

0.0

0.2

0.4

power

0.6

0.8

level 0.01 level 0.05 level 0.1

1.0

1.2

1.4

1.6

1.8

2.0

α

Figure 8.2.3. Estimate of GQ .E;BIs/ from (8.2.15) based on (8.2.12) as a function of ˛ with n D 500 and s D 50. The symbols denote the size of the test, cf. legend. The hypotheses are discretised to ‚Es D ¹ˇ W ˇ 2 ¹0; 1; 2; : : : ; 10; 20; : : : ; 1000ºº and ‚Bs D ¹˛ W ˛ 2 ¹1; 1:025; : : : ; 2ºº. Here, the Beta.2 ˛; ˛/-coalescent is the alternative. Image adapted from [28, Figure 2].

time "

Figure 8.3.1. Schematic illustrations of the population model described in Section 8.3.1. Top: ‰ .N / children of a single pair are created. Bottom left: Schematic illustration of crossing over (an important step in the biochemical mechanism of recombination, picture adapted from Thomas Hunt Morgan, A Critique of the Theory of Evolution, Princeton University Press, 1916). Bottom center: A possible recombination event in producing a child. Bottom right: Transmission of genetic information to the ‰ .N / children (which can include recombination).

168

Matthias Birkner and Jochen Blath

Let cN WD EŒ‰ .N / .‰ .N / C 3/=N.N 1/ (this is four times the pair coalescence probability for two randomly chosen chromosomes) and assume that cN EŒ‰ .N / =N 

D

EŒ‰ .N / .‰ .N / C 3/ .N 1/ EŒ‰ .N / 

!0

N !1

(8.3.1)

(which implies that also cN ! 0) and that there exists a probability measure ƒ on Œ0; 1 such that Z ¯ 1 ® .N / 1 ƒ.dy/ (8.3.2) P ‰ > Nx ! 2 N !1 cN .x;1 y for all continuity points x 2 .0; 1 of ƒ. Furthermore, r`.N / 

cN r .`/ 4 EŒ‰ .N / =N 

with fixed r` 2 Œ0; 1/ for ` D 1; : : : ; L

as N ! 1

(8.3.3)

1.

Remark 8.3.2. Note that EŒ‰ .N / =N  is the probability that (after a given reproduction event) a randomly chosen individual from the current population is a child. Then (8.3.1) ensures that “separation of time scales” occurs: The “short” time-scale 1=EŒ‰ .N / =N  on which sampled chromosomes paired in the same individual disperse into two different individuals carrying only one sampled chromosome each is much smaller than the “long” time-scale 1=cN over which we observe non-trivial ancestral coalescences. This lies “behind” Proposition 8.3.3 below. For the classification of general diploid models (in the single-locus context), we refer to [16], see also the article by Sturm [68] in this volume. 8.3.2 The „-ancestral recombination graph Consider a sample of n chromosomes (which could be taken from n2 sampled individuals, say), each of which carries L loci. We need some notation to describe the ancestral states: A possible configuration has the form  D ¹C1 ; C2 ; : : : ; Cb º with b 2 Œn, where Ci D .CQ i;1 ; CQ i;2 ; : : : ; CQ i;L / with CQ i;1 ; : : : ; CQ i;L  Œn and not all equal to ; such that S for ` D 1; : : : ; L we have biD1 CQ i;` D Œn and for i ¤ i 0 , CQ i;` \ CQ i 0 ;` D ;. Here, CQ i;` contains the indices of those samples for which the chromosome Ci in the current configuration is ancestral at the `-th locus. Thus, for each locus `, ¹CQ 1;` ; : : : ; CQ b;` º is a partition of Œn (with a grain of salt: it may contain ;’s). We write A for the set of all configurations of this form. We remark that in order to properly describe the dynamics of ancestral configurations for finite population size N , A is in fact not completely sufficient and has to be “enriched” by information about the grouping of ancestral chromosomes into diploid individuals. However, because of the separation of time scales described in Remark 8.3.2, this becomes irrelevant for the limit process. We will not go into details here and refer to [13], see also [36, 44] in the context of Kingman’s coalescent.

169

Genealogies and inference for multiple merger coalescents

From  2 A, possible transitions lead to ® pairmergei ;i ./ D C1 ; : : : ; Ci1 1 ; CO i1 ; Ci1 C1 ; : : : ; Ci2

1 ; Ci2 C1 ; : : : ; Cb

1 2

¯

with CO i1 D .CQ i1 ;1 [ CQ i2 ;1 ; : : : ; CQ i1 ;` [ CQ i2 ;` /, a merger of the pair Ci1 and Ci2 , ¯ ® groupmergeJ ./ D C 1 ; C 2 ; C 3 ; C 4 ; Cj ; j 2 Œb n .J1 [ J2 [ J3 [ J4 / with J1 ; : : : ; J4  Œb pairwise disjoint S S and at least one SjJi j > 3 or at least two of the Q i;1 ; Q i;2 ; : : : ; Q jJi j > 2. Here, C m D C C i 2Jm i 2Jm i 2Jm C1;` for m D 1; 2; 3; 4, a simultaneous multiple merger in (up to) four groups, and ® ¯ recombi;` ./ D C1 ; : : : ; Ci 1 ; Ci0 ; Ci00 ; Ci C1 ; : : : ; Cb with Ci0 D .CQ i;1 ; CQ i;2 ; : : : ; CQ i;` ; ;; : : : ; ;/, Ci0 D .;; : : : ; ;; CQ i;`C1 ; CQ i;`C2 ; : : : ; CQ i;L /, a recombination event splitting the i-th chromosome in the configuration between locus ` and locus ` C 1. Note that as mentioned above, both in the pairmerge and the groupmerge operations, “empty” entries .;; ;; : : : ; ;/ may arise, which then need to be removed; see [13] for details. The limiting genealogical process will then be a continuous-time Markov chain ¹.t /º t>0 on A with generator matrix q whose off-diagonal elements are given by 8 ˆ Cˇ I2 if  0 D pairmergej1 ;j2 ./; ˆ ˆ ˆ 0º be the ancestral process of a sample of n chromosomes in a population of size N with offspring laws L.‰ .N / / satisfying (8.3.1) and (8.3.2), and assume the scaling relation (8.3.3). ¹ n;N .b4t=cN c/º ! ¹.t/º

as N ! 1;

where the process ¹.t/º is the Markov chain with generator matrix (8.3.4).

(8.3.5)

Matthias Birkner and Jochen Blath

170

t1 t2 t3 time Figure 8.3.2. An illustration of the „-ancestral recombination graph for two loci, with some transitions highlighted. At time t1 , a groupmerge-event occurs. At time t2 , there is a recombevent and at time t3 , a pairmerge-event.

We refer to [13] for details, in particular the precise mode of convergence in (8.3.5) depending on whether or not the grouping of ancestral chromosomes into possibly “doubly marked individuals” is taken into account. 8.3.3 Towards a full SMMC multi-locus inference machinery One can incorporate the (biologically important) effects of recombination, spatial subdivision, variable population size (e.g. growing populations), and/or (directional) selection into stochastic models for populations with highly skewed offspring distributions and derive corresponding (limiting) models for the joint genealogy of an n-sample observed at L (possibly recombining) loci. The “full complexity” model is then a “structured „-ancestral selection recombination graph”. While in principle highly relevant in view of today’s large scale datasets, an explicit description of the resulting full sampling distributions seems out of reach at present. One can however make progress on statistical questions by employing low-dimensional summary statistics. One approach, inspired by the results from Section 8.2.2, is to use suitable lumpings of the normalised site frequency spectra and average these over the observed loci: Let 1 .`/ WD

1 .`/ ; j.`/j

 k .`/ WD

n 1 X j .`/ j.`/j

j Dk

be the proportion of singletons and the proportion of mutations visible in more than k > 2 copies at the `-th locus, respectively. .1 ;  k / WD

L  1X 1 .`/;  k .`/ L

(8.3.6)

`D1

is a two-dimensional summary of the data whose distribution under a given coalescent model … with mutation parameter  > 0   L …; ; .z1 ; z k / WD P …; .1 ;  k / D .z1 ; z k / (8.3.7)

Genealogies and inference for multiple merger coalescents

171

is generally not known explicitly, but .1 ;  k / can be simulated readily under .…;  /. Then the function .z1 ; z k / 7! L.…; ; .z1 ; z k // from (8.3.7) can be approximated by a kernel estimator based on M independent replicates: M 1 X  O …; ; .z1 ; z k / WD 1 L .1 ;  k / K M h mD1 h

 .1 ;  k /.m/ ;

(8.3.8)

where .1 ;  k /.m/ is the value of (8.3.6) computed from the m-th simulation and K the kernel function (e.g. a Gaussian) with bandwidth h > 0. Given (8.3.8), testing and model selection analogous to Section 8.2.3 can now be based on the approximate likelihood ratio statistic O sup.…;/2‚0 L.…; ; .z1 ; z k // ; O ; .z1 ; z k // sup.…;/2‚ L.…;

(8.3.9)

1

where of course the critical value for a test of given size has to be determined by simulations. In practice, one can alleviate the two-dimensional optimisation problem in (8.3.9) by plugging in the Watterson estimator  D O … from (8.2.7) given coalescent model …. This approach is pursued in [49], with promising initial results, see the discussion there and also Figure 8.3.3 below. It can also be extended to include the effects of selection, variable population sizes and spatial structure, see [50] for steps in this direction. Note that this is akin to approximate Bayesian computations (ABC), whose rôle in analyses of datasets in multiple merger contexts is described in the article by Freund [31] in this volume. Intuitively, although even unlinked loci are not independent under the skewed offspring distribution models from Section 8.3.2 (as observed in [13]), averaging over many loci does reduce sampling variability and is justified because the multiple merger mechanism affects all loci in the same way. This is in fact a distinguishing feature that explains why multi-locus data is useful to distinguish skewed offspring distributions from selective sweeps: The latter would only affect one locus at a time. The software used for this study is available under httpsW//github.com/JereKoskela/ Beta-Xi-Sim. Moreover, software for simulation and analysis of datasets in (S)MMC contexts can be found on Bjarki Eldon’s homepage httpW//page.math.tu-berlin.de/ ~eldon/programs.html.

8.4 Discussion: Are they really out there? In the previous sections, we outlined population models and evolutionary scenarios that invite genealogical modelling via (S)MMC processes. Further, we presented some paradigmatic statistical tools for inference and model selection for (S)MMC processes, and our hope is that this could pave at least some of the way towards an answer to initial

172

α =1.9

α =1.5 α =1.1

0.3 0.2

β =0.1 β =0.5

β =1

β =2

β =3

β =5

β =10

0.0

0.1

normalised tail

0.4

0.5

Matthias Birkner and Jochen Blath

0.2

0.3

0.4

0.5

β =30 0.6

normalised singletons Figure 8.3.3. The empirical distribution of .1 ;  k / from formula (8.3.6) is quite different under a Kingman coalescent with exponential growth (solid contours) compared to a 4-fold Beta coalescent .xB/ (dashed contours). Here, the sample size is n D 100, each sample considered at L D 23 loci, with cutoff parameter k D 15. Parameter values (˛ for the 4-fold Beta(2 ˛; ˛) coalescent, ˇ for the exponential growth rate) are as shown. The contour lines are based on 5000 simulated replicates for each parameter choice: For this, mutation rates  were chosen .n/;… so that the expected number of segregating sites per locus equalled sexpect D 10; 20; 30; 40; 50 .n/;… (cf. equation (8.2.7)), with 1000 replicates per value of . The pictures for a fixed value of sexpect are almost indistinguishable from the one shown. The contours were computed using R [55] and the function kde from the contributed R-package ks [26], with default values for the bandwidths. They correspond to regions containing respectively 20 %, 40 %, 60 %, 80 % and 95 % of the simulated points.

question [29] whether (S)MMC coalescents are really more adequate null-models for real populations exhibiting highly skewed offspring distributions (or other forces leading to an “effective skew”, such as selective sweeps, severe bottlenecks, etc.). One of our main take-home messages is that the statistical power of such inference methods is usually much higher in (diploid) multi-locus setups rather than in (haploid) single locus scenarios. However, it is the latter scenario in which MMC based inference methods have so far been applied in practice. For example, the results in [66] indicate that data generated under a Beta-coalescent can provide a better fit to observed genetic variability in Atlantic cod mitochondrial (thus haploid) DNA sequence data. In the cited article, it is also discussed in how far different underlying coalescent models lead to different estimates for the real-time most recent common ancestor of the sample. To some degree, it appears also possible to distinguish different evolutionary scenarios

Genealogies and inference for multiple merger coalescents

173

such as a recent increase in population size, leading to a time-changed Kingman coalescent, from other coalescent scenarios, as reviewed in Sections 8.2.3 and 8.3.3. A very recent further study involving virus data (influenza) is [57], which employs purely-atomic MMCs (of class .EW/), again in a haploid setup. The authors here come to the conclusion that the .EW/ coalescent can provide a “much more accurate neutral null model” in certain types of organisms including viruses and bacteria. However, the study seems to be restricted to a relatively small class of MMCs. We expect that a real test for the above methods will be in the framework of diploid multi-locus setups. A very interesting step in this direction is the recent work of Rice, Novembre and Desai [56] who propose a statistic based on the joint site frequency spectrum at two loci. This approach does not explicitly model multi-locus dynamics including recombination, but it can (quite straightforwardly) be scaled up to analyse genome-wide genetic variability and, as shown in [56], does shed a very interesting light on a Zambian population of fruit flies (Drosophila melanogaster). Furthermore, in this context, it is rather satisfying to see that the funding of the Icelandic Grant of Excellence “Population genomics of highly fecund codfish” has recently been awarded jointly to Árnason, Halldórsdóttir, Etheridge, and Stephan. Our hope is that this project will provide and analyse the necessary data on which the full multi-locus machinery can be tested. We will be curious to observe the outcomes. Acknowledgements. The authors would like to thank Iulia Dahmer, Frederik Klement and Timo Schlüter for carefully reading the manuscript and for their helpful comments. We also thank Iulia Dahmer for her help in producing Figure 8.3.3 and two anonymous referees for their insightful comments which helped to improve the presentation of this article.

References [1] E. Árnason, Mitochondrial cytochrome b DNA variation in the high-fecundity Atlantic cod: Trans-Atlantic clines and shallow gene genealogy, Genetics 166 (2004), 1871–1885. [2] E. Árnason and K. Halldórsdóttir, Nucleotide variation and balancing selection at the Ckma gene in Atlantic cod: Analysis with multiple merger coalescent models, PeerJ 3 (2014), Article ID e786. [3] E. Baake and M. Baake, Ancestral lines under recombination, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 43–67. [4] N. Barton, A. Etheridge, and A. Véber, A new model for evolution in a spatial continuum, Electron. J. Probab. 15 (2010), 162–216. [5] N. Berestycki, Recent Progress in coalescent theory, Ensaios Mat. 16 (2009), 1–193. [6] J. Berestycki, N. Berestycki, and V. Limic, Asymptotic sampling formulae for ƒ-coalescents, Ann. Inst. Henri Poincaré Probab. Stat. 50 (2014), 715–731. [7] J. Berestycki, N. Berestycki, and J. Schweinsberg, The genealogy of branching Brownian motion with absorption, Ann. Probab. 41 (2013), 527–618.

Matthias Birkner and Jochen Blath

174

[8] R. L. Berger and D. D. Boos, P values maximized over a confidence set for the nuisance parameter, J. Amer. Statist. Assoc. 89 (1994), 1012–1016. [9] J. Bertoin and J.-F. Le Gall, Stochastic flows associated to coalescent processes, Probab. Theory Related Fields 126 (2003), 261–288. [10] M. Birkner and J. Blath, Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model, J. Math. Biol. 57 (2008), 435–465. [11] M. Birkner and J. Blath, Genealogies and inference for populations with highly skewed offspring distributions, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 247–265. [12] M. Birkner, J. Blath, M. Capaldo, A. Etheridge, M. Möhle, J. Schweinsberg, and A. Wakolbinger, Alpha-stable branching and beta-coalescents, Electron. J. Probab. 10 (2005), 303–325. [13] M. Birkner, J. Blath, and B. Eldon, An ancestral recombination graph for diploid populations with skewed offspring distribution, Genetics 193 (2013), 255–290. [14] M. Birkner, J. Blath, and B. Eldon, Statistical properties of the site-frequency spectrum associated with Lambda-coalescents, Genetics 195 (2013), 1037–1053. [15] M. Birkner, J. Blath, M. Möhle, M. Steinrücken, and J. Tams, A modified lookdown construction for the Xi-Fleming-Viot process with mutation and populations with recurrent bottlenecks, ALEA Lat. Am. J. Probab. Math. Stat. 6 (2009), 25–61. [16] M. Birkner, H. Liu, and A. Sturm, Coalescent results for diploid exchangeable population models, Electron. J. Probab. 23 (2018), 1–44. [17] J. Blath, M. Cronjaeger, B. Eldon, and M. Hammer, The site-frequency spectrum associated with Xi-coalescents, Theor. Popul. Biol. 10 (2016), 36–50. [18] E. Bolthausen and A.-S. Sznitman, On Ruelle’s probability cascades and an abstract cavity method, Comm. Math. Phys. 197 (1998), 247–276. [19] C. Cannings, The latent roots of certain Markov chains arising in genetics: A new approach, I. Haploid models. Adv. Appl. Probab. 6 (1974), 260–290. [20] I. Dahmer, G. Kersting, and A. Wakolbinger, The total external branch length of Betacoalescents, Combin. Probab. Comput. 23 (2014), 1010–1027. [21] M. M. Desai, A. M. Walczak, and D. S. Fisher, Genetic diversity and the structure of genealogies in rapidly adapting populations, Genetics 193 (2013), 565–585. [22] C. Diehl and G. Kersting, External branch lengths of ƒ-coalescents without a dust component, Electron. J. Probab. 24 (2019), 1–36. [23] C. Diehl and G. Kersting, Tree lengths for general ƒ-coalescents and the asymptotic site frequency spectrum around the Bolthausen–Sznitman coalescent, Ann. Appl. Probab. 29 (2019), 2700–2743. [24] P. Donnelly and T. Kurtz, A countable representation of the Fleming–Viot measure-valued diffusion, Ann. Probab. 24 (1996), 698–742. [25] P. Donnelly and T. Kurtz, Particle representations for measure-valued population models, Ann. Probab. 27 (1999), 166–205. [26] T. Duong, ks: Kernel smoothing, R package version 1.11.5 (2019), httpsW//CRAN.R-project. org/package=ks

Genealogies and inference for multiple merger coalescents

175

[27] R. Durrett and J. Schweinsberg, A coalescent model for the effect of advantageous mutations on the genealogy of a population, Stochastic Process. Appl. 115 (2005), 1628–1657. [28] B. Eldon, M. Birkner, J. Blath, and F. Freund, Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents? Genetics 199 (2015), 841–856. [29] B. Eldon and J. Wakeley, Coalescent processes when the distribution of offspring number among individuals is highly skewed, Genetics 172 (2006), 2621–2633. [30] J. Felsenstein, M. K. Kuhne, J. Yamato, and P. Beerli, Likelihoods on coalescents: A Monte Carlo sampling approach to inferring parameters from population samples of molecular data, in: Statistics in Molecular Biology and Genetics (ed. F. Seillier-Moiseiwitsch), Institute of Mathematical Statistics, Hayward (1999), 163–185. [31] F. Freund, Multiple-merger genealogies: Models, consequences, inference, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 179–202. [32] Y. X. Fu, Statistical properties of segregating sites, Theor. Popul. Biol. 48 (1995), 172–197. [33] A. Gnedin, A. Iksanov, and A. Marynych, ƒ-coalescents: A survey, J. Appl. Probab. 51A (2014), 23–40. [34] A. González Casanova, V. Miró Pina, and A. Siri-Jégousse, The symmetric coalescent and Wright–Fisher models with bottlenecks, preprint 2019, httpsW//arxiv.org/abs/1903.05642. [35] A. Greven, P. Pfaffelhuber, and A. Winter, Convergence in distribution of random metric measure spaces: The ƒ-coalescent measure tree, Probab. Theory Related Fields 145 (2009), 285–322. [36] R. C. Griffiths and P. Marjoram, An ancestral recombination graph, in: Progress in Population Genetics and Human Evolution (eds. P. Donelly and S. Tavaré), Springer, New York (1997), 257–270. [37] R. C. Griffiths and S. Tavaré, The age of a mutation in a general coalescent tree, Stoch. Models 14 (1998), 273–295. [38] S. Gufler, A representation for exchangeable coalescent trees and generalized tree-valued Fleming–Viot processes, Electron. J. Probab. 23 (2018), 1–42. [39] S. Gufler, Pathwise construction of tree-valued Fleming–Viot processes, Electron. J. Probab. 23 (2018), 1–58. [40] D. Gusfield, Efficient algorithms for inferring evolutionary trees, Networks 21 (1991), 19–28. [41] D. Hedgecock, Does variance in reproductive success limit effective population size of marine organisms? in: Genetics and Evolution of Aquatic Organisms (ed. A. R. Beaumont), Chapman & Hall, London (1994), 123–134. [42] D. Hedgecock and A. I. Pudovkin, Sweepstakes reproductive success in highly fecund marine fish and shellfish: A review and commentary, Bull. Mar. Sci. 87 (2011), 971–1002. [43] A. Hobolth, A. Siri-Jégousse, and M. Bladt, Phase-type distributions in population genetics, Theor. Popul. Biol. 127 (2019), 16–32. [44] R. R. Hudson, Properties of a neutral allele model with intragenic recombination, Theor. Popul. Biol. 23 (1983), 183–201.

Matthias Birkner and Jochen Blath

176

[45] T. E. Huillet, Pareto genealogies arising from a Poisson branching evolution model with selection, J. Math. Biol. 68 (2014), 727–761. [46] I. Kaj and S. Krone, The coalescent process in a population with stochastically varying size, J. Appl. Probab. 40 (2003), 33–48. [47] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [48] J. F. C. Kingman, The coalescent, Stochastic Process. Appl. 13 (1982), 235–248. [49] J. Koskela, Multi-locus data distinguishes between population growth and multiple merger coalescents, Stat. Appl. Genet. Mol. Biol. 17 (2018), Article ID 20170011. [50] J. Koskela and M. Wilke-Berenguer, Robust model selection between population growth and multiple merger coalescents, Math. Biosci. 311 (2019), 1–12. [51] M. Möhle and H. Pitters, A spectral decomposition for the block counting process of the Bolthausen–Sznitman coalescent, Electron. Commun. Probab. 19 (2014), Paper no. 47. [52] M. Möhle and S. Sagitov, A classification of coalescent processes for haploid exchangeable population models, Ann. Probab. 29 (2001), 1547–1562. [53] R. A. Neher and O. Hallatschek, Genealogies of rapidly adapting populations, Proc. Natl. Acad. Sci. USA 110 (2013), 437–442. [54] J. Pitman, Coalescents with multiple collisions, Ann. Probab. 27 (1999), 1870–1902. [55] R Core Team, A language and environment for statistical computing, R foundation for statistical computing, httpsW//www.R-project.org/. [56] D. P. Rice, J. Novembre, and M. M. Desai, Distinguishing multiple-merger from Kingman coalescence using two-site frequency spectra, preprint 2018, httpsW//www.biorxiv.org/ content/10.1101/461517v1. [57] A. M. Sackman, R. Harris, and J. D. Jensen, Inferring demography and selection in organisms characterized by skewed offspring distributions, Genetics 211 (2019), 1019–1028. [58] S. Sagitov, The general coalescent with asynchronous mergers of ancestral lines, J. Appl. Probab. 36 (1999), 1116–1125. [59] S. Sagitov, Convergence to the coalescent with simultaneous multiple mergers, J. Appl. Probab. 40 (2003), 839–854. [60] R. Sainudiin and A. Véber, Full likelihood inference from the site frequency spectrum based on the optimal tree resolution, Theor. Popul. Biol. 124 (2018), 1–15. [61] J. Schweinsberg, Coalescents with simultaneous multiple collisions, Electron. J. Probab. 5 (2000), 1–50. [62] J. Schweinsberg, Coalescent processes obtained from supercritical Galton–Watson processes, Stochastic Process. Appl. 106 (2003), 107–139. [63] J. Schweinsberg, Rigorous results for a population model with selection I: Evolution of the fitness distribution. Electron. J. Probab. 22 (2017), 1–94. [64] J. Schweinsberg, Rigorous results for a population model with selection II: Genealogy of the population, Electron. J. Probab. 22 (2017), 1–54.

Genealogies and inference for multiple merger coalescents

177

[65] J. P. Spence, J. A. Kamm, and Y. S. Song, The site frequency spectrum for general coalescents, Genetics 202 (2016), 1549–1561. [66] M. Steinrücken, M. Birkner, and J. Blath. Analysis of DNA sequence variation within marine species using Beta-coalescents, Theor. Popul. Biol. 87 (2013), 15–24. [67] M. Stephens and P. Donnelly, Inference in molecular population genetics, With discussion and a reply by the authors, J. R. Stat. Soc. Ser. B Stat. Methodol. 62 (2000), 605–655. [68] A. Sturm, Diploid populations and their genealogies, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 203–221. [69] A. Tellier and C. Lemaire, Coalescence 2.0: A multiple branching of recent theoretical developments and their applications, Mol. Ecol. 23 (2014), 2637–2652. [70] J. Wakeley, Coalescent Theory: An Introduction, Roberts, Greenwood Village, 2008. [71] J. Wakeley and O. Sargsyan, Extensions of the coalescent effective population size, Genetics 181 (2009), 341–345. [72] G. A. Watterson, On the number of segregating sites in genetical models without recombination, Theor. Popul. Biol. 7 (1975), 1539–1546.

Chapter 9

Multiple-merger genealogies: Models, consequences, inference Fabian Freund Trees corresponding to ƒ- and „-n-coalescents can be both quite similar and fundamentally different compared to bifurcating tree models based on Kingman’s n-coalescent. This has consequences for inference of a well-fitting gene genealogy as well as for assessing biological properties of species having such sample genealogies. Here, mathematical properties concerning clade sizes in the tree as well as changes of the tree when the samples are enlarged are highlighted. To be used as realistic genealogy models for real populations, an extension for changing population sizes is discussed.

9.1 Multiple-merger coalescents Two decades ago, ƒ- and shortly later the bigger class of „-n-coalescents have been introduced and become an active area of mathematical research ever since. These are Markov processes ….n/ D .….n/ t / t >0 on the set of partitions of ¹1; : : : ; nº. Transitions are mergers of partition blocks. A „-n-coalescent’s rates are characterised by a finite measure „ with positive mass on the simplex ± ° X  D .xi /i 2N W x1 > x2 >    ; xi 6 1 : i 2N

An intuitive way of describing the transitions is via a Poisson construction, see [49,54]: Transitions are possible at times t 2 Œ0; 1/ that feature a point .t; x/ of a Poisson point process on Œ0; 1/   with intensity dt P ˝ „ .dx/=.x; x/, where „ is the restriction of „ to  n ¹.0; 0; : : :/º and .x; x/ WD i 2N xi2 . At time t, consider the “paintbox” x D .x1 ; x2 ; : : :/: Each partition block i present at t draws independently a colour Ci , which paints it withP colour j 2 N with P .Ci D j / D xj and makes it colourless with P .Ci D 0/ D 1 m2N xm . Merge each set of blocks of the same colour. If „ has positive mass on the origin, say a D „.¹.0; 0; : : :/º/, also allow additional transitions for each pair of blocks present with transition rate a. Such coalescents allow transitions that are mergers of sets of any number of blocks present into multiple new blocks. If „ has only positive mass on the first coordinate of the simplex, only a single set of blocks can be merged into one new block as a transition. This subclass of „-n-coalescents are called ƒ-n-coalescents, ƒ being the restriction of „ to the first dimension of , a finite measure on Œ0; 1.

Fabian Freund

180

For any „, ….n/ almost surely reaches its absorbing state ¹1; : : : ; nº in finite time. For any fixed time t , the partition ….n/ t of the „-n-coalescent is exchangeable, d .n/ i.e.  .….n/ / D … for any permutation  of ¹1; : : : ; nº. This extends to certain t t non-deterministic times, for instance the partition before absorption in ¹¹1; : : : ; nºº is exchangeable. Moreover, there exists a Kolmogorov extension .… t / t >0 on the set of partitions of N so that its restriction to Œn WD ¹1; : : : ; nº is a „-n-coalescent for any n 2 N, see [54, Theorem 2]. The process … D .… t / t >0 is called a „-coalescent. Exchangeability extends to the „-coalescent in the sense that any permutation of a finite set S  N does not change the distribution of …. A consequence of this is that the limit frequencies limn!1 Bi.n/ .t/=n of the partition block sizes Bi.n/ .t / of the restrictions of … t to Œn (e.g. ordered by sizes) exist almost surely, this is called Kingman’s correspondence (which is essentially de Finetti’s theorem, see e.g. [54, Appendix]). „-n-coalescents and related models also appear as central objects in other chapters, see the contributions of Birkner and Blath [6], of Kersting and Wakolbinger [34] as well as of Sturm [59] in this volume. The Poisson construction also shows fundamental differences between different choices of „ (respectively ƒ). While choosing ƒ D ı0 leads to Kingman’s n-coalescent, which only allows binary mergers, any „ with mass outside of .0; 0; : : :/ allows for mergers of more than two blocks (which are the name-giving multiple mergers). If „.dx/=.x; x/ is a finite measure, the Poisson point process almost surely only has finitely many points on any set Œ0; t  . For such „, the corresponding subclass P of coalescent processes are called simple „-(n-)coalescents. If i 2N xi „.dx/=.x; x/ is a finite measure, at least the number of mergers the block containing a fixed i 2 N can participate in is almost surely finite. Corresponding „(-n)-coalescents are called coalescents with dust. All simple „(-n)-coalescents are also coalescents with dust. A further property of „-coalescents is whether they stay infinite, i.e. which have infinitely many blocks almost surely at any time, or whether they come down from infinity, i.e. they have finitely many blocks at any time t > 0 almost surely. All „-coalescents with dust that do not allow for Poisson points Pk where all blocks are coloured with finitely many colours, i.e. that have „.¹x 2  W i D1 xi D 1 for a k 2 Nº/ D 0 stay infinite, see e.g. [20, Section 3]. The case of „-coalescents without dust is more complex, see e.g. [27]. We will need later that the times to absorption of „-n-coalescents not staying infinite are converging almost surely to a finite limiting variable for n ! 1: this follows from [54, Lemma 31] if the coalescent comes down from infinity. Otherwise, the same lemma shows that „ has positive mass on Poisson points that colour all present blocks with finitely many colours. Thus, there is a finite waiting time for such points, which then forces the number of blocks to be finite and subsequently these will merge into absorption after a finite time almost surely. One can even put this stronger for both cases: Since only finitely many blocks merge, exchangeability ensures that given n large enough, all the blocks of the „-coalescent merging at the last collision have at least one individual in the blocks of the „-n-coalescent. Thus, the height of the „-n-coalescent is the same as for the „-coalescent above a certain (path-dependent) n0 , if the „-coalescent does not stay infinite.

Multiple-merger genealogies: Models, consequences, inference

181

Any „-n-coalescent also encodes a random ultrametric (labelled) tree with n leaves, where ultrametric means that the path lengths from leaves to root are identical for all leaves. We build this tree from leaves to root: Start with n edges (initially length 0) at the leaves corresponding to the blocks ¹1º; : : : ; ¹nº at time 0. Elongate all edges by l1 , where l1 is the waiting time for the first transition of ….n/ . Then the edges corresponding to the sets of merged partition blocks of ….n/ at this transition are joined in new internal nodes (for „-n-coalescents, a transition may consist of multiple simultaneous mergers). Start a new edge at each newly introduced node (length 0) and elongate again all branches not yet connected to two nodes for l2 , the waiting time until the next transition of the n-coalescent. The new edges represent the blocks formed at the transition at time l1 . Then join all edges (not yet connected to two nodes) in a new node that correspond to the set(s) of blocks merged at the second transition and start again new edges (length 0) from these nodes (representing the newly formed blocks at the second transition). Repeat this until all blocks are merged, which corresponds to the root of the tree. Kingman’s n-coalescent has been introduced by Kingman [35, 36] as approximating the genealogical tree of a sample of n individuals of a haploid Wright–Fisher model with (large) population size N . Here, the discrete genealogy is defined by recording the partitions of ¹1; : : : ; nº corresponding to which sets of individuals in ¹1; : : : ; nº have the same ancestor r generations backwards from the time of sampling for all r 2 N. This partition-valued process .Rr.N / /r2N then converges for N ! 1 when time is rescaled, i.e. d

.N / .n/ .Rdc ; 1 / t>0 ! … te

(9.1.1)

N

where ….n/ is Kingman’s n-coalescent ….n/;ı0 and cN is the probability that two individuals drawn at random from the same generation have the same parent in the reproduction model, so cN D N 1 in the Wright–Fisher model. For any „-ncoalescent ….n/;„ , there exists a series of Cannings models, i.e. haploid reproduction models of a population of fixed size N in any generation with exchangeable offspring numbers (i.i.d. across generations), so that equation (9.1.1) holds for ….n/ D ….n/;„ , see [44]. While the Wright–Fisher model is a staple model in population genetics, the standard model of a neutrally evolving fixed-size population, the Cannings models leading to other „-n-coalescents are used more rarely. This is inherited by the coalescent process limits. The prevalence of Kingman’s n-coalescent also stems from its robustness: If a series of Cannings models with cN ! 0 and non-extreme variance in offspring numbers across individuals satisfies equation (9.1.1) for some limit process ….n/ , then ….n/ is Kingman’s n-coalescent, see [42]. Such models include the Moran model, where one random individual has two, a different one no and all others one offspring. However, there is growing evidence that for certain populations, „-n-coalescents are fitting better as genealogy models. Necessarily, as discussed above, these need to be populations where ancestors are shared by many offspring. For instance, a biological mechanism leading there is sweepstake reproduction, i.e. one “lucky” individual/

Fabian Freund

182

genotype produces considerably more offspring on the coalescent time scale than the rest of the (potential) parents [25]. An example would be type III survivorship in marine species, where individuals reproduce with large offspring numbers, but high early-life mortality keeps the population size constant. This has been modelled in [55] as sampling N actual offspring from the potential offspring (in reality offspring .N / dying in an early life stage) of N parents given by i.i.d. .1.N / ; : : : ; N / with heavy ˛ tails, i.e. P .1 > k/  C k . If ˛ 2 Œ1; 2/, the limit in equation (9.1.1) from this model is a Beta-n-coalescent, i.e. a ƒ-n-coalescent with ƒ D Beta.2 ˛; ˛/ being a Beta distribution. These coalescents have no dust, and come down from infinity for ˛ > 1 but stay infinite for ˛ D 1. While the Beta-n-coalescents model haploid sweepstake reproduction, diploid reproduction leads to a „-n-coalescent limit, the Beta-„-n-coalescent, see [8, 9]. In [16], a different haploid model of sweepstake reproduction in a population of fixed size N was introduced: In a Moran model, with small probability N , > 0, the single individual having two offspring instead has U D dN ‰e offspring (while U 1 instead of one parent have no offspring) for ‰ 2 .0; 1/. If 6 2, the limit in equation (9.1.1) is a ƒ-n-coalescent with ´ 2 ‰2 if D 2; 2 ı0 C 2C‰ 2 ı‰ 2C‰ ƒD ı‰ otherwise: The latter class is called Psi- or Dirac-n-coalescents. For the class of Beta-n-coalescents, recent studies showed evidence that for samples from Japanese sardines and Atlantic cod, Beta-n-coalescents (or their diploid „-n-coalescent counterparts) are fitting models, see [1, 9, 46, 58]. This needs a link of the usually unobserved genealogy to the observed genetic data. Most samples only include information about leaves of the genealogy at time t D 0. The link is usually given by modelling the differences in the DNA sequences of the sample at a region in the genome by tracing back the genealogy and the mutations upon it. Mutations on a branch are inherited by every leaf subtended by the branch and are interpreted under the infinite-sites model, i.e. each mutation hits another position of the sequence. The mutation process is given by a homogeneous Poisson point process with rate 2 on the branches of the coalescent tree, independent of this tree. Another biological mechanism that may lead to multiple-merger genealogies is selection. In models where reproductive success of individuals depend on their fitness and in order to survive and produce offspring with good survival chances, they need to stay close enough to an ever increasing fitness “threshold” (which may depend on the fitness of other individuals), the Bolthausen–Sznitman n-coalescent, a ƒ-ncoalescent with ƒ being the uniform distribution on Œ0; 1, emerges as a suitable limit genealogy model, see e.g. [4, 12, 13, 45, 56]. The latter two sources consider a model where the population gets fitter by pooling beneficial mutations of equal additive fitness gains, the genealogy limit there is a Bolthausen–Sznitman n-coalescent with its external branches elongated by a deterministic interval. Such types of selection are

Multiple-merger genealogies: Models, consequences, inference

183

summarised under the term “rapid selection”: if an individual by chance has a very large fitness advantage over the rest of the population, its number of descendants is boosted long enough so that a multiple merger can appear, while over time this fitness advantage is erased (so the n-coalescent stays exchangeable). The Bolthausen– Sznitman n-coalescent is also a Beta-n-coalescent if one chooses ˛ D 1. Moreover, it can be constructed from a random recursive tree with n nodes and its properties can thus be linked to the Chinese restaurant process, see [23]. For instance, the partition blocks added to the partition block including 1, starting in ¹1º and eventually reaching Œn, can be seen as merging each of the K cycles of a random permutation of ¹2; : : : ; nº (a Chinese restaurant process with n 1 customers) with the block containing 1 at i.i.d. Exp.1/ times. For the Bolthausen–Sznitman n-coalescent, some genomic data sets from populations likely under strong selection have been argued to show patterns explained by the Bolthausen–Sznitman n-coalescent [47, 52], although no strict model testing as for the genealogy models for marine species listed above has been performed. These are the two main scenarios where „-n-coalescents are reasonable genealogy models. Other „-n-coalescents also appear as genealogies in several additional contexts, see e.g. the reviews [30, 61]. This all shows that ƒ- and „-n-coalescents are a mathematically diverse class of Markov processes with a range of (potential) applications, but with sparse evidence so far for which range of species/populations they should be used as the standard sample genealogy model. Since Kingman’s n-coalescent has been the standard model for genealogies, with extensions e.g. for including population size changes and population structure, many population genetic predictions about populations are based on properties of Kingman’s n-coalescent. Thus, if the genealogy is given by a different „-n-coalescent, these properties may change profoundly, with consequences for interpreting and handling the genetic diversity of such populations.

9.2 Modelling multiple mergers for variable population size The pre-limiting Cannings models whose genealogies converge to ƒ- and „-n-coalescents are models of a sample taken from one generation in one population of fixed size N (or 2N in the diploid case) across time. Moreover, the mutational model produces linked mutations, i.e. the genealogy stays the same for any part of the genomic region modelled. Any of these assumptions can (and often will) be violated for real populations. However, a series of extensions both for haploid and diploid models are available, some also developed within this Priority Programme, see the chapters by Sturm [59], Birkner and Blath [6], and Kurt and Blath [10] in this volume. For ƒ-ncoalescents, recombination has been added in [7]. An approach for accounting for serial sampling, i.e. sampling in different generations has been proposed in [28]. A general model for genealogies in diploid populations and any combination of standard positive selection, population structure through demes connected by migration, population size

184

Fabian Freund

changes and recombination has been introduced in [38]. Their model extends the model for genealogies of diploid individuals when offspring distributions are skewed from [8], which satisfies equation (9.1.1) with ….n/ being the Beta-„-n-coalescent. Our focus here lies on modelling population size changes for ƒ-n-coalescents as limits of haploid 0/ Cannings models. For Kingman’s n-coalescent .….n/;ı t>0 , constructed as the limit t of genealogies in the Wright–Fisher model, population size changes of order N in the Wright–Fisher model lead to a time-changed coalescent limit, see [24, 32]. In more detail, assume the size Nr of the population r generations before sampling in a Wright–Fisher model (r 2 N0 ) is deterministic and can be characterised, for N D N0 ! 1, by an existing positive limit .t/ D lim

NdtcN1 e N

N !1

>0

(9.2.1)

for all t > 0. The limit of the discrete coalescents of Œn, rescaled by cN D N the fixed population size model, is then d

.N / .RŒc 1 t / t>0 ! .…g.t / / t >0

1

from

(9.2.2)

N

(in the Skorohod-sense) for g.t/ D instance, consider the specific case Nr D bN.1

Rt 0

.s/

1

0 ds and .… t / t >0 D .….n/;ı / t >0 . For t

cN /r c for r 2 N0 ;  > 0 H) .t / D e

t

;

(9.2.3)

i.e. on the timescale of the coalescent limit we have exponential growth of the population size with rate . For this scenario, we call the limit process Kingman’s n-coalescent with exponential growth. Additionally, for the Wright–Fisher model with non-constant population sizes, the genealogical relationship is clearly established: it is still defined as “offspring chooses parent at random from parent generation”. Similarly time-changed ƒ- and „-n-coalescents have been proposed as reasonable genealogy models for populations with skewed offspring distribution and fluctuating population sizes, e.g. in [57]. However, an explicit construction as a limit of Cannings models with fluctuating population sizes has only been shown for the special case of Dirac n-coalescents with exponential growth on the coalescent time scale, i.e. population size changes given by (9.2.3) in the underlying Cannings models. To present an explicit construction in general (for more multiple-merger n-coalescents and more general population size changes) is not (always) as straight forward as for Wright–Fisher models. The convergence itself is essentially covered by the machinery from [43], but some care is needed for defining the genealogical relationship. The idea from [40] is to use a specific class of Cannings models, the modified Moran models, where the genealogical relationship can be easily established and the convergence can be shown by verifying the conditions from [43]. The modified Moran model is defined as follows from parent to offspring generation. In the parent generation, one individual is picked

Multiple-merger genealogies: Models, consequences, inference

185

at random that has U > 2 offspring. The standard Moran model fixes U D 2, where the modified Moran model sets this as a random variable. Then, U 1 randomly chosen other individuals from the parent generation have no offspring, while the remaining parent individuals have exactly one offspring. For fixed population size, in order that there can be a continuous time coalescent limit for the discrete genealogies, one needs d cN ! 0 for N ! 1. For modified Moran models, this is equivalent to N 1 UN ! 0 for N ! 1, see [29]. An example of such a modified Moran model was shown in the first section, the sweepstake reproduction model from [16]. Based on this model, [40] added exponential growth as in equation (9.2.3), by having one individual r generations back from sampling have UN;r D max¹Nr 1 Nr ; 2 1Rr >Nr C dNr ‰e 1Rr 6Nr º individuals as offspring for 0 < < 2, ‰ 2 .0; 1/, .Rr /r2N i.i.d. uniform on Œ0; 1. In the same parent generation, a random set of UN;r 1 other individuals have no offspring, while all others have exactly one offspring each. Then, as proven in [40], [43, Theorem 2.2] .n/;ıp ensures that (9.2.2) R t holds for a time-changed Dirac coalescent .…g.t / / t >0 as limit, but for g.t / D 0 exp. s/ ds. This dependence of the timescale on the specific way the Cannings model is defined is noteworthy, since for fixed population size across generations, the coalescent limit of the discrete genealogies is not changed by the specific choice of . This may also prove problematic for parameter inference, since inferring the growth parameter  requires knowledge of . This nullifies one strength of many coalescent approaches, that the specificities of the pre-limit models do not change the coalescent limit. However, at least calibrating the per-generation mutation rate N in the discrete Cannings model, which leads to a mutation rate  D limn!1 cN1 N in the coalescent limit, does not depend on the fluctuation of the 2 population sizes, but just on the Cannings models. In [19], I could show that this approach can be generalised to allow much more general modified Moran models and other Cannings models with fluctuating population sizes whose genealogies converge, after rescaling by cN D cN0 , to a timescaled ƒ-n-coalescent, as in equation (9.2.2). Essentially, this boils down to verifying that [43, Theorem 2.2] can be applied. For any ƒ, one can use the modified Moran models constructed in [29, Proposition 3.4], whose genealogies for constant population size across generations converge to the ƒ-n-coalescent with the usual rescaling of time. These are defined either by distributing UN as the size of the block being produced at the first merger of a ƒ-N -coalescent (variant A) or by using UN 1B C 2.1 1B / instead, where 1B is independent of UN with P .B/ D N for 2 .1; 2/ (variant B). For changing population sizes, it now needs to be defined how the genealogical relation between parents and offspring is within this model. Let UNr be the offspring in generation r for the modified Moran model with constant size Nr . If there is a reduction in size from Nr to Nr 1 from one generation to the next, one can just sample down from the Nr offspring that the fixed size model produces, resulting again in a modified Moran model. For increases in population size, the model stays a modified Moran model if additional individuals are added as further offspring of the parent having UNr offspring or as single offspring of parents not reproducing in the fixed size

Fabian Freund

186

model (since UN > 2, at least one individual can be added with the second method). Then, for any ƒ and for any time change function  as in equation (9.2.1) there exist such modified Moran models with population sizes .Nr /r>0 satisfying equation (9.2.1) so that the discrete genealogies, properly scaled with a time-inhomogeneous function (related to cN , but not just scaling by cN ), converge to a ƒ-n-coalescent limit. Under mild additional assumptions this is equivalent to the convergence as described in equation (9.2.2) with a different time change g; for a certain class of  and ƒ one needs no additional assumptions at all. See the following two sample results from [19, Theorem 1 and Corollary 1]. The first one includes the case of modified Moran models converging to a time-changed Dirac coalescent from [40]. In both propositions, additional individuals can be added in any way so that the resulting model is still a modified Moran model. Proposition 9.2.1. Let UN be distributed as the first jump of a ƒ-N -coalescent. If .N 1/ 1 E.UN .UN 1// ¹ 0 for N ! 1, define the modified Moran model via variant B. Then, for any positive function , there exist population sizes satisfying equation (9.2.1) so that the discrete genealogies of the modified Moran R t model with variable population sizes converge as in equation (9.2.2) with g.t / D 0 ..s// ds and .… t / t >0 is a ƒ-n-coalescent. Proposition 9.2.2. Let ƒ be a Beta.a; b/-distribution with a 2 .0; 1/ and b > 0. Let W R>0 ! R>0 . Then there exist population sizes satisfying equation (9.2.1) for  so that the genealogies .Rr.N / /r2N0 of the modified Moran model R t with variable population sizes (variant A) fulfill equation (9.2.2) with g.t / D 0 ..s//a 2 ds and .… t / t>0 is a Beta.a; b/-n-coalescent. For the standard Moran model, the second proposition holds for a D 0 and the limit coalescent is a time-changed Kingman-n-coalescent, see [19, Proposition 1]. Thus, when comparing to equation (9.2.2) for the Wright–Fisher model, the same population size changes  on the coalescent time scale lead to different time changes of the limit Kingman n-coalescent. Similarly, for a given ƒ and , there can be other Cannings models whose genealogies, for constant population size, have the same coalescent limit, but where having the same population size profile .t / on the coalescent time scale still leads to different time scalings g. For instance, consider Schweinsberg’s Cannings model from [55], whose genealogies converge to Beta-n-coalescents, as described in the first section. The potential offspring produced from a parent generation in the fixedsize setting is enough, at least with very high probability ! 1 for N ! 1, to also sample an increased number of offspring from these, for any population size changes allowed by (9.2.1). For decreasing population size, just sampling less offspring of course also works. Again, [43, Theorem 2.2] can be applied. If, for constant population size, the discrete genealogies of Œn converge to the Beta.2 ˛; ˛/-n-coalescent for 1 6 ˛ < 2, when adding population size changes described by , the time change Rt in equation (9.2.2) is given by 0 ..s//1 ˛ ds, see [19, Theorem 2.8]. Again, this is a different timescale than in the case of the modified Moran models above. Even more

Multiple-merger genealogies: Models, consequences, inference

187

disturbingly, for ˛ D 1 the distribution of the limit genealogy is invariant under any population size change allowed by equation (9.2.1).

9.3 How much genetic information is contained in a subsample? For managing genetic resources in crops, gene banks hold many accessions of a crop, e.g. as seeds (here accession can be understood as individual plant sampled at random, see [18] for a thorough definition). However, resources of gene banks are limited and thus not all individual plants can be stored. Individuals grown from such seeds can be used as crossing partners to introduce genetic variation not yet present into a breeding program. In terms of genealogies, new variation is available if the genealogy of the material already in the breeding program together with the gene bank material has additional branches with mutations as compared to the genealogy of the material used in the breeding program. How much do different genealogy models affect this amount of added genetic variation? Mathematically, this corresponds to assessing the overlap of the nested genealogies of a sample of size n and an arbitrary subsample of size m (and the mutations on the non-overlapping branches). Assume that the genealogy of the n sampled individuals is given by a „-n-coalescent. From the Poisson construction, it follows that the subsample is a „-m-coalescent with the same measure „, a property called natural coupling. Due to exchangeability, for questions concerning the distribution of sharing aspects of the genealogy it does not matter which individuals are in the subsample and in the sample, thus we always assume the subsample of size m is Œm. For Kingman’s n-coalescent, questions about how much of the genealogy is covered have been discussed in [53]. For instance, the subsample’s genealogy covers .ı0 / m 1 nC1 the root of the sample’s genealogy with probability pn;m D mC1 for m 6 n 2 N. n 1 If the root is shared, any mutations on the two branches starting in the root are also present in the sample. Together with Eldon [15], I showed that this probability can also be computed recursively for any ƒ-n-coalescent as .ƒ/ pn;m

D

n X

.n; k/

kD2

where

n k

k^m X `D0



n;k .n; k/ D P ; k n;k

n m m k `  ` n k



Z n;k D

 pn.ƒ/kC1;m0 ;

(9.3.1)

1

xk

2

.1

x/n

k

ƒ.dx/

0

is the probability of transition by merging any k of n blocks present and m0 D .m

` C 1/ 1¹`>1º C m 1¹`61º :

.ƒ/ .ƒ/ As boundary conditions, we have pk;k D 1 for k 2 N and pk;1 D 0 for k > 2. As many recursions for ƒ-n-coalescents, this can be proven by conditioning on the first jump T1

188

0.8 0.6

m=100, n=1000 m=10, n=100 m=10, n=1000

0.4

(2−α, α)) p(n,Βm

1.0

Fabian Freund

1.0

1.2

1.4

1.6

1.8

2.0

α .ƒ/ Figure 9.3.1. The probability pn;m of sharing the root between sample and subsample for .ı0 / Beta.2 ˛; ˛/-n-coalescents. Solid lines show the asymptotic probabilities limn!1 pn;m for Kingman’s n-coalescent for m D 10 (in grey) and m D 100 (in black).

of the ƒ-n-coalescent: The genealogy cut at T1 , keeping all branches connected to the root, is again a ƒ-k-coalescent tree, where k is the number of blocks/branches present at T1 . Equation (9.3.1) then only sums over all possible mergers of k blocks, where ` blocks of the subsample and k ` of the sample without the subsample are merged. If ` D m, the root of the subtree of Œm is reached, and thus the root is shared if and only if no blocks in Œn n Œm are unmerged, which is encoded by the boundary .ƒ/ condition. Using the recursion, we can compare pm;n between ƒ-n-coalescents, see Figure 9.3.1. Figure 9.3.1 may raise the question whether Kingman’s n-coalescent .„/ is the „-n-coalescent that has maximum pn;m . This is not true, for instance „-ncoalescents that are essentially star-shaped, i.e. merge all individuals at the first merger .„/ with high probability, have a higher pn;m if the probability for being star-shaped is high enough. .„/ Due to the exchangeability of the „-n-coalescent, pn;m can alternatively be expressed in terms of the block sizes of the exchangeable partition of the „-n-coalescent shortly before absorption. Proposition 9.3.1 ([15, Propositions 2 and 5]). For any „-n-coalescent, .„/ pn;m

D1

E

X mY1 B .n/

k

Œi 

i 2N kD0

n

k

> 0;

(9.3.2)

.n/ .n/ where BŒ1 ; BŒ2 ; : : : are the sizes of the blocks of ….n/ merged at absorption, ordered by size from largest to smallest (filled up by empty blocks). The limit .„/ .„/ WD lim p.n;m/ pm n!1

exists and is greater than 0 if and only if the „-coalescent does not stay infinite. If the „-coalescent comes down from infinity, we have X  E.Y m / .„/ m pn;m !1 E PŒi D 1 E.X m 1 / D 1 >0 (9.3.3) E.Y / i2N

Multiple-merger genealogies: Models, consequences, inference

189

for fixed m and n ! 1, where PŒi WD limn!1 BŒi.n/ =n is the asymptotic frequency  of the i-th largest block merged at absorption, X is the asymptotic frequency of a sizebiased pick and Y is the asymptotic frequency of a block picked uniformly at random from the blocks merging at absorption. Proof. This is a condensed version of the proof, focusing on the ideas and omitting technicalities. .ƒ/ To see equation (9.3.2), first observe that 1 p.n;m/ , the probability that the root is not shared, means that all individuals of Œm have merged before the last merger. Thus, .n/ we need to have Œm  BŒi for some block of the partition merged at absorption. .n/ .n/ Given BŒ1 ; : : : ; BŒn , the probability of Œm  BŒi.n/ is  .n/ n m X Y1 BŒi i D1 `D0

n

` `

for an i 2 Œn due to exchangeability. .ƒ/ .ƒ/ WD This allows us to consider pm limn!1 p.n;m/ for fixed m. From the Poisson construction, one sees that adding more individuals can only add branches and nodes .ƒ/ to the tree. Thus p.n;m/ is monotonic in n for m fixed. When is the thus existing .„/ limit pm D 0? If the „-coalescent stays infinite, the waiting time for absorption, the height of the genealogy, diverges almost surely for n ! 1. Thus, the root cannot be .„/ shared almost surely, so pm D 0. Consider now the case that the coalescent does not stay infinite. As discussed above, this also means that the heights of the „-ncoalescents equal the height of the „-coalescent for n large enough. Exchangeability (and elementary, yet technical arguments) ensures that the asymptotic frequencies of the blocks participating in the last merger exist. There are at least two blocks, no block has frequency 1, and again exchangeability ensures that with positive probability, .„/ Œm is not a subset of a single block. This shows pm > 0 if the coalescent does not stay infinite. The convergence in equation (9.3.3) follows directly due to the link between block sizes and asymptotic frequencies, while the characterisations using the size-biased and uniform picks are just standard characterisations of asymptotic frequencies in exchangeable partitions on N. If one knows the moments of the asymptotic frequencies of the blocks of ….„/ par.„/ ticipating in the last merger, equation (9.3.3) would give an explicit formula for pm . For Beta-coalescents that come down from infinity, i.e. 1 < ˛ < 2, there is an explicit characterisation of asymptotic frequencies, conditioned on the event that the Betacoalescent is in a state with k blocks, k 2 N. In this case, the asymptotic frequencies can be expressed in terms of Slack’s distribution, see [3, Theorem 1.2]. Since the distribution of the number of blocks K participating in the last merger of this class of .„/ coalescents is also known, see [26], we can represent pm as  Ym X E .Y CCY1 /˛Cm 1 1 .Beta.2 ˛;˛// k pm D1 k P .K D k/; E..Y1 C    C Yk /1 ˛ / k2N

190

Fabian Freund

R1 where K has generating function E.uK / D ˛u 0 .1 x 1 ˛ / 1 ..1 ux/˛ 1 1/ dx for u 2 Œ0; 1 and .Yi /i 2N are i.i.d. and have Slack’s distribution with Laplace transform E.e Y1 / D 1 .1 C 1 ˛ / 1=.˛ 1/ . For ˛ D 1, so for the Bolthausen–Sznitman (n-)coalescent, a more explicit representation can be derived from the connection to the uniform random permutation, aka the Chinese restaurant process. Proposition 9.3.2 ([15, Proposition 4]). Let B1 ; : : : ; Bn 1 be independent Bernoulli random variables with P .Bi D 1/ D i 1 . For 2 6 m < n, .Beta.1;1// pn;m DE

B C    C B 1 m B1 C    C Bn

1

 :

(9.3.4)

1

Moreover, .Beta.1;1// log.n/pn;m !

m X1

i

1

for n ! 1 and m fixed:

i D1

Proof. The last merger of the Bolthausen–Sznitman n-coalescent has to feature one block including the individual 1. As described in the first section, the connection to the random recursive tree allows us to see the individuals subsequently merged to the block including 1 as adding the cycles of a uniform random permutation of ¹2; : : : ; nº in random order. Since 1 2 Œm, the root is shared between subsample and sample if and only if the cycle added at the last merger also includes at least one j 2 Œm. Using e.g. the Chinese restaurant P process construction of the uniform random permutation, P 1one sees that there are ni D11 Bi cycles in the random permutation, from which m i D1 Bi contain j 2 Œm. This shows equation (9.3.4) and standard arguments establish its limit for n ! 1. So what do these results imply for real populations? The main message is indeed Figure 9.3.1: Multiple merger genealogies, unless essentially star-shaped, will have a (often much) higher chance that enlarging collections of individuals uncovers new ancestral variation that then can be used for breeding. Adding population size changes of order N only changes the distribution of branch lengths but not the tree topology (see Section 9.2) for all ƒ-n-coalescents, so this holds true if we consider e.g. exponentially growing populations. However, if we are concerned about the actual shared genealogy, the picture changes somewhat. Using simulations, we assessed the fraction of internal branch length of the genealogy of Œn that is already covered by the genealogy of Œm, see [15, Figures 3–6]. On average, for Kingman’s n-coalescent a larger fraction of internal branches is covered by the subsample’s genealogy than for Beta-n-coalescents, while this is reversed for Kingman’s n-coalescent with strong exponential growth. Additionally, the coverage is much more variable for Beta-coalescents. Nevertheless, at least for small to medium growth rates, the benefit of larger samples in terms of added ancestral variation is stronger for multiple-merger genealogies. The results highlight

Multiple-merger genealogies: Models, consequences, inference

191

that inferring the correct genealogy model can be relevant for real-life decisions concerning the population at hand. One more technical, yet in my opinion important point should be discussed here. .„/ While we can characterise the limit behaviour of pn;m for n ! 1, this may not be of too much relevance for the practical application. This is not so much due to having finite sample sizes, but it is already questionable whether for n large, the coalescent approximation still resembles the genealogy in the Cannings model, see e.g. [5] for addressing this issue for Kingman’s n-coalescent. However, due to the monotonicity .„/ of pn;m for m fixed and n increasing, the limit results always give a lower bound for the probability of sharing the root between sample and subsample. The results described in this section show that adding individuals for ƒ- and „-ncoalescent will usually have the effect that the genealogical tree is enlarged by a higher factor (increased height, but also increased internal branch length) than for Kingman’s n-coalescent. This may point to a conjecture for multiple-merger coalescents about the question posed by Felsenstein in [17]: “Do we need more sites, more sequences, or more loci?” if it comes to population genetic inference. While for Kingman’s n-coalescent, sample size is less important, our results indicate that this importance is elevated for multiple-merger n-coalescents.

9.4 Model selection between n-coalescents During the two phases of the SPP 1590, much progress has been made on inference methods for deciding whether multiple merger genealogies explain the genetic diversity of a sample better than bifurcating genealogy models based on Kingman’s n-coalescent, both inside and outside of the SPP. For the former, see also the chapter by Birkner and Blath [6]. More formally, one is interested in performing model selection between two or more sets of n-coalescent models, each endowed with a range of mutation rates, based on the values of one or multiple statistics for measuring genetic diversity. Many methods focus on the site frequency spectrum .1.n/ ; : : : ; n.n/1 / as inference statistics, where i.n/ is the number of mutations that are inherited by exactly i individuals in the sample, i.e. that have mutation allele count i, see e.g. [14,37,38,40]. However, genetic data contains more information than the site frequency spectrum (SFS) and this surplus information can also be used to perform model selection, see e.g. [33, 51]. A reliable method to use multiple statistics for model selection is Approximate Bayesian Computation (ABC). ABC uses a computational version of a Bayes approach, see e.g. [60] for an introduction. Consider we want to perform model selection between two models with equal a priori probability for each model. Data is represented by summary statistics. In the simplest ABC approach (rejection scheme), statistics are simulated N  1 times with model parameters drawn from each model’s prior distribution (so 2N sets of summary statistics are produced). Then the posterior odds ratio between models is approximated by the ratio of the numbers of simulations from each model that are equal/very close to the observed value of the

Fabian Freund

192

statistics (the quality of approximation depends e.g. on the sufficiency of the statistics used). See [39] for a recent review of more sophisticated approaches. In [33], a variety of standard population genetic statistics were used in an ABC approach for model selection between Beta-n-coalescents and Kingman’s n-coalescent, both with exponential growth on the coalescent time scale. The SFS was further summarised by using its sum, the total number of observed mutations as well as the .0:1; 0:3; 0:5; 0:7; 0:9/-quantiles of the mutation allele counts (this set of quantiles will be denoted by AF). Additionally, the authors used the same range of quantiles of the Hamming distances between all pairs (set of statistics denoted by Ham), of the squared correlation r 2 between the frequencies in the sample for each pair of mutations and, for a reconstruction of the genealogical tree by a standard phylogenetic methods (“neighbor-joining”), of the set of all branch lengths in the reconstructed tree (denoted by Phy). For distinguishing Beta-n-coalescents from Kingman’s n-coalescent, both also accounting for exponential growth on the coalescent time scale, they reported considerably lower error rates than earlier methods, e.g. [14], while also considering a slightly different setup. A crucial difference is that in [33], the range of mutation rates is identical for all n-coalescents considered. Since the distributions of height and total branch length differ strongly between different n-coalescent models, the number of observed mutations is also different. If statistics are used that are non-robust to the number of mutations on the genealogy, e.g. the number of segregating sites, this approach needs large ranges for the mutation rate to be able to reproduce these statistics as observed, which is not efficient for performing simulation-based inference approaches as ABC. If the parameter ranges are not large enough, model classes may produce the comparable numbers of mutations only with low probability or not at all, thus likely being discarded as the true model class. Thus, another approach (used in the other inference approaches mentioned) is to use mutation rates  that produce a number of mutations in the range of the number s of mutations observed in the data, e.g. by setting  D Ecoal2s.Ln / for each specific member “coal” of a model class’s n-coalescents, which is the generalised form of Watterson’s estimator for . Here, Ln denotes the sum of the lengths of all branches of the coalescent tree. With Siri-Jégousse, I analyse which statistics considered by [33] are actually best to distinguish between different model classes featuring Kingman’s n-coalescent with exponential growth and several classes of „-n-coalescents, including ones with exponential growth [21]. For this, we use the ABC approach for model selection based on random forests from [50] (random forests are a widely used machine learning approach introduced in [11]). Since it is an approximate Bayesian method, we need to specify prior distributions on both the coalescent classes and their mutation ranges. However, the approach from [50] differs drastically from the ABC approach presented above. In a nutshell, the method simulates a set of summary statistics S1 ; S2 ; : : : from each model class under its prior distribution, then takes bootstrap samples of these simulations. From each bootstrap sample, a decision tree is built, whose nodes have the form Si > t or Si < t for some t 2 R. For each node, the statistic that distinguishes best between model classes (normally measured by the Gini index, we use a slightly different measure),

Multiple-merger genealogies: Models, consequences, inference

193

from a randomly drawn subset of statistics, is chosen as decider Si . Nodes are added until the tree divides the bootstrap sample perfectly into sets of simulations from the same model class. The observed data is then sorted into a model class according to the decision tree for each bootstrap sample (so for the complete random forest), and the model selection is then the majority class across the forest. In other words, instead of using that the true model should produce summary statistics closer to the observed ones than another model, as e.g. rejection-scheme ABC, ABC based on random forests trains a forest of decision trees based on the simulations for the model priors and lets them decide, which model matches the observed statistics. We chose this method mainly since it does not increase its model selection error when including many and potentially uninformative statistics and it comes with a builtin measure for the ability of each statistic to distinguish hypotheses, the variable importance. The variable importance of Si measures the average decline in misclassification over all nodes of the random forest where Si was picked as decider. To assess the misclassification errors, the ABC method uses the out-of-bag error. For this, one takes the fraction of trees in the random forest for each simulation sim that were built without sim and that sorted sim into a wrong model class (and then averages over all simulations). As summary statistics, we use the statistics as in [33], but we add several more statistics, for instance nucleotide diversity , the mean of the Hamming distances between all pairs of sampled individuals. We also add a new statistic. For each i 2 Œn, On .i / is the number of individuals including i that share all non-private mutations of i (a mutation is private if it is only found in one sampled individual). This corresponds to the smallest allele count > 1 of all mutations that affect individual i. See Figure 9.4.1 for an example. We consider the .0:1; 0:3; 0:5; 0:7; 0:9/-quantiles of .On .i //i 2Œn as well as the mean, standard deviation and harmonic mean as statistics, this set of statistics is denoted by O. Our first result is that, not very surprisingly, including more statistics decreases the out-of-bag error meaningfully across a range of different sets of n-coalescents. See Table 9.4.1 for a comparison between model classes of Beta-100-coalescents (class Beta) for ˛ 2 Œ1; 2/ and Kingman’s n-coalescent with exponential growth (class Growth) and growth rate in  2 Œ0; 1000. We chose a uniform prior on ˛, but (essentially) a log uniform prior on the growth rate, with an atom at 0, to put a higher prior weight on low, more realistic growth rates (however, some real data sets of fungal or bacterial pathogens do fit relatively well to models with growth rates of 500 or more). The mutation rate  was chosen for each model in any model class so that it produces, on average, a number of mutations s 2 ¹15; 20; 30; 40; 60; 75º, i.e. we draw s from this set according to a (uniform) prior distribution and then set  to the generalised Watterson estimate. We performed 175,000 simulations per model class. From these, we conducted the described ABC analysis. We perform the exact same analysis a second time, including new simulations with parameters from a new (and independent) draw from the prior distributions.

Fabian Freund

194

Figure 9.4.1. Genealogical tree, mutations marked by x. Individual 2 is affected by four mutations with allele counts 1; 5; 5; 6. All non-private mutations are shared by individuals Œ5. The observable minimal clade size of 2 is On .2/ D 5.

A more interesting take-away from our analysis is that adding statistics based on the minimal observable clade sizes O to the set based on the allele counts AF, S,  leads to considerably decreased error rates, which are only marginally reduced by further additions of statistics. The results are essentially unchanged if the full site frequency spectrum is used instead of AF. For the model selection using all considered statistics, the variable importance was highest for the harmonic mean of the minimal observable clade sizes. A potential explanation why the harmonic mean of .On .i //i 2Œn is a meaningful statistic for such model selections can be found in [21, pp. 31–32] and has to do with the connection of On .i / with the minimal clade size of i. These findings remain true for other model selection problems, e.g. if one changes the multiple merger genealogy model to a Dirac or Beta-„-n-coalescent. The results presented here indicate that focusing on the site frequency spectrum for model selection between multiple merger and binary n-coalescents can be a suboptimal choice and that combining the information with further statistics as also done in [51] is advisable. A similar effect was observed for model selection between different scenarios of population size changes for bifurcating genealogies, see [31]. Due to the recent advance of full-data methods for such related inference problems, notably [48], where past population sizes are inferred based on a Kingman’s n-coalescent with fluctuating population sizes (essentially inferring the function  from (9.2.1)) by an

Multiple-merger genealogies: Models, consequences, inference Set of statistics AF; S;  AF; S; ,O AF; S; ; Ham AF; S; ; Phy AF; S; ; r 2 AF; S; ; O; Ham FULL – O FULL – .AF; S; / FULL

Misclassification for Beta

Misclassification for growth

0.333 (0.330) 0.246 (0.242) 0.277 (0.274) 0.317 (0.320) 0.298 (0.290) 0.241 (0.240) 0.269 (0.271) 0.242 (0.240) 0.243 (0.240)

0.246 (0.246) 0.205 (0.209) 0.222 (0.227) 0.247 (0.244) 0.228 (0.235) 0.203 (0.205) 0.217 (0.215) 0.208 (0.210) 0.200 (0.204)

195

Table 9.4.1. Misclassification errors (out-of-bag errors) of the random forest ABC model comparison (5000 trees) with different sets of summary statistics. In parentheses: results for rerun of analysis. FULL: AF, S, , O, Ham, Phy, r 2 .

efficient representation of the full data, a revisit of full-likelihood methods may also be a viable alternative. Where is this surplus of information for model selection really critical? For species with large enough genomes featuring many chromosomes and/or linkage blocks, the two studies [37, 38] suggest that the information within the SFS is already enough to reliably distinguish between model classes over a range of different genealogy models (see the chapter of Birkner and Blath [6] in this volume). However, for e.g. bacteria with low recombination rates, where the entire genome can be essentially seen as one linkage block, this will not work as well. Here, pooling of more information than the SFS/AF clearly helps with distinguishing between model classes. An example of such bacteria is Mycobacterium tuberculosis, the bacterial agent of human tuberculosis, which is haploid and propagating clonally. Genealogies from M. tuberculosis outbreaks are usually modelled as Kingman’s n-coalescent with exponential growth. However, the bacteria have to evolve quickly due to a high selection pressure, which could lead to a Bolthausen–Sznitman n-coalescent as the suitable genealogy model. Moreover, real data shows some patients that are super-spreaders, i.e. infecting very many other patients, which could be modelled as a multiple-merger genealogy. With Menardo and Gagneux, I investigated in [41] whether indeed classes of ƒ-n-coalescents fit better to the data than Kingman’s n-coalescent with exponential growth. We considered eleven publicly available data sets and performed model selection using the ABC approach described above. Our results show that eight of eleven data sets actually fit better to multiple merger genealogies (ten of eleven if one allows for ƒ-n-coalescents with exponential growth) and that these produce patterns of genetic diversity compatible with the observed data.

196

Fabian Freund

9.5 Partition blocks and minimal observable clades Seeing that the minimal observable clade sizes are a reasonable addition to the arsenal of population genetic statistics, Siri-Jégousse and I also investigated their mathematical properties [22]. In the following, the key findings are presented. Looking back at Figure 9.4.1, the minimal observable clade of individual i can be represented as the block including i of the „-n-coalescent at the time of the first mutation affecting i that is not on the branch connecting leaf i to the rest of the tree, i.e. on the external branch of i. Due to exchangeability, the distribution of the size of the minimal observable clade is identical for all i 2 Œn, so in the following we fix i D 1 and omit the i for ease of notation. Let On be the size of the minimal observable clade of 1 and En the length of the external branch of 1. Since mutations are modelled by a homogeneous Poisson point process, independent of the n-coalescent, on the branches of the coalescent tree with rate 2 , the waiting time for the first non-private mutation on the path of leaf 1 to the root of the tree is En C Tn , where Tn is exponentially distributed with parameter 2 independent of En and of the coalescent. If En C Tn exceeds the height of the n-coalescent tree, there is no non-private mutation of i and thus On D n. Recall that B1.n/ .t / denotes the size of the block including 1 in the partition ….n/ t induced by the „-n-coalescent. Then we can express On D B1.n/ .En C Tn /: Based on this equation, we considered On for ƒ-n-coalescents. For finite n, all moments E.Onj /, j 2 N, can again be computed recursively, see [22, Theorem 4.1]. Since the recursion for fixed n is rather involved, yet relatively straightforward, let us focus on the asymptotics of On for n ! 1. Generally, for any En decreases R „-n-coalescent, P d in n and En ! Exp. 1 / for n ! 1, where  1 D  1 x „.dx/=.x; x/. Thus, i i D1 for ƒ-n-coalescents without dust,  1 D 1 and En ! 0 almost surely. In this case, d On is asymptotically B .n/ .T 0 /, where T 0 D Exp. 2 /, since this is the time until the first mutation in the coalescent tree on its path from leaf to root (which does not change with n, since the n-coalescents are nested, branches are added if n grows). The random variable T 0 is independent from all ƒ-n-coalescents. Kingman’s correspondence ensures that the asymptotic frequency limn!1 B1.n/ .T 0 /=n D f1 .T 0 / exists almost surely, where f1 .t/ is the asymptotic frequency of the block including 1 at time t. This ensures n 1 On ! f1 .T 0 / almost surely. When we additionally plug in the representation of the moments of f1 .t/ which follows from [49, equation (50) and Proposition 29], we can fully characterise the distribution of f1 .T 0 / by its moments 

0

E f1 .T /

k 

D1

kC1 X rD2

akC1;r

 2

r C

 2

;

where r is the total rate of the ƒ-coalescent in a state with r blocks and ak;r is a rational function of 2 ; : : : ; k , defined as in [49, Proposition 29]. For the

Multiple-merger genealogies: Models, consequences, inference

197

Bolthausen–Sznitman-n-coalescent (ƒ D Beta.1; 1/), we can show that the distribution of f1 .T 0 / is a Beta distribution. For any ƒ-n-coalescent, f1 is a pure jump process and for ƒ D Beta.1; 1/, the countably many jumps are happening at i.i.d. Exp.1/ distributed times and their heights .Ji /i 2N are governed by a Poisson–Dirichlet distribution with parameters .0; 1/, see [49, Corollary 16, Proposition 30]. Thus, each 0 jump contributes to f1 .T 0 / if its jump time is smaller P than T , which happens with  0 d 1 probability .1 C 2 / . This shows that f1 .T / D i 2N Bi Ji , where .Bi /i 2N are i.i.d. Bernoulli random variables with success probability .1 C 2 / 1 , and this sum has a Beta..1 C 2 / 1 ; 2 .1 C 2 / 1 /-distribution, which can be seen from the construction of the Poisson–Dirichlet distribution via a Poisson point process, as e.g. in [2, Section 4.11]. If the ƒ-n-coalescent has dust, the asymptotic decoupling of f1 and the time until the first mutation on the path from individual 1 to the root that is not on the external branch breaks down. However, if a „-coalescent has dust and stays infinite, the block frequency f1 can be characterised from jump to jump, which I showed in joint work with Möhle. Proposition 9.5.1 ([20, Theorem 1]). In any „-coalescent with dust that stays infinite, .f1 .t // t>0 is an increasing pure jump process with càdlàg paths, f1 .0/ D 0 and lim t !1 f1 .t / D 1, but f1 .t/ < 1 for t > 0 almost surely. The waiting times between the almost surely infinitely many jumps are distributed as independent Exp. 1 / random variables. Its jump chain .f1 Œk/k2N can be expressed via stick-breaking f1 Œk D

k X i D1

Xi

iY 1

.1

Xj /;

j D1

where the .Xj /j 2N are pairwise uncorrelated, Xj > 0 almost surely and E.Xj / D WD Moreover, E.f1 Œk/ D 1 .1 nor identically distributed.

„./  1

for all j 2 N:

/k . In general, the .Xj /j 2N are neither independent

Proof. Here, only a sketch of the proof is excerpted from [20], focusing on the waiting times between jumps and their expected height. The expected height is given in terms of the mass added to the asymptotic frequency f1 .t / if the kth jump is at time t, and thus has the form Xk .1 f1 .t //. First, observe that if the „-coalescent has dust, „.¹.0; 0; : : :/º/ D 0, which means that any „-n-coalescent can be constructed by colouring according to a Poisson point process on Œ0; 1/   with intensity dt ˝ „.dx/=.x; x/ without having to add additional mergers of two blocks. Recall the Poisson point construction of a „-n-coalescent from Section 9.1, especially how mergers are determined by throwing independent balls on a “paintbox” x 2  and merging blocks whose balls have the same colour (landed in the same compartment

198

Fabian Freund

of x). As also mentioned in the first section, if the „-coalescent has dust, there are almost surely only finitely many Poisson points .t; x/ where the block containing i is coloured. More P precisely, each Poisson point is affecting the block containing 1 with probability i 2N xi . Dividing the points of the original Poisson point into those that colour the block including 1 leads to a marked Poisson process, so the Poisson points affecting the block including 1 form P again a Poisson point process on Œ0; 1/   with intensity measure 1 D dt ˝ i 2N xi „.dx/=.x; x/. Since the coalescent stays infinite, every such point, due to the strong law of large numbers for exchangeable random variables, will colour infinitely many blocks with this colour. For f1 , we can thus describe its jump by just merging the blocks of the „-coalescent according to the colouring at each Poisson point affecting the block including 1 (though the Poisson points not affecting 1 do change the frequencies of these blocks). The probability for not having a Poisson point affecting 1 with time component t in any interval of length t0 is given by exp. 1 .Œt 0 ; t 0 C t0 /  // D exp. t0  1 /, which verifies the distribution of the waiting times. At such a Poisson point .t; x/, any other block merges with the block including P 1 if it has the same colour i (which the block including 1 picks with probability xi = i 2N xi ), so with probability xi P , regardless of its frequency. Each x is drawn from the probability distribution . 1 / 1 i 2N xi „.dx/=.x; x/. All blocks not containing 1 have total frequency 1 f1 .t / when potentially merging at .t; x/, so the average fraction of mass from 1 f1 .t / added to f1 .t / is Z X X „.dx/ x2 1 „./ P i x` E.Xk / D D :  1  .x; x/  1 k2N xk i 2N

`2N

This result can now be applied to address the asymptotics of On for ƒ-n-coalescents with dust if the coalescent stays infinite, i.e. ƒ.¹1º/ D 0. Let T1 ; T2 ; : : : be the i.i.d. jump times of f1 . Clearly, En D T1 for n large enough. Similarly to the case without dust, we can then see that there is a time T 0 , not dependent on n, when the first mutation after time T1 appears on the path from leaf 1 to the root of the „-n-coalescent, and this, for n large enough, will fall in between two jumps of f1 , say K and K C 1. Thus, Proposition 9.5.1 ensures that limn!1 n 1 On D f1 ŒK exists almost surely. The random variable K is geometrically distributed on N with success probability  1 , since the exponential distribution is memoryless and the probability that one  1 C=2 exponential random variable is smaller than another one independent of it is given by the exponential rate of the first divided by the sum of the rates. With Proposition 9.5.1, one can compute E.f1 ŒK/ D 1

 2

a 1

1

a

 with a D 1

 ƒ.Œ0; 1/  2   1 C  2

 : 1

The process .f1 .t// t >0 has some interesting further properties. While it is Markovian when observed in the Bolthausen–Sznitman n-coalescent, see [49, Corollary 16], it is not in general. For instance, by further exploiting the Poisson construction, the following properties of f1 for Dirac coalescents can be derived.

Multiple-merger genealogies: Models, consequences, inference

199

Proposition 9.5.2 ([20, essentially Proposition 2]). Let ƒ D ıp , p 2 Œ 12 ; 1/ or p 2 .0; 12 / and transcendental, and q WD 1 p. The process f1 takes values in the set ± °X X bi pq i 1 W bi 2 ¹0; 1º; bi < 1 : Mp WD i 2N

For x D

P

bi pq i

1

2 Mp , we have Y P .f1 Œ1 D x/ D pq j 1 P .Y C i 2 J / i 2N

i 2N

i 2J n¹j º

Y

P .Y C i … J / > 0;

i 2Œj 1nJ

d

where Y D Geo.p/, J WD ¹i 2 N W bi D 1º and j WD max J . The process f1 is not Markovian whereas its jump chain .f1 Œk/k2N is Markovian. Under the condition for p from the proposition, knowing that f1 Œ1 D x allows to directly infer information about mergers at or before the time the minimal clade is formed: Each collision of a Dirac coalescent (on N) merges a fraction of p of all singleton blocks present (the “dust”), so each block has asymptotic frequency from Mp . The condition on p ensures that x 2 Mp has a unique representation in terms of the coefficients bi . Moreover, the coefficients bi encode at which Poisson point the minimal clade appears and which fraction of singletons merged at earlier Poisson points are merged to form the minimal clade. If one drops the condition on p, the distribution of f1 Œ1 would become more involved, since one would have to trace back which combinations of coefficients bi lead to the same x and sum these.

References [1] E. Árnason and K. Halldórsdóttir, Nucleotide variation and balancing selection at the Ckma gene in Atlantic cod: Analysis with multiple merger coalescent models, PeerJ 3 (2014), Article ID e786. [2] R. Arratia, A. D. Barbour, and S. Tavaré, Logarithmic Combinatorial Structures: A Probabilistic Approach, European Mathematical Society, Zürich, 2003. [3] J. Berestycki, N. Berestycki, and J. Schweinsberg, Small-time behavior of beta coalescents, Ann. Inst. Henri Poincaré Probab. Stat. 44 (2008), 214–238. [4] J. Berestycki, N. Berestycki, and J. Schweinsberg, The genealogy of branching Brownian motion with absorption, Ann. Probab. 41 (2013), 527–618. [5] A. Bhaskar, A. G. Clark, and Y. S. Song, Distortion of genealogical properties when the sample is very large, Proc. Natl. Acad. Sci. USA 111 (2014), 2385–2390. [6] M. Birkner and J. Blath, Genealogies and inference for populations with highly skewed offspring distributions, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 151–177. [7] M. Birkner, J. Blath, and B. Eldon, An ancestral recombination graph for diploid populations with skewed offspring distribution, Genetics 193 (2013), 255–290.

Fabian Freund

200

[8] M. Birkner, H. Liu, and A. Sturm, Coalescent results for diploid exchangeable population models, Electron. J. Probab. 23 (2018), Paper no. 49. [9] J. Blath, M. C. Cronjäger, B. Eldon, and M. Hammer, The site-frequency spectrum associated with „-coalescents, Theor. Popul. Biol. 110 (2016), 36–50. [10] J. Blath and N. Kurt, Population genetic models of dormancy, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 247–265. [11] L. Breiman, Random forests, Mach. Learn. 45 (2001), 5–32. [12] É. Brunet and B. Derrida, Genealogies in simple models of evolution, J. Stat. Mech. 2013 (2013), Article ID P01006. [13] M. M. Desai, A. M. Walczak, and D. S. Fisher, Genetic diversity and the structure of genealogies in rapidly adapting populations, Genetics 193 (2013), 565–585. [14] B. Eldon, M. Birkner, J. Blath, and F. Freund, Can the site-frequency spectrum distinguish exponential population growth from multiple-merger coalescents?, Genetics 199 (2015), 841–856. [15] B. Eldon and F. Freund, Genealogical properties of subsamples in highly fecund populations, J. Stat. Phys. 172 (2018), 175–207. [16] B. Eldon and J. Wakeley, Coalescent processes when the distribution of offspring number among individuals is highly skewed, Genetics 172 (2006), 2621–2633. [17] J. Felsenstein, Accuracy of coalescent likelihood estimates: Do we need more sites, more sequences, or more loci?, Mol. Biol. Evol. 23 (2006), 691–700 [18] Food and Agriculture Organization of the United Nations, WIEWS – World information and early warning system on plant genetic resources for food and agriculture, httpW// www.fao.org/wiews/glossary/en/. [19] F. Freund, Cannings models, populations size changes and multiple-merger coalescents, J. Math. Biol. 80 (2020), 1497–1521. [20] F. Freund and M. Möhle, On the size of the block of 1 for -coalescents with dust, Mod. Stoch. Theory Appl. 4 (2017), 407–425. [21] F. Freund and A. Siri-Jégousse, The impact of genetic diversity statistics on model selection between coalescents, Comput. Statist. Data Anal. 156 (2021), Article ID 107055. [22] F. Freund and A. Siri-Jégousse, The minimal observable clade size of exchangeable coalescents, preprint 2019, httpsW//arxiv.org/abs/1902.02155; to appear in Braz. J. Probab. Stat. [23] C. Goldschmidt and J. B. Martin, Random recursive trees and the Bolthausen–Sznitman coalescent, Electron. J. Probab. 10 (2005), 718–745. [24] R. C. Griffiths and S. Tavaré, Sampling theory for neutral alleles in a varying environment, Philos. Trans. Roy. Soc. B 344 (1994), 403–410. [25] D. Hedgecock and A. I. Pudovkin, Sweepstakes reproductive success in highly fecund marine fish and shellfish: A review and commentary, Bull. Marine Sci. 87 (2011), 971– 1002. [26] O. Hénard, The fixation line in the ƒ-coalescent, Ann. Appl. Prob. 25 (2015), 3007–3032. [27] P. Herriger and M. Möhle, Conditions for exchangeable coalescents to come down from infinity, ALEA Lat. Am. J. Probab. Math. Stat. 9 (2012), 637–665.

Multiple-merger genealogies: Models, consequences, inference

201

[28] P. Hoscheit and O. Pybus, The multifurcating skyline plot, Virus Evol. 5 (2019), Article ID vez031. [29] T. Huillet and M. Möhle, On the extended Moran model and its relation to coalescents with multiple collisions, Theor. Popul. Biol. 87 (2013), 5–14. [30] K. K. Irwin, S. Laurent, S. Matuszewski, S. Vuilleumier, L. Ormond, H. Shim, C. Bank, and J. D. Jensen, On the importance of skewed offspring distributions and background selection in virus population genetics, Heredity 117 (2016), 393–399. [31] F. Jay, S. Boitard, and F. Austerlitz, An ABC method for whole-genome sequence data: Inferring paleolithic and neolithic human expansions, Mol. Biol. Evol. 36 (2019), 1565– 1579. [32] I. Kaj and S. M. Krone, The coalescent process in a population with stochastically varying size, J. Appl. Probab. 40 (2003), 33–48. [33] M. Kato, D. A. Vasco, R. Sugino, D. Narushima, and A. Krasnitz, Sweepstake evolution revealed by population-genetic analysis of copy-number alterations in single genomes of breast cancer. Roy. Soc. Open Sci. 4 (2017), Article ID 20170011. [34] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [35] J. F. C. Kingman, On the genealogy of large populations, J. Appl. Probab. 19 (1982), 27–43. [36] J. F. C. Kingman, The coalescent, Stochastic Process. Appl. 13 (1982), 235–248. [37] J. Koskela, Multi-locus data distinguishes between population growth and multiple merger coalescents, Stat. Appl. Genet. Mol. Biol. 17 (2018). [38] J. Koskela and M. Wilke Berenguer, Robust model selection between population growth and multiple merger coalescents, Math. Biosci. 311 (2019), 1–12. [39] J. Lintusaari, M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander, Fundamentals and recent developments in approximate Bayesian computation, Syst. Biol. 66 (2017), e66–e82. [40] S. Matuszewski, M. E. Hildebrandt, G. Achaz, and J. D. Jensen, Coalescent processes with skewed offspring distributions and non-equilibrium demography, Genetics 208 (2018), 323–338. [41] F. Menardo, S. Gagneux, and F. Freund, Multiple merger genealogies in outbreaks of Mycobacterium tuberculosis, Mol. Biol. Evol. 38 (2021), 290-–306. [42] M. Möhle, Total variation distances and rates of convergence for ancestral coalescent processes in exchangeable population models, Adv. Appl. Probab. 32 (2000), 983–993. [43] M. Möhle, The coalescent in population models with time-inhomogeneous environment, Stochastic Process. Appl. 97 (2002), 199–227. [44] M. Möhle and S. Sagitov, A classification of coalescent processes for haploid exchangeable population models, Ann. Probab. 29 (2001), 1547–1562. [45] R. A. Neher and O. Hallatschek, Genealogies of rapidly adapting populations, Proc. Natl. Acad. Sci. USA 110 (2013), 437–442. [46] H.-S. Niwa, K. Nashida, and T. Yanagimoto, Reproductive skew in Japanese sardine inferred from DNA sequences, ICES J. Marine Sci. 73 (2016), 2181–2189.

Fabian Freund

202

[47] A. Nourmohammad, J. Otwinowski, M. Łuksza, T. Mora, and A. M. Walczak, Fierce selection and interference in B-cell repertoire response to chronic HIV-1. Mol. Biol. Evol. 36 (2019), 2184–2194. [48] J. A. Palacios, A. Veber, L. Cappello, Z. Wang, J. Wakeley, and S. Ramachandran, Bayesian estimation of population size changes by sampling Tajima’s trees, Genetics 213 (2019), 967–986. [49] J. Pitman, Coalescents with multiple collisions, Ann. Probab. 27 (1999), 1870–1902. [50] P. Pudlo, J.-M. Marin, A. Estoup, J.-M. Cornuet, M. Gautier, and C. P. Robert, Reliable ABC model choice via random forests, Bioinformatics 32 (2015), 859–866. [51] D. P. Rice, J. Novembre, and M. M. Desai, Distinguishing multiple-merger from Kingman coalescence using two-site frequency spectra, preprint 2018, httpsW//www.biorxiv.org/ content/10.1101/461517v1. [52] C. Rödelsperger, R. A. Neher, A. M. Weller, G. Eberhardt, H. Witte, W. E. Mayer, C. Dieterich, and R. J. Sommer, Characterization of genetic diversity in the nematode Pristionchus pacificus from population-scale resequencing data, Genetics 196 (2014), 1153–1165. [53] I. W. Saunders, S. Tavaré, and G. A. Watterson, On the genealogy of nested subsamples from a haploid population, Adv. Appl. Probab. 16 (1984), 471–491. [54] J. Schweinsberg, Coalescents with simultaneous multiple collisions, Electron. J. Probab. 5 (2000), 1–50. [55] J. Schweinsberg, Coalescent processes obtained from supercritical Galton–Watson processes, Stochastic Process. Appl. 106 (2003), 107–139. [56] J. Schweinsberg, Rigorous results for a population model with selection II: Genealogy of the population, Electron. J. Probab. 22 (2017), Paper No. 38. [57] J. P. Spence, J. A. Kamm, and Y. S. Song, The site frequency spectrum for general coalescents, Genetics 202 (2016), 1549–1561. [58] M. Steinrücken, M. Birkner, and J. Blath, Analysis of DNA sequence variation within marine species using Beta-coalescents. Theor. Popul. Biol. 87 (2013), 15–24. [59] A. Sturm, Diploid populations and their genealogies, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 203–221. [60] M. Sunnåker, A. G. Busetto, E. Numminen, J. Corander, M. Foll, and C. Dessimoz, Approximate Bayesian computation, PLoS Comp. Biol. 9 (2013), Article ID e1002803. [61] A. Tellier and C. Lemaire, Coalescence 2.0: A multiple branching of recent theoretical developments and their applications, Mol. Ecol. 23 (2014), 2637–2652.

Chapter 10

Diploid populations and their genealogies Anja Sturm Diploid organisms carry pairs of homologous chromosomes that are inherited from two parents who each contribute exactly one homologous chromosome, while haploid organisms inherit just one copy from a single parent. In this contribution, we summarise classical results on the genealogies of haploid populations with large but fixed total population size, which have also been used as approximations to the diploid case by ignoring the pairing into individuals. We then present recent results on extending the characterisation of the genealogies to analogous, explicitly diploid models. We discuss the implications and illustrate them by means of a number of concrete examples. Lastly, we survey related research works and further questions regarding the modelling of diploid reproduction.

10.1 Introduction Many organisms, in particular most vertebrates including humans, are diploid and carry pairs of homologous chromosomes that are inherited from two parents who each contribute exactly one of every homologous chromosome. This is in contrast to haploid populations in which the organisms possess only a single copy of each chromosome that is generally inherited from a single parent. If the focus lies on a single gene on one chromosome then there is only a single parent gene in either a diploid or haploid population. This is why the mathematical modelling of populations and their genealogies has primarily focused on haploid population models. These have then also been used to approximately model the diploid populations: If the population has N diploid individuals (each represented by a pair of homologous chromosomes/genes) then by ignoring the pairing of the chromosomes in individuals one may simply consider a (haploid) population of 2N chromosomes/genes instead. This approximation has been used quite successfully in many cases. However, there are limitations: Since the reproductive success depends on the individuals and not on the individual chromosomes the additional diploid structure cannot always be ignored, particularly when the individuals’ offspring numbers may be highly skewed. Also, recombination is an important evolutionary force, in particular for diploid organisms, which can only be incorporated in haploid approximations in a simplified way. Here, we consider mathematical population models that are explicitly diploid and quantify when this structure plays a significant role for the genealogy of a sample of individuals/ chromosomes and when the previously used approximations are feasible. We start by describing classical haploid models with fixed total population size in Section 10.2. We review results on their genealogies (when the total population size

Anja Sturm

204

is large), see Theorem 10.2.1, which are due to [34]. This is followed by introducing analogous but explicitly diploid models in Section 10.3. Following [9], we present recent results on the structure of their genealogies, see Theorem 10.3.1. As an application we then consider a number of examples for diploid reproduction models and their corresponding genealogies in Section 10.4. Finally, in Section 10.5 we discuss consequences and possible extensions and we also survey other approaches and models for genealogies in diploid populations.

10.2 Haploid models In haploid populations every individual has one parent whose set of chromosomes it inherits. If the haploid population is of a stable size and has no additional spatial structure, the most commonly used mathematical population model is the Cannings model. Here, it is assumed that the population is of a fixed size N over time, reflecting a population that maximally uses the limited available resources and so stays at its carrying capacity. (Mathematically, this avoids complications of fluctuating populations sizes and conditioning on survival without unbounded growth.) The population evolves in discrete generations and the next generation is composed of offspring of the individuals in the parental generation. We here focus on neutral stochastic population models in which the number of offspring of each individual is random and they all have the same chances for a particular number of offspring. This means that the number of offspring in the next generation is given by a vector V D .Vi /i 2ŒN  where ŒN  WD ¹1; : : : ; N º and the N0 -valued random variables Vi are exchangeable and describe the number of offspring of individual i. Note that due to exchangeability, which means that .V.i/ /i 2ŒN  has the same distribution as .Vi /i 2ŒN  for any permutation  of ŒN , we may label the individuals in each generation arbitrarily with ŒN . We assume that the offspring numbers for different generations are i.i.d. with distribution V . Well studied examples of Cannings models are the Moran model where V is a random permutation of .2; 0; 1; : : : ; 1/ and the Wright–Fisher model in which each offspring “chooses” its parent at random, which corresponds to V being a symmetric multinomial distribution. From a biological point of view these models seem overly simplistic. However, for large N their behaviour represents a large class of (possibly more realistic) Cannings type population models with offspring distributions V D V .N / in the sense that asymptotically as N ! 1 the genealogy of a finite sample of size n of individuals is the same. The genealogy of a sample of size n with n 6 N is described by coalescent processes, which take values in the set of partitions En of Œn. Any element S in En can be expressed by  D ¹C1 ; C2 ; : : : ; Cb º where Ci \ Cj D ; for i ¤ j and biD1 Ci D Œn with b D jj the number of blocks in , which we will refer to as blocks in the following. We start at the present with every sampled individual with label in Œn in its own block (i.e.  D ¹¹1º; : : : ; ¹nºº) and, looking backwards in time, merge (coalesce) blocks when the sampled individuals represented by them find a common ancestor. Thus, blocks

205

Diploid populations and their genealogies

represent the current ancestor/ancestral line of the sampled individuals it contains and the corresponding En -valued process  n;N D . n;N .t // t 2N0 describes the genealogy completely. Asymptotically, the appropriate time rescaling is given by the probability that two individuals find a common ancestor in the previous generation, which is h WD cN

1 EŒ.V1 /2 ; N

(10.2.1)

where .m/0 WD 1 and .m/k WD m    .m k C 1/ for any m 2 R and k 2 N. (Note that we have added a label h in order to make the distinction to the corresponding diploid quantities that we will use later on.) Möhle and Sagitov [34] showed that if h cN ! 0 as N ! 1 the limiting coalescent process will, in the most general case, be described by a coalescent in continuous time with (simultaneous) multiple mergers or „-coalescent, which we briefly describe (see also the contribution by Birkner and Blath [6] as well as by Freund [23] in this volume, which focus on multiple merger coalescents). Here, „ is a finite measure on the infinite dimensional simplex P  D ¹.x1 ; x2 ; : : :/ W x1 > x2 >    > 0; 1 x i D1 i 6 1º. We write 0 WD ¹.0; 0; : : : /º and equip  with the of coordinate-wise convergence, metrised e.g. via the Ptopology i metric d .x; y/ D 1 yi j. i D1 2 jxi This „-coalescent can be fully characterised by the transition rates of its restrictions to En and indeed likewise on the partitions of all of N. If  2 En has b blocks and 0 with a < b blocks arises from  by merging j groups of sizes k1 ; : : : ; kj > 2 (in particular, there are s D b k1    kj “singleton” elements in  which do not participate in any merger), the transition from  to 0 occurs at the rate r;0 D bIk1 ;:::;kj Is D 1¹j D1;k1 D2º „.¹0º/ Z s 1 X X C n¹0º

lD0 i1 ;:::;ij Cl D1 distinct

  s k1 k x    xijj l i1  xij C1    xij Cl .1

jxj/s

l

„.dx/ ; (10.2.2) .x; x/

P P1 2 where jxj D 1 i D1 xi and .x; x/ D i D1 xi . The mass of „ at 0 determines the rate of binary mergers that applies to any pair of ancestral lines. If „ only has mass at 0 then we have Kingman’s coalescent with exclusively binary mergers. Another special case with just one multiple merger at a time, which already comprises large classes of coalescent processes, are ƒ-coalescents. Here, „ only gives mass to ¹.x1 ; x2 ; : : :/ W 0 6 x1 6 1; x2 D x3 D    D 0º   and we define a measure ƒ on Œ0; 1 as the pushforward measure under the projection on x1 . Then we necessarily have j D 1 in (10.2.2), and writing b;k1 WD bIk1 Ib k1 , (10.2.2) simplifies to 1

Z b;k1 D 1¹k1 D2º ƒ.¹0º/ C

x k1 0

2

.1

x/b

k1

ƒ.dx/:

206

Anja Sturm

We now write V.1/ > V.2/ >    > V.N / for the ranked version of .V1 ; : : : ; VN / and  V V.N / .1/ V.2/ ˆhN WD L ; ;:::; ; 0; 0; : : : (10.2.3) N N N for the law of its rescaling, the ranked offspring frequencies, viewed as a probability measure on . We assume that 1 h cN

ˆhN .dx/

!

N !1

1 „.dx/ vaguely on  n ¹0º; .x; x/

(10.2.4)

where „ is a sub-probability measure on  n ¹0º. We extend this sub-probability measure to a probability measure „ on , which means that we put „.¹0º/ D 1 „. n ¹0º/. The reason that we here obtain a probability measure „ on  instead of an arbitrary finite measure lies in the fact that we have rescaled time by the probabh ility cN that two lines merge in one generation. As a consequence we obtain for the limit process that the rate for two lines to merge given by 2I2I0 and equal to the mass of „ on  must be 1, see (10.2.2). It is known, see [34], that (10.2.4) is equivalent to the existence of the limit 1 EŒ.V1 /k1    .Vj /kj  N !1 c h N k1 CCkj j N

jh .k1 ; : : : ; kj / WD lim

(10.2.5)

for all j 2 N and k1 ; : : : ; kj > 2. If either of the conditions (10.2.4) and (10.2.5) holds then we have jh .k1 ; : : : ; kj / D bIk1 ;:::;kj I0 since the right-hand side of (10.2.5) is the (time rescaled) probability that in the previous generation all b lines participate in mergers with specific groups of sizes k1 ; : : : ; kj (analogously to (10.2.1)). More precisely, we have the connection jh .k1 ; : : : ; kj / D 1¹j D1;k1 D2º  „.¹0º/ Z 1 X k „.dx/ C xik11    xijj : .x; x/ n¹0º

(10.2.6)

i1 ;:::;ij D1 distinct

In the case of haploid Cannings models the convergence result of Möhle and Sagitov [34] is then the following: h Theorem 10.2.1 ([34, Theorem 2.1]). Assume that cN ! 0 and that the laws of n;N .V1 ; : : : ; VN / satisfy (10.2.4). Assume also that  .0/ ! 0 weakly as N ! 1 on En . Then h . n;N .bt=cN c// t >0

! . n .t// t >0

N !1

weakly on D.Œ0; 1/; En /;

where D.Œ0; 1/; En / denotes the space of En -valued càdlàg paths equipped with Skorohod’s J1 -topology (see e.g. [22, Chapter 3, Section 4]). The limiting process  n is a „-coalescent on En with transition rates (10.2.2), starting from 0 and with „ given by (10.2.4).

Diploid populations and their genealogies

207

This convergence result, which is derived from convergence of the finite-dimensional distributions and relative compactness, has been used widely in order to describe the ancestral structure of a sample of n genes when the total population size N is sufficiently large. Not only is the limiting coalescent process for the most part easier to analyse than the coalescent process of the underlying population models, but it also shows that the genealogical structures of large classes of population models are essentially the same. As an example, all population models whose offspring numbers do not vary widely will have Kingman’s coalescent as the limit. More precisely, Kingman’s coalescent is obtained if and only if 1h .3/ D 0, see [34]. A sufficient condition is limN !1 VarŒV1.N /  D  2 2 .0; 1/ and supN 2N EŒ.V1.N / /m  6 Mm for all m 2 N. This observation was already made in Kingman’s seminal work [28, 29]. Subsequently and more recently, ƒ-coalescents with multiple mergers were introduced and studied by Pitman [37], Sagitov [38] and Donnelly and Kurtz [17], as well as then „-coalescents with simultaneous multiple merges by Schweinsberg [40] and Sagitov [39] in addition to Möhle and Sagitov [34]. Studying properties of these coalescent processes has since been a very active area of research, we here refer in particular to the contributions of Birkner and Blath [6], Kersting and Wakolbinger [27] as well as Freund [23] in this volume for an overview of their analysis and relevance in population genetics. In order to get a sense of population models that do not have a Kingman coalescent limit we describe a class of models that were proposed and analysed by Schweinsberg [41]. These lead to beta coalescents, which are ƒ-coalescents with ƒ given by a beta.2 ˛; ˛/ distribution with 1 < ˛ < 2 having density 1 B.2

˛; ˛/

x1

˛

.1

x/˛

1

;

0 < x < 1;

where a; b > 0 and B.a; b/ D €.a/€.b/=€.a C b/ denotes the beta function. These beta coalescents have in recent years been studied intensely for 0 < ˛ < 2. In particular, there is a rich mathematical structure linking these particular ƒ-coalescents to stable branching processes, see e.g. Berestycki [5] as well as Birkner et al. [7]. In Schweinsberg’s model the V is constructed in two steps. First, each individual i 2 ŒN  produces Xi juveniles, where X1 ; : : : ; XN are i.i.d. copies of X with EŒX  > 1 and X has a (strictly) regularly varying tail, P .X > x/  cx

˛

as x ! 1

(10.2.7)

with c 2 .0; 1/. Then N of the SN D X1 C    C XN (> N typically) juveniles are drawn at random without replacement to form the next (adult) generation. This model captures situations where occasionally some individuals can, for example due to environmental fluctuations, possibly produce many more offspring than others (note that (10.2.7) with ˛ < 2 implies VarŒXi  D 1). It is thus a possible mathematical formalisation of the concept of “sweepstakes reproduction” that appears in the bio-

Anja Sturm

208

logical literature, see e.g. Eldon and Wakeley [20] and the discussion and references therein, as well as the contribution of Birkner and Blath [6] as well as Freund [23] in this volume.

10.3 Diploid models We now extend the coalescent theory to a wide class of diploid population models, in which individuals carry two copies of each gene which they inherit from two distinct parent individuals. We will thus derive the diploid analogue of the Möhle and Sagitov classification in Theorem 10.2.1 of the ancestral processes in exchangeable haploid population models. This summarises recent work of Birkner, Liu, and Sturm [9] and gives a unified picture of studying genealogies in an exchangeable diploid setting, which had previously only been available in special cases, see Section 10.4 for examples and details. We consider diploid biparental analogues of Cannings models, namely a general diploid exchangeable population model with fixed population size N . (For an overview of some other population models that incorporate aspects of diploidy we refer to Section 10.5.) In our model, every individual possesses two chromosome copies, each inherited from one of its two parents. Which parental chromosome is inherited is a uniform random pick, independently for each child. Let Vi;j be the number of children in the next generation of individuals i and j (for i < j ). We call these quantities pairwise offspring numbers and implicitly define Vi;j D Vj;i for i > j when notationally necessary. We exclude the possibility ofPself-fertilisation (selfing), i.e. Vi;i D 0. Since the population size is fixed we have 16i 2:

This formula reflects the fact that the coalescence of k blocks results in a decrease of k 1 in the number of blocks. Some of the most fundamental properties of ƒ-coalescents can be characterised in terms of the .b/, b > 2. A ƒ-coalescent is said to lack a dust component (or simply to be dust-free) if the probability that there are external branches within the n-coalescents reaching right down to the tree’s root is asymptotically vanishing in the limit n ! 1. This feature takes place if and only if the rate of decrease per capita is divergent, i.e. .b/ !1 b

ƒ-coalescents in equilibrium and in evolution

225

as 1. Originally this kind of behaviour has been characterised by the condition R b! p 1 ƒ.dp/ D 1 (see [30]), its equivalence to the above requirement is shown in [13, Lemma 1 (iii)]. Another property of importance is the stronger notion that a ƒ-coalescent comes down from infinity, which means that for any t > 0 the number of lineages at time t in the ƒ-coalescent is finite a.s. Expressed in terms of n-coalescents this means that for any t > 0 the numbers of their lineages at time t are tight as n ! 1. This feature arises [31] if and only if 1 X bD2

1 < 1: .b/

The chapter is subdivided into two parts. The first part, consisting of Sections 11.2, 11.3 and 11.4, focusses on aspects of the genealogy of n individuals that live at some fixed time, and on their description by coalescent functionals. Sections 11.2 and 11.3 treat general ƒ-coalescents. Each of them presents a different approximation method, one for a class of functionals of the ƒ-coalescent and the other for the block-counting processes. In Section 11.4 we turn to Beta-coalescents, and present an asymptotic expansion for a further class of functionals. In the second part we are going to describe and analyse coalescent-valued processes that are embedded in evolving populations. Here we distinguish two different perspectives. In Section 11.5 we think of n as a population size that is constant over time. Then each of the functionals studied in the first part gives rise to a real-valued process, and we are going to obtain limit laws for sequences of such processes as n ! 1. There, the factor that translates the generation time scale s into the evolutionary time scale t will vary with n, cf. Remark 11.2.8. In Section 11.6 we will consider genealogies of infinite populations in their evolutionary time scale. As we shall see, these genealogies can be viewed as evolving 1-coalescents. Here n loses its meaning as a total population size but still can be seen as the size of a sample taken from the population at some fixed time.

11.2 An approximation method for dust-free ƒ-coalescents 11.2.1 A class of functionals of ƒ-coalescents We start by introducing some notation. For fixed n 2 N, let Nn D .Nn .u//u>0 denote the block-counting process. Thus Nn .u/ gives the number of blocks within the n-coalescent at time u, in particular Nn .0/ D n. The process Nn is a Markov pure-death process with state space Œn, with an absorbing state 1 and with a jump rate in state b given by b   X b .b/ WD b;k ; b > 2: k kD2

Götz Kersting and Anton Wakolbinger

226

The Markov chain X D X .n/ is the decreasing path n D X0.n/ > X1.n/ >    > X.n/ D 1, n embedded into the Markov process Nn , with n being the total number of merging events. The waiting time in the state Xi D Xi.n/ , i D 0; 1; : : : , is denoted by Wi D Wi.n/ . For the sake of readability we will suppress the superscript n in Xi and Wi . A number of functionals of n-coalescents can be expressed or closely approximated by quantities of the form X n 1 Fn .X/ WD f .Xi / (11.2.1) i D0

with some function f W Œ2; 1/ ! R and some random variable 0 6 n 6 n . Here are a few examples. Example 11.2.1 (Number of coalescences). If we set f  1 and n D n , then Fn .X / equals the number n of merging events. Example 11.2.2 (Absorption time). For the time to the most recent common ancestor, Qn WD W0 C    C Wn 1 , we have EŒQn j X D

X n 1 iD0

Here the choice f .x/ WD .x/

1

1 : .Xi /

in (11.2.1) leads to an effective approximation of Qn .

Example 11.2.3 (Tree length). Similarly the total length `n D X0 W0 C X1 W1 C    C Xn

1 Wn 1

of all branches can be well approximated by EŒ`n j X D

X n 1 i D0

Xi : .Xi /

Example 11.2.4 (External branches). The choice f .x/ WD x1 , x > 2, deserves particular interest. Let  n be the number of coalescing events before an external branch chosen at random out of the n possible merges with some other branch(es) within an n-coalescent. We have the fundamental formula k 1 Xk 1 Y  P .n > k j X/ D 1 n 1 i D0

1  a.s. Xi

for any k 2 N (see [13, Lemma 4]). This implies  kX1  Xk 1 1 1 P .n > k j X/ D exp C O.Xk / a.s. n 1 Xi

(11.2.2)

i D0

Here the functional Fn .X/ with f .x/ D x1 is located in the exponent. With this approximation one gains access to the lengths of external branches, see [12, 13].

ƒ-coalescents in equilibrium and in evolution

227

11.2.2 An approximation method for dust-free ƒ-coalescents Following [13] we present an approach to obtain laws of large numbers for the random variables Fn .X / from (11.2.1). This approach relies on the following intuition. We note that the sequences .b/ and .b/, b > 2, can be naturally extended to smooth functions ; W Œ2; 1/ ! R, see [13, equations (4) and (5)]. Let i WD Xi

Xi

1;

.x/ WD

.x/ : .x/

Then we have the following approximation in two steps: Fn .X/ 

X n 1 i D0

iC1  f .Xi / .Xi /

Z

n

f .x/ rn

dx ; .x/

(11.2.3)

where .rn /n>1 is a sequence of real numbers, n WD min¹k > 0 W Xk < rn º (and the symbol  is understood in a heuristic manner). The right-hand part of (11.2.3) can be seen as a Riemann approximation of the integral. To provide a sufficiently good  fit, a natural requirement is that supn 6i 6n XiC1 becomes small in probability as i n ! 1. Therefore, in order to avoid very large jumps, we confine ourselves to a smalltime regime meaning that the time Qn WD inf¹u > 0 W Nn .u/ < rn º converges to 0 in probability as n ! 1. For coalescents coming down from infinity this simply means that .rn / diverges. Otherwise this is a stronger restriction, then .rn / has to diverge sufficiently fast. The rationale for the left-hand approximation in (11.2.3) rests in the observation that we have EŒi C1 j Xi  D .Xi / a.s. and consequently the difference of both left-hand terms may be embedded in the martingale M D .Mk /k>0 given by Mk WD

k^ n 1 X i D0

f .Xi /

k^ n 1 X i D0

f .Xi /

i C1 ; .Xi /

k > 0: 

Its quadratic variation may be bounded in the regime supn 6i 6n XiC1 D oP .1/ by i means of the estimate X  X n 1 n 1 2iC1 2 2 i C1 Xi f .Xi / D oP f .Xi / .Xi /2 .Xi /2 i D0 i D0   n 1 xf .x/ X i C1 D oP max f .Xi / : rn 6x6n .x/ .Xi / i D0

If .rn / increases sufficiently slowly, then one can show [13, Proposition 1] that under suitable conditions on f the maximum on the right-hand side of the previous display,

228

Götz Kersting and Anton Wakolbinger

maxrn 6x6n

xf .x/ , .x/

is O

X n 1 i D0

Rn rn

 dx f .x/ .x/ , and we end up with the estimate

f .Xi /2

2iC1 D oP .Xi /2

Z

n

f .x/ rn

dx .x/

2  ;

which allows corresponding second moment estimates of the martingale M . We note that martingale techniques have been applied earlier in [3] to ƒ-coalescents coming down from infinity. There the speed of coming down from infinity was obtained by comparing the block-counting process (in continuous time) with the solution of an explosive ODE run backwards in time. The above described method does not cover this case, it follows a different strategy. This kind of approach turns out to be applicable in great generality and appears to be a promising tool for other decreasing Markov chains, too. For ƒ-coalescents it leads to a law of large numbers for the tree length and the total external branch length. To state this compactly, we use the following notation: For two sequences .An /n>1 P 1 and .Bn /n>1 of positive random variables we write An  Bn and An  Bn , if the n quotients A converge to 1 in probability and in the L1 -norm, respectively. Bn Theorem 11.2.5 ([13, Theorems 1 and 2]). Let `n be the total length of an n-coalescent and let `Nn be the total length of its external branches. Then for a dust-free ƒ-coalescent we have, as n ! 1, Z n n2 x 1 P dx and `Nn  : `n  .n/ 2 .x/ The first part of the theorem was proven in [4] for coalescents coming down from infinity, and was conjectured to hold for the larger class of dust-free ƒ-coalescents. It is natural to expect that an analogous theorem holds for the total internal branch lengths `On WD `n `Nn of n-coalescents. This is proven for a class of coalescents containing the Bolthausen–Sznitman coalescent, see [13, Theorem 3]. Since there is further evidence that such a result holds in large generality, we formulate the following conjecture. Conjecture 11.2.6. For a dust-free coalescent we have, as n ! 1, Z n x n  P `On  dx: .x/ .n/ 2 As indicated in Example 11.2.4, our methodology is useful also to analyse single external branches in an n-coalescent. Here we have the following result. Theorem 11.2.7 ([12, Theorems 1.1 and 1.3]). Let Tn be the length of an external branch chosen at random from an n-coalescent. Then for a ƒ-coalescent without a dust component we have, for all u > 0, as n ! 1,  .n/  1 e 2u C o.1/ 6 P Tn > u 6 C o.1/ n 1Cu

ƒ-coalescents in equilibrium and in evolution

229

Tn converges in distribution to a probability measure  ¤ ı0 as Moreover, .n/ n n ! 1, if and only if  is a function varying regularly at infinity. Then its exponent ˛ of regular variation fulfills 1 6 ˛ 6 2 and we have, for ˛ D 1, u

.du/ D e

du

and, for 1 < ˛ 6 2, .du/ D

˛ 1/u/1C ˛

.1 C .˛

du:

˛ 1

The proof is based on formula (11.2.2). Remark 11.2.8. The second part of this theorem includes results for the Kingman and the Bolthausen–Sznitman coalescent as to be found in the literature [15,35]. It suggests n that .n/ can be interpreted as the appropriate scaling of a generation’s duration, i.e. the time at which a specific lineage out of the n present ones takes part in a merging event, see [12, 24]. In addition to the previous theorem we point out that the lengths of the different external branches behave asymptotically like i.i.d. random variables, which for coalescents coming down from infinity [12] also includes the external branches of maximal length. For the Bolthausen–Sznitman coalescent the picture changes: Theorem 11.2.9 ([12, Theorem 1.6]). Let Mn denote the length of the longest external branch in an n-coalescent. Then in case of a Bolthausen–Sznitman coalescent the random variables log log.n/.Mn tn / converge in distribution as n ! 1, with tn WD log log n

log log log n C

The limit has the density ..1 C eu /.1 C e logistic distribution.

u

//

1

log log log n : log log n

du, u 2 R, and is thus the standard

The latter arises as the distribution of the difference of two independent standard Gumbel random variables. Notably, if one proceeds to a point process description of the extremal lengths then it turns out that the limiting point process is a Poisson point process shifted by an independent standard Gumbel random variable. This random shift builds up away from the small time regime, consequently the analysis requires techniques different from the scheme (11.2.3). For details we refer to [12]. As a final application of the above approximation methodology we present the following result on the Bolthausen–Sznitman coalescent. Theorem 11.2.10 ([13, Theorem 4]). Let `Nn;b be the total length of all branches of order b > 1, meaning the branches subtending b leaves (in particular `Nn;1 D `Nn ). Then we have for the Bolthausen–Sznitman coalescent, for b > 2, as n ! 1, 1 `Nn;1 

n log n

1 and `Nn;b 

1 b.b

n : 1/ log2 n

Götz Kersting and Anton Wakolbinger

230

In the proof, formulas similar to (11.2.3) come into play. The result transfers immediately to the site frequency spectrum of the Bolthausen–Sznitman coalescent and is a counterpart to the result in [2] on the corresponding allele frequency spectrum. For analogous results for Beta-coalescents coming down from infinity we refer to [5]. 11.2.3 Fluctuations in ƒ-coalescents: a conjecture The martingale approximation (11.2.3) appears to be fruitful also for the treatment of asymptotic fluctuations of the random variable Fn .X / from (11.2.1), in the following way: Z n X n 1 .Xi / i C1 dx C f .Xi / : (11.2.4) Fn .X /  f .x/ .x/ .Xi / rn i D0

This promises to improve results on the fluctuation behaviour, e.g. for the total length `n of n-coalescents, as follows. Suppose that the ƒ-coalescent p is regularly varying with exponent 1 < ˛ < g with g equal to the golden ratio 12 . 5 C 1/. This means [12] that Z 1 ƒ.dp/ D y ˛ L.y 1 /; 0 < y < 1; (11.2.5) p2 y where the function L is slowly varying at infinity. Then, using techniques from [13], one obtains that for any sequence .an /n>1 with an ! 1 and an D o.n/ we have  L ann an ˛ P .1 > an j X0 D n/ D .1 C o.1//; €.2 ˛/L.n/ as n ! 1, so we are in the domain of stable laws. Choosing .an /n>1 such that  L ann an ˛ 1 C o.1/ D ; (11.2.6) €.2 ˛/L.n/ n the approximation (11.2.4) leads to the following conjecture. Conjecture 11.2.11 (Total length). Assume (11.2.5) and (11.2.6), with 1 < ˛ < g. Then Rn x `n 2 .x/ dx d ! c an n1 ˛ =L.n/ as n ! 1, with .˛ 1/1C1=˛ c WD .1 C ˛ ˛ 2 /1=˛ €.2 ˛/ and a stable random variable  with index ˛, which is characterised by the properties EŒ D 0, P . > z/  z ˛ and P . < z/ D o.z ˛ / as z ! 1. This conjecture is supported by corresponding results from [23] on Beta-coalescents, where L.n/  .˛€.˛/€.2 ˛// 1 as n ! 1.

ƒ-coalescents in equilibrium and in evolution

231

11.3 Approximating ƒ-coalescents with a dust component Following [25] we now come to a second approximation scheme, which is applicable beyond the small-time restriction addressed below formula (11.2.3) and extends thus to regions in the coalescent that are closer to the root. It is tailored to ƒ-coalescents with a dust component, but has consequences also in the dust-free case. As is well known [30] for ƒ-coalescents with a dust component, the logarithm of the block-counting processes Nn may be approximated by a subordinator S in the sense that log Nn .t/ and log n S.t/ are close to each other. Here the Lévy measure of the subordinator is equal to the image of the measure p 2 ƒ.dp/ under the mapping p 7! log.1 p/. The approximation relies on the fact that for this class of coalescents the very large mergers get dominant and smaller mergers may be neglected in the first instance. However, for various purposes this approximation is not good enough. The reason is that, if at a merging event k out of b blocks fuse, then the block-counting process does not decrease by the value k but by k 1. Thus the block-counting process decreases slightly slower than the accompanying subordinator. For the process log Nn this discrepancy has approximately size 1=Nn .u/ at time u > 0. Therefore we may state that, more accurately,  log Nn .u/  log n

S.u/ C

.Nn .u//u ; Nn .u/

where .b/ denotes the subordinator’s jump rate of those jumps that at state Nn .u/ D b find expression in the aforementioned discrepancy. This rate is given by Z  1 .1 p/b p 2 ƒ.dp/: Thus, to make our intuition precise we introduce the process Yn as the solution of the stochastic integral equation Z u Yn .u/ D log n S.u/ C g.eYn .r/ / dr 0

with g.b/ WD

1 b

Z 1 Œ0;1

.1

p/b

 ƒ.dp/ p2

:

The integral takes finite values just for processes with a dust component. Then we have the following theorem. Theorem 11.3.1 ([25, Theorem 10]). If the ƒ-coalescent has a dust component, then for all " > 0 there is an integer k > 2 such that for all n,  P supjlog Nn .u/ Yn .u/jI¹Nn .u/>kº > " 6 ": u>0

Götz Kersting and Anton Wakolbinger

232

11.3.1 Time to absorption and size of the last merger Theorem 11.3.1 is the essential tool for the proof of the next fundamental result, which holds for any ƒ-coalescent. Theorem 11.3.2 ([26, Theorem 1]). The absorption times Qn defined in Example 11.2.2 satisfy P log n Qn 

as n ! 1, with Z ƒ.dp/ :

WD jlog.1 p/j p2 Œ0;1 In the proof the case with dust is treated first, using Theorem 11.3.1. Afterwards, dust-free coalescents (where D 1) are approximated by coalescents with dust. Along similar lines one also obtains a central limit theorem for Qn , see [26]. Another application of this methodology concerns the size of the last merger of an n-coalescents. This is given by Ln WD Xn

1

1:

For coalescents coming down from infinity it is easy to see that this quantity converges in distribution to a limiting distribution on N. The next theorems deal with the general case. Theorem 11.3.3 ([25, Theorems 1, 2 and 3]). For any ƒ-coalescent fulfilling Z jlog.1 p/jƒ.dp/ < 1 (11.3.1) Œ0;1

the sequence .Ln /n>1 is tight. Moreover, in ƒ-coalescents with a dust component, condition (11.3.1) is necessary for tightness. If additional to (11.3.1) we have for all d > 0, [  1 zd ƒ ¹1 e º < ƒ..0; 1/; (11.3.2) zD1

then the sequence .Ln /n>1 is convergent in distribution. For Beta-coalescents this result was obtained in [21, 27]. The counterexample in [25, Section 5: Non-convergence for Eldon–Wakeley coalescents] shows the relevance of condition (11.3.2). In case of convergence of the sequence .Ln /n>1 the coalescent may be time-reversed in the following way. Let ´  lim"#0 Nn Qn .u C "/ for u < Qn ; O Nn .u/ WD n for u > Qn ; in particular NO n .0/ D Ln . Again the next theorem covers also dust-free ƒ-coalescents.

ƒ-coalescents in equilibrium and in evolution

233

Theorem 11.3.4 ([25, Theorem 5]). If the sequence .Ln /n>1 converges in distribution, then the sequence of processes .NO n /n>1 also converges in distribution in Skorohod space. The limiting process NO 1 is a Markov jump process with state space ¹2; 3; : : :º. For formulas determining the jump rates of NO 1 we refer to [25]. The special case of a Bolthausen–Sznitman coalescent has been treated in [18], and that of Betacoalescents in [21].

11.4 An asymptotic expansion for Beta-coalescents We now come to a different approximation method. It provides not only an approach to asymptotic distributions, but also an asymptotic expansion in probability, which further allows for the treatment of evolving coalescents, see the next section. We specialise to the case in which ƒ is the Beta.2 ˛; ˛/-distribution for 1 < ˛ < 2. The following theorem gives an asymptotic expansion for a class of functionals, where the fluctuation term converges to a stochastic integral with respect to a compensated Lévy process that has an ˛-stable distribution. We will see that, for the evolving Beta-coalescent, the same Poisson construction gives a representation of the fluctuation term as an ˛-stable moving average process. Let F be the set of all differentiable functions f which for some c > 0 and 0 <  < ˛1 obey jf 0 .x/j 6 cx  1 for all x 2 .0; 1. For n 2 N and f 2 F let Jn .f / WD n

1=˛



1 ˛

X 1

k 0, and Ln D .Ln;s /s2R is a compensated Lévy 1 process with Lévy measure ˇ˛ p 1 ˛ dp, ˇ˛ WD €.˛/€.2 . ˛/ The statement of Theorem 11.4.1 is an asymptotic version of the equation  X   X  Xk  kC1  X  Xk  kC1  Xk n 1=˛  f n f D f ; n n n n n1=˛ k g. In this latter parameter regime the total length’s fluctuations originate mainly from randomness arising close to the root, which do not show up in the total external length. For details see [23]. In the case ˛ < g the fluctuations of the total length are captured by putting f .x/ WD ˛.˛ 1/€.˛/x 1 ˛ in Theorem 11.4.1, and Corollary 11.4.2 then describes the joint fluctuations of the total length and the total external length. Figure 11.4.3 illustrates this by simulation results. So far we discussed fluctuation results on Beta-coalescents under circumstances where randomness arises mainly from the Markov chain X embedded in the blockcounting process Nn . However, this scenario is inapplicable in important cases, most notably for the Kingman coalescent, where X is a deterministic process. This requires specifically tailored methods. In the following instance this amounts to couple the quantities of interest to certain Markov chains.

Götz Kersting and Anton Wakolbinger

236

Theorem 11.4.4 ([7, Main Theorem]). Let `Nn;k be the total length of all branches of order k > 1, that is those branches subtending k leaves. Then we have for the Kingman coalescent, for each k > 1, as n ! 1, r n N 2 d ! N.0; 1/ `n;k 4 log n k and the branch lengths of different order are asymptotically independent. A corollary asserting asymptotic Poissonian fluctuations for the site frequency spectrum of the Kingman coalescent is immediate, see [7].

11.5 Evolving n-coalescents and their limits 11.5.1 A Poisson construction of the evolving n-coalescent As in the previous section we think of dtp 2 ƒ.dp/ as the intensity measure of p-mergers, where ƒ is a finite measure on .0; 1. Let the population size n 2 N be fixed; individuals will be labeled 1; : : : ; n. The following construction, which describes the evolving genealogy of the population driven by ƒ, appears in [24]. It combines elements of the Poisson process constructions of the ƒ-coalescent given in [30] and of the evolving Bolthausen–Sznitman coalescent in [32]. Let ‡ D ‡n be a Poisson point process on R  .0; 1  Œ0; 1n with intensity measure dtp 2 ƒ.dp/ dv1 : : : dvn : Suppose .t; p; v1 ; : : : ; vn / is a point of ‡. If zero or one of the points v1 ; : : : ; vn is less than p, then no reproductive event occurs at time t. However, if k > 2 of these points are less than p, so that vi1 <    < vik < p, then at time t, the individuals labelled i2 ; : : : ; ik all die, and the individual labelled i1 gives birth to k 1 new individuals who are assigned the labels i2 ; : : : ; ik . Seen backwards in time, this amounts to a coalescence of the lineages labelled i1 ; : : : ; ik at time t , and the rate of events that cause the lineages i1 ; : : : ; ik to coalesce is n;k . For each t 2 R one has a realisation of the n-coalescent which describes the genealogy of the n individuals at time t. We denote the corresponding coalescent tree (read off from the genealogy backwards from time t) by Tn .t / and call the process .Tn .t /; t 2 R/ the evolving n-coalescent. Figure 11.5.1, which is taken from [24], gives an illustration. Let us note that, unlike the lookdown construction (which will be discussed in Section 11.6), this construction does not exhibit a pathwise consistency between different n.

ƒ-coalescents in equilibrium and in evolution

237

5 4 3 2 1

t′

t1

t2

Figure 11.5.1. The picture shows realisations of T5 .t1 / (dashed lines) and T5 .t2 / (solid lines), with the root of T5 .t2 / being at time t 0 . Each black dot marks a point .t; i/ with .t; p; v1 ; : : : ; v5 / 2 ‡ such that vi < p. Only those coalescences are marked (by dotted lines) that affect T5 .t1 /. The coordinates of the point .t 0 ; p 0 ; v10 ; : : : ; v50 / 2 ‡ , which leads to a merger both for T5 .t1 / and for T5 .t2 /, obey v40 < min.v10 ; v20 / < max.v10 ; v20 / < p 0 < min.v30 ; v50 /.

11.5.2 Fluctuations in evolving Beta-coalescents Let us now specialise to the case ƒ D Beta.2 ˛; ˛/, 1 < ˛ < 2. For each s 2 R and n 2 N, we denote the Markov chain embedded into the block-counting process of the s n . By shifting the origin of the scaled time to the coalescent tree Tn .n1 ˛ s/ by .Xks /kD0 time point s and re-centring the process Ln at this new time origin (which does not affect its increments), we can apply Theorem 11.4.1 and conclude that Z 1 Z 1  1 X  Xks  Dn f f .x/ dx C n1=˛ f m.r/ m.r/ dLn;s r ˛ 1 n 0 0 s k0 with X.t/ D .xi .t//i 2Zd ;

Y.t/ D .yi .t //i2Zd ;

Figure 13.2.2. The square lattice Z2 .

276

Andreas Greven and Frank den Hollander

and obtain the following system of coupled SDEs: X dxi .t / D a.i; j /Œxj .t/ xi .t/ dt p j 2Zd C d0 xi .t/.1 xi .t// dwi .t / C KeŒyi .t / dyi .t / D eŒxi .t/

xi .t / dt;

d

i 2Z ;

yi .t/ dt;

where a.i; j / is the rate to migrate from i to j . It turns out that this system exhibits the same qualitative behaviour as the system without seed-bank. The reason is that a single seed-bank can slow down the evolution, but cannot alter it. In particular, the dichotomy of clustering versus coexistence holds under the same condition of recurrent versus transient migration. A qualitative change does occur when the seed-bank has a richer internal structure. Already in the non-spatial model it has been proposed to introduce wake-up times (from dormant to active) with fat tails (see [8, 9]). Unfortunately, this necessarily leads to non-Markovian models, which are difficult to analyse. We therefore introduce what we call a sequence of coloured seed-banks in every colony, with wake-up rates that allow for the wake-up time of a typical individual to have a fat tail, in such a way that when the colours are included in the state space the process still is Markov. In other words, we consider the process .X.t/; Y.t// t >0

(13.2.6)

with X.t / D .xi .t//i 2Zd ;

Y.t/ D .yi;m .t //m2N0



i 2Zd

;

and the following system of coupled SDEs: X dxi .t/ D a.i; j /Œxi .t/ xi .t / dt p d j 2Z C d0 xi .t/.1 xi .t // dwi .t / X C Km em Œyi;m .t / xi .t / dt;

(13.2.7)

(13.2.8)

m2N0

dyi;m .t/ D em Œxi .t/

yi;m .t/ dt;

m 2 N0 ; i 2 Zd :

Here, Km is the ratio of the sizes of the m-dormant population and the active population, and em is the rate of exchange between active and m-dormant individuals. We require that X D K m em < 1 (13.2.9) m2N0

and a.i; j / D a.j; i/ for all i; j 2 Zd :

(13.2.10)

From high to low volatility

277

Assumption (13.2.9) guarantees that the total flow from the active to the dormant population is finite in finite time, while assumption (13.2.10) is necessary for our claims on the longtime behaviour stated below ([32] provides counterexamples when (13.2.10) fails). The system of coupled SDEs in (13.2.8) determines the generator i @ Xh X @2 LD a.i; j /.xj xi / C d0 xi .1 xi / 2 @xi @xi i2G j 2G  Xh @ @ i Km em .yi;m xi / C C em .xi yi;m / ; @xi @yi;m m2N0

which acts on F , the algebra of twice continuously differentiable functions depending on finitely many components. We proved the following in [32, Theorem 2.4]. Theorem 13.2.2 (The spatial Fleming–Viot process with seed-bank is well defined). d For every initial law  2 P .Œ0; 1  Œ0; 1N0 /Z , the .L; F ; /-martingale problem is well-posed and defines a strong Markov and Feller process, denoted by (13.2.6) and (13.2.7). 13.2.2.2 Looking backward: Duality. The dual process is again a spatial coalescent. However, now the coalescence and the migration occur in the active state only. The migration is no longer a random walk. In particular, for two partition elements to be able to coalesce they need to be at the same site and both be active. This already indicates that coalescence is less likely to happen. The duality function is again given by mixed spatial moments:  H .xi /i2Zd ; ..yi;m /m2N0 /i 2Zd ; ni ; ..nQ i;m /m2N0 /i 2Zd D xini .yi;m /nQ i;m : Here, ni is the number of active partition elements at colony i and nQ i;m is the number of dormant partition elements of colour m at colony i. The partition elements independently follow a Markov process on Zd  ¹A; .Dm /m2N0 º (A stands for active, Dm stands for dormant with colour m), until they meet on Zd and coalesce at rate d0 . In other words, coalescence works as before, but only in the active state, while in the dormant state nothing happens. The Markov process has transition rates b.  ;  / given by 8 d ˆ ˆ ˆa.i; j /; i; j 2 Z ; k D l D A; ˆ 3.

x2Zd

279

From high to low volatility

13.3.1 (I) Spatial Cannings with block resampling In the case of the Cannings process with block resampling, for the coalescent starting from two partition elements there is a strictly positive probability to coalesce from any distance. This makes it possible for them to coalesce with probability 1 even when q1 < 1. In other words, coexistence may change to clustering due to block resampling. Indeed, set k D ƒk .Œ0; 1/; k 2 N0 ; and let a t .  ;  / denote the time-t transition kernel of the random walk on N with migration coefficients ck C N 1 kC1 ; k 2 N0 ; starting at 0. Here, the extra term N 1 kC1 comes from the reshuffling that takes place before the resampling, which induces additional migration. Define the hazard to coalesce Z 1 X k HN D k N ds a2s .0; Bk .0//; 0

k2N0

where Bk .0/ is the k-block in N around 0. Let Varx . / D Qx .du; dv/ .u/ .v/; with Qx .du; dv/ defined in (13.2.3), and Z E ŒVar.  / . / D

2 Cb .K; R/;

 .dx/ Varx . /:

P .K/

(The lower index .  / indicates that the variance is taken with respect to a random type distribution and is averaged afterwards.) We proved the following in [30, Theorem 1.8]. Theorem 13.3.1 (Dichotomy for Cannings with block resampling). (a) [Coexistence] If HN < 1, then lim inf t !1

sup 2Cb .K;R/

Exi .t/ ŒVar.  / . / > 0 for all i 2 N :

(b) [Clustering] If HN D 1, then lim

t !1

sup 2Cb .K;R/

Exi .t/ ŒVar.  / . / D 0 for all i 2 N :

This in turn leads to the following picture for the process in (13.2.5). Theorem 13.3.2 (Ergodic behaviour for finite N ). Suppose that the law of X.0/ is stationary and ergodic with respect to translations in N with single-site mean  2 P .K/. Then the following dichotomy holds.

280

Andreas Greven and Frank den Hollander

(a) [Coexistence] If HN < 1, then c;

LŒX.t/ HHH)  t !1

2 P .P .K/N /

c;

for some unique law  that is stationary and ergodic with respect to translations in N with single-site mean . (b) [Clustering] If HN D 1, then Z LŒX.t/ HHH) t !1

K

.du/ı.ıu /N 2 P .P .K/N /:

We can now analyse the effect of the block resampling. We focus on the volatility of the k-block averages Yk.N / .t/ D

1 Nk

X

xi .tN k /;

i 2 N ; t > 0;

j 2Bk .i/

in the hierarchical mean-field limit N ! 1. Let d D .dk /k2N0 be the sequence of volatilities of k-block averages in the spatial Cannings process with block resampling. These satisfy the recursion relation [30, equation (1.45)] d0 D 0;

1 dkC1

D

1 1 C 1 ; ck  C dk 2 k

k 2 N0 :

Let d  D .dk /k2N0 be the sequence of volatilities of k-block averages in the spatial Cannings process without block resampling, i.e. there is resampling in single colonies (0 > 0) but not in blocks of colonies (k D 0 for all k 2 N). The latter can be solved to give d0 D 0;

dk D

1

1  2 0 ; 1 C 2 0 k

k 2 N;

k D

k X1 lD0

1 : cl

We proved the following in [30, Theorem 1.11]. Theorem 13.3.3 (Comparison of hierarchical Fleming–Viot and hierarchical Cannings). The following hold for d D .dk /k2N0 : (a) The maps c 7! d and  7! d are component-wise non-decreasing. (b) d > d  component-wise. P P (c) Clustering occurs if and only if k2N0 c1k klD0 l D 1. P (d) If limk!1 k D 1 and k2N k k < 1, then limk!1 k dk D 1.

281

From high to low volatility

In particular, (a), (b) say that both migration and reshuffling-resampling increase the volatility, (c) says that the dichotomy due to migration is affected P by the reshufflingresampling only when the latter is strong enough, i.e. when k2N0 k D 1, while (d) says that the scaling behaviour of dk in the clustering regime is Punaffected by the reshuffling-resampling when the latter is weak enough, i.e. when k2N k k < 1. Note that, due to block coalescence, clustering may occur for migrations that, in the absence of block coalescence, give coexistence. 13.3.2 (II) Spatial Fleming–Viot with seed-bank Recall the definition of Km ; em below (13.2.8) and  in (13.2.9). For the system with seed-bank we obtain the following dichotomy for the longtime behaviour. Define X %D Km : m2N0

The wake-up time of a typical individual, denoted by  , has distribution P . > t/ D

X K m em e 

em t

:

m2N0

Suppose that Km  Am

˛

;

em  Bm

ˇ

m ! 1;

;

(13.3.1)

where A; B 2 .0; 1/ and ˛; ˇ 2 R with ˛ < 1 < ˛ C ˇ. Subject to (13.3.1), we have P . > t/  C t where

D

˛Cˇ ˇ

1

2 .0; 1/;

t ! 1;

;

C D

A 1 B ˇ



€. / 2 .0; 1/;

with € the Gamma-function. There is a principal difference between the cases % < 1 and % D 1. We proved the following in [32, Theorem 3.3]. Let a t denote the time t transition kernel. Define two hazards to coalesce Z 1 Z 1 Ha D a t .0; 0/ dt; Ha; D a t .0; 0/t .1 /= dt: 0

0

Theorem 13.3.4 (Dichotomy and longtime behaviour). Suppose that the law of .X.0/; Y .0// is stationary and ergodic with respect to translations in Zd with singlesite mean  2 Œ0; 1. (a) Suppose that % < 1. If Ha < 1, then K;e

LŒ.X.t/; Y.t/ HHH)  t!1

d 2 P .Œ0; 1  Œ0; 1N0 /Z ;

Andreas Greven and Frank den Hollander

282

K;e

where  is an equilibrium measure that is stationary and ergodic with respect to translations in Zd , and satisfies E Œ.xi ; yi / D .;  N0 /, where P   x0 C M mD0 Km y0;m  D lim : P M !1 1C M mD0 Km If Ha D 1, then LŒ.X.t/; Y.t// HHH) .1 t!1

/ı.0;0N0 /Zd C ı.1;1N0 /Zd :

(b) Suppose that % D 1. Then the above dichotomy holds with Ha replaced by Ha; . Thus, we see that if % D 1, then the dichotomy is shifted from clustering to coexistence. For 2 Œ 12 ; 1/ there is an interesting competition between the migration and the seed-bank, while for 2 .0; 12 / the seed-bank completely dominates and coexistence occurs no matter what is the migration. The exponent determines the tail of the wake-up time. The smaller is, the longer two lineages in the dual tend to be dormant, and so the harder it is for them to meet at the same colony while being active and coalesce. For symmetric migration with finite second moments, dimension d D 2 is critical: a t .0; 0/  t 1 as t ! 1, and so there is clustering for finite seed-bank but coexistence for infinite seed-bank (for all 2 .0; 1/). For biology this is an important observation. 13.3.3 The two model classes in comparison The key quantity for the dichotomy between clustering and coexistence is the hazard to coalesce of two dual partition elements. In the classical model the hazard is Ha , in the Cannings model HN , and in the Fleming–Viot model Ha or Ha; depending on whether % < 1 or % D 1. The Cannings model leads to a decrease of the critical dimension, while the Fleming–Viot model with seed-bank leads to an increase of the critical dimension.

13.4 Random environment Often it is more realistic to assume that the mechanism of evolution is spatially inhomogeneous. The important question is whether or not this change affects the longtime behaviour qualitatively. We will explore this issue for the Cannings process with block resampling. In particular, we will look at the situation where the resampling rates are spatially inhomogeneous. To define the modification of the model introduced in Section 13.2.1.3, we need to consider the full tree (see Figure 13.4.1), i.e. [ .k/ T N ; .k/ N D N D N =Bk .0/; k2N0

From high to low volatility

283

Figure 13.4.1. The hierarchical tree T 3 .

where N =Bk .0/ denotes the quotient group of N modulo Bk .0/ (recall (13.2.4)). Note that the leaves of the tree form the set N . For  2 T N , write jj D the height of  (counting from the leaves); i.e. jj D k when  2 .k/ N for k 2 N0 , and define Bjj ./; to be the set of sites in N that lie below . We want to make the reshuffling-resampling spatially random. To that end, let ® ¯ ƒ.!/ D ƒ .!/ W  2 T N be a random field of Mfin .Œ0; 1/-valued resampling measures indexed by the tree. Throughout the paper, we use the symbol ! to denote the random environment and the symbol P to denote the law of !. We assume that ƒ .!/ is of the form ƒ .!/ D jj  .!/; where  D .k /k2N0 is a deterministic sequence in .0; 1/ and ®  ¯  .!/ W  2 T N

(13.4.1)

(13.4.2)

is a random field of Mfin .Œ0; 1/-valued resampling measures that is stationary under translations in T N. Abbreviate  .!/ D  .!/..0; 1/; (13.4.3) which is the total mass of  .!/. Clearly, ¯ ®   .!/ W  2 T N

(13.4.4)

is a random field of .0; 1/-valued total masses that is also stationary under translations in T N . We assume that EŒ .!/ D 1;

EŒ. .!//2  D C 2 .0; 1/;

(13.4.5)

Andreas Greven and Frank den Hollander

284

and that the -algebra at infinity associated with (13.4.3), defined by \  T T D FL ; FL D   .  / with  2 T N such that dN .0; / > L ; (13.4.6) L2N0

is trivial, where dT denotes the hierarchical distance on T N. N The analogue of (13.2.5) is the random process X.!/ D .X.!I t// t >0 : We proved the following in [31, Theorem 3.2]. Theorem 13.4.1 (Longtime behaviour for Cannings in random environment). Assume (13.4.1)–(13.4.6). Suppose that, under the law P , the law of the initial state X.!I 0/ is stationary and ergodic with respect to translations in N , with single-site mean  D EŒX0 .!I 0/ 2 P .K/: Then, for P -a.e. !, there exists an equilibrium measure  .!/ 2 P .P .K/N /, arising as lim LŒX.!I t/ D  .!/; t!1

satisfying Z P .K/N

x0  .!/.dx/ D :

Moreover, under the law P , the random variable  .!/ is stationary and ergodic with respect to translations in N . The proof of Theorem 13.4.1 is based on a computation with the dual process, which allows us to control second moments. In random environment this computation involves two random walks in the same environment, and the difference of the two random walks is not a random walk itself. We identify the parameter regime for which  .!/ is a multitype equilibrium (which means coexistence given !), sup 2Cb .K;R/

E .!/ ŒVar.  /

 > 0;

respectively, a monotype equilibrium (that is clustering given !), i.e. Z  .!/ D ı.ıu /N .du/: K

We proved the following in [31, Theorem 3.3]. Theorem 13.4.2 (Dichotomy for Cannings in random environment). Assume equations (13.4.1)–(13.4.6). (a) Let C D ¹! W in ! coexistence occursº. Then P .C/ 2 ¹0; 1º.

From high to low volatility

285

(b) P .C / D 1 if and only if X k2N0

k X 1 l < 1: ck C N 1 kC1 lD0

Thus, we see that the dichotomy between coexistence and clustering is preserved: the effect of the random environment is not qualitative; only quantitative changes occur. For example, the random environment lowers the volatility dk on every hierarchical scale k compared to the average environment. The intuition behind this is that the random environment causes fluctuations in the resampling, which in turn reduce the clustering. For some choices of c and  the random environment slows down the growth of the monotype clusters, i.e. enhances the diversity of types.

13.5 Extensions Having completed our brief description of the two model classes (I) and (II), we discuss some interesting questions that are treated in our papers but that we cannot discuss in detail here. 13.5.1 On some generalisations of the model classes Our models allow for more general geographic spaces replacing N or Zd . In fact, our results on the longtime behaviour on N (for Cannings with block resampling) and on Zd (for Fleming–Viot with seed-bank) hold more generally for countable Abelian groups. Furthermore, concerning resampling, for the case K D ¹0; 1º we can replace the volatility function gFW .x/ D d0 x.1 x/, x 2 Œ0; 1, in the classical Fisher–Wright diffusion by volatility functions g.x/ with g.0/ D g.1/ D 0, g.x/ > 0 for x 2 .0; 1/, and g locally Lipschitz on Œ0; 1. This requires new methods of proof, including coupling, because duality can no longer be used. Still, the same phenomena prevail. In the case of more than two types, i.e. when we pass from the Fisher–Wright model with K D ¹0; 1º to the Fleming–Viot model with K D Œ0; 1, the basic questions about uniqueness of the martingale problem remain open for state-dependent volatility, and only small sets of possible volatility functions can be handled (see [17, 19]). Still, the same phenomena persist once we assume well-posedness. In principle our methods for including seed-banks should carry over, but this has to be worked out in more detail. 13.5.2 Justification of the model: Limits of individual-based models Our models arise from individual-based models with M individuals (respectively, M 1 individuals per site) in the limit as M ! 1 when the individuals carry mass M . This justifies the form of the generators as they appeared above. We can show tightness in path space and obtain convergence to the solution of the martingale problem. We

Andreas Greven and Frank den Hollander

286

can show that many individual-based models converge to the same diffusion model. For example, we can relax the assumption that active and dormant states follow a strict exchange as used above, and allow for active individuals to become dormant at a prescribed rate and dormant individuals to become active at a another prescribed rate independently. 13.5.3 Analysis close to the critical dimension On Zd the critical dimension is d D 2, which separates recurrence from transience and which itself is recurrent. Thus, for d 6 2 we have clustering and for d > 3 we have coexistence. For the behaviour of spatial populations in general we should find that it is the interplay of the dimension and the structure of the migration kernel that determines clustering or coexistence. In [14, 15] it is shown that the proper concept is the degree of the underlying random walk, which is a real number > 1. Here, 1 corresponds to positive recurrence (in the classical sense of finite mean return time), the interval . 1; 0/ to strong recurrence (for instance, simple random walk on Z), 0 to critical recurrence, .0; 1 to transience, and .1; 1/ to strong transience. The concept of the degree, and the formulas that have been derived for it, allow us to interpret the shift towards clustering in model (I) and towards coexistence in model (II) as a shift of the critical degree downwards, respectively, upwards. The random walk on N with ck D c k , k 2 N0 , and c 2 .0; N / has degree log c=.log N log c/. Hence the degree is 0 for c D 1 (the critical case), but converges to 0 as N ! 1, either from above or from below depending on whether c > 1 or c < 1. This means that we may approach criticality by taking the hierarchical mean-field limit, and may tackle the problem of finding all the universality classes for the large space-time scaling behaviour. Concrete answers can be given in terms of a so-called renormalisation analysis.

13.6 Perspectives We close with a brief look at the future. The two model classes we have discussed in the present contribution pose many challenges and should be investigated further from at least two perspectives.  Define and analyse the process of evolving genealogies of populations, using the approach outlined in [21–23, 34]. Exploit this process to investigate the effect on the genealogy of an increase or a decrease of the resampling volatility.  Identify the universality classes for the large space-time scaling. This can be done by taking the spatial continuum limit, in which case the possible limit dynamics represent the possible universality classes. The genealogies are described by equivalence classes of ultrametric measure spaces, an approach developed in [22, 33], and the evolution of the process is given via

From high to low volatility

287

a martingale problem, an approach developed for non-spatial models in [23,34] and for spatial models in [35, 36]. For models with seed-bank, because of the presence of long inactive pieces in the ancestral path that is used to describe the genealogy, the latter approaches require a technique developed in [37,38], which codes the ancestral path in order to treat singular settings where dust is formed after we pass to the approximation by Moran models. The continuum space limit for genealogies must be based on ideas going back to [36], where the Fleming–Viot model on the geographic space R was treated. We need to extend this analysis to the continuum hierarchical group, and we need to cope with the Cannings mechanism. This involves replacing the Brownian motion and the Brownian web used for the description of genealogies by Lévy-processes and Lévy-webs. Acknowledgements. The research described above was carried out by the authors together with S. Kliem, A. Klimovsky and M. Oomen. Details can be found in [30–32].

References [1] N. H. Barton, A. M. Etheridge, and A. Véber, A new model for evolution in a spatial continuum, Electron. J. Probab. 15 (2010), 162–216. [2] N. Berestycki, Recent progress in coalescent theory, Ensaios Mat. 16 (2009), 1–193. [3] N. Berestycki, A. M. Etheridge, and A. Véber, Large scale behaviour of the spatial ƒFleming–Viot process, Ann. Inst. Henri Poincaré Probab. Stat. 49 (2013), 374–401. [4] M. Birkner and J. Blath, Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model, J. Math. Biol. 57 (2008), 435–465. [5] M. Birkner and J. Blath, Measure-valued diffusions, general coalescents and population genetic inference, in: Trends in Stochastic Analysis (eds. J. Blath, P. Mörters, and M. Scheutzow), Cambridge University, Cambridge (2009), 329–363. [6] M. Birkner and J. Blath, Genealogies and inference for populations with highly skewed offspring distribution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 151–177. [7] J. Blath, A. M. Etheridge, and M. Meredith, Coexistence in locally regulated competing populations and survival of branching annihilating random walk, Ann. Appl. Probab. 17 (2007), 1474–1507. [8] J. Blath, A. Gonzáles Casanova, N. Kurt, and D. Spanò, The ancestral process of long-range seed bank models, J. Appl. Probab. 50 (2013), 741–759. [9] J. Blath, A. Gonzáles Casanova, N. Kurt, and M. Wilke-Berenguer, A new coalescent for seed-bank models, Ann. Appl. Probab. 26 (2016), 857–891. [10] J. Blath and N. Kurt, Population genetic models of dormancy, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 247–265. [11] C. Cannings, The latent roots of certain Markov chains arising in genetics: A new approach, I. Haploid models, Adv. Appl. Probab. 6 (1974), 260–290.

Andreas Greven and Frank den Hollander

288

[12] C. Cannings, The latent roots of certain Markov chains arising in genetics: A new approach, II. Further haploid models, Adv. Appl. Probab. 7 (1975), 264–282. [13] J. T. Cox and A. Greven, Ergodic theorems for infinite systems of locally interacting diffusions, Ann. Probab. 22 (1994), 833–853. [14] D. A. Dawson, L. G. Gorostiza, and A. Wakolbinger, Hierarchical random walks, in: Asymptotic Methods in Stochastics, Festschrift for Miklós Csörgö (eds. L. Horváth and B. Szyszkowicz), American Mathematical Society, Providence (2004), 173–193. [15] D. A. Dawson, L. G. Gorostiza, and A. Wakolbinger, Degrees of transience and recurrence and hierarchical random walks, Potential Anal. 22 (2005), 305–350. [16] D. A. Dawson and A. Greven, Spatial Fleming–Viot Models with Selection and Mutation, Springer, Cham, 2014. [17] D. A. Dawson, A. Greven, F. den Hollander, R. Sun, and J. Swart, The renormalization transformation for two-type branching models, Ann. Inst. Henri Poincaré Probab. Stat. 44 (2008), 1038–1077. [18] D. A. Dawson, A. Greven, and J. Vaillancourt, Equilibria and quasi-equilibria for infinite systems of Fleming–Viot processes, Trans. Amer. Math. Soc. 347 (1995), 2277–2360. [19] D. A. Dawson and P. March, Resolvent estimates for Fleming–Viot operators and uniqueness of solutions to related martingale problems, J. Funct. Anal. 132 (1995), 417–472. [20] F. den Hollander and G. Pederzani, Multi-colony Wright–Fisher with seed-bank, Indag. Math. 28 (2017), 637–669. [21] A. Depperschmidt and A. Greven, Tree-valued Feller diffusion, preprint 2019, httpsW// arxiv.org/abs/1904.02044. [22] A. Depperschmidt, A. Greven, and P. Pfaffelhuber, Marked metric measure spaces, Electron. Commun. Probab. 16 (2011), 174–188. [23] A. Depperschmidt, A. Greven, and P. Pfaffelhuber, Path-properties of the tree-valued Fleming–Viot dynamics, Electron. J. Probab. 18 (2013), 1-47. [24] R. Der, C. L. Epstein, and J. B. Plotkin, Generalized population models and the nature of genetic drift, Theor. Popul. Biol. 80 (2011), 80–99. [25] B. Eldon and J. Wakeley, Coalescent processes when the distribution of offspring number among individuals is highly skewed, Genetics 172 (2006), 2621–2633. [26] A. M. Etheridge, Some Mathematical Models from Population Genetics, École d’Été de Probabilités de Saint-Flour XXXIX-2009, Springer, Berlin, 2011. [27] S. N. Ethier and T. G. Kurtz, Markov Processes. Characterization and Convergence, Wiley, New York, 1986. [28] F. Freund, Multiple-merger genealogies: Models, consequences, inference, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 179–202. [29] A. Gonzáles Casanova, E. Aguirre-von Wobeser, G. Espín, L. Servín-González, N. Kurt, D. Spanò, J. Blath, and G. Soberón-Chavez, Strong seed-bank effects in bacterial evolution, J. Theor. Biol. 356 (2014), 62–70. [30] A. Greven, F. den Hollander, S. Kliem, and A. Klimovsky, Renormalisation of hierarchically interacting Cannings processes, ALEA Lat. Am. J. Probab. Math. Stat. 11 (2014) 43–140.

From high to low volatility

289

[31] A. Greven, F. den Hollander, and A. Klimovsky, The hierarchical Cannings process in random environment, ALEA Lat. Am. J. Probab. Math. Stat. 15 (2018), 295–351. [32] A. Greven, F. den Hollander, and M. Oomen, Spatial populations with seed-bank: wellposedness, duality and equilibrium, preprint 2020, httpsW//arxiv.org/abs/2004.14137. [33] A. Greven, P. Pfaffelhuber, and A. Winter, Convergence in distribution of random metric measure spaces (ƒ-coalescent measure trees), Probab. Theory Related Fields 145 (2009), 285–322. [34] A. Greven, P. Pfaffelhuber, and A. Winter, Tree-valued resampling dynamics martingale problems and applications, Probab. Theory Related Fields 155 (2013), 789–838. [35] A. Greven, T. Rippl, and P. Glöde, Branching processes: A general concept, preprint 2018, httpsW//arxiv.org/abs/1807.01921. [36] A. Greven, R. Sun, and A. Winter, Continuum space limit of the genealogies of interacting Fleming–Viot processes on Z1 . Electron. J. Probab. 21 (2016), Paper No. 58. [37] S. Gufler, A representation for exchangeable coalescent trees and generalized tree-valued Fleming–Viot processes, Electron. J. Probab. 23 (2018), Paper No. 41. [38] S. Gufler, Pathwise construction of tree-valued Fleming–Viot processes, Electron. J. Probab. 23 (2018), Paper No. 42. [39] I. Kaj, S. M. Krone, and M. Lascoux, Coalescent theory for seed bank models, J. Appl. Prob. 38 (2001), 285–300. [40] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [41] V. Limic and A. Sturm, The spatial ƒ-coalescent, Electron. J. Probab. 11 (2006), Paper No. 15. [42] J. Pitman, Coalescents with multiple collisions, Ann. Probab. 27 (1999), 870–1902. [43] J. Pitman, Combinatorial Stochastic Processes, École d’Été de Probabilités de Saint-Flour XXXII-2002, Springer, Berlin, 2006. [44] S. Sawyer and J. Felsenstein, Isolation by distance in a hierarchically clustered population, J. Appl. Probab. 20 (1983), 1–10. [45] T. Shiga, An interacting system in population genetics, J. Math. Kyoto Univ. 20 (1980), 213–242. [46] T. Shiga and A. Shimizu, Infinite-dimensional stochastic differential equations and their applications, J. Math. Kyoto Univ. 20 (1980), 395–416.

Chapter 14

Ancestral lineages in spatial population models with local regulation Matthias Birkner and Nina Gantert We give a short overview on our work on ancestral lineages in spatial population models with local regulation. We explain how an ancestral lineage can be interpreted as a random walk in a dynamic random environment. Defining regeneration times allows to prove central limit theorems for such walks. We also consider several ancestral lineages in the same population and show for one prototypical example that in one dimension the corresponding system of coalescing walks converges to the Brownian web.

14.1 Introduction Many natural populations live in a spatially extended – and often essentially twodimensional – habitat, with a range that is much larger than the typical distance that any individual may travel during its lifetime. When different genetic types are considered, this can lead to a local differentiation of types that violates the assumptions of panmixia. Furthermore, as a result of the interaction of individuals with their environment – which may be influenced by the population itself and additionally by other, competing species or by external events – local population sizes often fluctuate in time, and these fluctuations may be described using random fields. Understanding the evolution of populations with spatial structure is an interesting problem, and mathematical, individual-based models can help to understand how spatial structure modifies the action of other evolutionary forces such as genetic drift or selection. It is natural to translate the question of the spatial distribution of types into one about the spatial embedding of genealogies by analysing the space-time history of sampled individuals and their ancestral lines. In order to make the latter mathematically tractable, a customary approach, especially in mathematical population genetics, is to impose a discrete grid of “demes” and assume that local population sizes are constant in time, as in Kimura’s stepping stone model and its relatives [40, 44]. Then ancestral lines of sampled individuals are coalescing random walks (with a delay depending on the local population size), and detailed formulas for quantities of interest like the decay of the probability of identity by descent or the correlation of type frequencies with spatial separation are available [45, 46]. Arguably, the built-in assumption of fixed local population sizes in stepping stone models, though allowing the use of powerful mathematical tools in the analysis, appears somewhat artificial from the modelling perspective. We remark here that the

Matthias Birkner and Nina Gantert

292

most “obvious” attempt at removing this assumption would be to consider populations that move and reproduce freely in space without interaction among them, i.e. systems of critical branching random walks. The assumption of criticality, i.e. on average one offspring per individual, is a necessary, though not sufficient condition for such systems to possess non-trivial equilibria. Unfortunately, this attempt is bound to fail, at least in spatial dimensions d D 1 and d D 2, although the latter is possibly the most interesting case from a biological point of view: It is well known that in dimensions 1 and 2, critical branching random walks “generically” exhibit local extinction and if one conditions on non-extinction, the configuration forms arbitrarily dense clumps ([29], see e.g. [20, Chapter 6.4] for a discussion). This effect can not be eliminated by density-dependent down-regulation of the branching rate, see [9]. Another line of thought, more in the vein of mathematical ecology, aims at remedying the artificial and in principle undesirable assumption of fixed local population sizes, and some formulations also remove the discretisation of space in the models discussed above. Here, one models explicitly the stochastic evolution of the local population size forward in time in a way that takes “feedback” into account, typically in the sense that an individual in a crowded region tends to leave on average less offspring than an individual that happens to be in a sparsely populated region. Such models were introduced in the biology literature (and analysed with non-rigorous methods) in [15, 16, 32, 39]. Several investigations in the mathematical literature were inspired by these models and some modifications thereof, see for instance [3, 7, 11, 19, 21, 28, 37] for models and results in this direction (some with discrete, some with continuous space and “masses”). Models from this class can possess non-trivial equilibria in any spatial dimension and they can be “enriched” to also include ancestral information (this is straightforward for discrete-mass models as in [7, 21, 37]; for continuous mass models, one could approximate with particle systems or use “lookdown” constructions as e.g. in [33, 43]). Thus the problem of describing the space-time embedding of ancestral lineages of one or several individuals sampled from certain locations in an equilibrium population is mathematically well defined. It turns out that a single ancestral lineage, corresponding to a sample of size one, then forms a random walk in a dynamic random environment which is generated by the backward in time history of the entire population. Similarly, the ancestral information for a larger sample corresponds to a system of several random walks in the same environment which can additionally coalesce when they are in the same location. In this article, we discuss the behaviour of ancestral lineages in two prototypical examples, namely the discrete time contact process in Section 14.2 and the logistic branching random walk in Section 14.3. A key idea in both Sections 14.2 and 14.3 will be to construct regenerations. It turns out that in both cases, ancestral lineages behave similarly to random walks on large space-time scales in the sense that they satisfy the law of large numbers and a central limit theorem. Thus, broadly speaking, the effect of the fluctuating local population sizes manifests itself on large scales only in the variance parameter of the “random walks”. This validates the pragmatical approach mentioned above, where one simply replaces the true demographic history of the population by one with locally fixed

Ancestral lineages in spatial population models

293

“effective sizes” (and the migration by an “effective migration”). In Section 14.4, we discuss the relation to other projects within SPP 1590 and to the (considerable) literature of random walks in random environments.

14.2 The contact process and random walk on the backbone of the oriented percolation cluster We start with a more detailed description of the model forwards in time and then discuss its ancestral lineages. 14.2.1 The discrete time contact process Let ! WD ¹!.x; n/ W .x; n/ 2 Zd  Zº be a family of independent Bernoulli random variables (representing the carrying capacities) with parameter p 2 .0; 1. We call a site .x; n/ inhabitable (or open) if !.x; n/ D 1 and uninhabitable (or closed) if !.x; n/ D 0. We say that there is an open path from .y; m/ to .x; n/ for m 6 n if there is a sequence xm ; : : : ; xn such that xm D y, xn D x, kxk xk 1 k 6 1 for k D m C 1; : : : ; n and !.xk ; k/ D 1 for all k D m; : : : ; n. In this case we write .x; m/ ! .y; n/. Here k  k denotes the sup-norm. The terms open/closed are standard in percolation theory, we use here inhabitable/uninhabitable to emphasise the population interpretation. Given a set A  Zd we define the discrete time contact process .A n /n>m starting at time m 2 Z from the set A as A m .y/ D 1A .y/;

y 2 Zd ;

and for n > m, 8 A ˆ m the !.x; k/ are i.i.d. Bernoulli as above). Taking m D 0, we set ® ¯  A WD inf n > 0 W A (14.2.1) n 0 : We interpret the process  as a population process, where n .x/ D 1 means that the position x is occupied by an individual in generation n. Space-time sites can be inhabitable (if !.x; n/ D 1) or uninhabitable (if !.x; n/ D 0). The population dynamics is then the following: For each x 2 Zd independently, if !.x; n/ D 1 and there was at least one individual in the neighbourhood of x in the previous generation,

Matthias Birkner and Nina Gantert

294

n

n

1

expected no. of dotted offspring: 3p > 1

expected no. of dotted offspring: 3 13 p D p < 1

Figure 14.2.1. Interpretation of .n / as a locally regulated population model (note that p > pc > 13 in this case).

i.e. if Ax;n WD ¹y 2 Zd W kx yk 6 1 and n 1 .y/ D 1º ¤ ;; then y is picked uniformly from Ax;n and an offspring of the individual (14.2.2) at y in generation n 1 is placed at space-time site .x; n/: In this case n .x/ D 1 and (14.2.2) defines the ancestral structure of the population. In the other cases, namely if !.x; n/ D 0 (site uninhabitable) or if Ax;n D ; (no inhabited neighbours in the previous generation), we have n .x/ D 0, i.e. the site stays vacant. With this interpretation, (14.2.1) is the extinction time of a population that starts with all x 2 A inhabited. Note that the dynamics (14.2.2) implicitly contain a local population regulation: Neighbours compete for inhabitable sites, so individuals in sparsely populated regions have, on average, higher reproductive success. We can visualise this by considering a neutral multi-type version, where offspring simply inherit their parent’s type (discussed in more detail in Remark 14.2.3 below). See the example in Figure 14.2.1. Note that the ancestor is random since we are not given the whole evolution of the system but only its state at time n. Compare with the more familiar case of a continuous-time contact process and its graphical representation. Define the ancestor at time 0 of an infected site at time t to be the site where the infection came from, following back the graphical representation. Then the ancestor at time 0 of an infected site at time t is determined if we know the graphical representation up to time t, but it is random if we only see the configuration of all infected sites at time t. It is well known, see e.g. [23, Theorem 1], that there is a critical value pc 2 .0; 1/ such that P . ¹0º D 1/ D 0 for p 6 pc and P . ¹0º D 1/ > 0 for p > pc . Here and in the following, we write 0 D .0; 0; : : : ; 0/ 2 Zd for the origin in d -dimensional space. d We will only consider the supercritical case p > pc . In this case the law of Z n converges weakly to the so-called upper invariant measure, which is the unique nontrivial extremal invariant measure of the discrete-time contact process. By taking m ! 1 while keeping A D Zd one obtains the stationary process d

 WD .n /n2Z WD .Z n /n2Z :

(14.2.3)

Ancestral lineages in spatial population models

295

14.2.2 Ancestral lineages We are interested in the behaviour of the “ancestral lineages” of individuals in the stationary process  from (14.2.3), where the behaviour of such a lineage is described by iterating (14.2.2). Due to time stationarity, we can focus on ancestral lines of individuals living at time 0. It will be notationally convenient to time-reverse the stationary process  and consider the process  WD .n /n2Z defined by n .x/ D 1 if .x; n/ ! 1 (i.e. there is an infinite directed open path starting at .x; n/) and n .x/ D 0 otherwise. Note that indeed L..n /n2Z / D L.. n /n2Z /. More precisely, due to (14.2.3),  n .x/ D 1 if and only if there is an infinite directed open backwards path starting at .x; n/, i.e. a connection from 1 to .x; n/. This is the case if and only if in the time-reversed picture, there is a connection from .x; n/ to 1, i.e. there is an infinite directed open path starting at .x; n/, and this is the case if and only if n .x/ D 1. Hence there is a one-to one correspondence of .n /n2Z and . n /n2Z and in particular the two processes have the same law. We will from now on in this section consider the forwards evolution of  as the “positive” time direction. On the event B0 WD ¹0 .0/ D 1º there is an infinite path starting at .0; 0/. We define the oriented cluster by ® ¯ C WD .x; n/ 2 Zd  Z W n .x/ D 1 (in percolation jargon, this is strictly speaking the “backbone” of the oriented cluster) and let ® ¯ U.x; n/ WD .y; n C 1/ W kx yk 6 1 (14.2.4) be the neighbourhood of the site .x; n/ in the next generation. One can allow more general finite neighbourhoods in (14.2.4) with mostly only notational changes in the proofs, see [6, Remark 1.4]. Note however that if U.x; n/ is not symmetric around x, the walk will generically have a non-trivial speed. On the event B0 we may define a Zd -valued random walk X WD .Xn /n>0 starting from X0 D 0 with transition probabilities P .XnC1 D y j Xn D x; / ´ jU.x; n/ \ C j 1 when .y; n C 1/ 2 U.x; n/ \ C ; D 0 otherwise:

(14.2.5)

This corresponds to “going backwards” in (14.2.2) and we interpret Xn as the spatial position of the ancestor n generations ago of the individual at the origin today, see also Figure 14.2.2. Note that .Xn ; n/n>0 is a directed random walk on the percolation cluster C , and X can be also viewed as a random walk in a (dynamical) random environment, where the environment is given by the process . We write P! and E! to denote probabilities and expectations when the environment (which is a function of the !’s) is fixed, and

296

0 −20 −40 −60

−60

−40

−20

0

Matthias Birkner and Nina Gantert

−40

−20

0

20

40

−40

−20

0

20

40

Figure 14.2.2. Left: A simulation of the space-time configuration of the stationary contact process  from (14.2.3) with p D 0:68. Dark sites have n .x/ D 1. Right: The same configuration with only those sites highlighted in dark that are potential ancestors of the individual at the origin .0; 0/, i.e. those sites which the walk X with dynamics (14.2.5) can reach.

write P and E for the situation when we average with respect to both the walk and the environment. In the jargon of random walks in random environments, this refers to the “quenched” and the “averaged” or “annealed” case, respectively. The main result from [6] is the following theorem on the position Xn of the random walk on the backbone of the oriented percolation cluster at time n. The theorem can be interpreted by saying that Xn behaves similarly to a simple random walk: it satisfies a law of large numbers and a central limit theorem. (The case of simple random walk corresponds to p D 1 in our notation.) In other words, the percolation cluster behaves, on large scales, similarly to the full lattice: the effect of the “holes” in the cluster – which are clearly visible in the simulation in Figure 14.2.2 – vanishes on large scales. Theorem 14.2.1 (Law of large numbers, averaged and quenched central limit theorem [6, Theorems 1.1. and 1.3]). For any d > 1 we have 1  P! Xn ! 0 D 1 for P .  j B0 /-a.a. !; (14.2.6) n and for any f 2 Cb .Rd /, h  X  ˇ i n!1 ˇ n ! ˆ.f /; (14.2.7) E f p ˇ B0 n h  X i n!1 n E! f p ! ˆ.f / for P .  j B0 /-a.a. !, (14.2.8) n R where ˆ.f / WD f .x/ˆ.dx/ with ˆ a non-trivial centred isotropic d -dimensional normal law. Functional versions of (14.2.7) and (14.2.8) hold as well. A proof sketch is given in Section 14.2.3 below. Remark 14.2.2. The covariance matrix of ˆ in (14.2.7) is  2 times the d -dimensional identity matrix. It follows from the regeneration construction (see Section 14.2.3 below)

297

Ancestral lineages in spatial population models

that  2 D  2 .p/ D

2 EŒY1;1 

EŒ1 

2 .0; 1/;

where 1 is the first regeneration time (see (14.2.13) below) of the random walk X and Y1;1 is the first coordinate of X1 , the position of the random walk at this regeneration time. The behaviour of  2 .p/ as p # pc is an interesting open problem that merits further research. Remark 14.2.3 (Consequences for the long-time behaviour of the multi-type process). Let us enrich the contact process .n /n from Section 14.2.1 by including (so-called neutral) types: Say, at time n D 0, every 0 .x/ is independently assigned a uniformly chosen value from .0; 1/ and we augment the rule (14.2.2) by setting n .x/ D n 1 .y/ > 0 if y 2 Ax;n was chosen as the ancestor of the individual at site .x; n/. Thus, children inherit their parent’s type (which is > 0) and we still interpret n .x/ D 0 as a vacant site. As n ! 1, n will converge in distribution to an equilibrium Q of the multi-type dynamics. It follows from Theorem 14.2.1 and its proof in [6] that any two ancestral lineages will eventually meet in d 6 2, but not in d > 3. By “looking backwards in time”, this has consequences for : Q For any x; y 2 Z,  P .x/ Q D .y/ Q j .x/ Q > 0; .y/ Q >0 D1 (14.2.9) in d D 1; 2, and this probability is < 1 in d > 3. In fact, for d > 3 there is Cd 2 .0; 1/ such that  P .x/ Q D .y/ Q j .x/ Q > 0; .y/ Q >0 

Cd kx

ykd2

2

as kx

yk ! 1:

These properties are analogous to those of the multi-type stepping stone model. 14.2.3 Proof ideas: Local construction and regeneration A main difficulty in the proof of Theorem 14.2.1 lies in the fact that in order to determine .x; n/, one has to know the “whole future” of the environment !. To overcome this, we build a trajectory of X using rules that are “local”, i.e. that use only local information about the environment ! (and some additional local randomness), but not the processes . We then read off regeneration times from this construction: These are exactly the times when the locally constructed trajectory coincides with the true trajectory of X, see (14.2.12) below. This approach is inspired by [31, 36]. The construction employs some additional randomness: For every .x; n/ 2 Zd  Z let !.x; Q n/ be a uniformly chosen permutation of U.x; n/ (U.x; n/ may be written as a vector by ordering the elements according to the lexicographical ordering of the space coordinate x), independently distributed for all space-time sites .x; n/ and independent from !. We denote the whole family of these permutations by !. Q

298

Matthias Birkner and Nina Gantert

For every .x; n/ 2 Zd  Z let `.x; n/ D `1 .x; n/ be the length of the longest directed open path starting at .x; n/; we set `.x; n/ D 1 when .x; n/ is closed. (Recall that a path .x0 ; n/; .x1 ; n C 1/; : : : ; .xk ; n C k/ of length k with kxi xi 1 k 6 1 is open if !.x0 ; n/ D !.x1 ; n C 1/ D    D !.xk ; n C k/ D 1. We define `.x; n/ D 1 for .x; n/ 2 C .) For every k 2 N0 let `k .x; n/ WD `.x; n/ ^ k be the length of the longest directed open path of length at most k starting from .x; n/. Observe that `k .x; n/ is measurable with respect to the -algebra GnnCkC1 , where  Gnm WD  !.y; i/; !.y; Q i/ W y 2 Zd ; n 6 i < m ; n < m: (14.2.10) For k 2 ¹0; : : : ; 1º, we define Mk .x; n/  U.x; n/ to be the set of sites that maximise `k over U.x; n/, i.e. ® ¯ Mk .x; n/ WD y 2 U.x; n/ W `k .y/ D max `k .z/ ; z2U.x;n/

D U.x; n/. Observe that we have ® ¯ M0 .x; n/ D y 2 U.x; n/ W y is open ;

and for convenience we set M

1 .x; n/

M1 .x; n/ D U.x; n/ \ C ; Mk .x; n/  MkC1 .x; n/;

k>

1:

Let mk .x; n/ 2 Mk .x; n/ be the element of Mk .x; n/ that appears as the first in the permutation !.x; Q n/. Given .x; n/, k, ! and !, Q we define a path k D k.x;n/ of length k via

k .0/ D .x; n/;

k .j C 1/ D mk

j 2 . k .j //

for j D 0; : : : ; k

1: (14.2.11)

In words, at every step, k checks the neighbours of its present position and picks randomly (using the random permutation !) Q one of those where it can proceed on open sites, but inspecting only the state of sites in the time-layers ¹n; : : : ; n C k 1º. Consequently, the construction of k.x;n/ is measurable with respect to the -algebra GnnCk from (14.2.10). See Figure 14.2.3 for an illustration. Intuitively, k.x;n/ would be the trajectory of X starting from the space-time point .x; n/ if we replaced the condition in (14.2.5) that X can only walk on C by the requirement that the first k steps must begin on open sites. It is not hard to check that these paths k.x;n/ have the following properties (see [6, Lemma 2.1 and Remark 2.2] for details): Given !, .x; n/ 2 C and !, Q (a) (steps begin on open sites) !. k .m// D 1 for all 0 6 m < k. (b) (stability in k) If the end point of k is open, i.e. !. k .k// D 1, then the path kC1 restricted to the first k steps equals k . (c) (fixation on C ) Assume that k .j / 2 C for some k > 0, j 6 k. Then m .j / D

k .j / for all m > k. (d) (exploration of finite branches) If k .k 1/ 2 C and k .k/ … C for some k, then

j .k/ D k .k/ for all k 6 j 6 k C `. k .k// C 1 and kC`. k .k//C2 .k/ ¤ k .k/.

299

Ancestral lineages in spatial population models

!Q .x;n/ .2/

!Q .x;n/ .1/ kD1

.x; n/

kD2

kD3

kD4

Figure 14.2.3. The paths k.x;n/ from (14.2.11) based on !’s and !’s. Q Black and white circles represent open sites, i.e. !.site/ D 1, and closed sites, i.e. !.site/ D 0, respectively. Solid arrows from a site point to !Q .site/ .1/ and dotted ones to !Q .site/ .2/. On the right the sequence of paths k.x;n/ .  / for k D 1; 2; 3; 4 is shown. For sake of pictorial clarity, we used here U.x; n/ D ¹.x C 1; n C 1/; .x 1; n C 1/º instead of (14.2.4).

.x;n/ By (c), 1 .j / D limk!1 k .j / exists a.s. (since holes in the cluster are a.s. finite). Furthermore, for fixed ! and .x; n/ 2 C (but thinking of !Q as random), the law of .x;n/ . 1 .j //j >0 is the same as the law of the random walk .Xj ; n C j /j >0 on C started from .x; n/. Thus we can and shall couple the random walk .Xk ; k/ started from .0; 0/ with the random variables !; !Q by setting .0;0/ .Xk ; k/ D 1 .k/ D lim j.0;0/ .k/: j !1

With these ingredients, we can define regeneration times as follows: Let ® ¯ (14.2.12) T0 WD 0 and Tj WD inf k > Tj 1 W . k.0;0/ .k// D 1 ; j > 1: (Here and later we use the notation .y/ WD n .x/ when y D .x; n/ 2 Zd  Z.) At times Tj the local construction of the path finds a “real ancestor” of .0; 0/ in the sense .0;0/ that for any m > Tj , m .Tj / D T.0;0/ .Tj /, by property (c). The increments between j regeneration times are i WD Ti

Ti

1

and Yi WD XTi

XTi

1

;

(14.2.13)

and we then indeed have that the sequence ..Yi ; i //i >1 is i.i.d. and Y1 is symmetrically distributed; (14.2.14) both Y1 and 1 have exponential tails:

(14.2.15)

The intuition behind the regeneration property (14.2.14) is the following: Assume that for some k, we have constructed the path k.0;0/ and observe that . k.0;0/ .k// D 1. Then we have obtained information about some !.y; j / and !.y; Q j / for j < k, y 2 Zd .0;0/ and we know that the site k .k/ in time-slice k is connected to C1. The latter property depends only on !.y; j / with j > k, y 2 Zd and the coordinates of ! in

300

Matthias Birkner and Nina Gantert

2 D T1 T6 T3 1 T2 T1 T0 Figure 14.2.4. “Discovering” of the trajectory of X between the regeneration times T0 and T6 in case U D ¹ 1; 1º is shown on the left-hand side of the figure. On the right-hand side we zoom into the evolution between T0 and T1 . On the two “relevant sites” we show the realisation of the values of !Q using the same conventions as in Figure 14.2.3 (in particular, again U.x; n/ D ¹.x C 1; n C 1/; .x 1; n C 1/º).

different time-slices are independent. By property (c), we have .Xk ; k/ D k.0;0/ .k/. Thus, concerning the future behaviour of X , we are then in the same situation at time k as at time 0: All we know (and need to know) is that X sits on some site in C , and we can start afresh. However, if we observe that . k.0;0/ .k// D 0, we are in a different situation: We then know that k.0;0/ .k/ is the starting point of a finite (possibly empty) oriented percolation cluster. Then we must continue the local construction until it has explored the “reason why . k.0;0/ .k// D 0”, which depends on finitely many sites (cf. property (d) above). See Figure 14.2.4 for an illustration: In this example, the local construction enters a finite cluster at time k D 1 and explores this, regeneration then occurs at time T1 D 2 when the exploration is completed. The full details are in [6, Lemma 2.5]. To obtain (14.2.15), one uses the fact that the height of a finite cluster in supercritical oriented percolation has exponential tails, see [18] and [6, Lemma A.1]. The distributional symmetry of Y1 follows from the symmetry of U.x; n/ in (14.2.4). Given (14.2.14) and (14.2.15), the law of large numbers (14.2.6) and the annealed CLT (14.2.7) follow straightforwardly by re-writing Xn as a sum along regeneration times plus an asymptotically negligible remainder. The quenched CLT (14.2.8) requires some additional effort: Here, we used two copies X and X 0 of the walk on the p same cluster C to control the variance of E! Œf .Xn = n/, an approach inspired by [17]. This, in turn, requires to enlarge the regeneration construction to incorporate simultaneous regenerations for both X and X 0 . Studying two (or more) copies of the walk on C, especially when one stipulates that they coalesce as soon as they meet, is

Ancestral lineages in spatial population models

301

also very natural from the point of view of larger samples. In fact, this is exactly the device that is used in [8] and it also plays a key role in the proof of (14.2.17) below. We will however not spell out the details here and instead refer to [6, 8]. 14.2.4 Extensions 14.2.4.1 Contact process with fluctuating population sizes. Let K.x; n/, .x; n/ 2 Zd  Z be possibly correlated N-valued random variables, independent of the !’s. We define the discrete time contact process with fluctuating population size, O WD .O n /n2Z , by O n .x/ WD n .x/K.x; n/; with n .x/ from (14.2.3) and its time reversal O WD .On /n2Z by On .x/ WD n .x/K.x; n/. One can interpret K.x; n/ as a random “carrying capacity” of the site .x; n/: When n .x/ D 1, K.x; n/ individuals live at position x in generation n, and each of them is independently assigned an ancestor from Ax;n as in (14.2.2). Now conditioned on O0 .0/ > 1 the ancestral random walk is defined by X0 D 0 and (14.2.5) is generalised to O P .XnC1 D y j Xn D x; / 8 OnC1 .y/ ˆ

0, an annealed CLT analogous to (14.2.7) holds if n 2 O.n 2 ı /; a quenched CLT analogous to (14.2.7) holds if K is exponentially mixing in space and time. Note that in general, (14.2.16) describes a non-elliptic random walk in a non-Markovian (but mixing) environment. The key idea is again a “regeneration construction” where the i.i.d. property in (14.2.14) is now replaced by a sufficiently strong mixing property. We refer to [34, 35] for details. 14.2.4.2 Brownian web limit in spatial dimension one. One can consider the ancestral lineages of all individuals in the stationary  from (14.2.3) simultaneously. This .x;n/ gives rise to an infinite system of random walks X .x;n/ D .Xm /m>n on the timereversal  of , where for each .x; n/ 2 C, the walk X .x;n/ starts at time n at position x, follows the analogue of (14.2.5), and different walkers coalesce whenever they meet in the same space-time site. By Theorem 14.2.1 and space-time stationarity, any X .x;n/ converges to a Brownian motion under diffusive rescaling. As shown in [8],

Matthias Birkner and Nina Gantert

302

in spatial dimension d D 1, the collection of all these paths converges after diffusive rescaling as in Theorem 14.2.1 to the Brownian web in distribution. Informally, this limit object describes an infinite system of coalescing Brownian motions starting from all space-time points in R  R. One may then apply our convergence result to investigate the behaviour of interfaces in the discrete time contact process analogously to [38, Theorem 7.6 and Remark 7.7], as observed in [8, p. 1051]. We refer also to [12, 26, 27] and the article by Blath and Ortgiese [13] in this volume, which study spatial population models (in continuous space) in d D 1, with a particular focus on interfaces. These models are “continuum analogues” of the voter model, and the interfaces are stochastic processes in dynamic environments. Dualities and their genealogical interpretations play an important role there as well. An important ingredient in the proof is a quantitative strengthening of (14.2.9) from Remark 14.2.3:  .z1 ;z2 / P Tmeet > n j 0 .z1 / D 0 .z2 / D 1 jz1 z2 j for z1 ; z2 2 Z; n 2 N; (14.2.17) 6 const.  p n .z1 ;z2 / where Tmeet is the number of steps until two walks on the same realisation of  that start at time 0 from z1 and z2 , respectively, meet for the first time. Note that (14.2.17) is the asymptotically correct form of the decay for simple random walks in d D 1. For more information, we refer to [42]. The results in [8] can again be interpreted as an averaging statement about the percolation cluster: apart from a change of variance, it behaves as the full lattice (for which convergence to the Brownian web was proved in [38]), i.e. the effect of the “holes” in the cluster vanishes on a large scale. For a thorough discussion of the Brownian web, including historical comments and references, see the overview article [41]. Note that there is no analogous object in spatial dimension d > 2 because there, independent Brownian motions never meet.

14.3 Ancestral lineages for logistic branching random walks We consider a system of discrete-time branching random walks with logistic regulation: Let n .x/ be the number of individuals at position x 2 Zd in generation n 2 Z. Given the configuration n at time n, for x 2 Zd , each individual at x has a Poissondistributed number of offspring with mean  C X m z x n .z/ (14.3.1) z

and each child moves to y with probability py x , independently for different parental individuals and for different children. Here, pxy D py x is a symmetric, aperiodic finite range random walk kernel on Zd , m > 1, z > 0, z 2 Zd is symmetric with

Ancestral lineages in spatial population models

303

finite range and 0 > 0. These children then form the next generation, nC1 . Formula (14.3.1) has a natural interpretation as local competition: each individual at z reduces the average reproductive success of a focal individual at x by z x . In particular, this introduces local density-dependent feedback in the model: The offspring distribution is supercritical when there are few neighbours and subcritical when there are many neighbours. Note that by properties of the Poisson distribution .n / is in fact a probabilistic cellular automaton: Given n , X C  X nC1 .y/  Poisson m z x n .z/ py x ; (14.3.2) x2Zd

z2Zd

independently for different y 2 Zd . Remark 14.3.1. (1) For the choice   0, the system .n /n is a “classical” branching random walk, in which different individuals behave completely independently. This is a classical topic with a lot of recent progress, see in particular the article by König [30] in this volume. In [24,25], moment asymptotics for the number of particles in a branching random walk in random environment are derived. Note that the first moments correspond to the well-investigated solutions of the parabolic Anderson model. (2) Conditioning on n .  /  N in (14.3.2) for some N 2 N and considering types and/or ancestral relationships, as we will do below, yields a version of the stepping stone model. (3) The form of the competition kernel and the Poisson offspring law in (14.3.1) and (14.3.2) are prototypical (and convenient for the proofs) but can be replaced by more general choices, see the discussion in [7, Remark 5 (ii)] and [5, Section 5]. Theorem 14.3.2 (Survival and complete convergence [7, Theorem 1 and Corollary 4]). Assume m 2 .1; 3/. There exist "0 ; "1 > 0 such that for all choices 0 < 0 6 "0 and 0 6 z 6 "1 0 for z ¤ 0, the system .n / survives for all time locally (and hence also globally) with positive probability for any non-trivial initial condition 0 . Given survival (either local or global), n converges in distribution as n ! 1 to its unique non-trivial equilibrium. We will not prove Theorem 14.3.2 here but point out that a crucial ingredient in the proof is a strong coupling property of the system .n /: Starting from any two initial conditions 0 , 00 , copies .n /; .0n / can be coupled such that if both survive; n .x/ D 0n .x/ in a space-time cone:

(14.3.3)

This allows to compare the system to supercritical oriented percolation on suitably coarse-grained space-time scales, see [7, Section 5] for details and Figure 14.3.1 for a simulation. Remark 14.3.3. (1) The restriction to m < 3 in Theorem 14.3.2 is “inherited” from the logistic iteration wnC1 D mwn .1 wn / because in this parameter regime, it has

304

Matthias Birkner and Nina Gantert

200

Population 2

200

Population 1 30

25

150

150

25

30

100

15

20 generation

100

generation

20

15 10

50

50

10

0

50

100

150

5

5

0

0

200

0

50

position

100

150

200

position 200

Modulus of difference 30

150

25

100

generation

20 15

50

10 5 0 0

50

100

150

200

position

Figure 14.3.1. Starting from any two initial conditions 0 , 00 , copies .n /, .0n / can be coupled such that if both survive (here, m D 1:5, p D . 13 ; 31 ; 31 /,  D .0:01; 0:02; 0:01/, 0 D ı60 , 00 D ı120 and space is ¹1; 2; : : : ; 200º with periodic boundary conditions). The picture at the bottom shows jn .x/ 0n .x/j, note the growing region in the middle where n .x/ D 0n .x/ > 0.

a unique attracting fixed point. Note that literally, a “deterministic P space-less” analogue of (14.3.2) would read wQ nC1 D mwQ n .1 Q wQ n / with Q D z z , the rescaling Q n brings this to the “standard form” just mentioned. wQ n D .m=/w Survival can be proved also for m 2 Œ3; 4/ with similar arguments, but convergence cannot. For m < 1 (and for m D 1 in d 6 2) one can easily see, using domination by subcritical branching random walks, that .n /n will die out locally when starting from any initial condition 0 with supx2Zd EŒ0 .x/ < 1. (2) In [22], multi-type continuous mass branching populations with competitive interactions are studied; the logistic branching random walks we described in Section 14.3 are a close relative of such systems in the single-type case (or in the multitype case with completely symmetric parameters). Furthermore, by using space Rd as a “trait space”, the measure-valued processes studied in [4] can be seen as a suitable

305

Ancestral lineages in spatial population models

scaling limit of (relatives of) logistic branching random walks, see [4, Remark 5]. Many challenging questions about the long-time behaviour of such continuous-mass interacting multi-type systems remain open. It is conceivable that the regeneration constructions for ancestral lineages we investigated in [5, 6] might be adaptable to this context, and that this could enrich the pertinent “tool box”. 14.3.1 Dynamics of an ancestral lineage By Theorem 14.3.2, for suitable choices of the parameters, the system (14.3.2) has a unique non-trivial equilibrium. We denote by stat D .stat n .x//n2Z;x2Zd the corresponding stationary process and – implicitly in our notation – “enrich” it suitably to allow bookkeeping of genealogical relationships, as described at the beginning of Section 14.3. Consider the stationary stat conditional on stat 0 .0/ > 0 and sample an individual (uniformly) from the space-time origin .0; 0/, let Xn be the spatial position of her ancestor n generations ago. Then P .XnC1 D y j Xn D x; stat / DP

px

y0

stat y  n 1 .y/

px

P

m

stat 0 y 0  n 1 .y /

z P

C stat y  n 1 .z/

z

m

z

z

C stat y 0  n 1 .z/

;

(14.3.4)

see [5, (4.10)–(4.11)]. Thus .Xn /n is a random walk in a – relatively complicated – random environment. Note that the forwards in time direction for the walk corresponds to backwards in time for stat . Again it turns out that X behaves like an ordinary random walk when viewed over large enough space-time scales, as the following theorem shows. Theorem 14.3.4 (Law of large numbers and (averaged) central limit theorem, see e.g. [5, Theorem 4.3]). Assume m 2 .1; 3/. There exist "0 ; "1 > 0 such that for all choices 0 < 0 6 "0 and 0 6 z 6 "1 0 for z ¤ 0, we have ˇ 1  ˇ .0/ ¤ 0 D1 (14.3.5) P Xn ! 0 ˇ stat 0 n and

h  1 ˇ i ˇ E f p Xn ˇ stat .0/ ¤ 0 0 n

! EŒf .Z/

n!1

(14.3.6)

for f 2 Cb .Rd /, where Z is a (non-degenerate) d -dimensional normal random variable. A functional version of (14.3.6) holds as well. Note that (14.3.6) is an averaged limit result. In ongoing work with Andrej Depperschmidt and Timo Schlüter, we are proving the corresponding “quenched” limit theorem. The proof of Theorem 14.3.4 employs again a regeneration construction and a decomposition as in (14.2.13). We will only sketch the main ideas below, referring the reader to [5] for details.

Matthias Birkner and Nina Gantert

306

Given stat , X is a time-inhomogeneous Markov chain; given also Xn D x its transition probabilities in the n C 1-th step depend only on statn 1 .x/ in some finite window around x. We see from (14.3.4) that these transition probabilities are close to the fixed reference law .px /x if Xn is in a region where the relative variation of  n 1 .Xn / is small. Thus, we define “good” space-time blocks in stat on suitable length scales Lspace  1 and Ltime  1, that is a finite set G of local configurations on ¹1; 2; : : : ; Lspace ºd  ¹1; 2; : : : ; Ltime º with the properties that (a) stat has small relative variations inside a good block, (b) if the block with (coarse grained) index .x; Q n/ Q 2 Zd  Z is good, this will with high probability also be the case for its “temporal successors” with indices .xQ 1; nQ C 1/, .x; Q nQ C 1/, .xQ C 1; n/, Q (c) if we consider two copies  and 0 of the system (14.3.2) with the property that in both the block with (coarse grained) index .x; Q n/ Q is good, then with high probability the coupling discussed in (14.3.3) will make  and 0 identical on the block with index .x; Q nQ C 1/. Property (a) allows to control the walk X whenever it moves through good blocks; (b) allows to compare the good blocks to supercritical oriented percolation (on the coarse-grained scale); (c) allows to “localise” information about the space-time configuration stat around good blocks, this is akin to the local construction from Section 14.2.3. With these ingredients, we can discuss the regeneration construction: Assume that we find a space-time “cone” C (with fixed suitable base diameter and slope) that is centred at the current space-time position .Xn ; n/ of the walk such that (i) C covers the path and everything it has “explored” until the n-th step (since the last regeneration), (ii) the configuration in stat at the base of the cone C is “good” and (iii) “strong” coupling for stat as defined in (c) above occurs inside the cone C . Then the conditional law of the future path increments is completely determined by the configuration stat at the base of the cone and we can “start afresh”. It may happen that in order to find a cone with properties (i)–(iii), several attempts are needed, see Figure 14.3.2 for an illustration. This construction expresses the path increments between the regeneration times as a functional of a well-behaved Markov chain (which keeps track of the local configuration at the base of the corresponding cones at the regeneration times). Given this, (14.3.5) and (14.3.6) are fairly standard.

Ancestral lineages in spatial population models

307

t0 t1

t2

t3 Figure 14.3.2. A schematic illustration of the regeneration construction for Theorem 14.3.4: The walk passes through a sequence of cones in an attempt to regenerate. Here, regeneration at time t1 fails because the path up to that time is not covered by the corresponding cone and regeneration at time t2 fails because the corresponding cone does not cover the previous cone; successful regeneration then occurs in the third attempt at time t3 .

In ongoing work with Andrej Depperschmidt and Timo Schlüter we consider the joint dynamics of several ancestral lineages in the logistic branching random walk and establish properties analogous to those in Section 14.2.2 for walks on the oriented percolation cluster.

14.4 Discussion Our ancestral walks with dynamics as in (14.2.5), (14.2.16), (14.3.4) are generally speaking random walks in dynamical random environments (RWDRE). This is currently a very active field of research and we do not attempt to give an overview here, but refer to [1] for a good overview of the area up to 2010. There are recent papers on random walks in dynamical random conductances, random walks on dynamical percolation, random walks in dynamical random environments given by interacting particle systems as for instance exclusion processes. The general results often have strong assumptions on the environment (mixing conditions, spectral gap assumptions, uniform lower bounds for the transition probabilities of the walk). On the other hand, the “case studies” often refer to specific models and do not provide a general approach. Hence, this is an area where there is still a lot to understand. See e.g. the recent works [2,14] and the discussion and references there. Let us point out that our walks (14.2.5), (14.2.16), (14.3.4) are somewhat special inside the general class of RWDRE in that the natural “forwards” time direction for the walk is “backwards” in time for the environment, whereas researchers in RWDRE often study walks on certain interacting

Matthias Birkner and Nina Gantert

308

particle systems where the walk and the underlying system have the same forwards time direction. Also, let us mention that while in recent work, see [10], the assumption of ellipticity of the environment, i.e. on uniform lower bounds for the transition probabilities of the walk, is not present anymore, our model still does not fit in, since our environment is not stationary. ˇ Andrej Depperschmidt, Katja Acknowledgements. The authors thank Jiˇrí Cerný, Miller and Sebastian Steiber for the many enlightening discussions we had in the course of this project. We would also like to thank Iulia Dahmer, Frederik Klement and Timo Schlüter and an anonymous referee for carefully reading the manuscript and for their helpful comments.

References [1] L. Avena, Random walks in dynamic random environments, Proefschrift Universiteit Leiden (PhD dissertation), 2010, httpW//hdl.handle.net/1887/16072. [2] L. Avena, O. Blondel, and A. Faggionato, Analysis of random walks in dynamic random environments via L2 -perturbations, Stochastic Process. Appl. 128 (2018), 3490–3530. [3] N. H. Barton, F. Depaulis, and A. M. Etheridge, Neutral evolution in spatially continuous populations, Theor. Popul. Biol. 61 (2002), 31–48. [4] G. Berzunza, A. Sturm, and A. Winter, Trait-dependent branching particle systems with competition and multiple offspring, preprint 2018, httpsW//arxiv.org/abs/1808.09345. ˇ [5] M. Birkner, J. Cerný, and A. Depperschmidt, Random walks in dynamic random environments and ancestry under local population regulation, Electron. J. Probab. 21 (2016), 1–43. ˇ [6] M. Birkner, J. Cerný, A. Depperschmidt, and N. Gantert, Directed random walk on the backbone of an oriented percolation cluster, Electron. J. Probab. 18 (2013), Paper No. 80. [7] M. Birkner and A. Depperschmidt, Survival and complete convergence for a spatial branching system with local regulation. Ann. Appl. Probab. 17 (2007), 1777–1807. [8] M. Birkner, N. Gantert, and S. Steiber, Coalescing directed random walks on the backbone of a 1+1-dimensional oriented percolation cluster converge to the Brownian web, ALEA Lat. Am. J. Probab. Math. Stat. 16 (2019), 1029–1054. [9] M. Birkner and R. Sun, Low-dimensional lonely branching random walks die out, Ann. Probab. 47 (2019), 774–803. [10] M. Biskup and P.-F. Rodriguez, Limit theory for random walks in degenerate timedependent random environments, J. Funct. Anal. 274 (2018), 985–1046. [11] J. Blath, A. M. Etheridge, and M. Meredith, Coexistence in locally regulated competing populations and survival of branching annihilating random walk, Ann. Appl. Probab. 17 (2007), 1474–1507. [12] J. Blath, M. Hammer, and M. Ortgiese, The scaling limit of the interface of the continuousspace symbiotic branching model, Ann. Probab. 44 (2016), 807–866.

Ancestral lineages in spatial population models

309

[13] J. Blath and M. Ortgiese, The symbiotic branching model: Duality and interfaces, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 311–336. [14] O. Blondel, M. R. Hilário, and A. Teixeira, Random walks on dynamical random environments with non-uniform mixing, Ann. Probab. 48 (2020), 2014–2051. [15] B. M. Bolker and S. W. Pacala Using moment equations to understand stochastically driven spatial pattern formation in ecological systems, Theor. Popul. Biol. 52 (1997), 179–197. [16] B. M. Bolker and S. W. Pacala, Spatial moment equations for plant competition: Understanding spatial strategies and the advantages of short dispersal, Am. Nat. 153 (1999), 575–602. [17] E. Bolthausen and A. S. Sznitman, On the static and dynamic points of view for certain random walks in random environment, Methods Appl. Anal. 9 (2002), 345–375. [18] R. Durrett, Oriented percolation in two dimensions, Ann. Probab. 12 (1984), 999–1040. [19] A. M. Etheridge, Survival and extinction in a locally regulated population. Ann. Appl. Probab. 14 (2004), 188–214. [20] A. M. Etheridge, Some Mathematical Models from Population Genetics, École d’Été de Probabilités de Saint-Flour XXXIX–2009, Springer, Berlin, 2011. [21] N. Fournier and S. Méléard, A microscopic probabilistic description of a locally regulated population and macroscopic approximations. Ann. Appl. Probab. 14 (2004), 1880–1919. [22] A. Greven, A. Sturm, A. Winter, and I. Zähle, Multi-type spatial branching models for local self-regulation I: Construction and an exponential duality, preprint 2015, httpsW//arxiv.org/ abs/1509.04023. [23] G. Grimmett and P. Hiemer, Directed percolation and random walk, in: In and Out of Equilibrium (ed. V. Sidoravicius), Birkhäuser, Boston (2002), 273–297. [24] O. Gün, W. König, and O. Sekulovi´c, Moment asymptotics for branching random walks in random environment, Electron. J. Probab. 18 (2013), 1–18. [25] O. Gün, W. König, and O. Sekulovi´c, Moment asymptotics for multitype branching random walks in random environment, J. Theor. Probab. 28 (2015), 1726–1742. [26] M. Hammer, M. Ortgiese, and F. Völlering, A new look at duality for the symbiotic branching model, Ann. Probab. 46 (2018), 2800–2862. [27] M. Hammer, M. Ortgiese, and F. Völlering, Entrance laws for annihilating Brownian motions and the continuous-space voter model, Stochastic Process. Appl. 134 (2021), 240–264. [28] M. Hutzenthaler and A. Wakolbinger, Ergodic behaviour of locally regulated branching populations. Ann. Appl. Probab. 17 (2007), 474–501. [29] O. Kallenberg, Stability of critical cluster fields. Math. Nachr. 77 (1977), 7–43. [30] W. König, Branching random walks in random environment, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 23–41. [31] T. Kuczek, The central limit theorem for the right edge of supercritical oriented percolation, Ann. Probab. 17 (1989), 1322–1332.

Matthias Birkner and Nina Gantert

310

[32] R. Law and U. Dieckmann, Moment approximations of individual-based models, in: The Geometry of Ecological Interactions (eds. U. Dieckmann, R. Law, and J. A. Metz), Cambridge University, Cambridge (2002), 252–270. [33] V. Le, E. Pardoux, and A. Wakolbinger, “Trees under attack”: A Ray–Knight representation of Feller’s branching diffusion with logistic growth, Probab. Theory Related Fields 155 (2013), 583–619. [34] K. Miller, Random walks on weighted, oriented percolation clusters, ALEA Lat. Am. J. Probab. Math. Stat. 13 (2016), 53–77; and erratum: ALEA Lat. Am. J. Probab. Math. Stat. 14 (2017), 173–175. [35] K. Miller, Random walks on oriented percolation and in recurrent environments, Dissertation, Technische Universität München, 2017, httpW//mediatum.ub.tum.de/670560?show_ id=1366085. [36] C. Neuhauser, Ergodic theorems for the multitype contact process, Probab. Theory Related Fields 91 (1992), 467–506. [37] C. Neuhauser and S. W. Pacala, An explicitly spatial version of the Lotka–Volterra model with interspecific competition. Ann. Appl. Probab. 9 (1999), 1226–1259. [38] C. M. Newman, K. Ravishankar, and R. Sun, Convergence of coalescing nonsimple random walks to the Brownian web, Electron. J. Probab. 10 (2005), 21–60. [39] M. Raghib, N. A. Hill, and U. Dieckmann, A multiscale maximum entropy moment closure for locally regulated space time point process models of population dynamics, J. Math. Biol. 62 (2011), 605–653. [40] S. Sawyer, Results for the stepping stone model for migration in population genetics, Ann. Probab. 4 (1976), 699–728. [41] E. Schertzer, R. Sun, and J. M. Swart, The Brownian web, the Brownian net, and their universality, in: Advances in Disordered Systems, Random Processes and Some Applications (eds. P. Contucci and C. Giardiná), Cambridge University, Cambridge (2017), 270–368. [42] S. Steiber, Ancestral lineages in the contact process: scaling and hitting properties, Dissertation, Johannes Gutenberg-Universität Mainz, 2017, urn:nbn:de:hebis:77-diss1000010573. [43] A. Véber and A. Wakolbinger, The spatial Lambda-Fleming–Viot process: An event-based construction and a lookdown representation, Ann. Inst. Henri Poincaré Probab. Stat. 51 (2015), 570–598. [44] G. H. Weiss and M. Kimura, A mathematical analysis of the stepping stone model of genetic correlation, J. Appl. Probab. 2 (1965), 129–149. [45] H. M. Wilkinson-Herbots, Coalescence times and FST values in subdivided populations with symmetric structure, Adv. Appl. Probab. 35 (2003), 665–690. [46] I. Zähle, J. T. Cox, and R. Durrett, The stepping stone model, II. Genealogies and the infinite sites model, Ann. Appl. Probab. 15 (2005), 671–699.

Chapter 15

The symbiotic branching model: Duality and interfaces Jochen Blath and Marcel Ortgiese The symbiotic branching model describes the dynamics of a spatial two-type population, where locally particles branch at a rate given by the frequency of the other type combined with nearest-neighbour migration. This model generalises various classic models in population dynamics, such as the stepping stone model and the mutually catalytic branching model. We are particularly interested in understanding the region of coexistence, i.e. the interface between the two types. In this chapter, we give an overview over our results that describe the dynamics of these interfaces at large scales. One of the reasons that this system is tractable is that it exhibits a rich duality theory. So at the same time, we take the opportunity to provide an introduction to the strength of duality methods in the context of spatial population models.

15.1 Introduction Over recent years spatial stochastic models have become increasingly important in population dynamics. Of particular interest are the spatial patterns that emerge through the interaction of different types via competition, spatial colonisation, predation and (symbiotic) branching. The classic model in this field is the stepping stone model of Kimura [33]. More recent developments include [4,9,13,54], see also the contributions by Birkner and Gantert [7] and Greven and den Hollander [25] in this volume. A particularly useful technique in this context is duality. This technique allows to relate two (typically Markov) processes in such a way that information e.g. about the long-term behaviour of one process can be translated to the other one. The most basic form of duality can be described as follows: we say that two stochastic processes .X t / t >0 and .Y t / t>0 with state spaces E1 and E2 are dual with respect to a (measurable) duality function F W E1  E2 ! R if for any x 2 E1 ; y 2 E2 , Ex ŒF .X t ; y/ D Ey ŒF .x; Y t /:

(15.1.1)

The particular case when F .x; y/ D x y is known as moment duality and holds e.g. for a Wright–Fisher diffusion with dual given by the block-counting process of the Kingman coalescent. There are also other variations of duality such as pathwise duality where both original process and dual process can be constructed on the same probability space. Pathwise duality often arises when tracing back genealogies in population dynamics, see also the contributions by Birkner and Blath [6], Blath and Kurt [12], as well as Kersting and Wakolbinger [32] in this volume. To date, there is no general theory that characterises all possible duals or even just guarantees existence. However, if a dual process exists, exploiting this duality

Jochen Blath and Marcel Ortgiese

312

can often be a powerful way of analysing a model. See [31] for a survey on duality, [48] for results on pathwise duality in a general setting, but also [49] for a survey of recent developments regarding a systematic approach to duality based on [14, 24]. In this chapter we will mostly focus on a class of processes known as the symbiotic branching model introduced in [21]. These models describe the dynamics of a spatial two-type population that interacts through mutually modifying their respective branching rates. However, before we will look at the symbiotic branching model, we will set the scene in Section 15.2 by considering the discrete-space voter model, one of the classic spatial population models, which also has a close connection to the symbiotic branching model. We will show how a basic duality arises in this context and indicate how it can be used to determine the long-term behaviour as well as to describe the interfaces between different types. In Section 15.3, we then introduce the symbiotic branching model and in particular describe our results regarding the interfaces between different types that we characterise via a scaling limit. Note that the symbiotic branching model is particularly interesting as it exhibits several natural yet different kinds of dualities: a self-duality that we describe in Section 15.4, and a moment duality considered in Section 15.5. In Section 15.6, we look at how we can use the moment duality in a special case to gain insight into the scaling limit of the system. It turns out that this scaling limit is closely related to a continuous-space version of the voter model and that its interfaces here are described by annihilating Brownian motions, giving rise to an interface duality. We exploit this connection between the spatial population model and its interface in Section 15.7 to characterise the entrance laws of annihilating Brownian motions. Finally, in Section 15.8 we briefly discuss open problems in this area.

15.2 The discrete-space voter model Our first (well-known) example for duality in a spatial population model arises in the context of the classic voter model. Informally, the voter model represents a population indexed by x 2 Zd , where each individual has an opinion 0 or 1. At rate 1 each individual uniformly picks a neighbour and then copies the opinion of the chosen neighbour. An alternative interpretation is that of a biological population of two different types such that at rate 1 an individual dies and is replaced by the type of a parent uniformly chosen from the neighbours. If the underlying graph is the complete graph, the voter model is a version of the Moran model, see e.g. [20, Section 1.5]. We will also see that variations of the voter model arise as a limit when looking at more complicated population models indexed by Zd . The classic reference for the voter model is [40], see [51] for a more recent exposition. A formal definition of the system is the following.

The symbiotic branching model: Duality and interfaces

313

Figure 15.2.1. The graphical construction of the voter model. The two different types are indicated in black and white and the interface between the two types in grey.

Definition 15.2.1. The voter model is a Markov process . t / t>0 taking values in d ¹0; 1ºZ such that if the current state is  D ..x//x2Zd , then .x/ flips to 1

.x/ at rate

1 2d

X

1¹.y/¤.x/º ;

x 2 Zd :

yWjy xjD1

The voter model is famously characterised by the following duality: For all  2 d ¹0; 1ºZ and finite subsets A  Zd , we have hY i hY i E  t .x/ D EA .x/ ; t > 0; (15.2.1) x2A

x2Y t

where .Y t / t >0 denotes a (set-valued) system of (instantaneously) coalescing nearestneighbour random walks starting from A. One particularly nice way to analyse the voter model, which also gives the duality (15.2.1), is via a graphical construction due to [30]: We write x  y if x and y are neighbours in Zd and for x  y, we denote by .x; y/ the directed edge from x to y. Then let .N.x;y/ ; x; y 2 Zd ; x  y/ be a collection of independent Poisson 1 point processes on RC with rate 2d each. At an event of N.x;y/ at time s we draw a directed edge from .s; x/ to .s; y/, so that together with the lines Zd  Œ0; 1/ we obtain a directed graph as in Figure 15.2.1. d We can define the voter model started in an initial condition  2 ¹0; 1ºZ as follows: the initial opinions are propagated by letting them flow upwards in the graphical construction and if they encounter an arrow by letting them flow along the direction of the arrow (and replacing the opinion at that site if it is different). For each site .x; t/ 2 Zd  Œ0; 1/ and s 2 Œ0; t , we set st;x D y 2 Zd if y 2 Zd is the unique point at time t s that is reached by starting at .t; x/ and following vertical lines downwards and when encountering the tip of an arrow following the arrow horizontally in reverse direction. An equivalent way to describe the above flow construction is to set  t .x/ WD . tt;x /.

314

Jochen Blath and Marcel Ortgiese

We note that by the Poisson construction  t;x D .st;x /s2Œ0;t  has the law of a simple random walk (where jumps occur at rate 1). Moreover, if we consider the system ¹ t;x ; x 2 Aº for a finite set A  Zd , then this collection has the same law as a system of coalescing random walks: each particle moves as an independent random walk until two particles meet. After meeting, the two particles involved in the collision follow the same random walk trajectory. For more details see [40, Section III.6]. From this construction, we have immediately that for x1 ; : : : ; xn 2 Zd , E

Y n i D1

Y  n hY i t;xi  t .xi / D E . t / D Ex1 ;:::;xn .y/ ; 

i D1

y2Y t

t;x

where Y t D ¹Y0 i ; i D 1; : : : ; nº is the (set-valued) system of coalescing random walks started in Y0 D ¹x1 ; : : : ; xn º. Therefore, we have shown (15.2.1). Remark 15.2.2. As mentioned before the voter model on the complete graph with n vertices is a variant of the Moran model. See the contribution of Baake and Baake [3] in this volume for graphical constructions with extensions to more general models in this context. An immediate consequence of the duality with coalescing random walks is that the system in lower dimensions d D 1; 2 experiences clustering. d

Proposition 15.2.3. Let . t / t >0 be the voter model started in  2 ¹0; 1ºZ and assume d 2 ¹1; 2º. Then, for any x; y 2 Zd ,  P  t .x/ D  t .y/ ! 1; as t ! 1: Proof. Note that by the graphical construction   P  t .x/ D  t .y/ D P . tt;x / D . tt;y / > P . tt;x D  tt;y / D Px;y . 6 t /; where  is the first meeting time of two independent random walks started in x and y. Since the difference of two random walks is again a random walk that is recurrent in d D 1; 2, the latter probability tends to 1 as t ! 1. In particular, any invariant measure is concentrated on configurations consisting of all 0s or all 1s. Similarly, one can show that in dimensions d > 3, due to the transience of the random walk, the invariant measures are not constant. See e.g. [40, Corollary V.1.13]. We will now concentrate on the case d D 1. A question that we will come back to frequently is whether we can describe the dynamics of the “interfaces” between the two different types. More formally, consider the interface of a configuration  2 ¹0; 1ºZ as ® ¯ I./ WD x 2 Z W .x/ ¤ .x C 1/ : Then we can explicitly describe the law of this process as first observed in [44].

The symbiotic branching model: Duality and interfaces

315

Proposition 15.2.4. Let  2 ¹0; 1ºZ . The interface of the voter model I. t / with 0 D  follows a system of annihilating random walks started in I./. Recall that a system of (instantaneously) annihilating random walks is a system of random walks on Z that move independently until the first collision time of a pair of particles, at which point the two particles involved annihilate each other. Proof. The statement can either be checked by calculating generators, see [44] or it follows from the graphical construction, see also Figure 15.2.1: Note that if an interface particle is at site x and encounters an arrow from x to x C 1 then it jumps to the right. Conversely, if it is at x and encounters an arrow from x C 1 to x, then 1 it jumps to the left. Since arrows appear at rate 2d , each particle performs a simple random walk and different particles are independent since they use a disjoint set of arrows. Finally, if a particle jumps on top of another, then the type to the right of the left particle dies out in the voter model and so the interface particles annihilate. This relation between the annihilating random walks and the voter model leads to the following “interface duality”. Corollary 15.2.5. For any x; y 2 Z with x < y and denoting by X D .X/ t >0 a system of annihilating random walks, we have for any  2 ¹0; 1ºZ and for any t > 0,   PI./ jX t \ Œx; y 1j even D P  t .x/ D  t .y/ : Proof. This follows from Proposition 15.2.4 together with the observation that  t .x/ D  t .y/ if and only if I. t / \ Œx; y 1 is even. In fact, this relationship also means that given an initial condition  2 ¹0; 1ºZ , one can construct a voter model by first sampling a system of annihilating random walks started in I./ and then uniquely colouring the remaining sites so that the annihilating walks correspond to the interfaces. A similar interface duality is known for a one-dimensional voter model with swapping, where the interfaces follow a symmetric double-branching annihilating random walk, see [11, 47, 50]. We will come back to this duality in a continuous-space setting, see Sections 15.5 and 15.7 below.

15.3 The symbiotic branching model Our main object of study will be the symbiotic branching model introduced by Etheridge and Fleischmann in [21]. The model describes the dynamics of a spatial population consisting of two types. In the corresponding infinitesimal particle model, locally the population of each type follows a critical branching process, where the branching rate is given by times the frequency of particles of the other type, where > 0 is a parameter of the model. Moreover, each particle migrates according to an independent Brownian motion. Finally, the branching mechanisms are correlated

Jochen Blath and Marcel Ortgiese

316

with a correlation parameter denoted by % 2 Œ 1; 1. For a more precise description of the particle system, see [21]. In one spatial dimension and in continuous space, the model is described by two interacting stochastic partial differential equations (SPDEs). Here, u t .x/ and v t .x/ describe the densities of each type at time t > 0 and site x 2 R. The evolution of these (non-negative) densities is given by @ u t .x/ D @t @ v t .x/ D @t

p  u t .x/ C u t .x/v t .x/WP t.1/ .x/; 2 p  v t .x/ C u t .x/v t .x/WP t.2/ .x/; 2

(15.3.1)

with suitable nonnegative initial conditions u0 .x/ D u.x/ > 0 and v0 .x/ D v.x/ > 0, x 2 R. Here, > 0 is the branching rate,  is the Laplacian and .WP .1/ ; WP .2/ / is a pair of correlated standard Gaussian white noises on RC  R with correlation parameter % 2 Œ 1; 1. We refer to these SPDEs as cSBM.%; /u;v . Existence and uniqueness for these equations are covered in [21] (where uniqueness in general is still open for % D 1). There is also a discrete space version of the model (e.g. indexed by Zd ), but we will focus on the spatial continuum. A main motivation for this model stems from the fact that it generalises several well-known examples of spatial population models: For % D 1 and for initial conditions u  1 v one recovers a continuous-space version of the stepping stone model of Kimura, see also [52]. For % D 0, the model is known as the mutually catalytic branching model due to Dawson and Perkins [16]. For % D 1 and if u  v, then the system reduces to the parabolic Anderson model, compare the contribution of König [38] in this volume. In this case, uniqueness of the system is covered by standard SPDE techniques, see e.g. [41]. In order to investigate the dynamics of the model, one has to understand the balance between the critical local branching mechanism, which pushes one type towards extinction, and the Laplacian, which smoothes out solutions and in particular pushes mass back into regions where one type has died out. A particularly interesting consequence of this competition of forces is the observation of [21] that for any % 2 Œ 1; 1 if we start with initial conditions where both types are initially separated, as for example the complementary Heaviside conditions, i.e. u D 1.

1;0

and v D 1Œ0;1/ ;

(15.3.2)

then the region where both types coexist remains finite, despite the efforts of the Laplacian to spread mass everywhere instantaneously. More formally, define the region of coexistence or the interface at time t as I t D I.u t ; v t / D supp.u t / \ supp.v t /: Then [21] shows that I t is a compact set and the width of the interface grows at most linearly in t.

The symbiotic branching model: Duality and interfaces

317

One of our main goals is to understand the evolution of the interface in more detail. In the case % D 1, for the stepping stone model with Heaviside initial conditions, a result by Tribe [52] shows that after diffusive rescaling the interface shrinks to a single point that moves like a Brownian motion. One of our earlier works, [8, Theorem 2.11], showed that for all % close to 1 there is a constant C > 0 such p that almost psurely, for all t large enough, the interface is contained in the set Œ C t log.t/; C t log.t /. This shows sub-linear speed for the interface and is consistent with the conjecture that the diffusive behaviour might also be correct for other % > 1. This conjecture is also supported by the following scaling property: [21, Lemma 8] shows that for any > 0; K > 0, if .u t ; v t / t>0 is solution of cSBM.%; /u;v , then  uK 2 t .Kx/; vK 2 t .Kx/ x2R;t>0 solves cSBM.%; K /u.K/ ;v.K/ ; (15.3.3) where u.K/ .x/ D u.Kx/ and v .K/ .x/ D v.Kx/ for all x 2 R are suitably rescaled initial conditions. In particular, if the initial conditions are invariant under the rescaling (as e.g. the ones in (15.3.2)), then a diffusive space-time rescaling is in law equivalent to rescaling the branching parameter. In particular, in the following we will be discussing a scaling limit as ! 1, which also allows us to consider more general initial conditions. Our first main result shows that at least for negative %, the diffusive rescaling indeed captures the non-trivial behaviour of the interface. To formulate the convergence, we move from densities to measure-valued processes by defining Œ  Œ  t .dx/ WD u t .x/ dx;

 tŒ  .dx/ WD v tŒ  .x/ dx;

(15.3.4)

Œ  where we now write .uŒ  t ; v t / t >0 for the solution of cSBM.%; / to emphasise the dependence on . Also, we denote by Mtem the space of tempered measures on R, and by Mrap the space of rapidly decreasing R measures. Informally, a measure  is in Mtem (resp. Mrap ) if .f / WD h; f i WD R f .x/.dx/ is finite for every non-negative function f that is decreasing exponentially fast (resp. for any f that is growing slower C C than exponentially). Similarly, Btem (resp. Brap ) denotes the space of nonnegative, tempered (resp. rapidly decreasing) measurable functions, i.e. that grow slower than any exponentially growing function (resp. decaying faster than any exponentially decay function). For formal definitions, see [10, Appendix 1].

Theorem 15.3.1 ([10, Theorem 1.5], [27, Theorem 2.2], [28, Theorem 2.8]). Let C 2 % 2 Œ 1; 0/. If % 2 . 1; 0/, suppose the initial conditions satisfy .u; v/ 2 .Btem / or C 2 .u; v/ 2 .Brap / , and if % D 1, suppose the initial conditions are bounded. Then as Œ 

! 1, the measure-valued process .Œ  t ;  t / t>0 defined by (15.3.4) converges in law with respect to the Meyer–Zheng “pseudo-path” topology to a measure-valued process . t ;  t / t>0 . Moreover, for any t > 0, almost surely the limiting measures  t and  t are absolutely continuous with respect to the Lebesgue measure and their densities u t ; v t satisfy the separation-of-types property u t .x/v t .x/ D 0 for almost all x 2 R.

Jochen Blath and Marcel Ortgiese

318

In the following we will refer to the limit . t ;  t / t>0 (or its density .u t ; v t / t>0 ) as the continuous-space infinite rate symbiotic branching model cSBM.%; 1/u;v . The theorem is proved by showing tightness and uniqueness of limit points using two different types of duality that are known for the symbiotic branching model. For tightness, we make use of the duality to Brownian motions with dynamically changing colours (see Section 15.5) and for uniqueness, we use the self-duality first applied in this context by Mytnik [42], see Section 15.4. Remark 15.3.2. (a) The Meyer–Zheng “pseudo-path” topology is a fairly weak topology on the space of càdlàg measure-valued processes and essentially requires convergence at all times apart from those in a Lebesgue-null set. For a formal definition we refer to [10, Appendix A.1]. Under the more restrictive assumption that .u; v/ is the complementary Heaviside initial condition in (15.3.2), we show in [10, Thep orem 1.12] that for % 2 . 1; 1= 2/ and for % D 1 in [28, Theorem 2.8], tightness also holds in the stronger (standard) Skorokhod topology, implying 2 convergence in CŒ0;1/ .Mtem /. measures are absolutely continuous is first shown in [10] (b) The fact that the limiting p for % 2 . 1; 1= 2/ and complementary Heaviside initial conditions and the statement is extended to all % 2 . 1; 0/ in [27] and % D 1 in [28]. (c) The reason for the different assumptions on the initial conditions depending on % is mainly technical: for % D 1 the result is proved using a moment duality, which will be discussed in Section 15.5 and for % 2 . 1; 0/ we rely on a self-duality, see Section 15.4. As mentioned before, for % D 1 and complementary Heaviside initial conditions, the result of Theorem 15.3.1 is proved by Tribe [52, Theorem 4.2] (in the stronger Skorokhod topology) and it is shown that the limiting process in this case is .1¹x6B t º dx; 1¹x>B t º dx/ t >0 ;

(15.3.5)

for .B t / t >0 a standard Brownian motion. See also Theorem 15.6.2 below for an extension of this result to more general initial conditions (but still % D 1). Unfortunately, for % > 1, we do not currently have a description of the limit that is as explicit. However, we can characterise the limit via a martingale problem, see also Section 15.4 below. In particular, in this way we can exclude that the limit is of the simple form (15.3.5). The difference compared to % D 1 is that for % > 1, Œ  the sum uŒ  is no longer deterministic and the random height fluctuations t C vt influence the dynamics of the interface. Even though we do not have an explicit description of the limiting object, we can say more about the interface process. For any Radon measure  denote by ® ¯ supp./ WD x 2 R W .B" .x// > 0 for all " > 0

The symbiotic branching model: Duality and interfaces

319

the measure-theoretic support of , where B" .x/ denotes a ball of radius ". Then define L./ WD inf supp./ and R./ WD sup supp./: 2 Theorem 15.3.3 ([27]). Suppose % 2 . 1; 0/. Let .; / be initial conditions in Mtem 2 or Mrap that are mutually singular and such that R./ 6 L./ with  C  ¤ 0. Let . t ;  t / t >0 be a solution of cSBM.%; 1/; . Then, almost surely,

R. t / 6 L. t /

for all t > 0:

Moreover, for all fixed t > 0, almost surely, . t ;  t / has a single-point interface in the sense that R. t / D L. t /. Remark 15.3.4. Note the difference in the order of the quantifiers for t and !: For the first statement the set of exceptional !’s does not depend on t, while in the second statement it does. Note also that the condition R. t / 6 L. t / means that the support of  t is to the left of the support of  t , but there might possibly be a gap. In particular, our theorem does not guarantee the existence of an interface process I t WD R. t / D L. t /. In general, it is a non-trivial task to obtain results that are uniform in time, see e.g. the discussion in [15, Section 7] for the two-dimensional mutually catalytic model. Unlike for the first result, Theorem 15.3.1, the proof of Theorem 15.3.3 does not directly rely on the technique of duality. Instead, we deduce the fact that the interface is a single point by establishing a connection between the continuum space model and the model defined on Z. The proof heavily relies on prior work for the discrete model. Inspired by the scaling property (15.3.3), Klenke and Mytnik [34–36] consider the mutually catalytic branching model, i.e. the symbiotic branching model with % D 0, on a discrete space and show that without a spatial rescaling but taking ! 1, the model converges to an infinite-rate limiting process. In contrast to our result, they have an explicit description of the limit in terms of an interacting system of jump-type SDEs and they also study long-term properties of the system. Moreover, paper [37] gives a Trotter-type approximation and [18,19] extend the analysis for the discrete model to all % 2 . 1; 1/ and obtain comparable results to the % D 0 case. We show in [27] that if one starts with the infinite-rate symbiotic branching model defined on a lattice and takes a diffusive time and space rescaling, then one ends up with the continuous-space infinite-rate symbiotic branching model. Our strategy in showing Theorem 15.3.3 is then to first use the explicit description of the limit to show that the discrete model started in complementary Heaviside initial conditions has a single-point interface (interpreted suitably) and then to show that this property is preserved in the space-time limit.

Jochen Blath and Marcel Ortgiese

320

15.4 Self-duality in the symbiotic branching model The symbiotic branching model has a very rich duality structure. In this section we explain its self-duality, which is particularly useful for showing uniqueness of solutions, e.g. for the SPDE (15.3.2). This duality is essentially based on an idea of Mytnik [42], who used it to show uniqueness for the mutually catalytic branching model. The reason that self-duality is needed is that as soon as % > 1, the densities .u t ; v t / can have random heights. Therefore, if one wants to show uniqueness of solutions by showing that moments converge, one is additionally faced with the highly non-trivial task of controlling the growth of these moments. Self-duality is a classical duality in the sense of (15.1.1), where however the dual follows the same dynamics as the original process, but starts with different initial conditions. We start by defining the corresponding duality function. Let % 2 . 1; 1/ C 2 2 C 2 2 and if either .; ; ; / 2 Mtem  .Brap / or .; ; ; / 2 Mrap  .Btem / , define p p hh; ; ; ii% WD 1 %h C ;  C i C i 1 C %h ;  i; R where h; i WD R .x/.dx/. We then define the self-duality function F as F .; ; ; / WD exphh; ; ; ii% : The duality function F also plays an important role in the limiting martingale problem as the following lemma motivates. Lemma 15.4.1. Let % 2 . 1; 1/ and suppose the initial conditions satisfy .u; v/ 2 C 2 C 2 .Btem / (resp. 2 .Brap / /. Denote by .u t ; v t / t >0 the solution of cSBM.%; /u;v with

< 1 and set  t .dx/ D u t .x/ dx and  t .dx/ D v t .x/ dx. Then there exists an increasing càdlàg Mtem -valued (resp. Mrap -valued) process .ƒ t / t >0 with ƒ0 D 0 and C such that for all twice continuously differentiable test functions ; 2 Brap (resp. C ; 2 Btem ) the process F . t ;  t ; ; /

F .0 ; 0 ; ; / Z 1 t F .s ; s ; ; /hhs ; s ; ;  ii% ds 2 0 Z 4.1 %2 / F .s ; s ; ; /.x/ .x/ƒ.ds; dx/ (15.4.1) Œ0;tR

is a martingale. Here, ƒ.dt; dx/ WD u t .x/v t .x/ dt dx;

(15.4.2)

and ƒ t .dx/ D ƒ.Œ0; t  dx/. Proof. This lemma can be proved by first writing the solution .u t ; v t / t >0 in the weak formulation and then applying Itô’s lemma. See the proof of [21, Proposition 5] for details.

The symbiotic branching model: Duality and interfaces

321

We say that a process . t ;  t / t>0 taking values in Mtem , resp. Mrap , is a solution of the martingale problem MPF .%/ if there exists an increasing, càdlàg process .ƒ t / t>0 with ƒ0 D 0 taking values in Mtem , resp. Mrap , such that the expression in (15.4.1) is a martingale. Here, we interpret .ƒ t / t>0 as a measure on Œ0; 1/  R by setting ƒ.Œ0; t   B/ D ƒ t .B/ for any Borel set B and t > 0. Solutions of the martingale problem MPF .%/ are not unique, as for any , a solution cSBM.%; / gives a solution of MPF .%/. However, specifying the correlation via (15.4.2) fixes solutions. This can be shown via the following self-duality based on an idea of Mytnik [42]. 2 Lemma 15.4.2. Fix 2 .0; 1/. Let . t ;  t / t>0 be a process taking values in Mtem with densities .u t ; v t / t>0 that satisfies the martingale problem MPF .%/ together C with (15.4.2). Then, for any test functions ; 2 Brap ,

Eu0 ;v0 ŒF . t ;  t ; ; / D E; ŒF .Q t ; Q t ; u0 ; v0 /;

(15.4.3)

where .Q t ; Q t / t >0 is any solution of the martingale problem MPF .%/ with densities .uQ t ; vQ t / t >0 satisfying (15.4.2) (with .uQ t ; vQ t / replacing .u t ; v t /) and taking values in 2 Mrap with initial conditions .uQ 0 ; vQ 0 / D .; /. Clearly the collection of functions F .  ;  ; ; / for ; as above is measuredetermining since it is a mixed Laplace-Fourier transformation and so the self-duality uniquely determines solutions of the martingale problem if (15.4.2) is also specified. The main idea in [10] is to characterise the D 1 limit via the martingale problem, but instead of explicitly prescribing the correlation as in (15.4.2) we replace this condition by a separation-of-types condition. C 2 / Theorem 15.4.3 ([10, Theorem 1.10]). Given initial conditions .u; v/ 2 .Btem C 2 (resp. .u; v/ 2 .Brap / /, the limiting process cSBM.%; 1/ in Theorem 15.3.1 can be characterised as the unique solution . t ;  t / t>0 of MPF .%/ with 0 .dx/ D u.x/ dx and 0 .dy/ D v.y/ dy for which the increasing, càdlàg process .ƒ t / t >0 satisfies

E; Œƒ t .dx/ 2 Mtem

.resp. E; Œƒ t .dx/ 2 Mrap /

and, for all t > 0 and x 2 R, Eu;v ŒS"  t .x/S"  t .x/ ! 0 as " ! 0;

(15.4.4)

where .S t / t >0 denotes the heat semigroup. The proof of the uniqueness shows that under the separation-of-types condition (15.4.4) any solution of MPF .%/ satisfies a self-duality analogous to (15.4.3). A similar martingale problem was also used in earlier work for the infinite-rate model on a discrete space, see [34, 35]. The difference is that in this context one can work with functions F , where  and are such that  D 0, so that the term involving ƒ in (15.4.1) vanishes. Also, the proof of the self-duality relation becomes easier as one does not have to worry about spatial regularity.

Jochen Blath and Marcel Ortgiese

322

15.5 Moment duality in the symbiotic branching model The finite-rate symbiotic branching model also satisfies a moment duality as shown in [21]. We will first recall this standard construction and then discuss the extension of the moment duality to the infinite-rate model due to [28]. This construction works in R as well as in Zd , but we will concentrate on the continuous-space case here. In this section we denote by .u t ; v t / t >0 (the density of) a solution of cSBM.%; / for 2 .0; 1. For any functions u; vW R ! Œ0; 1/ and a colour c 2 ¹1; 2º, we introduce the notation ´ u.x/ if c D 1; .c/ .u; v/ .x/ WD v.x/ if c D 2: Then, with this notation and for n 2 N, x D .x1 ; x2 ; : : : ; xn / 2 Rn and a colouring c D .c1 ; : : : ; cn / 2 ¹1; 2ºn , the duality gives us an expression for the moment E

Y n

.u t ; v t /

.ci /

 .xi / :

i D1

The dual process is given by n independent Brownian motions X D .X t1 ; : : : ; X tn / t>0 started in x. Moreover, to each Brownian motion we associate a dynamically changing colour process .C t / t>0 with C t D .C t1 ; : : : ; C tn / 2 ¹1; 2ºn , where C ti describes the colour of the ith motion and is such that C0i D ci for each i D 1; : : : ; n. The dynamics of C t are as follows: when a pair of Brownian motions of the same colour meets frequently enough such that their collision local time exceeds an exponential time with rate , then one of the particles (chosen randomly) changes colour. ¤ In order to describe the duality, we also introduce LD t , respectively L t , as the total collision time collected up to time t by all pairs of equal colour, respectively different colours. Using the duality function .c/

.u; v/

.x/ WD

n Y

.u; v/.ci / .xi /;

i D1

we can write the moment duality for the symbiotic branching model with finite for C 2 .u; v/ 2 .Btem / and c 2 ¹1; 2ºn as  ¤  D Eu;v Œ.u t ; v t /c .x/ D Ex;c .u; v/C t .X t / e .L t C%L t / ;

(15.5.1)

see [21, Proposition 12]. Remark 15.5.1. Critical curve. The duality holds for all % 2 Œ 1; 1, however only for % D 1 it has been used to ensure uniqueness of solutions, see also the discussion at the end of [21, Section 1.1]. Moreover, by combining it with the self-duality, [8, Theorem 2.5] shows that there is a critical curve (as a function of %) that determines

323

The symbiotic branching model: Duality and interfaces

which moments of the solutions remain bounded in time. More precisely, it is shown that p < p.%/ H) E1;1 Œu t .x/p  is bounded uniformly in all t > 0; x 2 R; where p.%/ D

 arccos. %/

and 1 is the uniform initial condition.

In order to derive a moment duality in the D 1 case, the main idea of [28] is to decouple the evolution of the Brownian motions and the colourings in the following sense: we first sample the Brownian motions and then conditionally on the Brownian motions we treat the remainder of the right-hand side of (15.5.1) as a measure on colourings. More precisely, given the Brownian motions X and an initial condition c 2 ¹1; 2ºn , we define a (random) measure on colourings by setting   ¤ D M tŒ  .b/ WD Ec e .L t C%L t / 1C t Db j XŒ0;t  ; (15.5.2) for any b 2 ¹1; 2ºn . With this notation the duality (15.5.1) can be written as i h X Eu0 ;v0 Œ.u t ; v t /c .x/ D Ex;c .u0 ; v0 /.b/ .X t /M tŒ  .b/ : b2¹1;2ºn

This re-write of the duality is helpful, because we can write down the evolution of the measure M Œ  explicitly. i j Lemma 15.5.2. With Li;j t denoting the collision local time between X and X , we Œ  have that M0 D ıc and

dM tŒ  .b/ D

n

% X 1bi ¤bj M tŒ  .b/ dLi;j t 2 i;j D1

C

n

X 1bi ¤bj M tŒ  .bO i / dLi;j t ; 2

(15.5.3)

i;j D1

where bO i is the colouring b flipped at i. The lemma follows by looking at a small time increment and considering the possible changes in M Œ  induced by the dynamics of the colouring process. The details can be found in [28, Lemma 3.4]. Notice that, conditional on X, (15.5.3) is a system of linear ODEs driven by the local times; note also that increasing the parameter corresponds to speeding up the time evolution. Moreover, we have a lot of explicit control over the evolution of these ODEs in terms of eigenvalues and eigenvectors. To be more explicit, suppose that the Brownian motions X start in distinct starting positions. We set 0 D 0 and for k > 0 define ® ¯ kC1 WD inf t > k W there exist i ¤ j such that X ti D X tj but Xik ¤ Xjk ;

Jochen Blath and Marcel Ortgiese

324

as the consecutive times when a new pair of Brownian motions meet. Then we have kC1 > k almost surely and we can show that in the limit ! 1, in between times k and kC1 , the measures M tŒ  immediately settle into their equilibrium measure M tŒ1 (so are constant in the limit). We refer to [28, Theorems 2.3 and 2.5], where we show the following theorem. Theorem 15.5.3. Suppose that % 2 Œ 1; cos. n / ^ 0/ and that the starting point x D .x1 ; : : : ; xn / 2 Rn satisfies xi ¤ xj for i ¤ j . Then as ! 1, the process .M tŒ  / t>0 defined in (15.5.2) converges almost surely pointwise to a càglàd1 limiting process .M tŒ1 / t >0 . Moreover, denote by .u t ; v t / t>0 the (density of the) infinite rate limit C 2 SBM.%; 1/u;v started in .u; v/ 2 .Btem / . Then, for any colouring c D .c1 ; : : : ; cn / 2 ¹1; 2ºn , i h X Eu;v Œ.u t ; v t /.c/ .x/ D Ex M tŒ1 .b/.u; v/.b/ .X t / ; b2¹1;2ºn

where M0Œ1 D ıc and both sides are finite. The limiting process M Œ1 is explicit, but fairly complicated to write down. Therefore, we refer to [28, Propositions 4.2 and 4.6] for the details. Note that, for % D 1, we have that u t C v t satisfies the heat equation, i.e. w t WD u t C v t D S t .u0 C v0 /, where .S t / t>0 denotes the heat semigroup. In this case, the duality also simplifies and we have the following result. Proposition 15.5.4. Let .u t ; v t / t >0 be the densities of cSBM. 1; 1/u;v for initial C 2 conditions .u; v/ 2 .Btem / . (i) For x 2 Rn with x1 < x2 <    < xn and c 2 ¹1; 2ºn , we have an alternating colouring,   Eu;v Œ.u t ; v t /.c/ .x/ D Ex .u; v/.c/ .X t / 1¹t 6º ; where  WD inf¹t > 0 W there exist i ¤ j such that X ti D X tj º is the first collision time. (ii) Suppose additionally that u C v  1. For any x D .x1 ; : : : ; xn / 2 Rn , Y  n hY i Eu;v u t .xi / D Ex u.y/ ; (15.5.4) i D1

y2Yxt

where Yxt D ¹Y t.xi / W 1 6 i 6 nº; t > 0, is a system of coalescing Brownian motions started from x. Proof sketch. In the case % D 1, the dynamics of M Œ1 can be described as follows, see [28, Propositions 4.2 and 4.6]: Let 0 D 0 and .k /k>1 be, as before, the consecutive time when a new pair of Brownian motions meets. Then the process M tŒ1 is constant on each of the intervals Œ0; 1 ; .1 ; 2 ; .2 ; 3 ; : : : . Moreover, M tŒ1 D M0Œ1 D ıc for t 6 1 . 1

That is, left-continuous with right limits.

The symbiotic branching model: Duality and interfaces

325

To describe the dynamics of M tŒ1 for t > 1 , define for `1 ; `2 2 ¹1; : : : ; nº and any b 2 ¹1; 2ºn , ´ 0 if b`1 ¤ b`2 ; `1 ;`2 K1 .ıb / D 1 ` 1 ` ıb C 2 ıbO 1 C 2 ıbO 2 if b`1 D b`2 ; `1 ;`2 where bO ` denotes the colouring flipped at position `. Then define K1 .M / for any n measure M on ¹1; 2º by setting X `1 ;`2 `1 ;`2 K1 .M / D M.b/K1 .ıb /: b2¹1;2ºn

Now, we can define M tŒ1 inductively. Assume that at time k with k > 1 the pair of Brownian motions with indices `1 and `2 meets, then set `1 ;`2 M tŒ1 D K1 .MŒ1 / for t 2 .k ; kC1 : k

(i) In the first case, we see that M0Œ1 D ıc for t 6  D 1 . In particular, if at time  the Brownian motions indexed by `1 ; `2 meet, we have by assumption on c that `1 ;`2 c`1 ¤ c`2 . Hence, K1 D 0 and thus M tŒ1 D 0 for all t >  D 1 . The statement of (i) then follows from Theorem 15.5.3, see also [28, Lemma 5.8] for a more detailed argument. (ii) We only prove the case when n D 2. The general case follows along similar lines, but the combinatorics is more involved (also see Remark 15.5.5 for an alternative approach). We have by Theorem 15.5.3 that for c D .1; 1/, h X i Eu;v Œu t .x1 /u t .x2 / D Ex M tŒ1 .b/.u; v/.b/ .X t / ; b2¹1;2º2

where M0Œ1 D ıc . As before, we have M tŒ1 D ıc D ı.1;1/ for t 2 Œ0; 1 . Then, for t > 1 , we have that 1 1 `1 ;`2 M tŒ1 D K1 .ı.1;1/ / D ı.1;1/ C ı.1;2/ C ı.2;1/ : 2 2 Thus, using that u C v D 1, we get that Eu;v Œu t .x1 /u t .x2 /     D Ex 1¹t61 º u.X t1 /u.X t2 / C Ex 1¹t>1 º u.X t1 /u.X t2 /  1   1  C Ex 1¹t>1 º u.X t1 /.1 u/.X t2 / C Ex 1¹t >1 º .1 u/.X t1 /u.X t2 / 2 2   1   1   D Ex 1¹t61 º u.X t1 /u.X t2 / C Ex 1¹t>1 º u.X t1 / C Ex 1¹t >1 º u.X t2 / : 2 2 This expression can be represented in terms of coalescing Brownian motions as in (ii), if we notice that we can obtain a system of coalescing Brownian motions from X by deciding at the collision time to follow exactly one of the two Brownian motions, chosen with probability 21 each.

326

Jochen Blath and Marcel Ortgiese

Remark 15.5.5. An alternative derivation of the second part of the proposition would be to recall that the case % D 1 and u C v  1 corresponds to the infinite rate limit of the continuous-space stepping stone model. Theorem 4.1 in [46] showed that in the finite- case there is a similar moment duality, where however the dual is a system of delayed coalescing Brownian motions, where two motions only coalesce at rate times their collision local time. Taking ! 1 in the dual gives the instantaneously coalescing Brownian motions as in part (ii).

15.6 Interface duality in the symbiotic branching model In the case % D 1, duality allows us to explicitly characterise the interface process in the symbiotic branching model, see also Section 15.2 for a similar result in the discrete voter model. The following results generalise [52] to general initial conditions that have infinitely many interface points and we can also remove the restriction that the initial conditions satisfy u C v  1. Define U as the space of absolutely continuous measures .u; v/ with bounded densities also denoted by .u; v/ such that u.x/v.x/ D 0 and u.x/ C v.x/ > 0 for almost all x 2 R. For .u; v/ 2 U, we define I.u; v/ WD supp.u/ \ supp.v/;

(15.6.1)

where supp.u/ denotes the measure-theoretic support. Our next result deals with initial conditions of “single interface point” type. Theorem 15.6.1. Assume .u; v/ 2 U such that jI.u; v/j D 1. Let .u t ; v t / t >0 denote the solution of cSBM. 1; 1/u;v . Then we have, almost surely, jI.u t ; v t /j D 1 for all t > 0 and if we denote by I t the single interface point, then .I t / t>0 is continuous almost surely and there exists a standard Brownian motion .B t / t>0 such that I t is the unique (in law) weak solution2 of Z t 0 ws .Is / I t D I0 ds C B t ; t > 0; (15.6.2) 0 ws .Is / where w t D S t .u C v/, for .S t / t >0 the heat semigroup. Moreover, if 1¹u.x/>0º ! 1 as x ! 1, then u t .x/ D 1¹x6I t º w t .x/

and

v t .x/ D 1¹x>I t º w t .x/;

and otherwise the roles of u t and v t have to be interchanged. 2

Under the assumption .u; v/ 2 U1 , the integrand on the right-hand side of (15.6.2) is not guaranteed to be Lebesgue-integrable at 0. However, in any case the integral exists as an R t w0 .I / improper integral lim"#0 " wss .Iss / ds.

The symbiotic branching model: Duality and interfaces

327

Figure 15.6.1. An illustration of the colouring m O of Œ0; 1/  R induced by an initial configuration .u0 ; v0 / with five interfaces. Type 1 is drawn in white and type 2 is shaded grey. Figure taken from [29].

Proof sketch. Using the duality relation for the mixed moments in part (i) of Proposition 15.5.4, we can show that the number of interfaces cannot increase. The one-dimensional distributions for a single I t follow from a first moment calculation u t .x/  and then using that since u t .x/ 2 ¹0; w t .x/º, we have that P .I t > x/ D EŒ w t .x/ applying duality. Suppose now that the initial condition .u; v/ 2 U is such that I.u; v/ has no accumulation points. Define m.u; v; x/ to be 1 if x 2 supp.u/ and set it to be 2 otherwise. Suppose .Y tx / t >0 , x 2 I.u; v/ is a system of stochastic processes with Y0x D x that move independently according to the stochastic differential equation (15.6.2) until two motions collide at which point the pair annihilates. Then the paths of the annihilating motions induce a partition of the set Œ0; 1/  R. We can define a “colouring” of the half-plane by defining m.t; O x/ such that m.0; O x/ D m.u; v; x/ and such that each component of the partition has the same colour 2 ¹1; 2º (and such that the boundaries are of colour 1), see Figure 15.6.1 for an illustration. Then we can define, with w t D S t .u C v/, uO t .x/ D w t .x/ 1¹m.t;x/D1º O

and vO t .x/ D w t .x/ 1¹m.t;x/D2º : O

Theorem 15.6.2 ([29, Theorem 2.12]). Assume that .u; v/ 2 U and I.u; v/ has no accumulation points. Let .u t ; v t / t >0 denote the infinite rate limit cSBM. 1; 1/u;v . Then d .u t ; v t / t >0 D .uO t ; vO t / t>0 ; where .uO t ; vO t / t >0 is defined in terms of annihilating motions with dynamics given by (15.6.2) as above.

Jochen Blath and Marcel Ortgiese

328

Remark 15.6.3. (a) We also have a version of this theorem for arbitrary initial conditions, see [28, Theorem 2.14]. There we show that there is a “coming down from infinity” effect and for any t > 0, the set I.u t ; v t / does not have any accumulation points and the evolution of .us ; vs /s>t is as above, but started from .u t ; v t /. The notion of “coming down from infinity” originates in coalescent theory, see also the contributions of Blath and Kurt [12], Kersting and Wakolbinger [32], Birkner and Blath [6], as well as Freund [23] in this volume. (b) In the case u C v D 1, there is a further interesting duality to annihilating Brownian motions. For the finite stepping stone model cSBM. 1; /, [9, Lemma 2.1] implies (in the special case s D 0 and there formulated for discrete space) that Y  n hY i Eu .1 2u t .xi // D Ex .1 2u.y// y2X t

i D1

for x D .x1 ; : : : ; xn / and where .X t / t>0 is a system of delayed annihilating Brownian motions started in x. Taking the limit ! 1 on the left will then lead to instantaneously annihilating Brownian motions on the right. Note in the case that % D 1 and the initial conditions are such that u C v  1, the duality in Proposition 15.5.4 (ii) is the continuous-space analogue of the discretespace voter model duality in (15.2.1). Indeed, the result shows that cSBM. 1; 1/ corresponds to the continuous-space voter model first introduced by [22] and further discussed in [17, 55], where it is referred to as continuum-sites stepping-stone model. In analogy with the discrete model from Section 15.2, the continuous-space voter model can also be obtained via a graphical construction. We will give an informal description, the idea goes back to [2], see also [26] for further details in a more general situation. Let W D .W .t;x/ /.t;x/2R2 be the (time-reversed) Brownian web; informally this is a system of coalescing Brownian motions W .t;x/ started at every point .t; x/ 2 R2 with W0.t;x/ D x, see [43] for technical details. Recall that since u t D 1 v t , we only need to construct u t . Fix initial conditions u such that u.x/ 2 Œ0; 1 for every x 2 R. In order to determine the state of the system u t .x/ at some time t > 0 and location x 2 R, one traces back the genealogy by following W .t;x/ for t time units, then samples a type W0.t;x/ in ¹0; 1º according to a Bernoulli variable with success probability u.W t.t;x/ /. Finally, set u t .x/ D W t.t;x/ . Note that this gives a pathwise construction of the continuous-space voter model that is perfectly analogous to the discrete construction in Section 15.2. For a more careful construction that takes care of the non-trivial difficulties we refer to [26]. In particular, one can easily deduce the moment duality (15.5.4), since for x D .x1 ; : : : ; xn /, n n Y Y u t .xi / D W t.t;xi / : i D1

i D1

The symbiotic branching model: Duality and interfaces

329

Then taking expectations and carrying out the expectation over the Bernoulli variables gives Y  Y  n n Eu u t .xi / D E u.W t.t;xi / / ; i D1

i D1

W t.t;xi / ,

and the system i D 1; : : : ; n, is in law equivalent to the system of coalescing Brownian motions started in x. As in the discrete case, the same construction can also be used to see that the interfaces in the continuous-space voter model are given by a system of annihilating Brownian motions, and thus yield an alternative proof of Theorem 15.6.2, see [29] for details.

15.7 Entrance laws for annihilating Brownian motions In the last section we have seen that we can use the moment duality for the symbiotic branching model with % D 1 to show that the interfaces between different types behave like (instantaneously) annihilating Brownian motions (aBMs). It turns out that this observation can be used to gain insight into the behaviour of aBMs itself. This section is based on [29]. The following is a well-known issue that arises when trying to construct aBMs: As long as the set of starting points is finite, it is absolutely straightforward to construct a system of aBMs. Even if the initial points form a locally finite (equivalently, discrete and closed) subset of the real line, then it is possible to construct the corresponding systems, even though a little care is needed, see [53, Section 4.1] or [29, Appendix A.1]. However, if one considers a sequence of locally finite starting points that become dense in the real line, it is not at all clear if the corresponding system of aBMs converges in some suitable sense. Mathematically, this question is closely related to characterising entrance laws for aBMs. Recall that a family  D . t / t>0 of probability measures on (the Borel -algebra of) a suitable state space D is called a probability entrance law for a semigroup .P t / t>0 if s P t s D  t for all 0 < s < t: See e.g. [39, Appendix A.5] or [45] for the general theory of entrance laws. Roughly speaking, an entrance law corresponds to a Markov process .X t / t >0 with time-parameter set .0; 1/, whose one-dimensional distributions are given by  t , but where we do not specify the initial condition. One approach to finding entrance laws for aBMs, carried out in [53], is to use a thinning relation between annihilating and coalescing Brownian motions. Unlike for aBMs, one can always add coalescing Brownian motions to an existing system in a consistent way. In particular, if the starting points become dense everywhere on the real line, this leads to a unique entrance law for cBMs, also known as Arratia’s flow [1].

330

Jochen Blath and Marcel Ortgiese

Using the thinning relation, see [53, Section 2.1], this leads also to an entrance law for aBMs (termed “maximal” in [53]). However, it is also clear that the consistency of the cBMs described above does not hold for aBMs, and different ways of taking initial conditions that become denser may lead to different systems of aBMs. Our main result in [29] is to classify all possible entrance laws for systems of aBMs using the connection to cSBM. 1; 1/ described in Section 15.6. Before we can state this correspondence, we need to set up a bit of notation. We expect that even if we start with a dense set of points on the line, then at any positive t > 0 many nearby Brownian motions would have annihilated and so the positions of the remaining Brownian motions would form a discrete (and closed) subset of R. Hence, a suitable state space for the evolution of aBMs is given by ® ¯ D WD x  R W x is discrete and closed : For each x 2 D a system of aBMs starting from x can be constructed as a (strong) Markov process Xx D .Xxt / t >0 taking values in D. The main idea is to use the fact that any system of aBMs started in D corresponds to the interfaces in the symbiotic branching model with % D 1 and the right initial conditions. So let ® ¯ M1 WD u.x/ dx j uW R ! Œ0; 1 measurable denote the space of all absolutely continuous measures on R with densities taking values in Œ0; 1. Here, we recall that since for % D 1 and initial conditions u C v  1, in order to specify cSBM. 1; 1/u;v we only need to describe the evolution of one type and so choosing u 2 M1 fixes the initial conditions. To go from the measure-valued process to the aBMs, we need to define a mapping that associates to each u 2 M1 its interface. So we think of u.x/ representing the proportion of type 1 particles at x and correspondingly 1 u.x/ as the proportion of type 2 particles. Thus, in agreement with (15.6.1) we define for u 2 M1 , I.u/ WD I.u; 1

u/ D supp.u/ \ supp.1

u/;

(15.7.1)

as the interface points between the two types, where we recall that supp.u/ denotes the measure-theoretic support of u. By Theorem 15.6.2 any initial condition u with interface I.u/ 2 D corresponds to a system of aBMs started in the points I.u/. Conversely, the interfaces only describe the population model up to interchanging the types. Therefore, we define an equivalence relation  on M1 by identifying two measures with densities u and v if and only if either almost everywhere u D v or almost everywhere u D 1 v. Then we work with the quotient space V WD M1 =. We write v D Œu D ¹u; 1 uº for elements of V . Now, using that on the level of measure-valued processes it is possible to start in any measure u 2 M1 we have the following result.

The symbiotic branching model: Duality and interfaces

331

Theorem 15.7.1 ([29]). There is a suitable topology on D such that there is a bijective correspondence between probability entrance laws  D . t / t >0 for the semigroup .P t / t>0 of aBMs on D and probability measures  on V. Informally, the measure  (up to fixing types) will correspond to the law of the initial condition of the measure-valued process, whose interfaces then correspond to the system of aBMs. The right choice of topology is absolutely essential for the above theorem to work. The obvious choice of identifying points in D with locally finite point measures and then relying on the vague convergence is not a good choice. One reason is that in this topology we only get càdlàg, but not continuous, paths for the process X. Instead, we construct a weaker topology on D, which is derived from the topology on the level of the measure-valued process. We start by equipping M1 with the vague topology, this turns M1 into compact space, see e.g. [29, Appendix A.2]. An important role is played by the subspace of all u 2 M1 with discrete interface ® ¯ M1d WD u 2 M1 W I.u/ 2 D ; which is dense in M1 , see [29, Appendix A.2]. Note that for each u 2 M1d , we may choose a version of its density that is locally constant on the complement of I.u/. Again we need to take into consideration that u and 1 u give rise to the same interface and so set V d WD M1d =, where  is as before. On this quotient space the mapping I defined in (15.7.1) induces a mapping IW V d ! D, which is welldefined and bijective. In particular, it induces a topology on D, generated by the system ¹I.U / W U  V d openº, which is by definition the coarsest topology on D with respect to which I 1 W D ! V d is continuous. This topology is the one referred to in Theorem 15.7.1. Our next theorem makes the connection to the infinite-rate symbiotic branching model more explicit and also clarifies what happens when taking a sequence of initial conditions for aBMs that become dense on the real line. Theorem 15.7.2 ([29]). Let D be endowed with the above topology and let ..n/ /n2N be a sequence of probability measures on D. Consider the corresponding sequence .n/ of aBM processes .X.n/ . t / t >0 started according to the (random) initial condition  .n/ Then .X t / t>0 converges in distribution in C.0;1/ .D/ if and only if the sequence ..n/ ı I/n2N of probability measures on V d converges weakly to some probability measure  on V, in which case, as n ! 1, d

.X.n/ t / t>0 ! .I.V t // t>0

on C.0;1/ .D/:

Here, V t D Œu t  2 V for .u t ; 1 u t / t >0 is a solution of cSBM. 1; 1/u;1 u , where the (random) initial conditions u are chosen by first choosing V 2 V according to  and then taking u such that Œu D V . Moreover, I.V t / D I.u t / as defined in (15.7.1). We will illustrate the power of the correspondence between aBMs and the population model by looking at the following examples.

Jochen Blath and Marcel Ortgiese

332

Example 15.7.3. (i) If we take xn D n1 Z, then I 1 .xn / converges to Œ 21  in V. Hence by Theorem 15.7.2 the system of aBMs starting from xn converges. (ii) Suppose that xn is distributed according to a Poisson point process on R with intensity n; then I 1 .xn / also converges to Œ 21 . (iii) Now, consider xn D n1 Z C ¹0; n12 º. Then I 1 .xn / still converges in V, however now the limit is Œ0, so the limiting system of aBMs corresponds to the empty system. Intuitively, this corresponds to the case where each pair of nearby aBMs starts so close to each other that cancellations arise so early that they do not survive in the limit. 1 º, where now I 1 .xn / converges to Œ 14  (iv) Finally, we consider xn D n1 Z C ¹0; 4n in V, which is different from Œ 12 . So the examples in (i), (ii) and (iv) converge, but in the latter example the aBMs “come down from infinity” in a different way, thus leading to a different entrance law.

Note that the approximations in (i) and (ii) give the “maximal” entrance law considered in [53]. See Figure 15.7.1 for illustrations of the four examples. Finally, we also observe that it is possible to construct initial conditions in D such that I 1 .xn / does not converge to any measure, e.g. by starting with finitely many points and then adding one point at a time such that an accumulation point arises. The corresponding system of aBMs starts either with an odd or even number of motions, so cannot converge as in either case the system will die out while in the other case one motion survives. See [29] for more details. Finally, the construction also gives insight into n-point densities for aBMs, which for an entrance law  D . t / t>0 are defined as \  n ® ¯ 1 p .t; x/ WD lim P X t \ Œxi "; xi C " ¤ ; ; "!0 .2"/n i D1

for x D .x1 ; : : : ; xn / with x1 <    < xn and t > 0. The relation to cSBM. 1; 1/ (or the continuous-space voter model) and the construction in terms of the Brownian web discussed above gives us a way to calculate the one-point densities explicitly as well as a to simplify the expression for general n-point densities. See [29] for details.

15.8 Outlook We have seen that the symbiotic branching model has a very rich duality theory which allows us to carry out a fairly detailed analysis of this system of SPDEs. One of the main open problems is to characterise the D 1 limit more explicitly for % 2 . 1; 0/ and possibly to extend the results to non-negative %. Another challenging direction would be to consider models with large range migration, where the Laplace term in (15.3.1) is replaced by e.g. a fractional Laplacian.

333

The symbiotic branching model: Duality and interfaces t

t

T (a)

(1) xn

=

1 Z n

T (b)

t

(2) xn

∼ PPP(n)

t

T (c)

(3) xn

=

1 Z n

+ {0,

1 } n2

T (d)

(4) xn

=

1 Z n

+ {0,

1 } 4n

Figure 15.7.1. Simulations of aBMs as interface process with discrete starting configurations on a torus T , taken from [29].

While some of the duality theory still works, it is unclear what the scaling limit would be even in the stepping stone case % D 1. Finally, it would be very interesting to see how far our techniques of characterising the limiting process without specifying the correlation has applications in other spatial systems where types naturally separate, such as some of the limiting objects found in [5] or the two-dimensional version of the mutually catalytic branching model [15]. Acknowledgements. We are indebted to our collaborators Matthias Hammer and Florian Völlering. We would also like to thank the two referees and Matthias Hammer for carefully reading our manuscript and for their helpful comments.

References [1] R. Arratia, Coalescing Brownian Motions on the Line, Ph.D. Thesis, University of Wisconsin-Madison, Madison, 1979. [2] S. Athreya and R. Sun, One-dimensional voter model interface revisited, Electron. Commun. Probab. 16 (2011), 792–800. [3] E. Baake and M. Baake, Ancestral lines under recombination, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 365–382.

Jochen Blath and Marcel Ortgiese

334

[4] N. H. Barton, A. M. Etheridge, and A. Véber, A new model for evolution in a spatial continuum, Electron. J. Probab. 15 (2010), 162–216. [5] N. Berestycki, A. M. Etheridge, and A. Véber, Large scale behaviour of the spatial ƒFleming–Viot process, Ann. Inst. Henri Poincaré Probab. Stat. 49 (2013), 374–401. [6] M. Birkner and J. Blath, Genealogies and inference for populations with highly skewed offspring distributuions, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 151–177. [7] M. Birkner and N. Gantert, Ancestral lineages in spatial population models with local regulation, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 291–310. [8] J. Blath, L. Döring, and A. Etheridge, On the moments and the interface of the symbiotic branching model, Ann. Probab. 39 (2011), 252–290. [9] J. Blath, A. Etheridge, and M. Meredith, Coexistence in locally regulated competing populations and survival of branching annihilating random walk, Ann. Appl. Probab. 17 (2007), 1474–1507. [10] J. Blath, M. Hammer, and M. Ortgiese, The scaling limit of the interface of the continuousspace symbiotic branching model, Ann. Probab. 44 (2016), 807–866. [11] J. Blath and N. Kurt, Survival and extinction of caring double-branching annihilating random walk, Electron. Commun. Probab. 16 (2011), 271–282. [12] J. Blath and N. Kurt, Population genetic models of dormancy, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 247–265. [13] B. M. Bolker and S. W. Pacala, Spatial moment equations for plant competition: Understanding spatial strategies and the advantages of short dispersal, Am. Nat. 153 (1999), 575–602. [14] G. Carinci, C. Giardinà, C. Giberti, and F. Redig, Dualities in population genetics: A fresh look with new dualities, Stochastic Process. Appl. 125 (2015), 941–969. [15] D. A. Dawson, A. M. Etheridge, K. Fleischmann, L. Mytnik, E. A. Perkins, and J. Xiong, Mutually catalytic branching in the plane: Finite measure states, Ann. Probab. 30 (2002), 1681–1762. [16] D. A. Dawson and E. A. Perkins, Long-time behavior and coexistence in a mutually catalytic branching model, Ann. Probab. 26 (1998), 1088–1138. [17] P. Donnelly, S. N. Evans, K. Fleischmann, T. G. Kurtz, and X. Zhou, Continuum-sites stepping stone models, coalescing exchangeable partitions and random trees, Ann. Probab. 28 (2000), 1063–1110. [18] L. Döring and L. Mytnik, Mutually catalytic branching processes and voter processes with strength of opinion, ALEA Lat. Am. J. Probab. Math. Stat. 9 (2012), 1–51. [19] L. Döring and L. Mytnik, Longtime behavior for mutually catalytic branching with negative correlations, in: Advances in Superprocesses and Nonlinear PDEs (eds. J. Englander and B. Rider), Springer, Boston (2013), 93–111. [20] R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008.

The symbiotic branching model: Duality and interfaces

335

[21] A. M. Etheridge and K. Fleischmann, Compact interface property for symbiotic branching, Stochastic Process. Appl. 114 (2004), 127–160. [22] S. N. Evans, Coalescing Markov labelled partitions and a continuous sites genetics model with infinitely many types, Ann. Inst. Henri Poincaré Probab. Stat. 33 (1997), 339–358. [23] F. Freund, Multiple-merger genealogies: Models, consequences, inference, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 179–202. [24] C. Giardinà, J. Kurchan, F. Redig, and K. Vafayi, Duality and hidden symmetries in interacting particle systems, J. Stat. Phys. 135 (2009), 25–55. [25] A. Greven and F. den Hollander, From high to low volatility: spatial Cannings with block resampling and spatial Fleming–Viot with seed-bank, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 267–289. [26] A. Greven, R. Sun, and A. Winter, Continuum space limit of the genealogies of interacting Fleming–Viot processes on Z, Electron. J. Probab. 21 (2016), 1–64. [27] M. Hammer and M. Ortgiese, The infinite rate symbiotic branching model: From discrete to continuous space, preprint 2015, httpsW//arxiv.org/abs/1508.07826. [28] M. Hammer, M. Ortgiese, and F. Völlering, A new look at duality for the symbiotic branching model, Ann. Probab. 46 (2018), 2800–2862. [29] M. Hammer, M. Ortgiese, and F. Völlering, Entrance laws for annihilating Brownian motions and the continuous-space voter model, Stochastic Process. Appl. 134 (2021), 240–264. [30] T. E. Harris, Additive set-valued Markov processes and graphical methods, Ann. Probab. 6 (1978), 355–378. [31] S. Jansen and N. Kurt, On the notion(s) of duality for Markov processes, Probab. Surv. 11 (2014), 59–120. [32] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [33] M. Kimura, “Stepping stone” model of population, Ann. Rept. Nat. Inst. Genetics Japan 3 (1953), 62–63. [34] A. Klenke and L. Mytnik, Infinite rate mutually catalytic branching, Ann. Probab. 38 (2010), 1690–1716. [35] A. Klenke and L. Mytnik, Infinite rate mutually catalytic branching in infinitely many colonies: Construction, characterization and convergence, Probab. Theory Related Fields 154 (2012), 533–584. [36] A. Klenke and L. Mytnik, Infinite rate mutually catalytic branching in infinitely many colonies: The longtime behavior, Ann. Probab. 40 (2012), 103–129. [37] A. Klenke and M. Oeler, A Trotter-type approach to infinite rate mutually catalytic branching, Ann. Probab. 38 (2010), 479–497. [38] W. König, Branching random walks in random environment, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 23–41. [39] Z. Li, Measure-valued Branching Markov Processes, Springer, Berlin, 2011.

Jochen Blath and Marcel Ortgiese

336

[40] T. M. Liggett, Interacting Particle Systems, Springer, New York, 1985. [41] C. Mueller, On the support of solutions to the heat equation with noise, Stochastics Stochastics Rep. 37 (1991), 25–245. [42] L. Mytnik, Uniqueness for a mutually catalytic branching model, Probab. Theory Related Fields 112 (1998), 245–253. [43] E. Schertzer, R. Sun, and J. M. Swart, The Brownian web, the Brownian net, and their universality, in: Advances in Disordered Systems, Random Processes and Some Applications (eds. P. Contucci and C. Giardinà), Cambridge University, Cambridge (2017), 270–368. [44] D. Schwartz, On hitting probabilities for an annihilating particle model, Ann. Probab. 6 (1978), 398–403. [45] M. Sharpe, General Theory of Markov Processes, Academic Press, San Diego, 1988. [46] T. Shiga, Stepping stone models in population genetics and population dynamics, in: Stochastic Processes in Physics and Engineering (Bielefeld 1986; eds. S. Albeverio, P. Blanchard, M. Hazewinkel, and W. Streit), Reidel, Dordrecht (1988), 345–355. [47] A. Sturm and J. M. Swart, Voter models with heterozygosity selection, Ann. Appl. Probab. 18 (2008), 59–99. [48] A. Sturm and J. M. Swart, Pathwise duals of monotone and additive Markov processes, J. Theoret. Probab. 31 (2018), 932–983. [49] A. Sturm, J. M. Swart, and F. Völlering, The algebraic approach to duality: An introduction, in: Genealogies of Interacting Particle Systems (eds. M. Birkner, R. Sun, and J. M. Swart), Lect. Notes Ser. 38, World Scientific, Singapore (2020), 81–150. [50] A. Sudbury, The branching annihilating process: An interacting particle system, Ann. Probab. 18 (1990) 581–601. [51] J. M. Swart, A course in interacting particle systems, preprint 2017, httpsW//arxiv.org/abs/ 1703.10007. [52] R. Tribe, Large time behavior of interface solutions to the heat equation with Fisher–Wright white noise, Probab. Theory Related Fields 102 (1995), 289–311. [53] R. Tribe and O. Zaboronski, Pfaffian formulae for one dimensional coalescing and annihilating systems, Electron. J. Probab. 16 (2011), 2080–2103. [54] I. Zähle, J. T. Cox, and R. Durrett, The stepping stone model. II. Genealogies and the infinite sites model, Ann. Appl. Probab. 15 (2005), 671–699. [55] X. Zhou, Clustering behavior of a continuous-sites stepping-stone model with Brownian migration, Electron. J. Probab. 8 (2003), 1–15.

Chapter 16

Multitype branching models with state-dependent mutation and competition in the context of phylodynamic patterns Anja Sturm and Anita Winter In this article we propose a type-dependent branching model with mutation and competition, which can be used to model phylodynamic patterns of a virus population. For any two virus particles, the competition kernel depends on the particles’ types and the total mass of the population. We introduce our individual-based model as a measure-valued process and discuss possible scaling regimes. We then model the evolving phylogenies as stochastic processes with values in the space of marked metric measure spaces. For that we rely on the genetic information given through the number of nucleotide substitutions separating the virus particles. Finally, we construct a twolevel branching model that describes branching with competition within splitting cells. In all cases the large population limit of these models solves a martingale problem. Uniqueness of solutions is verified for the first and third model. The techniques for showing uniqueness include novel formulations of Girsanov’s theorem in the presence of jumps as well as a novel duality relation in the two-level-model. In the second model uniqueness is open but conjectured to hold under restricted conditions on the branching and competition rates. For this model the proof of existence of the limit relies on a new method of showing the compact containment condition for stochastic processes with values in metric measure spaces that are a priori not ultra-metric.

16.1 Introduction and motivation In this paper we model the dynamics of a population that evolves as a continuous time branching process with a trait structure and interactions in form of mutations and competition between individuals. We generalise existing microscopic models in several directions that allow us to capture evolutionary and phylogenetic patterns in rapidly evolving microorganisms such as viral and bacterial populations. We further investigate tractable large population approximations. For example, many RNA viruses replicate very fast. At the same time there is a lack of a proofreading mechanism in the virus’ RNA polymerase. As a consequence mutations arise frequently. The high mutation and replication rates cause viral variability that gives rise to different evolutionary and phylodynamic patterns of the viruses (or bacteria) within one host and within the host population.

338

Anja Sturm and Anita Winter r (a)

(b)

(c)

p

r s

-

time

p r

Figure 16.1.1. The range of phylogeny patterns.

Such patterns range, for example, at any given time, (a) from just a few traits and one dominating trait persisting a long time (for instance, in influenza on the population level or HIV on the intra-host level), (b) via a bounded number of (sero-)types (for instance, in dengue on the population level) or a significant number of traits (for instance, in measles on the population level), (c) to many traits but none of them lasting very long (for instance, HIV and HCV on the population level). Compare Figure 16.1.1 or the excellent survey papers [26, 46]. We are interested in capturing those patterns that are primarily shaped by natural selection arising from cross-immunity, that is the differential effect of immune responses on genetically variable strains. For that purpose we consider an individual based multi-type branching model, in which virus particles branch with a natural branch rate depending on their trait. Note that viruses can only be reproduced by infected cells. A branching event for the virus particles thus corresponds to the release of the newborn virus particles from an infected cell. In our model, large numbers of newborn virus particles may be produced at branching events. These newborn virus particles are clones of their parent unless mutation occurs at birth. Moreover, the evolution of the virus population is affected by the immune system, which reacts to the presence of particular virus types that it has already recognised. This leads to an increased death rate of a prevalent subpopulation, which we model by a competition. The competition kernel may depend on the current population size and on the traits of the competing virus particles. Motivated by [8, 18] such continuous time pure birth-death processes with mutation have been rigorously developed as mathematical models, see for example [10, 24, 38, 49, 50] and in particular the review article [9] in this volume. We here present the following generalisations. First we allow individuals to have multiple offspring at a reproduction event. Such population models with larger offspring numbers (highly skewed) are also the subject of [7, 22, 27, 39] in this volume.

Multitype branching models for phylodynamics

339

In our model the number of children produced by each individual depends on its trait and may also depend on the new trait that appears in case a mutation occurs. Such a dependence may occur in a variety of settings. In the context of viral populations or of other micro-parasites with fast adaptation (compare Figure 16.1.1 (a)) the interpretation is that a mutant type – sufficiently different from the parent type say – may quickly (on a much shorter time scale) establish a sizeable subpopulation, whose size could also depend on the intrinsic fitness of their trait type, before they are targeted by the immune system (“immune escape”). This is in particular realistic if a mutation to an epitope site results in a completely new phenotype of the virus’ antibody-binding sites, see [56] for a discussion. Another scenario is modelling so-called “jackpot” events, introduced in a seminal paper of [48], in which particular mutants rapidly create a sizeable mutant subpopulation – in the famous original Luria and Delbrück experiment because they are more resistant to detrimental effects of the environment. In our model, this mutant subpopulation is created instantaneously and we refer to them simply as mutant offspring. Second, as the virus patterns shape their phylogenies we propose a parametric model for the evolving phylogenies of individual based multi-type branching models. A phylogenetic tree, or short phylogeny, describes the history of the evolution of a species in terms of lines of descent. As virus populations evolve very fast, phylogenies are in fact genealogical trees. In accordance with the biological literature, we here use the term phylogenies rather than genealogies. In practice phylogenies are reconstructed from the number of nucleotide substitutions (compare [2]). Our model allows for competition kernels that may depend on the current population size, on the types of the competing virus particles as well as their mutual distance (given by counting the number of substitutions). In this manner the response of the immune system such as cross-immunity is implicitly taken into account. When mutation occurs, the parent particle gives birth to a mutant whose distances to all other virus particles (including its own parent) is set to be “one unit” longer than the corresponding distances from its parent. A similar multi-type branching model describing the evolving empirical trait distributions was introduced in [14]. In that paper the authors even allow for natural branching rates that may also depend on the population size. However, their model is not suitable for modelling cross-immunity as their competition kernel does not take into account any information on the traits’ history. Including historical information without losing the Markov property requires to leave the measure-valued set-up and to work with more enriched state spaces. This issue is resolved in [49] by working with historical processes. Historical processes have been established earlier in [15, 28] for structured neutral populations. As Markov processes with values in measures on path-space are notationally far too involved to allow for explicit calculations, we rather encode our multi-type phylogenies as marked metric measure spaces, and restrict ourselves to a dependence of the traits’ history only through their phylogenetic distances. This is also closer to applications where the raw data consists of samples of gene sequences whose mutational differences provide a measure of distance.

Anja Sturm and Anita Winter

340

The space of marked metric measure spaces equipped with the marked Gromov weak topology has been introduced in [16,54]. It relies on the idea of encoding “spatial trees” as “tree-like” metric spaces that are equipped with a sampling measure, while in addition each point in the tree is assigned a distribution on type-space. It extends the space of (non-spatial) metric measure spaces and the Gromov-weak topology developed in [29] (compare also [25, 47]). So far only a few examples of dynamics with values in metric measure spaces can be found in the literature. The first paper that considers dynamics with values in metric (probability) measure spaces is [30], which constructs the evolving genealogies for models with (neutral) resampling dynamics. These were extended to type and state dependent resampling dynamics (including the Ohta–Kimura model) in [54] and to resampling dynamics with selection in [17]. Note that the dynamics introduced in the latter paper agrees with our dynamics in the case of constant natural branching rates and conditioned on the total mass to be constant. Evolving genealogies of spatially structured resampling populations and their continuum space limits are constructed in [32]. Third, we propose a two-level branching model with competition for virus populations under cell division following an approach suggested in [40]. This is an example for a host-parasite system, which are also the subject of [55] included in this volume. Also, a host-parasite system is discussed in Section 1.3 of [3] in this volume, where the focus is on the diversity of virus populations within human hosts. In the setting considered here the virus evolves within the host cells. These infected cells either die or split into two cells while dividing the mother cell’s virus population randomly. By this splitting a new infected cell is born. In Kimmel’s model introduced in [40] all infected cells and all virus particles are indistinguishable and splitting happens at a constant rate. It has been verified in experiments that a very contaminated cell often gives birth to a very contaminated cell which dies fast and a much less contaminated cell which might survive. This observation was further studied in [4] who considered a discrete time version of Kimmel’s model in the case of imbalanced splitting and derived criteria for recovery of the organism from infected cells in terms of the means of offspring in the two newborn cells. Taking into account that life spans of infected cells are much longer than that of their virus populations, [5] considers a continuous time model, in which the virus populations evolve according to a supercritical continuous state branching diffusion. We extend their model and describe cells carrying virus populations that evolve as branching models with competition, while the cells split according to a Yule process.

16.2 The individual-based branching model In this section we introduce the individual-based model in detail. Let X be a locally compact and Polish trait space. We consider a parameter K 2 N that scales the resources or area available. It is thus directly connected to the carrying capacity

Multitype branching models for phylodynamics

341

and is often also referred to as the “system size” [51] as it is linked to the size of the population. As usual, given a topological space X we denote by M1 .X/ and Mf .X/ the spaces of probability measures and finite measures on X, respectively, defined on the Borel- -algebra B.X/ of X, and by ) weak convergence in M1 .X/ and Mf .X/. Moreover, for any two topological spaces V and V 0 we denote by M.V; V 0 /, Mb .V; V 0 / and Cb .V; V 0 / the measurable functions that are in addition bounded respectively continuous. If V 0 D R, we drop the range and simply write M.V /, Mb .V / and Cb .V /. We also write M C .V /, MbC .V / and CbC .V / for subsets of non-negative R-valued functions that are uniformly bounded away from zero. The state-dependent multi-trait branching particle model with mutation and competition takes values in the space ® ¯ 1 Nf .X/ WD  2 Mf .X/ W K.A/ 2 N0 for all A 2 B.X/ K of finite measures that take values in non-negative multiples of K1 equipped with the weak topology (Nf itself denotes the atomic measures). Given a state of the form N 1 1 X D ıi 2 Nf .X/; K K

(16.2.1)

i D1

we have that N is the number of individuals in the population and that K1 ıi is the contribution of an individual of type i 2 X to the population with individual mass K1 . Thus, we have the total mass given by m D N . We fix the following. K  p in Mb .X; Œ0; 1/, which we refer to as the mutation probability.  A stochastic kernel mK from X to X, which we refer to as the mutation kernel. We exclude the possibility of non-trivial mutation by assuming that for every  2 X, mK .; ¹º/ D 0:

(16.2.2)

 A Markov kernel K from X  X to N with uniformly finite mean, i.e.  WD sup K .;  0 / < 1; ; 0 2X

where

K .;  0 / WD

1 X

kK ..;  0 /; k/:

(16.2.3)

kD1

It describes the offspring distribution of a particle of trait  2 X when the (mutant) offspring have trait  0 2 X. birth death  ˇK and ˇK in MbC .X; RC / referred to as the natural birth and death rates.

Anja Sturm and Anita Winter

342

 death in M.RC  X  X; RC / referred to as the competition rate. We assume that there is a finite constant Nd such that for all m > 0, sup death .m; ;  0 / 6 Nd max¹1; mº:

; 0 2X

The state-dependent multi-trait branching particle model with mutation and competition is a K1 Nf .X/-valued Markov process with the following dynamics: Given that its current state is  2 K1 Nf .X/ as in (16.2.1) such that the total mass equals N K we have the following.  Death: an individual of type  2 X dies at rate Z death dK .; / WD ˇK ./ C death .m ; ;  0 /.d 0 /;

(16.2.4)

 Birth: an individual of type  2 X gives birth at rate birth bK .; / WD ˇK ./:

(16.2.5)

With probability .1 p.//K ..; /; k/ it gives birth to k clonal children, i.e. to k children of the original type  2 X, while with probability p./K ..;  0 /; k/ mK .; d 0 / it has k mutant children which are all of the same type  0 2 X. Note that in the later sections we will allow for the birth rate to depend on the measure , for example through a birth enhancement term of the same form as the competition term in (16.2.4). In order to unify notation we will thus use bK .; / for the birth rate even though it is for the moment constant in . In [6, 43] constructions as strong solutions of stochastic differential equations with Poisson noises were used, and by combining the results one can rigorously deduce the existence of the above Markov process. Theorem 16.2.1 ([6, Theorem 1, Proposition 1], [43, Proposition 2.7]). Consider the above setting with an initial number of particles with finite mean. Then there exists a well-defined (non-explosive) Markov process with the dynamics specified in (16.2.4) and (16.2.5). The construction was inspired by [10, 24] who considered the restricted case of binary reproduction and an offspring distribution that only depends on the parent’s type. In [6, Proposition 1] it is also shown that the infinitesimal generator LK of the process acts on measurable functions f W K1 Nf .X/ ! R such that for all u 2 .0; 1/ and  2 K1 Nf .X/, Z jf . uı / f ./j.d/ < C u: (16.2.6) X

(We will later on consider functions that satisfy this condition for all K. For a fixed K it would suffice to have (16.2.6) be satisfied for u D K1 .) The generator is then given

Multitype branching models for phylodynamics

343

by K

L f ./ D K

Z .1

p.//bK ..X/; /   k    K ..; /; k/ f  C ı K ZkD1 Z CK p./ bK ..X/; / X

1 X

X

 f ./ .d/

X

1 X

   k   K ..;  0 /; k/ f  C ı 0 f ./ mK .; d 0 /.d/ K kD1 Z    1  CK dK ..X/; / f  ı f ./ .d/: K X 

We point out that the class of functions that satisfy (16.2.6) contains the class F of all functions f W Mf .X/ ! R of the form f ./ D e

.X/

h˝n ; i;

with  2 RC , n 2 N0 ,  2 Cb .X n /, and h˝n ; i denoting the integral of the n-fold product measure of  against . While these functions clearly separate points in Mf .X/, they are also convergence determining (apply, for example, [47, Theorem 2.7]).

16.3 Possible scaling regimes We next investigate the large population limit as K ! 1. We are mostly interested in a random measure-valued limit process. Such a limit is obtained when the natural birth and death rates are accelerated appropriately (depending on the offspring distribution) in such a way that the selection due to the imbalance between these rates is weak. Nonetheless, the convergence assumptions we detail below, which we describe in the next Section 16.4, will include scalings where the births and deaths are not accelerated (or at a lesser rate). In this case, this kind of selection can be strong and the expected limit process is a deterministic measure-valued process (see [10, 24] for a comparison of the scaling regimes, respectively the examples at the end of this section). O WD X [ ¹@º For technical reasons we consider the one-point compactification X of X, and denote by C@  Cb those functions on X that can be extended continuously O The previous interactive particle system can be regarded as a process with state to X. O  Mf .X/ O where we suppose that the functions defining the process space K1 Nf .X/ O such that the following assumptions are met on X. O on X can be extended to X Assumption 16.3.1 (Sufficient for existence of large population limits). As for the branching mechanism and the drift due to mutation, we assume the following.

344

Anja Sturm and Anita Winter

(1) Gap between natural birth and death rate. We assume that birth ./ sup sup jK .; /ˇK

death ./j < 1: ˇK

O K>0 2X

(2) Uniform bounds on mean offspring number. We assume that the bound on the mean offspring numbers in (16.2.3) holds uniformly in K. In addition, let death 2 O  X/ O be constant in the first coordinate for the moment. Moreover, Cb .RC  X we assume for the probability generating function g K .;  0 ;  / of the offspring distribution that for all a > 0, as K ! 1, ˇ ˇ z z ˇ ˇ sup sup K  bK ./ˇg K .;  0 ; e K / 1 C K .;  0 /ˇ ! 0: K 0 O z2Œ0;a ; 2X (3) Domain of attraction of CSBP. Put for z > 0, K

.; z/ WD bK ./ g K .; ; e

and assume that the sequence .K for each a > 0 to

K

z

/

 1 C dK ./.ez

1/;

O  Œ0; a .;  =K//K converges uniformly on X

2

Z

.; z/ WD b./z C ./z C

1

.e

zu

1 C zu/….; du/;

0

O with  non-negative and ….;  / is a kernel from X O to .0; 1/ where b;  2 Cb .X/ such that for all B 2 B..0; 1//, Z Z 1 sup .u ^ u2 /….; du/ < 1; .u ^ u2 /….; du/ 2 C@ .X/: (16.3.1) O 2X

0

B

(4) Drift due to mutation. We assume that there are generators .A1 ; D.A1 // and O with domain dense in Cb .X/ O such .A2 ; D.A2 // of Feller semi-groups on Cb .X/ that for all 1 2 D.A1 /, Z ˇ  ˇ lim sup ˇbK ./K .; / 1 . 0 / 1 ./ ˇ K!1 2X O O X ˇ  mK .; d 0 / A1 1 ./ˇ D 0; and for all 2 2 D.A2 /, Z ˇ ˇ lim sup ˇbK ./ K!1 2X O

O X

  2 ./ K .;  0 / ˇ K .; / ˇ  mK .; d 0 / A2 2 ./ˇ D 0:

2 . 0 /

O Moreover, we assume that for some r 2 Cb .X/, Z ˇ  ˇ lim sup ˇbK ./ K .;  0 / K .; / mK .; d 0 / K!1 2X O

O X

ˇ ˇ r./ˇ D 0:

Multitype branching models for phylodynamics

345

These assumptions are motivated by those made in the theory of superprocesses; see for example [12, Section 4.4] or [45, Proposition 4.3]. Before we proceed to various descriptions of the limiting object in the following sections we provide several examples. Example 16.3.2 (Possible offspring distributions and scalings). The first one captures the mean-field scaling which leads to a deterministic measure-valued limit. The remaining examples are such that the measure-valued limit may be random. (1) Mean-field scaling. Let bK ; dK 2 C@C .X/ be such that bK ! b and dK ! d O where b; d 2 C C .X/. If we further assume that as K ! 1 uniformly on X @ K O then we obtain a deterministic limit that  !  2 C@ .X/ uniformly on X, depends on the offspring distribution only through its mean offspring function . (2) Binary branching. If K ..;  0 /;  / D ı1 , ˇK D K  ˇ for some  2 .0; 1 and birth death ˇ 2 C@C .X/, ˇK D ˇK C b for b 2 C@ .X/ as well as ˇK D ˇK , then the following holds: If  D 1, we obtain a random measure-valued limit and .; z/ D b./z C ˇ./z 2 . This case has been studied in [38, 49, 50] and will be the focus in Section 16.5. If  < 1, we obtain a deterministic limiting process, which has been studied in more detail in [10]. (3) Trait dependent offspring. Consider the generating function  g K .;  0 ; z/ WD z exp ƒ.;  0 /.1 z/ ; jzj 6 1; O  X/, O which correspond to an offspring number 1 C X; 0 where with ƒ 2 MbC .X X; 0 is Poisson distributed with trait dependent parameter ƒ.;  0 /. In particular, we have K D 1 C ƒ. Since this is an offspring distribution with a finite variance, close natural birth and death rates accelerated proportional to K yield a random limit while a lesser acceleration would lead to a deterministic limit. (4) ˛-stable offspring distribution. Let the reproduction law satisfy g K .;  0 ; z/ D

1 .1 ˛

z/1C˛ C

1C˛ z ˛

1 ; ˛

jzj 6 1;

(16.3.2)

birth death with ˛ 2 .0; 1. Choose ˇK D ˇK D K  ˇ for some  2 .0; ˛ and ˇ 2 C@C .X/. Then we obtain a random measure valued limit if  D ˛. Clearly, the variance of the offspring distribution is infinite and therefore the limiting process can no longer have finite second moments (compare with [12, Section 4.5] in the superprocess setting without a competition term). If  < ˛, then we get a deterministic limiting process.

In all of the above examples one still needs to choose the mutation kernels mK appropriately in order to assure convergence as K ! 1. We next provide some examples that show that large classes of mutation dynamics can be included. Notice that if the mean offspring number K .;  0 / only depends on the parent individuals’ trait , which will be the case in later sections and in our first example, then there is only one condition to check for A1 and we naturally obtain A2  0 and r  0.

346

Anja Sturm and Anita Winter

Example 16.3.3 (Mutation kernels). We discuss examples where the mutation kernel only depends on the parent or on the offspring trait but not on both. (1) Offspring numbers dependent on the parent trait. If K .;  0 / D ./, then the usual random walk transition kernels lead to a diffusion operator A1 or even a more general jump diffusion operator. For example, if X D Rd and the mutation kernel mK .;  / has mean  and covariance matrix †./="K such that "K > 0, "K ! 1 birth O then under some additional regularity and ˇK ="K converges (uniformly on X), assumptions the generator A1 is given by d @2 ./ 1 X A1 ./ D †ij ./ : 2 @i @j i;j D1

N 0 /. (2) Offspring numbers dependent on the offspring trait. We take K .;  0 / D . Let X D R and the mutation kernel mK as in the previous example. Then we obtain A1 ./ D 12 †./ 00 ./, A2 ./ D 12 †./N 0 ./ 0 ./ and r./ D 12 †./N 00 ./. For more details we refer to [6, Remarks 5–7].

16.4 The measure-valued model In this section we consider the limiting measure-valued process of the individual-based model presented in Section 16.2 under the scaling assumptions made in Section 16.3. We thereby summarise results from [6]. Theorem 16.4.1 ([6, Theorem 3]). Let ¹K W K > 0º be a family of individual-based models as defined in Section 16.2 that satisfy the assumptions in Definition 16.3.1. K Also assume that the initial states satisfy supK>0 EŒK 0 .X/ < 1 as well as 0 )  as K ! 1 for some  2 Mf .X/. O (càdlàg paths from Then the family ¹K W K > 0º is tight in D.Œ0; T ; Mf .X// O Œ0; T  to Mf .X/ furnished with the Skorohod topology). Any limit point, whose distribution we denote by Q , solves the following martingale problem: For any  2 D.M/ WD C@C .X/ \ D.A1 / \ D.A2 / the process M./ D .M t ./; t 2 Œ0; T / given by Z tZ  M t ./ D h t ; i h0 ; i p./ A1 ./ C A2 ./ s .d/ ds O 0 X  Z tZ Z C ./ b./ C

death .;  0 /s .d 0 /  0

O X

O X

p./r./ s .d/ ds

(M)

is a Q -martingale and Q .0 D / D 1. Moreover, M./ admits the decomposition M./ D M c ./ C M d ./, where M c ./ is a continuous martingale with increasing

Multitype branching models for phylodynamics

process

Z tZ 2 0

O X

347

./ 2 ./s .d/ ds;

d

and M ./ is a purely discontinuous martingale, see [37, p. 85]. More precisely, Z tZ Q M td ./ D h0 ; iN.ds; d0 /; O Mf .X/

0

where NQ is the P compensated random measure of the optional random measure O given by N.ds; d0 / D s>0 1¹s s ¤0º ı.s;s s / .ds; d0 / on Œ0; 1/  Mf .X/ O NQ D N NO with the compensator N.ds; d0 / D ds n. O s ; d0 / characterised via Z O Mf .X/

0

0

f . /n. O s ; d / D

Z Z O X

1

f .uı /….; du/s .d/: 0

Under some mild additional assumptions on the generators A1 and A2 , we also O the meashave that the limiting process takes values in Mf .X/ instead of Mf .X/, ures on the one-point compactification, meaning that no mass escapes to infinity, see [6, Theorem 5]. Under additional assumptions we can assure by means of a generalised version of Girsanov’s theorem that the limit points are unique. Theorem 16.4.2 ([6, Theorem 4]). Let  2 M.X/ be non-random and suppose that the coefficients in the martingale problem (M) satisfy the assumption of Definition 16.3.1 as well as in addition  2 C@ .X; RC /C . Then uniqueness holds for the martingale problem (M) and it is thus well posed. In particular, K as in TheO to the unique solution of this orem 16.4.1 converge weakly on D.Œ0; T ; Mf .X// martingale problem as K ! 1. The proof of the uniqueness of solutions to the martingale problem (M) via Girsanov’s transform is based on a localisation procedure introduced by Stroock [57] and generalised by He [34] to the measure-valued context. First, a “killed” martingale problem is considered for which large jumps are eliminated and replaced by a corresponding immigration term in the drift. The resulting process has bounded moments of any order and a Girsanov type theorem is used to remove the non-linearities (caused by the competition) and thus to deduce uniqueness for the “killed” martingale problem. Here,  needs to be bounded away from zero in order to define the probability measure transformation. (Note that for measure-valued processes without jumps the use of Girsanov’s theorem to show uniqueness goes back to Dawson [11, Section 5].) Finally, via a localisation argument it is shown that the uniqueness of the “killed” martingale problem implies uniqueness for the martingale problem (M). Let us point out that the nonlinear superprocess with interaction obtained as the solution to the limiting martingale problem (M) generalises earlier work on superprocesses, see [53] for a general overview and for instance [13,19,21,53] for superprocess

Anja Sturm and Anita Winter

348

models with a general branching mechanism that includes jumps in the limit. On the other hand, the model and limit studied here extends work by Méléard [24], Champagnat et al. [10], Jourdain et al. [38] and Etheridge [20] on superprocesses with nonlinear interaction terms such as competition by considering a general reproduction law of the approximating population system yielding a limiting process with a general branching mechanism, a superprocess with nonlinear interactions that possesses a jump structure.

16.5 The evolving phylogenies In this section we model the evolution of the phylogenetic trees (or for short, phylogenies) of our multi-trait branching model introduced in Section 16.2. The main results have been obtained in [41, 43]. For that purpose we capture a phylogeny of a multi-trait population in such a way that it allows an explicit description of the population size as well as of the ancestral relationships and traits in any finite sample. That is we will represent a phylogeny by a marked metric measure space .X; r; / with marks in the trait space X as follows:  .X; r/ is the set of all individuals (which are currently alive or were alive at some point) together with the genetic distance r.x; y/, which counts the number of nucleotide substitutions that separate x 2 X from y 2 X (compare [2]). Notice that this distance is only a pseudo-metric as r.x; y/ D 0 is possible whenever x 2 X and y 2 X are clones of each other. Therefore, more precisely, X collects all clans of clones (rather than individuals).   is a measure on X  X that allows to sample individuals together with their traits. In what follows, we therefore rely on the space of metric measure spaces equipped with the Gromov-weak topology introduced in the context of mono-trait populations in [29] and its extension to marked metric measure spaces equipped with the marked Gromovweak topology introduced for multi-trait populations in [16] (compare also [54]). Definition 16.5.1 (mmm-spaces). An X-marked metric measure space (mmm-space, for short) is a triple .X; r; / such that .X; r/ is a complete and separable metric space and  2 Mf .X  X/, where X  X is equipped with the product topology. Recall that for a topological space X the support supp./ of  2 M1 .X / or  2 Mf .X / is the smallest closed set X0  X such that .X n X0 / D 0. The push forward of  under a measurable map ' from X into another topological space Z is the probability measure '  2 M1 .Z/ defined by ' .A/ WD .' for all A 2 B.Z/.

1

.A//

Multitype branching models for phylodynamics

349

Two mmm-spaces .X; rX ; X / and .Y; rY ; Y / are equivalent if they are measureand mark-preserving isometric, i.e. there is a measurable 'W supp.X .   X// ! supp.Y .   X// such that rX .x; x 0 / D rY .'.x/; '.x 0 //

for all x; x 0 2 supp.X .   X//

and 'Q X D Y

for '.x; Q u/ D .'.x/; u/:

We write .X; r; / for the equivalence class of an mmm-space .X; r; /. Define the set of (equivalence classes of) X-marked metric measure spaces ® ¯ MX WD t D .X; r; / W .X; r; / is an X-marked metric measure space : To distinguish between different (equivalence classes of) phylogenies, we will sample from the population and evaluate the mutual distances among the sampled individuals, while simultaneously keeping track of their traits. That is for a given metric space .X; r/ we introduce the marked distance matrix map ´ .N / .X  X/N ! RC2  X N ; .X;r/;X WD R  ..xi ; i /i >1 / 7! .r.xi ; xj //16i 1 and consider the distribution of this marked distance matrix under picking the sequence of individuals and their traits in an i.i.d. fashion with the sampling measure 8 0 , is the MX;K -valued Markov process that, given it starts in t0 2 MX;K , has the following (modified) dynamics when the current state is t.  Death. The total death rate for a particle of clan x2 and trait 2 D .x Q 2 / is X  Kˇ.2 / C .¹x N 1 º  X/ death mt ; r.x1 ; x2 /; 2 ; .x Q 1/ : x1 2X

Such a death event yields the following transition:   1 t 7! X; r;   ı.x2 ;2 / : K  Birth. The total birth rate of a particle from clan x2 of trait 2 is X  Kˇ.2 / C .¹x N 1 º  X/ birth mt ; r.x1 ; x2 /; 2 ; .x Q 1/ : x1 2X

Multitype branching models for phylodynamics

351

With probability p 2 Œ0; 1, a new clan z … X gets added to the current set X of clans of clones. The newborn z is a mutant of its parent (clan) x2 whose new trait .z/ Q is chosen with respect to mK .2 ;  / and its distance to all other particles is given by ´ r.x2 ; x/ C K1 ; x 2 X; 1 r .x2 ;z/; K .z; x/ WD 0; x D z; while with probability 1 p the newborn is just a clone of its parent and we formally set z D x2 . Under our assumption (16.2.2) on non-trivial mutation, this yields the following transition:  1 1 Q ¤ 2 º t 7! X [ ¹zº; r .x2 ;z/; K ;  C ı.z;Q .z//  1¹.z/ K  1 C ı.x2 ;2 /  1¹.z/ Q D 2 º ; K where .z/ Q is chosen with respect to m O K .2 ;  / WD p  mK .2 ;  / C .1

p/ı2 .  /:

We are interested in the existence of subsequential limits of ¹Y K W K > 0º in MX . These we can only expect to exist when the mutation observed along a single line of descent has limit points under suitable rescaling. We therefore assume the following. Assumption 16.5.4 (On the mutation). There is a linear operator .A; D.A// on Cb .X/ such that D.A/ is an algebra that is dense in Cb .X/. Moreover, the .A; D.A//martingale problem has a unique solution, and for all h 2 D.A/ and  2 X, as N ! 1, Z  K mK .; d/ Q ı .d/ Q h./ Q ! Ah./; X

uniformly in  2 X. (In fact, A corresponds to A1 from Sections 16.3 and 16.4.) Then the following result was obtained in [43] by novel methods relying on the criterion of compact containment proven in [41]. Theorem 16.5.5 ([43, Theorem 1]). Under Assumption 16.5.4 the family ¹Y K W K > 0º is tight provided that the family of initial states ¹Y0K W K > 0º is tight and that K lim supK!1 EŒmY0  < 1. We next want to describe possible limits analytically. For that purpose we abbreviate r WD .ri;j /16i1

where A.l/ acts on h as the mutation operator A on the function of the l-th trait-coordinate of h (assuming that all other variables are kept constant). Step 3 (Growth of genetic distances). Put Z X h .ˇ.l1 / C ˇ.l2 //  p;ˇ F .t/ D p   t .d.r; // growth 16l1 1

 h.m; 1 .r; //;

where for ` > 1, ` denotes an index shift by ` 2 N, that is  ` .r; / WD .ri;j /lC16i 0. For the second level dynamics, recall ˇK , birth , death , K and pK from Sections 16.2 and 16.5. We here assume that  ˇK D Kˇ for some constant rate ˇ > 0,  birth  ,  death .m/  m for a positive competition rate  > 0, P  K ..;  0 /;  /  Q K .  / such that K. k>0 k Q K .k/ 1/ ! as K ! 1 for a capacity > 0, and  pK  0. Then the virus population model with cell division, Z K D .Z tK / t>0 ; is an Nf . K1 N0 /-valued Given its P Markov chain that has the following dynamics: 1 N current state  WD N ı , for N 2 N and z ; : : : ; z 2 we denote by 0 1 N i D1 zi K 0 ni WD Kzi the number of particles in the i-th cell and have:  Cell division. For all i D 1; : : : ; N and for each k 2 ¹1; : : : ; ni º at rate   ni ni k r  .1  /k k we have the following jump:  7! 

ızi C ı k C ı ni K

k

K

:

 Branching with competition within a cell. – Natural birth. For all i D 1; : : : ; N and for each of the ni particles in the i-th cell at rate Kˇ C there is a new particle born of mass K1 in the i-th cell, i.e. we have the following jump:  7! 

ızi C ızi C K1 :

– Death with competition. For all i D 1; 2; : : : ; N and for each of the ni particles in the i-th cell at rate Kˇ C .zi K1 / the particle dies, i.e. we have the following jump:  7!  ızi C ızi K1 : Following [24], we can once more describe a Markov process with these transition rates as the unique strong solution of stochastic differential equations with Poisson

356

Anja Sturm and Anita Winter

noise. Its generator K acts on F 2 Cb .Nf . K1 N0 // as follows: K F ./ D r

Kz 1X

Z 0

BKz; .k/ F .

ız C ız

k K

C ı Kk /

 F ./ .dz/

kD0

1  C .Kˇ C / Kz F . ız C ızC K1 / F ./ .dz/ 0 Z 1  1  C Kˇ C  z K 0   Kz f . ız C ız K1 / f ./ .dz/;

Z

where Bn; .k/ D p > 1,

n k

 k  .1

/n

 sup E K>0

k

. We show in [52, Proposition 3.1] that for all

Z sup t 2Œ0;T  RC

 .1 C x p /Z tK .dx/ < 1;

R whenever supK>0 EŒ RC .1 C x 2p /Z0K .dx/ < 1. Following the lines of proof in [5, Proposition 2.2] this implies that the family ¹Z K W K > 0º satisfies the compact containment condition, i.e. for all T > 0 and  > 0, there exists a compact subset €;T of Mf .RC / such that ® ¯ inf P Z tK 2 €;T for all t 2 Œ0; T  > 1 : K>0

Consider test functions F g;f;m W Nf .RC / ! R of the form Z g;f;m F ./ D g.h1; i/ f .z/˝m;# .dz/; .RC /m

where m 2 N, g 2 Cb .RC /, f W Rm C ! R, and with ˝m;# .dz/ D .dz1 /..dz2 /

 ız1 .dz2 // : : : .dzm /

m X1

 ızj .dzm / :

j D1

This function describes the procedure of sampling m different cells (if there are at least m cells) and then evaluating the number of cells via the test function gW RC ! R together with the mass in each of these cells via the test function f W Rm C ! R. We refer to these functions as polynomials. Denote by D the algebra generated by these polynomials, and notice that D is convergence determining (compare [1, (1.1)]). It was shown in [52, Proposition 4.3] that for K acting on these functions the following holds: As K ! 1, K F g;f;m ./

! cell F g;f;m ./

K!1

Multitype branching models for phylodynamics

357

with cell F g;f;m ./ Z Dr g.1 C h1; i/F 1;f;m .

ız0 C ı z0 C ı.1

 /z0 /

 g.h1; i/F 1;f;m ./ .dz0 / C g.h1; i/

Z X m  zj .K j D1

@ f .z/ @zj  @2 C  zj f .z/ ˝m;# .dz/: @zj 2

zj /

Formally, we then have the following result. Theorem 16.6.1 ([52, Theorem 1]). The .cell ; D/-martingale problem is well-posed. Moreover, if ¹Z K W K > 0º is a family of virus population models with cell division and Z the unique solution of the .cell ; D/-martingale problem with Z0K ) Z0 for some Z0 2 Nf .RC /, then Z K ! Z weakly in the Skorohod space D.Œ0; 1/I Nf .RC //. Definition 16.6.2 (2-level branching model). We refer to the solution of the above martingale problem as the 2-level branching model with cell division and logistic branching diffusions. While existence follows from the compact containment condition together with the uniform convergence of generators, for showing uniqueness, we will rely on the fact that the martingale problem for the projection on the total number of cells is well-posed and on a Feynman–Kac duality describing the distribution of the masses in sampled cells. It combines ideas from [31, 35, 36] obtained for samples of logistic branching with and without disasters. The dual process is a Markov process K WD .q t ; M t ; X t / t >0 with state space .0; 1  K, where [ K WD ¹mº  Rm C m2N

and with the following dynamics:  q follows the ordinary differential equation d qt D dt

rq t .1

q t /:

 For all k D 1; : : : ; m the following jump happens at rate qr:   q; m; .x1 ; : : : ; xm / 7! q; m; .x1 ; : : : ; xk 1 ;  xk ; xkC1 ; : : : ; xm / :  For all k D 1; : : : ; m the following jump happens at rate qr:  q; m; .x1 ; : : : ; xm / 7! q; m; .x1 ; : : : ; xk

1 ; .1

  /xk ; xkC1 ; : : : ; xm / :

358

Anja Sturm and Anita Winter

 For all 1 6 k1 6D k2 6 m the following jump happens at rate qr:  q; m; .x1 ; : : : ; xm / 7! q; m

1; .x1 ; : : : ; xk1 1 ;  xk1 C .1  /xk2 ;  xk1 C1 ; : : : ; xk2 1 ; xk2 C1 ; : : : ; xm / :

 For all 1 6 k1 6D k2 6 m the following jump happens at rate qr:  q; m; .x1 ; : : : ; xm / 7! q; m

1; .x1 ; : : : ; xk1 1 ; .1  /xk1 C  xk2 ;  xk1 C1 ; : : : ; xk2 1 ; xk2 C1 ; : : : ; xm / :

 In between two jumps of M , the coordinate processes Xk , k D 1; : : : ; m, perform independent logistic Feller branching diffusion p dXk .t / D Xk .t/. ˇXk .t// dt C 2Xk .t / dBk .t /; t > 0; (16.6.1) where the Bk for k D 1; : : : ; m denote independent Brownian motions, that is the strong Markov process whose generator .log-Feller ; D.log-Feller // acts on f 2 C 2 .RC /  D.log-Feller / is ‰.x/f 0 .x/ C xf 00 .x/;

log-Feller f .x/ D

with the branching mechanism ‰.x/ WD

x 2 RC ;

x C ˇx 2 .

Consider the dual function H W Nf .RC /  K ! R defined by  H ; .q; m; .x1 ; : : : ; xm // D

Z

 X  m q h1;i exp xk zk ˝m;# .dz/: kD1

We have the following duality relation. Theorem 16.6.3 ([52, Theorem 2]). Let Z be a 2-level branching model with celldivision and logistic virus branching diffusion, and K D .q; M; .X1 ; : : : ; XM // the dual process defined above. Then, for all  2 Nf .RC / and x 2 RN C,   X   Z m E q h1;Z t i exp zk xk Z t˝m;# .dz/ kD1

  Z t  h1;i 2 D E.m;.x1 ;x2 ;::: // q t exp r qs Ms ds 0

Z 

 X   Mt exp zk Xk .t / ˝M t ;# .dz/ : (16.6.2) kD1

359

Multitype branching models for phylodynamics

Remark 16.6.4 (Longterm behaviour). We can apply this duality relation to describe the basic long term behaviour. For t > 0, put W t WD Z t .RC /. Then W D .W t / t >0 is the continuous time pure binary branching process, which is also known as the Yule process. If we apply our duality relation with m D 0, we obtain an explicit expression for the generating function of the Yule process, from which we can recover the known fact that t !1 e rt W t HHH) €.W0 ; 1/; where €.k; 1/ denotes a Gamma distribution with parameters k and 1 (see [52, Corollary 5.2]). If Pwe apply our duality relation from Theorem 16.6.3 with m D 1, q D 1 and Z0 D klD1 ızl , then it implies that for all x > 0, lim e

t !1

rt

EPklD1 ızl

hZ

Z t .dz/ e

x1 z

i

D

k X

 E1 lim e

lD1

and therefore that e

rt

t!1

X1 .t /z1



D k;

t!1

Z t HHH) €.k; 1/  ı0 :

This means that at a late time t there are ert  €.k; 1/ many cells and that a typical cell is free of virus particles. Remark 16.6.5 (Current work in progress). The above result on the basic long-term behaviour leads to the question which stochastic forces result in or prevent the survival of virus populations in typical cells. To give an answer to this question we propose to extend the model by the possibility of huge offspring numbers in the virus reproduction. In particular, assume that after suitable rescaling the virus population within a cell performs a Markov process with generator  WD cell C … huge ; where the additional part acts on sufficiently smooth F 2 D as Z  Z 1 … huge F ./ D z F . ız C ızCu / F ./   RC 0 @ u1Œ0;1 .u/ F ./ ….du/ .dz/; @z for a Borel measure … on RC satisfying (16.3.1). We verified that the dual relation (16.6.2) stated in Theorem 16.6.3 stays the same if we replace the logistic Feller diffusion in (16.6.1) by the solution of p dU t D ‰.U t / dt C 2U t dB t : R 1 zu Here ‰.z/ WD z C ˇz 2 1 zu1Œ0;1 .u//….du/ is the new continuous 0 .e state branching mechanism. The longterm behaviour of U D .U t / t >0 with a stable

Anja Sturm and Anita Winter

360

branching mechanisms Z ‰.z/ WD

z C

1

.1

e

uz

/….dz/

0

(compare with (16.3.2)) was studied in [23]. It was shown in [23, Theorem 3.7] (compare also the criterion given in [44, Theorem 3.4] in the case that D 0) that here U has a non-trivial stationary distribution if and only if one of the following three assumptions hold:

> 0;

or

…..0; 1// D 1;

or

….Œ0; 1// > :

In current work in progress we are investigating whether these assumptions also result in survival of the virus population in the two-level model. In particular, we conjecture that it can be shown that these phase transitions are visible in tree length statistics.

References [1] S. Athreya, W. Löhr, and A. Winter, The gap between Gromov-vague and Gromov– Hausdorff-vague topology, Stochastic Process. Appl. 126 (2016), 2527–2553. [2] E. Baake and A. von Haeseler, Distance measures in terms of substitution processes, Theor. Popul. Biol. 55 (1999), 166–175. [3] E. Baake and A. Wakolbinger, Microbial populations under selection, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 43–67. [4] V. Bansaye, Proliferating parasites in dividing cells, Ann. Appl. Probab. 18 (2008), 967– 996. [5] V. Bansaye and C. Tran, Branching Feller diffusion for cell division with parasite infection, ALEA Lat. Am. J. Probab. Math. Stat. 8 (2011), 95–127. [6] G. Berzunza, A. Sturm, and A. Winter, Trait-dependent branching particle systems with competition and multiple offspring, preprint 2018, httpsW//arxiv.org/abs/1808.09345. [7] M. Birkner and J. Blath, Genealogies and inference for populations with highly skewed offspring distributions, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 151–177. [8] B. Bolker and S. W. Pacala, Using moment equations to understand stochastically driven spatial pattern formation in ecological systems, Theor. Popul. Biol. 52 (1997), 179–197. [9] A. Bovier, Stochastic models for adaptive dynamics: Scaling limits and diversity, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 127–149. [10] N. Champagnat, R. Ferrière, and S. Méléard, From individual stochastic processes to macroscopic models in adaptive evolution, Stoch. Models 24 (2008), 2–44. [11] D. A. Dawson, Geostochastic calculus, Canad. J. Statist. 6 (1978), 143–168.

Multitype branching models for phylodynamics

361

[12] D. A. Dawson, Measure-valued Markov processes, in: Ecole d’Eté de Probabilités de Saint-Flour XXI–1991 (eds. P. L. Hennequin), Springer, Berlin (1993), 1–260. [13] D. A. Dawson, L. G. Gorostiza, and Z. Li, Nonlocal branching superprocesses and some related models, Acta Appl. Math. 74 (2002), 93–112. [14] D. Dawson and A. Greven, State dependent multitype spatial branching processes and their longtime behaviour, Electron. J. Probab. 8 (2003), 1–93. [15] D. Dawson and E. Perkins, Historical Processes, American Mathematical Society, Providence, 1991. [16] A. Depperschmidt, A. Greven, and P. Pfaffelhuber, Marked metric measure spaces, Electron. Commun. Probab. 16 (2011), 174–188. [17] A. Depperschmidt, A. Greven, and P. Pfaffelhuber, Tree-valued Fleming–Viot dynamics with mutation and selection, Ann. Appl. Probab. 22 (2012), 2560–2615. [18] U. Dieckmann and R. Law, Relaxation projections and the method of moments, in: The Geometry of Ecological Interactions: Simplifying Spatial Complexity (eds. U. Dieckmann, R. Law, and J. A. J. Metz), Cambridge University, Cambridge (2000), 412–455. [19] N. El Karoui and S. Roelly, Propriétés de martingales, explosion et représentation de Lévy–Khintchine d’une classe de processus de branchement à valeurs mesures, Stochastic Process. Appl. 38 (1991), 239–266. [20] A. M. Etheridge, Survival and extinction in a locally regulated population, Ann. Appl. Probab. 14 (2004), 188–214. [21] P. J. Fitzsimmons, On the martingale problem for measure-valued Markov branching processes, in: Seminar on Stochastic Processes 1991 (eds. E. Çinlar, K. L. Chung, and M. J. Sharp), Birkhäuser, Boston (1992), 39–51. [22] F. Freund, Multiple-merger genealogies: Models, consequences, inference, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 179–202. [23] C. Foucart, Continuous state branching process with competition: duality and reflection to infinity, Electron. J. Probab. 24 (2019), 1–38. [24] N. Fournier and S. Méléard, A microscopic probabilistic description of a locally regulated population and macroscopic approximations, Ann. Appl. Probab. 14 (2004), 1880–1919. [25] P. K. Glöde, Dynamics of genealogical trees for autocatalytic branching processes, PhD thesis, FAU Erlangen, 2012, httpW//www.min.math.fau.de/fileadmin/min/users/gloede/ Dissertation/PatricKarlGloedeDissertation.pdf. [26] B. Grenfell, T. Bryan, O. Pybus, J. Gog, J. Wood, J. Daly, J. Mumford, and E. Holmes, Unifying the epidemiological and evolutionary dynamics of pathogens, Science 303 (2004), 327–332. [27] A. Greven and F. den Hollander, From high to low volatility: Spatial Cannings with block resampling and spatial Fleming–Viot with seed bank, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 267–289. [28] A. Greven, V. Limic, and A. Winter, Representation theorems for interacting Moran models, interacting Fisher–Wright diffusions and applications, Electron. J. Probab. 10 (2005), 1286–1358.

Anja Sturm and Anita Winter

362

[29] A. Greven, P. Pfaffelhuber, and A. Winter, Convergence in distribution of random metric measure spaces (ƒ-coalescent measure trees), Probab. Theory Related Fields 145 (2009), 285–322. [30] A. Greven, P. Pfaffelhuber, and A. Winter, Tree-valued resampling dynamics: Martingale problems and applications, Probab. Theory Related Fields 155 (2013), 789–838. [31] A. Greven, A. Sturm, A. Winter, and I. Zähle, Multi-type spatial branching models for local self-regulation, I: Construction and an exponential duality, preprint 2015, httpsW//arxiv.org/ abs/1509.00402. [32] A. Greven, R. Sun, and A. Winter, Continuum space limit of the genealogies of interacting Fleming–Viot processes on Z, Electron. J. Probab. 21 (2016), 1–64. [33] M. Gromov, Metric Structures for Riemannian and non-Riemannian Spaces, Birkhäuser, Boston, 2001. [34] H. He, Discontinuous superprocesses with dependent spatial motion, Stochastic Process. Appl. 119 (2009), 130–166. [35] F. Hermann and P. Pfaffelhuber, Markov branching processes with disasters: extinction, survival and duality to p-jump processes, Stochastic Process. Appl. 130 (2020), 2488–2518. [36] M. Hutzenthaler and A. Wakolbinger, Ergodic behaviour of locally regulated branching populations, Ann. Appl. Probab. 17 (2007), 474–501. [37] J. Jacod and A. N. Shiryaev, Limit Theorems for Stochastic Processes, Springer, Berlin, 1987. [38] B. Jourdain, S. Méléard, and W. A. Woyczynski, Lévy flights in evolutionary ecology, J. Math. Biol. 65 (2012), 677–707. [39] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [40] M. Kimmel, Quasistationarity in a branching model of division within division, in: Classical and Modern Branching Processes (eds. K. B. Athreya and P. Jagers), Springer, New York (1997), 157–164. [41] S. Kliem, A compact containment result for nonlinear historical superprocess approximations for population models with trait-dependence, Electron. J. Probab. 19 (2014), 1–13. [42] S. Kliem and W. Löhr, Existence of mark functions in marked metric measure spaces, Electron. J. Probab. 20 (2015), 1–24. [43] S. Kliem and A. Winter, Evolving phylogenies of trait-dependent branching with mutation and competition Part I: Existence, Stochastic Process. Appl. 129 (2019), 4837–4877. [44] A. Lambert, The branching process with logistic growth, Ann. Appl. Probab. 15 (2005), 1506–1535. [45] Z. Li, Measure-Valued Branching Markov Processes, Springer, Berlin, 2011. [46] M. Lipsitch and J. O’Hagan, Patterns of antigenic diversity and the mechanisms that maintain them, J. R. Soc. Interface 4 (2007), 787–802. [47] W. Löhr, Equivalence of Gromov–Prohorov- and Gromov’s  -metric on the space of metric measure spaces, Electron. Commun. Probab. 18 (2013), 1–10.

Multitype branching models for phylodynamics

363

[48] S. E. Luria and M. Delbrück, Mutations of bacteria from virus sensitivity to virus resistance, Genetics 28 (1943), 491–511. [49] S. Méléard and V. C. Tran, Nonlinear historical superprocess approximations for population models with past dependence, Electron. J. Probab. 17 (2012), Paper No. 47. [50] S. Méléard and V. C. Tran, Slow and fast scales for superprocess limits of age-structured populations, Stochastic Process. Appl. 122 (2012), 250–276. [51] J. A. J. Metz, S. A. H. Geritz, G. Meszéna, F. J. A. Jacobs, and J. S. van Heerwaarden, Adaptive dynamics, a geometrical study of the consequences of nearly faithful reproduction, in: Stochastic and Spatial Structures of Dynamical Systems (eds. S. J. Van Strien and S. M. Verduyn Lunel), North Holland, Amsterdam (1996), 183–231. [52] L. Osorio and A. Winter, Two level branching model for virus population under cell division, preprint 2020, httpsW//arxiv.org/abs/2004.14352. [53] E. Perkins, Dawson–Watanabe superprocesses and measure-valued diffusions, in: Lectures on Probability Theory and Statistics. Ecole d’eté de Probabilités de Saint-Flour XXIX–1999 (ed. P. Bernard), Springer, Berlin (2002), 125–324. [54] S. Piotrowiak, Dynamics of genealogical trees for type- and state-dependent resampling models, PhD thesis, FAU Erlangen, 2010, httpsW//opus4.kobv.de/opus4-fau/files/1533/ SvenPiotrowiakDissertation.pdf. [55] W. Stephan and A. Tellier, Stochastic processes and host-parasite coevolution: Linking coevolutionary dynamics and DNA polymorphism data, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 107–125. [56] N. Strelkowa and M. Lässig, Clonal interference in the evolution of influenza, Genetics 192 (2012), 671–682. [57] D. W. Stroock, Diffusion processes associated with Lévy generators, Probab. Theory Related Fields 32 (1975), 209–244.

Chapter 17

Ancestral lines under recombination Ellen Baake and Michael Baake Solving the recombination equation has been a long-standing challenge of deterministic population genetics. We review recent progress obtained by introducing ancestral processes, as traditionally used in the context of stochastic models of population genetics, into the deterministic setting. With the help of an ancestral partitioning process, which is obtained by letting population size tend to infinity (without rescaling parameters or time) in an ancestral recombination graph, we obtain the solution to the recombination equation in a transparent form.

17.1 Introduction Recombination is a genetic mechanism that “mixes” or “reshuffles” the genetic material of different individuals from generation to generation; it takes place in the course of sexual reproduction. Models that describe the evolution of populations under recombination (in isolation or in combination with other processes) are among the major challenges in population genetics. Besides being of theoretical and mathematical interest, they play a major role in inference from population sequence data; compare the contributions of Birkner and Blath [15] and of Dutheil [22] in this volume. In line with the general situation in population genetics, models of recombination come in two categories, deterministic and stochastic. In addition, there are versions in discrete and in continuous time, both of which will be considered below. In particular, our approach will result in a unified treatment of both. Deterministic approaches assume that the population is so large that a law of large numbers applies and random fluctuations may be neglected. The resulting models are (systems of) ordinary differential equations or (discrete-time) dynamical systems, which describe the evolution of the genetic composition of a population under recombination in the usual forward direction of time; for review, see [17, 18, 35]. The genetic composition is described via a probability distribution (or measure) on a space of sequences of finite length. The equations are nonlinear and notoriously difficult to solve. Elucidating the underlying structure and finding solutions was a challenge to theoretical population geneticists for nearly a century. Indeed, the first studies go back to Jennings [33] in 1917 and Robbins [39] in 1918. Geiringer [28] in 1944 and Bennett [13] in 1954 were the first to state the generic general form of the solution in terms of a convex combination of certain basis functions, and evaluated the corresponding coefficients recursively for sequences with a small number of sites. The approach was later continued within the systematic framework of genetic algebras; compare [35, 37]. The recursions for the coefficients were worked out in fairly general

Ellen Baake and Michael Baake

366

form by Dawson [20]. In any case, however, the work is technically cumbersome and yields limited insight into the underlying mathematical structure. Stochastic approaches, on the other hand, take into account the fluctuations due to finite population size. The evolution of the composition of a population over time is described via a Moran or a Wright–Fisher model with recombination. The first study goes back to Ohta and Kimura [38] in 1969. Over the decades, two major lines of research have emerged. There has been continuous interest in how the correlations between sites (known as linkage disequilibria) will develop; see [38, 40] and the overviews in [31, Section 5.4], [21, Sections 3.3 and 8.2] or [42, Section 7.2.4]. The explicit time course of the genetic composition of the population is even more challenging, due to an intricate interplay of resampling and recombination; compare [5, 16, 38, 40] as well as [21, Section 8.2]. These questions are usually approached forward in time. The second line of research is concerned with genealogical aspects. Here, one starts with a sample taken from the present population and traces back the ancestry of the various sequence segments the individuals are composed of. The standard tool for this purpose is the ancestral recombination graph (ARG), first formulated by Hudson [32] in 1983. His original version was for two sites, but generalisations to an arbitrary number of sites [14, 30] and continuous versions ([21, Section 3.4] or [34]) are immediate. The stochastic models of recombination are related to their deterministic counterparts via a dynamical law of large numbers as population size tends to infinity. Nevertheless, deterministic and stochastic approaches have largely led separate lives for decades. It is the goal of this article to review recent progress to build bridges between them by introducing the genealogical picture into the deterministic equations. The corresponding ancestral processes remain random even in the deterministic limit, since they describe the history of single individuals (or a finite sample thereof). They lead to stochastic representations of the solutions of the deterministic equations and shed new light both on their dynamics and their asymptotic behaviour. A similar programme has been carried out for mutation-selection models; see [3, 19] as well as the review [7].

17.2 Moran model with recombination Let us start from the Moran model with recombination (in continuous time), which we recapitulate here from [16, 24, 25]. We consider a linear arrangement (or sequence) of n discrete positions called sites, which are collected in the set S D ¹1; : : : ; nº. A site may be understood as a nucleotide site or a gene locus. We will throughout consider sequences as (haploid) individuals, that is we think at the level of gametes (rather than that of diploid individuals that carry two copies of the genetic information; see the contributions by Sturm [41] and by Birkner and Blath [15] in this volume for explicitly diploid populations models and the corresponding genealogies). Site i is occupied by

Ancestral lines under recombination

367

x1 , x2 , x3 , . . . , xn−1 , xn r(A)

x1 , . . . , xi , yi+1 , . . . , yj , xj+1 , . . . , xn

r(A)

x1 , x2 , x3 , . . . , xn−1 , xn

y1 , y2 , y3 , . . . , yn−1 , yn

x1 , . . ., xi , ∗, . . . , ∗, xj+1 , . . . , xn ∗, . . ., ∗, xi+1 , . . . , xj , ∗, . . . , ∗ Figure 17.2.1. Result of a double crossover between sites i and i C 1 and between j and j C 1 (1 6 i < j < n). Top: full details of parental sequences; bottom: a version that marginalises over the letters that do not end up in the offspring.

a letter xi 2 Xi , where Xi is a finite set, 1 6 i 6 n. If sites are nucleotide sites, a natural choice for each Xi is the nucleotide alphabet ¹A; G; C; Tº; if sites are gene loci, Xi is the set of alleles that can occur at locus i. The genetic type of each individual is thus described by the sequence x D .x1 ; x2 ; : : : ; xn / 2 X1      Xn DW X , where X is the type space1 . In this setting, recombination means that a new individual is formed as a “mixture” of an (ordered) pair2 of parents, say x and y. We describe this mixture with the help of a partition A of S into two parts. Namely, A D ¹A1 ; A2 º means that the new individual copies the letters at all sites in A1 from the first individual and the letters at all sites in A2 from the second; this happens via a number of crossovers between the sequences, as illustrated in Figure 17.2.1. For reasons of symmetry, we need not keep track of which part (or block) was “maternal” and which was “paternal”. Altogether, whenever an offspring is created, its sites are partitioned between parents P according to A with probability r.A/, where r.A/ > 0, A2P2P .S / r.A/ 6 1, and P2 .S / is the set of all partitions of S into two parts. The sum A2P2 .S / r.A/ is the probability that some Precombination event takes place during reproduction. With probability r.1/ D 1 A2P2 .S/ r.A/, there is no recombination, in which case the offspring is the full copy of a single parent; here 1 WD ¹S º, the coarsest partition. We write P62 .S / WD P2 .S/ [ ¹1º for the set of partitions into at most two parts, and P .S / for the set of all partitions of S. The collection ¹r.A/ºA2P62 .S / is known as the recombination distribution [17, p. 55]. Consider now a population of a constant number N of haploid individuals (that is gametes) that evolves in continuous time as described next (see Figure 17.2.2). 1

We restrict ourselves to a finite type space here for ease of presentation; but the results generalise to more general type spaces where the Xi may be locally compact [1]. 2 We formulate the model and the results throughout for the (biologically realistic) case of two parents here. Everything generalises easily to the situation with an arbitrary number of parents, which is mathematically interesting as well. Indeed, most of the results are available in the general setting in the original articles.

368

Ellen Baake and Michael Baake

1

2

3

4

5

0

t Figure 17.2.2. A realisation of the Moran model with recombination forward in time, with N D 5. For example, in the second event, individual 3 is replaced by a recombined copy of individuals 2 and 3.

Each individual dies at rate , that is it has an exponential lifespan with parameter  (this parameter simply sets the time scale). When an individual dies, it is replaced by a new one as follows. First, draw a partition A according to the recombination distribution. Then draw jAj parents from the population (the parents may include the individual that is about to die), uniformly and with replacement, where jAj is the number of parts in A. If jAj D 2, then A is of the form ¹A1 ; A2 º, and the offspring inherits the sites in A1 from the first parent and the sites in A2 from the second, as described above. If jAj D 1 (and thus A D 1), the offspring is a full copy of a single parent (again chosen uniformly from among all individuals); this is called a (pure) resampling event. All events are independent of each other. Note that the model may equivalently be formulated in terms of reproducing rather than dying individuals, in the following way. Every individual reproduces at rate , draws a partition A according to the recombination distribution, and picks jAj 1 partners from the population; the offspring individual is pieced together according to A from the “active” individual and the partners, and replaces a uniformly chosen individual. Note also that there is substantial interest in more general reproduction models, where a positive proportion of the population is replaced by the recombined offspring of a single pair of parents, see the contributions by Birkner and Blath [15], Freund [27], as well as Greven and den Hollander [29] in this volume. We identify the population at time t with a (random) counting measure Z t.N / on X, where the upper index indicates the dependence on population size. Namely, Z t.N / .¹xº/ > 0 denotes the number of individuals of type x 2 X at time t, and X Z t.N / .A/ D Z t.N / .¹xº/ for A  X: x2A

We can also write Z t.N / D

X x2X

Z t.N / .¹xº/ıx

Ancestral lines under recombination

369

in terms of point measures on x. Since our Moran population has constant size N , we have kZ t.N / k D N for all times, where X Z t.N / .¹xº/ kZ t.N / k WD Z t.N / .X/ D x2X

is the norm (or total variation) of Z t.N / . This way, .Z t.N / / t >0 is a Markov chain in continuous time with values in ® ¯ E WD z 2 ¹0; : : : ; N ºjXj W kzk D N ; where jX j is the number of elements in X . We will describe the action of recombination on (positive) measures with the help of so-called recombinators as introduced in [9]. Let MC .X/ be the set of all positive, finite measures on X , where we understand MC .X / to include the zero measure. Define the canonical projection I W X 7! i 2I Xi DW XI , for ¿ ¤ I  S D ¹1; : : : ; nº, by I .x/ D .xi /i 2I DW xI as usual. For ! 2 MC .X /, the shorthand ! I WD I :! D ! ı I 1 indicates the marginal measure with respect to the sites in I , where I 1 is the preimage of I , and the operation : (where the dot is on the line and should not be confused with a multiplication sign) denotes the pushforward of ! with respect to I . In terms of coordinates, the definition may be spelled out as ® ¯ ! I .xI / D .! ı I 1 /.xI / D ! x 2 X W I .x/ D xI ; xI 2 XI : Note that ! S D !. Consider now A D ¹A1 ; : : : ; Am º 2 P .S/ and ! 2 MC .X /, and define the recombinator as O 1 !A; RA .!/ WD m 1 k!k A2A N where indicates the product measure and the definition extends consistently to RA .0/ D 0. Note that R1 .!/ D !. Clearly, kRA .!/k D k!k for all ! 2 MC .X /. In particular, RA turns ! ¤ 0 into the (normalised) product measure of its marginals with respect to the blocks in A. If Z t D z is the current population, we see that 1 R .z/ D N1 RA .z/ is the type distribution that results when a hypothetical indikzk A vidual is created by drawing marginal types (as encoded by A 2 P .S /) from the current population, uniformly and with replacement. We now use the recombinators to reformulate the Moran model with recombination in a compact way. Namely, since all individuals die at rate , the population loses type-y individuals at rate Z t.N / .¹yº/. Each loss is replaced by a new individual, which is sampled uniformly from N1 RA .Z t.N / / with probability r.A/ for A 2 P62 .S /. Therefore, when Z t.N / D z, the transition to z C ıx ıy occurs at rate .N / .zI y; x/ D

X A2P62 .S/

1 %.A/.RA .z//.¹xº/z.¹yº/; N

Ellen Baake and Michael Baake

370

where %.A/ D r.A/ is a recombination rate (in line with the continuous-time model)3 . The summand for A D 1 corresponds to pure resampling, whereas all other summands include recombination. Note that .N / includes “silent transitions” (x D y).  Law of large numbers. Consider now the family of processes Z t.N / t >0 with  N 2 N. Also, consider the normalised version N1 Z t.N / t >0 ; note that N1 Z t.N / is a random probability measure on X. For  N ! 1 and without any rescaling of the %.A/ or of time, the sequence N1 Z t.N / t >0 converges to the solution of the deterministic recombination equation4 X !P t D %.A/.RA .! t / ! t / (17.2.1) A2P2 .S/

with initial value !0 2 P.X/ (the set of probability measures on X ), where we assume that Z0.N / D !0 : lim N !1 N The convergence to the differential equation (17.2.1) is a dynamical law of large numbers and due to [26, Theorem 11.2.1]. The precise statement as well as the proof are perfectly analogous to [5, Proposition 6], which assumes a slightly different recombination and sampling scheme, without consequence for the convergence claim.

17.3 Ancestral recombination graph and deterministic limit Let us return to the finite-N model and construct the type of an individual sampled randomly from the population at time t (the “present”) by genealogical means. We do so by adapting the ARG (see [14] and, for overviews, [31, Section 5.4], [21, Sections 3.3, 8.4] or [42, Section 7.2.4]) to our model and a sample of size 1. The type of an individual at present, together with its ancestry, can thus be constructed by a three-step in Figure 17.3.1. First, we run a partitioning  procedure as illustrated  .N / .N / process ˙ . Here, ˙ is a Markov chain in continuous time on 066t >0 P .S /, whose time axis is directed into the past; we use the variables t and  throughout for forward and backward time, respectively, so  D t in backward time corresponds to t D 0 in forward time. The process starts with the coarsest partition ˙0.N / D 1,

3 Note that the meaning of %.A/ as a recombination rate is best understood by recalling the equivalent formulation of the model where every individual reproduces at rate  and then picks partition A with probability r.A/. 4 The generalisation to an arbitrary number of parents, that is A 2 P .S/, is treated in [1]. The special case A 2 P62 .S/ is then obtained by setting %.A/ D 0 for all A … P62 .S/. In any case, note that the summand for A D 1 may or may not be included in the right-hand side of the equation, since it contributes nothing due to R1 .!/ D !.

371

Ancestral lines under recombination

ag

(N )

 = {1, 3, 5}, {2, 4}

(N )

 = {1, 2, 3, 4, 5}

(N )

 = {1, 3, 5}, {2, 4}

(N )

 = {1, 2, 3, 4, 5}

(N )

 = {1, 3, 5}, {2, 4}

(N )

 = {1, 2, 3, 4, 5}

t

Σt

0

Σ0

τ

x1 ∗ x3 ∗ x5

∗ x2 ∗ x4 ∗

Σt

Σ0 x1 ∗ x3 ∗ x5

0

∗ x2 ∗ x4 ∗

∗ ∗ x3 ∗ ∗

x1 ∗ ∗ ∗ x5

t

Σt

∗ x2 ∗ x4 ∗

∗ x2 x3 x4 ∗

x1 x2 x3 x4 x5

Σ0

Figure 17.3.1. Example realisation of the partitioning process (top), assigning letters to the parts (middle), and propagating them downwards (bottom).

Ellen Baake and Michael Baake

372

 that is we consider the (intact) sequence of one individual at time t. Then ˙.N / >0 describes the partitioning of sites into parental individuals at time  before the present; sites in the same block (in different blocks) belong to the same (to different) individuals. Clearly, j˙.N / j is the number of ancestral individuals at time . The process ˙.N / >0 consists of splitting and coalescence events (and combinations thereof), is independent of the types, and will be described in detail below. In the second step, a letter is assigned to each site of S at  D t (that is at forward time 0) as follows. For every part of ˙ t.N / , pick an individual from the initial population Z0.N / (without replacement) and copy its letters to the sites in the block considered. For illustration, also assign a colour to each block, thus indicating different parental individuals. In the last step, the letters and colours arepropagated downwards (that is forward in time) according to the realisation of ˙.N / 066t laid down in the first step. Let us now describe the partitioning process more precisely, following [24, 25] but specialising to ˙0.N / D 1. Since we also trace back sites in subsets U  S (rather than complete sequences), we need the corresponding marginal recombination rates X %U .B/ WD %.A/ (17.3.1) A2P62 .S / AjU DB

for any B 2 P62 .U /, where AjU is the partition of U induced by A; namely, AjU D ¹A \ U W A 2 A; A \ U ¤ ¿º. Clearly, %S .B/ D %.B/ and %U .B/ is the sum of the rates of all recombination events that lead to partition B under the projection to U , as illustrated in Figure 17.3.2. Note that, for jU j D 1, the only recombination parameter is %U .1/ D 1 (note that we use 1 to indicate the coarsest partition throughout, where the meaning is always clear from the upper index, so here 1 D ¹U º). U B ∈ P62 (U ) S

{A ∈ P62(S) : A|U = B}

Figure 17.3.2. The marginal recombination rate for a partition B of a subset U is the sum of all recombination rates for partitions A of the original set S that lead to B under projection to U .

Ancestral lines under recombination

373

Suppose now that the current state is ˙.N / D A D ¹A1 ; : : : ; Am º and denote by  the waiting time to the next event. The random variable  is exponentially distributed with parameter m, since each block corresponds to an individual, and each individual is independently affected at rate . When the event happens, choose one of the m 1 .N / blocks, each with probability m . If Aj is picked, ˙C is obtained via a two-step procedure, namely a splitting step followed by a sampling step: (1) In the splitting step, block Aj turns into an intermediate state a with probability r Aj .a/, a 2 P62 .Aj /, where the marginal probabilities r U .B/ are defined as the marginal recombination rates in equation (17.3.1) with % replaced by r. In detail:  With probability r Aj .1/, the block Aj remains unchanged. The resulting intermediate state (of this block) is a D 1jAj D ¹Aj º.  With probability r Aj .a/, a 2 P2 .Aj /, block Aj splits into two parts, a D ¹Aj1 ; Aj2 º. (2) In the following sampling step, each block of a chooses out of N parents, uniformly and with replacement. Among these, there are m 1 parents that carry one block of A n ¹Aj º each; the remaining N .m 1/ parents are empty, that is they do not carry ancestral material available for coalescence. Coalescence happens if the choosing block picks a parent that carries ancestral material; otherwise, the choosing block becomes an ancestral block of its own, which is available for coalescence from then onwards. The possible outcomes are certain coarsenings of .A n ¹Aj º/ [ a. The long list of outcomes is provided explicitly in [25] for the special case of single crossovers and in [24] for general partitions into two parts, and the formal duality between the Moran model and the partitioning process is established. Here, we only aim at the law of large numbers, which is again obtained as N ! 1 without rescaling of parameters or time. In this limit, each of the blocks of the intermediate state a ends up in a different individual, so there are no coalescence events and a is the final state. As a consequence, the blocks of the partition are conditionally independent. This leads to the following result. Proposition 17.3.1 (Law of large numbers for the ARG [24, 25]). The sequence of  partitioning processes ˙.N / >0 , with N 2 N and initial state ˙0.N /  1, converges in distribution, as N ! 1, to the process .˙ />0 with initial state ˙0 D 1 and generator matrix Q defined by the nondiagonal elements ´ %A .a/ if B D .A n ¹Aº/ [ a for some A 2 A and a 2 P2 .A/; QAB D 0 for all other B ¤ A: The limiting process may therefore be described as follows. If the current state is ˙ D A, each part A of A is replaced by a 2 P .A/ n ¹Aº at rate %A .a/, independently of all other parts. Hence, .˙ />0 is a process of progressive refinements, that is ˙T 4 ˙ for all T > .

374

Ellen Baake and Michael Baake x1 ∗ ∗ x4 ∗

∗ ∗ ∗ ∗ x5

x1 ∗ ∗ x4 x5

∗ x2 x3 ∗ ∗

 Σt = {1, 4}, {2, 3}, {5}

∗ x2 x3 ∗ ∗

x1 x2 x3 x4 x5

 Σ0 = {1, 2, 3, 4, 5}

Figure 17.3.3. Determining the type of an individual at time t via the partitioning process .˙ /066t .

The process .˙ />0 , which is illustrated in Figure 17.3.3, may be understood as the N ! 1 limit of the ARG started with a single individual. Note that, due to the continuous-time setting, at most one block may be refined at any given time (with probability one), but it may be any of the blocks. Since Q is the Markov generator of .˙ />0 , we can further conclude that .eQ /BC D P .˙ D C j ˙0 D B/ (where P denotes probability), which is the transition probability from B to C during a time interval of length . This leads us to the solution of the deterministic recombination equation. Theorem 17.3.2 (Solution of the recombination equation [1]5 ). The solution of the recombination equation (17.2.1) reads X !t D a t .A/RA .!0 / D E.R˙ t .!0 / j ˙0 D 1/; A2P .S/

where a t .A/ D P .˙ t D A j ˙0 D 1/ D .etQ /1A and E denotes expectation. Remark 17.3.3. With Theorem 17.3.2, we have found a stochastic representation of the solution of the (deterministic) differential equation (17.2.1). This reflects the fact that, while the time evolution of the composition of the infinite population follows a (dynamical) law of large numbers and is hence deterministic, the fate and ancestry of a single individual retains some stochasticity. While ancestral processes are common tools when working with the stochastic processes that describe finite populations, they are not within the usual scope of deterministic population genetics. 5

In fact, [1] treats the general case of an arbitrary number of parents, which corresponds to allowing for multiple (rather than binary) splits in the partitioning process; compare footnotes 2 and 4.

375

Ancestral lines under recombination

Remark 17.3.4. In [1], the route of thought was, in fact, different from the one presented here. While we start from the ancestral process in this review, [1] works forward in time by means of classical methods from the theory of differential equations. The key was to establish the system of differential equations for the quantities RA .! t /, A 2 P .S /, by exploiting the properties of the recombinators. This procedure mimics the algebraic technique of Haldane linearisation and leads to the generator Q in a purely analytic way. The partitioning process then emerged as an interpretation of the result. Remark 17.3.5. It is easy to see that Theorem 17.3.2 extends to the duality relation E.RB .! t / j !0 D / D E.R˙ t ./ j ˙0 D B/ for any  2 P.X / and B 2 P .S/. Hence, since the left-hand side is deterministic, RB .! t / D E.R˙ t .!0 / j ˙0 D B/ for any initial condition !0 2 P.X/. Let us now turn to the evaluation of the a t of Theorem 17.3.2. It has been shown6 in [2] that, in the generic case that the U .A/ explained below are all distinct, it can be given in the form X S .B/t a t .A/ D  S .A; B/ e : (17.3.2) B 0. More generally, it is also of interest to consider the inhomogeneous counterpart, where Q D Q.t / is time dependent; see [8, Addendum] for an example in the case of single crossovers. Let M.t / again denote the solution of the Cauchy problem, which is unique under mild assumptions on Q by general principles [23]. Clearly, M.t/ is still the matrix of transition probabilities until time t and the underlying process satisfies the Markov property, while the semigroup property is lost. There are now two scenarios to be distinguished as follows. When the generator family .Q.t // t >0 is commuting, so Q.t/Q.s/ D Q.s/Q.t / for all t; s > 0, one gets Z t  M.t/ D exp Q./ d (17.3.4) 0

Rs

or, more generally, M.t; s/ D exp t Q./ d , with M.t; s/M.s; r/ D M.t; r/ for r > s > t > 0, also known as the flow property. In general, however, the generators Q.t / need not commute, and equation (17.3.4) has to be replaced by the more general Peano–Baker formula; see [10] for details. It can still be evaluated in some simple cases, and the flow property remains valid. Let us finally turn to the asymptotic behaviour of the solution of the recombination equation. It can, of course, be read off equation (17.3.2), but it is more instructive to argue directly on the grounds of .˙ t / t>0 . The following consequence of Proposition 17.3.1 and Theorem 17.3.2 is then immediate. Corollary 17.3.7 (Asymptotic behaviour of recombination equation). Assume without C1º loss of generality that %¹i;i > 0 for all i 2 S n ¹nº (if this is not the case, glue ¹¹i º;¹i C1ºº sites i and i C 1 together so that they form a single site). The partitioning process is then absorbing, with ® ¯ lim ˙ t D ¹1º; ¹2º; : : : ; ¹nº t!1

almost surely and independently of ˙0 , and lim ! t D

t!1

n O .i :!0 /: i D1

The convergence to the limit is exponentially fast.

377

Ancestral lines under recombination

17.4 An explicit solution for single-crossover recombination There is an important special case that allows for a closed solution of the Markov semigroup, beyond the somewhat deceptive notation etQ for the Markov semigroup generated by Q. This is the case of single crossovers of two parental gametes, which is also highly relevant biologically: Since crossovers are rare, it is unlikely that two or more of them happen in a given reproduction event, in any sequence of moderate length. The assumption of single crossovers is also inherent in [15]. We speak of single-crossover recombination if %.A/ > 0 implies A 2 I62 .S /. Here, I.S / is the set of interval partitions of S, I62 .S / is the set of interval partitions of S into at most two parts, and I2 .S/ is the set of interval partitions of S into exactly two parts.7 Clearly, I2 .S/ D ¹Ak W 1 6 k 6 n 1º; where Ak WD ¹¹1; 2; : : : ; kº; ¹k C 1; : : : ; nºº. The partition Ak is the result of a singlecrossover event after site k. Obviously, there is a one-to-one correspondence between the elements of I2 .S/ and those of S n ¹nº. Likewise, there is a one-to-one correspondence between I.S / and the set of subsets of S n ¹nº. Let G D ¹j1 ; : : : ; jjGj º  S n ¹nº, with j1 < j2 <    < jjGj . Let then .¿/ D 1, and, for G ¤ ¿, let .G/ denote the interval partition ® ¯ .G/ WD ¹1; : : : ; j1 º; ¹j1 C 1; : : : ; j2 º; : : : ; ¹jjGj C 1; : : : ; nº : In particular, .S n ¹nº/ D ¹¹0º; : : : ; ¹nºº. It is clear that .H / 4 .G/ if and only if G  H . It is also obvious that  defines a bijection; its inverse, 'D

1

;

(17.4.1)

associates with every interval partition of S the corresponding subset of S n ¹nº, so that '..G// D G for all G  S n ¹nº. It is clear that .G/ may alternatively be represented as .G/ D 1 ^ Aj1 ^ Aj2 ^    ^ AjjGj ;

(17.4.2)

where ^ denotes the coarsest common refinement; note that the action of ^ is commutative. In particular, one has .G [ ¹kº/ D .G/ ^ Ak . More precisely, let .G/ D B D ¹B1 ; : : : ; Bm º and k 2 S n ¹nº. Then ´ B; k 2 G; B ^ Ak D .B n Bi / [ Ak jBi ; k 2 S n .G [ ¹nº/; where, in the latter case, Bi is the unique block that contains k; the other blocks are not affected. 7

The case of interval partitions with an arbitrary number of parts is analysed in [11].

378

Ellen Baake and Michael Baake

Let us now connect this to the partitioning process. Assume that we have ˙ D B D ¹B1 ; : : : ; Bm º D .G/ for some G  S n ¹nº and fix one index 1 6 i 6 m. Evaluating the rates in Proposition 17.3.1 with the help of the marginal recombination rates (17.3.1) then reveals that, in the single-crossover case, the only (non-silent) transitions involving block Bi are B 7! .B n Bi / [ Ak jBi D B ^ Ak ; at rate %.Ak / for all k 2 Bi n .G [ ¹nº/: If all blocks are taken into account, we therefore get the transitions .G/ D B 7! B ^ Ak D .G [ ¹kº/; at rate %.Ak / for all k 2 S n .G [ ¹nº/:

(17.4.3)

Since .G [ ¹kº/ is again an interval partition, it is clear that ¹˙ º>0 , when started in I.S /, will never leave I.S/. Remark 17.4.1. The property that recombination according to A induces the transition from B to B ^ A is a special (and decisive) property of the single-crossover setting, where A 2 I62 .S/ and B 2 I.S/, which implies that A refines at most one block of B. This is not true in the general case, where the possible refinements are considerably more complex. We are now well prepared to calculate a t . We could work via the matrix exponential of Q and use its special structure resulting from the restriction to I.S /; however, we pursue a more elegant approach based on equations (17.4.2) and (17.4.3). To this end, let ˙0 D 1 and conclude from equation (17.4.3) that .˙ />0 is governed by the arrival of Ak -events that happen independently of each other at rate %.Ak /. The waiting times Tk until Ak appears are therefore independent and exponentially distributed with parameters %.Ak /. Let now C be an interval partition as in equation (17.4.2), that is C D .G/ for some G  S n ¹nº (so G D '.C /). Taking equations (17.4.2) and (17.4.3) together, we see that ˙ t D C if and only if all  Ak -events with k 2 G have occurred, while all Aj -events with j 2 S n G [ ¹nº have not. We therefore get a t .C / D P .˙ t D C j ˙0 D 1/ Y Y D P .Tk < t/ k2G

D

Y k2G

P .T` > t /

`2S n.G[¹nº/

1

e

t%.Ak /



Y

e

t%.A` /

:

`2S n.G[¹nº/

With these coefficients, Theorem 17.3.2 indeed turns into an explicit and simple solution of the recombination equation. Let us summarise our result as follows.

Ancestral lines under recombination

379

Corollary 17.4.2 (Single-crossover recombination). Assume single-crossover recombination, that is %.A/ > 0 implies A 2 I62 .S /. The probability vector a t from Theorem 17.3.2 is then given by a t .C / D 0 if C … I.S / and, for C 2 I.S /, by Y Y  a t .C / D 1 e t%.Ak / e t%.A` / ; k2G

`2S n.G[¹nº/

where G D '.C / of equation (17.4.1). In fact, the content of Corollary 17.4.2 was originally obtained by analytic means in [9]; we have recovered it here in genealogical terms. Note that the exponential convergence to the product measure of Corollary 17.3.7 is obvious here, from the explicit formula for the a t .C /.

17.5 Recombination in discrete time Let us finally turn our attention to the discrete-time analogue of equation (17.2.1), namely the discrete-time dynamical system X ! t C1 D r.A/RA .! t /; (17.5.1) A2P62 .S/

which is often considered in population genetics [17, 20, 35, 36]. Here, t 2 N0 now denotes discrete time (counting generations); the initial condition is again !0 2 P.X /. The iteration describes the synchronous formation of a new population from the parental one. The parameters are now the recombination probabilities r.A/ for A 2 P62 .S /. Obviously, ! t C1 is a convex combination of ! t recombined in all possible ways, so P.X/ is preserved under the iteration. In analogy with the derivation of the continuous-time recombination equation as the limit of a finite-N Moran model, the discrete-time recombination equation may be obtained as the law of large numbers of an underlying Wright–Fisher model with recombination; see [6] for the special case of single crossovers. Rather than working this out explicitly, we simply state the plausible fact that there is again an underlying partitioning process, .˙ /2N0 . This is now a Markov chain in discrete time, again with values in P .S/ and starting at ˙0 D 1. When ˙ D A, in the time step from  to  C 1, part A of A is replaced by a 2 P .A/ with probability r A .a/, independently for each A 2 A. Note that, in contrast to the continuous-time case, several parts can be refined at the same time, which makes the discrete-time case actually more complicated. Of course, a D ¹Aº, which means no action on this part, is also possible. Put together, it is not difficult to verify that one ends up with the Markov transition matrix M with elements ´Q A A2A r .BjA /; if B 4 A; MAB D 0; otherwise:

Ellen Baake and Michael Baake

380

In particular, M D .MAB /A;B2P .S/ is a lower-triangular Markov matrix. (Let us note in passing that the triangular form, which also appears in the continuous-time case, motivated to revisit the Markov embedding problem [12].) The analogue of Theorem 17.3.2 reads as follows. Theorem 17.5.1 (Solution of the discrete-time recombination equation [1]). The solution of the recombination equation (17.5.1) is given by X !t D a t .A/RA .!0 / D E.R˙ t .!0 / j ˙0 D 1/; A2P .S/

where a t .A/ D P .˙ t D A j ˙0 D 1/ D .M t /1A : It is tempting to assume that, again in analogy with continuous time, the case with single crossovers might be amenable to a simple solution. This is, however, not true. The reason is that, in continuous time, the single-crossover events appear independently by the very nature of the continuous-time process, where the probability of two events occurring simultaneously is zero. In contrast, the single-crossover assumption in discrete time induces dependence. Namely, a crossover between a given pair of neighbouring sites precludes a crossover between another pair of neighbouring sites in the same block. With the help of Möbius inversion on a suitable poset of rooted forests, a solution was obtained nevertheless, but it is of surprising complexity [4]. However, the long-term behaviour is, once more, simple: Corollary 17.3.7 carries over, with % replaced by r. Acknowledgements. It is our pleasure to thank Frederic Alberti for critically reading the manuscript, and two referees for helpful comments.

References [1] E. Baake and M. Baake, Haldane linearisation done right: Solving the nonlinear recombination equation the easy way, Discrete Contin. Dyn. Syst. Ser. A 36 (2016), 6645–6656. [2] E. Baake, M. Baake, and M. Salamat, The general recombination equation in continuous time and its solution, Discrete Contin. Dyn. Syst. Ser. A 36 (2016), 63–95; and 36 (2016), 2365–2366 (erratum and addendum). [3] E. Baake, F. Cordero, and S. Hummel, A probabilistic view on the deterministic mutation– selection equation: Dynamics, equilibria, and ancestry via individual lines of descent, J. Math. Biol. 77 (2018), 795–820. [4] E. Baake and M. Esser, Fragmentation process, pruning poset for rooted forests, and Möbius inversion, Markov Process. Related Fields 24 (2018), 57–84. [5] E. Baake and I. Herms, Single-crossover dynamics: Finite versus infinite populations, Bull. Math. Biol. 70 (2008), 603–624.

Ancestral lines under recombination

381

[6] E. Baake and U. von Wangenheim, Single-crossover recombination and ancestral recombination trees, J. Math. Biol. 68 (2014), 1371–1402. [7] E. Baake and A. Wakolbinger, Lines of descent under selection, J. Stat. Phys. 172 (2018), 156–174. [8] M. Baake, Recombination semigroups on measure spaces, Monatsh. Math. 146 (2005), 267–278; and 150 (2007), 83–84 (addendum). [9] M. Baake and E. Baake, An exactly solved model for mutation, recombination and selection, Can. J. Math. 55 (2003), 3–41; and 60 (2008), 264–265 (erratum). [10] M. Baake and U. Schlägel, The Peano–Baker series, Proc. Steklov Inst. Math. 275 (2011), 167–171. [11] M. Baake and E. Shamsara, The recombination equation for interval partitions, Monatsh. Math. 182 (2017), 243–269. [12] M. Baake and J. Sumner, Notes on Markov embedding, Linear Algebra Appl. 594 (2020), 262–299 [13] J. H. Bennett, On the theory of random mating, Ann. Human Genetics 18 (1954), 311–317. [14] A. Bhaskar and Y. S. Song, Closed-form asymptotic sampling distributions under the coalescent with recombination for an arbitrary number of loci, Adv. Appl. Prob. 44 (2012), 391–407. [15] M. Birkner and J. Blath, Genealogies and inference for populations with highly skewed offspring distributions, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 151–177. [16] A. Bobrowski, T. Wojdyła and M. Kimmel, Asymptotic behavior of a Moran model with mutations, drift and recombination among multiple loci, J. Math. Biol. 61 (2010), 455–473. [17] R. Bürger, The Mathematical Theory of Selection, Recombination, and Mutation, Wiley, Chichester, 2000. [18] F. B. Christiansen, Population Genetics of Multiple Loci, Wiley, Chichester, 1999. [19] F. Cordero, Common ancestor type distribution: A Moran model and its deterministic limit, Stochastic Process. Appl. 127 (2017), 590–621. [20] K. J. Dawson, The evolution of a population under recombination: How to linearise the dynamics, Linear Algebra Appl. 348 (2002), 115–137. [21] R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008. [22] J. Y. Dutheil, Towards more realistic models of genomes in populations: The Markovmodulated sequentially Markov coalescent, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 383–408. [23] K.-J. Engel and R. Nagel, One-Parameter Semigroups for Linear Evolution Equations, Springer, New York, 2000. [24] M. Esser, Recombination models forward and backward in time, PhD thesis, Bielefeld University, 2017, urnWnbnWdeW0070-pub-29102790. [25] M. Esser, S. Probst, and E. Baake, Partitioning, duality, and linkage disequilibria in the Moran model with recombination, J. Math. Biol. 73 (2016), 161–197.

Ellen Baake and Michael Baake

382

[26] S. N. Ethier and T. G. Kurtz, Markov Processes: Characterization and Convergence, Wiley, New York, 1986. [27] F. Freund, Multiple-merger genealogies: Models, consequences, inference, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 179–202. [28] H. Geiringer, On the probability theory of linkage in Mendelian heredity, Ann. Math. Stat. 15 (1944), 25–57. [29] A. Greven and F. den Hollander, From high to low volatility: Spatial Cannings with block resampling and spatial Fleming–Viot with seed-bank, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 267–289. [30] R. C. Griffiths and R. Marjoram, Ancestral inference from samples of DNA sequences with recombination, J. Comput. Biol. 3 (1996), 479–502. [31] J. Hein, M. H. Schierup, and C. Wiuf, Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory, Oxford University, Oxford, 2005. [32] R. R. Hudson, Properties of an neutral allele model with intragenetic recombination, Theor. Popul. Biol. 23 (1983), 183–201. [33] H. S. Jennings, The numerical results of diverse systems of breeding, with respect to two pairs of characters, linked or independent, with special relation to the effects of linkage, Genetics 2 (1917), 97–154. [34] A. Lambert, V. Miró Pina, and E. Schertzer, Chromosome painting: How recombination mixes ancestral colors, preprint 2018, httpsW//arxiv.org/abs/1807.09116; to appear in Ann. Appl. Probab. [35] Y. I. Lyubich, Mathematical Structures in Population Genetics, Springer, Berlin, 1992. [36] S. Martínez, A probabilistic analysis of a discrete-time evolution in recombination, Adv. Appl. Math. 91 (2017), 115–136. [37] D. McHale and G. A. Ringwood, Haldane linearisation of baric algebras, J. Lond. Math. Soc. (2) 28 (1983), 17–26. [38] T. Ohta and M. Kimura, Linkage disequilibrium due to random genetic drift, Genet. Res. 13 (1969), 47–55. [39] R. B. Robbins, Some applications of mathematics to breeding problems III. Genetics 3 (1918), 375–389. [40] Y. S. Song and J. S. Song, Analytic computation of the expectation of the linkage disequilibrium coefficient r 2 , Theor. Popul. Biol. 71 (2007), 49–60. [41] A. Sturm, Diploid populations and their genealogies, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 203–221. [42] J. Wakeley, Coalescent Theory: An Introduction, Roberts, Greenwood Village, 2009.

Chapter 18

Towards more realistic models of genomes in populations: The Markov-modulated sequentially Markov coalescent Julien Y. Dutheil The development of coalescent theory paved the way to statistical inference from population genetic data. In the genomic era, however, coalescent models are limited due to the complexity of the underlying ancestral recombination graph. The sequentially Markov coalescent (SMC) is a heuristic that enables the modelling of complete genomes under the coalescent framework. While it empowers the inference of detailed demographic history of a population from as few as one diploid genome, current implementations of the SMC make unrealistic assumptions about the homogeneity of the coalescent process along the genome, ignoring the intrinsic spatial variability of parameters such as the recombination rate. Here, I review the historical developments of SMC models and discuss the evidence for parameter heterogeneity. I then survey approaches to handle this heterogeneity, focusing on a recently developed extension of the SMC.

18.1 Modelling the evolution of genomes in populations When modelling the evolution of large genomic sequences at the population level, in particular for sexually reproducing species, a key biological mechanism to account for is meiotic recombination, which shuffles genetic material at each generation. We first introduce the concept of the ancestral recombination graph, needed to represent the complete genealogy of a sample undergoing recombination. We then review the statistical approaches used to fit models accounting for recombination to population genomics data. 18.1.1 The ancestral recombination graph The evolution of the set of sequences carried by all individuals forming a population, generation after generation, can be modelled by a stochastic process, where each individual leaves a variable number of descendants in the next generation. As a result, at any position of the sequence, the genealogy of a sample of n individuals can be described by a tree (Figure 18.1.1 A) [35]. The tips of the tree represent the sampled individuals and the inner nodes their common ancestors. In the case of sexually reproducing organisms, which will be the focus of this chapter, the genealogy is not identical for every position in the sequence. During sexual reproduction, two individuals contrib-

Julien Y. Dutheil

384

ute part of their sequence to their descendant(s) in the next generation. The mechanism of recombination is responsible for randomly sampling the new sequence from the two parental ones (Figure 18.1.1 B). How often and where the recombination points occur will be discussed in Section 18.2. The consequences of the recombination process can be stated as: (1) the genealogy of the sequence on the left of a recombination breakpoint potentially differs from the genealogy of the sequence on the right, (2) the genealogy at two positions in the sequence are more likely to differ as the distance between the two points is large, and (3) the genealogy of the complete sequence can no longer be described by a single tree, but by a collection of such trees and associated breakpoints. This tree and breakpoints collection can be represented as a single graph, called the ancestral recombination graph (ARG) (Figure 18.1.1 C) [25]. The complexity of the ARG grows with the number of individuals (which dictates the size of the underlying trees) and the number of recombination events (which determines how many trees are needed to represent the history of the full sequences). The ARG represents the complete history of the sampled individuals, where the trees at each position (referred to as the “marginal genealogies”) are embedded [49]. It describes the history of each segment of the sampled sequences, tracing back their ancestors in potentially distinct individuals. Such segments, which have left descendants in the sample, are termed ancestral. Contained in the ARG is also the history of some non-ancestral segments, which did not leave a descendant in the sample, but were once part of a sequence that contained both ancestral and non-ancestral segments (Figure 18.1.1 D). The characteristics of the ARG are determined by the demography of the population (the history of population size changes), the recombination landscape (where do the recombination events occur), but also the selective forces acting on the sequence, as natural selection influences the distribution of the number of descendants for each individual, based on the nature of the sequences themselves. While the ARG contains the signature of the biological processes that shaped the genome sequences, it is unfortunately not directly accessible. In order to access the embedded information, it is necessary to model the evolution of sequences in populations. 18.1.2 The coalescent with recombination as a chronological process When modelling evolution, the most intuitive approach is to consider the process chronologically, that is to model the state of the system generation after generation. One of the most simple models, the so-called Wright–Fisher process, considers that the gametes forming one generation are a random sample of the gametes produced at the previous generation, that is reproduction is a purely random process where each individual has the same a priori chance to contribute to the next generation. In addition, the population has a finite, constant size. A similar model, termed the Moran process, considers a slightly different set-up with overlapping generations [50]. The Wright–Fisher and Moran processes can both accommodate recombination, modelled by randomly choosing a breakpoint along the genome and exchanging the parental genetic segments (see the contribution of Baake and Baake [6] in this volume). In such

The Markov-modulated sequentially Markov coalescent d



385

"

3

x1

e x2

x1

N x3 R

k

j

9

x2

x1

8

.

*

x1 = 0.4  #  x2 = 0.5

x = 0.4

R 

k  #

#

x3 = 0.2

x4 = 0.7

j

R

k

j

#

Figure 18.1.1. Genealogies and recombination: relationships between individuals and along sequences. A: example genealogy of a sample of five individuals at a given position in the sequences, under a Kingman coalescent. Tip nodes represent the samples (1–5) and inner nodes their common ancestors (6–9). B: illustration of the recombination tree resulting from a process without coalescence (large number approximation): two chromosomes (solid black and dotted grey) are paired during sexual reproduction and exchange segments at a breakpoint x1 . At the next generation, the descendant sequence recombines with another sequence (solid grey) at another breakpoint x2 , etc. The sequence of any sampled individual is therefore a mosaic of segments with distinct ancestors separated by a series of breakpoints (x1 , x2 , x3 ). C: a simple ancestral recombination graph (ARG) representing the genealogy of a sample of size three with one recombination event. The ARG is a combination of a coalescence tree (as in A) and a recombination tree (as in B). For ease of interpretation, two mutation events, a and b have been added. The relative coordinate of the recombination event is also indicated: x D 0:4, assuming a total sequence length of 1. D: partial graph showing the different classes of recombination events. Ancestral segments are depicted as filled rectangles, while non-ancestral segments are shown as simple lines (subfigure created after [47, Figure 1]).

Julien Y. Dutheil

386

processes, the fate of a genetic variant is purely stochastic and governed only by the population size. When conditioning on a sample of the result of the evolutionary process, a backwards-in-time modelling is used. Each sequence in the sample represents a lineage, and the aim of the model is to determine which lineages find a common ancestor in the past and when. Every time two lineages coalesce into a common ancestor, the number of lineages to model is reduced by one, until the last common ancestor of the sample is reached. This process of lineages merging backwards in time is termed the coalescent [35]. The probability that two lineages merge at a given generation back in time depends on the population size. When the population size is constant in time, the number of 1 generations until coalescence follows a geometric distribution with parameter 2N , e where Ne is the effective population size. Assuming a large Ne , this discrete-time coalescent is well approximated by a continuous-time coalescent, where the divergence time between two sequences follows an exponential distribution with average 2  Ne generations. For convenience, time is, therefore, measured in “coalescence” units equal to 2  Ne , so that the mean divergence time between two sequences in a sample equals 1. In the coalescent with recombination process, recombination events are modelled in addition to coalescence events [30]. Backwards-in-time, a lineage undergoing a recombination event splits in two, the left and right sequences having distinct ancestors. Since the rates of coalescence and recombination events, at any time point, depend only on the current lineages, the process is Markovian in time [63]. This property enabled the development of simulation procedures and inference methods, allowing the estimation of various parameters by integrating over the unknown genealogy of a sample (e.g. [2, 14]). Such methods, however, do not scale well with the length of the modelled sequences, as the number of events in the underlying ARG grows with the sequence length [20], preventing efficient integration even with Markov chains Monte-Carlo [75]. These methods are, therefore, restricted to small samples with relatively few loci. 18.1.3 The coalescent with recombination as a sequential process Following the initial work by Simonsen and Churchill [63], Wiuf and Hein extended the two-loci model of coalescence with recombination to multiple loci [82]. The resulting process models the ARG sequentially along the genome rather than chronologically. The resulting sequential coalescent with recombination aims at modelling the genealogy of the sample at position i given the genealogies at previous positions. In fact, genealogies at two distinct positions in the sequence are not independent: they are identical if no recombination event occurred between the two positions since the last common ancestor of the sample and can only differ if at least one recombination event occurred. Despite this intuitive correlation structure, the coalescent with recombination is not Markovian along the sequence. Computing the probability distribution of the marginal genealogy at a given position proved to be quite challenging because

The Markov-modulated sequentially Markov coalescent

387

of long-range dependencies, the genealogy at position i depending not only on the genealogy at position i 1, but on the genealogy at all positions 1 to i 1. With the goal to simplify the likelihood calculation under the coalescent with recombination, McVean and Cardin proposed an approximation where certain types of coalescence events are ignored [49]. An intuitive description of the simplified process was provided by Marjoram and Wall [47], who recognised five types of recombination events on the ARG, based on the type of segments in the parental sequences on both sides of the recombination event (Figure 18.1.1 D): type 1 events occur in ancestral segments (events at x3 and x4 in Figure 18.1.1 D) while types 2–5 occur in nonancestral segments. Type 2 events occur in so-called trapped genetic material [57], that is non-ancestral segments flanked on both sides by ancestral segments (event at x1 in Figure 18.1.1 D). Events of types 3, 4 and 5 occur in non-ancestral segments only flanked by non-ancestral segments on one or both sides (e.g. event at x2 in Figure 18.1.1 D). Such events (3, 4 and 5) do not affect the sample generated by the corresponding ARG, and therefore do not impact the likelihood of the sample given the ARG. They can therefore be ignored without introducing any additional hypotheses, see [82] (types 4 and 5) and [31] (types 3, 4 and 5). The process of McVean and Cardin, which was further improved by Marjoram and Wall [47] and Hobolth and Jensen [29] also ignores type 2 recombination events, that is recombination events occurring in trapped genetic material [57]. Doing so implies ignoring potential longrange dependencies between loci, and the distribution of samples generated by this approximated process is, therefore, different from that of the standard coalescent with recombination. The approximated process, however, has the additional property that the distribution of genealogies at position i is only dependent on the genealogy at position i 1, and is, therefore, Markovian along the sequence. Such a process is referred to as the sequentially Markov coalescent (SMC) [47,49]. Importantly, the SMC process generates samples with patterns of genetic diversity that are very similar to the ones generated by the full coalescent process [49]. The SMC, in particular, can be seen as a first-order Markov approximation of the true coalescent with recombination process [81], and higher order extensions have been introduced [66]. Furthermore, the Markov property enables very efficient likelihood calculation using dynamic programming algorithms to integrate over all ARGs. Such methodology comes from the field of hidden Markov models (HMM), which we introduce in the next section. 18.1.4 Coalescent hidden Markov models Because of the SMC approximation, likelihood calculation under a coalescent with recombination process represents a classical bioinformatic problem where the probability of an observed state in the sequence depends on an unobserved state, which is then said to be hidden. In the case studied here, the observed states are sequence polymorphisms (between two or more individual sequences) and the hidden states are the underlying marginal genealogies. HMMs have been broadly used in sequence analysis [15]. Coalescent hidden Markov models (CoalHMM) refer to HMMs where

Julien Y. Dutheil

388

Figure 18.1.2. Chronology of sequentially Markov coalescent (SMC) and coalescent hidden Markov models (CoalHMM). PSMC: pairwise SMC. MSMC: multiple SMC. ILS: incomplete lineage sorting. I: isolation model. IM: isolation with migration model.

the hidden states are genealogies. It was introduced by Hobolth et al. [27] as a name of the first model developed, which we introduce later in this section, but was then extended to generally encompass a full class of models [65] (Figure 18.1.2). We note as ¹Ai º16i 6L the site-specific random variable of observed states in a sample of M sequences of length L. Such states (noted ¹Ag º16g6S ) are, in the general case, a combination of the four nucleotides A, C , G and T (one per modelled sequence), with the possibility to additionally account for missing data (coded as N ), so that ¹Ag º16g6S 2 ¹A; C; G; T; N ºM and S < 5M because of symmetry relationships between trees making some of them unidentifiable. We note as ¹xi º16i 6L a particular realisation of ¹Ai º16i 6L , that is the sequence data. Furthermore, we note as ¹Hi º16i 6L the site-specific random variable describing the marginal genealogies at each position in the sequences. In the general case, such genealogies are rooted trees with M leaves. In HMM terminology, the probabilities of observing the sequence data xi at a given position i given a realisation of Hi , Pr.Ai j Hi /, are called the emission probabilities. Hi is a random variable that has a continuous distribution. To make likelihood calculations tractable, this distribution is discretised, so that Hi can take a finite number n of hidden states, ¹Hj º16j 6n . Under a discretised distribution of

The Markov-modulated sequentially Markov coalescent

389

hidden states, the emission probabilities for each position i, hidden state k can be more explicitly written as ei;k .x/ D Pr.Ai D x j Hi D Hk /: We further introduce the so-called transition probability of a genealogy Hj at position i 1 to a genealogy Hk at position i as qi;j;k D Pr.Hi D Hk j Hi

1

D Hj /:

Given the set of emission and transition probabilities we can then write the likelihood of the data by recursion. We define Fi;k the joint probability of the data (observed states) x1 ; : : : ; xi at positions 1 to i and the ancestral genealogy Hk at position i as ´ fk if i D 0; Fi;k D Pr.x1 ; : : : ; xi ; Hk / D (18.1.1) P ei;k .xi /  j qi;j;k  Fi 1;j if i > 0; where ¹fk º16k6n denotes some initial conditions. These conditions may be set to n1 , or to the stationary distribution of the Markov chain (providing it exists). Equation (18.1.1) is called the forward algorithm and allows for the computation of the likelihood of the sequences as X LD FL;k : k

This recursion is an example of dynamic programming, allowing for the integration over all possible ARGs very efficiently, as it scales in O.L  n2 /. The symmetry relationships in the transition matrix and the relatively low frequency of the observed heterozygous states, however, allow for further improvement of the run time of the forward algorithm [26, 59, 76]. The likelihood function depends on a set of parameters ‚, which includes the demographic model (effective population size and its variation) and the recombination rate. More complex models can be implemented, for instance allowing for population structure. These parameters can affect either the emission probabilities ei;k .x/, the transition probabilities qi;j;k , or both. The emission and transition probabilities define the type of model used. Importantly, most models assume that the process is homogeneous along the sequence, so that both ei;k .x/ and qi;j;k are actually independent of i. In the following, we will review several examples of coalescent HMM models. 18.1.5 The two-genome case When the genome sample consists only of two genomes, the marginal genealogies have a more simple encoding consisting of a single (continuous) number representing the time to the most recent common ancestor (TMRCA) of the two sequences. The

390

Julien Y. Dutheil iBK2 

tn θn t...

θ...

tj

θk

t...

θ... θ2 θ1

t2 t1 τ

Ne

1

2

1

2

1

2

1

1

2

3

1

2

1

2

2

1

iBK2

" θ123 θ12

τ123 τ12

Ne 1

2

3

2

3

1

2

3

Figure 18.1.3. Demographic models and hidden states for CoalHMM with two and three sequences. A: Pairwise SMC. Hidden states correspond to discretised divergence times between two sequences. Parameters of the model potentially contain a species divergence time  and several epochs of constant effective population sizes k . The CoalHMM model of Mailund et al. [42] uses only one epoch and , the corresponding states tj being, therefore, drawn from the exponential distribution with mean 2   , shifted by  . The PSMC model of Li and Durbin [39], assumes a skyline model of multiple epochs, yet with individuals from the same population ( D 0). B: CoalHMM with three sequences and ILS. The hidden states correspond to four genealogies that differ both in time and order of the coalescence events. The model assumes constant but distinct ancestral effective population sizes 12 and 123 , as well as the two species divergence times 12 and 123 .

TMRCA can be further discretised into n hidden states, each represented by a mean value .tj /16j 6n . The transition probabilities between the hidden states can be calculated under the SMC. Variants of this model were developed independently by Li and Durbin [39] and Mailund et al. [42]. In the latter, the two genomes come from two distinct populations that diverged at a time  units ago (Figure 18.1.3 A). The common ancestral population is assumed to have a constant effective population size anc . The TMRCA .tj /16j 6n follows an exponential distribution shifted by an amount of . Mailund and collaborators applied this model to the newly sequenced genomes of two Orangutan subspecies in order to estimate their ancestral effective population size and the time of their last genetic exchange. They further extended this model to allow for a period of gene flow after the initial separation of the two populations [43]. In the Li and Durbin model, named pairwise sequentially Markov coalescent (PSMC), the two genomes come from a single population. This approach was further improved by Schiffels and Durbin [61], who used a more accurate recombination

The Markov-modulated sequentially Markov coalescent

391

1e+05

Ne

population CEU

1e+04

LWK

1e+03 1e+04

1e+05

time (y)

1e+06

1e+07

Figure 18.1.4. Demographic inference under the PSMC. The MSMC2 software was used independently on 20 diploid individuals from the 1000 Genomes Project [1], 10 from the CEU population (Utah residents with European ancestry) and 10 from the LWK population (Luhya in Webuye, Kenya). MSMC2 was run on data from chromosome 9 only, with default parameters. The results show that individuals with European ancestry underwent a stronger bottleneck between 50 and 100 ky ago, corresponding to the out-of-Africa event.

model. The authors consider a demographic model where the effective population size is piecewise constant over a given number of epochs (Figure 18.1.3 A). The parameters of the model consist of the set of ancestral sizes, as well as the recombination rate, presumed to be constant along the sequences. While the epochs of the demographic model and the discretisation scheme used for the divergence time are distinct aspects, it is convenient to have some overlap between the two, providing that there are at least as many hidden states as epochs (otherwise some parameters would become unidentifiable). Li and Durbin proposed to consider one hidden state per epoch, so that each segment is represented by one value of tj and one value of j (Figure 18.1.3 A). By estimating one  per epoch, the PSMC model allows the reconstruction of a “skyline” plot where population size varies in time, from present to the distant past (Figure 18.1.4). This method was applied to data from the 1000 Genomes Project [1] in order to infer the demographic history of distinct populations, which show the signature of the out-of-Africa bottleneck. The two approaches of Li and Durbin and Mailund et al. further differ in their calculation of the emission probabilities. Focusing on the population level and a relatively short time span of  1 My, Li and Durbin consider an infinite sites model where only one mutation per site can happen. They further consider all types of mutations as equally probable and ignore the biochemical nature of the underlying nucleotides. This reduces the number of observed states to three types: homozygous (the two sequences are identical at a given position), heterozygous (the two sequences differ at a given position), and unknown (at least one sequence has an unresolved state at that position).

Julien Y. Dutheil

392

The emission probabilities then take the simple form ei;j;homozygous D exp.   tj /; ei;j;heterozygous D 1

exp.   tj /;

ei;j;unknown D 1; where  D 4  Ne  u denote the population mutation rate, and u the per nucleotide, per generation molecular mutation rate. Comparing genomes from two distinct (sub)species, therefore potentially encompassing larger time scales, Mailund et al. used a fully parametrised substitution model as used in parametric phylogenetic reconstruction methods [22]. The mutation process is then a continuous time, discrete-state Markov model with generator U, and the emission probabilities are given by exp.U  tj / for each hidden state j . In both models, the emission probabilities only depend on the observed states and are independent of the actual position in the sequence, assuming a homogeneous mutation process along the genome. 18.1.6 The three-genome case In 2007, Hobolth et al. introduced the first CoalHMM model [18,27], by modelling the possible genealogical relationships between three species: “.A; B/; C ”, “A; .B; C /” and “.A; C /; B” (Figure 18.1.3 B). Considering the two speciation events that separate first the ancestor of A and B from the ancestor of C , and then the ancestor of A from the ancestor of B, the probability that an individual sequence from A coalesces with a sequence from B within the AB ancestral species depends on the ancestral population size and the time between the two speciation events [19]. Backwards in time, if the two corresponding lineages do not coalesce until the most ancient speciation time, any of them can coalesce with a sequence from species C before coalescing with each other. This phenomenon, which results in the marginal genealogy being incongruent with the phylogeny, is termed incomplete lineage sorting (ILS). Using coalescent theory, Hobolth et al. derived relationships between the transition probabilities and used them to infer ancestral effective population sizes as well as the dates of species divergence, the so-called speciation times. In this first model the hidden states differ in tree topology and divergence times. The first hidden state corresponds to the case where the two lineages A and B coalesce within the ancestral population of A and B, leading to a genealogy congruent with the phylogeny. The three other topologies denote cases where the A and B lineages did not coalesce within the AB ancestor, so that the three lineages A, B and C were already present within the ancestral population of the three species. These three states correspond to the cases where A and B, A and C, or B and C coalesce first, respectively (Figure 18.1.3 B). The model assumes constant ancestral population sizes and the divergence times for each topology are reduced to the averages of the corresponding exponential distributions. The proportion of ILS topologies directly depends on the time separating the speciation

The Markov-modulated sequentially Markov coalescent

393

events T and the effective size of the ancestral population anc (see [19]):  anc  2 ; Pr.ILS/ D exp 2  3 T allowing the estimations of these parameters from the patterns of topology variation. The three-species CoalHMM model was applied to genome sequences of Great Apes: Orangutan [28], Gorilla [60], Bonobo [56], Baboons [58], in order to infer the patterns of ILS and the ancestral effective population sizes in this group of species (reviewed in [44]). It was also applied to species of fungal pathogens where it was used to infer recombination rates [73]. 18.1.7 The multiple-genome cases With the development of sequencing technologies and the increasing sample size of population genomic datasets, models able to extract the genealogical information contained in multiple genomes are needed. Building CoalHMM models with more than two or three sequences poses, however, a computational challenge because of the underlying combinatorics of marginal genealogies. The number of possible topologies increases hyper-exponentially with the number M of sampled sequences, and there is an infinite number of possible genealogies with a given topology due to the continuous nature of branch lengths (divergence times). Further approximations are, therefore, required to scale CoalHMM approaches with larger datasets. 18.1.7.1 Using conditional sampling distributions. In a series of articles [54,62,69, 70], Song and collaborators developed an approach based on the so-called conditional sampling distribution (CSD) introduced by Li and Stephens [40]. This approach stems from the chain rule of conditional probabilities, allowing the expression of the likelihood of a sample of M sequences S1 ; : : : ; SM as a product of conditional likelihoods: Pr.S1 ; S2 ; : : : ; SM j ‚/ D Pr.S1 j S2 ; : : : ; SM ; ‚/  Pr.S2 ; : : : ; SM j ‚/ D Pr.S1 j S2 ; : : : ; SM ; ‚/  Pr.S2 j S3 ; : : : ; SM j ‚/  Pr.S3 ; : : : ; SM j ‚/ D Pr.SM j ‚/

M Y1

Pr.Sk j SkC1 ; : : : ; SM ; ‚/;

kD1

where ‚ denotes the parameter vector. The conditional likelihoods, however, are approximated, so that the resulting likelihood is a product of approximate conditionals (PAC) [40], which depends on the order by which the sequences are treated in the product chain. This is usually accommodated by permutations and averaging [40], or by a composite likelihood approach such as the “leave-one-out” strategy [62]: Pr.S1 ; S2 ; : : : ; SM j ‚/ '

M Y i D1

Pr.Si j ¹Sj ºj ¤i ; ‚/:

394

Julien Y. Dutheil

The CSD are computed under an SMC model, given a piecewise constant demographic model, as in the PSMC. The model was further extended to allow more complex demographic scenarios with population structure and migration [69]. 18.1.7.2 Modelling the most recent coalescence events. Schiffels and Durbin [61] developed the multiple sequentially Markov coalescent (MSMC), which models only the most recent coalescent event in the sample. The underlying rationale was that the PSMC is lacking resolution in the more recent past, due to the very small number of mutations and coalescences happening in the most recent epochs. Combining multiple samples, therefore, has the potential to compensate for the lack of information in a single pair of genomes. The MSMC approach is elegant as it reduces the combinatorics of the hidden states to one continuous variable (which is discretised, as in the PSMC): the time of coalescence, together with the index of the two genomes  in the sample that are coalescing, bringing the number of hidden states to n D M  K, 2 where K is the number of discrete classes used for the distribution of divergence times. The efficiency of the MSMC approach is, however, paradoxical: by gaining resolution in the present as the sample size increases, the method progressively loses power as the number of modelled sequences becomes larger (see [17] for an illustration). In practice, the authors showed that for a human dataset, the maximum resolution is obtained for eight haploid genomes [61]. 18.1.7.3 Using a composite likelihood. In the MSMC2 approach [45, 78], Schiffels and collaborators proposed to approximate the likelihood of a sample of M genomes by independently considering all pairs of genomes. The likelihood of the sample is then approximated by the product of all pairwise likelihoods, each computed under the PSMC model. While the pairwise likelihoods are exact under the SMC, the likelihood of the sample is an example of composite likelihood [36]: Pr.S1 ; S2 ; : : : ; SM j ‚/ '

M Y1

M Y

Pr.Si ; Sj j ‚/:

i D1 j Di C1

The likelihood here is an approximation since the divergence times between pairs of sequences in a genealogy are not independent. The MSMC2 approach is therefore better described as a “multiple pairwise SMC”. It was shown to display good resolution in both the past and present time, efficiently making use of the increasing quantity of signal as the sample size increases. 18.1.7.4 Augmenting the PSMC with site frequency spectra. While approaches like diCal and MSMC2 allow for the efficient modelling of the evolution of multiple genomes, they are intrinsically limited as the computational cost become prohibitive for samples of more than one or two dozen genomes (at least for genomes with a size of the order of that of humans). Terhorst and colleagues introduced a hybrid approach combining the power of the SMC, which makes efficient use of linkage patterns, with that of classical site frequency spectrum (SFS) based approaches [76]. This

The Markov-modulated sequentially Markov coalescent

395

modelling framework, termed SMC++, considers a “focus” diploid individual that is modelled with a PSMC approach. The observed states are then augmented by taking into account additional genomes to compute an SFS. The emission probabilities are calculated as the probability of observing the local SFS given the genealogy at the focus individual, and the authors proposed an approach to compute such a conditional site frequency spectrum (CSFS). The resulting SMC++ model can accommodate hundreds of individual genomes. Another innovation introduced in this approach is the abandonment of the “skyline” model of piecewise constant effective population size in favour of a spline model. While divergence times are still discretised, the corresponding times for each category are derived from a spline curve whose parameters are estimated. This reduces the number of parameters to estimate and ensures smoother inferred demographies. The SMC++ approach has been applied to human data as well as other species, including Drosophila and Zebra finch [76].

18.2 Heterogeneity of processes along the genome In all models that we evoked so far, evolutionary processes have been considered to be homogeneous along sequences. In this section, I review evidence that these assumptions are at odds with our current knowledge of the biology of genomes. 18.2.1 Variation of the recombination rate Recombination rates can vary extensively between species [68], between sexes [37] and within genomes. At the molecular scale, multiple levels can be distinguished: recombination rate correlates negatively with chromosome size, a pattern attributed to the mechanism of meiosis and crossing-over interference [32]. Given that the rate of crossing-over events is low, this leads to a higher recombination rate in small chromosomes. Recombination rates vary also within chromosomes: in Primates, it is generally higher at the start and end of the chromosomes (the so-called telomeric regions) [72], while in Drosophila the opposite pattern is observed [10, 13]. In many species recombination events have an increased chance to occur in particular regions, called hotspots [52, 55, 74] (but see [77] for a counterexample). The variation of recombination rate has two types of consequences on the patterns of sequence diversity. Because the molecular mechanisms of recombination are tightly linked to DNA repair, recombination itself can be mutagenic and locally increase sequence variability [4, 38]. Furthermore, in many species, the repair mechanisms involve gene conversion between homologous sequences, as one chromosome is used as a template to repair the other one. However, this mechanism is biased in many species: in the case of heterozygous positions, the “C” or “G” nucleotides are preferred over “A” and “T” nucleotides, potentially resulting in large scale variations of GC content [16] mirroring the variations in recombination rate. The recombination rate also has indirect effects on genetic diversity: because it breaks down genetic linkage,

Julien Y. Dutheil

396

recombination counteracts the reduction of diversity at sites linked to loci under selection, both negative (background selection [11]) and positive (genetic hitch-hiking [9]). By modulating the local effective population size, variation of recombination rate along the genome has a strong impact on the underlying genealogy. 18.2.2 Variation of the mutation rate Finally, the rate at which mutations occur can vary extensively along the genome [7]. Mutations can occur via direct modification of the DNA or indirectly, via errors in the replication or repair mechanisms. Particular sequence motifs, such as CpG dinucleotides are known to undergo comparatively higher mutation rates, via the methylation of the cytosine, which is then deaminated into a thymine, leading to a TpG dinucleotide. In addition to the potentially mutagenic effect of recombination, which also plays a role in the repair of DNA damage, the replication machinery itself is error-prone. This error rate is position dependent: it increases with the replication time, being lower close to the origins of replication [3, 67, 79]. Under a neutral scenario, the mutation process is independent of the coalescent process [30], and, therefore, has no impact on the underlying genealogies. Yet, for inference models, mutation rate variation acts as a confounding factor, as a high sequence divergence can be either explained by an ancient coalescence time or a high mutation rate. In CoalHMM models, the mutation rate will have an impact on the emission probabilities, that is the probability of observing the observed sequence diversity given a genealogy.

18.3 Existing approaches to account for spatial heterogeneity In this section, I review the approaches that have been developed to cope with the heterogeneity of evolutionary processes acting along the genome. 18.3.1 Inferring sequential heterogeneity alone A large body of work is built on the idea that, if a parameter affects certain patterns of genetic diversity, it should be possible to use these patterns to recover the underlying variation of the parameter. The most studied case in this respect is the recombination rate, via its impact on linkage disequilibrium. Given an a priori known demographic scenario, it is possible to compute the likelihood of the data for any given recombination rate, and use it to estimate the most likely recombination rate value. Due to the complexity of the underlying model, however, approximations are required to apply these methods to large genomic datasets. McVean, Awadalla and Fearnhead [48] introduced the use of a composite likelihood, approximating the full likelihood by the product of the likelihoods of all pairs of positions within a minimum distance of each other. This approach is the basis of several popular methods for recombination rate inference such as LDhat [5] and LDhelmet [10]. Further developments of

The Markov-modulated sequentially Markov coalescent

397

these models allowed for the incorporation of variable population sizes [33, 64]. The underlying demography, however, has to be estimated independently from the data. Li and Stephens [40], on the other hand, used the conditional sampling distribution and the product of approximate conditionals (PAC) to approximate the likelihood. An application of this method also includes the reconstruction of haplotypes from genotypic data, a problem known as phasing [71]. 18.3.2 Inference using piecewise-homogeneous processes The most simple approach to infer heterogeneous processes along the genome while jointly accounting for demography is to use a window-based approach, consisting of dividing the genome into segments of fixed sizes and estimating model parameters independently in each resulting window. This strategy was used by Stukenbrock et al. [73] to use the patterns of ILS and a CoalHMM model to estimate the recombination rate in 100 kb windows along the genome of the fungal pathogen Zymoseptoria tritici. In most cases, however, SMC models require long genome sequences to be able to confidently estimate parameters, and cannot be run in windows of small sizes, in particular for models at the population level. Furthermore, window-based approaches raise the issue of the window size and boundaries to use. 18.3.3 Using sequentially heterogeneous simulation procedures While the computation of the likelihood of the data under a sequentially heterogenous process is notoriously difficult, simulating under the corresponding model can be comparatively easy. Software like the Markovian coalescent simulator (MaCS) [12], the sequential coalescent with recombination model (SCRM) [66], fastsimcoal [21] and MSprime [34] allow the simulation of population genomic data sets under models with variable recombination rate. Owing to their high computational efficiency, they can be used within an approximate Bayesian computation (ABC) framework in order to estimate demographic parameters under realistic recombination models [80]. This possibility, however, has to date been underexploited, as demographic inference is so far conducted with data simulated under a homogeneous recombination landscape (see for instance [41]). While no ABC method has been developed with the goal to infer the variation of population genomic parameters along the genome, Gao et al. [24] introduced a machine learning approach to infer recombination rates. The underlying simulations, however, are conducted under a model with constant recombination rate. 18.3.4 A posteriori inference of heterogeneous processes The HMM methodology allows, via the forward algorithm, to compute the likelihood of the data given a specified demographic model by efficiently integrating over the unknown underlying ARG. The HMM toolbox further allows for the computation of

398

Julien Y. Dutheil

the a posteriori probability of each marginal genealogy for each position [17]: Pr.Hi D Hj j x1 ; : : : ; xL / D

Pr.x1 ; : : : ; xL ; Hi D Hj / : Pr.x1 ; : : : ; xL /

The denominator of the ratio is the likelihood of the data, L. In order to compute the numerator, we need to introduce the so-called backward algorithm [15]: Bi;j D Pr.xi C1 ; : : : ; xL j Hi D Hj / ´ 1 D P k eiC1;k .xi C1 /  qi C1;j;k  Bi C1;k

if i D L; if i < L:

The posterior probability of hidden state Hj can therefore be computed as Pr.Hi D Hj j x1 ; : : : ; xL / D

Fi;j  Bi;j : L

This formula allows for the reconstruction of the most probable marginal genealogy at each position i by taking the maximum posterior probability  O i D arg max Pr.Hi D Hj j x1 ; : : : ; xL / ; H j

a procedure called posterior decoding. The posterior decoding is performed after fitting the model parameters by maximising the likelihood. It is therefore an example of empirical Bayesian inference [46]. Posterior probabilities of marginal genealogies can also be used to obtain posterior estimates of biological quantities of interest, accounting for the uncertainty on the underlying genealogy. The posterior mean estimate O i at position i of a property ƒ.Hj / can be obtained by X O i D Pr.Hi D Hj j xi ; : : : ; xL /  ƒ.Hj /: (18.3.1) j

If ƒ is the coalescence time between two sequences, this formula can be used to get posterior estimates of sequence divergence along the genome [53]. More complex examples of functions include, for instance, whether Hi is distinct from Hi 1 , or, in other words, whether a recombination event occurred between positions i and i 1. This allows for the reconstruction of a recombination map, integrating over the ARG. Such approach was notably used by Munch et al. [51] to reconstruct the recombination map of the human-chimpanzee ancestor. Posterior estimates are rather robust to the specified input model and can, therefore, offer a powerful approach to infer aspects of the process that are not directly accounted for by the model. However, because some model properties are intrinsically confounded, such as local divergence and mutation rate, ignoring spatial heterogeneity might result in biased inference [8].

399

The Markov-modulated sequentially Markov coalescent

18.4 The integrative sequentially Markov coalescent In order to account for heterogeneous processes along the genome, we recently developed the integrative sequentially Markov coalescent (iSMC) [8], an extension of the SMC. In this framework, parameters of the original SMC vary along the genome in a Markovian manner, allowing for the modelling of genome heterogeneity in addition to demographic processes. I illustrate this approach with results from a recent application of this framework to infer recombination landscapes and further discuss possible extensions. 18.4.1 The Markov-modulated sequentially Markov coalescent Whilst the framework can be applied to cases where more than one parameter varies along the genome, for simplicity, we here consider the case where one parameter only varies, which we label R, Ri denoting the values of R at position i in the sequences. We assume that R follows an a priori known discrete distribution with nR categories, each with mean value Rk , with 1 6 k 6 nR . In the iSMC framework, the transition and/or emission probabilities are functions of R and are, therefore, noted as SMC SMC ei;k .x; R/ and qi;j;k .R/, respectively. The key assumption is then to consider that the variation of R along the genome can be modelled as a Markov model, that is there is a matrix of probabilities q R defined as R qi;j;k D Pr.Ri D Rk j Ri

1

D Rj /:

The forward recursion of the CoalHMM can then be written as SMC Fi;j;k D Pr.x1 ; : : : ; xi ; Hj ; Rk / ´ fj  f R D SMC k P P SMC R ei;j .xi ; Rk /  u v qi;u;j .Rk /  qi;v;k  FiSMC 1;u;v

if i D 0; if i > 0;

where fkR denotes the a priori probability Pr.R D Rk /. Because the SMC now depends on a parameter that itself follows a Markov process, the resulting process can be described as a Markov-modulated Markov chain (MMMC). As an MMMC is itself a Markov process [23], we can rewrite the forward recursion as iSMC Fi;k D Pr.x1 ; : : : ; xi ; HkiSMC / ´ fkiSMC D iSMC P iSMC ei;k .xi /  j qi;j;k  FiiSMC 1;j

if i D 0; if i > 0:

(18.4.1)

In the iSMC hidden Markov model, the hidden states H iSMC consist of all possible pairs of genealogies and R: .Ra ; Hb /, with 1 6 a 6 nR and 1 6 b 6 n. The emission iSMC probabilities ei;j .x/ now depend on the pair .R; H /j , and the initial probabilities of the hidden states are f iSMC D f R ˝ f . Similarly, the transition probabilities of

Julien Y. Dutheil

400

the Markov-modulated SMC (MMSMC) can be written as a function of the transition probabilities of the two Markov chains: R qi;1;1  qiSMC .R1 / B :: D@ : SMC R qi;n  q .R.nR / / R ;1 i

0

qiiSMC

 :: : 

1 SMC R qi;1;n .R1 / R  qi C :: A: : SMC R qi;nR ;nR  qi .R.nR / /

(As we consider the process modelling the variation of R to be itself homogeneous iSMC along the genome, we have qiiSMC D qiiSMC D qj;k for all i1 ; i2 .) The iSMC model 1 ;j;k 2 ;j;k can therefore be analysed with standard HMM methodology, just like the homogeneous SMC. The number of hidden states, however, is now nR  n, meaning that the complexity of the likelihood calculation becomes O.L  .n  nR /2 /. The iSMC model adds relatively few extra parameters to the SMC: the transition probabilities q R , which can be reduced to one parameter (see below), and parameters of the a priori distribution of R. While parameters of the distribution of R are generally not of direct biological interest, the posterior decoding of the HMM allows for the inference of the underlying landscape of the heterogeneous parameter. Distinct decoding procedures can be performed in the case of Markov-modulated HMMs: (1) a full decoding, where the most likely pair .R; H / at each position is reconstructed:

2

 .R; H /i D arg max Pr.Hi D Hj ; Ri D Rk j xi ; : : : ; xL / ; j;k

(2) a partial decoding of genealogies, where the most likely genealogy is inferred, summing over all heterogeneous parameters: X  HO i D arg max Pr.Hi D Hj ; Ri D Rk j xi ; : : : ; xL / ; j

k

(3) a posterior mean estimation of the heterogeneous variable. Setting ƒ.Hj ; Rj / D Rj and applying equation (18.3.1), we get XX Pr.Hi D Hj ; Ri D Rk j xi ; : : : ; xL /  Rk RO i D D

j

k

X

Rk 

k

X

Pr.Hi D Hj ; Ri D Rk j xi ; : : : ; xL /:

(18.4.2)

j

The partial decoding of genealogies enables the reconstruction of the ARG while accounting for the heterogeneity of the SMC along the genome. The posterior estimates of the heterogeneous variable allows the reconstruction of the variation along the genome while accounting for the genealogy and its uncertainty. In the next section, we apply this framework to model the variation of the recombination rate along the genome.

The Markov-modulated sequentially Markov coalescent

401

18.4.2 A case study: Inference of recombination rate variation Recombination is the best documented heterogeneous process along the genome. It can be measured experimentally or indirectly using genomic approaches (reviewed in [5]). In the context of the sequential coalescent and the SMC approximation, the local recombination rate affects the probability of transition from one genealogy to another, which increases with higher recombination rates. To model variable recombination rates in iSMC, we considered that the local population recombination rate  D 4  Ne  r, where r is the molecular recombination rate in cM / bp per generation, is the product of a genome average G and a local modifier r  . This modifier follows a prior discrete distribution of mean 1 and with n categories, for instance a discretised Gamma distribution with shape parameter ˛. We further considered a simple model for the transition probabilities between the r  classes, assuming equal probabilities of change, . The transition probability matrix q  takes the form 1 0 1

=.n 1/ 

=.n 1/ B C :: :: B =.n 1/ C : 1 : C: q D B B C :: :: :: @ : :

=.n 1/A :

=.n 1/ 

=.n 1/ 1 Using the forward equation (18.4.1), we can compute the likelihood of the parameters of the iSMC model. Using optimisation procedures, estimates of the ˛ and parameters together with the average G and the demography parameters can be obtained by maximising the likelihood function. The likelihood calculation can also be used to perform model comparisons and test for the heterogeneity of the coalescent process along the genome, for instance using Akaike’s information criterion (AIC). A posterior decoding approach (equation (18.4.2)) can then be used to obtain estimates of site-specific recombination rates. Simulations under controlled recombination landscape and demography can be used to assess the accuracy of the iSMC inference [8]. Figure 18.4.1 shows that iSMC recovers the underlying recombination landscape with good accuracy, despite generally underestimating high recombination rates. A possible explanation for this is the discretisation procedure, as the posterior mean estimate is bounded by the class with the highest mean recombination value. Allowing for more recombination classes allows for a wider range of values and can potentially reduce this bias, at the cost of increasing the running time and memory usage. Because it can recover the recombination landscape from a single diploid only, the iSMC model was used to generate recombination maps from extinct hominids from their ancient genome sequences [8].

402

Julien Y. Dutheil ρ



Simulated



Inferred



● ●

0.004

● ●

●●

●●



ρ







● ●







● ●●



0.002

● ●



0

●● ●● ● ●

● ● ● ●



● ● ●●



● ●●● ● ● ● ● ●●● ● ●● ● ● ●

● ● ● ● ●● ●

● ●

● ● ●●

● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●●● ● ●●

0.000



● ●

● ●● ● ●





●●

●● ● ● ●● ●

● ●



● ●● ●







●● ●● ●● ●●● ●● ● ●

● ●

● ● ● ●● ● ●

●●

● ● ●● ● ● ●

● ●





● ● ●

● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●●●

10000

● ●●

● ● ●

20000

● ● ●

● ● ● ● ● ●

30000

Position (kb)

Figure 18.4.1. Posterior estimates of recombination rates from a single diploid genome using iSMC. A 30 Mb region was simulated using the SCRM program [66] and a variable recombination rate, and then inferred with iSMC, as described in [8]. Recombination rates were averaged in windows of 200 kb.

18.4.3 Extension of iSMC: Multiple genomes and multiple heterogeneous parameters The Markov-modulated Markov model framework underlying the iSMC approach can be applied to other SMC models, such as the MSMC [61]. Hidden states of the resulting CoalHMM are combinations of TMRCA, pairs of genome indices undergoing the most recent coalescent event, and classes of heterogenous parameters such as the recombination rate. Extension to multiple genomes can also be achieved using a composite likelihood approach, as implemented in MSMC2 (see Section 18.1.7.3). The likelihood of the dataset is then approximated by the product of the likelihoods of each pair of genomes, which are modelled separately with their own process. In the case of the iSMC approach, this implies considering that the heterogeneous parameters vary independently along each pair of genomes; for a model with variable recombination rate, this is equivalent to estimating a distinct recombination map for each pair of genomes. This assumption is clearly incorrect for the vast majority of positions within genomes from the same population. Extensions enforcing a common map while allowing the coalescent processes to be independent will, therefore, be instrumental in efficiently scaling the iSMC approach to larger sample sizes. The iSMC framework further allows the joint modelling of the variation of multiple parameters along the genome, such as the mutation and recombination rates. In such approach, the hidden states are a combination of TMRCA and classes for each discretised parameter. The addition of any additional heterogeneous parameter 2 multiplies the complexity of the likelihood calculation by n , where n is the number of discrete classes considered for the parameter distribution. Besides the increased

The Markov-modulated sequentially Markov coalescent

403

complexity, identifiability issues may also arise, since local patterns of diversity may be equally explained by variation of demography, recombination rate or mutation rate. However, when these parameters vary at a scale larger than the variation of the TMRCA along the genome and when very large genome sequences are analysed, increasingly complex models may be successfully fitted.

18.5 Conclusions The availability of complete genome data opened the floodgates for the detailed inference of the demographic history of species. A new generation of coalescent-based models permits the extraction of demographic signal from the patterns of genetic linkage along sequences. Such models, however, largely ignore fundamental aspects of genome biology, that is that processes such as recombination and mutation are highly heterogeneous along genomes. Extending these approaches to account for such heterogeneity not only potentially improves demographic inference, but also allows to reconstruct the underlying genomic landscape. Acknowledgements. I would like to thank Gustavo V. Barroso for critical reading of this manuscript and for providing the simulation data plotted in Figure 18.4.1. I am deeply indebted to Ellen Baake, as well as Jeffrey P. Spence and an anonymous reviewer for their careful reading of the manuscript, for finding several mistakes and typos, and for their suggestions on how to improve its clarity.

References [1] 1000 Genomes Project Consortium (and 8 coauthors), A map of human genome variation from population-scale sequencing, Nature 467 (2010), 1061–1073. [2] A. M. Adams and R. R. Hudson, Maximum-likelihood estimation of demographic parameters using the frequency spectrum of unlinked single-nucleotide polymorphisms, Genetics 168 (2004), 1699–1712. [3] N. Agier and G. Fischer, The mutational profile of the yeast genome is shaped by replication, Mol. Biol. Evol. 29 (2012), 905–913. [4] I. Alves, A. A. Houle, J. G. Hussin, and P. Awadalla, The impact of recombination on human mutation load and disease, Philos. Trans. R. Soc. Lond. B Biol. Sci. 372 (2017), Article ID 20160465. [5] A. Auton and G. McVean, Estimating recombination rates from genetic variation in humans, Methods Mol. Biol. 856 (2012), 217–237. [6] E. Baake and M. Baake, Ancestral lines under recombination, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 365–382. [7] C. F. Baer, M. M. Miyamoto, and D. R. Denver, Mutation rate variation in multicellular eukaryotes: Causes and consequences, Nature Rev. Gen. 8 (2007), 619–631.

Julien Y. Dutheil

404

[8] G. V. Barroso, N. Puzovic, and J. Dutheil, Inference of recombination maps from a single pair of genomes and its application to archaic samples, PLoS Genet. 15 (2019), Article ID e1008449. [9] N. H. Barton, Genetic hitchhiking, Philos. Trans. R. Soc. Lond. B Biol. Sci. 355 (2000), 1553–1562. [10] A. H. Chan, P. A. Jenkins, and Y. S. Song, Genome-wide fine-scale recombination rate variation in Drosophila melanogaster, PLoS Genet. 8 (2012), Article ID e1003090. [11] B. Charlesworth, M. T. Morgan, and D. Charlesworth. The effect of deleterious mutations on neutral molecular variation, Genetics 134 (1993), 1289–1303. [12] G. K. Chen, P. Marjoram, and J. D. Wall, Fast and flexible simulation of DNA sequence data, Genome Res. 19 (2009), 136–142. [13] J. M. Comeron, R. Ratnappan, and S. Bailin, The many landscapes of recombination in Drosophila melanogaster, PLoS Genet. 8 (2012), Article ID e1002905. [14] A. J. Drummond, A. Rambaut, B. Shapiro, and O. G. Pybus, Bayesian coalescent inference of past population dynamics from molecular sequences, Mol. Biol. Evol. 22 (2005), 1185– 1192. [15] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University, Cambridge, 1998. [16] L. Duret and N. Galtier, Biased gene conversion and the evolution of mammalian genomic landscapes, Annu. Rev. Genomics Hum. Genet. 10 (2009), 285–311. [17] J. Y. Dutheil, Hidden Markov models in population genomics, Methods Mol. Biol. 1552 (2017), 149–164. [18] J. Y. Dutheil, G. Ganapathy, A. Hobolth, T. Mailund, M. K. Uyenoyama, and M. H. Schierup, Ancestral population genomics: the coalescent hidden Markov model approach, Genetics 183 (2009), 259–274. [19] J. Y. Dutheil and A. Hobolth, Ancestral population genomics, Methods Mol. Biol. 856 (2012), 293–313. [20] S. N. Ethier and R. C. Griffiths, On the two-locus sampling distribution, J. Math. Biol. 29 (1990), 131–159. [21] L. Excoffier and M. Foll, Fastsimcoal: A continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios, Bioinformatics 27 (2011), 1332–1334. [22] J. Felsenstein, Inferring Phylogenies, 2nd ed., Sinauer, Sunderland 2003. [23] N. Galtier and A. Jean-Marie, Markov-modulated Markov chains and the covarion process of molecular evolution, J. Comput. Biol. 11 (2004), 727–733. [24] F. Gao, C. Ming, W. Hu, and H. Li, New software for the fast estimation of population recombination rates (FastEPRR) in the genomic era, G3 6 (2016), 1563–1571. [25] R. C. Griffiths and P. Marjoram, Ancestral inference from samples of DNA sequences with recombination, J. Comput. Biol. 3 (1996), 479–502. [26] K. Harris, S. Sheehan, J. A. Kamm, and Y. S. Song, Decoding coalescent hidden Markov models in linear time. in: Research in Computational Molecular Biology. RECOMB2014 (ed. R. Sharan), Springer, Cham (2014), 100–114.

The Markov-modulated sequentially Markov coalescent

405

[27] A. Hobolth, O. F. Christensen, T. Mailund, and M. H. Schierup, Genomic relationships and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden Markov model, PLoS Genet. 3 (2007), Article ID e7. [28] A. Hobolth, J. Y. Dutheil, J. Hawks, M. H. Schierup, and T. Mailund, Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection, Genome Res. 21 (2011), 349–356. [29] A. Hobolth and J. L. Jensen, Markovian approximation to the finite loci coalescent with recombination along multiple sequences, Theor. Popul. Biol. 98 (2014), 48–58. [30] R. R. Hudson, Properties of a neutral allele model with intragenic recombination, Theor. Popul. Biol. 23 (1983), 183–201. [31] R. R. Hudson, Generating samples under a Wright–Fisher neutral model of genetic variation, Bioinformatics 18 (2002), 337–338. [32] D. B. Kaback, Chromosome-size dependent control of meiotic recombination in humans, Nat. Genet. 13 (1996), 20–21. [33] J. A. Kamm, J. P. Spence, J. Chan, and Y. S. Song, Two-locus likelihoods under variable population size and fine-scale recombination rate estimation, Genetics 203 (2016), 1381– 1399. [34] J. Kelleher, A. M. Etheridge, and G. McVean, Efficient coalescent simulation and genealogical analysis for large sample sizes, PLoS Comput. Biol. 12 (2016), Article ID e1004842. [35] J. F. C. Kingman, The coalescent, Stochastic Process. Appl. 13 (1982), 235–248. [36] F. Larribe and P. Fearnhead, On composite likelihoods in statistical genetics, Statist. Sinica 21 (2011), 43–69. [37] T. Lenormand and J. Dutheil, Recombination difference between sexes: A role for haploid selection, PLoS Biol. 3 (2005), Article ID e63. [38] M. J. Lercher and L. D. Hurst, Human SNP variability and mutation rate are higher in regions of high recombination, Trends Genet. 18 (2002), 337–340. [39] H. Li and R. Durbin, Inference of human population history from individual whole-genome sequences, Nature 475 (2011), 493–496. [40] N. Li and M. Stephens, Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data, Genetics 165 (2003), 2213–2233. [41] S. Li and M. Jakobsson, Estimating demographic parameters from large-scale population genomic data using approximate Bayesian computation, BMC Genet. 13 (2012), 22–37. [42] T. Mailund, J. Y. Dutheil, A. Hobolth, G. Lunter, and M. H. Schierup, Estimating divergence time and ancestral effective population size of Bornean and Sumatran orangutan subspecies using a coalescent hidden Markov model, PLoS Genet. 7 (2011), Article ID e1001319. [43] T. Mailund, A. E. Halager, M. Westergaard, J. Y. Dutheil, K. Munch, L. N. Andersen, G. Lunter, K. Prüfer, A. Scally, A. Hobolth, and M. H. Schierup, A new isolation with migration model along complete genomes infers very different divergence processes among closely related great ape species, PLoS Genet. 8 (2012), Article ID e1003125. [44] T. Mailund, K. Munch, and M. H. Schierup, Lineage sorting in apes, Annu. Rev. Genet. 48 (2014), 519–535.

Julien Y. Dutheil

406

[45] A.-S. Malaspinas (and 74 coauthors), A genomic history of Aboriginal Australia, Nature 538 (2016), 207–214. [46] J. S. Maritz and T. Lwin, Empirical Bayes Methods, Routledge, London, 2018. [47] P. Marjoram and J. D. Wall, Fast “coalescent” simulation, BMC Genet. 7 (2006), Paper No. 16. [48] G. McVean, P. Awadalla, and P. Fearnhead, A coalescent-based method for detecting and estimating recombination from gene sequences, Genetics 160 (2002), 1231–1241. [49] G. A. T. McVean and N. J. Cardin, Approximating the coalescent with recombination, Philos. Trans. R. Soc. Lond. B Biol. Sci. 360 (2005), 1387–1393. [50] P. A. P. Moran, Random processes in genetics, Math. Proc. Camb. Phil. Soc. 54 (1958), 60–71. [51] K. Munch, T. Mailund, J. Y. Dutheil, and M. H. Schierup, A fine-scale recombination map of the human-chimpanzee ancestor reveals faster change in humans than in chimpanzees and a strong impact of GC-biased gene conversion, Genome Res. 24 (2014), 467–474. [52] S. Myers, L. Bottolo, C. Freeman, G. McVean, and P. Donnelly, A fine-scale map of recombination rates and hotspots across the human genome, Science 310 (2005), 321–324. [53] P. F. Palamara, J. Terhorst, Y. S. Song, and A. L. Price, High-throughput inference of pairwise coalescence times identifies signals of selection and enriched disease heritability, Nat. Genet. 50 (2018), 1311–1317. [54] J. S. Paul, M. Steinrücken, and Y. S. Song, An accurate sequentially Markov conditional sampling distribution for the coalescent with recombination, Genetics 187 (2011), 1115– 1128. [55] T. D. Petes, Meiotic recombination hot spots and cold spots, Nature Rev. Gen. 2 (2001), 360–369. [56] K. Prüfer (and 40 coauthors), The bonobo genome compared with the chimpanzee and human genomes, Nature 486 (2012), 527–531. [57] M. D. Rasmussen, M. J. Hubisz, I. Gronau, and A. Siepel, Genome-wide inference of ancestral recombination graphs, PLoS Genet. 10 (2014), Article ID e1004342. [58] J. Rogers (and 41 coauthors), The comparative genomics and complex population history of Papio baboons, Sci. Adv. 5 (2019), Article ID eaau6947. [59] A. Sand, M. Kristiansen, C. N. S. Pedersen, and T. Mailund, zipHMMlib: A highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm, BMC Bioinformatics 14 (2013), 339–348. [60] A. Scally (and 70 coauthors), Insights into hominid evolution from the gorilla genome sequence, Nature 483 (2012), 169–175. [61] S. Schiffels and R. Durbin, Inferring human population size and separation history from multiple genome sequences, Nat. Genet. 46 (2014), 919–925. [62] S. Sheehan, K. Harris, and Y. S. Song, Estimating variable effective population sizes from multiple genomes: Aa sequentially markov conditional sampling distribution approach, Genetics 194 (2013), 647–662. [63] K. L. Simonsen and G. A. Churchill, A Markov chain model of coalescence with recombination, Theor. Popul. Biol. 52 (1997), 43–59.

The Markov-modulated sequentially Markov coalescent

407

[64] J. P. Spence and Y. S. Song, Inference and analysis of population-specific fine-scale recombination maps across 26 diverse human populations, Sci. Adv. 5 (2019), Article ID eaaw9206. [65] J. P. Spence, M. Steinrücken, J. Terhorst, and Y. S. Song, Inference of population history using coalescent HMMs: Review and outlook, Curr. Opin. Genet. Dev. 53 (2018), 70–76. [66] P. R. Staab, S. Zhu, D. Metzler, and G. Lunter, Scrm: Efficiently simulating long sequences using the approximated coalescent with recombination, Bioinformatics 31 (2015), 1680– 1682. [67] J. A. Stamatoyannopoulos, I. Adzhubei, R. E. Thurman, G. V. Kryukov, S. M. Mirkin, and S. R. Sunyaev, Human mutation rate associated with DNA replication timing, Nat. Genet. 41 (2009), 393–395. [68] J. Stapley, P. G. D. Feulner, S. E. Johnston, A. W. Santure, and C. M. Smadja, Variation in recombination frequency and distribution across eukaryotes: Patterns and processes, Philos. Trans. R. Soc. Lond. B, Biol. Sci. 372 (2017), Article ID 20160455. [69] M. Steinrücken, J. Kamm, J. P. Spence, and Y. S. Song, Inference of complex population histories using whole-genome sequences from multiple populations, Proc. Natl. Acad. Sci. USA 116 (2019), 17115–17120. [70] M. Steinrücken, J. S. Paul, and Y. S. Song, A sequentially Markov conditional sampling distribution for structured populations with migration and recombination, Theor. Popul. Biol.. 87 (2013), 51–61. [71] M. Stephens and P. Scheet, Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation, Am. J. Hum. Genet. 76 (2005), 449–462. [72] L. S. Stevison (and 9 coauthors) Great Ape Genome Project, The time scale of recombination rate evolution in great apes, Mol. Biol. Evol. 33 (2016), 928–945. [73] E. H. Stukenbrock, T. Bataillon, J. Y. Dutheil, T. T. Hansen, R. Li, M. Zala, B. A. McDonald, J. Wang, and M. H. Schierup, The making of a new pathogen: insights from comparative population genomics of the domesticated wheat pathogen Mycosphaerella graminicola and its wild sister species, Genome Res. 21 (2011), 2157–2166. [74] E. H. Stukenbrock and J. Y. Dutheil, Fine-scale recombination maps of fungal plant pathogens reveal dynamic recombination landscapes and intragenic hotspots, Genetics 208 (2018), 1209–1229. [75] M. P. H. Stumpf and G. A. T. McVean, Estimating recombination rates from populationgenetic data, Nat. Rev. Genet. 4 (2003), 959–968. [76] J. Terhorst, J. A. Kamm, and Y. S. Song, Robust and scalable inference of population history from hundreds of unphased whole genomes, Nat. Genet. 49 (2017), 303–309. [77] A. Wallberg, S. Glémin, and M. T. Webster, Extreme recombination frequencies shape genome variation and evolution in the honeybee, Apis mellifera, PLoS Genet. 11 (2015), Article ID e1005189. [78] K. Wang, I. Mathieson, J. O’Connell, and S. Schiffels, Tracking human population structure through time from whole genome sequences, PLoS Genetics 16 (2020), 1–24 [79] C. C. Weber, C. J. Pink, and L. D. Hurst, Late-replicating domains have higher divergence and diversity in Drosophila melanogaster, Mol. Biol. Evol. 29 (2012), 873–882.

Julien Y. Dutheil

408

[80] D. Wegmann, C. Leuenberger, S. Neuenschwander, and L. Excoffier, ABCtoolbox: A versatile toolkit for approximate Bayesian computations, BMC Bioinformatics 11 (2010), 116–123. [81] P. R. Wilton, S. Carmi, and A. Hobolth, The SMC’ is a highly accurate approximation to the ancestral recombination graph, Genetics 200 (2015), 343–355. [82] C. Wiuf and J. Hein, Recombination as a point process along sequences, Theor. Popul. Biol. 55 (1999), 248–259.

Chapter 19

Diffusion limits of genealogies under various modes of selection Martin Hutzenthaler and Peter Pfaffelhuber We are studying genealogies in population genetic models including selection. Our main tool is the tree-valued Fleming–Viot process as introduced in [2]. We review approximations on the change in tree-length relative to neutrality, as well as treevalued processes in models with fluctuating selection. The latter is treated by using an approach on stochastic averaging, which works on both time-discrete and timecontinuous stochastic processes.

19.1 Introduction The study of genealogies within population genetic models such as the Wright–Fisher model, the Moran model or their diffusion limits, has become the basis for a deeper understanding of model features. The reason is that most statistics, such as allelic frequencies, are encoded within genealogies, if they are treated the right way. Another important aspect is that the understanding of patterns within DNA-data is much easier when getting a hand on the underlying genealogy. Genealogies are a main object of study at least since the fundamental work of Kingman [14] and Hudson [8], who treat neutral evolution. For models including selection, the ancestral selection graph (ASG) as introduced by Neuhauser and Krone in [15, 17] was a fundamental work in this area. Let us briefly describe how the ASG can be understood when starting from the time-continuous two-allele Moran model; see also Figure 19.1.1. One allele is assumed to be fit (let us call it ), the other unfit (ı). Within a population of constant size N , neutral resampling events occur (at rate 1 for any unordered pair of lines). Upon such a resampling event, one individual dies and is replaced by the offspring of the other individual. Selective events occur at rate ˛ per line, but only take effect if the line carries a fit type . In this case, another line is chosen at random and the line carrying the fit type reproduces to the randomly chosen line. The ASG arises when studying genealogies, i.e. when tracing the history of several lines backwards in time. Upon neutral resampling events, lines find joint ancestry, i.e. coalesce, resulting in a tree-like structure. However, as we just saw, selective events are more difficult, since they depend on the types of the involved lines. As a result, the genealogy cannot be determined without knowing types. As a way out, the ASG follows both options upon such a selective event, and such splitting events indicate only possible ancestry. Which of the split lines is a true ancestor is decided upon knowing ancestral types. These are only known when going through the ASG forwards in time in a second stage. A clear

Martin Hutzenthaler and Peter Pfaffelhuber

410

time

t

genealogy if

genealogy if

Figure 19.1.1. On the left, the graphical representation of a time-continuous two-allele Moran model is given. The fit type is , the unfit type is ı. Grey arrows indicate neutral resampling events, where the individual at the tip of the arrow is replaced by a offspring of the individual at the line initiating the arrow. Such arrows are used by all types. For the selective black arrows, the rule is that they are only used if the line initiating the arrow is fit, i.e. carries type . Accordingly, the genealogy on the right depends on the fitness of the individual indicated with the arrow at the bottom. In the ancestral selection graph (ASG), both possible genealogies are encoded, resulting in a splitting event in the graph at the time of the selective event.

disadvantage of these splitting events is that the genealogy is much more complicated than the Kingman coalescent which describes the genealogy under neutral evolution. Recently, genealogies under selection have been studied by [2] using Markov processes taking values in the space of ultra-metric trees. Here, the genealogical tree is modelled as a stochastic process that is changing as the population evolves. Here, we describe recent results on the shape of genealogies under selection, which were obtained by using this approach. Let us give an outline of this chapter: In Section 19.2, we recall the tree-valued Fleming–Viot process and its state space, using results mainly from [1, 2]. In Section 19.3, we study genealogies for small ˛, summarising our findings from [10]. The rest of the chapter is devoted to genealogies under fast fluctuating selection. For this reason, we give a general result, Theorem 19.4.3 on limits of Markov processes in Section 19.4. In the same section, we give a result originating in [11] on the diffusion limit of a discrete-time model. Finally, in Section 19.5, we extend the tree-valued Fleming–Viot process to fast fluctuating environment. Remark 19.1.1 (Notation). We will use the following notation. For some complete and separable metric space .E; r/, let Cb .E/ be the set of real-valued, continuous

Genealogies under various modes of selection

411

and bounded functions, B.E/ the set of real-valued, bounded and measurable functions, and M1 .E/ be the set of probability measures on the Borel--field of E. For  2 M1 .E/ and 'W E ! F measurable, '  2 M1 .F / is the image measure of  under '. For n D 1; 2; : : : , we will write x D .x1 ; : : : ; xn / 2 E n , for n D 2; 3; : : : , we write .n/ r WD .rij /16i0 solves the .G; D.G/; 0 /martingale problem if Z0  0 and for all f 2 D.G/, the process   Z t f .Z t / Gf .Zs / ds 0

t>0

is a martingale. We say that the martingale-problem is well-posed if there exists a solution, and the solution is unique in law. Next, we introduce the state space of our tree-valued processes. Note that similar state spaces appear in the contributions by Kersting and Wakolbinger [13] and Winter [18] in this volume. Recall that r is an ultra-metric on some space U if r.x; z/ 6 r.x; y/ _ r.y; z/ for all x; y; z 2 U . Such spaces can always be embedded in trees, and ultrametric trees appear frequently in phylogenetics. With this history of encoding genealogies, the above state space combines trees with the classical (measure-valued) Fleming–Viot process. Definition 19.2.2 (Marked metric measure spaces). Let I be compact. An I -marked ultra-metric measure space is a triple .X; r; /, such that .X; r/ is an ultra-metric space and  2 M1 .X  I /. We denote by X W X  I ! X and I W X  I ! I projection operators.

Martin Hutzenthaler and Peter Pfaffelhuber

412

We will assume without loss of generality that supp..X / / D X. We say that .X; rX ; X / and .Y; rY ; Y / are measure-preserving isometric, if there is 'W X ! Y such that rX .x1 ; x2 / D rV .'.x1 /; '.x2 // and ' X D Y , i.e. ' is measure-preserving isometric. We denote by ŒX; r;  the equivalence class of .X; r; / and set ® ¯ U1I WD ŒX; r;  W .X; r; / is I -marked ultra-metric space : If the tree-valued Fleming–Viot process, which will be given in Theorem 19.2.8, is in state ŒX; r;  2 U1I , we will be able to pick a sample of size n from the population (indexed by X ) by a random pick according to ˝n . When this pick results in .x1 ; u1 /; : : : ; .xn ; un / 2 .X  I /n (recall that  2 M1 .X  I /), the allelic types are given by u1 ; : : : ; un and the genealogical distances within the sample (i.e. twice the time to the most recent common ancestor of two individuals) are given by .r.xi ; xj //16i;j 6n . The initial state of the process will be any element of U1I , i.e. we assume that initially all pairs of individuals have a common ancestor. We now collect some results on U1I , which are important in order to obtain stochastic processes with this state space. Theorem 19.2.3 ([1, Theorem 2]). Define the set of polynomials on U1I by … WD

1 [

…n ;

® ¯ .n/ …n WD ˆn; W  2 Cb .RC2  I n / ;

(19.2.1)

nD0

.n/ where (note that ˝ WD ˝n on the right-hand side since  2 Cb .RC2  I n /, which is clear from the context) Z n; ˆ .ŒX; r; / D .r.x; x/; u/˝ .dx; du/; which we abbreviate by ˆ

n;

Z .ŒX; r; / DW

 d˝ ;

and note that ˆn; .ŒX; r; / in fact does not depend on the representative .X; r; /, since it does not change under measure-preserving isometric maps. Then the initial topology on U1I with respect to …, i.e. the coarsest topology on U1I such that all ˆ 2 … are continuous, is separable and metrisable by a complete metric. Remark 19.2.4. In the above theorem, much more could be said: The given topology is frequently called the Gromov-weak topology and the complete metric is usually referred to as the Gromov–Prohorov topology. It is given by dGPr .ŒX; rX ; X ; ŒY; rY ; Y / D

inf

Z;'X ;'Y

dPr ..'QX / X ; .'QY / Y /;

413

Genealogies under various modes of selection

where .Z; rZ / is any complete and separable metric space, dPr is the usual Prohorov metric on Z  I , 'X W X ! Z and 'Y W Y ! Z are isometric embeddings and 'QX .x; u/ D .'X .x/; u/ (and 'QY .y; u/ D .'Y .y/; u/) leaves the second variable identical. Moreover, the compact sets in the Gromov-weak topology can be characterised (see e.g. [1, Theorem 3]), which gives a hand on showing tightness of sequences of random elements in U1I . We now come to the definition of the generator for the tree-valued Fleming–Viot process with mutation and selection. Note that all genealogical distances of pairs of distinct points in the ultra-metric tree grow at speed 2, since the distance is twice the time to the most recent common ancestor; see G gro below. For resampling, as in the Moran model from Figure 19.1.1, one individual is replaced by the offspring of another individual; see G res below. Mutation only affects allelic types; see G mut below. For selection, see G sel , and also consult Remark 19.2.6 for some more explanations. Some mechanisms reappear in the contribution by Greven and den Hollander [6] in this volume. Definition 19.2.5 (Generator of the tree-valued Fleming–Viot process with mutation and selection). Let ; #; ˛ > 0. We define GW … ! … by G D G gro C G res C # G mut C ˛G sel : Here, the growth operator (describing the fact that genealogical distances grow when time evolves) is, for ˆ 2 … and ŒX; r;  2 U I (note that for ˆ 2 …n , we have ˝ D ˝n and the summation is over 1 6 i < j 6 n), XZ @ gro G ˆ.ŒX; r; / WD 2 (19.2.2) .r.x; x/; u/˝ .dx; du/: @rk` k0 the tree-valued Fleming–Viot process. If ˇ is parent-independent, there is a U1I -valued random variable X1 such that X t ) X1 as t ! 1.

19.3 Genealogies under low levels of selection In this section, we will be dealing with different values of ˛, and will therefore write P ˛ .  / for the corresponding probability measure and E˛ Œ   for its expectation operator. For simplicity, we set D 1 in the sequel. Theorem 19.2.8 makes it possible to study the effect of weak selection (i.e. ˛ ! 0) on equilibrium genealogies by using that E˛ ŒGˆ.X1 / D 0. We take here a simple mutational scheme by assuming a two-alleles model, I D ¹ı; º, and  D 1 ;

(19.3.1)

i.e.  is the beneficial and ı is the deleterious type. For the mutation, we take ˇ.ı; ¹º/ D #ı ;

ˇ.; ¹ıº/ D # :

(19.3.2)

Martin Hutzenthaler and Peter Pfaffelhuber

416

(Actually, this is parent-independent mutation, since we can as well say that mutation happens at rate #N WD # C #ı and gives  with probability ‚ WD #ı =#N and ı with N Within this model, note that the frequency of , denoted probability 1 ‚ D # =#.) by Y D .Y t / t>0 , solves the SDE p N dY D ˛Y.1 Y / dt C #.‚ Y / dt C Y .1 Y / dW for some Brownian motion W ; see e.g. [4, (5.6)]. In addition, [2, Lemma 8.1] ensures that E˛ ŒGˆ.X1 / D E0 ŒGˆ.X1 / C O.˛/ as ˛ ! 0. We now explain how genealogies under low levels of selection can be described. As an example, we consider ²X ³ n 1 1 n .r/ WD inf ri;.i / W  2 †n ; (19.3.3) 2 i D1

where †1n WD ¹permutations of ¹1; : : : ; nº with one cycleº. Then, for ŒX; r;  2 U1I and x 2 X n , n .r.x; x// is the total length of the tree spanned by x; see [7, Lemma 3.1]. Hence take Z n ˆ .ŒX; r; / WD e n .r.x;x// ˝ .dx; du/; the Laplace transform of the length of the tree spanned by n randomly chosen individuals. For this, we will need ˆnij .ŒX; r; / Z WD e n .r.x;x//  1.u1 D    D ui D unC1 D    D unCj D /˝ .dx; du/: (19.3.4) In words, the quantity ˆnij .ŒX; r; / is the Laplace transform of the length of the genealogical tree of a sample of size n, on the event that the first i individuals within the sample, as well as j additionally picked individuals (outside of the sample) carry the beneficial type . Crucially, we obtain the following recursion. Lemma 19.3.1 ([10, Lemma 10]). For n > 2, ˆnij as in (19.3.4), with the convention that ˆn 1;j D ˆni; 1 D 0 and #N D #ı C # ,       i n i 1 Gˆnij D  ˆni 1;j C  ˆni;j 1 2 2 2   j n n C .n i/j  ˆi C1;j 1 C ij  ˆi;j 1 C  ˆni;j 1 2 C i#ı  ˆni 1;j C j#ı  ˆni;j 1  C ˛  .n i/ˆniC1;j .n C j /ˆni;j C1     nCj N n C C .i C j /# ˛.i C j /  ˆni;j : 2

417

Genealogies under various modes of selection

This recursion can be used as follows: Since the goal is to compute E˛ Œˆn .X1 / D E Œˆn00 .X1 /, we use the ansatz ˛

0 D E˛ ŒGˆn00 .X1 /   n n 1 ˛ DE ˆ00 .X1 / C ˛ ˆn10 .X1 / 2     n n C n ˆ00 .X1 / ; 2

ˆn01 .X1 /



which gives the equality      n n ˛ n C n E Œˆ00 .X1 / D E˛ Œˆn00 1 .X1 / 2 2 C ˛ E˛ Œˆn10 .X1 / ˆn01 .X1 /: Note that we have reduced the task to compute E˛ Œˆn00 .X1 / up to second order in ˛ into the task to compute E˛ Œˆn10 .X1 / ˆn01 .X1 / up to first order. Moreover, a similar equality exists for E˛ Œˆn10 .X1 / ˆn01 .X1 /, precisely,    nC1 C #N C n ˛  E˛ Œˆn10 .X1 / ˆn01 .X1 / 2   n n 1 D  E˛ Œˆ10 .X1 / ˆn01 1 .X1 / 2 C ˛  E˛ Œ.n 1/ˆn20 .X1 / 2nˆn11 .X1 / C .n C 1/ˆn02 .X1 /: Now, the task to compute E˛ Œˆn10 .X1 / ˆn01 .X1 / up to first order is transformed into the task to compute E˛ Œ.n 1/ˆn20 .X1 / 2nˆn11 .X1 / C .n C 1/ˆn02 .X1 / only for ˛ D 0. So, dropping terms of order ˛ 2 , we can replace the last E˛ Œ  -term by E0 Œ  . For the latter, we can explicitly compute various statistics using the Kingman coalescent. Finally, we obtain the following result. Theorem 19.3.2 ([10, Theorem 1]). Let zn WD E˛ Œˆn00 .X1 /

E0 Œˆn00 .X1 /:

Then z1 ; z2 ; : : : satisfy the recursion z1 D 0 and      n n C n  zn D  zn 2 2

1

C ˛ 2 n  an ;

N where a1 ; a2 ; : : : satisfy the recursion a1 D 0 and (with ‚ D #ı =#)      nC1 n N C # C n  an D  an 1 C ‚.1 ‚/  bn C O.˛/; 2 2

418

Martin Hutzenthaler and Peter Pfaffelhuber

where b1 ; b2 ; : : : satisfy the recursion b1 D 0 and        nC2 n n N C 2# C n  bn D  bn 1 C  cn 2 2 2 where c1 ; c2 ; : : : satisfy the recursion c1 D 0 and      nC2 n C 2#N C n  cn D  cn 2 2

1

1

C .n

1/  dn

C 2  e n C dn ;

where e1 ; e2 ; : : : satisfy a recursion e1 D 0 and      n nC1 N  en C 2# C n  en D 2 2

1

C dn

and finally dn D fn

1

fn

gn

1

C gn

N and with f1 D 1, g1 D 1=.1 C 2#/ fn D

n Y kD2

k k

1 ; 1 C 2

n X nC1 gn D n 1 bD2

bY1

1 bC1 2

 kD2

k 2 k 2





n Y

C k kDb

k 2 k 2





C k C 2#N

:

Remark 19.3.3. (1) Actually, an early result on z2 using the ancestral selection graph appears in [15, Theorem 4.26], where it is stated that z2 D 1=.1 C / C O.˛ 2 /. Using Theorem 19.3.2, it is possible to explicitly compute the O.˛ 2 /-term. See also [2, Theorem 5]. (2) Since it can be shown that dn is positive for all n, we can see directly that en ; cn ; bn ; an and zn are all positive. This implies that, for small ˛, one has E˛ Œˆn00 .X1 / > E0 Œˆn00 .X1 /, i.e. trees are shorter (in Laplace transform order) under selection. (3) A similar recursion is also found for expected tree lengths. This is the basis for Figure 19.3.1. Notably, Theorem 19.3.2 shows that E˛ Œˆn00 .X1 / D E0 Œˆn00 .X1 / C O.˛ 2 /, i.e. the length of genealogical trees under selection differs from the Kingman coalescent only in second order in ˛. For selection acting on diploids, i.e. a selection operator of the form (19.2.5), this difference is already in first order (see [10]).

419

0.08 0.06 0.04

(E0[Ln]−Eα[Ln])/α2

Genealogies under various modes of selection

2

20

40

60

80

100

n Figure 19.3.1. Recursions similar to those given in Theorem R 19.3.2 are used to compute the expected tree length for a sample of size n, given by Ln WD n .r.x; x//˝ .dx; du/ (with n as in (19.3.3)). The parameters used are #N D 1, ‚ D 12 and ˛ D 0:00001 (mimicking the limit for small ˛).

19.4 A result on stochastic averaging, and applications to selection in fast fluctuating environment Gillespie writes in the preface of his book [5]: “It is my conviction that the only viable model of selection is one based on temporal and spatial fluctuations in the environment.” In this section, we will be dealing with a way how to obtain the correct large population limits for models with temporally fluctuating selection. In the next section this will be applied to the corresponding genealogies. Recall the notion of a generator and a martingale problem from Definition 19.2.1. Assume that the generator for some sequence of pairs of stochastic processes .X N; Z N / has the form Gf D G0 f C NG1 f C N 2 G2 f: (19.4.1) Recall that pre-factors of the generator-terms G0 ; G1 and G2 indicate the speed of movement of the process for the respective terms, since G is defined as a timederivative. We assume that X N evolves much slower than Z N in the following sense: If f only depends on x, then (i) G2 f D 0, i.e. G2 describes the fast movement of Z and, typically, G0 f only depends on x (i.e. G0 describes the slow movement of X independent of Z). Our goal is to obtain a limit result for X N , i.e. some Markov process X with N X ) X as N ! 1. Since Z N evolves much faster than X N , the sequence Z N does not converge on the same time-scale. It is now crucial to find h

with G2 h D

G1 f C

g N

(19.4.2)

Martin Hutzenthaler and Peter Pfaffelhuber

420

for some g. Then, if f only depends on x, Z t   1  1  f C h .X tN ; Z tN / G f C h .XsN ; ZsN / ds N N 0 Z t  f .X tN / G0 f .XsN / C NG1 f .XsN ; ZsN / C G1 h.XsN ; ZsN / 0   1 C N G1 f .XsN ; ZsN / C g.XsN ; ZsN / ds N Z t D f .X tN / G0 f .XsN / C G1 h.XsN ; ZsN / C g.XsN ; ZsN / ds 0

is approximately a martingale. Now, assuming that X N ) X, and that Z N evolves much faster than X N , such that x .dz/ is approximately the equilibrium for Z N as given X , we find that, in the limit N ! 1, the process X has the generator Z  Gf .x/ D G0 f .x/ C G1 h.x; z/ C g.x; z/ x .dz/: Solving (19.4.2) is particularly easy if G2 is of the form, for some distribution , Z G2 f .x; z/ D f .x; z 0 /.dz 0 / f .x; z/; N i.e. from z to some independent draw from  at rate N . If, moreover, R Z jumps G1 f .x; z 0 /.dz 0 / D 0, i.e. averaging over z does not contribute to the dynamics generated by G1 , then h D G1 f gives

G2 h D G2 G1 f D

G1 f .x; z/;

and we have solved (19.4.2). As a result, the generator for the limiting process X reads Z Gf .x/ D G0 f .x/ C G1 G1 f .x; z/x .dz/: (19.4.3) Example 19.4.1 (Wright–Fisher diffusion with fluctuating selection). The above approach leads to a limit result, describing the evolution of allele frequencies under fast fluctuating selection, p p dX D X.1 X/.1 2X/ dt C X.1 X/ dW C 2X.1 X / dW 0 (19.4.4) for independent Brownian motions W and W 0 ; see [3, Theorem 7.12]. Indeed, consider the Markov process .X N ; Z N /, where Z N has state space ¹ 1; 1º and changes between the states at rate N 2 . Furthermore, X N solves p dX N D N Z N X N .1 X N / dt C X N .1 X N / dW

Genealogies under various modes of selection

421

for some Brownian motion W . The generator of .X N ; Z N / is then G N f .x; z/ D

1 x.1 2 „

@f 2 .x; z/ x/ CN zx.1 2 ƒ‚ @x … „

DW G0 f .x;z/

C N2

Z „

@f .x; z/ x/ ƒ‚ @x …

DW G1 f .x;z/

f .x; z 0 /.dz 0 / ƒ‚

DW G2 f .x;z/

 f .x; z/ ; …

where  D 12 .ı 1 C ı1 / is the equilibrium for Z N . Now, let f only depend on x R and take h D G1 f . Then, since G1 f .x; z/.dz/ D 0, we have that G2 h.x; z/ D G1 f .x; z/ and the limiting process must have the generator Z G0 f .x/ C G1 G1 f .x; z/.dz/ @ @f .x/ D G0 f .x/ C x.1 x/ x.1 x/ @x @x D G0 f .x/ C x.1 x/.1 2x/f 0 .x/ C x 2 .1 x/2 f 00 .x/: In other words, X solves (19.4.4). The above approach is also working in discrete time, which we now formulate. For all details, see [11]. We start with an abstract result on stochastic averaging, which is based on an application of [16]. Remark 19.4.2 (Setting for Theorem 19.4.3). (1) For every N 2 N, let G N W D  C.S  E/ ! B.S  E/ be linear, and let G0N ; G1N ; G2N W D ! B.S  E/ be such that G N D G0N C NG1N C N 2 G2N : (2) For every N 2 N, let .X N ; Z N / D .X tN ; Z tN / t 2Œ0;1/ be an S  E-valued process, such that, for all f 2 D,   1 X f .X tN ; Z tN / G N f .XsN ; ZsN / t 2ZC =N N s2ZC =N s6t

is a martingale. (3) The sequence .X N /N D1;2;::: of S-valued stochastic processes satisfies the compact containment condition. (4) The family of occupation measures .€Z N /N D1;2;::: given through Z t €Z N .A  Œ0; t/ D 1ZsN 2A ds 0

is tight.

422

Martin Hutzenthaler and Peter Pfaffelhuber

Theorem 19.4.3 ([11, Theorem 2.3]). Let the setting from Remark 19.4.2 be given, let D0  Cb .S / be a dense set in the topology of uniform convergence on compact sets and let A1 W D0 ! Cb .S/ and A2 W D0 ! C .S  H /. Suppose for every f 2 D0 that there exist fN ; hN 2 D, N D 1; 2; : : : , such that for all N D 1; 2; : : : it holds that G2N fN D 0, such that for all t 2 Œ0; 1/ it holds that ˇi   1 ˇ fN C hN .XsN ; ZsN /ˇ D 0; N

ˇ h ˇ lim E sup ˇf .XsN /

N !1

s2Œ0;t

such that for all t 2 Œ0; 1/, there exists p 2 .1; 1/ with Z t   sup E j.A2 f /.XsN ; ZsN /jp ds < 1; N 2N

0

such that all integrals in (19.4.5) and (19.4.6) are well-defined and such that for all t 2 Œ0; 1/ it holds that ˇZ s  ˇ lim E sup ˇˇ Œ.A1 f /.XrN / .NG2N hN C NG1N fN ˇ N !1 ˇ s2Œ0;t 0 N N N C G0 fN /.Xr ; Zr / dr ˇˇ D 0; (19.4.5) ˇZ s h  ˇ lim E sup ˇˇ .A2 f /.XrN ; ZrN / N !1 s2Œ0;t 0 i ˇˇ   1 G1N hN C G0N hN .XrN ; ZrN / dr ˇˇ D 0: (19.4.6) N Then .X N ; €Z N /N D1;2;::: is relatively compact in D.Œ0; 1/; S /  M.E  Œ0; 1//, and for every limit point ..X t / t>0 ; €/ and for every f 2 D0 it holds that all integrals in (19.4.7) are well-defined and   Z t Z tZ f .X t / .A1 f /.Xs / ds .A2 f /.Xs ; y/€.ds; dy/ (19.4.7) 0

0

H

t2Œ0;1/

is a martingale. We are now going to apply Theorem 19.4.3 to a model with fluctuating selection. The corresponding Wright–Fisher model is known as the Karlin–Levikson model in the special case p D 1 (see [12]). We present here a simplified version of [11, Theorem 3.3]. Differing from the standard setting, we assume that the environment has a chance p to change in each generation. Most interestingly, the parameter p also appears in the diffusion limit; see Theorem 19.4.5 below. Note that in [11], more general selection schemes are considered. Definition 19.4.4 (Selection in fluctuating environment). Consider the time-discrete Wright–Fisher model of size N with alleles  and ı. Let p 2 .0; 1, and let N D 1 .ı =p2N C ı=p2N / for some  > 0 be the distribution of the fitness of . (If 2 N †N n  N , then †n is the fitness of ı.) The environment changes independently

423

Genealogies under various modes of selection

N N with probability p each generation, i.e. we let †N 0  N , and recursively †nC1 D †n N N N with probability 1 p and †nC1  N independently from †1 ; : : : ; †n . Moreover, the allele frequency for  evolves such that, given XnN and †N n ,

 N NXnC1  B N;

N .1 C †N n /Xn N .1 C †N †N n /Xn C .1 n /.1

XnN /

 :

Here, we have independent selection regimes that last for a geometrically distributed number of generations. In the i.i.d. case, p D 1, Karlin and Levikson [12] derive the coefficients of the SDE (19.4.8) without giving a formal proof; cf. also [3, Theorem 7.12]. In the non-i.i.d. case (that is if the sequence of selection coefficients is auto-correlated), in [9, equation (8)], the authors conjecture a diffusion limit which is different by a factor of 2 p from (19.4.8). We note that the derivation in [9] is based on an analogy to the time-continuous case, which does not seem to apply in this case of a fast fluctuating environment. Theorem 19.4.5 ([11, Theorem 3.3]). For N D 1; 2; : : : , let .XnN ; †N n /nD0;1;2;::: be as above and assume that X0N ) X0 as N ! 1. Then X N ) X as N ! 1 (in the space of càdlàg-paths on Œ0; 1), where X has initial condition X0 and follows the SDE  1  X/ .2 p/ˇ X dt 2 p p C X t .1 X t / dW C .2 p/ˇX.1

dX D X.1

with ˇ D

2 2 , p

X / dW 0 ;

(19.4.8)

where W; W 0 are independent standard Brownian motions.

19.5 The tree-valued Fleming–Viot process under fluctuating selection Actually, the calculation that leads to the limit (19.4.4) can also be performed on the level of trees. We briefly report on this approach, which is currently work in progress. Using the setting of Section 19.2, in particular the space U1I with I D ¹; ıº (as in Section 19.3), we will first define a sequence of Markov processes with state space U1I  ¹ 1; 1º, given through a sequence of generators G N with domain ® ¯ Q WD .x; z/ 7! ˆ‰.x; z/ WD ˆ.x/‰.z/ W ˆ 2 …; ‰ 2 C .¹ 1; 1º/ ; … Q !… Q is given through where … is as in (19.2.1). The operator G N W … G N ˆ‰.x; z/ D ‰.z/ G gro ˆ.x/ C G res ˆ.x/ p  C # G mut ˆ.x/ C N z G sel ˆ.x/ Z C N .‰.z 0 / ‰.z//.dz 0 /;

424

Martin Hutzenthaler and Peter Pfaffelhuber

for some  > 0 and some  2 M1 .¹ 1; 1º/, where G gro is given by (19.2.2), G res is given by (19.2.3), G mut is as in (19.2.3) with ˇ is as in (19.3.2), and G sel is as in (19.2.4) with .u/ D 1uD as in (19.3.1). For any initial distribution 0 , it is a simple Q 0 /-martingale problem is wellextension of Theorem 19.2.8 to see that the .G N ; …; posed. Note that the operator G N has exactly the form (19.4.1). Therefore, recalling from of Section 19.4, we can consider the limit N ! 1, and apply (19.4.3) for computing the generator of the limiting process. From Theorem 19.4.3, necessarily, if X0N ) X0  0 and if .X N /N D1;2;::: satisfies the compact containment condition, we have convergence X N ) X for some U1I -valued process X, which solves the .G; …; 0 /-martingale problem for G D G gro C G res C # G mut C  2 G fsel with G fsel ˆ D G sel G sel ˆ; see (19.4.3). In order to compute G fsel more explicitly, we write Z X fsel sel G ˆ.x/ D G .1uk D 1unC1 D /.r.x; x//˝ .dx; du/ k

D

XZ

.1uk D

C

1unC1 D /.1u` D

1unC2 D /

˝

k;`

 .r.x; x// .dx; du/ XZ k

.1uk D

1unC1 D /.1unC1 D

1unC2 D /

˝

 .r.x; x// .dx; du/:

After some rearrangements, this gives Z 1X fsel 1unC1 ¤unC2 G ˆ.x/ D 2 k  . ı nC1;` /.r.x; x/; u/˝ .dx; du/ Z X 1 C 1u Du 2 nC1 nC2 k¤`  . ı nC1;k ı nC2;` C /.r.x; x/; u/ C 1unC1 ¤unC2 . ı nC1;`

/.r.x; x/; u/˝ .dx; du/:

(19.5.1)

Let us briefly discuss well-posedness of the .G; …; 0 /-martingale problem. Both existence and uniqueness require a useful bound on the number of ancestral lines influencing a finite sample of individuals from the population. If there are n such ancestral lines, one can see that these ancestors decrease at rate n2 due to resampling, and (closely inspecting (19.5.1)) increase at rate  2 n2 by two lines due to G fsel . 2 Hence, we can bound the number of ancestors only if  < 14 , and we only obtain well-posedness of the martingale problem in this case.

Genealogies under various modes of selection

425

References [1] A. Depperschmidt, A. Greven, and P. Pfaffelhuber, Marked metric measure spaces, Electron. Commun. Probab. 16 (2011), 174–188. [2] A. Depperschmidt, A. Greven, and P. Pfaffelhuber, Tree-valued Fleming–Viot dynamics with mutation and selection, Ann. Appl. Probab. 22 (2012), 2560–2615. [3] R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008. [4] W. Ewens, Mathematical Population Genetics. I. Theoretical Introduction, 2nd ed., Springer, New York, 2004. [5] J. H. Gillespie, Substitution processes in molecular evolution. I. Uniform and clustered substitutions in a haploid model, Genetics 134 (1993), 971–981. [6] A. Greven and F. den Hollander, From high to low volatility: Spatial Cannings with block resampling and spatial Fleming–Viot with seed-bank, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 267–289. [7] A. Greven, P. Pfaffelhuber, and A. Winter, Tree-valued resampling dynamics, Martingale problems and applications, Probab. Theory Relat. Fields 155 (2013), 789–838. [8] R. Hudson, Properties of a neutral allele model with intragenic recombination, Theor. Popul. Biol. 23 (1983), 183–201. [9] E. Huerta-Sanchez, R. Durrett, and C. D. Bustamante, Population genetics of polymorphism and divergence under fluctuating selection, Genetics 178 (2008), 325–337. [10] E. Huss and P. Pfaffelhuber, Genealogical distances under low levels of selection, Theor. Popul. Biol. 131 (2020), 2–11. [11] M. Hutzenthaler, P. Pfaffelhuber, and C. Printz, Stochastic averaging for multiscale Markov processes with an application to a Wright–Fisher model with fluctuating selection, preprint (2018), httpsW//arxiv.org/abs/1504.01508v2. [12] S. Karlin and B. Levikson, Temporal fluctuations in selection intensities: Case of small population size, Theor. Popul. Biol. 6 (1974), 383–412. [13] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [14] J. F. C. Kingman, The coalescent, Stochastic Process. Appl. 13 (1982), 235–248. [15] S. Krone and C. Neuhauser, Ancestral processes with selection, Theor. Popul. Biol. 51 (1997), 210–237. [16] T. G. Kurtz, Averaging for martingale problems and stochastic approximation, in: Applied Stochastic Analysis (eds. I. Karatzas and D. Ocone), Springer, Berlin (1992), 186–209. [17] C. Neuhauser and S. Krone, The genealogy of samples in models with selection, Genetics 154 (1997), 519–534. [18] A. Winter, Algebraic measure trees: Statistics based on sample subtree shapes and sample subtree masses, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 451–475.

Chapter 20

Counting, grafting and evolving binary trees Thomas Wiehe Binary trees are fundamental objects in models of evolutionary biology and population genetics. Here, we discuss some of their combinatorial and structural properties as they depend on the tree class considered. Furthermore, the process by which trees are generated determines the probability distribution in tree space. Yule trees, for instance, are generated by a pure birth process. When considered as unordered, they have neither a closed-form enumeration nor a simple probability distribution. But their ordered siblings have both. They present the object of choice when studying tree structure in the framework of evolving genealogies.

20.1 Introduction Trees appear in different contexts and with different properties. In graph theory, they are defined as connected, acyclic graphs: any pair of vertices (nodes) is connected by exactly one concatenated sequence of edges (branches). Tagging one node, called root of the tree, implicitly establishes a directionality of the graph. In theoretical biology, trees are used to describe genealogies of cells, genes, individuals or species. Depending on the biological context, planarity of the tree, degree and labelling of nodes, directionality and length of branches may or may not be of interest. Cardinality and probability distribution depend strongly on these properties. The study of trees as mathematical objects reaches back at least to the 1850s, when Cayley [9] derived recursion formulae for the enumeration of trees with a finite number of nodes, and also recognised the link to isomer chemistry. As an alternative to recursions, bijections between trees and permutations can help to solve certain counting problems [2, 16]. More generally, and yielding insight into asymptotic behaviour for large trees, the tools of analytic combinatorics are particularly powerful [2, 38]. Comprehensive treatments are found in the classical textbook by Flajolet and Sedgwick [23] and, focusing on random trees only, in the textbook by Drmota [17]. With a view from computer science, where they appear primarily as data structures, trees are covered in the epitomic opus by Knuth [30, Volume 1]. The link of “tree theory” with biology has been established by Yule’s seminal paper of 1925 [53], when seeking to explain the distribution of the number of species within genera. It initiated a long tradition of research in phylogenetics and macro-evolution on enumeration, topology and distribution of trees generated by random processes [5, 6, 8, 20, 29, 31, 34, 36, 44, 46]. The border between macro- and micro-evolution is fuzzy, but intensely investigated in the context of gene tree embeddings in species trees [13, 14, 31, 41]. Perhaps the most genuine application of Yule’s original model,

Thomas Wiehe

428

and with most ramifications, lies in population genetics as a model of individual gene genealogies and their statistical properties. Kingman’s [28] coalescent is its backward-in-time analogue and – in the guise of its evolved descendants – features in several chapters of this volume. The genetic operation of recombination translates into subtree-prune and -regraft operations, opening a field of active theoretical research on tree transformations [45], in part also covered in this volume. Standard references on the coalescent are the textbooks by Wakeley [49] and Durett [18]. Aldous [1] offers a view on Yule’s paper from a modern perspective. Given that trees are treated in different disciplines, and with different degree of mathematical rigour, it is not surprising to find oneself confronted with a non-unified, sometimes even inconsistent, terminology and nomenclature, which alone can make it difficult to identify the relevant theoretical features of some tree class for a specific biological application. Without claiming to authoritatively clarify this problem, we start the section below with an (incomplete) catalogue of tree classes and their enumerations (Section 20.2). We will then devote special attention to Yule trees and explore some of their structural properties (Sections 20.3 and 20.4). Since they represent the scaffold of the widely used coalescent model in population genetics, we will consider two such applications (Sections 20.5 and 20.6).

20.2 Counting trees 20.2.1 Preliminaries We consider rooted, binary, finite trees: there is a unique node, the root, defining a directionality for all branches. Each branch is delimited by a parent and a child node. The root is ancestor of all other nodes. They are subdivided into n < 1 external and m D n 1 internal nodes, including the root. All internal nodes have exactly two children. External nodes have no descendants and are also called leaves. The size of a tree is the number of its leaves. A subtree is a tree that is rooted at some node of the original tree. Subtrees of size 2 are also called cherries, subtrees of size 3 pitchforks. A caterpillar is a tree for which at least one of the subtrees at each of its internal nodes has size 1. Slightly more generally, a c-caterpillar is a (sub-)tree of size c that is a caterpillar. Thus, a cherry is a 2-caterpillar, and a pitchfork is a 3-caterpillar. Since trees here are binary, all internal nodes have a left and a right subtree, which are rooted at the left and right child. Trees are ordered (plane), if left and right can be distinguished, otherwise they are un-ordered (non-plane). 20.2.2 Classification of binary trees Tree enumerations depend crucially on the presence and the kind of node labels. Among the many possibilities, we restrict ourselves to the following cases: presence or absence of alphanumeric labels at external nodes, and presence or absence of totally ordered numeric labels at internal nodes. Trees without any node labels are called

429

Counting, grafting and evolving binary trees bb

1

1

1 2

2 3

1

2

2 3

3

1

1 2

3

2 3

3

{1,2,3} {1,3,2} {2,1,3} {2,3,1} {3,1,2} {3,2,1} Figure 20.2.1. The six ordered ranked trees of size n D 4 and the corresponding six permutations of ¹1; 2; 3º obtained by reading out internal labels during in-order tree traversal [30]. Note, for example, the difference between ¹2; 1; 3º and ¹3; 1; 2º.

shape trees or topologies [8, 40]. We call a tree ranked or a history [13, 25, 47], if the internal nodes are labelled with integers 1; : : : ; n 1 such that (i) the root has label 1, (ii) distinct nodes have distinct labels and (iii) every child has a larger label than its parent. We call a tree labelled, if the leaves carry labels. Labelled trees can be thought of as phylogenies with species names as leaf labels. Without internal labels, they are also called cladograms, with internal labels they are ranked phylogenies or labelled histories [47]. Their cardinality follows, for instance, from a coalescent-like construction: randomly selecting twoQ out of k labelled lineages to coalesce, there are  n k k possibilities [31]. The product is 1/Š=2n 1 . kD2 2 D nŠ .n 2 When shape trees have a left/right orientation, they are called Catalan trees,  because they are enumerated by the Catalan numbers Cm D 2m =.m C 1/, see [43, m A000108], where m D n 1 is the number of internal nodes of such trees. Finally, ordered histories are ordered ranked trees. Since they map bijectively to permutations of m D n 1 integers, we also call them permutation trees. They are enumerated by the factorials mŠ. To see this, one can read the labels of all ordered ranked trees of a given size in an in-order [30] tree traversal, observing that all subtrees, except cherries, have a distinguishable left-right order (Figure 20.2.1). V n and un-ordered trees by ƒn . The We denote ordered trees of size n by ƒ exponent is a placeholder to indicate presence or absence of internal or external labels. The tree classes mentioned above are summarised in Table 20.2.1. Note that these classes represent only a subset of the possibilities. For instance, Felsenstein [20] discusses phylogenies with non-numeric labels at internal nodes. This constitutes a class that is different from ƒCC and that has a different cardinality: it leads to n Cayley’s formula [10], enumerating non-binary trees (cf. [43, A000169] and [25]). Not all tree classes have closed form enumerations. Often, ordered trees do, while un-ordered trees do not [23, p. 87]. In our list (Table 20.2.1), the cardinalities of unordered shape and ranked trees are given only implicitly via generating functions, but their ordered versions have closed formulae. The (ordinary) generating function and the exponential generating function of an integer sequence .an /n are given by the formal power series f .x/ D

X n>0

an x n

and F .x/ D

X n>0

an

xn ; nŠ

430

Thomas Wiehe name

alias

int. lab.

cardinality

OEIS1 ID

ƒn

eq. (20.2.1)

A001190

eq. (20.2.2)

A000111

C

ƒC n ƒn C

A001147

C

ƒCC n

.2n 3/Š 2n 2 .n 2/Š nŠ .n 1/Š 2n 1

Vn ƒ

1 2.n 1/ n .n 1/

A000108

ext. lab.

symbol

unordered trees shape trees ranked trees labelled trees labelled ranked trees5

2

topologies histories

3

C 4

phylogenies

ranked phylogenies

C

A006472

ordered trees Catalan trees6 permutation trees

ordered topologies ordered histories7

VC ƒ n

C

.n

1/Š



A000142

1

www.oeis.org, [43] [8, 40]; called topological types in [44] 3 [13, 25, 47] 4 [1, 44]; called rooted phylogeny in [20] or tree form in [8] 5 cf. [31], there in the context of Kingman’s coalescent 6 [17, p. 5] 7 called shapes in [25]

2

V trees of size n. Presence (C) or absence Table 20.2.1. Classes of un-ordered (ƒ) and ordered (ƒ) V n j. ( ) of internal or external labels is indicated by superscripts. Cardinalities are jƒn j and jƒ

respectively. If f or F are holomorphic functions defined in a neighbourhood around x D 0, the series can be interpreted as their Taylor expansions and, for instance, their asymptotic properties can be studied by analytic means. In 1922, Wedderburn [50] showed that the cardinalities of shape trees can be implicitly represented via a functional equation of a generating function. De Bruijn and Klarner derived the somewhat simpler representation f .x/ D x C

 1 2 f .x/ C f .x 2 / 2

(20.2.1)

and showed [7] that its solution f generates the cardinalities of shape trees of size n, via X f .x/ D jƒn jx n : n

For 1 6 n 6 10, the coefficients are 1; 1; 1; 2; 3; 6; 11; 23; 46; 98.

431

Counting, grafting and evolving binary trees

For unordered ranked trees (histories), the cardinalities are identical with the Euler numbers and are given by the coefficients of the exponential generating function F .x/ D sec.x/ C tan.x/ D

X xn jƒC j ; nC1 nŠ n

(20.2.2)

for 1 6 n 6 10, they are 1; 1; 1; 2; 5; 16; 61; 272; 1385; 7936. A natural way to construct unordered ranked trees of any finite size is by recursion: given a ranked tree of size m D n 1, construct a tree of size n by randomly choosing one of the m leaves to give rise to two children and label the chosen leaf with the integer n. Following other authors [11, 47], we call trees generated in this way Yule trees and the underlying model (process) the Yule model (Yule process). In the equivalent backward process, one starts from n leaves and their n parental branches. One randomly, and iteratively, selects two branches to coalesce into a single one until all are coalesced. When, in addition, a time axis for the coalescent times is introduced, and  when these times are exponentially distributed with a parameter proportional to k2 , where k is the current number of branches, Yule trees are called coalescent trees, generated by the (Kingman-)coalescent process [28]. They are the basis of a plethora of genealogical models in population genetics.

20.3 Properties of ranked trees Note that the Yule process does not generate uniformly distributed trees in ƒC n . For instance, in Figure 20.2.1 the 4-caterpillar is generated with probability 23 and the balanced tree, corresponding to the permutations ¹2; 1; 3º and ¹3; 1; 2º, with probabVC ility 13 . Only when considered as trees in ƒ n , they become uniformly distributed under the Yule process, each with probability .n 11/Š . Other tree generating processes may lead to still other probability distributions [34]. Since ordered and un-ordered trees are identical up to left/right order of subtrees that are not cherries, there are exactly 2n 1 o different ordered trees for each unordered one with o cherries. Thus, given a ranked tree, one also knows the probability with which it is generated, by simply counting its cherries (cf. [48]). With O denoting the random variable for the number of cherries, we have Prob.given ranked tree of size n with O D o cherries/ D

2n 1 o : .n 1/Š

To explore the unconditional distribution of Yule trees, we remark that all external nodes and branches (shown in grey in Figure 20.2.2) may be stripped from a ranked tree of size n without loss of information. Such stripping leads to a reduced tree with m D n 1 nodes with ordered labels, all of out-degree 0, 1 or 2 (see [15]). Nodes of out-degree 0 represent cherries in the original tree. Sometimes, reduced trees are called pruned trees [23], a term which we avoid, to not confuse it with “tree pruning”

432

Thomas Wiehe ranked trees 1

1

2

1

2 3

4

5

4

1

3

1

2 5

5

4

3

2

25

2

32

2

1

3

25

2

24

2

2

3

25

2

24

2

0

1

25

2

8

1

1

1

25

1

16

5

4

1 3

4

2

2

5

4

5

3

1

5

3

1

4

4

5

1

2

3

2

2 3

16

4

1

1

3

0

1

2

5

25

3

5

4

4

3

4

3

2

1

2

3

4

jƒj

1

4

5

V jƒj

jC3 j

2

3

5

factor

jC2 j

1 2

3

4

5

3 4

2 5

1 2 3 4

5 1 2 3

4 5

Figure 20.2.2. The sixteen possible un-ordered ranked trees of size n D 6, classified by shape. Within each class, all admissible orderings of the internal nodes are displayed. Number of cherries (jC2 j) and pitchforks (jC3 j) are indicated. The number of all ordered ranked trees, classified by shape, is obtained by multiplying with the factor 2m jC2 j . The total number is 5Š D 120. Branch lengths are without meaning; position of an internal node in a tree is given by the node label, not by the actual drawing of its position. External nodes and branches are shown in grey. Removing them leads to the reduced trees of size 5. They can be uniquely identified with the original trees of size 6.

discussed later. Reduced trees with m nodes can be constructed recursively, starting from a reduced tree with one node, according to the following production rule: .o; m/ ! .o; m C 1/o .o C 1; m C 1/m

2oC1

;

where o is the number of cherries and m the total number of nodes in the current tree. The exponent counts how many new trees with o (or o C 1) cherries and m C 1 nodes are produced. Note that in each step m is increased by one and the number of cherries may either remain unchanged or also increase by one. The former happens when the new branch and node are appended at a node of out-degree 0, the latter, when appended at a node of out-degree 1. At nodes of out-degree 2 (true internal nodes) nothing can be

433

Counting, grafting and evolving binary trees

appended. For instance, starting with .1; 1/, the production rule generates the sequence .1; 2/1 .2; 2/0 ; .1; 3/1 .2; 3/1 ; .1; 4/1 .2; 4/2 and .2; 4/2 .3; 4/0 ; : : : : Consider now the bivariate exponential generating function X

F .x; z/ D

xo

reduced trees with o cherries and m nodes

zm : mŠ

The production rule can then be translated into algebraic terms as 2o C 1/.x oC1 z mC1 / .m C 1/Š .m C 1/Š X xozm X ox o z mC1 C xz ; D xz C .1 2x/ .m C 1/Š mŠ

F .x; z/ D xz C

X ox o z mC1

C

X .m

where the summations are over all reduced trees with o cherries and m nodes and the first summand represents a tree of size m D 1. Differentiating both sides with respect to the variable z, one obtains a partial differential equation for F , x.1

2x/

@F .x; z/ C .xz @x

1/

@F .x; z/ D @z

xF .x; z/

x;

which admits a solution in closed form [15] as p 2x C 1/ x/ 2.x exp.z : F .x; z/ D p p p . 2x C 1 1/ exp.z 2x C 1/ C 2x C 1 C 1 One direct application of F is to determine the probability that two randomly generated Yule trees are identical ([15, Theorem 1], with F replaced by Y ). Furthermore, F can be used to find a partition of the Euler numbers em in such a way that em;o represents the number of (unreduced) ranked trees of size n D m C 1 with o cherries. As shown in [15], em;o D mŠ  Œx o z m F; where the brackets Œ   denote coefficient extraction. The partitions of em for m D 1; : : : ; 10 and o D 1; : : : ; 5 are shown in Table 20.3.1. Other applications involve simple transformations of F . For instance, with x  FQ .x; z/ D zF ; 2z 2 one obtains the weighted (ordinary) generating function p zx exp.2z x C 1/ zx ; FQ .x; z/ D p p p . x C 1 1/ exp.2z x C 1/ C 1 C xC1

434

Thomas Wiehe O

1 2 3 4 P5

m 1

2

3

4

5

6

7

8

9

10

1 0 0 0 0 1

1 0 0 0 0 1

1 1 0 0 0 2

1 4 0 0 0 5

1 11 4 0 0 16

1 26 34 0 0 61

1 57 180 34 0 272

1 120 768 496 0 1,385

1 247 2,904 4,288 496 7,936

1 502 10,194 28,768 11,056 50,521

TableP 20.3.1. Partitions em;o of Euler numbers [43, A000111]. O: number of cherries. Column sums o em;o D em . For instance, for m D 5 (i.e. n D 6) there are one ranked tree with one cherry (the caterpillar), 11 trees with two cherries and 4 trees with three cherries.

for the coefficients of x o z n , such that 2n 1 o o n x z ; .n 1/Š ranked trees X

FQ .x; z/ D

of size n

leading to the following result [15]. Result 20.3.1. The probability that a Yule tree of size n has o cherries is given by the coefficient of x o z n in the Taylor expansion of FQ around z D 0, i.e. Pn .O D o/ D Œx o z n FQ .x; z/: By differentiating FQ , one can easily derive the moments of O. For instance, the mean number of cherries in ranked trees of size n is ˇ Q ˇ z 4 3z 3 C 3z 2 n @F E.O/ D Œz  .x; z/ˇˇ : D Œz n  @x 3.z 1/2 xD1 If n > 2, this simplifies to E.O/ D

n : 3

The second moment is ˇ ˇ Q .x;z/ ˇ 2 Q Q ˇ ˇ @.x @F@x /ˇ n @ F .x; z/ ˇ n @F .x; z/ ˇ ˇ E.O / D Œz  D Œz  C Œz  ˇ @x @x 2 ˇxD1 @x ˇxD1 xD1   7 6 5 4  2 z 2z z z D Œz n  C C E.O/: 3 .z 1/ 45 15 3 3 2

n

If n > 6, and using V .O/ D E.O 2 /

E2 .O/, one obtains

V .O/ D

2n : 45

435

Counting, grafting and evolving binary trees

The distribution of O (see [35]), and mean and variance of c-caterpillars [40], have been derived before, however with different methods not employing generating functions. The latter represent a powerful tool to handle the recursive production rules of binary trees, and readily offer a somewhat deeper look into tree structure. Focusing on general c-caterpillars, let F .x2 ; x3 ; x4 ; : : : ; xk ; z/ D

X

c

x2o x3c3 x4c4 : : : xkk

trees of size n>1

zn 1 .n 1/Š

be a multi-variate exponential generating function, where ci is the number of caterpillars of size i > 2, and o the number of cherries. This function satisfies the partial differential equation @F @F @F D x2 C x2 F C x2 z C .x2 x3 2x22 / @z @z @x2   k 1 i 3 X X C xi xi C1 xi2 C x2 .1 xi / 1 C i D3

j D1

 C xk

 k X3 xk / 1 C

xk2 C x2 .1

j D1



1 xi

1 xi 2

: : : xi 

1 xk

1 xk 2

: : : xk

j

j

@F @xi

@F ; @xk

which leads to a recursively determined family of polynomials .Fm /m>1 with Fm D

X x2o.t/ x3 c3 .t/ x4 c4 .t / : : : x ck .t / z n

1

k

trees t of size nDmC1

.n

1/Š

@F @z

x2 ;

:

Defining the operator G .F / D the recursion for .Fm /m>1 is given by Z F1 D x2 z;

FmC1 D

G .Fm / dz:

As an example, fix k D 5. Then, for m D 1; 2; 3; 4; 5, one has F1 D x2 z; 1 F2 D x2 x3 z 2 ; 2 1 1 F3 D x2 x3 x4 z 3 C x22 z 3 ; 6 6 1 1 1 4 F4 D x2 x3 x4 x5 z C x22 z 4 C x22 x3 z 4 ; 24 24 8

(20.3.1)

436

Thomas Wiehe

F5 D

1 1 2 5 1 x2 x3 x4 x5 z 5 C x2 z C x22 x3 z 5 120 120 40 1 2 2 5 1 2 1 5 C x2 x3 z C x2 x3 x4 z C x23 z 5 : 40 30 30

Recursion (20.3.1) yields both the joint distribution of cherries and caterpillars of different sizes and the conditional distribution of caterpillars, conditioned on the number of cherries. Summarising, one can state the following result (cf. [15]). Result 20.3.2. Given an (unordered) ranked tree T of size n D m C 1. Then: (i) the probability that T contains c-caterpillars of size k is Pm .Ck D c/ D Œxkc Fm

1 2

 ; 1; 1; : : : ; xk ; 2 ;

(ii) the joint probability that T contains o cherries and c caterpillars of size k is x  2 Pm .O D o; Ck D c/ D Œx2o xkc Fm ; 1; 1; : : : ; xk ; 2 ; 2 (iii) the conditional probability that T contains c caterpillars of size k, given it has o cherries is Pm .O D o; Ck D c/ Pm .O D o/  o c Œx2 xk Fm x22 ; 1; 1; : : : ; xk ; 2  ; D Œx2o Fm x22 ; 1; 1; : : : ; 1; 2

Pm .Ck D c j O D o/ D

(iv) the probability that T contains c 0 caterpillars of size i, with 3 6 i < k, and c caterpillars of size k is 0

Pm .Ci D c 0 ; Ck D c/ D Œxic xkc Fm

1 2

 ; 1; : : : 1; xi ; 1; : : : ; xk ; 2 ;

(v) the conditional probability that T contains c caterpillars of size k, given it has c 0 caterpillars of size i, with 3 6 i < k, is Pm .Ck D c j Ci D c 0 / D

Pm .Ci D c 0 ; Ck D c/ Pm .Ci D c 0 / 0

D

1 ; 1; : : : ; 1; xi ; 1; : : : ; xk ; 2 2  0 1 c Œxi Fm 2 ; 1; : : : ; 1; xi ; 1; : : : ; 1; 2

Œxic xkc Fm

 :

The distribution of O, both under the Yule process and when trees are generated uniformly, as well as the conditional expectations for some c-caterpillars, are shown in Figure 20.3.1 for the example of size n D 54.

437

uniformly generated Yule trees

c=3 c=4 c=5 c=6

8

0.2

4

probability

expected number of c.-caterpillars

Counting, grafting and evolving binary trees

0.1

0

0 4

8

12

16

20

24

number of cherries Figure 20.3.1. Ranked trees of size n D 54. Conditional expectation of the number of c-caterpillars (left y-axis, c D 3; 4; 5; 6), given the number of cherries (curves with triangles, diamonds and squares). Vertical black line at x D 18: expected number of cherries in unconstrained trees; horizontal black bars: unconditional expected number of c-caterpillars. Curves with filled circles: fraction of trees (right y-axis) with given number of cherries generated under the Yule process (black) and in uniformly generated trees (grey). Equivalently, this is the distribution of cherries (O) in ranked trees. Index of dispersion V .O/= E.O/  0:13. Dotted line: diagonal x D y.

20.4 Induced subtrees Induced subtrees occur as embedded genealogies of a subset of the leaves of a tree [42]. Let Tn be a ranked, labelled tree of size n with leaf labels L D ¹l1 ; l2 ; : : : ; ln º. Choose n0 6 n, and select labels L0 D ¹l10 ; l20 ; : : : ; ln0 0 º, such that for each 1 6 i 6 n0 there is exactly one j with li0 D lj . Then the induced subtree T 0 is the tree that is obtained from T by maintaining only the branches connecting a leaf li0 with the most recent common ancestor of all leaves L0 . We write T 0 C T for short. Note that the root of T 0 is not necessarily identical with the root of T and that the topologies of different induced subtrees of the same supertree T may be different. There are nn0 possible subsets of size n0 . When conditioned on a fixed tree T , number and distribution of induced subtrees are obviously different from independently generated trees. There is no general enumeration formula for induced subtrees since the number depends on the topology of T . For instance, take a caterpillar of size n. Then all induced subtrees are caterpillars. Only when averaging over all Yule super-trees of size n, induced subtrees and independently generated trees are identical in number and distribution. We introduce now the notion of node balance.

438

Thomas Wiehe

Definition 20.4.1. For an internal node i of a binary rooted tree T let Ti .L/ and Ti .R/ be the left and right subtrees at node i . We call the minimum ® ¯ !i D min jTi .L/j; jTi .R/j node balance at node i . In particular, !1 is the root balance. It is a standard exercise to calculate the probability that T and T 0 have the same root (1 ). Given T and fixing !1 , one has     0 1 nX !1 n !1 !1 C n n!0 1 0 n0 i n0 i   Prob.1 D 1 j T; !1 / D D1 : n n i D1

n0

n0

When n is large, one may replace the hypergeometric terms by binomials and get Prob.10

D 1 j T; !1 / 

0 1  nX 0

i D1

n p i .1 i

0

p/n

i

D1

.1

0

p/n

0

p n ; (20.4.1)

where p D !n1 , 0 < p 6 12 . For trees generated by the Yule process, node balance is (nearly) uniformly distributed on 1; : : : ; b n2 c, hence p is uniform on 0; 12 Œ. Integrating equation (20.4.1) with respect to p and multiplying with uniform weights, one obtains the well-known result (cf. [42]) Z 1=2 n0 1 0 0 Prob.10 D 1 /  2 : 1 .1 p/n p n dp D 0 n C1 0 We now consider node balance in induced subtrees. Let the random variable 1 be root balance in a Yule tree of size n. One has Prob.1 D !1 / D

2

ı!1 ;n=2 : n 1

Fixing T and selecting an arbitrary induced subtree T 0 C T , consider the random variable 01 j 1 . To calculate the conditional distribution, one may use the auxiliary terms  ! !1  n !1   C n !!0 1 n0!1! 0  0 0 !0 1 ! n 0 0 1 1 1 1    p.!1 j !1 /  Prob.1 D 1 /  n !1 n !1 1 C ı!10 ;n0 =2 n0 n0 n0   2 ı!10 ;n0 =2 C Prob.1 ¤ 10 /  ; n0 1 assuming that the induced subtree T 0 is a random tree of size n0 when roots of T and T 0 are different. Normalising, one obtains ! 1 0 =2c bn X Prob.!10 j !1 / D p.!10 j !1 /  p.!10 j !1 /: (20.4.2) !10 D1

439

Counting, grafting and evolving binary trees 1

0.8

ω’i

0.6

0.4

ω1 ω2

0.2

0

0

0.2

0.4

0.6

0.8

1

ωi Figure 20.4.1. Standardised (i.e. scaled to Œ0; 1) values of E.!10 j !1 / (black) and E.!20 j !2 / (grey) for n D 200 and n0 D 50. Theoretical results (solid lines) according to equation (20.4.2) and simulation results (dots), obtained with ms [26].

Different roots, and the ensuing “approximation”, are likely to occur when !1 is small. Analytical, however lengthy, expressions of the conditional expectation E.01 j 1 / are then easily derived with software for symbolic algebra. This computation can be extended to the balance 2 of the root of the largest root subtree, to obtain the conditional expectation of 02 j .1 ; 2 / and of 02 j 2 (Disanto and Wiehe, unpublished results). In Figure 20.4.1, we show E.01 j !1 / and E.02 j !2 / as functions of !1 and !2 and compare them to simulated values. Shown are averages across arbitrary trees of fixed size n and arbitrary induced subtrees of fixed size n0 . Note that induced subtrees, when conditioned on a fixed super-tree, reflect node balance of the supertree only when the latter is not extremal. In principle, these calculations could be continued to further internal nodes. However, a full probabilistic treatment and the involved expressions become very clumsy. Application: Neutrality test using node balance Tree balance statistics [5,12,29] have traditionally been used to investigate evolutionary hypotheses in the context of phylogenetic species trees. However, they can also be defined and examined for gene genealogies modelled by the coalescent process and be integrated into powerful tests of the neutral evolution hypothesis [22,32,33]. Published versions of such tests, however, are typically a mixture of tree shape and branch length statistics. Relying, in contrast, only on node balance, one may define the statistic

440

Thomas Wiehe

(cf. [33]) T3 D 2

3  X i 2 ni i D1

1 ; 2

i where n1 D n, n2 D n 1 and n3 D n 1 2 . Since 2  is approximately ni 2 uniform on the interval Œ ni ; 1, it follows that T3 is close to standard normal [33]. Small values of T3 are obtained for highly unbalanced trees, i.e. when !i are small, produced for instance by caterpillars, and large values for highly balanced trees. In the context of population genetics, a locally unbalanced genealogy of a sample of n genes can be produced by the rapid fixation of a favourable allele. Hence, an estimate of T3 , based on observed genetic variability, provides a statistic with which the hypothesis of neutral evolution can be tested. The results on induced subtrees can be integrated into a nested test-strategy where samples and sub-samples are tested jointly. More details are described in [39].

20.5 Transformations I: Pruning, grafting and recombination Let T 2 ƒC n be a ranked tree. The layer lj (1 6 j 6 n) of T is the “interval” in which T has j branches. Layer l1 can be imagined as the infinitely long layer above the root, which makes T a planted tree [17, p. 6]. An internal node j (1 6 j < n) marks the border between layers lj and lj C1 and layers subdivide any branch b between two nodes into branch segments s.b/1 ; : : : ; s.b/k , where k depends on b. The size of a branch is the number of leaves below the branch. By extension, the size of a segment is the size of the branch to which the segment belongs. A tree T may be transformed into another tree TQ by a prune and re-graft operation: (i) randomly select branch segments sp in layer lp for pruning and sg in layer lg for re-grafting, such that lg 6 lp ; (ii) prune the subtree spanned by sp and re-graft it to segment sg . This prune and re-graft operation is a model of genetic recombination. Recombination can also be thought of as a segmentation process, which subdivides a linear chromosome into (genomic) segments, such that all sites within one segment have the same genealogical history, or ranked tree; see the contributions of Baake and Baake [3], Birkner and Blath [4] and Dutheil [19] in this volume. Here, we ask two questions: (i) what is the probability that recombination changes the root of the tree and (ii) how is root balance affected by recombination? First note that only some recombination events affect tree topology. One way to change the root is by a re-graft operation to a segment in layer l1 above the root. Such events may also change root balance !1 . Re-grafting below the root may change root height or balance only if sp and sg belong to different root subtrees. On average, this happens with probability one third (see below). So far, we ignored branch lengths, but for applications in population genetics it is of interest to assign branch lengths according to the coalescent process: the length of each layer lj (j > 1) is scaled by a factor proportional to 1= j2 . Let PQ" .i / be the probability that a pruned branch in such a coalescent tree has size i and that re-grafting

Counting, grafting and evolving binary trees

441

is above the current root, i.e. tree height increases. Averaging over coalescent trees of size n, this probability is (see [21]) n 2 X Pn;k .i/ PQ " .i/ D an k.k

1 ; 1/.k C 1/

kD2

where an is the n-th harmonic number and Pn;k .i/ D

n i 1 k 2  n 1 k 1



is the probability that a branch of layer k has size i. Since re-grafting is above the root, one of the root-subtrees will have size i after re-grafting and 1 will take the value !1 D min.i; n i/ with probability P" .!1 / D

PQ " .!1 / C PQ " .n !1 / : .1 C ı2!1 ;n /

Similarly, one can also obtain the transition probabilities from !10 before recombination to !1 after recombination when tree height is increasing. Let Pn;j .i j !0 / be the probability that a branch at level j has size i in a tree of total size n, given that the size of the root branches are !10 and n !10 . Then, by [21], n 2 X PQ" .i j !10 / D Pn;j .i j !10 / an j.j j D2

1 1/.j C 1/

and P" .!1 j !10 / D

PQ" .!1 j !10 / C PQ" .n !1 j !10 / : 1 C ı2!1 ;n

Similar calculations lead also to the transition probabilities of root balance under recombination events that do not change tree height, and to estimates of the “correlation length” of root balance under multiple recombination events. These help to explore the speed with which genealogical trees and shapes change along a recombining chromosome. Considering only recombination events that change root height, we estimated the physical distance between such recombination events as (see [21, equation (51)]) 1 2.10

 2 /



3:83 ; 

where  is the scaled recombination rate per nucleotide site. In other words, about every 4th recombination event affects tree height. For example, if  D 10 3 , the genomic distance between such events is about 4000 nucleotides. Recombination events that affect root balance are slightly more common, since more branches are

Thomas Wiehe

442

available for re-grafting. The distance between such events can be estimated by the average run-length .1 P .!1 j !1 // 1 , i.e. the average size of a genomic fragment within which root balance !1 does not change. The run-length depends on n, is longer for more unbalanced trees (small !1 ) and is on the order of a few recombination events (about 2 to 6, for a typical sample size of n D 100) [21]. Linkage disequilibrium Change in tree topology along a recombining chromosome can also be interpreted as a reduction of linkage disequilibrium. Two-loci linkage disequilibrium, LD, is the non-random association of two alleles (genetic variants) from two linked genetic loci or sites (alleles A, a at the first locus and alleles B, b at the second, say). Let XA (XB ) be the indicator variable of allele A (B). Then one standard way to express LD is by Pearson’s correlation coefficient (e.g. [54]) of the indicator variables r2 D

Cov2 .XA ; XB / : V .XA /V .XB /

Alleles A and B are often interpreted as being derived from their ancestral forms a and b, respectively, by two independent mutation events that occurred some time ago in the genealogical history of each locus, i.e. by events that “fall on” some branches of their genealogical trees. As such, a mutation event can be thought of as a “subtree marker”, marking the subtree below the branch on which it occurred. Thus, the frequency of the new mutation in the current population(-sample) is identical to the size of the marked subtree. Focusing on this property, one arrives at a slightly modified concept of linkage disequilibrium [51]: considering two, not necessarily adjacent, genomic segments, S and U , with labelled ranked trees T .S / and T .U /, and left root subtrees T .S /L and T .U /L , the leaf labels can be partitioned into four sets: (i) labels that belong to both left subtrees, (ii) both right subtrees, (iii) to either the left subtree in segment S and right subtree in segment U , or (iv) vice versa. With the indicator variables XT .S/L and XT .U /L one can calculate r 2 in exactly the same way as before and formulate the following definition. Definition 20.5.1. The quantity 2 rS;U D

Cov2 .XT .S/L ; XT .U /L / V .XT .S/L /V .XT .U /L /

is called topological linkage disequilibrium (tLD) of the segments S and U . Here, a segment takes the role of a gene locus, and left/right take the roles of two alleles. The assignment of left and right is arbitrary, as much as the naming of two 2 alleles in the context of conventional LD, and does not affect the value rS;U . Let SL , SR , UL and UR denote the leaf labels in the left and right root subtrees at segments 2 S and U . Note that rS;U D 1, if and only if SL D UL or SL D UR . This implies that

Counting, grafting and evolving binary trees

443

subtrees are not only identical in size but also contain identically labelled leaves at both segments. In contrast to conventional LD, a configuration of complete topological linkage, 2 rS;U D 1, can be broken only by recombination events that do change tree topology. Since only about every third recombination event changes tree topology, average decay of tLD with distance between segments is slower than decay of conventional LD [51]. A simple argument is the following: consider a pruning and a re-grafting event and the relative size p of the left root subtree. The probability that both events take place on opposite sides of the tree, i.e. on different root subtrees, is 2p.1 p/. Integrating with R1 uniform density over all left subtree sizes yields pD0 2p.1 p/ dp D 13 . Furthermore, tLD has an about three times higher signal-to-noise ratio (the inverse of the coefficient of variation) than conventional LD [51]. The limit of expected tLD at large distances between segments is 1 2 lim E.rS;U .// D ; !1 n 1 which is in agreement with a classical result by Haldane [24]. Generally, compared to conventional LD, tLD shows a sharper contrast among genomic regions that are in low versus high linkage disequilibrium. This is a welcome property when searching in whole genome scans for signatures of potential gene-gene interactions using patterns of linkage disequilibrium [51].

20.6 Transformations II: Pruning, grafting and evolving trees 20.6.1 The evolving Moran genealogy The Yule process is a pure birth process. Augmented by a death process, such that each split of a leaf is compensated by removal of a uniformly chosen leaf and its parental branch, size n < 1 remains constant in time and the Yule process becomes a Moran process. Following the Moran process over time  naturally leads to the evolving Moran genealogy .EMG />0 (see the contribution of Kersting and Wakolbinger [27] for a related class of evolving genealogies). Conversely, for any time  D   , a tree T .  / of size n can be extracted from the sequence .EMG / . In the following, we consider ordered, rather than un-ordered, trees and keep track of left/right when choosing a leaf for splitting. The evolving Moran genealogy, EMG for short, induces a discrete Markov process VC on the set ƒ n . This process is recurrent and aperiodic [52] and therefore has a staVC tionary distribution P  on ƒ n . Since we may interpret the genealogy T . / for any given  as a result of a Yule process, and since all T are uniformly distributed, P  must be the uniform distribution as well, i.e. P  .T / D .n 11/Š (see Table 20.2.1). Following the process of tree balance in an EMG, let jT . /L j be the size of the left root subtree of T ./ extracted from .EMG / . The sequence .jT . /L j/ is subject to the same transition law as the frequency of a newly arising allele in a Moran model.

444

Thomas Wiehe

A new allele arising at time   can be imagined as “marking” an external branch of T .  / and the evolving subtree under this branch in .EMG / >  . Only at the boundary, there is an exception: whenever the left (or right) root subtree is of size 1, this remaining branch may be killed with positive probability. This leads to loss or fixation of the allele and consequently to a root jump with a uniform “entrance” law. After a root jump the new left root subtree has uniformly distributed size, and not necessarily size 1. We call the time interval between successive root jumps an episode of the evolving Moran process. Result 20.6.1. For 2 6 jT ./L j 6 n 2, the transition probability of the tree balance process .jT . /L j/ is given by (see [52]) 8 jT ./ j.n jT . / j/ L L ; ! D jT . /L j C 1; ˆ ˆ n2 <  2 2 Prob jT . C 1/L j D ! j jT ./L j D jT ./L j C.n2 jT . /L j/ ; ! D jT . /L j; n ˆ ˆ : jT ./L j.n jT . /L j/ ; ! D jT . /L j 1: n2 At the boundary jT ./L j D 1, one has



Prob jT . C 1/L j D ! D

8 1 ˆ ˆ 2, one expects 2.1 time interval Œ0 ; 1 .

1 / n

root jumps during the

The proof goes by considering events in the backward process, where one finds that the expected total number of root jumps along the EMG[ -path is n 1 X kD2

2 n 2 D : k.k C 1/ n

Adding one additional jump, which necessarily happens at the moment of fixation, one obtains the stated expectation. Hence, in an infinitely large sample (n D 1) one expects two root jumps per one fixation, a result obtained with different means before [37]. In the framework of the EMG[ one can calculate the exact distribution of root jumps during a fixation recursively for any n, and show that these distributions quickly converge as n ! 1. For n > 2, let Probn .k/ denote the probability of observing k root jumps during fixation of a new allele in an EMG of size n, and Prob1 .k/ the

447

Counting, grafting and evolving binary trees

Figure 20.6.2. Distributions of Probn .k/ for k D 1; : : : ; 8; n D 2 (dotted grey), n D 5 (dashed), n D 10 (dot-dashed), n D 25 (solid) and n D 1 (solid black).

same probability in the infinite-population limit. Then Probn .k/ D

kY1

X 26i1 ;:::;ik

1 6n

1 lD1

2 il .il C 1/

 1

Y j ¤i1 ;:::;ik

1

 2 : j.j C 1/

For small k, using software for symbolic algebra, one can easily write down closedform expressions for Probn .k/. For n D 2, Prob2 .1/ D 1. Furthermore, Probn .1/ decreases monotonically in n, with limn!1 Probn .1/ D 13 . In Figure 20.6.2 root jump distributions are shown for some small values of n and for n D 1, illustrating the fast convergence for n ! 1. Any root jump is tantamount to loss of some “genetic memory”. In the future, it will be interesting to explore the root jump process in more detail, in particular under non-equilibrium and non-neutral population genetic scenarios, and with regard to the speed of loss of genetic memory. Acknowledgements. I would like to express my gratitude to Filippo Disanto and Johannes Wirtz for their intellectual input to the projects pursued as part of SPP 1590. I am very grateful also to Luca Ferretti for his enthusiastic discussions, sharing of ideas and his contributions to tree transformations under recombination. Finally, I would like to thank two reviewers for their critical and constructive comments on an earlier version of this chapter.

Thomas Wiehe

448

References [1] D. J. Aldous, Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today, Stat. Sci. 16 (2001), 23–34. [2] D. André, Mémoire sur les permutations alternées, J. Math. Pures Appl. (3) 7 (1881), 167–184. [3] E. Baake and M. Baake, Ancestral lines under recombination, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 365–382. [4] M. Birkner and J. Blath, Genalogies and inference for populations with highly skewed offspring distributions, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 151–177. [5] M. G. B. Blum and O. François, On statistical tests of phylogenetic tree imbalance: The Sackin and other indices revisited, Math. Biosci. 195 (2005), 141–153. [6] N. Bortolussi, E. Durand, M. G. B. Blum, and O. François, apTreeshape: Statistical analysis of phylogenetic tree shape, Bioinform. 22 (2006), 363–364. [7] N. G. de Bruijn and D. A. Klarner, Multisets of aperiodic cycles, SIAM J. Discr. Math. 3 (1982), 359–368. [8] L. L. Cavalli-Sforza and A. W. Edwards, Phylogenetic analysis. Models and estimation procedures, Amer. J. Hum. Genet. 19 (1967), 233–257. [9] A. Cayley, XXVIII. On the theory of the analytical forms called trees, Lond. Edinb. Dubl. Phil. Mag. 13 (1857), 172–176. [10] A. Cayley, A theorem on trees, Quart. J. Math. 23 (1889), 376–378. [11] H. Chang and M. Fuchs, Limit theorems for patterns in phylogenetic trees, J. Math. Biol. 60 (2010), 481–512. [12] D. H. Colless, Review of ‘Phylogenetics: The Theory and Practice of Phylogenetic Systematics’, by E. O. Wiley, Syst. Zool. 31 (1982), 100–104. [13] J. H. Degnan, N. A. Rosenberg, and T. Stadler, The probability distribution of ranked gene trees on a species tree, Math. Biosci. 235 (2012), 45–55. [14] F. Disanto and N. A. Rosenberg, Enumeration of ancestral configurations for matching gene trees and species trees, J. Comput. Biol. 24 (2017), 831–850. [15] F. Disanto and T. Wiehe, Exact enumeration of cherries and pitchforks in ranked trees under the coalescent model, Math. Biosci. 242 (2013), 195–200. [16] R. Donaghey, Alternating permutations and binary increasing trees, J. Combin. Theory A 18 (1975), 141–148. [17] M. Drmota, Random Trees: An Interplay Between Combinatorics and Probability, Springer, Wien, 2009. [18] R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008. [19] J. Y. Dutheil, Towards more realistic models of genomes in populations: The Markovmodulated sequentially Markov coalescent, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 383–408.

Counting, grafting and evolving binary trees

449

[20] J. Felsenstein, The number of evolutionary trees, Syst. Zool. 27 (1978), 27–33. [21] L. Ferretti, F. Disanto, and T. Wiehe, The effect of single recombination events on coalescent tree height and shape, PLoS One 8 (2013), Article ID e60123. [22] L. Ferretti, A. Ledda, T. Wiehe, G. Achaz, and S. E. Ramos-Onsins, Decomposing the site frequency spectrum: The impact of tree topology on neutrality tests, Genetics 207 (2017), 229–240. [23] P. Flajolet and R. Sedgewick, Analytic Combinatorics, Cambridge University Press, Cambridge, 2009. [24] J. B. S. Haldane, The mean and variance of 2 , when used as a test of homogeneity, when expectations are small, Biometrika 31 (1940), 346–355. [25] E. F. Harding, The probabilities of rooted tree-shapes generated by random bifurcation, Adv. Appl. Prob. 3 (1971), 44–77. [26] R. R. Hudson, Generating samples under a Wright–Fisher neutral model of genetic variation, Bioinform. 18 (2002), 337–338. [27] G. Kersting and A. Wakolbinger, Probabilistic aspects of ƒ-coalescents in equilibrium and in evolution, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 223–245. [28] J. F. C. Kingman, The coalescent, Stochastic Process. Appl. 13 (1982), 235–248. [29] M. Kirkpatrick and M. Slatkin, Searching for evolutionary patterns in the shape of a phylogenetic tree, Evolution 47 (1993), 1171–1181. [30] D. Knuth, The Art of Computer Programming. Vol. 1: Fundamental Algorithms, 3rded., Addison-Wesley, Boston, 2004. [31] A. Lambert and T. Stadler, Birth-death models and coalescent point processes: The shape and probability of reconstructed phylogenies, Theor. Popul. Biol. 90 (2013), 113–128. [32] H. Li, A new test for detecting recent positive selection that is free from the confounding impacts of demography, Mol. Biol. Evol. 28 (2011), 365–375. [33] H. Li and T. Wiehe, Coalescent tree imbalance and a simple test for selective sweeps based on microsatellite variation, PLoS Comp. Biol. 9 (2013), Article ID e1003060. [34] W. P. Maddison and M. Slatkin, Null models for the number of evolutionary steps in a character on a phylogenetic tree, Evolution 45 (1991), 1184–1197. [35] A. McKenzie and M. Steel, Distributions of cherries for two models of trees, Math. Biosci. 164 (2000), 81–92. [36] A.Ø. Mooers and S. B. Heard, Inferring evolutionary process from phylogenetic tree shape, Q. Rev. Biol. 72 (1997), 31–54. [37] P. Pfaffelhuber and A. Wakolbinger, The process of most recent common ancestors in an evolving coalescent, Stochastic Process. Appl. 116 (2006), 1836–1859. [38] G. Pólya, Kombinatorische Anzahlbestimmungen für Gruppen, Graphen und chemische Verbindungen, Acta Math. 68 (1937), 145–254. [39] M. Rauscher, Topology of genealogical trees – Theory and applications, Doctoral thesis, University of Cologne, 2018, httpsW//kups.ub.uni-koeln.de/9030/.

Thomas Wiehe

450

[40] N. A. Rosenberg, The mean and variance of the numbers of r-pronged nodes and r-caterpillars in Yule-generated genealogical trees, Ann. Comb. 10 (2006), 129–146. [41] N. A. Rosenberg, Counting coalescent histories, J. Comput. Biol. 14 (2007), 360–377. [42] I. W. Saunders, S. Tavaré, and G. A. Watterson, On the genealogy of nested subsamples from a haploid population, Adv. Appl. Prob. 16 (1984), 471–491. [43] N. J. A. Sloane, The Online Encyclopedia of Integer Sequences, available online at httpsW// oeis.org. [44] J. B. Slowinski and C. Guyer, Testing the stochasticity of patterns of organismal diversity: an improved null model, Amer. Nat. 134 (1989), 907–921. [45] Y. S. Song, Properties of subtree-prune-and-regraft operations on totally-ordered phylogenetic trees, Ann. Comb. 10 (2006), 147–163. [46] E. Stam, Does imbalance in phylogenetics reflect only bias? Evolution 56 (2002), 1292– 1295. [47] M. Steel and A. McKenzie, Properties of phylogenetic trees generated by Yule-type speciation models, Math. Biosci. 170 (2001), 91–112. [48] F. Tajima, Evolutionary relationship of DNA sequences in finite populations, Genetics 105 (1983), 437–460. [49] J. Wakeley, Coalescent Theory – An Introduction, Roberts, Greenwood Village, 2009. [50] J. H. M. Wedderburn, The functional equation g.x 2 / D 2˛x C Œg.x/2 , Ann. Math. 24 (1922), 121–140. [51] J. Wirtz, M. Rauscher, and T. Wiehe, Topological linkage disequilibrium calculated from coalescent genealogies, Theor. Popul. Biol. 124 (2018), 41–50. [52] J. Wirtz and T. Wiehe, The evolving Moran genealogy, Theor. Popul. Biol. 130 (2019), 94–105. [53] G. U. Yule, A mathematical theory of evolution based on the conclusions of Dr. J. C. Willis, F. R. S., Phil. Trans. R. Soc. 213 (1925), 21–87. [54] D. V. Zaykin, A. Pudovkin, and B. S. Weir, Correlation-based inference for linkage disequilibrium with multiple alleles, Genetics 180 (2008), 533–545.

Chapter 21

Algebraic measure trees: Statistics based on sample subtree shapes and sample subtree masses Anita Winter Null models of binary genealogical or phylogenetic trees are useful for testing hypotheses. In this chapter we describe the space of algebraic measure trees whose elements represent phylogenies and genealogies as binary trees without edge lengths endowed with a sampling measure. With the aim of describing the degree of similarity between actual and simulated phylogenies or genealogies, we focus on the sample shape of subtrees and related statistics. We describe certain statistics of the branching and coalescent tree in more detail. Finally, we use the martingale problem method to characterise evolving trees analytically.

21.1 Introduction An N -cladogram is an unrooted, binary tree with N > 2 leaves labelled ¹1; 2; : : : ; N º and with N 2 unlabeled internal nodes. Cladograms are particular phylogenetic trees respective genealogies for which no information on the edge lengths is available, and which therefore only capture the tree structure. As null models for testing real world phylogenies and genealogies, parametric families of random cladograms have been studied (compare [4, 12]). One such family introduced in [12] is today referred to as the ˛-Ford model. Fix ˛ 2 Œ0; 1/ and n 2 N. The ˛-Ford tree is an n-cladogram constructed recursively as follows (compare Figure 21.1.1): (1) Start with one external edge (yielding two leaves). (2) Given the ˛-Ford tree with k leaves, assign weight 1 weight ˛ to each internal edge.

˛ to each external and

(3) Pick an edge according to its weight, and glue here a new edge (and thereby insert a new leaf). (4) Stop when the current binary combinatorial tree has n leaves. (5) Permute the leaf labels. The case ˛ D 1 is singular as one needs at least one internal edge to get the procedure started. We can include this case if we start here with a quartet (consisting of 4 external edges and 1 internal).

452

Anita Winter r2 1

˛

1

˛

r2 1

r1

˛

r1

1

r1

r2 ˛

1

˛

1

˛

r3

r4 1

˛

r1

1

˛

1

˛

r3 1

r4

r2 ˛ ˛

˛

r1

1

˛

1

˛

˛ 1

r5

r3 ˛

Figure 21.1.1. Construction of the ˛-Ford tree with four leaves.

The ˛-Ford model interpolates continuously between three popular models from the coalescent tree (also known as Yule tree) for ˛ D 0 via the branching tree (also known as uniform tree) for ˛ D 12 to the totally unbalanced tree (known as comb tree) for ˛ D 1. In this paper we are interested in limit cladograms as the number of leaves goes to infinity, as well as in analytic characterisations of limit diffusions on continuum cladograms. For that we present a novel notion of continuum trees. In order to provide a unified set-up that includes graph-theoretical trees on the one hand and continuum trees on the other, it has by now become a classic approach to encode trees as metric (measure) spaces or bi-measure R-trees (i.e. real trees equipped with the sampling measure and a further measure on the skeleton, which might e.g. play the role of the intensity measure for rare mutation) and to equip the space of all metric (measure) trees with the Gromov–Hausdorff [18], Gromov-weak [15, 16, 22] or Gromov–Hausdorff-weak [1, 6] topology and the space of all bi-measure R-trees with the leaf-sampling weak-vague topology [24]. All these approaches have in common that they rely on encoding trees as metric spaces. We refer to a metric space .T; r/ as a real tree (or R-tree) if and only if it is tree-like (i.e. it satisfies the so-called 4-point condition), i.e. for all x1 ; : : : ; x4 2 T , r.x1 ; x2 / C r.x3 ; x4 / ® ¯ 6 max r.x1 ; x3 / C r.x2 ; x4 /; r.x1 ; x4 / C r.x2 ; x3 / ; and is path-connected. It follows that under this assumption, for all x1 ; x2 ; x3 2 T , there exists a unique c.¹x1 ; x2 ; x3 º/ 2 T , ® ¯ Œx1 ; x2  \ Œx1 ; x3  \ Œx2 ; x3  D c.¹x1 ; x2 ; x3 º/ ; (21.1.1) where Œx; y WD ¹z 2 T W r.x; z/ C r.z; y/ D r.x; y/º (see, e.g. [9, Lemma 3.20]). We refer to c.¹x1 ; x2 ; x3 º/ as branch point and to the map cW T 3 ! T as branch point map. To allow for a unified set-up including discrete and continuum trees, we may also skip the requirement of path-connectedness and refer to a metric space as a metric tree if it satisfies the 4-point condition and can be embedded isometrically into an R-tree such that all branch points of the embedding R-tree belong to the metric tree. Throughout the paper we will aim to turn away from the metric structure and rather focus on the tree-structure by axiomatising the properties of the branch point map.

Algebraic measure trees

453

21.2 Algebraic measure trees In this section we introduce the state space. The goal is to overcome the metric issue raised in the introduction by focusing on the algebraic tree structure only. We define algebraic measure trees in Section 21.2.1, and introduce a notion of global convergence in Section 21.2.2. 21.2.1 Algebraic measure trees The following definition is based on the intuition that in a tree one can assign to any three points a unique branch point that lies on the “path connecting any two of them”. Definition 21.2.1 (Algebraic tree). An algebraic tree is a non-empty set T together with a symmetric map cW T 3 ! T satisfying the following: (M1) For all x1 ; x2 2 T , c.x1 ; x2 ; x2 / D x2 . (M2) For all x1 ; x2 ; x3 2 T , c.x1 ; x2 ; c.x1 ; x2 ; x3 // D c.x1 ; x2 ; x3 /. (M3) For all x1 ; x2 ; x3 ; x4 2 T , ® ¯ c.x1 ; x2 ; x3 / 2 c.x1 ; x2 ; x4 /; c.x1 ; x3 ; x4 /; c.x2 ; x3 ; x4 / : We refer to the map c as branch point map. A tree isomorphism between two algebraic trees .Ti ; ci /, i D 1; 2, is a bijective map W T1 ! T2 with .c1 .x1 ; x2 ; x3 // D c2 ..x1 /; .x2 /; .x3 // for all x1 ; x2 ; x3 2 T1 . It is worth noticing that in algebraic trees we have most of the common notions concerning the tree structure available, e.g. if .T; r/ is an algebraic tree, then:  we extend the notion Œx; y, x; y 2 T , for the interval (or path) from metric trees to algebraic trees by writing now ® ¯ Œx; y WD z 2 T W c.x; y; z/ D z ;  we say that ¹x; yº is an edge if and only if x ¤ y and Œx; y D ¹x; yº, and write edge.T / for the set of edges of T ,  we say that S  T is a subtree of T if and only if c.S 3 / D S,  we refer for each pair of points .x; y/ with x 2 T and y 2 T n ¹xº to ® ¯ x .y/ WD z 2 T W x … Œy; z

(21.2.1)

as the subtree component with respect to x that contains y,  for each point x 2 T we call the number of components of T n ¹xº the degree of x 2 T and write deg.x/ WD #¹x .y/ W y 2 T n ¹xºº,  we say that u 2 T is a leaf if and only if deg.v/ D 1, and write lf.T / for the set of leaves,

Anita Winter

454

 we say that v 2 T is a branch point if and only if deg.v/ > 3, or equivalently, v D c.x1 ; x2 ; x3 / for some x1 ; x2 ; x3 2 T n ¹vº, and write r.T / for the set of branch points, etc.,  we say that .T; c/ is a discrete tree if all paths Œx; y are finite,  we refer to a leaf u 2 T as a cherry leaf if there is another leaf u0 2 lf.T / such that #Œu; u0  D 3, and write ch.T / for the set of cherry leaves in T ,  we refer to an edge e 2 edge.T / as an external edge if there is a leaf u 2 lf.T / such that e D ¹u; xº for some x 2 T , and write ext-edge.T / for the set of external edges in T ,  if an edge e 2 edge.T / is not an external edge, than we refer to e as internal edge, and write int-edge.T / for the set of internal edges in T , etc. Notice that the branch point map in an algebraic tree shares the typical feature of branch points in real trees. Namely, condition (21.1.1) holds for all x1 ; x2 ; x3 2 T (see [25, Lemma 2.3]). Remark 21.2.2 (Rooted algebraic trees). In many applications rooted tree models are considered. Even though we do not use them in this paper, we want to shortly explain how algebraic trees can be extended to rooted algebraic trees. We say that .T; c; / is a rooted algebraic tree, if T ¤ ;,  2 T and c W T  T ! T is a symmetric map that satisfies the following: (rM1) For all x 2 T , c .x; x/ D x. (rM2) For all x1 ; x2 ; x3 2 T , c .x1 ; c .x2 ; x3 // D c .c .x1 ; x2 /; x3 /. (rM3) For all x1 ; x2 ; x3 2 T , ® ¯ # c .x1 ; x2 /; c .x1 ; x3 /; c .x2 ; x3 / 6 2 and if c .x1 ; x2 / D c .x1 ; x3 /, then  c .x1 ; x2 / D c c .x1 ; x2 /; c .x2 ; x3 / : We refer to c as the minimum map. Obviously, given an algebraic tree .T; c/ and a distinguished point  2 T , the triple .T; c ; / is a rooted tree if c is the branch point map restricted to triples .x1 ; x2 ; x3 / 2 T 3 with  2 ¹x1 ; x2 ; x3 º. On the other hand, given a rooted algebraic tree .T; c ; / we can define a symmetric map cW T 3 ! T by letting c.x1 ; x2 ; x3 / D c .x2 ; x3 /

(21.2.2)

whenever c .x1 ; x2 / D c .x1 ; x3 / (compare [25, Lemma 2.2]). It is not hard to check that this map is a branch point map. Indeed, if x1 ; x2 ; x3 2 T are such that x2 D x3 , then (21.2.2) implies that c.x1 ; x2 ; x3 / D c .x2 ; x3 / D x2 , which proves (M1). To see (M2), we distinguish three cases.

Algebraic measure trees

455

(a) If c .x1 ; x3 / D c .x2 ; x3 /, then   c x1 ; x2 ; c.x1 ; x2 ; x3 / D c x1 ; x2 ; c .x1 ; x2 / D c .x1 ; x2 / D c.x1 ; x2 ; x3 /: (b) If c .x1 ; x2 / D c .x2 ; x3 /, then c.x1 ; x2 ; c.x1 ; x2 ; x3 // D c.x1 ; x2 ; c .x1 ; x3 //. Thus by (rM3), ®  c.x1 ; x2 ; c.x1 ; x2 ; x3 // 2 c .x1 ; x2 /; c .x1 ; x3 / D c x1 ; c .x1 ; x3 / ; ¯ c .x1 ; x2 / D c x2 ; c .x1 ; x3 / ; where for the last equality we applied (rM2) together with (rM3) to conclude that c .x2 ; c .x1 ; x3 // D c .c .x1 ; x2 /; x3 / D c .c .x2 ; x3 /; x3 / D c .x2 ; x3 / D c .x1 ; x2 /. This implies that c.x1 ; x2 ; c.x1 ; x2 ; x3 // D c .x1 ; x3 / D c.x1 ; x2 ; x3 /. (c) If c .x1 ; x2 / D c .x1 ; x3 /, then we argue by the same line of arguments as in case (b). To see (M3), let x1 ; : : : ; x4 2 T and assume w.l.o.g. that c .x1 ; x2 / D c .x1 ; x3 / and that thus c.x1 ; x2 ; x3 / D c .x2 ; x3 / and c .c .x1 ; x2 /; c .x2 ; x3 // D c .x1 ; x2 /. We need to show that c .x2 ; x3 / 2 ¹c.x1 ; x2 ; x4 /; c.x1 ; x3 ; x4 /; c.x2 ; x3 ; x4 /º. If c.x2 ; x3 ; x4 / D c .x2 ; x3 /, then we are done. Assume therefore that c.x2 ; x3 ; x4 / ¤ c .x2 ; x3 /. Then we can assume w.l.o.g. that c .x2 ; x3 / D c .x2 ; x4 / and thus that c .c .x2 ; x4 /; c .x3 ; x4 // D c .x2 ; x4 / D c .x2 ; x3 /. If c .x1 ; x2 / D c .x2 ; x4 /, then c.x1 ; x2 ; x4 / D c .x2 ; x3 / and we are done. If on the contrary c .x1 ; x2 / ¤ c .x2 ; x4 /, then c.x1 ; x2 ; x4 / D c .x2 ; x4 / D c .x2 ; x3 /, and we are done as well. Remark 21.2.3 (Rooted trees versus (di-)dendritic systems). Independently of our work, with so-called didendritic systems a close relative of algebraic trees has been developed in [10, 11]. We want to explain the connection here. For this we will skip the prefix di- as it refers to binary trees; also, here we do not need the left and right relation, which the authors introduced for the sake of their applications. Given a minimum map c on .T; /, we can define an equivalence relation Š on T  T by declaring for any four points x; y; x 0 ; y 0 2 T , .x; y/ Š .x 0 ; y 0 / if and only if c .x; y/ D c .x 0 ; y 0 /. Moreover, we can define a partial order 6 on T by letting x 6 y if and only if c .x; y/ D x. Equivalently, we can define the partial order 6Š on the equivalence classes on T  T by letting Œ.x; x 0 / 6 Œ.y; y 0 / if and only if c .c .x; x 0 /; c .y; y 0 // D c .x; x 0 /, and write Œ.x; x 0 / for the equivalence class. By our axioms on the minimum map, the triple .T; Š; 6Š / satisfies the so-called triplet condition (compare [11, Definition 6.2, condition (C)]): for x1 ; x2 ; x3 2 T , one of the following conditions hold: Œ.x1 ; x2 / D Œ.x1 ; x3 / 6Š Œ.x2 ; x3 /; or

Œ.x1 ; x2 / D Œ.x2 ; x3 / 6Š Œ.x1 ; x3 /;

or

Œ.x1 ; x3 / D Œ.x2 ; x3 / 6Š Œ.x1 ; x2 /:

Anita Winter

456

On the other hand, given a set N ¤ ; together with a equivalence relation Š on N  N and a partial order on the set of all equivalence classes, we can assign the rooted algebraic tree .T; c ; / as follows. Let TQ be the set of equivalence classes. If inf6Š TQ is attained in TQ , put T WD TQ and  WD min6Š TQ . Otherwise, for  … TQ put T WD TQ t ¹º, where t denotes the disjoint union, and extend the partial order 6Š by letting  6Š Œx for all Œx 2 TQ . We define a minimum map c W T  T ! T by  c Œ.x; x 0 /; Œ.y; y 0 / WD Œ.x; y/: It is easy to check that c satisfies (rM1) to (rM3). In fact, (rM1) and (rM2) are the properties of the minimum of a lattice coming here from the partial order 6Š , while (rM3) is the analogue of the triplet condition. There is a natural topology on a given algebraic tree, namely the topology  generated by the set of all components x .y/ as defined in (21.2.1) with x ¤ y, x; y 2 T . This topology is a Hausdorff topology [25, Lemma 2.17]. If .T; c/ is an algebraic tree, and x; y 2 T , then [® ¯ T n Œx; y D u .v/ W u 2 Œx; y; v 2 T; u .v/ \ Œx; y D ; 2 : This means that Œx; y is closed in the component topology . One can also check easily that the branch point map cW T 3 ! T is continuous [25, Lemma 2.16]. In what follows we refer to an algebraic tree .T; c/ as order separable if it is separable with respect to this topology and in addition has at most countably many edges. We further equip order separable algebraic trees with a probability measure on the Borel -algebra B.T; c/. This so-called sampling measure allows to sample leaves from the tree. Definition 21.2.4 (Algebraic measure trees). A (separable) algebraic measure tree .T; c; / is an order separable algebraic tree .T; c/ together with a probability measure  on B.T; c/. In what follows we call two algebraic measure trees .Ti ; ci ; i /, i D 1; 2, equivalent if there exist subtrees Si  Ti with i .Si / D 1, i D 1; 2, and a measure preserving tree isomorphism  from S1 onto S2 , i.e. c2 ..x/; .y/; .z// D .c1 .x; y; z// for all x; y; z 2 S1 , and 1 ı  1 D 2 . We define T WD set of equivalence classes of algebraic measure trees: With a slight abuse of notation, we will write x D .T; c; / for the algebraic tree as well as for the equivalence class. 21.2.2 A notion of global convergence We next equip the space T of algebraic measure trees with a notion of global convergence. Such a global perspective is usually obtained by considering scaling limits of trees. But how to rescale an algebraic tree, which does not come with a metric

457

Algebraic measure trees

structure? As we will see, we can always read off from a given algebraic metric tree x D .T; c; / an intrinsic metric r on the tree that generates the same branch point map. For this purpose we consider the branch point distribution on T , i.e. .T;c;/ WD ˝3 ı c

1

;

and then associate the algebraic measure tree x D .T; c; / 2 T with the metric measure tree .T; r ; / 2 M, where we put for x; y 2 T , r .x; y/ WD x .Œx; y/

1 x .¹xº/ 2

1 x .¹yº/: 2

Then indeed the quotient space .T ; r / is a metric tree, and the canonical projection  is a tree homomorphism ([25, Lemma 2.28]). Notice that if .T; c; / and .T 0 ; c 0 ; 0 / are equivalent algebraic measure trees with branch point distributions  and  0 , respectively, then the isomorphism is also an isometry with respect to r and r 0 . We define convergence of the algebraic measure trees in T as Gromov-weak convergence of these associated metric measure trees as follows: Definition 21.2.5 (bpdd-Gromov-weak topology). We say that a sequence .xn /n2N of (equivalence classes of) algebraic measure trees xn D .Tn ; cn ; n / 2 T converges Gromov-weakly with respect to the branch point distribution distance (bpdd-Gromovweakly) to the algebraic measure tree .T; c; / 2 T if and only if the sequence .xQ n /n2N of (equivalence classes of) metric measure trees xQ n WD .Tn ; rn ; n / 2 M converges to the metric measure tree .T; r ; / 2 M Gromov-weakly. This is equivalent to requiring that if for U1n ; U2n ; : : : independent and n -distributed, and U1 ; U2 ; : : : independent and -distributed, for all m 2 N,   rn .Uin ; Ujn / 16i0 the dual chain with generator  t 2 Cm ,  Z t  m;˛ m;˛ ˛ Q m;˛ Q Ps Œ¹X t D tº D Es 1s .X / exp ˇm .Xs / ds : 0

Proof. We have the following generator relation: X  ˛ ˛m H.x; t/ WD qm .x; x0 / H.x0 ; t/ H.x; t/ x0 2Cm

D

X

˛ qQ m .t; x/H.x0 ; t/

x0 2Cm

D

X

˛ qm .t; x0 /

x0 2Cm ˛ qQ m .t; x/H.x0 ; t/

Q ˛m H.x; t/ 

X

H.x; t/

x0 2Cm

D

X

H.x; t/

˛ ˛ qQ m .x0 ; t/ C ˇm .t/H.x; t/

x0 2Cm

C

˛ ˇm .t/H.x; t/:

(21.5.3)

As the potential is a bounded function, the claim follows.

21.6 The ˛-Ford tree in the limit as N ! 1 In this section we construct the limit of the ˛-Ford tree as the number N of leaves goes to infinity. For this we identify an N -cladogram with a binary algebraic measure tree with N leaves and the uniform distribution of the leaf set. That is consider for N 2 N, ° ± 1 X T2N WD .T; c; / 2 T2 W # lf.T / D N;  D ıx : (21.6.1) N x2lf.T /

Remark 21.6.1 (Scaling limit of ˛-Ford trees). We want to point out that for ˛ 2 .0; 1/ the (rooted) ˛-Ford tree with N leaves encoded as a metric measure tree, where edge lengths are taken to be N ˛ , converges Gromov-weakly to a random metric measure space that arises in the context of fragmentation [7, 20]. Obviously, the existence of a scaling limit is true also for ˛ D 1 as the comb tree with edge lengths N1 converges Gromov-weakly to an interval of length 12 with the uniform distribution. For ˛ D 0 the situation is different. The height of a typical leaf is of order ln N but no scaling limit exists if we choose all edge lengths to be ln1N . Further, put ® ¯ T2cont WD .T; c; / 2 T2 W lfatoms .x/ ¤ ; :

(21.6.2)

The following approximation result holds [23, Proposition 2.9]. Proposition 21.6.2 (Approximations with T2N ). Let x 2 T2 . Then x 2 T2cont if and only if there exists for each N 2 N an xN 2 T2N such that xN ! x in one of the notions of convergence on T2 given above.

Anita Winter

466

We will argue that the ˛-Ford tree with N leaves converges in sample shape for all ˛ 2 Œ0; 1 to a random element in T2cont . We will then discuss the statistics of sample subtree masses branching off a typical branch point in our prototype situations ˛ 2 ¹0; 21 ; 1º. We make use of the fact that the family of ˛-Ford trees with N leaves, N 2 N, is sampling consistent. Definition 21.6.3 (Sampling consistency). Consider a family .Tn ; cn /n2N of random, finite binary algebraic trees, where .Tn ; cn / has n leaves. Let Kn be the Markov kernel that takes such a tree and removes a leaf uniformly chosen at random, together with the branch point it is attached to, thus obtaining a binary tree with n 1 leaves. We say that the family is sampling consistent if X Kn .t;  /P .¹Tn D tº/ D L.Tn 1 /: t

Example 21.6.4 (˛-Ford tree). It was shown in [12, Proposition 35] that the family of ˛-Ford trees with N leaves is sampling consistent. It follows immediately from our notion of convergence that a sampling consistent family of random binary algebraic trees together with the uniform distribution N on lf.TN / converges weakly to a binary algebraic measure tree [25, Lemma 6.2]. 21.6.1 The uniform tree In this subsection we identify the 12 -Ford tree as Brownian algebraic measure tree, or for short, the uniform tree (compare Figure 21.4.2). Definition 21.6.5 (Algebraic measure Brownian continuum random tree (CRT)). The algebraic measure Brownian CRT is the unique (in distribution) random binary algebraic measure tree XCRT D .T; c; / with uniform annealed sample shape distribution, i.e. for all m 2 N, for all t 2 Cm ,  ® ¯ 1 ECRT ˝m .u1 ; : : : ; um / W s.T;c/ .u1 ; : : : ; um / D t D : (21.6.3) #Cm Note that there is a unique law on T2 satisfying (21.6.3) because the sample shape distribution separates probability measures on T2 , and it is realised through the well-known Brownian CRT once we ignore the distances (compare [2]). A simple induction over the number of leaves shows that the 12 -Ford tree is the uniform binary tree with N leaves, as the next leaf is inserted at a uniformly chosen edge. We therefore refer to the 21 -Ford tree also as the uniform tree. Now we provide the analogue of [2, Theorem 23] by considering, together with the sample shape, the vector of masses of the subtrees branching off the edges of the shape cladogram. As expected, under the annealed law of the Brownian CRT, we obtain that this vector is Dirichlet distributed and independent of the shape. To state the result more precisely, for u D .u1 ; : : : ; um / 2 T let T .u/ D c.¹u1 ; : : : ; um º3 / be

467

Algebraic measure trees

the generated subtree, and for e D ¹ex ; ey º 2 edge.T .u// let ® .T;c;/ .u; e/ WD  v 2 T W c.v; ex ; ey / 2 .ex ; ey / ¯ [ .¹ex ; ey º \ lf.T .u///

(21.6.4)

be the mass of all subtrees branching off the inner of edge e and of those boundary points of the edge e that are in the left set of the subtree spanned by the sample u. Let further .T;c;/ .u/ D ..T;c;/ .u; e//e2edge.T .u// be the vector of these 2m 3 masses (assuming u1 ; : : : ; um are distinct). We obtain in [23, Proposition 5.2] the following, which is proven in the special case m D 3 in [3, Theorem 2]. Proposition 21.6.6 (Brownian CRT and Dir. 12 ; : : : ; 12 /). Let XCRT be the Brownian CRT, m 2 N, t 2 Cm and f W 2m 3 ! R bounded measurable, where k is the k-simplex for k 2 N, i.e. ® ¯ k WD x 2 Œ0; 1k W x1 C    C xk D 1 : Then the following holds: i hZ ECRT ˝m .du/ 1 t .sT .u//f ..T;c;/ .u// Z 1 1 1 D f .x/ Dir ; : : : ; .dx/ #Cm 2m 3 2 2 Z 3 €.m 2 / D f .x/.x1  : : :  x2m #Cm € 2m 3 . 21 / 2m 3

3/

1 2

dx;

where Dir. 21 ; : : : ; 12 / is the Dirichlet distribution. Proof. We follow Aldous’ proof of [3, Theorem 2] and study the asymptotic behaviour of the subtree mass distribution for the approximating N -cladograms as N ! 1. Fix m 2 N and t 2 Cm . Let N > m. Denote by N;m W CN ! Cm the projection map that sends an N -cladogram .T; c; / to the m-labelled cladogram spanned by the first m leaves, i.e. for Tm WD c..¹1; : : : ; mº/3 /, N;m .T; c; / WD .Tm ; cjTm ; j¹1;:::;mº /: P For n D .ne /e2edge.t/ with e ne D N , let qN .n/ be the probability that the first m leaves of a uniform N -cladogram span the m-labelled cladogram t, and the numbers of leaves of the N -cladogram in the subtrees corresponding to the edges of t are given by the vector n, i.e. ¯ 1 ® 1 # .T; c; / 2 N;m .t/ W N  .T;c; / ..1/; : : : ; .m// D n #CN Q Q .N m/Š e2int-edge.t/ #Cne C2 e2ext-edge.t/ #Cne C1 1 Q Q D  : #CN 1/Š e2int-edge.t/ .ne /Š e2ext-edge.t/ .ne

qN .n/ WD

468

Anita Winter

The first factor in the numerator together with the denominator counts the number of possibilities to distribute the N m remaining leaves to the edges of the m-labelled cladogram t (with quantities specified by n), and the products in the numerator count the possibilities to give cladogram structure to the leaves associated to every edge of t. For an external edge (e 2 ext-edge.t/), we have the number of .ne C 1/-cladograms and identify the additional leaf with the branch point of t it is attached to. For an internal edge (e 2 int-edge.t/), we need two additional leaves to identify with the two adjacent branch points. Fix  D .1 ; : : : ; 2m 3 / 2 2m 3 . We are interested in the asymptotic behaviour of qN .n.N // as N 1 ni .N / ! i , i D 1; : : : ; 2m 3, where we enumerate the edges such that the first m edges are the external, and the remaining m 3 the internal edges. Recall that  € 2N2 3 N 2 .2N 4/Š  2 D N 2 #CN D .2N 5/ŠŠ D 1 2 .N 2/Š € 2  .N

2/Š  2.N

2/

..N

2//

1 2

;

where  means that the multiplicative error tends to zero as N ! 1, and we have applied the Stirling formula in the last step. Thus m 2m 3 Y #Cni C1 Y #Cni C2 1 qN .n/ D  .N m/Š  #CN .ni 1/Š ni Š iD1 i DmC1 p m Y .N 2/   .N m/Š 2ni 1 ..ni 1// .N 2/Š  2N 2

1 2

i D1

p D .N

.N 2/  .N 5 2/

N

.m

DN

.2m 4/

2 

m/Š 2 2/Š

.m 2/



.m 2/



.m 2/

m Y

.ni

N  3

€ 2m2 1   #Cm € 1 2m 2

3

.m

3 2/

1/

which gives the claimed density on the .2m

2m Y3

1 2

1 2

.ni /

1 2

i DmC1

 .1 2  : : :  2m

.1 2  : : :  2m

2ni . ni /

i DmC1

i D1

.m 2/

2m Y3

3/

1 2

3/

1 2

;

3/-simplex in the limit.

21.6.2 The Kingman coalescent tree In this subsection we identify the 0-Ford tree as Yule tree, or neutral evolutionary tree (compare Figure 21.4.2). It is well known that the 0-Ford tree equals, in distribution, the algebraic Kingman coalescent tree, which is the random algebraic measure tree that satisfies that for all m 2 N the tree shape of m sampled leaves looks like an m-Kingman coalescent (compare [16, 31]). As before we would like to say something about the distribution of the subtree mass vector branching off a typical branch point. Recall x .u; e/ from (21.6.4). Apparently,

Algebraic measure trees

469

for the Kingman coalescent tree there is no such nice characterisation as for the uniform tree. However, at least for m D 3, we can find a representation as the stationary distribution of a system of interacting jump diffusions. Consider the system  WD .1 .t /; 2 .t /; 3 .t// t >0 of jump diffusions with the following independently combined dynamics:  (Catastrophes) For all i D 1; 2; 3 at rate 2, the current state  jumps to the i-th unit vector ei .  (Wright–Fischer with mutation) In between two jumps,  solves the system of stochastic differential equations p di .t/ D 2.1 3i .t// dt C 2i .t /.1 i .t // dBi .t /; where B1 ; B2 ; B3 are standard Brownian motions coupled such that 3 X p

i .t/.1

i .t// dBi .t /  0:

i D1

It is easy to argue that this system has a unique invariant distribution, WFmc . Existence follows simply by compactness of 3 . Uniqueness follows by a coupling of two such systems such that they use the same exponential clock for the jumps into each corner. Proposition 21.6.7 (Kingman coalescent). Let XKingman be the Kingman coalescent algebraic tree and U1 ; U2 ; U3 be an i.i.d. sample. Then WFmc is identical with the law of XKingman .U1 ; U2 ; U3 /. The proof is an immediate consequence of Theorem 21.7.5, which we present later in this paper. It follows that, in contrast to the uniform tree, the law of XKingman is not a Dirichletdistribution. We conjecture that it can be constructed from a two-parameter Poisson– Dirichlet-distribution introduced in [27] and discussed in [8]. 21.6.3 The comb tree In this subsection we exploit the fact that the 1-Ford tree equals the comb tree (compare Figure 21.3.2). To see this, we have to be a bit more precise about the construction of the ˛-Ford tree presented in the introduction. The case ˛ D 1 is a bit singular. As we are allowed to insert new edges only at internal edges, we need to start in this case with a binary tree of 4 leaves. Such a tree has exactly one internal edge and inserting a new edge at this internal edge produces a binary tree with 5 leaves and two adjacent internal edges. If we keep inserting edges at internal edges, we are producing binary trees in which the internal edges are lined up and therefore the binary tree forms a comb tree. We have already seen in Example 21.3.7 that the comb tree with N leaves converges in sample shape to an equivalence class of the compact interval Œ0; 1 (seen as linearly ordered tree) equipped with the uniform distribution.

470

Anita Winter

In the following we denote for m 2 N, ˛1 ; : : : ; ˛m > 0 by X c 1 ; : : : ; ˛m / WD 1 Dir.˛ Dir.˛.1/ ; : : : ; ˛.m/ / mŠ 2Sm

the symmetrised Dirichlet distribution with the parameters ˛1 ; : : : ; ˛m . Here Sm denotes the set of all permutations of ¹1; : : : ; mº. We note that for all J  ¹1; : : : ; mº, the family ¹Dir.˛1 ; : : : ; ˛m / W ˛j > 0; j 2 J º is tight, and that we can therefore extend the definition to parameters equal to 0 as the weak limit as ˛ # 0. As the 1-Ford tree is deterministic, we can easily derive the distribution of the vector of subtree masses branching off the edges of the m-labelled cladogram shaped by a sample of size m. For u D .u1 ; : : : ; um / 2 T with u1 ; : : : ; um pairwise distinct, recall from (21.6.4) the definition of the vector .T;c;/ .u/ of the 2m 3 masses branching off. The following is [26, Proposition 3.4]. Proposition 21.6.8 (Comb tree). Let xcomb be the comb tree. Then the law of xcomb equals the symmetrised Dirichlet distribution with parameters ˛1 D ˛2 D 2, ˛3 D    D ˛m D 0 and ˛mC1 D    D ˛2m 3 D 1. Proof. Let U1 ; : : : ; Um be independent and identically uniformly on Œ0; 1 distributed random variables. Denote by U .1/ ; : : : ; U .n/ the increasingly ordered random variables. It is well known that then .U .1/ ; U .2/ U .1/ ; : : : ; U .m/ U .m 1/ / is Dir.1; : : : ; 1/distributed. Moreover, xcomb ..U1 ; : : : ; Um // D U .2/ ; 0; U .3/ 0; U .m

2/

U .2/ ; 0; U .4/ U .m

3/

; 0; U .m

U .3/ ; 0; : : : ; 1/

U .m

2/

; 0; 1

U .m

1/



:

To see that this is true, notice that the shape of a sample from a (horizontal) line segment of size m is an m-cladogram whose branch points correspond to U .2/ ; : : : ; U .m 1/ and with masses branching off horizontal edges being U 2 U 1 ; : : : ; U .m 1/ U .m 2/ and masses branching off vertical edges being zero. Thus by exchangeability, the claim follows.

21.7 The ˛-Ford chain in the diffusion limit In what follows we are interested in the diffusion limit of the ˛-Ford chain as N ! 1. Recall T2N from (21.6.1) and T2cont from (21.6.2). We will define the diffusion limits on the space T2cont . Notice that [25, Corollary 4.9] implies the following. Proposition 21.7.1 (Compactness). The space T2cont is a closed subspace of T2 in the bpdd-Gromov-weak topology, and thus compact as well.

471

Algebraic measure trees

Consider for each ˛ 2 Œ0; 1 the operator ˛Ford acting on shape polynomials: Z ˛ m;t Ford ˆ ..T; c; // WD ˛m 1t .sT .u//˝m .du/: (21.7.1) The following is proven in [23, Proposition 3.2] for ˛ D ˛ 2 Œ0; 1 in [26]. Recall ˛N from (21.5.1).

1 2

and generalised to all

Proposition 21.7.2 (Convergence of generators). For all ˆ 2 …sh and ˛ 2 Œ0; 1, we have lim sup j˛N ˆ.x/ ˛Ford ˆ.x/j D 0: N !1 x2T N 2

Consequently, as is compact, the family ¹X N;˛ W N 2 Nº is tight. Moreover, we can use the following duality to characterise the limit process uniquely. T2cont

Proposition 21.7.3 (Feynman–Kac duality). If X Nk ;˛ converges to X weakly in path space with respect to sample shape convergence on T2 and if for arbitrary m 2 N and t 2 Cm , Y m WD .Y tm / t>0 is the Cm -valued Ford chain with Y0m D t independent of X , then  ˝m ® ¯ EX u 2 T tm W s.T t ;c t / .u/ D t .T;c;/  t i hZ Rt m .1 2˛/m.m 1/t Y De Et ˝m .du/ 1sT .u/ .Y tm / e.1 2˛/.2m 5/ 0 ch.Ys / ds : Tm

Proof. The proof follows by inserting relation (21.5.3) into (21.7.1). Our main result is the following (compare [23, Theorem 1] and [26, Theorem 1]). Theorem 21.7.4 (Well-posed martingale problem). For each N 2 N, let X0N 2 T2N. Assume that X0N ! x 2 T2cont . Then the ˛-Ford chain X N;˛ starting in X0N converges weakly in Skorokhod path space with respect to the bpdd-Gromov-weak topology to a T2cont -valued Feller process X ˛ with continuous paths. The process X ˛ is the unique Feller process such that for all ˆ 2 …sh , the process M WD .M t / t>0 given by Z t M t WD ˆ.X t / ˆ.X0 / ˛Ford ˆ.Xs / ds 0

is a martingale. Proof. Tightness of the family ¹X ˛;N W N 2 Nº and existence of a solution of the martingale problem stated in Theorem 21.7.4 follow directly from Proposition 21.7.2 due to compactness of T2cont . Uniqueness of a solution of the martingale problem and therefore also uniqueness of the limit follows from the Feynman–Kac duality relation given in Proposition 21.7.3.

472

Anita Winter

Let .S t / t >0 be the semigroup of this (Markov) solution on the set of bounded measurable functions on T2cont . Our duality relation implies that S t ˆm;t .x/ D Et Œˆm;Y t .x/ e

Rt

0

˛ .Y / ds ˇm s



for all shape polynomials. As T2cont is compact and the set …sh of shape polynomials is uniformly dense in C .T2cont /, it follows that S t ˆm;t 2 C .T2cont /. Strong continuity of the semigroup .S t / t >0 on C .T2cont / follows from weak continuity, e.g. [21, Theorem 19.6]. The weak continuity follows directly from rightcontinuity of the sample paths. As an application we study the evolution of the quenched law of the statistics that samples three points and evaluates the masses of the subtrees branching off the branch point under the dynamics of the ˛-Ford diffusion. Recall from (21.2.1) the definition of the components v .u/, u; v 2 T , and let .u/ with u D .u1 ; u2 ; u3 / 2 T 3 denote the vector of the three masses of the components connected to c.u/, i.e. .u/ WD .i .u//i D1;2;3 D ..c.u/ .ui ///i D1;2;3 : Consider subtree mass polynomials Z f ˆ .x/ D f ..c.u///˝3 .du/ with x D .T; c; / and put ® ¯ …mass WD ˆf W f 2 C 2 .Œ0; 13 / : Define ˛Ford ˆf

Z .x/ D

˝3

 T3

.du/

X 3

j /@2ij f ..c.u///

i .ıij

i;j D1

C .2

˛/

3 X .1

3i /@i f ./

i D1 3 ˛ X 1 C f ı i;j ./ 2 i

 f ./ 1.0;1/ .i /

i ¤j D1

C

3 ˛ X 1¹0º .j / 2

 1¹0º .i / @i f ./

i ¤j D1

C .2

3˛/

3 X

f .ei /

  f ./ ;

i D1

with ei denoting the i-th unit vector (i.e. the i-th corner of the simplex 3 ) and where i;j W 3 ! 3 denotes the migration operator on the three-simplex 3 that sends the

Algebraic measure trees

473

vector  to a vector where we subtract i from the i-th entry (resulting in the entry zero) and add it to the j -th entry (resulting in i C j ). The following is [23, Theorem 3] in the case ˛ D 21 and [26, Theorem 2] for the case ˛ D 0 and thus for general ˛ 2 Œ0; 1. Theorem 21.7.5 (Extended martingale problem for subtree masses). Let X D .X t / t>0 be the ˛-Ford diffusion on T2cont . Then for all ˆf 2 …mass , the process M f WD .M tf / t>0 given by Z t f f f M t WD ˆ .X t / ˆ .X0 / ˛Ford ˆf .Xs / ds 0

is a martingale. The proof of both results concerning martingale problems relies on the interpolation relation (21.5.2), which allows us to restrict to the study of the Ford-chains for two different values of ˛. The existence of a continuum limit of the 12 -Ford chain was proposed by Aldous in 1999 at the Fields institute and has been listed on Aldous’ open problem page since. It has been constructed and further studied in [23]. The case ˛ D 0 (“resampling”) is investigated in [26]. Acknowledgements. I express my gratitude to my coauthors Wolfgang Löhr, Leonid Mytnik and Josué Nussbaumer whose joint work with me appears in this article. Many thanks to the anonymous referees whose comments and suggestions helped to improve the presentation. Finally, I would like to thank Wolfgang Löhr for discussions on the axioms characterising the minimal map in rooted algebraic trees, as well as Steve Evans and Anton Wakolbinger for discussions on the link between their didendritic systems presented in [10, 11] and our rooted algebraic trees.

References [1] R. Abraham, J. F. Delmas, and P. Hoscheit, A note on the Gromov–Hausdorff–Prokhorov distance between (locally) compact metric measure spaces, Electron. J. Probab. 14 (2013), 1–21. [2] D. Aldous, The continuum random tree III, Ann. Probab. 21 (1993), 248–289. [3] D. Aldous, Recursive self-similarity for random trees, random triangulations and Brownian excursion, Ann. Probab. 22 (1994), 527–545. [4] D. Aldous, Probability distributions on cladograms, in: Random Discrete Structures (Minneapolis 1993; eds. A. Aldous and R. Pemantle), Springer, New York (1996), 1–18. [5] D. Aldous, Mixing time for a Markov chain on cladograms, Comb. Probab. Comput. 983 (2000), 191–204. [6] S. Athreya, W. Löhr, and A. Winter, The gap between Gromov-vague and Gromov– Hausdorff-vague topology, Stochastic Process. Appl. 126 (2016), 2527–2553.

Anita Winter

474

[7] B. Chen and M. Winkel, Restricted exchangeable partitions and embedding of associated hierarchies in continuum random trees, Ann. Inst. Henri Poincaré Probab. Stat. 49 (2013), 839–872. [8] C. Costantini, P. De Blasi, S. Ethier, M. Ruggiero, and D. Spanò, Wright–Fisher construction of the two-parameter Poisson–Dirichlet diffusion, Ann. Appl. Probab. 27 (2017), 1923-1950. [9] S. N. Evans, Probability and Real Trees: École d’Été de Probabilités de Saint-Flour XXXV–2005, Springer, Berlin, 2008. [10] S. Evans, R. Grübel, and A. Wakolbinger, Doob–Martin boundary of Rémy’s tree growth chain, Ann. Probab. 45 (2017), 225–277. [11] S. N. Evans and A. Wakolbinger, PATRICIA bridges, in: Genealogies of Interacting Particle Systems (eds. M. Birkner, R. Sun, and J. Swart), World Scientific, Singapore (2020), 233– 267. [12] D. J. Ford, Probabilities on cladograms: Introdction to the ˛-model, preprint 2005, httpsW// arxiv.org/abs/math/0511246. [13] N. Forman, Exchangeable hierarchies and mass-structure of weighted real trees, Electron. J. Probab. 25 (2020), 1–28. [14] N. Forman, C. Haulk, and J. Pitman, A representation of exchangeable hierarchies by sampling from real trees, Probab. Theory Related Fields 172 (2018), 1-29. [15] K. Fukaya, Collapsing of Riemannian manifolds and eigenvalues of Laplace operators, Invent. Math. 87 (1987), 517–547. [16] A. Greven, P. Pfaffelhuber, and A. Winter, Convergence in distribution of random metric measure spaces (ƒ-coalescent measure trees), Probab. Theory Related Fields 145 (2009), 285–322. [17] A. Greven, P. Pfaffelhuber, and A. Winter, Tree valued resampling dynamics: Martingale problems and applications, Probab. Theory Related Fields 155 (2013), 789–838. [18] M. Gromov, Metric Structures for Riemannian and non-Riemannian Spaces, Birkhäuser, Boston, 2001. [19] S. Gufler, A representation for exchangeable coalescent trees and generalized tree-valued Fleming–Viot processes, Electron. J. Probab. 41 (2018), 1–42. [20] B. Haas, G. Miermont, J. Pitman, and M. Winkel, Continuum tree asymptotics of discrete fragmentations and applications to phylogenetic models, Ann. Probab. 36 (2008), 790– 1837. [21] O. Kallenberg, Foundations of Modern Probability, Springer, New York, 2002. [22] W. Löhr, Equivalence of Gromov–Prohorov- and Gromov’s  -metric on the space of metric measure spaces, Electron. Commun. Probab. 18 (2013), 1–10. [23] W. Löhr, L. Mytnik, and A. Winter, The Aldous chain on cladograms in the diffusion limit, Ann. Probab. 48 (2020), 2565–2590. [24] W. Löhr, G. Voisin, and A. Winter, Convergence of bi-measure R-trees and the pruning process, Ann. Inst. Henri Poincaré Probab. Stat. 51 (2015), 1342–1368. [25] W. Löhr and A. Winter, Spaces of algebraic measure trees and triangulations of the circle, Bull. Soc. Math. France 149 (2021), 1–63.

Algebraic measure trees

475

[26] J. Nussbaumer and A. Winter, The algebraic ˛-Ford tree under evolution, preprint 2020, httpsW//arxiv.org/abs/2006.09316. [27] J. Pitman and M. Yor, The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator, Ann. Probab. 25 (1997), 855–900. [28] J. Schweinsberg, An O.N 2 / bound for the relaxation time of a Markov chain on cladograms, Random Structures Algorithms 20 (2001), 59–70. [29] P. Seidel, The historical Moran model, preprint 2015, httpsW//arxiv.org/abs/1511.05781. [30] T. Wiehe, Counting, grafting and evolving binary trees, in: Probabilistic Structures in Evolution (eds. E. Baake and A. Wakolbinger), EMS Press, Berlin (2021), 427–450. [31] J. Wirtz and T. Wiehe, The evolving Moran genealogy, Theor. Popul. Biol. 130 (2019), 94-105.

List of contributors Ellen Baake Technische Fakultät Universität Bielefeld Universitätsstr. 25 33615 Bielefeld [email protected]

Frank den Hollander Mathematisch Instituut Universität Leiden Niels Bohrweg 1 NL–2333 CA Leiden The Netherlands [email protected]

Michael Baake Fakultät für Mathematik Universität Bielefeld Universitätsstr. 25 33615 Bielefeld [email protected]

Julien Y. Dutheil MPI für Evolutionsbiologie August-Thienemann-Str. 2 24306 Plön [email protected]

Rolf Backofen Institut für Informatik Albert-Ludwigs-Universität Freiburg Georges-Koehler-Allee Geb. 106 79110 Freiburg [email protected]

Fabian Freund Institut für Pflanzenzüchtung Saaatgutwissenschaft und Populationsgenetik Universität Hohenheim Fruwirthstr. 21 70599 Stuttgart [email protected]

Matthias Birkner Institut für Mathematik Johannes-Gutenberg-Universität Mainz Staudingerweg 9 55099 Mainz [email protected]

Nina Gantert Fakultät für Mathematik TU München Boltzmannstr. 3 85748 Garching b. München [email protected]

Jochen Blath Institut für Mathematik TU Berlin Straße des 17. Juni 136 10623 Berlin [email protected]

Andreas Greven Department Mathematik Friedrich-Alexander-Universität Erlangen-Nürnberg Cauerstr. 11 91058 Erlangen [email protected]

Anton Bovier Institut für Angewandte Mathematik Rheinische-Friedrich-Wilhelms-Universität Bonn Endenicher Allee 60 53115 Bonn [email protected]

Martin Hutzenthaler Fakultät für Mathematik Universität Duisburg-Essen Thea-Leymann-Str. 9 45127 Essen [email protected]

List of contributors

478

Götz Kersting Institut für Mathematik Goethe-Universität Frankfurt Robert-Mayer-Str. 10 60325 Frankfurt am Main [email protected]

Peter Pfaffelhuber Abteilung für Mathematische Stochastik Albert-Ludwigs-Universität Freiburg Ernst-Zermelo-Str. 1 79104 Freiburg [email protected]

Wolfgang König Weierstraß-Institut Berlin für Angewandte Analysis und Stochastik Mohrenstr. 39 10117 Berlin and Institut für Mathematik TU Berlin Straße des 17. Juni 136 10623 Berlin [email protected]

Wolfgang Stephan Museum für Naturkunde Invalidenstr. 43 10115 Berlin [email protected]

Joachim Krug Institut für Biologische Physik Universität zu Köln Zülpicher Str. 77 50937 Köln [email protected] Noemi Kurt Institut für Mathematik TU Berlin Straße des 17. Juni 136 10623 Berlin [email protected] Dirk Metzler Fakultät für Biologie Ludwig-Maximilans-Universität München Großhaderner Str. 2 82152 Martinsried [email protected] Marcel Ortgiese Department of Mathematical Sciences University of Bath Claverton Down Bath BA2 7AY UK [email protected]

Anja Sturm Institut für Mathematische Stochastik Georg-August-Universität Göttingen Goldschmidtstr. 7 37077 Göttingen [email protected] Aurélien Tellier School of Life Sciences TU München 85354 Freising [email protected] Anton Wakolbinger Institut für Mathematik Goethe-Universität Frankfurt Robert-Mayer-Str. 10 60325 Frankfurt am Main [email protected] Thomas Wiehe Institut für Genetik Universität zu Köln Zülpicher Str. 47a 50674 Köln [email protected] Anita Winter Fakultät für Mathematik Universität Duisburg-Essen Thea-Leymann-Str. 9 45127 Essen [email protected]

Index ABC, random-forest based 192 absorption time 226, 232 accessibility transition 10 active individual 275 adaptation 69 – fast 339 adaptive – dynamics 121, 128 – flight 136 – speciation 129 – walk 18, 129, 136 Aldous chain on cladograms 464 ˛-Ford chain 463 altruism 85, 86 ancestral – graph 61 – lineage 96, 239 – random walk 295 – recombination graph 168, 217, 366, 370, 373, 384 – „ 168 – selection graph 51, 409, 418 – structure 294 Anderson – localisation 27 – operator 27 annealed 296 annihilating – Brownian motions 329 – random walks 315 antibiotic resistance 2 approximate Bayesian computation 191 backward – algorithm 398 – equation 99 bacteria 44, 69 balance – node 437, 439 – root 438, 439, 441 – tree 443 behaviour – altruistic 86 – defence 86

benefit of defence 88, 91, 93 biparental 208 birth enhancement rate 350 block-counting process 159, 225, 231, 251, 261 – jump rate 225 – time reversal 232 branch – point – distribution 454 – distribution distance 457 – map 453 – set 454 – segment 440 – size 440 branch length – branches of order b 229 – coalescent tree 162 – longest external 229 – total 163, 224, 226, 228, 230, 236, 238 – external 224, 226, 228, 234, 238 – internal 224, 228 branching – diffusion 358 – mechanism 358 – model 340 – competition 337, 340 – multi-type 337, 340 – mutation 337, 340 – mutually catalytic 316 – symbiotic 312, 315 – two-level 337, 340 – process 23, 64 – ˛-stable 157 – approximation 48, 59 – higher moment 36 – multitype 30 – spatial 23 – total mass 28 – rates – random 23 – state-dependent 23

Index branching random walk 23, 292, 303 – logistic 302 – ancestral lineage in 305 – spatial 24 Brownian web 302, 328 Cannings – model 44, 48, 50, 51, 156, 181, 204, 269 – diploid 208, 217 – resampling 270 canonical equation 129, 139 carrying capacity 131, 216, 341 Catalan number 429 caterpillar 428, 435, 440 CEAD 129 cell division 340 central limit theorem 305 – for random walk on percolation cluster 296 cheater 86 cherry 428, 431 – leaf 454 Chinese restaurant process 183, 190 clade size 196 cladogram 458 – labelled 458 – shape 459 clonal interference 45, 49, 52 clone 44, 338 coalescence 372 – hazard 279 – number of events 226 coalescent 108, 247, 386 – absorption time 226, 232 – beta 154, 165, 207, 224, 230, 232, 233, 235, 237, 238 – n- 182, 189 – ˇ 120 – Bolthausen–Sznitman 154, 224, 228, 229, 233, 236 – n- 182, 190, 197 – Dirac 182, 198 – Eldon–Wakeley 154, 232 – evolving 225, 233, 234, 236, 238, 240 – n- 236, 237 – hidden Markov model (CoalHMM) 387 – 1- 225

480

– Kingman 115, 151, 153, 205, 224, 229, 236, 238, 247, 417, 431 – robustness 211, 218 – subordinated 155 – tree 469 – with exponential growth 165 – ƒ 120, 153, 205, 223, 239, 241, 242 – ƒ-n- 179 – with time change 185 – last merger 224 – multiple merger 153, 204, 223 – n- 223, 228, 234 – on/off 262 – peripatric 262 – Poisson–Dirichlet 155 – ‰ 115, 116 – seed bank 111, 250 – simultaneous multiple merger 154, 205 – spatial 277 – star-shaped 153 – structured 115 – tree 237, 238, 431 – -valued process 225 – with recombination 386 – „ 154, 205, 223, 239 – asymptotic frequencies 197, 198 – „-n- 179 – clade size 193, 196 – model selection 191 – subsampling 187 coalescing – Brownian motions 326, 329 – random walks 313 coding map 462 coevolution 107 coexistence 88, 269, 316 – region of 316 coming down from infinity 180, 225, 227, 228, 230, 232, 241, 253, 256, 328, 332 commutator 376 competition – kernel 129 – local 303 competition experiment 44 contact process – in discrete time 293 – multi-type 297

Index continuum random tree 466 convergence – complete 303 – rate of 376 coupling 51, 64, 303 CRISPR – Cas 43, 69 – Cas9 72 – classification 71 critical – dimension 282 – value for oriented percolation 294 crossover 367 – single 377 defence 86 – against parasites 85 – against predators 85 – altruistic 86 – benefit of 97 – cost of 88, 89, 93, 97 – costly 86 demography, host-parasite 118, 119 differentiability of semigroups 101 diploid 156, 166, 203 – population model 154, 208, 212, 214 – competition 216 – mutation 216 – selection 216 – two-sex 212 – varying population size 216 Dirichlet distribution 470 distance matrix 349 distribution – beta 154 – double-exponential 28, 37 – Gamma 359 – geometric 78 – Gumbel 10, 229 – Pareto 30 – phase-type 161 – Poisson 303 – Slack’s 189 – symmetrised Dirichlet 470 – Weibull 30, 32 dormancy 247 dormant individual 275

duality 51, 255, 274, 311, 337, 373 – Feynman–Kac 464 – interface 315 – moment 255, 261, 322 – self 321 dust 180, 231 – component 224, 228, 231, 232, 242 – lack of 224 – -free 224, 225, 228, 239–241, 271 dynamical system 60, 107, 365, 379 – continuous time 113 – discrete time 109, 111 – Jacobian matrix 111 edge 453 – external 454 – internal 454 effective – population size 163, 211, 386 – reproduction rate 47 eigenvalue expansion 27 emission probability 388 empirical Bayesian 398 entrance law 329 environment, random 32 episode 444 epistasis – diminishing returns 45, 46 – sign 14 equilibrium 87, 88, 91, 376 – monomorphic 269 – polymorphic 269 Escherichia coli 43, 44 Euler number 433 evolutionary – accessibility 2 – branching 138, 139 – path 3 – predictability 17 – singularity 138 – stable condition 129, 138 evolving – genealogy 286 – Moran genealogy 443 – backward in time 445 excursion measure 94, 96

481

Index excursions – forest of trees of 96, 98 – tree of 94–96 Feller – branching diffusion 358 – process 274 Feynman–Kac – duality 357, 465, 471 – formula 27 Fisher–Wright diffusion 88, 98, 268 fitness 128, 214, 216, 414 – additive 46 – correlation 11 – direct 86 – effect 11 – graph 3 – heterozygous 216 – homozygous 216 – inclusive 86 – increment 45–47, 51, 52 – invasion 128, 132 – landscape 2, 35, 46, 50, 129 – additive 12 – correlated 11 – NK model 11 – random 25, 26 – rough Mount Fuji model 11 – valley crossing 14 – Malthusian 44, 45, 47 – mean 26 – relative 47 – random 214 – relative 44 – trajectory 52, 53 – valley 143 fixation 45, 49, 52, 63, 89, 107, 116 – probability 44, 48, 59, 261 – sequential 49, 51 – time 59 Fleming–Viot – model 267 – process 153, 260 – tree-valued 409, 411, 423 flow 376

482

fluctuating – environment 419, 422 – population size 301 forward algorithm 389 four-point condition 452 Galton–Watson – process 51 – reproduction, diploid 214 Gamma distribution 359 genealogy 203, 286, 339, 383 generating function 429, 435 genetic drift 51, 87, 114 – Cannings model 115, 116, 120 – Wright–Fisher model 114, 116, 118, 119 geographic space 267 Gerrish–Lenski heuristics 52 Girsanov’s – theorem 337, 347 – transform 347 grafting 444–446 – subtree 440 graphical – construction 313, 315, 328 – representation 61, 62 Gromov–Prohorov topology 412 Gromov-weak – convergence 457 – topology 340, 412 – marked 340 guilt by association 75 Gumbel distribution 10, 229 Haldane linearisation 375 Haldane’s formula 44, 48, 51, 59 Hamming – distance 7 – graph 3 – metric 2, 3 haploid 203 HCMV 43, 54 heat equation with potential 27 heterozygous 216 hidden Markov model 72, 387 hierarchical – group 269 – model 56

Index highly skewed offspring numbers 215, 338 history 431 – labelled 429 hitchhiking effect 217 Hölder-continuous diffusion coefficient 100 homologous chromosomes 217 homozygous 216 horizontal gene transfer 70 host – -parasite – interactions 85 – model 54, 55 – population 337, 340 – replacement 55, 57, 61 house of cards model 26 huge offspring 358 human cytomegalovirus 43, 54 hypercube 3 – as sequence space 26 immune system 69 – cross-immunity 339 – escape 339 incomplete lineage sorting 392 individual-based model 129, 337, 340 – competition 337, 340 – diploid 216 – multi-type 337, 340 – mutation 337, 340 – two-level 337, 340 infectious disease model – population genetics 109 – epidemiology 113 infinite-sites 254 – model 158, 182, 391 inhabitable 293 interface 302, 316 interference 70 intermittency 23, 28, 29 island model 56, 248 jackpot event 339 jump process 57, 59, 238 Karlin–Levikson model 422 Kimmel’s model 340

483

Laplace – operator 26 – transform 416 large population approximation 337 latency 248 law of large numbers 45, 50, 51, 56, 58, 63, 131, 305, 373 – dynamical 366, 370 – for random walk on percolation cluster 296 leader sequence 70, 74, 76 Lenski’s long-term evolution experiment 43, 44 Lévy process 233, 234 likelihood – composite 394 – function 163 – approximate 163 – ratio test – approximate 165 line of descent 241 linkage disequilibrium 366, 442 – topological 442 lookdown – construction 157, 236, 239 – space 239 – with competition and selection 242 loop-free process 99 loss of spacer 76 Lotka–Volterra – equation 128, 132 – model 85, 87, 89–91 LTEE 44 many-demes limit 92, 94, 96, 97, 100 many-to-one formula 37 Markov chain – discrete time 120 – tree-indexed 77 martingale problem 267, 274, 320, 337, 347, 411, 424 McKean–Vlasov SDE 88, 92, 93 mean-field 61 – approximation 92 – equation 216 – limit 280 – scaling 345 measure-valued process 337, 346

Index Mendelian rules 216 merge-graft operation 445 merger 223 – simultaneous multiple 239 – size of last 224, 232 meta-population 86 metric measure space 240, 337, 339 – Gromov–Hausdorff–Prohorov 240, 242 – Gromov–Prohorov 240, 241 – isomorphy class 240 – marked 242, 337, 339, 411 – Gromov–Prohorov 242 migration 43, 56, 86, 248, 267 mmm-space 337, 339 – marked 337, 339 Möbius inversion 380 model selection 191 monomorphic 46, 63, 64 – equilibrium 269 Moran – model 51, 55, 115, 182, 204, 409, 410 – diploid 213 – modified 184 – with recombination 366, 368 – process 384 most recent common ancestor 218, 224, 239, 241 – time to 253 multi-locus model 166 multiple loci 217 – homologous chromosomes 217 – non-homologous chromosomes 217 mutant invasion 142 mutation 257, 411 – beneficial 46, 52 – contending 48, 52 – graph 7 – infinitely many genes 76 – kernel 135, 341 – law 130 – parent-independent 414 – rate 254 – -selection – balance 14 – equation, diploid 216 – model 23, 25 – two-way 64

484

mutational – pathways 2 – reversion 6 non-homologous chromosomes 217 offspring – distribution, highly skewed 156 – frequencies 206 – variance 48 one-point compactification 347 ordinary differential equation 50, 60, 357, 365, 370 orientation 444, 445 oriented percolation 295 – cluster 295, 302 origin-fixation model 49 paintbox construction 179, 197 pair coalescence probability 48, 163, 168 parabolic Anderson model 23, 27, 316 – biological interpretation 25 – on Galton–Watson tree 39 – on hypercube 35 – on random graph 39 parasites 85 parasitism 85 partition 367, 368, 433 – interval 377, 378 – marked 250 partitioning process 370–376, 378, 379 Peano–Baker 376 pedigree 217 – cyclical model 217 percolation 1 – accessibility 1 – first passage 7 permutation 183, 429 phase-type distribution 161 phasing 397 phylogenetic distance 339 phylogeny 337, 339, 348, 429 – evolving 337, 339 – ranked 429 pitchfork 428

Index Poisson – point process 96, 179, 182, 229, 260, 272, 313 – process 52 Poisson–Dirichlet coalescent 155 polymorphic 63, 64 – equilibrium 133, 269 – evolution sequence 134, 136 population – dynamics 127 – model 156 – process 23 – regulation, local 294 – size – fluctuating 301 – varying 216 posterior decoding 398 power law 50, 53 predator-prey system 85 probabilistic – cellular automaton 303 – representation 61 probability to balance 58 process – block-counting 159, 251, 261 – Chinese restaurant 183, 190 – coalescent 431, 440 – coalescent-valued 225 – Feller 274 – Fleming–Viot 153, 260, 409, 411, 423 – Galton–Watson 51 – jump 57, 59, 134, 238 – Lévy 233, 234 – Moran 443, 445 – partitioning 370–376, 378, 379 – Poisson 52 – point 96, 179, 182, 229, 260, 272, 313 – relative fitness 50 – root jump 446 – subordinator 231 – tree-valued 240, 242 – Wright–Fisher 384 – Yule 359, 431, 443 projection 372 propagation of chaos 56, 60, 62, 63, 94 protein space 1

pruning 62 pseudometric 348 q– factorial 10 – number 10 quenched 296 random – environment 267, 282 – forest 192 – walk – ancestral 301 – degree of 286 – in dynamic random environment 308 – on oriented percolation cluster 295 – self-avoiding 40 – walks, coalescing 291 recombination 217, 365, 384, 440 – distribution 367 – equation 370 – rates, marginal 372 – single-crossover 377 recombinator 369 reconstruction theorem 459 record statistics 10 refinement 373 regeneration construction 299, 306 reinfection 55, 57–59, 61, 62, 64 relatedness 86 relative fitness process 50 resident 138 responsive switching 249 root jump 444 runtime effect 47, 50, 52 sample shape – convergence 460 – distribution 459 – polynomial 459 sampling – consistency 466 – distribution, conditional 393 – measure 340 scaling limit 50, 131, 465 Schweinsberg’s model 157

485

Index seed bank 111, 247, 275 – coalescent 250 – diffusion 250, 254, 255 – model 251 segment – genomic 440 – size 440 segregating sites 162 selection 130, 215, 409, 411 – additive 415 – balancing 43, 44, 54–57, 59, 108, 116 – diploid 415 – directional 43, 59 – fluctuating 409, 420, 423 – frequency-dependent 107 – group 86–88 – kin 86 – moderate 57 – periodic 49 – positive 108, 116 – rapid 183 – weak 87, 415 selective – advantage 47 – sweep 108, 116, 155 self-avoiding path 6 selfing 208, 211, 212, 215 separation of types 317 sequence 366 – data 158 – space 3 sequential – coalescent with recombination 386 – fixation 136 – Markov coalescent 119, 387 – multiple 394 sexual reproduction 383 shape function 459 simultaneous switching 249, 258 single-nucleotide polymorphism 108 site-frequency spectrum 108, 158, 230, 248, 394 – expected 160 – folded 158 Skorokhod topology 58, 59, 63, 206

486

Slack’s distribution 189 slave rebellion 86 social insects 85 spacer 69 – gain of 76 – loss of 76 sparse regime 94, 96, 98 spatial subdivision 215 spine technique 37 spontaneous switching 249 square-root diffusion coefficient 101 SSWM regime 17 staircase model 46 stepping-stone model 92, 291, 316 stochastic – averaging 409, 419 – tunnelling 14 sub-triangulation of the circle 461 subordinator 231 subtree 428 – grafting 440 – induced 437 – mass polynomial 472 – pruning 440 – root 442 superprocess 347 survival 303 sweepstake reproduction 156, 181, 207 switching – responsive 249 – simultaneous 249 – spontaneous 249 system size 341 tightness 59 time – scale – evolutionary 225, 238 – generation 225, 238 – separation 44, 87, 128, 133, 136, 168, 210, 211 – to balance 58, 59 trait – space 128, 129 – substitution sequence 138, 139, 216 trapped genetic material 387

Index tree – ˛-Ford 451, 452, 459 – algebraic 453 – branch point 454 – degree 453 – discrete 454 – subtree 453 – subtree component 453 – algebraic measure 456 – binary 457, 465 – branch point distribution 457 – branch point distribution distance 457 – Brownian 466 – binary 428 – Catalan 429 – coalescent 431, 440 – comb 452, 459 – genealogical 339 – labelled 429 – length 224, 226, 228, 230, 236, 238, 416 – n- 9 – neutral evolutionary 468 – of excursions 94–96 – ordered 428 – permutation 429 – phylogenetic 76 – plane 428 – planted 440, 445 – random 76 – ranked 429, 431, 432, 437 – recursive 183 – reduced 431, 432 – rooted 428 – size 428 – space 340 – structure 94, 97 – traversal 429 – ultrametric 181 – uniform 452 – unordered 428 – -valued process 240, 242 – Yule 431, 438, 444, 452 trees, nested 61 triangulation of the circle 461 triplet condition 455 type space 267

487

ultrametric 181, 411, 412 uninhabitable 293 virus population 337, 340 Vlasov–McKean dynamics 61 volatility 268 – of block averages 280 voter model 313 – continuous-space 328 warning signals 85 Watterson estimate 162 Watterson’s estimator 192 weak – migration-moderate selection 44, 57 – mutation-moderate selection 44, 45, 50, 51 – selection 87 – topology 63 Weibull distribution 32 Wright–Fisher – diffusion 88, 98, 420, 469 – model 50–52, 114, 165, 181, 204, 248 – biparental 218 – diploid 212, 213 – Dirichlet compound 155 – two-sex diploid 212, 217 – with recombination 366, 379 – process 384 Yule process 46, 359, 431

EMS SERIES OF CONGRESS REPORTS

Probabilistic Structures in Evolution The present volume collects twenty-one survey articles about probabilistic aspects of biological evolution. They cover a large variety of topics from the research done within the German Priority Programme SPP 1590. Evolution is a complex phenomenon driven by various processes, such as mutation and recombination of genetic material, reproduction of individuals, and selection of favourable types. These processes all have intrinsically random elements, which give rise to a wealth of phenomena that cannot be explained by deterministic models. Examples of such effects are the loss of genetic variability due to random reproduction and the emergence of random genealogies. The collection is centred around the stochastic processes in population genetics and population dynamics. On the one hand, these are individual-based models of predator-prey and of coevolution type, of adaptive dynamics, or of experimental evolution, considered in the usual forward direction of time. They lead to processes describing the evolution of type frequencies, which may then be analysed via suitable limit theorems. On the other hand, one traces the ancestral lines of individuals back into the past; this leads to random genealogies. Beyond the classical concept of Kingman’s coalescent, emphasis is on genealogies with multiple mergers and on ancestral structures that take into account selection, recombination, or migration. The contributions in this volume will be valuable to researchers interested in stochastic processes and their biological applications, or in mathematical population biology.

https://ems.press ISBN 978-3-98547-005-1