Crump Mode Jagers processes with neutral Poissonian mutations

Crump Mode Jagers processes with neutral Poissonian mutations Nicolas Champagnat 1 Amaury Lambert 2 1 INRIA Nancy, équipe TOSCA 2 UPMC Univ Paris 06, Laboratoire de Probabilités et Modèles Aléatoires Paris, le 15 septembre 2011

Branching processes with mutations Yule (1924) = pure-birth process, species and genera Griffiths & Pakes (1988) = Galton Watson tree and independent mutations with fixed probability Taïb (1992) = general branching process, mutation at birth Abraham & Delmas (2007) = continuous-state branching processes, all mutants have the same type Bertoin (2009, 2010, 2011) = Galton Watson, allelic partition of total descendance Sagitov & Serra (2009, 2011) = waiting time to n-th mutation

Splitting tree in forward time (Geiger & Kersting 97) We consider an asexual population where t individuals reproduce independently have i.i.d. lifetime durations during which they give birth at constant rate b The population size process (N t ;t 0) is a non-markovian branching process called (homogeneous, binary) Crump Mode Jagers process.

Representation in backward time (1) Starting from one single individual, the subtree spanned by the individuals alive at time t can be represented as follows... t 0 1 2 3 H 1 H 3 H 4 4 H 2...where the times H 1,H 2,H 3... are called coalescence times.

Representation in backward time (2) One can again represent this subtree... like this......or like that t H 1 H 3 H 4 H 1 H 3 H 4 H 2 H 2...And in each case, the coalescence times characterize the subtree.

Contour of a splitting tree A splitting tree and the (jumping) contour process of its truncation below time t. a) t b) t

First result Theorem (L. (2010)) The (jumping) contour of a (truncated) splitting tree is a strong Markov process. As a consequence, the coalescence times H 1,H 2,H 3... of the splitting tree form a sequence of i.i.d. positive random variables killed at its first value larger than t. There is a positive increasing function W such that W(0) = 1 and P(H > x) = 1 W(x). The sequence (H 1,H 2,...) is called a coalescent point-process.

Coalescent point process Generally, a coalescent point process is the genealogy generated by a sequence of arbitrary i.i.d. positive r.v. as below. In that case, we will define W(x) as 1/P(H > x). 0 1 2 3 4 5 6 7 8 9 10 12 13 14 15

Precision Two objects : splitting trees and coalescent point processes ; Results at fixed times are valid for coalescent point processes (more general than splitting trees) ; Asymptotic (t ) results are valid for supercritical splitting trees.

Assumptions on the mutation scheme Now conditional on the genealogy, point mutations occur randomly. 1 mutations occur at constant rate θ during lifetimes, or equivalently, on branch lengths of the coalescent point process 2 mutations are neutral : they have no effect on the genealogy (birth rate, lifetimes...) 3 each mutation yields a new type, called allele, to its carrier (infinitely-many alleles model) 4 types are inherited by the offspring born after this mutation and before the next one.

Mutation at rate θ N = 9 alive individuals shared out in 6 types : 4 types of abundance 1, 2 types of abundance 2, and 1 type of abundance 3. a a c f c c d e b f e a c d b

Frequency spectrum For a total population of N individuals, we adopt the following condensed notation : A = # distinct types in the population A(k) = # types represented by k individuals so that... A(k) = A and ka(k) = N k 1 k 1 (A(k);k 1) = «frequency spectrum».

Clonal splitting trees (1) the genalogy of clonal individuals is a splitting tree with (birth rate b and) lifetime duration distributed as V θ := min(v,e), where E is an exponential variable with parameter θ independent of V. to a clonal splitting tree is associated a clonal coalescent point process with i.i.d. branch lengths H1 θ,hθ 2,... whose inverse of the tail distribution is denoted by W θ P(H θ > s) =: 1 W θ (s).

Clonal splitting trees (2) Proposition (L. (2009)) In the case of a coalescent point process with branch lengths H 1,H 2,..., we can define H θ as max(h 1,...,H B θ ), where B θ is the index of first virgin lineage (i.e., carrying no mutation since it has split from ancestral lineage 0). In both cases, the scale function W θ associated with clonal trees is related to W via with W θ (0) = 1. W θ (x) = e θx W (x) x 0,

Virgin lineage Below, the index of the first virgin lineage is 8 a 0 c 1 c 2 c 3 b 4 b 5 d 6 d 7 a 8 e a 9 10 c b d e B θ H θ a

Clonal coalescent point process (1) lineage last first virgin lineage i a a a not carrying a y + dy a Goal. Compute the number of alleles of age in (y,y + dy) and carried by k alive individuals at time t, jointly with N t.

Clonal coalescent point process (2) lineage i H θ 1 B θ 1 last first virgin lineage a a a not carrying a H θ 2 B θ 2 B θ 3 H θ 4 y + dy a Goal. Compute the number of alleles of age in (y,y + dy) and carried by k alive individuals at time t, jointly with N t.

Finer result on clonal coalescent point process (1) B θ i = distances between consecutive virgin lineages Hi θ = max of branch lengths between consecutive virgin lineages = (B θ i,hθ i ) are i.i.d. Bθ 1 B θ 2 Bθ 3 Bθ 4 Bθ 5 B θ 6 Bθ 7 a a a a H3 θ a a a H θ 5 a a

Finer result on clonal coalescent point process (2) We are interested in the joint law of H θ and B θ. Set W θ (x,s) := 1 1 E(s Bθ,H θ x) x 0,s [0,1]. In particular, W θ (x,1) = W θ (x). Theorem (Champagnat & L. 2010) We have x W θ (x,s) = e θx W(x,s) x 0, x with W θ (0,γ) = 1, where W(x,s) := In particular, W(x, 1) = W(x). 1 1 sp(h x).

Expected frequency spectrum Recall N t is the population size at time t. Theorem (Champagnat & L. 2010) If A(k,t,dy) denotes the number of alleles of age in (y,y + dy) and carried by k alive individuals at time t, then E ( s N t 1 A(k,t,dy) N t 0 ) = θ dy W(t;s)2 W(t) e θy ( ) 1 k 1 W θ (y;s) 2 1 W θ (y;s)

Random characteristics for any individual i in the population, let σ i be her birth time, and let χ i (t σ i ) be a random characteristic of i examples : number of descendants of i, indicator of alive descendants of i, of alive clonal descendants of i... t units of time after her birth (χ i (t σ i ) = 0 if t < σ i ) the process Zt χ := i χ i (t σ i ) is a branching process counted by random characteristic, as defined by Jagers (74) and Jagers-Nerman (84a, 84b) under suitable conditions, in the supercritical case, ratios of the branching process counted by different random characteristics converge a.s. on the survival event to a deterministic limit.

Almost sure convergence, small families For any k 1 and 0 a < b, b U k (a,b) := θ dy a e θy ( W θ (y) 2 1 1 ) k 1 W θ (y) Thanks to the theory of random characteristics, we get Theorem (Champagnat & L. 2010) If A(k,t;a,b) denotes the number of alleles of age in (a,b) carried by k alive individuals at time t, then on the survival event, A(k,t;a,b) lim = U k (a,b) a.s. t N t

Preliminary remark We consider a supercritical splitting tree with Malthusian parameter α, so that N t increases like e αt. Since θ is an additional death rate for clonal families, Clonal families are supercritical if α > θ critical if α = θ subcritical if α < θ.

We define Notation M t (x;a,b)= number of families of size x and of age in [a,b] b M t (x;a,b) := A(k,t,dy) k x a L t (x)= number of families of size x L t (x) := M t (x;0, ) O t (a)= number of families of age a O t (a) := M t (0;a, ). Goal. Find x t such that E L t (x t ) = O(1) and a t such that E O t (a t ) = O(1), as t.

Case α > θ Assume α > θ. Proposition (Champagnat & L. 2011) For any c > 0 and a < b, ) E M t (ce (α θ)t ;t b,t a = O(1), so that largest families have sizes cn 1 θ/α and are also the oldest ones (born at times O(1)).

Case α < θ : largest families Assume α < θ and set β := θ/(θ α). Proposition (Champagnat & L. 2011) For some other explicit constant b, set x t := b(αt β log(t)). Then for any c ( E L t (x t + c) E M t x t + c;(1 ε) log(t) ) log(t),(1 + ε) = O(1), θ α θ α so that largest families have sizes b(log(n) β log(logn)) + c and they all have age log(t) θ α.

Case α < θ : oldest families Assume (again) α < θ and set γ := α/θ < 1. Proposition (Champagnat & L. 2011) For any a, E O t (γt + b) = O(1), and for any y t, E M t (y t ;γt + a, ) = 0 so that oldest families have ages γt + a and tight sizes.

Convergence in distribution (1) Take the coalescent point process at time t, choose s t such that s t, and set N t s t := number of subtrees (T i ) grafted on branch lengths s t s t T T 2 1 T N t s t N t s t t s t

Set X (k) t Convergence in distribution (2) := size of the k-th largest family in the whole population Y i := size of the largest family in subtree T i. When α θ, we can define { log(t) / (θ α) if α < θ s t := t / 2 if α = θ satisfying N t s t (X (1) t,...,x (k) )= first k order statistics of {Y 1,...,Y N } W.H.P. t s t t P(Y x t + c) = P(L st (x t + c) 1) E(L st (x t + c)).

Convergence in distribution (3) The same results hold with A (k) t := age of the k-th oldest family in the whole population Y i := age of the oldest family in subtree T i, and { s t := αt / θ if α < θ t log(t) / α if α = θ.

Assume α < θ. Convergence in distribution (4) Theorem (Champagnat & L. 2011) There are some explicit constants u,c, such that 1 lim t P(X(1) t < b(αt β log(t)) + k) = 1 + u.c k. More specifically, along some subsequence, (X (k) t b(αt β log(t));k 1) converge (fdd) to the (ranked) atoms of a mixed Poisson point measure with intensity where E is some exponential r.v. E c j δ j, j Z

Convergence in distribution (5) Assume again α < θ. Theorem (Champagnat & L. 2011) There is some explicit constant v > 0 such that 1 lim t P(A(1) t < (αt / θ) + a) = 1 + v.e θa. More specifically, (A (k) t (αt / θ);k 1) converge (fdd) to the (ranked) atoms of a mixed Poisson point measure with intensity where E is some exponential r.v. E e θa da,

Acknowledgements Laboratoire de Probabilités et Modèles Aléatoires UPMC Univ Paris 06 ANR MANEGE (Modèles Aléatoires en Écologie, Génétique, Évolution)...et merci de votre patience.