Lecture 19 : Chinese restaurant process

Similar documents
Lecture 18 : Ewens sampling formula

Notes 20 : Tests of neutrality

An introduction to mathematical modeling of the genealogical process of genes

Statistical population genetics

The Combinatorial Interpretation of Formulas in Coalescent Theory

Notes 1 : Measure-theoretic foundations I

WXML Final Report: Chinese Restaurant Process

On The Mutation Parameter of Ewens Sampling. Formula

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

URN MODELS: the Ewens Sampling Lemma

PROBABILITY. Contents Preface 1 1. Introduction 2 2. Combinatorial analysis 5 3. Stirling s formula 8. Preface

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

PROBABILITY VITTORIA SILVESTRI

Notes 9 : Infinitely divisible and stable laws

Notes 15 : UI Martingales

Notes 18 : Optional Sampling Theorem

the time it takes until a radioactive substance undergoes a decay

Notes 6 : First and second moment methods

Lecture 13 : Kesten-Stigum bound

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Bayesian Nonparametrics: Dirichlet Process

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Final Exam: Probability Theory (ANSWERS)

Lecture 19 : Brownian motion: Path properties I

3. The Multivariate Hypergeometric Distribution

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Randomized Algorithms

MAT 271E Probability and Statistics

CIS 2033 Lecture 5, Fall

Lecture 9 : Identifiability of Markov Models

6 Introduction to Population Genetics

Increments of Random Partitions

CSC 2541: Bayesian Methods for Machine Learning

6 Introduction to Population Genetics

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

The two-parameter generalization of Ewens random partition structure

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

Evolution in a spatial continuum

Limits at Infinity. Horizontal Asymptotes. Definition (Limits at Infinity) Horizontal Asymptotes

THE POISSON DIRICHLET DISTRIBUTION AND ITS RELATIVES REVISITED

Universal examples. Chapter The Bernoulli process

Lecture 1: Contraction Algorithm

Bayesian Nonparametrics

Today s class. Linear Algebraic Equations LU Decomposition. Numerical Methods, Fall 2011 Lecture 8. Prof. Jinbo Bi CSE, UConn

Latent voter model on random regular graphs

Computational Systems Biology: Biology X

THE COALESCENT Lectures given at CIMPA school St Louis, Sénégal, April Étienne Pardoux

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Stirling Numbers of the 1st Kind

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

UNIT Explain about the partition of a sampling space theorem?

Dynamics of the evolving Bolthausen-Sznitman coalescent. by Jason Schweinsberg University of California at San Diego.

Genetic Variation in Finite Populations

Linear Variable coefficient equations (Sect. 2.1) Review: Linear constant coefficient equations

The ubiquitous Ewens sampling formula

Closed-form sampling formulas for the coalescent with recombination

Department of Mathematics, University of Wisconsin-Madison Math 114 Worksheet Sections (4.1),

Physics Sep Example A Spin System

Math 115 Spring 11 Written Homework 10 Solutions

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

HWK#7solution sets Chem 544, fall 2014/12/4

Introduction to Queuing Networks Solutions to Problem Sheet 3

Mathematical Population Genetics II

Bayesian nonparametrics

Random Networks. Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

Mathematical Models for Population Dynamics: Randomness versus Determinism

Non-Parametric Bayes

2.1 Elementary probability; random sampling

Errata List for the Second Printing

Section 2.6 Limits at infinity and infinite limits 2 Lectures. Dr. Abdulla Eid. College of Science. MATHS 101: Calculus I

MH1200 Final 2014/2015

On linear and non-linear equations.(sect. 2.4).

1 Ensemble of three-level systems

EE126: Probability and Random Processes

Probabilistic Graphical Models

Math 20 Spring Discrete Probability. Midterm Exam

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

EXCHANGEABLE COALESCENTS. Jean BERTOIN

Rings, Paths, and Cayley Graphs

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Math 105 Exit Exam Review

8 Basics of Hypothesis Testing

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

Probability theory basics

Endowed with an Extra Sense : Mathematics and Evolution

The Wright-Fisher Model and Genetic Drift

Chapter 6 Expectation and Conditional Expectation. Lectures Definition 6.1. Two random variables defined on a probability space are said to be

Midterm Exam 1 Solution

Convergence Time to the Ewens Sampling Formula

Lectures on Elementary Probability. William G. Faris

AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. By Paul A. Jenkins and Yun S. Song, University of California, Berkeley

Supplementary Material to Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

CS181 Midterm 2 Practice Solutions

Random Networks. Complex Networks, CSYS/MATH 303, Spring, Prof. Peter Dodds

The nested Kingman coalescent: speed of coming down from infinity. by Jason Schweinsberg (University of California at San Diego)

Sparse Linear Models (10/7/13)

2. As we shall see, we choose to write in terms of σ x because ( X ) 2 = σ 2 x.

An Algorithmist s Toolkit September 24, Lecture 5

Transcription:

Lecture 9 : Chinese restaurant process MATH285K - Spring 200 Lecturer: Sebastien Roch References: [Dur08, Chapter 3] Previous class Recall Ewens sampling formula (ESF) THM 9 (Ewens sampling formula) Letting a = n ia i, in a sample of size n we have q(a) = { a = n} n! where = ( + ) ( + n ) Proof of ESF n ( ) ai i a i!, We begin with a proof of the ESF This is essentially Kingman s proof (understanding ESF was the main motivation for his introduction of the coalescent) Proof: Let Π be the partition generated by the infinite-alleles model on n sample We call each set in Π a cluster Assume there are k clusters Looking backwards in time, to obtain Π it must be that each cluster undergoes a sequence of coalescences followed by a single mutation: more precisely, a cluster of size λ i goes through λ i coalescences then a single mutation Once this happens, we say that the cluster has merged Moreover, no coalescence event can occur between two unmerged clusters we call such an event invalid We now compute the probability that these events occur The probability that a coalescence occurs within a cluster with j lineages remaining when the total number of lineages remaining is i is given by ( j i ( 2) + i i ) = 2 j(j ) i( + i )

Lecture 9: Chinese restaurant process 2 Similarly, the probability that a mutation occurs in a cluster with lineage remaining when the total number of lineages reamining is i is given by + i i = i( + i ) Note that in both expressions, the denominator depends only on i and the numerator depends only on j Hence, the probability that a particular sequence of valid events occurs must be k λ i!(λ i )! n! The particular sequence of valid events is not important to us so we multiply by the number of ways of choosing λ,, λ k positions among the n time slots k k λ i!(λ i )! n! k n! λ! λ k! = k (λ i )! (Note that we ignore those events that are valid but irrelevant like the coalescence between a merged cluster and an unmerged cluster The properties of exponentials allow us to do so) Since we are looking for the probability of the allele frequencies, we multiply by the number of ways of choosing the elements in each cluster not distinguishing between clusters of the same size to get q(a) = k k (λ i )! n! λ! λ k! a! a n! = n! i a ( i k λ i a! a n! = n! n ( i ) ) ai a i! 2 A better estimator for Implicit in Kingman s proof of ESF is the following urn process due to Hoppe

Lecture 9: Chinese restaurant process 3 DEF 92 (Hoppe s urn) An urn contains a black ball and a certain number of balls of other colors At each time step, a ball is picked at random, the black having weight and all other balls having weight If the black ball is picked, a ball of a new color is added to the urn (and the black ball is replaced) If a ball of another color is chosen, a new ball of the same color is added to the urn (and the ball picked is returned to the urn) The process starts with a single black ball We stop when we have n non-black balls From the proof of the ESF given above, the distribution of the frequencies of the non-black balls is given by Ewens formula (Just look at the process going backwards in time) To illustrate the use of Hoppe s urn, we consider a different estimator of A natural quantity which is influenced by the mutation rate is the total number of alleles K n in the sample THM 93 (Watterson s estimator) We have E[K n ] log n Var[K n ] log n Proof: By Hoppe s urn, K n the number of non-black colors in the urn is a sum of independent Bernoulli variables with mean, i =,, n Therefore, and Var[K n ] = E[K n ] = n +i + i n i + i + i = n (i ) ( + i ) 2 i Using the fact that +i and the approximation of i +i by an integral which scales as log(n + ) log, the result follows Note that the variance of K n / log n is roughly / log n and therefore converges to 0 as n In fact, an appropriate version of the Central Limit Theorem shows that K n (standardized) converges in law to a Gaussian 3 Chinese restaurant process Although K n may appear to be a rather naive estimator (not to mention it converges painfully slowly), we will show here that one cannot much better We expand Hoppe s urn by tracking the time at which each ball enters the urn It turns out to be useful to represent the state of the system by the cycle decomposition of a permutation We start with the empty permutation At time i:

Lecture 9: Chinese restaurant process 4 If the black ball is chosen (including at the very first step) a new cycle is created including only i If a ball of another color is chosen, say with index j, we include i in the cycle of j to the left of j This is called the Chinese restaurant process THM 94 If the permutation π generated by the chinese restaurant process has k cycles then its probability is k Proof: Noting that the process can be reversed in a unique way and using an argument similar to the proof of the ESF, the result follows When =, we get a uniformly random permutation and Ewens formula gives a formula for the number of permutations with k cycles Recall: DEF 95 (Sufficient statistic) If a statistic X is such that the conditional dsitribution of the data given X does not depend on the parameter, we say that X is sufficient THM 96 K n is a sufficient statistic for Proof: Since the probability in Theorem 94 depends only on k, we have that P[K n = k] = k S k n, where Sn k is the number of permutations of n elements with k cycles Using ESF, we get P[a K n = k] = { a = k} n! n ( ) aj Sn k j a j!, which does not depend on Further reading The material in this section was taken from Section 3 of the excellent monograph [Dur08] j=

Lecture 9: Chinese restaurant process 5 References [Dur08] Richard Durrett Probability models for DNA sequence evolution Probability and its Applications (New York) Springer, New York, second edition, 2008