Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Similar documents
Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Methods for Machine Learning

Non-Parametric Bayes

Bayesian Nonparametric Models

Clustering using Mixture Models

Bayesian Models in Machine Learning

Bayesian nonparametrics

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Gaussian Mixture Model

Probabilistic Graphical Models

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Bayesian Nonparametrics: Dirichlet Process

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Part IV: Monte Carlo and nonparametric Bayes

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Probabilistic Machine Learning

STAT Advanced Bayesian Inference

Fast Approximate MAP Inference for Bayesian Nonparametrics

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

A Brief Overview of Nonparametric Bayesian Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Non-parametric Clustering with Dirichlet Processes

Gentle Introduction to Infinite Gaussian Mixture Modeling

Lecture 3a: Dirichlet processes

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian non parametric approaches: an introduction

Principles of Bayesian Inference

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

COMP90051 Statistical Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Advanced Machine Learning

Bayesian Inference. Chapter 2: Conjugate models

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Lecture 2: Conjugate priors

Bayesian Analysis for Natural Language Processing Lecture 2

Infinite latent feature models and the Indian Buffet Process

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Introduction to Machine Learning

Learning Bayesian network : Given structure and completely observed data

Distance dependent Chinese restaurant processes

Part 6: Multivariate Normal and Linear Models

Bayesian Nonparametrics

Collapsed Gibbs Sampler for Dirichlet Process Gaussian Mixture Models (DPGMM)

Introduction to Machine Learning. Lecture 2

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Methods: Naïve Bayes

Dirichlet Processes: Tutorial and Practical Course

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CSC321 Lecture 18: Learning Probabilistic Models

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Priors for Random Count Matrices with Random or Fixed Row Sums

Bayesian Inference for Dirichlet-Multinomials

Stat 5101 Lecture Notes

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Study Notes on the Latent Dirichlet Allocation

Density Estimation: ML, MAP, Bayesian estimation

Lecture 11: Probability Distributions and Parameter Estimation

LEARNING WITH BAYESIAN NETWORKS

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Time Series and Dynamic Models

CPSC 540: Machine Learning

5. Conditional Distributions

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Statistics for Managers Using Microsoft Excel/SPSS Chapter 4 Basic Probability And Discrete Probability Distributions

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Nonparametric Bayesian Methods - Lecture I

Bayesian non-parametric model to longitudinally predict churn

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Infinite Latent Feature Models and the Indian Buffet Process

Statistics for Managers Using Microsoft Excel (3 rd Edition)

COS513 LECTURE 8 STATISTICAL CONCEPTS

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Lecture 2: Priors and Conjugacy

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Discrete Binary Distributions

Language as a Stochastic Process

Foundations of Nonparametric Bayesian Methods

Modeling Environment

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Dirichlet Processes and other non-parametric Bayesian models

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Principles of Bayesian Inference

Principles of Bayesian Inference

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Transcription:

Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model EM for MAP Collapsed Gibbs sampling Chinese restaurant process, nonparametric clustering 1

Frequentist vs Bayesian Statistical methods, such as hypothesis testing (p-values), maximum likelihood estimates (MLE), and confidence intervals (CI), are known as frequentist methods. In the frequentist framework, probabilities refer to long run frequencies and are objective quantities; parameters are fixed but unknown constants; statistical procedures should have well-defined long run frequency properties (e.g. 95% CI).

There is another approach to inference known as Bayesian inference. Inthe Bayesian framework, probabilities reflect (subjective) personal belief; unknown parameters are treated as random variables; we make inferences about a parameter by producing a probability distribution for. 3

Bayesian Analysis The Bayesian inference is carried out in the following way: 1. Choose a statistical model p(x ), i.e., thelikelihood, sameasin frequentist approach;. Choose a prior distribution ( ); 3. Calculate the posterior distribution ( x). ( x) = p(x ) ( ) R p(x ) ( )d 4

Alternatively, we can write the posterior as ( x) = p(x ) ( ) R p(x ) ( )d / p(x ) ( ) where we drop the scale factor R p(x ) ( )d 1 since it is a constant not depending on. For example, if ( x) / e +,thenweknowthat x N(1, 1/). 5

Now any inference on the parameter can be obtained from the posterior distribution ( x). For example, if one wants a point estimate of, wecanreportthemean(e( x)), the median (median( x)), or the mode of the posterior distribution (max ( x)); an interval estimate of, we can report the 95% credible interval, which is a region with 0.95 posterior probability. So 95% credible interval (1., 3.5) means that P( (1., 3.5)) = 0.95 where P corresponds to the posterior distribution over. For a 95% confidence interval (1., 3.5), we CANNOT say that (1., 3.5) covers the true with prob 95%. 6

The mode estimate is often referred to as the MAP (maximum a posteriori) estimate, which is the solution of max log ( x) = max h i log p(x ) + log ( ). Note that MAP is also the solution of min h log p(x ) i log ( ). So MAP is related to the regularization approach, with the penalty term being ( log) of the prior. 7

Your first resistance to Bayesian inference may be the priors. Where does one find priors? Priors, like the likelihood, is part of your assumption: it s one s initial guess of the parameter; after observing the data which carry information about the parameter, one updates his/her prior to the posterior. Priors matter and do not matter. Next I ll introduce some default prior choices. Of course the sensitivity of prior choices how di erent priors a ect the final result should always be examined in practice. 8

A Bernoulli Example Suppose X =(X 1,...,X n ) denotes the outcomes from n coin-tossings, 1 means a head and 0 means a tail. They are iid samples from a Bernoulli distribution with parameter, where is the probability of getting a head. Without any information about the coin, we can put a uniform prior on, thatis, ( ) =1, 0 apple apple 1. 9

Next we calculate the posterior distribution of given X. ( X) / ny p(x i ) i=1 / s (1 ) n s, s = X X i, which implies that X Beta(s +1,n +1 s). The corresponding posterior mean can be used as a point estimate for, ˆ = s +1 n +. (1) 10

Post-mean = s +1 n +, MLE = s n. MLE is equal to the observed frequency of heads among n experiments; the Bayes estimator is the frequency of heads among (n + ) experiments in which there are two prior experiments, one is a head and the other one is atail. Without the data, one just looks at the prior experiments and a reasonable guess for is 1/. After observing the data, the final estimate (1) is some number between 1/ and s/n as a compromise between the prior information and the MLE. Note that when n gets large, the prior gets washed out. 11

Post-mean = s +1 n +, MLE = s n. The extra counts one for head and one for tail are often called the pseudo-counts. Having pseudo-counts is appealing in cases where is likely to take extreme values close to 1 or 0. Forexample,toestimate for a rare event, it is likely to observe X i =0for all i =1,...,n,butitmaybe dangerous to conclude ˆ =0. 1

Beta distributions are often used as a prior on ; infact,uniformisaspecial case of Beta. Suppose the prior on is Beta(, ), then ( ) = 1 (1 ) 1 B(, ) ( X) Beta(s +, n s + ). () We call Beta distributions the conjugate family for Bernoulli models since both the prior and the posterior distributions belong to the same family., The posterior mean of () is equal to ( + s)/(n + + ). SoBetapriors can be viewed as having ( + ) prior experiments in which we have heads and tails. 13

A Multinomial Example Suppose you randomly draw a card from an ordinary deck of playing cards, and then put it back in the deck. Repeat this exercise five times (i.e., sampling with replacement). Let (N 1,N,N 3,N 4 ) denote the number of spades, hearts, diamonds, and clubs among the five cards. We say (N 1,N,N 3,N 4 ) follow a multinomial distribution, with a probability distribution function given by = P (N 1 = n 1,...,N 4 = n 4 n, 1,..., 4 ) (n)! (n 1 )!(n )!(n 3 )!(n 4 )! n 1 1 n 4 4, where n 1 + n 4 = n =5and 1 = = 4 =1/4. 14

In the Bernoulli example, we conduct n independent trials and each trial results in one of two possible outcomes, e.g., head or tail, with probabilities and (1 ), respectively. In the multinomial example, we conduct n independent trials and each trial results in one of k possible outcomes, e.g., k =4in the aforementioned card example, with probabilities 1,..., k,respectively. 15

For multinomial distributions, the parameter of interest is =( 1,..., k ) which lies in a simplex of R k, S = { =( 1,..., k ); X i i =1, i 0}. A Dirichlet distribution on S, Dir( 1,..., k ), is an extension of the Beta distribution, with a density function given by p( ) = ( 1 + + k ) ( 1 ) ( k ) ky i=1 i 1 i. The Dirichlet distributions are conjugate priors for Multinomial models. 16

Dirichlet Distributions Examples of Dirichlet distributions over p =(p 1,p,p 3 ) which can be plotted in D since p 3 =1 p 1 p :

A Normal Example Assume X 1,...,X n N(, ).Theparametershereare(, ) and we would like to get the posterior distribution (, X). IfusingGibbs sampler which will be introduced later, we only need to know the posterior distribution of (, X) and (, X). 17

For the location parameter, the conjugate prior is normal. X, N(, /n) N(µ 0, 0 ), then, X N(µ, ) where µ = w X +(1 w)µ 0, w = 1 /n 1 /n + 1 0, = 1 /n + 1 0 1. 18

For the scale parameter, the conjugate prior is Invse Gamma, that is, the prior on 1/ is Gamma. Suppose ( )=InvGa(, ), then ( µ, X) / 1 n/ exp{ P (Xi µ) InvGa n +, P (Xi µ) } 1 +. 1e In practice, we often specify ( ) using Inv distributions which are special cases of InvGa, Inv (v 0,s 0)=InvGa( v 0, v 0 s 0). With prior Inv (v 0,s 0 ) the posterior distribution is also Inv (v n,s n) where v n = v 0 + n, v n s n = v 0 s 0 + X (X i µ). v0 pseudo samples and each contributes s 0 into RSS. 19

Gibbs Sampling for Posterior Inference Suppose the random variables X and Y have a joint probability density function p(x, y). Sometimes it is not easy to simulate directly from the joint distribution. Instead, suppose it is possible to simulate from the individual conditional distributions p X Y (x y) and p Y X (y x). 0

Then a Gibbs sampler draws (X 1,Y 1 ),...,(X T,Y T ) as follows: 1. Initialization: let (X 0,Y 0 ) be some starting values; set n =0.. draw X n+1 p X Y (x Y n ) 3. draw Y n+1 p Y X (y X n+1 ) 4. Go to step and repeat. 1

Gibbs samplers are MCMC algorithms, and they produce samples from the desired distributions after a so-called burning period. So in practice, we always drop some samples from the initial steps (say, for example, 1000 or 5000 steps) and start saving samples after that.

In Bayesian analysis, suppose we have multiple parameters =( 1,..., K ). Then we can draw the posterior samples of using a multi-stage Gibbs sampler and at each stage, we draw i from the conditional distribution ( i [ i], Data) where [ i] denotes the (K 1) parameters except i.the reason for Gibbs samplers is that in many cases the conditional distribution of ( Data) is not of closed form, while all those conditionals are. 3

A Gaussian Mixture Model Suppose the data x 1,x,...,x n iid from w N(µ 1, 1)+(1 w) N(µ, ). For each x i,weintroducealatentvariablez i indicating which component x i is generated from and P (Z i = 1) = w, P(Z i = ) = 1 w. The parameters of interest are =(w, µ 1,µ, 1, ) and their prior distributions are specified as follows w Be(1, 1), µ 1,µ N(0, ), 1, InvGa(, ). 4

EM for MAP If we want to obtain a MAP estimate of the parameters, we can use the EM algorithm. Recall that ˆ = arg max P (x ) ( ) = arg max h X z i P (x, z ) ( ) log p(x, Z ) ( ) = X h 1(Z i = 1) log i i µ1, 1 (x i ) + log w + X i 1(Z i = ) h i log µ, (x i ) + log(1 w) + log (w) + log (µ 1 ) + log (µ ) + log ( 1) + log ( ) 5

Recall the EM algorithm for MLE: at the E-step, we replace 1(Z i = 1) and 1(Z i = ) by its expectation, i.e., the probability of Z i =1or conditioning on the data x and the current estimate of the parameter 0 i = P (Z i =1 x i, 0 )= w µ1, 1 (x i ) w µ1, 1 (x i )+(1 w) µ, (x i ) ; at the M-step, we update, and iterative between the E and M steps, until convergence. 6

For MAP, the E-step is the same; the M-step is slightly di erent: Without the Beta prior, we would update w by + /n. But with the Beta prior on w, we need to add the pseudo-counts. Similarly, without the prior, we would update µ 1 by 1 + X i ix i, but with the prior, we would update µ 1 by a weighted average of the value above and the prior mean for µ 1. 7

Gibbs Sampling from the Posterior Distribution (, Z x) = ny p(zi )p(x i Z i, ) i=1 (w) (µ 1 ) (µ ) ( 1) ( ) In a Gibbs sampler, we iteratively sample each element from (w, µ 1,µ, 1,,Z 1,...,Z n ) conditioning on the other elements being fixed. 8

For example, how to sample Z i? Bernoulli P (Z i =1 x, Z [ i], others) / P (Z i =1 ) P (x i Z i =1, ) how to sample µ 1? Normal (µ 1 x, Z, others) / = wp(x i µ 1, h Y P (X i µ 1, i:z i =1 1) i 1) (µ 1 ). 9

For a general Gaussian mixture model with K components, we have the prior on the mixing weights as w =(w 1,...,w K ) Dir( K,..., K ), and again, normal prior on µ k s and InvGa prior on k s. The Gibbs sampler iterates the following steps: 1. Draw Z i from a Multinomial distribution for i =1,...,n;. Draw w from Dirichlet; 3. Draw µ 1,...,µ K from Normal; 4. Draw 1,..., K from InvGa. 30

Collapsed Gibbs sampling The mixing weights w is used in sampling Z i s: P (Z i = k x, z [ i], ) / P (x i Z i = k, )P (Z i = k z [ i], ), where the nd term is equal to P (Z i = k z [ i], w) =P (Z i = k w), i.e., when conditioning on w, Z i s are independent. 31

Let s eliminate w from the parameter list, i.e., integrate over w. So we need to compute P (Z i = k z [ i] )= P (z 1,...,z i 1,Z i = k, z i+1,...,z n ) P (z 1,...,z i 1,z i+1,...,z n ) Note that Z i s are exchangeable (its meaning will be made clearly in class). So it su ces to compute the sampling distribution for the last observation. 3

= = P (Z n = k z 1,...,z n 1 )= P (Z n = k, z 1,...,z i ) P (z 1,...,z i ) R n w 1 + /K 1 1 w n k+1+ /K 1 k w n K+ /K 1 K dw R n w 1 + /K 1 1 w n k+ /K 1 dw (n 1+ ) (n 1+ + 1) = n k + /K n 1+ where n k =#{j : z j = k, j =1:(n 1)}. k w n K+ /K 1 K (n k + /K + 1) (n k + /K) 33

The Collapsed Gibbs sampler iterates the following steps: 1. Draw Z i from a Multinomial( i1,..., ik), for i =1,...,n where ik / n(i) k + /K n 1+ P (x i µ k, k ).. No need to sample w from Dirichlet; 3. Draw µ 1,...,µ K from Normal; 4. Draw 1,..., K from InvGa. 34

Infinite Many Clusters K = A Large Value (> n)? At the t-th iteration of the algorithm, there must be some empty clusters. Let s re-label the clusters: 1 to K being the non-empty clusters (of course, the value of K changes from iteration to iteration), and the remaining are empty ones. Update {µ k, k } K k=1 from posterior (Normal/InvGa) Update {µ k, k } K k=k +1 from prior (Normal/InvGa) 35

Update Z i from a Multinomial( i1,..., ik), where ik / n(i) k + /K n 1+ P (x i µ k, k ). Z i may start a new cluster, i.e.,z i >K. It doesn t matter which value Z i takes from (K +1,...,K), sincewe can always label this new cluster as the (K + 1)th cluster, and then immediately update (µ K +1, K +1 ) based on the corresponding posterior (conditioning on x i ). 36

What t the chance that Z i starts a new cluster? KX k=k +1 ik = =! /K n 1+ K K K n 1+ n 1+ KX k=k +1 " K ZZ 1 P (x i µ k, K P (x i µ, KX k=k +1 k ) ) (µ, P (x i µ k, # k ) )dµd, when K!1, where ZZ P (x i µ, ) (µ, )dµd = m(x i ) is the integrated (wrt to our prior ) likelihoodforasamplex i. 37

Clustering with Chinese Restaurant Process (CRP) Amixturemodel X i Z i = k P ( µ k, k ), i =1:n. Prior on Z 1,...,Z n and (µ k, Process) k ) K k=1 (known as the Chinese Restaurant Z 1 =1and (µ 1, 1 ) ( ) for i 1, suppose Z 1,...,Z i form m clusters with size n 1,...,n m, then P (Z i+1 = k Z 1,...,Z i ) = P (Z i+1 = m +1 Z 1,...,Z i ) = 38 n k i +, i +, k =1,...,m; (µ m+1, m+1) ( ).

Alternatively, you can describe the prior first and then the likelihood (which gives you a clear idea of how data are generated): Set Z 1 =1, generate (µ 1, 1 ) ( ) and X 1 P ( µ 1, 1 ). Loop over i =1,...,n 1: suppose the previous i samples form m clusters with cluster-specific parameters (µ k, k ) m k=1 ;then P (Z i+1 = k Z 1,...,Z i ) = P (Z i+1 = m + 1) Z 1,...,Z i ) = n k i +, i +. k =1,...,m; If Z i+1 = m +1, generate (µ m+1, m+1 ) ( ). Then generate X i+1 P ( µ k, k ). 39

Advantages We do not need to specify K. K is treated as a random variable, and its (posterior) distribution is learned from the data. Can model unseen data: foranynewsamplex,thereisalwaysa positive chance that it can start a new cluster. 40

Exchangeability of Z i s In CRP, the labels Z i s are generated sequentially, but in fact they are exchangeable (up to a permutation of the cluster labels labels should start from 1,,...) P (111) = P (111) = P (11) = (!)(1!) (1 + )( + ) (4 + ). In general, suppose z 1,...,z n form m clusters with size n 1,...,n m,then P (z 1,...,z n )= m (n 1 1)! (n m 1)! Q n i= (i 1+ ). So the order of z i s doesn t matter; what matters is the partition of the n samples implied by z i s. 41

Posterior Sampling for Clustering with CRP Same as the Gibbs sampler we have derived for K!1.Atthetth iteration, repeat the following: Suppose Z [ i] form K clusters, labeled from 1 to K, of size n (i) k. Sample Z i from a Multinomial with P (Z i = k) / P (Z i = K + 1) / n (i) k n 1+ P (x i µ k, n 1+ m(x i), k ), k =1,...,K where m(x i )= RR P (x i µ, ) (µ, )dµd. Update {µ k, k } from posterior (Normal/InvGa) 4

The exchangeability of Z i s plays an important role in the algorithm. Where we use this property? The marginal likelihood m( ) is easy to compute if the prior (µ, ) is conjugate, otherwise, we need to figure a way to compute m( ). For other MCMC algorithms, check Neal (000); for Variational Bayes, check Blei and Jordan (004). The ugly side:labelingissue. 43

Non-parametric Bayesian (NB) Models The finite mixture model X i Z i = k P k, P(Z i = k) =w k. k iid G 0, w Dir( /K,..., /K). Alternatively, X i i P i, KX G( ) = w k k ( ), k=1 i G G w Dir K,, k G 0. The prior on G is a K-element discrete dist. In the NB approach, we ll drop this restriction. 44

A NB approach for clustering X i i P i, i G G G DP(, G 0 ) where DP(, G 0 ) denotes a Dirichlet Process with a scale (precision) parameter and a base measure G 0. 45

Dirichlet Process (DP) i G iid G, G DP(,G 0 ) Define DP as a distribution over distributions (Ferguson, 1973) Describe DP as a stick-breaking process (Sethuraman, 1994) If we integrate over G (wrt DP), the resulting prior on ( 1,..., n ), ( 1,..., n )= n i=1 G( i )d (G) is the Chinese restaurant process (CRP). 7

Dirichlet Process G DP( G 0, ) OK, but what does it look like? Samples from a DP are discrete with probability one: G( ) = k=1 k k ( ) where k ( ) is a Dirac delta at k, and k G 0 ( ). Note: E(G) =G 0 As, G looks more like G 0.

Dirichlet Processes: Stick Breaking Representation G DP( G 0, ) Samples G from a DP can be represented as follows: G( ) = k k ( ) k=1 p(!) 15 10 5 0 0 0. 0.4 0.6 0.8 1! Beta(1,1) Beta(1,10) Beta(1,0.5) where k G 0 ( ), k=1 k =1, k 1 k = k (1 j ) j=1 and k Beta( 1, ) (Sethuraman, 1994)

Chinese Restaurant Process! 1! 3!! 6! 4...! 5!!! 1 3! 4 Generating from a CRP: customer 1 enters the restaurant and sits at table 1. 1 = 1 where 1 G 0, K =1, n =1, n 1 =1 for n =,..., n k with prob k customer n sits at table n 1+ for k =1...K K +1 with prob n 1+ (new table) if new table was chosen then K K +1, K+1 G 0 endif set n to k of the table k that customer n sat at; set n k n k +1 endfor The resulting conditional distribution over n : n 1,..., n 1,G 0, n 1+ G 0 ( )+ K k=1 n k n 1+ k ( )