Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model EM for MAP Collapsed Gibbs sampling Chinese restaurant process, nonparametric clustering 1
Frequentist vs Bayesian Statistical methods, such as hypothesis testing (p-values), maximum likelihood estimates (MLE), and confidence intervals (CI), are known as frequentist methods. In the frequentist framework, probabilities refer to long run frequencies and are objective quantities; parameters are fixed but unknown constants; statistical procedures should have well-defined long run frequency properties (e.g. 95% CI).
There is another approach to inference known as Bayesian inference. Inthe Bayesian framework, probabilities reflect (subjective) personal belief; unknown parameters are treated as random variables; we make inferences about a parameter by producing a probability distribution for. 3
Bayesian Analysis The Bayesian inference is carried out in the following way: 1. Choose a statistical model p(x ), i.e., thelikelihood, sameasin frequentist approach;. Choose a prior distribution ( ); 3. Calculate the posterior distribution ( x). ( x) = p(x ) ( ) R p(x ) ( )d 4
Alternatively, we can write the posterior as ( x) = p(x ) ( ) R p(x ) ( )d / p(x ) ( ) where we drop the scale factor R p(x ) ( )d 1 since it is a constant not depending on. For example, if ( x) / e +,thenweknowthat x N(1, 1/). 5
Now any inference on the parameter can be obtained from the posterior distribution ( x). For example, if one wants a point estimate of, wecanreportthemean(e( x)), the median (median( x)), or the mode of the posterior distribution (max ( x)); an interval estimate of, we can report the 95% credible interval, which is a region with 0.95 posterior probability. So 95% credible interval (1., 3.5) means that P( (1., 3.5)) = 0.95 where P corresponds to the posterior distribution over. For a 95% confidence interval (1., 3.5), we CANNOT say that (1., 3.5) covers the true with prob 95%. 6
The mode estimate is often referred to as the MAP (maximum a posteriori) estimate, which is the solution of max log ( x) = max h i log p(x ) + log ( ). Note that MAP is also the solution of min h log p(x ) i log ( ). So MAP is related to the regularization approach, with the penalty term being ( log) of the prior. 7
Your first resistance to Bayesian inference may be the priors. Where does one find priors? Priors, like the likelihood, is part of your assumption: it s one s initial guess of the parameter; after observing the data which carry information about the parameter, one updates his/her prior to the posterior. Priors matter and do not matter. Next I ll introduce some default prior choices. Of course the sensitivity of prior choices how di erent priors a ect the final result should always be examined in practice. 8
A Bernoulli Example Suppose X =(X 1,...,X n ) denotes the outcomes from n coin-tossings, 1 means a head and 0 means a tail. They are iid samples from a Bernoulli distribution with parameter, where is the probability of getting a head. Without any information about the coin, we can put a uniform prior on, thatis, ( ) =1, 0 apple apple 1. 9
Next we calculate the posterior distribution of given X. ( X) / ny p(x i ) i=1 / s (1 ) n s, s = X X i, which implies that X Beta(s +1,n +1 s). The corresponding posterior mean can be used as a point estimate for, ˆ = s +1 n +. (1) 10
Post-mean = s +1 n +, MLE = s n. MLE is equal to the observed frequency of heads among n experiments; the Bayes estimator is the frequency of heads among (n + ) experiments in which there are two prior experiments, one is a head and the other one is atail. Without the data, one just looks at the prior experiments and a reasonable guess for is 1/. After observing the data, the final estimate (1) is some number between 1/ and s/n as a compromise between the prior information and the MLE. Note that when n gets large, the prior gets washed out. 11
Post-mean = s +1 n +, MLE = s n. The extra counts one for head and one for tail are often called the pseudo-counts. Having pseudo-counts is appealing in cases where is likely to take extreme values close to 1 or 0. Forexample,toestimate for a rare event, it is likely to observe X i =0for all i =1,...,n,butitmaybe dangerous to conclude ˆ =0. 1
Beta distributions are often used as a prior on ; infact,uniformisaspecial case of Beta. Suppose the prior on is Beta(, ), then ( ) = 1 (1 ) 1 B(, ) ( X) Beta(s +, n s + ). () We call Beta distributions the conjugate family for Bernoulli models since both the prior and the posterior distributions belong to the same family., The posterior mean of () is equal to ( + s)/(n + + ). SoBetapriors can be viewed as having ( + ) prior experiments in which we have heads and tails. 13
A Multinomial Example Suppose you randomly draw a card from an ordinary deck of playing cards, and then put it back in the deck. Repeat this exercise five times (i.e., sampling with replacement). Let (N 1,N,N 3,N 4 ) denote the number of spades, hearts, diamonds, and clubs among the five cards. We say (N 1,N,N 3,N 4 ) follow a multinomial distribution, with a probability distribution function given by = P (N 1 = n 1,...,N 4 = n 4 n, 1,..., 4 ) (n)! (n 1 )!(n )!(n 3 )!(n 4 )! n 1 1 n 4 4, where n 1 + n 4 = n =5and 1 = = 4 =1/4. 14
In the Bernoulli example, we conduct n independent trials and each trial results in one of two possible outcomes, e.g., head or tail, with probabilities and (1 ), respectively. In the multinomial example, we conduct n independent trials and each trial results in one of k possible outcomes, e.g., k =4in the aforementioned card example, with probabilities 1,..., k,respectively. 15
For multinomial distributions, the parameter of interest is =( 1,..., k ) which lies in a simplex of R k, S = { =( 1,..., k ); X i i =1, i 0}. A Dirichlet distribution on S, Dir( 1,..., k ), is an extension of the Beta distribution, with a density function given by p( ) = ( 1 + + k ) ( 1 ) ( k ) ky i=1 i 1 i. The Dirichlet distributions are conjugate priors for Multinomial models. 16
Dirichlet Distributions Examples of Dirichlet distributions over p =(p 1,p,p 3 ) which can be plotted in D since p 3 =1 p 1 p :
A Normal Example Assume X 1,...,X n N(, ).Theparametershereare(, ) and we would like to get the posterior distribution (, X). IfusingGibbs sampler which will be introduced later, we only need to know the posterior distribution of (, X) and (, X). 17
For the location parameter, the conjugate prior is normal. X, N(, /n) N(µ 0, 0 ), then, X N(µ, ) where µ = w X +(1 w)µ 0, w = 1 /n 1 /n + 1 0, = 1 /n + 1 0 1. 18
For the scale parameter, the conjugate prior is Invse Gamma, that is, the prior on 1/ is Gamma. Suppose ( )=InvGa(, ), then ( µ, X) / 1 n/ exp{ P (Xi µ) InvGa n +, P (Xi µ) } 1 +. 1e In practice, we often specify ( ) using Inv distributions which are special cases of InvGa, Inv (v 0,s 0)=InvGa( v 0, v 0 s 0). With prior Inv (v 0,s 0 ) the posterior distribution is also Inv (v n,s n) where v n = v 0 + n, v n s n = v 0 s 0 + X (X i µ). v0 pseudo samples and each contributes s 0 into RSS. 19
Gibbs Sampling for Posterior Inference Suppose the random variables X and Y have a joint probability density function p(x, y). Sometimes it is not easy to simulate directly from the joint distribution. Instead, suppose it is possible to simulate from the individual conditional distributions p X Y (x y) and p Y X (y x). 0
Then a Gibbs sampler draws (X 1,Y 1 ),...,(X T,Y T ) as follows: 1. Initialization: let (X 0,Y 0 ) be some starting values; set n =0.. draw X n+1 p X Y (x Y n ) 3. draw Y n+1 p Y X (y X n+1 ) 4. Go to step and repeat. 1
Gibbs samplers are MCMC algorithms, and they produce samples from the desired distributions after a so-called burning period. So in practice, we always drop some samples from the initial steps (say, for example, 1000 or 5000 steps) and start saving samples after that.
In Bayesian analysis, suppose we have multiple parameters =( 1,..., K ). Then we can draw the posterior samples of using a multi-stage Gibbs sampler and at each stage, we draw i from the conditional distribution ( i [ i], Data) where [ i] denotes the (K 1) parameters except i.the reason for Gibbs samplers is that in many cases the conditional distribution of ( Data) is not of closed form, while all those conditionals are. 3
A Gaussian Mixture Model Suppose the data x 1,x,...,x n iid from w N(µ 1, 1)+(1 w) N(µ, ). For each x i,weintroducealatentvariablez i indicating which component x i is generated from and P (Z i = 1) = w, P(Z i = ) = 1 w. The parameters of interest are =(w, µ 1,µ, 1, ) and their prior distributions are specified as follows w Be(1, 1), µ 1,µ N(0, ), 1, InvGa(, ). 4
EM for MAP If we want to obtain a MAP estimate of the parameters, we can use the EM algorithm. Recall that ˆ = arg max P (x ) ( ) = arg max h X z i P (x, z ) ( ) log p(x, Z ) ( ) = X h 1(Z i = 1) log i i µ1, 1 (x i ) + log w + X i 1(Z i = ) h i log µ, (x i ) + log(1 w) + log (w) + log (µ 1 ) + log (µ ) + log ( 1) + log ( ) 5
Recall the EM algorithm for MLE: at the E-step, we replace 1(Z i = 1) and 1(Z i = ) by its expectation, i.e., the probability of Z i =1or conditioning on the data x and the current estimate of the parameter 0 i = P (Z i =1 x i, 0 )= w µ1, 1 (x i ) w µ1, 1 (x i )+(1 w) µ, (x i ) ; at the M-step, we update, and iterative between the E and M steps, until convergence. 6
For MAP, the E-step is the same; the M-step is slightly di erent: Without the Beta prior, we would update w by + /n. But with the Beta prior on w, we need to add the pseudo-counts. Similarly, without the prior, we would update µ 1 by 1 + X i ix i, but with the prior, we would update µ 1 by a weighted average of the value above and the prior mean for µ 1. 7
Gibbs Sampling from the Posterior Distribution (, Z x) = ny p(zi )p(x i Z i, ) i=1 (w) (µ 1 ) (µ ) ( 1) ( ) In a Gibbs sampler, we iteratively sample each element from (w, µ 1,µ, 1,,Z 1,...,Z n ) conditioning on the other elements being fixed. 8
For example, how to sample Z i? Bernoulli P (Z i =1 x, Z [ i], others) / P (Z i =1 ) P (x i Z i =1, ) how to sample µ 1? Normal (µ 1 x, Z, others) / = wp(x i µ 1, h Y P (X i µ 1, i:z i =1 1) i 1) (µ 1 ). 9
For a general Gaussian mixture model with K components, we have the prior on the mixing weights as w =(w 1,...,w K ) Dir( K,..., K ), and again, normal prior on µ k s and InvGa prior on k s. The Gibbs sampler iterates the following steps: 1. Draw Z i from a Multinomial distribution for i =1,...,n;. Draw w from Dirichlet; 3. Draw µ 1,...,µ K from Normal; 4. Draw 1,..., K from InvGa. 30
Collapsed Gibbs sampling The mixing weights w is used in sampling Z i s: P (Z i = k x, z [ i], ) / P (x i Z i = k, )P (Z i = k z [ i], ), where the nd term is equal to P (Z i = k z [ i], w) =P (Z i = k w), i.e., when conditioning on w, Z i s are independent. 31
Let s eliminate w from the parameter list, i.e., integrate over w. So we need to compute P (Z i = k z [ i] )= P (z 1,...,z i 1,Z i = k, z i+1,...,z n ) P (z 1,...,z i 1,z i+1,...,z n ) Note that Z i s are exchangeable (its meaning will be made clearly in class). So it su ces to compute the sampling distribution for the last observation. 3
= = P (Z n = k z 1,...,z n 1 )= P (Z n = k, z 1,...,z i ) P (z 1,...,z i ) R n w 1 + /K 1 1 w n k+1+ /K 1 k w n K+ /K 1 K dw R n w 1 + /K 1 1 w n k+ /K 1 dw (n 1+ ) (n 1+ + 1) = n k + /K n 1+ where n k =#{j : z j = k, j =1:(n 1)}. k w n K+ /K 1 K (n k + /K + 1) (n k + /K) 33
The Collapsed Gibbs sampler iterates the following steps: 1. Draw Z i from a Multinomial( i1,..., ik), for i =1,...,n where ik / n(i) k + /K n 1+ P (x i µ k, k ).. No need to sample w from Dirichlet; 3. Draw µ 1,...,µ K from Normal; 4. Draw 1,..., K from InvGa. 34
Infinite Many Clusters K = A Large Value (> n)? At the t-th iteration of the algorithm, there must be some empty clusters. Let s re-label the clusters: 1 to K being the non-empty clusters (of course, the value of K changes from iteration to iteration), and the remaining are empty ones. Update {µ k, k } K k=1 from posterior (Normal/InvGa) Update {µ k, k } K k=k +1 from prior (Normal/InvGa) 35
Update Z i from a Multinomial( i1,..., ik), where ik / n(i) k + /K n 1+ P (x i µ k, k ). Z i may start a new cluster, i.e.,z i >K. It doesn t matter which value Z i takes from (K +1,...,K), sincewe can always label this new cluster as the (K + 1)th cluster, and then immediately update (µ K +1, K +1 ) based on the corresponding posterior (conditioning on x i ). 36
What t the chance that Z i starts a new cluster? KX k=k +1 ik = =! /K n 1+ K K K n 1+ n 1+ KX k=k +1 " K ZZ 1 P (x i µ k, K P (x i µ, KX k=k +1 k ) ) (µ, P (x i µ k, # k ) )dµd, when K!1, where ZZ P (x i µ, ) (µ, )dµd = m(x i ) is the integrated (wrt to our prior ) likelihoodforasamplex i. 37
Clustering with Chinese Restaurant Process (CRP) Amixturemodel X i Z i = k P ( µ k, k ), i =1:n. Prior on Z 1,...,Z n and (µ k, Process) k ) K k=1 (known as the Chinese Restaurant Z 1 =1and (µ 1, 1 ) ( ) for i 1, suppose Z 1,...,Z i form m clusters with size n 1,...,n m, then P (Z i+1 = k Z 1,...,Z i ) = P (Z i+1 = m +1 Z 1,...,Z i ) = 38 n k i +, i +, k =1,...,m; (µ m+1, m+1) ( ).
Alternatively, you can describe the prior first and then the likelihood (which gives you a clear idea of how data are generated): Set Z 1 =1, generate (µ 1, 1 ) ( ) and X 1 P ( µ 1, 1 ). Loop over i =1,...,n 1: suppose the previous i samples form m clusters with cluster-specific parameters (µ k, k ) m k=1 ;then P (Z i+1 = k Z 1,...,Z i ) = P (Z i+1 = m + 1) Z 1,...,Z i ) = n k i +, i +. k =1,...,m; If Z i+1 = m +1, generate (µ m+1, m+1 ) ( ). Then generate X i+1 P ( µ k, k ). 39
Advantages We do not need to specify K. K is treated as a random variable, and its (posterior) distribution is learned from the data. Can model unseen data: foranynewsamplex,thereisalwaysa positive chance that it can start a new cluster. 40
Exchangeability of Z i s In CRP, the labels Z i s are generated sequentially, but in fact they are exchangeable (up to a permutation of the cluster labels labels should start from 1,,...) P (111) = P (111) = P (11) = (!)(1!) (1 + )( + ) (4 + ). In general, suppose z 1,...,z n form m clusters with size n 1,...,n m,then P (z 1,...,z n )= m (n 1 1)! (n m 1)! Q n i= (i 1+ ). So the order of z i s doesn t matter; what matters is the partition of the n samples implied by z i s. 41
Posterior Sampling for Clustering with CRP Same as the Gibbs sampler we have derived for K!1.Atthetth iteration, repeat the following: Suppose Z [ i] form K clusters, labeled from 1 to K, of size n (i) k. Sample Z i from a Multinomial with P (Z i = k) / P (Z i = K + 1) / n (i) k n 1+ P (x i µ k, n 1+ m(x i), k ), k =1,...,K where m(x i )= RR P (x i µ, ) (µ, )dµd. Update {µ k, k } from posterior (Normal/InvGa) 4
The exchangeability of Z i s plays an important role in the algorithm. Where we use this property? The marginal likelihood m( ) is easy to compute if the prior (µ, ) is conjugate, otherwise, we need to figure a way to compute m( ). For other MCMC algorithms, check Neal (000); for Variational Bayes, check Blei and Jordan (004). The ugly side:labelingissue. 43
Non-parametric Bayesian (NB) Models The finite mixture model X i Z i = k P k, P(Z i = k) =w k. k iid G 0, w Dir( /K,..., /K). Alternatively, X i i P i, KX G( ) = w k k ( ), k=1 i G G w Dir K,, k G 0. The prior on G is a K-element discrete dist. In the NB approach, we ll drop this restriction. 44
A NB approach for clustering X i i P i, i G G G DP(, G 0 ) where DP(, G 0 ) denotes a Dirichlet Process with a scale (precision) parameter and a base measure G 0. 45
Dirichlet Process (DP) i G iid G, G DP(,G 0 ) Define DP as a distribution over distributions (Ferguson, 1973) Describe DP as a stick-breaking process (Sethuraman, 1994) If we integrate over G (wrt DP), the resulting prior on ( 1,..., n ), ( 1,..., n )= n i=1 G( i )d (G) is the Chinese restaurant process (CRP). 7
Dirichlet Process G DP( G 0, ) OK, but what does it look like? Samples from a DP are discrete with probability one: G( ) = k=1 k k ( ) where k ( ) is a Dirac delta at k, and k G 0 ( ). Note: E(G) =G 0 As, G looks more like G 0.
Dirichlet Processes: Stick Breaking Representation G DP( G 0, ) Samples G from a DP can be represented as follows: G( ) = k k ( ) k=1 p(!) 15 10 5 0 0 0. 0.4 0.6 0.8 1! Beta(1,1) Beta(1,10) Beta(1,0.5) where k G 0 ( ), k=1 k =1, k 1 k = k (1 j ) j=1 and k Beta( 1, ) (Sethuraman, 1994)
Chinese Restaurant Process! 1! 3!! 6! 4...! 5!!! 1 3! 4 Generating from a CRP: customer 1 enters the restaurant and sits at table 1. 1 = 1 where 1 G 0, K =1, n =1, n 1 =1 for n =,..., n k with prob k customer n sits at table n 1+ for k =1...K K +1 with prob n 1+ (new table) if new table was chosen then K K +1, K+1 G 0 endif set n to k of the table k that customer n sat at; set n k n k +1 endfor The resulting conditional distribution over n : n 1,..., n 1,G 0, n 1+ G 0 ( )+ K k=1 n k n 1+ k ( )