Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1

Suggested Projects: www.cs.ubc.ca/~arnaud/projects.html First assignement on the web this afternoon: capture/recapture. Additional articles have been posted.

.1 Outline Bayesian model selection. Bayesian linear model and variable selection. Extensions. Overview of the Lecture Page 3

3.1 Summary of Last Lecture Ones wants to compare two hypothesis: H : θ π versus H 1 : θ π 1 then the prior is where π H )+π H 1 )=1. π θ) =π H ) π θ)+π H 1 ) π 1 θ) One can have in a coin example: π θ) =U [ 1, 1], π 1 θ) =U [ ), 1 or π θ) =δ θ θ) andπ 1 θ) =U [, ) 1 or π θ) =Be α,β )and π 1 θ) =Be α 1,β 1 ). To compare H versus H 1, we typically compute the Bayes factor which partially eliminated the influence of the prior modelling i.e. π H i )) B1 π = π x H 1) f x θ) π x H ) = π1 θ) dθ f x θ) π θ) dθ Bayesian Model Selection Page 4

3.1 Summary of Last Lecture You can also compute the posterior probabilities of H and H 1 π H x) = π x H ) π H ) π x) = π x H ) π H ) π x H ) π H )+π x H 1 ) π H 1 ). The posterior probabilities satisfy π H 1 x) π H x) = π x H 1) π x H ) π H 1 ) π H ) = π H 1 ) Bπ 1 π H ). Bayesian Model Selection Page 5

3.1 Summary of Last Lecture Testing hypothesis in a Bayesian way is attractive... but be careful to vague priors!!! Assume you have X μ, σ ) N μ, σ ) where σ is assumed known but μ the parameter θ) is unknown. We want to test H : μ = vs H 1 : μ N ξ,τ ) then B π 1 x) = π x H 1) π x H ) = = N x; μ, σ ) N μ; ξ,τ ) dμ f x ) σ σ + τ exp τ x ) σ σ + τ ) τ Bayesian Model Selection Page 6

3. Bayesian Polynomial Regression Example In practice, you might have more than potential models/hypothesis for your data. Consider the following polynomial regression problem where D = {x i,y i } n i=1 where x i,y i ) R R. Y = M β i X i + σv where V N, 1) i= = β T :M f M X)+σV Here the problem is that if M is too large then there will be overfitting. Bayesian Model Selection Page 7

3. Bayesian Polynomial Regression Example As M increases, the model overfits. M = M = 1 M = M = 3 5 1 5 1 5 1 5 1 M = 4 M = 5 M = 6 M = 7 5 1 5 1 5 1 5 1 Bayesian Model Selection Page 8

3. Bayesian Polynomial Regression Example Candidate Bayesian models H M for M {,..., M max }. For the model H M, we take the prior π M β:m,σ ) π M β:m,σ ) = π M β:m σ ) π M σ ) = N β :M ;,δ σ ) I M+1 IG σ ; ν, γ ). We have the following Gaussian likelihood f D β :M,σ ) = n N y i ; β:mf T M x i ),σ ) i=1 Bayesian Model Selection Page 9

3. Bayesian Polynomial Regression Example Standard calculations yield π M β:m,σ D ) = N ) β :M ; μ M,σ Σ M IG σ ; ν + n, γ + n i=1 y i μt M Σ 1 M μ M ) where n ) μ M =Σ M y i f M x i ) i=1, Σ 1 M = δ I M+1 + n f M x i ) fm T x i ) i=1 knowing that IG σ ; α, β ) = βα Γβ) 1 σ ) α+1 exp βσ ). Bayesian Model Selection Page 1

3. Bayesian Polynomial Regression Example The marginal likelihood/evidence is given by π D H M )= f D β :M,σ ) π M β:m,σ ) dβ :M dσ =Γ ν +n +1 ) δ M+1) Σ M 1/ γ + n i=1 y i μt M Σ 1 M μ M ) ν +n +1) We can also compute π H M D) = π D H M ) π H M ) Mmax i= π D H i ) π H i ) Bayesian Model Selection Page 11

3. Bayesian Polynomial Regression Example M = M = 1 M = M = 3 1 Model Evidence.8 5 1 M = 4 5 1 M = 5 5 1 M = 6 5 1 M = 7 PY M).6.4. 1 3 4 5 6 7 M 5 1 5 1 5 1 5 1 Bayesian Model Selection Page 1

3. Bayesian Polynomial Regression Example We have assumed here that δ was fixed and set to δ =1. As δ, the prior on β :M is getting vague but then as for M 1 lim π H D) =1 δ π D H ) δ 1 1/ γ Σ + n π D H M ) = i=1 y i μt Σ 1 μ δ M+1) Σ M 1/ γ + n i=1 y i μt M Σ 1 M μ M ) ν +n +1) ) ν +n +1) δ Do not use vague priors for model selection!!! For a robust model, select a random δ and estimate it from the data. However, numerical methods are then necessary. Bayesian Model Selection Page 13

3.3 Bayesian Model Choice: Example In practice, you might have models of different natures for your data x =x 1,..., x T ). M 1 : Gaussian white noise X n iid N,σ WN). M : An AR process of order k AR, k AR being fixed, excited by white Gaussian iid noise V n N,σAR), X n = k AR i=1 a i X n i + V n. M 3 : k sin sinusoids, k sin being fixed, embedded in a white Gaussian noise iid sequence V n N,σsin), k sin X n = acj cos [ω j n]+a sj sin [ω j n] ) + V n. j=1 Bayesian Model Selection Page 14

3.4 Bayesian Model for Model Choice Generally speaking you have a countable collection of models {M i }. For each model M i, you have a prior π i θ i )onθ i and a likelihood function f i x θ i ). You attribute a prior probability π i) toeachmodelm i. The parameter space is Θ = i {i} Θ i and the prior on Θ is π i, θ i )=π i) π i θ i ). Bayesian Model Selection Page 15

3.4 Bayesian Model for Model Choice In the polynomial regression example Θ= M max i= {i} }{{} model indicator } R {{ i+1 } }{{} R + regression parameters β :i noise variance. Remark: In all models, you have a noise variance to estimate. This parameter has a different interpretation for each model. Bayesian Model Selection Page 16

3.4 Bayesian Model for Model Choice In the non-nested example Θ = {1} Θ 1 {} Θ {3} Θ 3 where θ 1 = σ WN and Θ 1 = R +, θ = a 1,..., a kar,σ AR) and Θ = R k AR R +, θ 3 = ) a c1,a s1,ω 1,...,a cksin,a sksin,ω ksin,σwn,θ 3 = R k sin [,π] k sin R +. Remark: In all models, you have a noise variance to estimate. This parameter has a different interpretation for each model. Be careful, we don t select here Θ = {1,, 3} Θ 1 Θ Θ 3. Bayesian Model Selection Page 17

3.4 Bayesian Model for Model Choice The posterior is given by Bayes rule π k, θ k x) = π k) π k θ k ) f k x θ k ) k π k) Θ k π k θ k ) f k x θ k ) dθ k. We can obtain the posterior model probabilities through π k x) = Θ k π k, θ k x) dθ k. Once more, it is conceptually simple but it requires the calculation of many/an infinite number of integrals. Bayesian Model Selection Page 18

3.5 Bayesian Model Averaging Assume you re doing some prediction of say Y g y θ). Then in light of x, wehave g y x) = g y θ) π θ x) dθ = k = k Θ k g k y θ k ) π k, θ k x) dθ k π k x) }{{} posterior proba of model k g k y θ k ) π θ k x, k) dθ k Θ } k {{} Prediction from model k This is called Bayesian model averaging. All the models are taken into account to perform the prediction. Bayesian Model Selection Page 19

3.5 Bayesian Model Averaging An alternative way to make prediction consists of selecting the best model ; say the model which has the highest posterior proba. The prediction is performed according to Θ kbest g kbest y θ kbest ) π θ kbest x, k best ) dθ kbest This is computationally much simpler and cheaper. This can also be very misleading. Bayesian Model Selection Page

3.6 Example Consider the previous example: 1 simulated data from a sum of three sinusoids with a very large additive noise. Priors were selected for the three models: Inverse-Gamma for σ, normal inverse-gamma for AR and normal-inverse Gamma plus uniform for sinusoids. We set π H 1 )=π H )=π H 3 )= 1 3. We obtain π H 1 x) =., πh x) =.1 and π H 3 x) =.86. If we start using very vague priors... π H 1 x) 1. Bayesian Model Selection Page 1

3.7 Bayesian Variable Selection Example Consider the standard linear regression problem Y = p β i X i + σv where V N, 1) i=1 Often you might have too many predictors, so this model will be inefficient. A standard Bayesian treatment of this problem consists of selecting only a subset of explanatory variables. This is nothing but a model selection problem with p possible models. Bayesian Model Selection Page

3.8 Bayesian Variable Selection Example A standard way to write the model is p Y = γ i β i X i + σv where V N, 1) i=1 where γ i =1ifX i is included or γ i = otherwise. However this suggests that β i is defined even when γ i =. A neater way to write such models is to write Y = β i X i + σv = βγ T X γ + σv {i:γ i =1} where, for a vector γ =γ 1,..., γ p ), β γ = {β i : γ i =1},X γ = {X i : γ i =1} and n γ = p i=1 γ i. Prior distributions π γ βγ,σ ) = N ) β γ ;,δ σ I nγ IG σ ; ν, γ and π γ) = p i=1 π γ i)= p. ) Bayesian Model Selection Page 3

3.8 Bayesian Variable Selection Example An alternative way to think of it is to write but the prior follows Y = β T X + σv with π β 1,..., β p )= p π β i ) i=1 β i σ 1 δ + 1 N,δ σ ). The regression coefficients follow a mixture model with a degenerate component. Bayesian Model Selection Page 4

3.9 Bayesian Variable Selection Example For a fixed model γ and n observations D = {x i,y i } n i=1 the marginal likelihood and the posterior analytically then we can determine π γ D βγ,σ ) ν + n =Γ and ) +1 δ n γ Σ γ 1/ γ + n i=1 y i μt γ Σ 1 γ μ γ ) ν +n +1) where π γ βγ,σ D ) = N β γ ; μ γ,σ Σ γ ) IG n ) μ γ =Σ γ y i x γ,i i=1 σ ; ν + n, γ + n i=1 y i μt γ Σ 1 γ, Σ 1 γ = δ I nγ + n x γ,i x T γ,i. i=1 μ γ ) Bayesian Model Selection Page 5

3.1 Conclusion Bayesian model selection is a simple and principled way to do model selection. Bayesian model selection appears in numerous applications. Vague/Improper priors have to be banned in the model selection context!!!! Bayesian model selection only allows us to compare models. It does not tell you if any of the candidate models makes sense. Except for simple problems, it is impossible to perform calculations in closed-form. Bayesian Model Selection Page 6