Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Size: px

Start display at page:

Download "Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems"

Alan Bruce
5 years ago
Views:

1 Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Anne Sabourin, Ass. Prof., Telecom ParisTech September /78

2 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 1/78

3 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 2/78

4 Reminder : conjugate priors last lecture (see lecture notes) Definition : hyper parameter, conjugate family examples : normal model,known variance - normal prior normal model, known mean - gamma prior on inverse variance normal model, unknown mean and variance : normal-gamma prior on (µ, λ = 1/σ 2 ) 3/78

5 conjugate priors for multivariate normal X N (µ, Λ 1 ), µ R d, Λ R d d positive, definite (precision matrix : inverse of covariance matrix) 1. unknown mean conjugate prior family on µ : a multivariate Gaussian distributions 2. unknown precision conjugate prior on Λ : Wishart distributions W(ν, W ) with ν degrees of freedom (ν N ) and W R d d. 4/78

6 Wishart distribution defined on the cone of positive definite matrices. The Wishart distribution W(ν, W ) has density { 1 } f W (Λ ν, W ) = B det λ (ν d 1)/2 exp 2 Tr(W 1 Λ) w.r.t. Lebesgue on R d(d+1) 2 : i j dλ (i,j), restricted to the set of positive definite matrices. B : a normalizing constant. probabilistic representation : let M be a random ν d matrix with i.i.d.rows M (i, ) N (0, W ). Then Λ W(ν, W ) Λ d = M M = n M(i, ) M (i, ) i=1 More details : see e.g. Eaton, Multivariate Statistics : A Vector Space Approach, 2007 (Chapter 8) 5/78

7 conjugate priors for multivariate normal, Cont d 3. Unknown mean and precision conjugate prior family on (µ, Λ) : the Gaussian-Wishart distribution with hyper-parameters (W, ν, m, β) π(µ, Λ) = π 1 (Λ)π 2 (µ λ) with π 2 = W(W, ν), ν N, W positive definite, π 2 ( Λ) = N (m, (βλ) 1 ), m R d, β > 0 6/78

8 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 7/78

9 Definition : exponential family A dominated parametric model P = {P θ, θ Θ} is an exponential family if the densities write { } p θ (x) = C(θ)h(x) exp T (x), R(θ) for some functions R : Θ R k, T : X R k, C : Θ R +, h : X R +. C(θ) : a normalizing constant R(θ) : the natural parameter (R : the good re-parametrization) If R(θ) = θ, the family is natural. Most textbook distributions are from the exponential family! 8/78

10 Example I : Bernoulli model θ Θ =]0, 1[, X = {0, 1} The model is dominated by λ = δ 0 + δ 1 p θ (x) = θ x (1 θ) 1 x = exp{x log θ + (1 x) log(1 θ)} { θ } = (1 θ) exp x log 1 θ }{{} T (x) The model is an exponential family with T (x) = x natural parameter : ρ = R(θ) = log normalizing constant C(θ) = (1 θ) } {{ } R(θ) θ 1 θ. 9/78

11 Example II : Gaussian model θ = (µ, σ 2 ) Θ = R R + the model is dominated by Lebesgue on X = R. 1 { (x 2 p θ (x) = exp 2µx + µ 2 ) } 2πσ 2 2σ 2 1 µ ( ) = exp{ 2πσ 2 2σ 2 } exp ( x x 2 ) µ/σ 2, }{{} 1/(2σ 2 ) }{{}}{{} T (x) C(θ) R(θ) The model is an exponential family with T (x) = (x, x 2 ) natural parameter : ρ = R(θ) = (µ/σ 2, 1/2σ 2 ). normalizing constant C(θ) = (2πσ 2 ) 1/2 10/78

12 likelihood for i.i.d. samples in exponential families { } p θ (x 1 ) = C(θ)h(x) exp R(θ), T (x 1 ) p n θ (x 1:n ) = C(θ) n { n } h(x i ) exp T (x i ), R(θ). i=1 } {{ } h n(x 1:n ) i=1 } {{ } T n(x 1:n ) 11/78

13 Natural parameter space natural parametrization : ρ = R(θ). The density p ρ (x) = C(ρ)h(x) exp T (x), ρ integrates to 1 ρ E, the natural parameter space, i.e. { E = ρ : h(x) exp T (x), ρ } dλ(x) < If E is open : the family is regular. X 12/78

14 Maximum likelihood in regular exponential families natural parametrization : ρ = R(θ). λ : reference measure. [ ] lemma : expression for E ρ T (X ) E ρ [ T (X ) ] = ρ {ln C(ρ)} Proof 1 C(ρ) h(x) exp T (x), ρ dλ(x) X (with regularity to exchange and ) ) 0 = ρc(ρ) h(x) exp T (x), ρ dλ(x) +C(ρ) X } {{ } C(ρ) 1 0 = 1 C(ρ) ρc(ρ) + E( T (X ) ) X h(x)t (x) exp T (x), ρ dλ(x) 13/78

15 Maximum likelihood in regular exponential families, cont d proposition The MLE estimator ρ in a regular exponential family satisfies Proof E ρ [T (X )] = 1 T (x i ). n ρ log p ρ (x) = 0 ρ {n log C(ρ) + T (x i ), ρ } = 0 ρ log C( ρ) = 1 T (x i ). n then use the lemma. i i 14/78

16 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 15/78

17 Conjugate priors in exponential family Proposition A natural exponential family with densities p θ (x) = C(θ)h(x) exp θ, T (x), admits a conjugate prior family F = {π λ,µ, λ > 0, µ M λ R k }, with π λ,µ (θ) = K(µ, λ)c(θ) λ exp { θ, µ } and M λ = {µ : Θ π(µ, λ) dθ < }. The posterior for n observation is π λ,µ (θ x 1:n ) C(θ) λ+n exp { θ, µ + i T (x i ) } so that π λ,µ ( x 1:n ) = π λn,µ n ( ), with λ n = λ + n ; µ n = µ + i T (x i ) proof exercise 16/78

18 Example : Poisson model p θ (x) = e θ θ x /x!, X = N, θ > 0 = 1 x! e θ e x log θ an exponential family with T (x) = x, ρ = R(θ) = log θ R, C(ρ) = exp{ e ρ } conjugate prior for ρ : π a,b (ρ) exp{ be ρ } exp{aρ}. Back to θ : π(θ) = dπ dρ dρ dθ = θa 1 exp{ bθ} The Gamma family is a conjugate prior for θ. (Gamma density) 17/78

19 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 18/78

20 About the choice of a conjugate prior A convenient choice only One must still choose hyper-parameters (λ, µ) This is an issue of model choice possible to do so via empirical Bayes methods, see lecture 2 and lab session. 19/78

21 Other prior choices : non informative priors Goal : minimize the bias induced by the prior If Θ compact : one can choose π(θ) = Constant If Θ non compact, θ π(θ) dθ = Θ C dθ = + OK to do so as long as the posterior is well defined, i.e. when p θ (x) dπ(θ) <. Θ uniform only w.r.t. the reference measure not invariant under re-parametrization. e.g. Flat prior on ]0, 1[ in a Ber(θ) model non flat over ρ = log[θ/(1 θ)] 20/78

22 Other prior choices : Jeffreys prior For Θ open in R d. Reasonable with d = 1. Remind the Fisher information (in a regular model) : [( log pθ (X ) ) 2 ] [ 2 log p θ (X ) ] I (θ) = E θ = E θ θ 2. I (θ) is the expected curvature of the likelihood around θ. Interpretation as a an average information carried by X about θ. Idea : grant more prior mass to highly informative θ s Definition : Jeffreys prior In a dominated model with densities p θ, θ Θ, the Jeffreys prior has densities w.r.t. Lebesgue on Θ : π(θ) I (θ). exercise compute the Jeffreys prior in the Bernoulli model, in the location model N (θ, σ 2 ), σ 2 known and in the scale model N (µ, θ 2 ), µ known. 21/78

23 Invariance of the Jeffreys prior Change of variable : h(θ) = η. Then p θ = p h(θ). Let θ π J,θ the Jeffreys prior. Then η π J,θ h 1 with density π(η) for θ=h 1 (η) = π J,θ (θ) dθ dη = I (θ) h (θ) On the other hand compute the Jeffreys prior on η : [( log pη (X ) ) 2 ] 1/2 π J,η (η) = I η (η) = E η η [( θ=h 1 (η) log pθ (X ) dθ ) 2 ] 1/2 = E θ = θ η I (θ) h (θ). Same result : the Jeffreys prior in the η parametrization is the image measure of the Jeffreys prior in the θ parametrization. In other words the Jeffreys prior is parametrization-invariant. 22/78

24 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 22/78

25 Rough overview as the sample size n The influence of the prior choice vanishes The posterior distribution concentrates around the true value θ 0 (almost surely) The posterior distribution is asymptotically normal with mean θ = the maximum likelihood, and variance n 1 I (θ) 1 (same as MLE s) 23/78

26 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 24/78

27 Reminder : Beta-Binomial model Bayesian model { θ π = Beta(a, b) X θ Ber(θ). P θ : distribution over X of the random sequence (X n ) n 1 i.i.d P θ posterior distribution (conjugate prior) : π( x 1:n ) = Ber(a + s, b + n s), s = n x i. 1 25/78

28 Posterior expectation and variance E π (θ X 1:n ) = a + n 1 X i a + b + n = a/n + 1 n n 1 X i (a + b)/n + 1 a.s. θ 0 under P n θ 0 ( a + )( n 1 X i b + n ) 1 X i Var π (θ X 1:n ) = ( ) 2 ( ) a + b + n a + b + n + 1 ( = 1 a/n + 1 )( n n 1 X i b/n ) n 1 X i ( ) n 2 ( ) (a + b)/n + 1 (a + b + 1)/n + 1 θ 0 (1 θ 0 ) P θ a.s. 0 n = exercise (n I (θ 0 )) 1 26/78

29 Concentration of the posterior distribution Write θ n = θ n(x 1:n ) = E π (θ X 1:n ). Tchebychev inequality δ > 0, U = (θ n δ, θ n + δ), P π (θ / U X 1:n ) = P π ( (θ θ n ) 2 > δ 2 X 1:n ) Var π(θ X 1:n ) nδ 2 θ 0 (1 θ 0 ) P θ a.s. 0 nδ 2 a.s. n 0. summary : P θ 0 - a.s., we have that The posterior distribution concentrates around the posterior expectancy θ n θ n n θ 0. 27/78

30 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 28/78

31 Posterior consistency Definition Let {P θ, θ Θ}, π be a Bayesian model and let θ 0 Θ. The posterior is consistent at θ 0 if For all neighborhood U of θ 0, π(u X 1:n ) n 1, P θ 0 -a.s. In general consistency holds when Θ is finite dimensional if π assigns positive mass to θ 0 s neighborhoods. See e.g. [Ghosh and Ramamoorthi, 2003], Chapter 1.3, 1.4 for details 29/78

32 Doob s theorem Theorem If Θ and X are complete, separable, metric spaces endowed with their Borel σ-field, if θ P θ is 1 to 1, then for any prior π on Θ, Θ 0 Θ with π(θ 0 ) = 1 such that for all θ 0 Θ 0, the posterior is consistent at θ 0. issue The π-negligible set where consistency does not hold may be large. Under additional regularity conditions, consistency holds at a given θ 0. 30/78

33 Consistency at a given θ 0. Theorem([Ghosh and Ramamoorthi, 2003], Th ) Let Θ be compact, metric and θ 0 Θ. Let T (x, θ) = log p θ(x) p θ0 (x). Assume 1. x X, θ T (x, θ) is continuous 2. θ Θ, x T (x, θ) is measurable 3. E (sup θ Θ T (θ, X 1 ) ) <. Then 1. The maximum likelihood estimator is consistent at θ 0 (CV in proba) 2. If θ 0 Supp(π), then the posterior is consistent at θ 0. 31/78

34 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 32/78

35 Bayesian asymptotic normality : Overview Tells us about the rate of convergence of π( X 1:n ) towards δ θ0. With a n re-scaling, a Gaussian limit centered at the MLE (under appropriate regularity conditions) Good references : [Van der Vaart, 1998], [Ghosh and Ramamoorthi, 2003], [Schervish, 2012] 33/78

36 Bernstein - Von Mises Theorem (stated for Θ R, similar statements for Θ R d ). Theorem Under appropriate regularity conditions (detailed in [Ghosh and Ramamoorthi, 2003], Th ), Let s = n(θ θ n (X 1:n )), with θ(x 1:n ) the MLE. Let π (s X 1:n ) be the posterior density of s. Then R π (s X 1:n ) I (θ0 ) 2π e s 2 I (θ0 ) 2 ds a.s. n 0 under P θ 0 Interpretation : as n, n(θ θn (X 1:n ) d N (0, I (θ 0 ) 1 ), i.e. θ d N ( θn, I (θ 0) 1 ) n Multivariate case : similar result with multivariate Gaussian and Fisher information matrix. 34/78

37 Asymptotic normality of the posterior mean θ n = E π [θ X 1:n ], θn : maximum likelihood. Theorem In addition to the assumptions of BVM Theorem, assume R θ π(θ) dθ <. Then under P θ 0, 1. n(θ n θ n ) n 2. n(θ n θ 0 ) 0 in probability d n N (0, I (θ 0) 1 ). 35/78

38 Regularity conditions for BVM theorem 1. {x X : p θ (x) > 0} does not depend on θ 2. L(θ, x) = log p θ (x) is three times differentiable w.r.t. θ in a neighborhood of θ E θ0 θ L(θ 0, X ) <, E θ0 2 θ 2 L(θ 0, X ) < and E θ0 sup θ (θ0 δ,θ 0 +δ) 3 θ 3 L(θ 0, X ) < 4. X and θ may be interchanged. 5. I (θ 0 ) > 0. Remark : under these conditions the MLE is asymptotically normal, d n( θn θ 0 ) N (0, I (θ 0 ) 1 as well. 36/78

39 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 36/78

40 Setting Not purely Bayesian framework : the training step is not necessarily Bayesian, only the prediction step is. Sample space X = X 1 X d (d features) some features may be categorical, some discrete, some continuous... data X i = (X i,1,... X i,d ), i = 1,..., n. Classification problem : X i may come from anyone of K classes (C 1,..., C K ). { X i,1 R p p : X-ray image from patient i Example X i,2 {0, 1} : result of a blood test from patient i. classes : {ill, healthy, healthy carrier}. Goal predict the class c {1,..., K} of a new patient. 37/78

41 Naive Bayes assumption Conditionally to the class c(i) {1,..., K} of observation i, the features (X i,1,..., X i,d ) are independent. Looks like a strong (and erroneous) assumption! In practice : produces reasonable prediction (even though the posterior probabilities of each class are not to be taken too seriously) 38/78

42 1. Training step Training set { (x i,j, c(i)), i {1,..., n}, j {1,..., d} }, c(i) {1,..., K}. for k {1,..., K} : Retain observations of class k i I k. For j {1,..., d} estimate the class distribution, with density p j,k (x j ) = p(x i,j c(i) = k), using data (x i,j ) i Ik, usually in a parametric model with parameter θ j,k : estimated density p j,k, θj,k ( ) output : the conditional distribution of X given C = k, p k (x) = k p j,k, θj,k (x j ) j=1 39/78

43 2. computing the predictive class probabilities input : new data point x = (x 1,..., x d ) From step 1 : conditional distributions of X given C = k : p k ( ) = p j,k, θj,k (plug-in method, neglect estimation error of θ j,k ). (a) Assign a prior probability to each class : π = (π 1,..., π K ), π k = P π (C = k). step 1 joint density of (X, C) : q(x, k) = π k p k (x). (b) Apply the discrete Bayes formula : π(k x) = π k p k (x) π d k K c=1 π cp c (x) = j=1 p (x j,k, θ j,k j ) K c=1 π d c j=1 p j,c, θ (x j,c j) Easy to implement! O(kdN) for N testing data. 40/78

44 3. final step : class prediction Classification task : output= a predicted class x Naive Bayes prediction for a new point x (a maximum a posteriori) ĉ = argmax π(k x). k {1,...,k 41/78

45 Example : text documents classification 2 classes : {1 = spam, 2 = non spam } vocabulary V = {w 1,..., w V }. dataset : documents ( ) T i = (T i,j, j = 1,..., N i ), i n with N i : number of words in T i t i,j V : j th word in T i 42/78

46 Conditional model (text documents) Naive Bayes assumption : in document T i, conditionally to the class, words are drawn independently from each other the vocabulary V T i can be summarized by a bag of words X i = (X i,1,..., X i,v ) : X i,j : number of occurrences of word j in T i. Conditional model for X i given its class k {1, 2} : L(X i C = k) = Multi ( ) θ k = (θ 1,k,..., θ V,k ), N i, N i! V p k,θk (x) = V j=1 x θ x i,j j,k i,j! j=1 i.e. 43/78

47 1. training step (text documents) Fit separately 2 Multinomial models on spam and non-spam Here : the Dirichlet prior Diri(a 1..., a v ), a j > 0 is conjugate for the Multinomial model, with density diri(θ a 1,..., a V ) = Γ( V j=1 a j) V V j=1 Γ(a θ a j 1 j j) j=1 on S V = {θ R V + : V j=1 θ j = 1} the V 1-simplex. Mean of θ under π = Diri(a 1,..., a V ) : ( a1 a ) E π (θ) = V j a,..., j j a j The posterior for x 1:n = (x i,1,..., x i,v ) i {1,...,n} is Diri ( n n (a 1 + x i,1 ),..., (a V + x i,v ) ). i=1 i=1 44/78

48 1. training step (text documents) Cont d Concatenate documents of each class separately x (k) = (x (k) j ) j=1,...,v, k = 1, 2 with x k,j = total # occurrences of word j in documents of class k. θ k = (θ k,1,..., θ k,v ) multinomial parameter for class k. Flat priors on θ k : π 1 = π 2 = Diri(1,..., 1) Posterior mean estimates ( θ k = E πk [θ x (k) ] = x (k) V + V j=1 x (k),..., j x (k) V + 1 ) V + V j=1 x (k) j (the prior acts as regularizer : +1 term avoids 0 probabilities. 45/78

49 2. Prediction step For a new document x new the predictive probabilities of each class are : π(c = k x new ) = p(x new C = k)π 1 p(x new C = k)π 1 + p(x new C = 2)π 2 with p(x new C = k) V j=1 xj θ new k,j The class prediction is k (x new ) = argmax p(x new C = k) k=1,2 46/78

50 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 46/78

51 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice 47/78

52 The regression problem Supervised learning : training dataset (x i, Y i ), i n, with x i X the features for observation i (considered non random) Y i R the label (random variable). goal : for a new observation with features x new, predict Y new, i.e. construct a regression function h H, so that h(x) is our best prediction of Y at point x. h should be simple (avoid over-fitting) simple class H. fit the data well : measured through a loss function L(x, y, h). example : squared error loss L(x, y, h) = (y h(x)) 2. 48/78

53 Multiple classical strategies Statistical learning approach : empirical risk minimization R n (x 1:n, y 1:n, h) = 1 n L(x i, y i, h) minimize h H R n (x 1:n, y 1:n, h) Probabilistic modeling approach (likelihood based) : assume e.g. Y i = h 0 (x i ) + ɛ i, ɛ i P ɛ independent noises, e.g. P ɛ = N (0, σ 2 ), σ 2 known or not. likelihood of h, p h (x 1:n, y 1:n ) = n i=1 p ɛ(y i h(x i )). minimize h H n log p ɛ (y i h(x i )) i=1 With Gaussian noises, both strategies coincide. 49/78

54 Linear regression h : a linear combination of basis functions φ j : X R (feature maps), j {1,..., p} h(x) = p θ j φ j (x), θ j unknown, φ j known, i.e. j=1 { p H = θ j φ j : θ = (θ 1,..., θ p ) R p} j=1 Examples X = R p, φ j (x) = x j : canonical feature map X = R, φ j (x) = x j 1 : polynomial basis function X = R d, φ j (x) = 1 (2π) d/2 det Σ j exp 1 2 (x µ j) Σ 1 j (x µ j ), Gaussian basis function 50/78

55 Empirical risk minimization for linear regression Empirical risk : R n (x 1:n, y 1:n, θ) = 1 2 n (y i θ, φ(x i ) ) 2 = 1 2 y 1:n Φθ 2, i=1 with Φ R n p : design matrix, Φ i,j = φ j (x i ). Minimizer of R n : the least squares estimator explicit solution when Φ Φ is of rank p (invertible) θ = (Φ Φ) 1 Φ y 1:n 51/78

56 Regularization goals : prevent over-fitting numerical instabilities (inversion of (Φ Φ). Add a complexity penalty (function of θ) to the empirical risk penalty :λ θ 2 2 ridge regression penalty :λ θ 1 Lasso regression e.g. with L 2 penalty, the optimization problem becomes solution θ = θ = argmin y 1:n Φθ 2 + λ θ 2 2 for some λ > 0. θ [ Φ Φ + λi p ] 1Φ y 1:n. 52/78

57 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice 53/78

58 Bayesian linear model Again, Y i = θ, Φ(x i ) + ɛ i Assume ɛ i N (0, β 1 ), β > 0 noise precision viewed as a constant (known or not) Prior distribution on θ R p : π = N (m 0, S 0 ). independence assumption : ɛ 1 ɛ 2 θ. Y = Y 1:n = Φθ + ɛ 1:n, with Φ R n p, Φ i,j = φ j (x i ). Bayesian model { θ π = N (m0, S 0 ) L [ Y θ ] = N (Φθ, 1 β I n) Natural Bayesian estimator : θ = E π (θ Y 1:n ). posterior distribution? 54/78

59 Conditioning and augmenting Gaussian vectors Lemma Let { W N (µ, Λ 1 ) L[Y w] = N (Aw + b, L 1 ) i.e. Y = AW + b + ɛ with ɛ N (0, L 1 ) W. Then L[W y] = N (m y, S) with S = (Λ + A ΛA) 1 m y = S[A L(y b) + Λµ.] proof : homework (see exercises sheet online) 55/78

60 Application to posterior computation Using the lemma with A = Φ, b = 0, W = θ, Λ = S 1 0, µ = m 0, L = βi p, we obtain immediately the posterior distribution π( Y 1:n ) = L[θ y 1:n ] = N (m n, S n ) with { Sn = ( S βφ Φ ) 1 m n = S n ( βφ y 1:n + S 1 0 m 0 ) (1) Posterior mean estimate θ = E π [θ y 1:n ] = m n 56/78

61 Special case : diagonal, centered prior choose m 0 = 0, S 0 = α 1 I p, with α : prior precision (it makes sense!) Then (1) becomes S n = ( αi p + βφ Φ ) 1 ( m n = S n βφ ) y 1:n ) 1 = β 1( α β + Φ Φ ( α ) 1Φ = β + Φ Φ y 1:n }{{} penalized least squares solution (2) Adding a prior N (0, α 1 I p ) Adding a L 2 regularization with parameter λ = α/β. remark : Narrow prior large α large penalty 57/78

62 Predictive distribution New data point (x new, Y new ), with Y new not observed and x new known : goal : obtain the posterior distribution of Y new (mean and variance credible intervals). We still have Y new = θ, φ(x new ) + ɛ, ɛ N (0, β 1 ) and ɛ θ. Now (after training step) θ π( y 1:n ) = N (m n, S n ) d Thus Y new = linear transform of Gaussian vector (ɛ, θ) ) (φ(x new ) m n, φ(x new ) S n φ(x new ) + β 1 L[Y new y 1:n ] = N 58/78

63 Example : polynomial basis functions True regression functions : h 0 (x) = sin(x) Polynomial basis functions : φ(x) = (1, x, x 2, x 3, x 4 ) (p = 5). h 0(x) = sin(x) observation noise y ε i x 59/78

64 Estimated regression function ĥ(x) = θ, Φ(x) = θ j=2 θ j x j 1 With the previous dataset h 0(x) = sin(x) observation estimated regression function h 0(x) = sin(x) observation estimated regression function y y x α = 0.01 α = 100 x 60/78

65 Predictive distribution ĥ(x) : the mean of L(Y new y 1:n ) for x new = x Remind L(Y new y 1:n ) = N (ĥ(x), σ2 new = φ(x) S n φ(x) + β 1 ) posterior credible interval for Y, ] I x = [ĥ(x) 1/96 σnew 2, ĥ(x) + 1/96 σnew 2 h 0(x) = sin(x) observation posterior mean credible intervals h 0(x) = sin(x) observation posterior mean credible intervals y y x α = 0.01 α = 100 x 61/78

66 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 61/78

67 Model choice problem What if several model in competition {M k, k {1,..., K}}, with M k = {Θ k, π k }? Continuous case : family of models {M α, α A} How to choose k or α? Examples : M 1 = {Θ, π 1 }, M 2 = {Θ, π 2 } with π 1 a flat prior and π 2 the Jeffreys prior M α linear model with normal prior on the noise N (0, α 1 ) 62/78

68 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 63/78

69 Hierarchical models Bayesian view : put a prior on unknown quantities, then condition upon data. Model choice problem : put a hyper-prior on α A (or k {1,..., K}) hierarchical Bayesian model Convenient when dealing with parallel experiments 64/78

70 Example of hierarchical model Example : 2 rivers with fishes. X i {0, 1} : fished fish ill or sound. X i Ber(θ), with θ = θ 1 in river 1 and θ = θ 2 in river 2. θ 1 and θ 2 are 2 realizations of θ Beta(a, b) α = (a, b) : hyper-parameter for the prior hierarchical Bayes : put a prior on α (e.g. product of 2 independent Gammas). 65/78

71 Posterior mean estimates in a BMA framework denote π h the hyper-prior on k (or α) Let us stick to the discrete case, k {1,..., M}. The prior is a mixture distribution π = K k=1 πh (k)π k ( ), i.e. for all π-integrable function g(θ), E π [g(θ)] = E π h [ ] E π [g(θ) k] = K k=1 π h (k) g(θ) dπ k (θ) Θ k So is the posterior distribution, thus the posterior mean is a weighted average ] ĝ = E π [g(θ) X 1:n ] = E π h [E π [g(θ) k, X 1:n ] X 1:n = K π h (k X 1:n ) g(θ) dπ k (θ X 1:n ) Θ } k {{} ĝ k :posterior mean in model k k=1 66/78

72 Model evidence Computing the posterior mean in the BMA framework requires Computing the posterior means in each individual model k moderate tasks Averaging them with weights π h (k X 1:n ), posterior weight of model k Bayes formula with π h (k X 1:n ) = p(x 1:n k) = evidence of model k = p(x 1:n θ) dπ k (θ) Θ k πh (k)p(x 1:n k) K j=1 πh (j)p(x 1:n j) = m k (X 1:n ) marginal likelihood of X 1:n in model k hard to compute (integral) 67/78

73 Shortcomings of BMA Inference has to be done in each individual model Usually one weight (say π(k X 1:n )) all others (reason : concentration of the posterior around the true θ 0 Θ k0 and k = k 0 = final estimate ĝ ĝ k0. Other ĝ k s are almost useless Bottleneck : compute k. model choice problem. 68/78

74 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 69/78

75 Posterior weights, model evidence and Bayes factor Recall k = argmax π(k X 1:n ) = argmax k k p(x 1:n k) π h (k) }{{} evidence of model k Uniform prior on k = only the evidence p(x 1:n k) matters. in any case : prior influence vanishes with n. Relevant quantity to compare model k and j : B kj = p(x 1:n k) p(x 1:n j) : Bayes factor (Jeffreys, 61) Suggested scale for decision making : log 10 B kj B kj evidence against B j 0 1/ not significant 1/ substantial strong > 2 > 100 decisive 70/78

76 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 71/78

77 Occam s razor principle Avoid over-fitting Between 2 models explaining the data equally well, one ought to choose the simplest one. Better generalization properties. 72/78

78 Occam s razor and model evidence When selecting k according to the model evidences p(x 1:n k), the Occam s razor is automatically implemented. Reason : the prior plays the role of a regularizer. 73/78

79 automatic complexity penalty : intuition 1 Complex model = large Θ k = small π k (θ) (if uniform over Θ k ) = p θ (x 1:n )π k (θ) dθ small Θ k (average over large regions where p θ (x 1:n ) small) 74/78

80 automatic complexity penalty : intuition 2 if Θ k R : assume π k flat over interval of length θ prior k p θk (X 1:n ) peaked around (X p θmap,k 1:n ) with width posterior k. then π k (θ) 1/ prior k and p(x 1:n k) = p θ (x)π k (θ) dθ (X p θmap,k 1:n ) Θ k If Θ k R d and same approximation in each dimension log p(x 1:n k) log p θmap,k (X 1:n ) + θ posterior k θ prior k }{{} complexity penalty d log θposterior k θ prior k }{{} dimension + complexity penalty 75/78

81 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 76/78

82 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 77/78

83 Bibliography [Ghosh and Ramamoorthi, 2003] Ghosh, J. and Ramamoorthi, R. (2003). Bayesian nonparametrics [Schervish, 2012] Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media. [Van der Vaart, 1998] Van der Vaart, A. W. (1998). Asymptotic statistics, volume 3. Cambridge university press. 78/78

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for