Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems

Size: px
Start display at page:

Download "Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems"

Transcription

1 Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Anne Sabourin, Ass. Prof., Telecom ParisTech September /78

2 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 1/78

3 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 2/78

4 Reminder : conjugate priors last lecture (see lecture notes) Definition : hyper parameter, conjugate family examples : normal model,known variance - normal prior normal model, known mean - gamma prior on inverse variance normal model, unknown mean and variance : normal-gamma prior on (µ, λ = 1/σ 2 ) 3/78

5 conjugate priors for multivariate normal X N (µ, Λ 1 ), µ R d, Λ R d d positive, definite (precision matrix : inverse of covariance matrix) 1. unknown mean conjugate prior family on µ : a multivariate Gaussian distributions 2. unknown precision conjugate prior on Λ : Wishart distributions W(ν, W ) with ν degrees of freedom (ν N ) and W R d d. 4/78

6 Wishart distribution defined on the cone of positive definite matrices. The Wishart distribution W(ν, W ) has density { 1 } f W (Λ ν, W ) = B det λ (ν d 1)/2 exp 2 Tr(W 1 Λ) w.r.t. Lebesgue on R d(d+1) 2 : i j dλ (i,j), restricted to the set of positive definite matrices. B : a normalizing constant. probabilistic representation : let M be a random ν d matrix with i.i.d.rows M (i, ) N (0, W ). Then Λ W(ν, W ) Λ d = M M = n M(i, ) M (i, ) i=1 More details : see e.g. Eaton, Multivariate Statistics : A Vector Space Approach, 2007 (Chapter 8) 5/78

7 conjugate priors for multivariate normal, Cont d 3. Unknown mean and precision conjugate prior family on (µ, Λ) : the Gaussian-Wishart distribution with hyper-parameters (W, ν, m, β) π(µ, Λ) = π 1 (Λ)π 2 (µ λ) with π 2 = W(W, ν), ν N, W positive definite, π 2 ( Λ) = N (m, (βλ) 1 ), m R d, β > 0 6/78

8 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 7/78

9 Definition : exponential family A dominated parametric model P = {P θ, θ Θ} is an exponential family if the densities write { } p θ (x) = C(θ)h(x) exp T (x), R(θ) for some functions R : Θ R k, T : X R k, C : Θ R +, h : X R +. C(θ) : a normalizing constant R(θ) : the natural parameter (R : the good re-parametrization) If R(θ) = θ, the family is natural. Most textbook distributions are from the exponential family! 8/78

10 Example I : Bernoulli model θ Θ =]0, 1[, X = {0, 1} The model is dominated by λ = δ 0 + δ 1 p θ (x) = θ x (1 θ) 1 x = exp{x log θ + (1 x) log(1 θ)} { θ } = (1 θ) exp x log 1 θ }{{} T (x) The model is an exponential family with T (x) = x natural parameter : ρ = R(θ) = log normalizing constant C(θ) = (1 θ) } {{ } R(θ) θ 1 θ. 9/78

11 Example II : Gaussian model θ = (µ, σ 2 ) Θ = R R + the model is dominated by Lebesgue on X = R. 1 { (x 2 p θ (x) = exp 2µx + µ 2 ) } 2πσ 2 2σ 2 1 µ ( ) = exp{ 2πσ 2 2σ 2 } exp ( x x 2 ) µ/σ 2, }{{} 1/(2σ 2 ) }{{}}{{} T (x) C(θ) R(θ) The model is an exponential family with T (x) = (x, x 2 ) natural parameter : ρ = R(θ) = (µ/σ 2, 1/2σ 2 ). normalizing constant C(θ) = (2πσ 2 ) 1/2 10/78

12 likelihood for i.i.d. samples in exponential families { } p θ (x 1 ) = C(θ)h(x) exp R(θ), T (x 1 ) p n θ (x 1:n ) = C(θ) n { n } h(x i ) exp T (x i ), R(θ). i=1 } {{ } h n(x 1:n ) i=1 } {{ } T n(x 1:n ) 11/78

13 Natural parameter space natural parametrization : ρ = R(θ). The density p ρ (x) = C(ρ)h(x) exp T (x), ρ integrates to 1 ρ E, the natural parameter space, i.e. { E = ρ : h(x) exp T (x), ρ } dλ(x) < If E is open : the family is regular. X 12/78

14 Maximum likelihood in regular exponential families natural parametrization : ρ = R(θ). λ : reference measure. [ ] lemma : expression for E ρ T (X ) E ρ [ T (X ) ] = ρ {ln C(ρ)} Proof 1 C(ρ) h(x) exp T (x), ρ dλ(x) X (with regularity to exchange and ) ) 0 = ρc(ρ) h(x) exp T (x), ρ dλ(x) +C(ρ) X } {{ } C(ρ) 1 0 = 1 C(ρ) ρc(ρ) + E( T (X ) ) X h(x)t (x) exp T (x), ρ dλ(x) 13/78

15 Maximum likelihood in regular exponential families, cont d proposition The MLE estimator ρ in a regular exponential family satisfies Proof E ρ [T (X )] = 1 T (x i ). n ρ log p ρ (x) = 0 ρ {n log C(ρ) + T (x i ), ρ } = 0 ρ log C( ρ) = 1 T (x i ). n then use the lemma. i i 14/78

16 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 15/78

17 Conjugate priors in exponential family Proposition A natural exponential family with densities p θ (x) = C(θ)h(x) exp θ, T (x), admits a conjugate prior family F = {π λ,µ, λ > 0, µ M λ R k }, with π λ,µ (θ) = K(µ, λ)c(θ) λ exp { θ, µ } and M λ = {µ : Θ π(µ, λ) dθ < }. The posterior for n observation is π λ,µ (θ x 1:n ) C(θ) λ+n exp { θ, µ + i T (x i ) } so that π λ,µ ( x 1:n ) = π λn,µ n ( ), with λ n = λ + n ; µ n = µ + i T (x i ) proof exercise 16/78

18 Example : Poisson model p θ (x) = e θ θ x /x!, X = N, θ > 0 = 1 x! e θ e x log θ an exponential family with T (x) = x, ρ = R(θ) = log θ R, C(ρ) = exp{ e ρ } conjugate prior for ρ : π a,b (ρ) exp{ be ρ } exp{aρ}. Back to θ : π(θ) = dπ dρ dρ dθ = θa 1 exp{ bθ} The Gamma family is a conjugate prior for θ. (Gamma density) 17/78

19 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 18/78

20 About the choice of a conjugate prior A convenient choice only One must still choose hyper-parameters (λ, µ) This is an issue of model choice possible to do so via empirical Bayes methods, see lecture 2 and lab session. 19/78

21 Other prior choices : non informative priors Goal : minimize the bias induced by the prior If Θ compact : one can choose π(θ) = Constant If Θ non compact, θ π(θ) dθ = Θ C dθ = + OK to do so as long as the posterior is well defined, i.e. when p θ (x) dπ(θ) <. Θ uniform only w.r.t. the reference measure not invariant under re-parametrization. e.g. Flat prior on ]0, 1[ in a Ber(θ) model non flat over ρ = log[θ/(1 θ)] 20/78

22 Other prior choices : Jeffreys prior For Θ open in R d. Reasonable with d = 1. Remind the Fisher information (in a regular model) : [( log pθ (X ) ) 2 ] [ 2 log p θ (X ) ] I (θ) = E θ = E θ θ 2. I (θ) is the expected curvature of the likelihood around θ. Interpretation as a an average information carried by X about θ. Idea : grant more prior mass to highly informative θ s Definition : Jeffreys prior In a dominated model with densities p θ, θ Θ, the Jeffreys prior has densities w.r.t. Lebesgue on Θ : π(θ) I (θ). exercise compute the Jeffreys prior in the Bernoulli model, in the location model N (θ, σ 2 ), σ 2 known and in the scale model N (µ, θ 2 ), µ known. 21/78

23 Invariance of the Jeffreys prior Change of variable : h(θ) = η. Then p θ = p h(θ). Let θ π J,θ the Jeffreys prior. Then η π J,θ h 1 with density π(η) for θ=h 1 (η) = π J,θ (θ) dθ dη = I (θ) h (θ) On the other hand compute the Jeffreys prior on η : [( log pη (X ) ) 2 ] 1/2 π J,η (η) = I η (η) = E η η [( θ=h 1 (η) log pθ (X ) dθ ) 2 ] 1/2 = E θ = θ η I (θ) h (θ). Same result : the Jeffreys prior in the η parametrization is the image measure of the Jeffreys prior in the θ parametrization. In other words the Jeffreys prior is parametrization-invariant. 22/78

24 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 22/78

25 Rough overview as the sample size n The influence of the prior choice vanishes The posterior distribution concentrates around the true value θ 0 (almost surely) The posterior distribution is asymptotically normal with mean θ = the maximum likelihood, and variance n 1 I (θ) 1 (same as MLE s) 23/78

26 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 24/78

27 Reminder : Beta-Binomial model Bayesian model { θ π = Beta(a, b) X θ Ber(θ). P θ : distribution over X of the random sequence (X n ) n 1 i.i.d P θ posterior distribution (conjugate prior) : π( x 1:n ) = Ber(a + s, b + n s), s = n x i. 1 25/78

28 Posterior expectation and variance E π (θ X 1:n ) = a + n 1 X i a + b + n = a/n + 1 n n 1 X i (a + b)/n + 1 a.s. θ 0 under P n θ 0 ( a + )( n 1 X i b + n ) 1 X i Var π (θ X 1:n ) = ( ) 2 ( ) a + b + n a + b + n + 1 ( = 1 a/n + 1 )( n n 1 X i b/n ) n 1 X i ( ) n 2 ( ) (a + b)/n + 1 (a + b + 1)/n + 1 θ 0 (1 θ 0 ) P θ a.s. 0 n = exercise (n I (θ 0 )) 1 26/78

29 Concentration of the posterior distribution Write θ n = θ n(x 1:n ) = E π (θ X 1:n ). Tchebychev inequality δ > 0, U = (θ n δ, θ n + δ), P π (θ / U X 1:n ) = P π ( (θ θ n ) 2 > δ 2 X 1:n ) Var π(θ X 1:n ) nδ 2 θ 0 (1 θ 0 ) P θ a.s. 0 nδ 2 a.s. n 0. summary : P θ 0 - a.s., we have that The posterior distribution concentrates around the posterior expectancy θ n θ n n θ 0. 27/78

30 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 28/78

31 Posterior consistency Definition Let {P θ, θ Θ}, π be a Bayesian model and let θ 0 Θ. The posterior is consistent at θ 0 if For all neighborhood U of θ 0, π(u X 1:n ) n 1, P θ 0 -a.s. In general consistency holds when Θ is finite dimensional if π assigns positive mass to θ 0 s neighborhoods. See e.g. [Ghosh and Ramamoorthi, 2003], Chapter 1.3, 1.4 for details 29/78

32 Doob s theorem Theorem If Θ and X are complete, separable, metric spaces endowed with their Borel σ-field, if θ P θ is 1 to 1, then for any prior π on Θ, Θ 0 Θ with π(θ 0 ) = 1 such that for all θ 0 Θ 0, the posterior is consistent at θ 0. issue The π-negligible set where consistency does not hold may be large. Under additional regularity conditions, consistency holds at a given θ 0. 30/78

33 Consistency at a given θ 0. Theorem([Ghosh and Ramamoorthi, 2003], Th ) Let Θ be compact, metric and θ 0 Θ. Let T (x, θ) = log p θ(x) p θ0 (x). Assume 1. x X, θ T (x, θ) is continuous 2. θ Θ, x T (x, θ) is measurable 3. E (sup θ Θ T (θ, X 1 ) ) <. Then 1. The maximum likelihood estimator is consistent at θ 0 (CV in proba) 2. If θ 0 Supp(π), then the posterior is consistent at θ 0. 31/78

34 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 32/78

35 Bayesian asymptotic normality : Overview Tells us about the rate of convergence of π( X 1:n ) towards δ θ0. With a n re-scaling, a Gaussian limit centered at the MLE (under appropriate regularity conditions) Good references : [Van der Vaart, 1998], [Ghosh and Ramamoorthi, 2003], [Schervish, 2012] 33/78

36 Bernstein - Von Mises Theorem (stated for Θ R, similar statements for Θ R d ). Theorem Under appropriate regularity conditions (detailed in [Ghosh and Ramamoorthi, 2003], Th ), Let s = n(θ θ n (X 1:n )), with θ(x 1:n ) the MLE. Let π (s X 1:n ) be the posterior density of s. Then R π (s X 1:n ) I (θ0 ) 2π e s 2 I (θ0 ) 2 ds a.s. n 0 under P θ 0 Interpretation : as n, n(θ θn (X 1:n ) d N (0, I (θ 0 ) 1 ), i.e. θ d N ( θn, I (θ 0) 1 ) n Multivariate case : similar result with multivariate Gaussian and Fisher information matrix. 34/78

37 Asymptotic normality of the posterior mean θ n = E π [θ X 1:n ], θn : maximum likelihood. Theorem In addition to the assumptions of BVM Theorem, assume R θ π(θ) dθ <. Then under P θ 0, 1. n(θ n θ n ) n 2. n(θ n θ 0 ) 0 in probability d n N (0, I (θ 0) 1 ). 35/78

38 Regularity conditions for BVM theorem 1. {x X : p θ (x) > 0} does not depend on θ 2. L(θ, x) = log p θ (x) is three times differentiable w.r.t. θ in a neighborhood of θ E θ0 θ L(θ 0, X ) <, E θ0 2 θ 2 L(θ 0, X ) < and E θ0 sup θ (θ0 δ,θ 0 +δ) 3 θ 3 L(θ 0, X ) < 4. X and θ may be interchanged. 5. I (θ 0 ) > 0. Remark : under these conditions the MLE is asymptotically normal, d n( θn θ 0 ) N (0, I (θ 0 ) 1 as well. 36/78

39 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 36/78

40 Setting Not purely Bayesian framework : the training step is not necessarily Bayesian, only the prediction step is. Sample space X = X 1 X d (d features) some features may be categorical, some discrete, some continuous... data X i = (X i,1,... X i,d ), i = 1,..., n. Classification problem : X i may come from anyone of K classes (C 1,..., C K ). { X i,1 R p p : X-ray image from patient i Example X i,2 {0, 1} : result of a blood test from patient i. classes : {ill, healthy, healthy carrier}. Goal predict the class c {1,..., K} of a new patient. 37/78

41 Naive Bayes assumption Conditionally to the class c(i) {1,..., K} of observation i, the features (X i,1,..., X i,d ) are independent. Looks like a strong (and erroneous) assumption! In practice : produces reasonable prediction (even though the posterior probabilities of each class are not to be taken too seriously) 38/78

42 1. Training step Training set { (x i,j, c(i)), i {1,..., n}, j {1,..., d} }, c(i) {1,..., K}. for k {1,..., K} : Retain observations of class k i I k. For j {1,..., d} estimate the class distribution, with density p j,k (x j ) = p(x i,j c(i) = k), using data (x i,j ) i Ik, usually in a parametric model with parameter θ j,k : estimated density p j,k, θj,k ( ) output : the conditional distribution of X given C = k, p k (x) = k p j,k, θj,k (x j ) j=1 39/78

43 2. computing the predictive class probabilities input : new data point x = (x 1,..., x d ) From step 1 : conditional distributions of X given C = k : p k ( ) = p j,k, θj,k (plug-in method, neglect estimation error of θ j,k ). (a) Assign a prior probability to each class : π = (π 1,..., π K ), π k = P π (C = k). step 1 joint density of (X, C) : q(x, k) = π k p k (x). (b) Apply the discrete Bayes formula : π(k x) = π k p k (x) π d k K c=1 π cp c (x) = j=1 p (x j,k, θ j,k j ) K c=1 π d c j=1 p j,c, θ (x j,c j) Easy to implement! O(kdN) for N testing data. 40/78

44 3. final step : class prediction Classification task : output= a predicted class x Naive Bayes prediction for a new point x (a maximum a posteriori) ĉ = argmax π(k x). k {1,...,k 41/78

45 Example : text documents classification 2 classes : {1 = spam, 2 = non spam } vocabulary V = {w 1,..., w V }. dataset : documents ( ) T i = (T i,j, j = 1,..., N i ), i n with N i : number of words in T i t i,j V : j th word in T i 42/78

46 Conditional model (text documents) Naive Bayes assumption : in document T i, conditionally to the class, words are drawn independently from each other the vocabulary V T i can be summarized by a bag of words X i = (X i,1,..., X i,v ) : X i,j : number of occurrences of word j in T i. Conditional model for X i given its class k {1, 2} : L(X i C = k) = Multi ( ) θ k = (θ 1,k,..., θ V,k ), N i, N i! V p k,θk (x) = V j=1 x θ x i,j j,k i,j! j=1 i.e. 43/78

47 1. training step (text documents) Fit separately 2 Multinomial models on spam and non-spam Here : the Dirichlet prior Diri(a 1..., a v ), a j > 0 is conjugate for the Multinomial model, with density diri(θ a 1,..., a V ) = Γ( V j=1 a j) V V j=1 Γ(a θ a j 1 j j) j=1 on S V = {θ R V + : V j=1 θ j = 1} the V 1-simplex. Mean of θ under π = Diri(a 1,..., a V ) : ( a1 a ) E π (θ) = V j a,..., j j a j The posterior for x 1:n = (x i,1,..., x i,v ) i {1,...,n} is Diri ( n n (a 1 + x i,1 ),..., (a V + x i,v ) ). i=1 i=1 44/78

48 1. training step (text documents) Cont d Concatenate documents of each class separately x (k) = (x (k) j ) j=1,...,v, k = 1, 2 with x k,j = total # occurrences of word j in documents of class k. θ k = (θ k,1,..., θ k,v ) multinomial parameter for class k. Flat priors on θ k : π 1 = π 2 = Diri(1,..., 1) Posterior mean estimates ( θ k = E πk [θ x (k) ] = x (k) V + V j=1 x (k),..., j x (k) V + 1 ) V + V j=1 x (k) j (the prior acts as regularizer : +1 term avoids 0 probabilities. 45/78

49 2. Prediction step For a new document x new the predictive probabilities of each class are : π(c = k x new ) = p(x new C = k)π 1 p(x new C = k)π 1 + p(x new C = 2)π 2 with p(x new C = k) V j=1 xj θ new k,j The class prediction is k (x new ) = argmax p(x new C = k) k=1,2 46/78

50 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 46/78

51 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice 47/78

52 The regression problem Supervised learning : training dataset (x i, Y i ), i n, with x i X the features for observation i (considered non random) Y i R the label (random variable). goal : for a new observation with features x new, predict Y new, i.e. construct a regression function h H, so that h(x) is our best prediction of Y at point x. h should be simple (avoid over-fitting) simple class H. fit the data well : measured through a loss function L(x, y, h). example : squared error loss L(x, y, h) = (y h(x)) 2. 48/78

53 Multiple classical strategies Statistical learning approach : empirical risk minimization R n (x 1:n, y 1:n, h) = 1 n L(x i, y i, h) minimize h H R n (x 1:n, y 1:n, h) Probabilistic modeling approach (likelihood based) : assume e.g. Y i = h 0 (x i ) + ɛ i, ɛ i P ɛ independent noises, e.g. P ɛ = N (0, σ 2 ), σ 2 known or not. likelihood of h, p h (x 1:n, y 1:n ) = n i=1 p ɛ(y i h(x i )). minimize h H n log p ɛ (y i h(x i )) i=1 With Gaussian noises, both strategies coincide. 49/78

54 Linear regression h : a linear combination of basis functions φ j : X R (feature maps), j {1,..., p} h(x) = p θ j φ j (x), θ j unknown, φ j known, i.e. j=1 { p H = θ j φ j : θ = (θ 1,..., θ p ) R p} j=1 Examples X = R p, φ j (x) = x j : canonical feature map X = R, φ j (x) = x j 1 : polynomial basis function X = R d, φ j (x) = 1 (2π) d/2 det Σ j exp 1 2 (x µ j) Σ 1 j (x µ j ), Gaussian basis function 50/78

55 Empirical risk minimization for linear regression Empirical risk : R n (x 1:n, y 1:n, θ) = 1 2 n (y i θ, φ(x i ) ) 2 = 1 2 y 1:n Φθ 2, i=1 with Φ R n p : design matrix, Φ i,j = φ j (x i ). Minimizer of R n : the least squares estimator explicit solution when Φ Φ is of rank p (invertible) θ = (Φ Φ) 1 Φ y 1:n 51/78

56 Regularization goals : prevent over-fitting numerical instabilities (inversion of (Φ Φ). Add a complexity penalty (function of θ) to the empirical risk penalty :λ θ 2 2 ridge regression penalty :λ θ 1 Lasso regression e.g. with L 2 penalty, the optimization problem becomes solution θ = θ = argmin y 1:n Φθ 2 + λ θ 2 2 for some λ > 0. θ [ Φ Φ + λi p ] 1Φ y 1:n. 52/78

57 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice 53/78

58 Bayesian linear model Again, Y i = θ, Φ(x i ) + ɛ i Assume ɛ i N (0, β 1 ), β > 0 noise precision viewed as a constant (known or not) Prior distribution on θ R p : π = N (m 0, S 0 ). independence assumption : ɛ 1 ɛ 2 θ. Y = Y 1:n = Φθ + ɛ 1:n, with Φ R n p, Φ i,j = φ j (x i ). Bayesian model { θ π = N (m0, S 0 ) L [ Y θ ] = N (Φθ, 1 β I n) Natural Bayesian estimator : θ = E π (θ Y 1:n ). posterior distribution? 54/78

59 Conditioning and augmenting Gaussian vectors Lemma Let { W N (µ, Λ 1 ) L[Y w] = N (Aw + b, L 1 ) i.e. Y = AW + b + ɛ with ɛ N (0, L 1 ) W. Then L[W y] = N (m y, S) with S = (Λ + A ΛA) 1 m y = S[A L(y b) + Λµ.] proof : homework (see exercises sheet online) 55/78

60 Application to posterior computation Using the lemma with A = Φ, b = 0, W = θ, Λ = S 1 0, µ = m 0, L = βi p, we obtain immediately the posterior distribution π( Y 1:n ) = L[θ y 1:n ] = N (m n, S n ) with { Sn = ( S βφ Φ ) 1 m n = S n ( βφ y 1:n + S 1 0 m 0 ) (1) Posterior mean estimate θ = E π [θ y 1:n ] = m n 56/78

61 Special case : diagonal, centered prior choose m 0 = 0, S 0 = α 1 I p, with α : prior precision (it makes sense!) Then (1) becomes S n = ( αi p + βφ Φ ) 1 ( m n = S n βφ ) y 1:n ) 1 = β 1( α β + Φ Φ ( α ) 1Φ = β + Φ Φ y 1:n }{{} penalized least squares solution (2) Adding a prior N (0, α 1 I p ) Adding a L 2 regularization with parameter λ = α/β. remark : Narrow prior large α large penalty 57/78

62 Predictive distribution New data point (x new, Y new ), with Y new not observed and x new known : goal : obtain the posterior distribution of Y new (mean and variance credible intervals). We still have Y new = θ, φ(x new ) + ɛ, ɛ N (0, β 1 ) and ɛ θ. Now (after training step) θ π( y 1:n ) = N (m n, S n ) d Thus Y new = linear transform of Gaussian vector (ɛ, θ) ) (φ(x new ) m n, φ(x new ) S n φ(x new ) + β 1 L[Y new y 1:n ] = N 58/78

63 Example : polynomial basis functions True regression functions : h 0 (x) = sin(x) Polynomial basis functions : φ(x) = (1, x, x 2, x 3, x 4 ) (p = 5). h 0(x) = sin(x) observation noise y ε i x 59/78

64 Estimated regression function ĥ(x) = θ, Φ(x) = θ j=2 θ j x j 1 With the previous dataset h 0(x) = sin(x) observation estimated regression function h 0(x) = sin(x) observation estimated regression function y y x α = 0.01 α = 100 x 60/78

65 Predictive distribution ĥ(x) : the mean of L(Y new y 1:n ) for x new = x Remind L(Y new y 1:n ) = N (ĥ(x), σ2 new = φ(x) S n φ(x) + β 1 ) posterior credible interval for Y, ] I x = [ĥ(x) 1/96 σnew 2, ĥ(x) + 1/96 σnew 2 h 0(x) = sin(x) observation posterior mean credible intervals h 0(x) = sin(x) observation posterior mean credible intervals y y x α = 0.01 α = 100 x 61/78

66 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 61/78

67 Model choice problem What if several model in competition {M k, k {1,..., K}}, with M k = {Θ k, π k }? Continuous case : family of models {M α, α A} How to choose k or α? Examples : M 1 = {Θ, π 1 }, M 2 = {Θ, π 2 } with π 1 a flat prior and π 2 the Jeffreys prior M α linear model with normal prior on the noise N (0, α 1 ) 62/78

68 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 63/78

69 Hierarchical models Bayesian view : put a prior on unknown quantities, then condition upon data. Model choice problem : put a hyper-prior on α A (or k {1,..., K}) hierarchical Bayesian model Convenient when dealing with parallel experiments 64/78

70 Example of hierarchical model Example : 2 rivers with fishes. X i {0, 1} : fished fish ill or sound. X i Ber(θ), with θ = θ 1 in river 1 and θ = θ 2 in river 2. θ 1 and θ 2 are 2 realizations of θ Beta(a, b) α = (a, b) : hyper-parameter for the prior hierarchical Bayes : put a prior on α (e.g. product of 2 independent Gammas). 65/78

71 Posterior mean estimates in a BMA framework denote π h the hyper-prior on k (or α) Let us stick to the discrete case, k {1,..., M}. The prior is a mixture distribution π = K k=1 πh (k)π k ( ), i.e. for all π-integrable function g(θ), E π [g(θ)] = E π h [ ] E π [g(θ) k] = K k=1 π h (k) g(θ) dπ k (θ) Θ k So is the posterior distribution, thus the posterior mean is a weighted average ] ĝ = E π [g(θ) X 1:n ] = E π h [E π [g(θ) k, X 1:n ] X 1:n = K π h (k X 1:n ) g(θ) dπ k (θ X 1:n ) Θ } k {{} ĝ k :posterior mean in model k k=1 66/78

72 Model evidence Computing the posterior mean in the BMA framework requires Computing the posterior means in each individual model k moderate tasks Averaging them with weights π h (k X 1:n ), posterior weight of model k Bayes formula with π h (k X 1:n ) = p(x 1:n k) = evidence of model k = p(x 1:n θ) dπ k (θ) Θ k πh (k)p(x 1:n k) K j=1 πh (j)p(x 1:n j) = m k (X 1:n ) marginal likelihood of X 1:n in model k hard to compute (integral) 67/78

73 Shortcomings of BMA Inference has to be done in each individual model Usually one weight (say π(k X 1:n )) all others (reason : concentration of the posterior around the true θ 0 Θ k0 and k = k 0 = final estimate ĝ ĝ k0. Other ĝ k s are almost useless Bottleneck : compute k. model choice problem. 68/78

74 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 69/78

75 Posterior weights, model evidence and Bayes factor Recall k = argmax π(k X 1:n ) = argmax k k p(x 1:n k) π h (k) }{{} evidence of model k Uniform prior on k = only the evidence p(x 1:n k) matters. in any case : prior influence vanishes with n. Relevant quantity to compare model k and j : B kj = p(x 1:n k) p(x 1:n j) : Bayes factor (Jeffreys, 61) Suggested scale for decision making : log 10 B kj B kj evidence against B j 0 1/ not significant 1/ substantial strong > 2 > 100 decisive 70/78

76 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 71/78

77 Occam s razor principle Avoid over-fitting Between 2 models explaining the data equally well, one ought to choose the simplest one. Better generalization properties. 72/78

78 Occam s razor and model evidence When selecting k according to the model evidences p(x 1:n k), the Occam s razor is automatically implemented. Reason : the prior plays the role of a regularizer. 73/78

79 automatic complexity penalty : intuition 1 Complex model = large Θ k = small π k (θ) (if uniform over Θ k ) = p θ (x 1:n )π k (θ) dθ small Θ k (average over large regions where p θ (x 1:n ) small) 74/78

80 automatic complexity penalty : intuition 2 if Θ k R : assume π k flat over interval of length θ prior k p θk (X 1:n ) peaked around (X p θmap,k 1:n ) with width posterior k. then π k (θ) 1/ prior k and p(x 1:n k) = p θ (x)π k (θ) dθ (X p θmap,k 1:n ) Θ k If Θ k R d and same approximation in each dimension log p(x 1:n k) log p θmap,k (X 1:n ) + θ posterior k θ prior k }{{} complexity penalty d log θposterior k θ prior k }{{} dimension + complexity penalty 75/78

81 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 76/78

82 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 77/78

83 Bibliography [Ghosh and Ramamoorthi, 2003] Ghosh, J. and Ramamoorthi, R. (2003). Bayesian nonparametrics [Schervish, 2012] Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media. [Van der Vaart, 1998] Van der Vaart, A. W. (1998). Asymptotic statistics, volume 3. Cambridge university press. 78/78

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Introduction to Bayesian Methods

Introduction to Bayesian Methods Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. Lecture 4 Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza October 6, 2017 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 1 / 46 Outline

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Lecture 1 October 9, 2013

Lecture 1 October 9, 2013 Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: http://www.di.ens.fr/~fbach/courses/fall2013/

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Matthew S. Johnson New York ASA Chapter Workshop CUNY Graduate Center New York, NY hspace1in December 17, 2009 December

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

Exponential Families

Exponential Families Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Some Curiosities Arising in Objective Bayesian Analysis

Some Curiosities Arising in Objective Bayesian Analysis . Some Curiosities Arising in Objective Bayesian Analysis Jim Berger Duke University Statistical and Applied Mathematical Institute Yale University May 15, 2009 1 Three vignettes related to John s work

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Anne Philippe Laboratoire de Mathématiques Jean Leray Université de Nantes Workshop EDF-INRIA,

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Bayesian statistics: Inference and decision theory

Bayesian statistics: Inference and decision theory Bayesian statistics: Inference and decision theory Patric Müller und Francesco Antognini Seminar über Statistik FS 28 3.3.28 Contents 1 Introduction and basic definitions 2 2 Bayes Method 4 3 Two optimalities:

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS LECTURE : EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS HANI GOODARZI AND SINA JAFARPOUR. EXPONENTIAL FAMILY. Exponential family comprises a set of flexible distribution ranging both continuous and

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21 Naive Bayes Bayes Rules: Naive Bayes

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger) 8 HENRIK HULT Lecture 2 3. Some common distributions in classical and Bayesian statistics 3.1. Conjugate prior distributions. In the Bayesian setting it is important to compute posterior distributions.

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33 Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

More information

Maximum likelihood estimation

Maximum likelihood estimation Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Machine Learning - MT Classification: Generative Models

Machine Learning - MT Classification: Generative Models Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements Practical 1 Submission Try to get signed off during session itself Otherwise,

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

2.3 Exponential Families 61

2.3 Exponential Families 61 2.3 Exponential Families 61 Fig. 2.9. Watson Nadaraya estimate. Left: a binary classifier. The optimal solution would be a straight line since both classes were drawn from a normal distribution with the

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Nonparameteric Regression:

Nonparameteric Regression: Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Bayesian methods in economics and finance

Bayesian methods in economics and finance 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Monday, September 26, 2011 Lecture 10: Exponential families and Sufficient statistics Exponential Families Exponential families are important parametric families

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information