Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems
|
|
- Alan Bruce
- 5 years ago
- Views:
Transcription
1 Introduction to Bayesian learning Lecture 2: Bayesian methods for (un)supervised problems Anne Sabourin, Ass. Prof., Telecom ParisTech September /78
2 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 1/78
3 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 2/78
4 Reminder : conjugate priors last lecture (see lecture notes) Definition : hyper parameter, conjugate family examples : normal model,known variance - normal prior normal model, known mean - gamma prior on inverse variance normal model, unknown mean and variance : normal-gamma prior on (µ, λ = 1/σ 2 ) 3/78
5 conjugate priors for multivariate normal X N (µ, Λ 1 ), µ R d, Λ R d d positive, definite (precision matrix : inverse of covariance matrix) 1. unknown mean conjugate prior family on µ : a multivariate Gaussian distributions 2. unknown precision conjugate prior on Λ : Wishart distributions W(ν, W ) with ν degrees of freedom (ν N ) and W R d d. 4/78
6 Wishart distribution defined on the cone of positive definite matrices. The Wishart distribution W(ν, W ) has density { 1 } f W (Λ ν, W ) = B det λ (ν d 1)/2 exp 2 Tr(W 1 Λ) w.r.t. Lebesgue on R d(d+1) 2 : i j dλ (i,j), restricted to the set of positive definite matrices. B : a normalizing constant. probabilistic representation : let M be a random ν d matrix with i.i.d.rows M (i, ) N (0, W ). Then Λ W(ν, W ) Λ d = M M = n M(i, ) M (i, ) i=1 More details : see e.g. Eaton, Multivariate Statistics : A Vector Space Approach, 2007 (Chapter 8) 5/78
7 conjugate priors for multivariate normal, Cont d 3. Unknown mean and precision conjugate prior family on (µ, Λ) : the Gaussian-Wishart distribution with hyper-parameters (W, ν, m, β) π(µ, Λ) = π 1 (Λ)π 2 (µ λ) with π 2 = W(W, ν), ν N, W positive definite, π 2 ( Λ) = N (m, (βλ) 1 ), m R d, β > 0 6/78
8 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 7/78
9 Definition : exponential family A dominated parametric model P = {P θ, θ Θ} is an exponential family if the densities write { } p θ (x) = C(θ)h(x) exp T (x), R(θ) for some functions R : Θ R k, T : X R k, C : Θ R +, h : X R +. C(θ) : a normalizing constant R(θ) : the natural parameter (R : the good re-parametrization) If R(θ) = θ, the family is natural. Most textbook distributions are from the exponential family! 8/78
10 Example I : Bernoulli model θ Θ =]0, 1[, X = {0, 1} The model is dominated by λ = δ 0 + δ 1 p θ (x) = θ x (1 θ) 1 x = exp{x log θ + (1 x) log(1 θ)} { θ } = (1 θ) exp x log 1 θ }{{} T (x) The model is an exponential family with T (x) = x natural parameter : ρ = R(θ) = log normalizing constant C(θ) = (1 θ) } {{ } R(θ) θ 1 θ. 9/78
11 Example II : Gaussian model θ = (µ, σ 2 ) Θ = R R + the model is dominated by Lebesgue on X = R. 1 { (x 2 p θ (x) = exp 2µx + µ 2 ) } 2πσ 2 2σ 2 1 µ ( ) = exp{ 2πσ 2 2σ 2 } exp ( x x 2 ) µ/σ 2, }{{} 1/(2σ 2 ) }{{}}{{} T (x) C(θ) R(θ) The model is an exponential family with T (x) = (x, x 2 ) natural parameter : ρ = R(θ) = (µ/σ 2, 1/2σ 2 ). normalizing constant C(θ) = (2πσ 2 ) 1/2 10/78
12 likelihood for i.i.d. samples in exponential families { } p θ (x 1 ) = C(θ)h(x) exp R(θ), T (x 1 ) p n θ (x 1:n ) = C(θ) n { n } h(x i ) exp T (x i ), R(θ). i=1 } {{ } h n(x 1:n ) i=1 } {{ } T n(x 1:n ) 11/78
13 Natural parameter space natural parametrization : ρ = R(θ). The density p ρ (x) = C(ρ)h(x) exp T (x), ρ integrates to 1 ρ E, the natural parameter space, i.e. { E = ρ : h(x) exp T (x), ρ } dλ(x) < If E is open : the family is regular. X 12/78
14 Maximum likelihood in regular exponential families natural parametrization : ρ = R(θ). λ : reference measure. [ ] lemma : expression for E ρ T (X ) E ρ [ T (X ) ] = ρ {ln C(ρ)} Proof 1 C(ρ) h(x) exp T (x), ρ dλ(x) X (with regularity to exchange and ) ) 0 = ρc(ρ) h(x) exp T (x), ρ dλ(x) +C(ρ) X } {{ } C(ρ) 1 0 = 1 C(ρ) ρc(ρ) + E( T (X ) ) X h(x)t (x) exp T (x), ρ dλ(x) 13/78
15 Maximum likelihood in regular exponential families, cont d proposition The MLE estimator ρ in a regular exponential family satisfies Proof E ρ [T (X )] = 1 T (x i ). n ρ log p ρ (x) = 0 ρ {n log C(ρ) + T (x i ), ρ } = 0 ρ log C( ρ) = 1 T (x i ). n then use the lemma. i i 14/78
16 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 15/78
17 Conjugate priors in exponential family Proposition A natural exponential family with densities p θ (x) = C(θ)h(x) exp θ, T (x), admits a conjugate prior family F = {π λ,µ, λ > 0, µ M λ R k }, with π λ,µ (θ) = K(µ, λ)c(θ) λ exp { θ, µ } and M λ = {µ : Θ π(µ, λ) dθ < }. The posterior for n observation is π λ,µ (θ x 1:n ) C(θ) λ+n exp { θ, µ + i T (x i ) } so that π λ,µ ( x 1:n ) = π λn,µ n ( ), with λ n = λ + n ; µ n = µ + i T (x i ) proof exercise 16/78
18 Example : Poisson model p θ (x) = e θ θ x /x!, X = N, θ > 0 = 1 x! e θ e x log θ an exponential family with T (x) = x, ρ = R(θ) = log θ R, C(ρ) = exp{ e ρ } conjugate prior for ρ : π a,b (ρ) exp{ be ρ } exp{aρ}. Back to θ : π(θ) = dπ dρ dρ dθ = θa 1 exp{ bθ} The Gamma family is a conjugate prior for θ. (Gamma density) 17/78
19 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 18/78
20 About the choice of a conjugate prior A convenient choice only One must still choose hyper-parameters (λ, µ) This is an issue of model choice possible to do so via empirical Bayes methods, see lecture 2 and lab session. 19/78
21 Other prior choices : non informative priors Goal : minimize the bias induced by the prior If Θ compact : one can choose π(θ) = Constant If Θ non compact, θ π(θ) dθ = Θ C dθ = + OK to do so as long as the posterior is well defined, i.e. when p θ (x) dπ(θ) <. Θ uniform only w.r.t. the reference measure not invariant under re-parametrization. e.g. Flat prior on ]0, 1[ in a Ber(θ) model non flat over ρ = log[θ/(1 θ)] 20/78
22 Other prior choices : Jeffreys prior For Θ open in R d. Reasonable with d = 1. Remind the Fisher information (in a regular model) : [( log pθ (X ) ) 2 ] [ 2 log p θ (X ) ] I (θ) = E θ = E θ θ 2. I (θ) is the expected curvature of the likelihood around θ. Interpretation as a an average information carried by X about θ. Idea : grant more prior mass to highly informative θ s Definition : Jeffreys prior In a dominated model with densities p θ, θ Θ, the Jeffreys prior has densities w.r.t. Lebesgue on Θ : π(θ) I (θ). exercise compute the Jeffreys prior in the Bernoulli model, in the location model N (θ, σ 2 ), σ 2 known and in the scale model N (µ, θ 2 ), µ known. 21/78
23 Invariance of the Jeffreys prior Change of variable : h(θ) = η. Then p θ = p h(θ). Let θ π J,θ the Jeffreys prior. Then η π J,θ h 1 with density π(η) for θ=h 1 (η) = π J,θ (θ) dθ dη = I (θ) h (θ) On the other hand compute the Jeffreys prior on η : [( log pη (X ) ) 2 ] 1/2 π J,η (η) = I η (η) = E η η [( θ=h 1 (η) log pθ (X ) dθ ) 2 ] 1/2 = E θ = θ η I (θ) h (θ). Same result : the Jeffreys prior in the η parametrization is the image measure of the Jeffreys prior in the θ parametrization. In other words the Jeffreys prior is parametrization-invariant. 22/78
24 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 22/78
25 Rough overview as the sample size n The influence of the prior choice vanishes The posterior distribution concentrates around the true value θ 0 (almost surely) The posterior distribution is asymptotically normal with mean θ = the maximum likelihood, and variance n 1 I (θ) 1 (same as MLE s) 23/78
26 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 24/78
27 Reminder : Beta-Binomial model Bayesian model { θ π = Beta(a, b) X θ Ber(θ). P θ : distribution over X of the random sequence (X n ) n 1 i.i.d P θ posterior distribution (conjugate prior) : π( x 1:n ) = Ber(a + s, b + n s), s = n x i. 1 25/78
28 Posterior expectation and variance E π (θ X 1:n ) = a + n 1 X i a + b + n = a/n + 1 n n 1 X i (a + b)/n + 1 a.s. θ 0 under P n θ 0 ( a + )( n 1 X i b + n ) 1 X i Var π (θ X 1:n ) = ( ) 2 ( ) a + b + n a + b + n + 1 ( = 1 a/n + 1 )( n n 1 X i b/n ) n 1 X i ( ) n 2 ( ) (a + b)/n + 1 (a + b + 1)/n + 1 θ 0 (1 θ 0 ) P θ a.s. 0 n = exercise (n I (θ 0 )) 1 26/78
29 Concentration of the posterior distribution Write θ n = θ n(x 1:n ) = E π (θ X 1:n ). Tchebychev inequality δ > 0, U = (θ n δ, θ n + δ), P π (θ / U X 1:n ) = P π ( (θ θ n ) 2 > δ 2 X 1:n ) Var π(θ X 1:n ) nδ 2 θ 0 (1 θ 0 ) P θ a.s. 0 nδ 2 a.s. n 0. summary : P θ 0 - a.s., we have that The posterior distribution concentrates around the posterior expectancy θ n θ n n θ 0. 27/78
30 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 28/78
31 Posterior consistency Definition Let {P θ, θ Θ}, π be a Bayesian model and let θ 0 Θ. The posterior is consistent at θ 0 if For all neighborhood U of θ 0, π(u X 1:n ) n 1, P θ 0 -a.s. In general consistency holds when Θ is finite dimensional if π assigns positive mass to θ 0 s neighborhoods. See e.g. [Ghosh and Ramamoorthi, 2003], Chapter 1.3, 1.4 for details 29/78
32 Doob s theorem Theorem If Θ and X are complete, separable, metric spaces endowed with their Borel σ-field, if θ P θ is 1 to 1, then for any prior π on Θ, Θ 0 Θ with π(θ 0 ) = 1 such that for all θ 0 Θ 0, the posterior is consistent at θ 0. issue The π-negligible set where consistency does not hold may be large. Under additional regularity conditions, consistency holds at a given θ 0. 30/78
33 Consistency at a given θ 0. Theorem([Ghosh and Ramamoorthi, 2003], Th ) Let Θ be compact, metric and θ 0 Θ. Let T (x, θ) = log p θ(x) p θ0 (x). Assume 1. x X, θ T (x, θ) is continuous 2. θ Θ, x T (x, θ) is measurable 3. E (sup θ Θ T (θ, X 1 ) ) <. Then 1. The maximum likelihood estimator is consistent at θ 0 (CV in proba) 2. If θ 0 Supp(π), then the posterior is consistent at θ 0. 31/78
34 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice 32/78
35 Bayesian asymptotic normality : Overview Tells us about the rate of convergence of π( X 1:n ) towards δ θ0. With a n re-scaling, a Gaussian limit centered at the MLE (under appropriate regularity conditions) Good references : [Van der Vaart, 1998], [Ghosh and Ramamoorthi, 2003], [Schervish, 2012] 33/78
36 Bernstein - Von Mises Theorem (stated for Θ R, similar statements for Θ R d ). Theorem Under appropriate regularity conditions (detailed in [Ghosh and Ramamoorthi, 2003], Th ), Let s = n(θ θ n (X 1:n )), with θ(x 1:n ) the MLE. Let π (s X 1:n ) be the posterior density of s. Then R π (s X 1:n ) I (θ0 ) 2π e s 2 I (θ0 ) 2 ds a.s. n 0 under P θ 0 Interpretation : as n, n(θ θn (X 1:n ) d N (0, I (θ 0 ) 1 ), i.e. θ d N ( θn, I (θ 0) 1 ) n Multivariate case : similar result with multivariate Gaussian and Fisher information matrix. 34/78
37 Asymptotic normality of the posterior mean θ n = E π [θ X 1:n ], θn : maximum likelihood. Theorem In addition to the assumptions of BVM Theorem, assume R θ π(θ) dθ <. Then under P θ 0, 1. n(θ n θ n ) n 2. n(θ n θ 0 ) 0 in probability d n N (0, I (θ 0) 1 ). 35/78
38 Regularity conditions for BVM theorem 1. {x X : p θ (x) > 0} does not depend on θ 2. L(θ, x) = log p θ (x) is three times differentiable w.r.t. θ in a neighborhood of θ E θ0 θ L(θ 0, X ) <, E θ0 2 θ 2 L(θ 0, X ) < and E θ0 sup θ (θ0 δ,θ 0 +δ) 3 θ 3 L(θ 0, X ) < 4. X and θ may be interchanged. 5. I (θ 0 ) > 0. Remark : under these conditions the MLE is asymptotically normal, d n( θn θ 0 ) N (0, I (θ 0 ) 1 as well. 36/78
39 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 36/78
40 Setting Not purely Bayesian framework : the training step is not necessarily Bayesian, only the prediction step is. Sample space X = X 1 X d (d features) some features may be categorical, some discrete, some continuous... data X i = (X i,1,... X i,d ), i = 1,..., n. Classification problem : X i may come from anyone of K classes (C 1,..., C K ). { X i,1 R p p : X-ray image from patient i Example X i,2 {0, 1} : result of a blood test from patient i. classes : {ill, healthy, healthy carrier}. Goal predict the class c {1,..., K} of a new patient. 37/78
41 Naive Bayes assumption Conditionally to the class c(i) {1,..., K} of observation i, the features (X i,1,..., X i,d ) are independent. Looks like a strong (and erroneous) assumption! In practice : produces reasonable prediction (even though the posterior probabilities of each class are not to be taken too seriously) 38/78
42 1. Training step Training set { (x i,j, c(i)), i {1,..., n}, j {1,..., d} }, c(i) {1,..., K}. for k {1,..., K} : Retain observations of class k i I k. For j {1,..., d} estimate the class distribution, with density p j,k (x j ) = p(x i,j c(i) = k), using data (x i,j ) i Ik, usually in a parametric model with parameter θ j,k : estimated density p j,k, θj,k ( ) output : the conditional distribution of X given C = k, p k (x) = k p j,k, θj,k (x j ) j=1 39/78
43 2. computing the predictive class probabilities input : new data point x = (x 1,..., x d ) From step 1 : conditional distributions of X given C = k : p k ( ) = p j,k, θj,k (plug-in method, neglect estimation error of θ j,k ). (a) Assign a prior probability to each class : π = (π 1,..., π K ), π k = P π (C = k). step 1 joint density of (X, C) : q(x, k) = π k p k (x). (b) Apply the discrete Bayes formula : π(k x) = π k p k (x) π d k K c=1 π cp c (x) = j=1 p (x j,k, θ j,k j ) K c=1 π d c j=1 p j,c, θ (x j,c j) Easy to implement! O(kdN) for N testing data. 40/78
44 3. final step : class prediction Classification task : output= a predicted class x Naive Bayes prediction for a new point x (a maximum a posteriori) ĉ = argmax π(k x). k {1,...,k 41/78
45 Example : text documents classification 2 classes : {1 = spam, 2 = non spam } vocabulary V = {w 1,..., w V }. dataset : documents ( ) T i = (T i,j, j = 1,..., N i ), i n with N i : number of words in T i t i,j V : j th word in T i 42/78
46 Conditional model (text documents) Naive Bayes assumption : in document T i, conditionally to the class, words are drawn independently from each other the vocabulary V T i can be summarized by a bag of words X i = (X i,1,..., X i,v ) : X i,j : number of occurrences of word j in T i. Conditional model for X i given its class k {1, 2} : L(X i C = k) = Multi ( ) θ k = (θ 1,k,..., θ V,k ), N i, N i! V p k,θk (x) = V j=1 x θ x i,j j,k i,j! j=1 i.e. 43/78
47 1. training step (text documents) Fit separately 2 Multinomial models on spam and non-spam Here : the Dirichlet prior Diri(a 1..., a v ), a j > 0 is conjugate for the Multinomial model, with density diri(θ a 1,..., a V ) = Γ( V j=1 a j) V V j=1 Γ(a θ a j 1 j j) j=1 on S V = {θ R V + : V j=1 θ j = 1} the V 1-simplex. Mean of θ under π = Diri(a 1,..., a V ) : ( a1 a ) E π (θ) = V j a,..., j j a j The posterior for x 1:n = (x i,1,..., x i,v ) i {1,...,n} is Diri ( n n (a 1 + x i,1 ),..., (a V + x i,v ) ). i=1 i=1 44/78
48 1. training step (text documents) Cont d Concatenate documents of each class separately x (k) = (x (k) j ) j=1,...,v, k = 1, 2 with x k,j = total # occurrences of word j in documents of class k. θ k = (θ k,1,..., θ k,v ) multinomial parameter for class k. Flat priors on θ k : π 1 = π 2 = Diri(1,..., 1) Posterior mean estimates ( θ k = E πk [θ x (k) ] = x (k) V + V j=1 x (k),..., j x (k) V + 1 ) V + V j=1 x (k) j (the prior acts as regularizer : +1 term avoids 0 probabilities. 45/78
49 2. Prediction step For a new document x new the predictive probabilities of each class are : π(c = k x new ) = p(x new C = k)π 1 p(x new C = k)π 1 + p(x new C = 2)π 2 with p(x new C = k) V j=1 xj θ new k,j The class prediction is k (x new ) = argmax p(x new C = k) k=1,2 46/78
50 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 46/78
51 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice 47/78
52 The regression problem Supervised learning : training dataset (x i, Y i ), i n, with x i X the features for observation i (considered non random) Y i R the label (random variable). goal : for a new observation with features x new, predict Y new, i.e. construct a regression function h H, so that h(x) is our best prediction of Y at point x. h should be simple (avoid over-fitting) simple class H. fit the data well : measured through a loss function L(x, y, h). example : squared error loss L(x, y, h) = (y h(x)) 2. 48/78
53 Multiple classical strategies Statistical learning approach : empirical risk minimization R n (x 1:n, y 1:n, h) = 1 n L(x i, y i, h) minimize h H R n (x 1:n, y 1:n, h) Probabilistic modeling approach (likelihood based) : assume e.g. Y i = h 0 (x i ) + ɛ i, ɛ i P ɛ independent noises, e.g. P ɛ = N (0, σ 2 ), σ 2 known or not. likelihood of h, p h (x 1:n, y 1:n ) = n i=1 p ɛ(y i h(x i )). minimize h H n log p ɛ (y i h(x i )) i=1 With Gaussian noises, both strategies coincide. 49/78
54 Linear regression h : a linear combination of basis functions φ j : X R (feature maps), j {1,..., p} h(x) = p θ j φ j (x), θ j unknown, φ j known, i.e. j=1 { p H = θ j φ j : θ = (θ 1,..., θ p ) R p} j=1 Examples X = R p, φ j (x) = x j : canonical feature map X = R, φ j (x) = x j 1 : polynomial basis function X = R d, φ j (x) = 1 (2π) d/2 det Σ j exp 1 2 (x µ j) Σ 1 j (x µ j ), Gaussian basis function 50/78
55 Empirical risk minimization for linear regression Empirical risk : R n (x 1:n, y 1:n, θ) = 1 2 n (y i θ, φ(x i ) ) 2 = 1 2 y 1:n Φθ 2, i=1 with Φ R n p : design matrix, Φ i,j = φ j (x i ). Minimizer of R n : the least squares estimator explicit solution when Φ Φ is of rank p (invertible) θ = (Φ Φ) 1 Φ y 1:n 51/78
56 Regularization goals : prevent over-fitting numerical instabilities (inversion of (Φ Φ). Add a complexity penalty (function of θ) to the empirical risk penalty :λ θ 2 2 ridge regression penalty :λ θ 1 Lasso regression e.g. with L 2 penalty, the optimization problem becomes solution θ = θ = argmin y 1:n Φθ 2 + λ θ 2 2 for some λ > 0. θ [ Φ Φ + λi p ] 1Φ y 1:n. 52/78
57 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice 53/78
58 Bayesian linear model Again, Y i = θ, Φ(x i ) + ɛ i Assume ɛ i N (0, β 1 ), β > 0 noise precision viewed as a constant (known or not) Prior distribution on θ R p : π = N (m 0, S 0 ). independence assumption : ɛ 1 ɛ 2 θ. Y = Y 1:n = Φθ + ɛ 1:n, with Φ R n p, Φ i,j = φ j (x i ). Bayesian model { θ π = N (m0, S 0 ) L [ Y θ ] = N (Φθ, 1 β I n) Natural Bayesian estimator : θ = E π (θ Y 1:n ). posterior distribution? 54/78
59 Conditioning and augmenting Gaussian vectors Lemma Let { W N (µ, Λ 1 ) L[Y w] = N (Aw + b, L 1 ) i.e. Y = AW + b + ɛ with ɛ N (0, L 1 ) W. Then L[W y] = N (m y, S) with S = (Λ + A ΛA) 1 m y = S[A L(y b) + Λµ.] proof : homework (see exercises sheet online) 55/78
60 Application to posterior computation Using the lemma with A = Φ, b = 0, W = θ, Λ = S 1 0, µ = m 0, L = βi p, we obtain immediately the posterior distribution π( Y 1:n ) = L[θ y 1:n ] = N (m n, S n ) with { Sn = ( S βφ Φ ) 1 m n = S n ( βφ y 1:n + S 1 0 m 0 ) (1) Posterior mean estimate θ = E π [θ y 1:n ] = m n 56/78
61 Special case : diagonal, centered prior choose m 0 = 0, S 0 = α 1 I p, with α : prior precision (it makes sense!) Then (1) becomes S n = ( αi p + βφ Φ ) 1 ( m n = S n βφ ) y 1:n ) 1 = β 1( α β + Φ Φ ( α ) 1Φ = β + Φ Φ y 1:n }{{} penalized least squares solution (2) Adding a prior N (0, α 1 I p ) Adding a L 2 regularization with parameter λ = α/β. remark : Narrow prior large α large penalty 57/78
62 Predictive distribution New data point (x new, Y new ), with Y new not observed and x new known : goal : obtain the posterior distribution of Y new (mean and variance credible intervals). We still have Y new = θ, φ(x new ) + ɛ, ɛ N (0, β 1 ) and ɛ θ. Now (after training step) θ π( y 1:n ) = N (m n, S n ) d Thus Y new = linear transform of Gaussian vector (ɛ, θ) ) (φ(x new ) m n, φ(x new ) S n φ(x new ) + β 1 L[Y new y 1:n ] = N 58/78
63 Example : polynomial basis functions True regression functions : h 0 (x) = sin(x) Polynomial basis functions : φ(x) = (1, x, x 2, x 3, x 4 ) (p = 5). h 0(x) = sin(x) observation noise y ε i x 59/78
64 Estimated regression function ĥ(x) = θ, Φ(x) = θ j=2 θ j x j 1 With the previous dataset h 0(x) = sin(x) observation estimated regression function h 0(x) = sin(x) observation estimated regression function y y x α = 0.01 α = 100 x 60/78
65 Predictive distribution ĥ(x) : the mean of L(Y new y 1:n ) for x new = x Remind L(Y new y 1:n ) = N (ĥ(x), σ2 new = φ(x) S n φ(x) + β 1 ) posterior credible interval for Y, ] I x = [ĥ(x) 1/96 σnew 2, ĥ(x) + 1/96 σnew 2 h 0(x) = sin(x) observation posterior mean credible intervals h 0(x) = sin(x) observation posterior mean credible intervals y y x α = 0.01 α = 100 x 61/78
66 1. Lecture 1 Cont d : Conjugate priors and exponential family. Introduction, examples Exponential family conjugate priors in exponential families Prior choice 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics Example : Beta-Binomial model Posterior consistency Asymptotic normality 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression Regression : reminders Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 61/78
67 Model choice problem What if several model in competition {M k, k {1,..., K}}, with M k = {Θ k, π k }? Continuous case : family of models {M α, α A} How to choose k or α? Examples : M 1 = {Θ, π 1 }, M 2 = {Θ, π 2 } with π 1 a flat prior and π 2 the Jeffreys prior M α linear model with normal prior on the noise N (0, α 1 ) 62/78
68 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 63/78
69 Hierarchical models Bayesian view : put a prior on unknown quantities, then condition upon data. Model choice problem : put a hyper-prior on α A (or k {1,..., K}) hierarchical Bayesian model Convenient when dealing with parallel experiments 64/78
70 Example of hierarchical model Example : 2 rivers with fishes. X i {0, 1} : fished fish ill or sound. X i Ber(θ), with θ = θ 1 in river 1 and θ = θ 2 in river 2. θ 1 and θ 2 are 2 realizations of θ Beta(a, b) α = (a, b) : hyper-parameter for the prior hierarchical Bayes : put a prior on α (e.g. product of 2 independent Gammas). 65/78
71 Posterior mean estimates in a BMA framework denote π h the hyper-prior on k (or α) Let us stick to the discrete case, k {1,..., M}. The prior is a mixture distribution π = K k=1 πh (k)π k ( ), i.e. for all π-integrable function g(θ), E π [g(θ)] = E π h [ ] E π [g(θ) k] = K k=1 π h (k) g(θ) dπ k (θ) Θ k So is the posterior distribution, thus the posterior mean is a weighted average ] ĝ = E π [g(θ) X 1:n ] = E π h [E π [g(θ) k, X 1:n ] X 1:n = K π h (k X 1:n ) g(θ) dπ k (θ X 1:n ) Θ } k {{} ĝ k :posterior mean in model k k=1 66/78
72 Model evidence Computing the posterior mean in the BMA framework requires Computing the posterior means in each individual model k moderate tasks Averaging them with weights π h (k X 1:n ), posterior weight of model k Bayes formula with π h (k X 1:n ) = p(x 1:n k) = evidence of model k = p(x 1:n θ) dπ k (θ) Θ k πh (k)p(x 1:n k) K j=1 πh (j)p(x 1:n j) = m k (X 1:n ) marginal likelihood of X 1:n in model k hard to compute (integral) 67/78
73 Shortcomings of BMA Inference has to be done in each individual model Usually one weight (say π(k X 1:n )) all others (reason : concentration of the posterior around the true θ 0 Θ k0 and k = k 0 = final estimate ĝ ĝ k0. Other ĝ k s are almost useless Bottleneck : compute k. model choice problem. 68/78
74 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 69/78
75 Posterior weights, model evidence and Bayes factor Recall k = argmax π(k X 1:n ) = argmax k k p(x 1:n k) π h (k) }{{} evidence of model k Uniform prior on k = only the evidence p(x 1:n k) matters. in any case : prior influence vanishes with n. Relevant quantity to compare model k and j : B kj = p(x 1:n k) p(x 1:n j) : Bayes factor (Jeffreys, 61) Suggested scale for decision making : log 10 B kj B kj evidence against B j 0 1/ not significant 1/ substantial strong > 2 > 100 decisive 70/78
76 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 71/78
77 Occam s razor principle Avoid over-fitting Between 2 models explaining the data equally well, one ought to choose the simplest one. Better generalization properties. 72/78
78 Occam s razor and model evidence When selecting k according to the model evidences p(x 1:n k), the Occam s razor is automatically implemented. Reason : the prior plays the role of a regularizer. 73/78
79 automatic complexity penalty : intuition 1 Complex model = large Θ k = small π k (θ) (if uniform over Θ k ) = p θ (x 1:n )π k (θ) dθ small Θ k (average over large regions where p θ (x 1:n ) small) 74/78
80 automatic complexity penalty : intuition 2 if Θ k R : assume π k flat over interval of length θ prior k p θk (X 1:n ) peaked around (X p θmap,k 1:n ) with width posterior k. then π k (θ) 1/ prior k and p(x 1:n k) = p θ (x)π k (θ) dθ (X p θmap,k 1:n ) Θ k If Θ k R d and same approximation in each dimension log p(x 1:n k) log p θmap,k (X 1:n ) + θ posterior k θ prior k }{{} complexity penalty d log θposterior k θ prior k }{{} dimension + complexity penalty 75/78
81 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 76/78
82 1. Lecture 1 Cont d : Conjugate priors and exponential family. 2. Lecture 1 Cont d : A glimpse at Bayesian asymptotics 3. Supervised learning example : Naive Bayes Classification 4. Bayesian linear regression 5. Bayesian model choice Bayesian model averaging Bayesian model selection Automatic complexity penalty Laplace approximation and BIC criterion Empirical Bayes 77/78
83 Bibliography [Ghosh and Ramamoorthi, 2003] Ghosh, J. and Ramamoorthi, R. (2003). Bayesian nonparametrics [Schervish, 2012] Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media. [Van der Vaart, 1998] Van der Vaart, A. W. (1998). Asymptotic statistics, volume 3. Cambridge university press. 78/78
Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationStat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2
Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationMachine Learning, Fall 2012 Homework 2
0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationIntroduction to Bayesian Methods
Introduction to Bayesian Methods Jessi Cisewski Department of Statistics Yale University Sagan Summer Workshop 2016 Our goal: introduction to Bayesian methods Likelihoods Priors: conjugate priors, non-informative
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationA Very Brief Summary of Statistical Inference, and Examples
A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming
More informationLecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.
Lecture 4 Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza October 6, 2017 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 1 / 46 Outline
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationA Very Brief Summary of Bayesian Inference, and Examples
A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X
More informationSemiparametric posterior limits
Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations
More informationInvariant HPD credible sets and MAP estimators
Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationLecture 1 October 9, 2013
Probabilistic Graphical Models Fall 2013 Lecture 1 October 9, 2013 Lecturer: Guillaume Obozinski Scribe: Huu Dien Khue Le, Robin Bénesse The web page of the course: http://www.di.ens.fr/~fbach/courses/fall2013/
More informationIntroduction to Machine Learning
Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationIntroduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models
Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models Matthew S. Johnson New York ASA Chapter Workshop CUNY Graduate Center New York, NY hspace1in December 17, 2009 December
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationLecture 1: Introduction
Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One
More informationParametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory
Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationFoundations of Statistical Inference
Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationMotivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University
Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationBayesian Learning (II)
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP
More informationExponential Families
Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationSome Curiosities Arising in Objective Bayesian Analysis
. Some Curiosities Arising in Objective Bayesian Analysis Jim Berger Duke University Statistical and Applied Mathematical Institute Yale University May 15, 2009 1 Three vignettes related to John s work
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationGeneral Bayesian Inference I
General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for
More informationFoundations of Statistical Inference
Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationConstruction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting
Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Anne Philippe Laboratoire de Mathématiques Jean Leray Université de Nantes Workshop EDF-INRIA,
More informationGaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak
Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression
More informationBayesian statistics: Inference and decision theory
Bayesian statistics: Inference and decision theory Patric Müller und Francesco Antognini Seminar über Statistik FS 28 3.3.28 Contents 1 Introduction and basic definitions 2 2 Bayes Method 4 3 Two optimalities:
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationStatistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks
Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered
More informationCovariance function estimation in Gaussian process regression
Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian
More informationLECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS
LECTURE : EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS HANI GOODARZI AND SINA JAFARPOUR. EXPONENTIAL FAMILY. Exponential family comprises a set of flexible distribution ranging both continuous and
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationNaive Bayes and Gaussian Bayes Classifier
Naive Bayes and Gaussian Bayes Classifier Mengye Ren mren@cs.toronto.edu October 18, 2015 Mengye Ren Naive Bayes and Gaussian Bayes Classifier October 18, 2015 1 / 21 Naive Bayes Bayes Rules: Naive Bayes
More informationPart III. A Decision-Theoretic Approach and Bayesian testing
Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationLecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)
8 HENRIK HULT Lecture 2 3. Some common distributions in classical and Bayesian statistics 3.1. Conjugate prior distributions. In the Bayesian setting it is important to compute posterior distributions.
More information1. Fisher Information
1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationNaive Bayes and Gaussian Bayes Classifier
Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others February 22, 2016 Naive Bayes and Gaussian Bayes Classifier February 22, 2016 1 / 21 Naive Bayes Bayes Rule:
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationLogistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824
Logistic Regression Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative Please start HW 1 early! Questions are welcome! Two principles for estimating parameters Maximum Likelihood
More informationThe Bayesian approach to inverse problems
The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu
More informationIntroduction to Machine Learning
Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationHypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33
Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett
More informationMaximum likelihood estimation
Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationLinear Models A linear model is defined by the expression
Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationMachine Learning - MT Classification: Generative Models
Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements Practical 1 Submission Try to get signed off during session itself Otherwise,
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More information2.3 Exponential Families 61
2.3 Exponential Families 61 Fig. 2.9. Watson Nadaraya estimate. Left: a binary classifier. The optimal solution would be a straight line since both classes were drawn from a normal distribution with the
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationNonparameteric Regression:
Nonparameteric Regression: Nadaraya-Watson Kernel Regression & Gaussian Process Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationBayesian methods in economics and finance
1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors Linear Regression 2/26 Linear regression Model for relationship between (several) independent
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationST5215: Advanced Statistical Theory
Department of Statistics & Applied Probability Monday, September 26, 2011 Lecture 10: Exponential families and Sufficient statistics Exponential Families Exponential families are important parametric families
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More information