The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Size: px

Start display at page:

Download "The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model"

Julie Ball
6 years ago
Views:

1 Back to Maximum Likelihood Give a geerative model f (x, y = k) =π k f k (x) Usig a geerative modellig approach, we assume a parametric form for f k (x) =f (x; k ) ad compute the MLE θ of θ =(π k, k ) k= based o the traiig data {x i, y i } i=. We the use a plug-i approach to perform classificatio p(y = k X = x, θ)= π kf (x; k ) j= π jf (x; j ) Eve for simple models, this ca prove difficult; e.g. for LDA, f (x; k )=N (x; µ k, Σ), ad the MLE estimate of Σ is ot full rak for p >. Oe aswer: simplify eve further, e.g. usig axis-aliged covariaces, but this is usually too crude. Aother aswer: regularizatio. The Bayesia Learig Framework Bayes Theorem: Give two radom variables X ad Θ, p(θ X) = p(x Θ)p(Θ) p(x) Likelihood: p(x Θ) Posterior: p(θ X) Prior: p(θ) Margial likelihood: p(x) = p(x Θ)p(Θ)dΘ Treat parameters as radom variables, ad process of learig is just computatio of posterior p(θ X). Summarizig the posterior: Posterior mode: θ MAP = argmax θ p(θ X). Maximum a posteriori. Posterior mea: θ mea = E[Θ X]. Posterior variace: Var[Θ X]. How to make decisios ad predictios? Decisio theory. How to compute posterior? Naïve Bayes Retur to the spam classificatio example with two-class aïve Bayes f (x i ; k )= p j= k xij kj ( kj) xij. The MLE estimates are give by i= (x ij =, y i = k) kj =, π k = k where k = i= I(y i = k). If a word j does ot appear i class k by chace, but it does appear i a documet x, the p(x y = k) = ad so posterior p(y = k x )=. Worse thigs ca happe: e.g., probability of documet uder all classes ca be, so posterior is ill-defied. Simple Example: Coi Tosses A very simple example: We have a coi with probability of comig up heads. Model coi tosses as iid Beroullis, =head, =tail. Lear about give dataset D =(x i ) i= of tosses. with j = i= (x i = j). Maximum likelihood f (D ) = ( ) ˆ ML = Bayesia approach: treat ukow parameter as a radom variable Φ. Simple prior: Φ U[, ]. Posterior distributio: p( D) = Z ( ), Z = Posterior is a Beta( +, + ) distributio. ( ) d = ( + )!!! 58 6

2 Simple Example: Coi Tosses Simple Example: Coi Tosses =, =, = =, =, = =, =4, = =, =65, = =, =686, = =, =74, = Posterior becomes peaked at true value =.7 as dataset grows. What about test data? The posterior predictive distributio is the coditioal distributio of x + give (x i ) i= : p(x + (x i ) i=) = = p(x +, (x i ) i=))p( (x i ) i=))d p(x + )p( (x i ) i=))d =( mea ) x+ ( mea ) x+ We predict o ew data by averagig the predictive distributio over the posterior. Accouts for ucertaity about. 6 6 Simple Example: Coi Tosses Posterior distributio captures all leart iformatio. Posterior mode: MAP = Posterior mea: mea = mea ( mea ) Posterior variace: Asymptotically, for large, variace decreases as / ad is give by the iverse of Fisher s iformatio. Posterior distributio coverges to true parameter as. Simple Example: Coi Tosses Posterior distributio is a kow aalytic form. I fact posterior distributio is i the same beta family as the prior. A example of a cojugate prior. A beta distributio Beta(a, b) with parameters a, b > is a expoetial family distributio with desity p( a, b) = Γ(a + b) Γ(a)Γ(b) a ( ) b where Γ(t) = u t e u du is the gamma fuctio. If the prior is Beta(a, b), the the posterior distributio is so is Beta(a +, b + ). p( D, a, b) = a+ ( ) b+ Hyperparameters a ad b are pseudo-couts, a imagiary iitial sample that reflects our prior beliefs about. 6 64

3 Beta Distributios Dirichlet Distributios (.,.) (.8,.8) (,) (,) (5,5) (,9) (,7) (5,5) (7,) (9,) (A) Support of the Dirichlet desity for =. (B) Dirichlet desity for α k =. (C) Dirichlet desity for α k = Bayesia Iferece for Multiomials Suppose x i {,...,} istead, ad we model (x i ) i= as iid multiomials: p(d π) = π xi = i= k= π k k with k = i= (x i = k) ad π k >, k= π k =. The cojugate prior is the Dirichlet distributio. Dir(α,...,α ) has parameters α k >, ad desity p(π) = Γ( k= α k) k= Γ(α k) k= π αk k o the probability simplex {π : π k >, k= π k = }. The posterior is also Dirichlet, with parameters (α k + k ) k=. Posterior mea is π mea k = α k + k j= α j + j Text Classificatio with (Less) Naïve Bayes Uder the Naïve Bayes model, the joit distributio of labels y i {,...,} ad data vectors x i {, } p is p(x i, y i )= i= = i= k= k= π k π k k p j= p j= kj ( kj) xij xij kj kj ( kj) k kj where k = i= (y i = k), kj = i= (y i = k, x ij = ). (y i=k) For cojugate prior, we ca use Dir((α k ) k= ) for π, ad Beta(a, b) for kj idepedetly. Because the likelihood factorizes, the posterior distributio over π ad ( kj ) also factorizes, ad posterior for π is Dir((α k + k ) k= ), ad for kj is Beta(a + kj, b + k kj )

4 Text Classificatio with (Less) Naïve Bayes For predictio give D =(x i, y i ) i= we ca calculate with Predicted class is p(x, y = k D) =p(y = k D)p(x y = k, D) p(y = k D) = α k + k + l= α l p(x j = y = k, D) = a + kj a + b + k p(y = k x D) = p(y = k D)p(x y = k, D) p(x D) Compared to ML plug-i estimator, pseudocouts help to regularize probabilities away from extreme values. Bayesia Learig Discussio Clear separatio betwee models, which frame learig problems ad ecapsulates prior iformatio, ad algorithms, which computes posteriors ad predictios. Bayesia computatios Most posteriors are itractable, ad algorithms eeded to efficietly approximate posterior: Mote Carlo methods (Markov chai ad sequetial varieties). Variatioal methods (variatioal Bayes, belief propagatio etc). No optimizatio o overfittig (!) but there ca still be model misfit. Tuig parameters Ψ ca be optimized (without eed for cross-validatio). p(x Ψ) = p(x θ)p(θ Ψ)dθ p(ψ X) = p(x Ψ)p(Ψ) p(x) Be Bayesia about Ψ compute posterior. Type II maximum likelihood fid Ψ maximizig p(x Ψ) Bayesia Learig ad Regularizatio Cosider a Bayesia approach to logistic regressio: itroduce a multivariate ormal prior for b, ad uiform (improper) prior for a. The prior desity is: p(a, b) =(πσ ) p e σ b Bayesia Learig Further Readigs The posterior is p(a, b D) exp σ b log( + exp( y i (a + b x i ))) i= Zoubi Ghahramai. Bayesia Learig. Graphical models. Videolectures. Gelma et al. Bayesia Data Aalysis. evi Murphy. Machie Learig: a Probabilistic Perspective. The posterior mode is the parameters maximizig the above, equivalet to miimizig the L -regularized empirical risk. Regularized empirical risk miimizatio is (ofte) equivalet to havig a prior ad fidig the maximum a posteriori (MAP) parameters. L regularizatio - multivariate ormal prior. L regularizatio - multivariate Laplace prior. From a Bayesia perspective, the MAP parameters are just oe way to summarize the posterior distributio. 7 7

5 Gaussia Processes.5.5 Gaussia Processes The prior p(f) ecodes our prior kowledge about the fuctio. What properties of the fuctio ca we icorporate? Multivariate ormal assumptio: f N (, G).5 Use a kerel fuctio κ to defie G:.5.5 G ij = κ(x i, x j) Suppose we are give a dataset cosistig of iputs x =(x i ) i= ad outputs y =(y i ) i=. Regressio: lear the uderlyig fuctio f (x). 7 f (x) f (x ) Expect regressio fuctios to be smooth: If x ad x are close by, the f (x) ad f (x ) have similar values, i.e. strogly correlated. N, κ(x, x) κ(x, x ) κ(x, x) κ(x, x ) I particular, wat κ(x, x ) κ(x, x) =κ(x, x ) Model: f N (, G) y i f i N (f i, σ ) 75 Gaussia Processes We ca model respose as oisy versio of a uderlyig fuctio f (x): y i f (x i ) N (f (x i ), σ ) Typical approach: parametrize f (x; β), ad lear β, e.g., f (x) = d β d j (x) j= More direct approach: sice f (x) is ukow, we take a Bayesia approach, itroduce a prior over fuctios, ad compute a posterior over fuctios Istead of tryig to work with the whole fuctio, just work with the fuctio values at the iputs f =(f (x ),...,f (x )) Gaussia Processes What does a multivariate ormal prior mea? Imagie x forms a very dese grid of data space. Simulate prior draws f N (, G) Plot f i vs x i for i =,...,. The prior over fuctios is called a Gaussia process (GP)

Gaussia Processes Differet kerels lead to differet fuctio characteristics. Gaussia Processes.5.5.5 4....4.5.6.7.8.9.5....4.5.6.7.8.9 Carl Rasmusse. Tutorial o Gaussia Processes at NIPS 6.

6 Gaussia Processes Differet kerels lead to differet fuctio characteristics. Gaussia Processes Carl Rasmusse. Tutorial o Gaussia Processes at NIPS Gaussia Processes f x N (, G) y f N (f, σ I) Posterior distributio: f y N (G(G + σ I) y, G G(G + σ I)G) Posterior predictive distributio: Suppose x is a test set. We ca exted our model to iclude the fuctio values f at the test set: f f x, x xx N, xx x x y f N (f, σ I) x x where zz is matrix with ijth etry κ(z i, z j). xx = G. Some maipulatio of multivariate ormals gives: f y N x x( xx + σ I) y, x x x x( xx + σ I) xx 78

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for