Naïve Bayes Lecture 17

Size: px

Start display at page:

Download "Naïve Bayes Lecture 17"

Caitlin Summers
6 years ago
Views:

1 Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri

2 Bayesian Learning Use Bayes rule! Data Likelihood Prior Posterior Normalization Or equivalently: For uniform priors, this reduces to maximum likelihood estimation! P (θ) 1 P (θ D) P (D θ)

3 ln ln 2 2N 2 δ ln 2 2N ln δ ln 2 2N 2 ln 2 2N 2 2 Prior + Data - > Posterior ln(2/δ) N ln(2/δ) 2 2 N ln(2/δ) ln(2/δ) /0.05) N ln(2/0.05) = = = P (θ) 1 P (θ) 1 1 Likelihood funcbon: Posterior: P (θ D) P (D θ) θ D) P P (D (D θ) θ) P (θ D) θ α H (1 θ) α T θ β H 1 (1 (1 θ) θ) β T β 1 T 1 1 (1 θ) β T 1 = θ α H+β H 1 (1 θ) α T +β t +1 1 = θ α H+β H 1 (1 θ) α T +β t +1 = Beta(α H +β H, α T +β T )

4 What about conbnuous variables? Billionaire says: If I am measuring a conbnuous variable, what can you do for me? You say: Let me tell you about Gaussians

5 Some properbes of Gaussians Affine transformabon (mulbplying by scalar and adding a constant) are Gaussian X ~ N(µ,σ 2 ) Y = ax + b Y ~ N(aµ+b,a 2 σ 2 ) Sum of Gaussians is Gaussian X ~ N(µ X,σ 2 X ) Y ~ N(µ Y,σ 2 Y ) Z = X+Y Z ~ N(µ X +µ Y, σ 2 X +σ2 Y ) Easy to differenbate, as we will see soon!

6 Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: µ Variance: σ x i i = Exam Score

7 MLE for Gaussian: Prob. of i.i.d. samples D={x 1,,x N }: µ MLE, σ MLE =argmax µ,σ Log-likelihood of data: P (D µ, σ)

8 Your second learning algorithm: MLE for mean of a Gaussian What s MLE for mean? = = N i=1 (x i µ) σ 2 =0 N x i + Nµ =0 i=1

9 MLE for variance Again, set derivabve to zero: = N σ + N i=1 (x i µ) 2 σ 3 =0

10 MLE: Learning Gaussian parameters BTW. MLE for the variance of a Gaussian is biased Expected result of esbmabon is not true parameter! Unbiased variance esbmator:

11 Bayesian learning of Gaussian parameters Conjugate priors Mean: Gaussian prior Variance: Wishart DistribuBon Prior for mean:

12 Bayesian Prediction Definition: the expected conditional loss of predicting y Y is Bayesian decision: predict class minimizing expected conditional loss, that is y =argmin by zero-one loss: L[y x] = y Y L(y, y)pr[y x]. L[y x] =argmin by y =argmax by y Y L(y, y)pr[y x]. Pr[y x]. Maximum a Posteriori (MAP) principle. Mehryar Mohri - Introduction to Machine Learning page 6

13 Binary Classification - Illustration 1 Pr[y 1 x] 0 x Pr[y 2 x] Mehryar Mohri - Introduction to Machine Learning page 7

14 Maximum a Posteriori (MAP) Definition: the MAP principle consists of predicting according to the rule y =argmax y Y Pr[y x]. Equivalently, by the Bayes formula: y =argmax y Y Pr[x y]pr[y] Pr[x] =argmax y Y Pr[x y]pr[y]. How do we determine Pr[x y] and Pr[y]? Density estimation problem. Mehryar Mohri - Introduction to Machine Learning page 8

15 Density Estimation Data: sample drawn i.i.d. from set some distribution D, x 1,...,x m X. Xaccording to Problem: find distribution p out of a set P that best estimates D. [Slide from Mehryar Mohri]

16 Density esbmabon Can make parametric assumpbon, e.g. that Pr(x y) is a mulbvariate Gaussian distribubon When the dimension of x is small enough, can use a non- parametric approach (e.g., kernel density es.ma.on) x

17 Difficulty of (naively) esbmabng high- dimensional distribubons Can we directly esbmate the data distribubon P(X,Y)? How do we represent these? How many parameters? Prior, P(Y): Suppose Y is composed of k classes Likelihood, P(X Y): Suppose X is composed of n binary features Complex model! High variance with limited data!!!

18 CondiBonal Independence X is condiconally independent of Y given Z, if the probability distribubon for X is independent of the value of Y, given the value of Z e.g., Equivalent to:

19 Naïve Bayes Naïve Bayes assumpbon: Features are independent given class: More generally: How many parameters now? Suppose X is composed of n binary features

20 The Naïve Bayes Classifier Given: Prior P(Y) n condibonally independent features X given the class Y For each X i, we have likelihood P(X i Y) X 1 X 2 Y X n Decision rule: If certain assumption holds, NB is optimal classifier! (they typically don t)

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional