Bayesian Methods: Naïve Bayes

Size: px

Start display at page:

Download "Bayesian Methods: Naïve Bayes"

Mabel Carr
6 years ago
Views:

1 Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate

2 Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior distributions Posterior distributions Today: more parameter learning and naïve Bayes 2

3 Maximum Likelihood Estimation (MLE) Data: Observed set of α H heads and α T tails Hypothesis: Coin flips follow a binomial distribution Learning: Find the best θ MLE: Choose to maximize the likelihood (probability of D given θ) θ MLE = arg max θ p(d θ) 3

4 MAP Estimation Choosing θ to maximize the posterior distribution is called maximum a posteriori (MAP) estimation θ MAP = arg max θ p(θ D) The only difference between θ MLE and θ MAP is that one assumes a uniform prior (MLE) and the other allows an arbitrary prior 4

5 MLE for Gaussian Distributions Two parameter distribution characterized by a mean and a variance 5

6 Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ (μ, σ 2 ) Y = ax + b Y ~ (aμ + b, a 2 σ 2 ) Sum of Gaussians is Gaussian X ~ (μ X, σ X 2), Y ~ (μ Y, σ Y 2) Z = X + Y Z ~ (μ X + μ Y, σ X 2 + σ Y 2) Easy to differentiate, as we will see soon! 6

7 Learning a Gaussian Collect data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ i Exam Score Variance: σ 7

8 MLE for Gaussian: Probability of i.i.d. samples D = x (1),, x () p D μ, σ = 1 2πσ 2 e x i μ 2σ 2 2 Log-likelihood of the data ln p(d μ, σ) = 2 ln 2πσ2 x i μ 2 2σ 2 8

9 MLE for the Mean of a Gaussian μ ln p(d μ, σ) = μ 2 ln 2πσ2 x i μ 2 x i μ 2 2σ 2 = μ 2σ 2 = x i μ σ 2 x i = μ σ 2 = 0 μ MLE = 1 x i 9

10 MLE for Variance σ ln p(d μ, σ) = σ 2 ln 2πσ2 x i μ 2 = σ + x i σ μ 2 2σ 2 x i μ 2 = σ + σ 3 = 0 2σ 2 2 σ MLE = 1 x i μ MLE 2 10

11 Learning Gaussian parameters μ MLE = 1 x i 2 σ MLE = 1 x i μ MLE 2 MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator 2 σ unbiased = 1 1 x i μ MLE 2 11

12 Bayesian Categorization/Classification Given features x = (x 1,, x m ) predict a label y If we had a joint distribution over x and y, given x we could the label using MAP inference arg max y p(y x 1,, x m ) Can compute this in exactly the same way that we did before: using Bayes rule p y x 1,, x m = p x 1,, x m y p y p x 1,, x m 12

13 Article Classification Given a collection of news articles labeled by topic, goal is, given a new news article to predict its topic One possible feature vector One feature for each word in the document, in order x i corresponds to the i th word x i can take a different value for each word in the dictionary 13

14 Text Classification 14

15 Text Classification x 1, x 2, is sequence of words in document The set of all possible features (and hence p(y x)) is huge Article at least 1000 words, x = (x 1,, x 1000 ) x i represents i th word in document Can be any word in the dictionary at least 10,000 words 10, = possible values Atoms in Universe:

16 Bag of Words Model Typically assume position in document doesn t matter p X i = x i Y = y = p(x k = x i Y = y) All positions have the same distribution Ignores the order of words Sounds like a bad assumption, but often works well! Features Set of all possible words and their corresponding frequencies (number of times it occurs in the document) 16

17 Bag of Words aardvark 0 about2 all 2 Africa 1 apple0 anxious 0... gas 1... oil 1 Zaire 0 17

18 eed to Simplify Somehow Even with the bag of words assumption, there are too many possible outcomes Too many probabilities p(x 1,, x m y) Can we assume some are the same? p(x 1, x 2 y ) = p(x 1 y) p(x 2 y) This is a conditional independence assumption 18

19 Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z Equivalent to p X Y, Z = P(X Z) p X, Y Z = p X Z P(Y Z) 19

20 aïve Bayes aïve Bayes assumption Features are independent given class label p(x 1, x 2 y ) = p(x 1 y) p(x 2 y) More generally p x 1,, x m y = How many parameters now? m p(x i y) Suppose x is composed of m binary features 20

21 The aïve Bayes Classifier Given Prior p(y) Y m conditionally independent features X given the class Y X 1 X 2 X n For each X i, we have likelihood P(X i Y) Classify via y = h B x = arg max y = arg max y p y p(x 1,, x m y) m p(y) i p(x i y) 21

22 MLE for the Parameters of B Given dataset, count occurrences for all pairs Count(X i = x i, Y = y) is the number of samples in which X i = x i and Y = y MLE for discrete B p Y = y = p X i = x i Y = y = Count Y = y y Count(Y = y ) Count X i = x i, Y = y x i Count (X i = x i, Y = y) 22

23 aïve Bayes Calculations 23

24 Subtleties of B Classifier: #1 Usually, features are not conditionally independent: p x 1,, x m y m p(x i y) The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases Plausible reason: Only need the probability of the correct class to be the largest! Example: binary classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99). 24

25 Subtleties of B Classifier: #2 What if you never see a training instance (X 1 = a, Y = b) Example: you did not see the word Enlargement in spam! Then p X 1 = a Y = b = 0 Thus no matter what values X 2,, X m take Why? P X 1 = a, X 2 = x 2,, X m = x m Y = b = 0 25

26 Subtleties of B Classifier: #2 To fix this, use a prior! Already saw how to do this in the coin-flipping example using the Beta distribution For B over discrete spaces, can use the Dirichlet prior The Dirichlet distribution is a distribution over z 1,, z k (0,1) such that z z k = 1 characterized by k parameters α 1,, α k f z 1,, z k ; α 1,, α k k z i α i 1 Called smoothing, what are the MLE estimates under these kinds of priors? 26

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas Naïve Bayes Vibhav Gogate The University of Texas at Dallas Supervised Learning of Classifiers Find f Given: Training set {(x i, y i ) i = 1 n} Find: A good approximation to f : X Y Examples: what are