Bayesian Methods: Naïve Bayes

Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate

Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior distributions Posterior distributions Today: more parameter learning and naïve Bayes 2

Maximum Likelihood Estimation (MLE) Data: Observed set of α H heads and α T tails Hypothesis: Coin flips follow a binomial distribution Learning: Find the best θ MLE: Choose to maximize the likelihood (probability of D given θ) θ MLE = arg max θ p(d θ) 3

MAP Estimation Choosing θ to maximize the posterior distribution is called maximum a posteriori (MAP) estimation θ MAP = arg max θ p(θ D) The only difference between θ MLE and θ MAP is that one assumes a uniform prior (MLE) and the other allows an arbitrary prior 4

MLE for Gaussian Distributions Two parameter distribution characterized by a mean and a variance 5

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ (μ, σ 2 ) Y = ax + b Y ~ (aμ + b, a 2 σ 2 ) Sum of Gaussians is Gaussian X ~ (μ X, σ X 2), Y ~ (μ Y, σ Y 2) Z = X + Y Z ~ (μ X + μ Y, σ X 2 + σ Y 2) Easy to differentiate, as we will see soon! 6

Learning a Gaussian Collect data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ i Exam Score 0 85 1 95 2 100 3 12 99 89 Variance: σ 7

MLE for Gaussian: Probability of i.i.d. samples D = x (1),, x () p D μ, σ = 1 2πσ 2 e x i μ 2σ 2 2 Log-likelihood of the data ln p(d μ, σ) = 2 ln 2πσ2 x i μ 2 2σ 2 8

MLE for the Mean of a Gaussian μ ln p(d μ, σ) = μ 2 ln 2πσ2 x i μ 2 x i μ 2 2σ 2 = μ 2σ 2 = x i μ σ 2 x i = μ σ 2 = 0 μ MLE = 1 x i 9

MLE for Variance σ ln p(d μ, σ) = σ 2 ln 2πσ2 x i μ 2 = σ + x i σ μ 2 2σ 2 x i μ 2 = σ + σ 3 = 0 2σ 2 2 σ MLE = 1 x i μ MLE 2 10

Learning Gaussian parameters μ MLE = 1 x i 2 σ MLE = 1 x i μ MLE 2 MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator 2 σ unbiased = 1 1 x i μ MLE 2 11

Bayesian Categorization/Classification Given features x = (x 1,, x m ) predict a label y If we had a joint distribution over x and y, given x we could the label using MAP inference arg max y p(y x 1,, x m ) Can compute this in exactly the same way that we did before: using Bayes rule p y x 1,, x m = p x 1,, x m y p y p x 1,, x m 12

Article Classification Given a collection of news articles labeled by topic, goal is, given a new news article to predict its topic One possible feature vector One feature for each word in the document, in order x i corresponds to the i th word x i can take a different value for each word in the dictionary 13

Text Classification 14

Text Classification x 1, x 2, is sequence of words in document The set of all possible features (and hence p(y x)) is huge Article at least 1000 words, x = (x 1,, x 1000 ) x i represents i th word in document Can be any word in the dictionary at least 10,000 words 10,000 1000 = 10 4000 possible values Atoms in Universe: 10 80 15

Bag of Words Model Typically assume position in document doesn t matter p X i = x i Y = y = p(x k = x i Y = y) All positions have the same distribution Ignores the order of words Sounds like a bad assumption, but often works well! Features Set of all possible words and their corresponding frequencies (number of times it occurs in the document) 16

Bag of Words aardvark 0 about2 all 2 Africa 1 apple0 anxious 0... gas 1... oil 1 Zaire 0 17

eed to Simplify Somehow Even with the bag of words assumption, there are too many possible outcomes Too many probabilities p(x 1,, x m y) Can we assume some are the same? p(x 1, x 2 y ) = p(x 1 y) p(x 2 y) This is a conditional independence assumption 18

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z Equivalent to p X Y, Z = P(X Z) p X, Y Z = p X Z P(Y Z) 19

aïve Bayes aïve Bayes assumption Features are independent given class label p(x 1, x 2 y ) = p(x 1 y) p(x 2 y) More generally p x 1,, x m y = How many parameters now? m p(x i y) Suppose x is composed of m binary features 20

The aïve Bayes Classifier Given Prior p(y) Y m conditionally independent features X given the class Y X 1 X 2 X n For each X i, we have likelihood P(X i Y) Classify via y = h B x = arg max y = arg max y p y p(x 1,, x m y) m p(y) i p(x i y) 21

MLE for the Parameters of B Given dataset, count occurrences for all pairs Count(X i = x i, Y = y) is the number of samples in which X i = x i and Y = y MLE for discrete B p Y = y = p X i = x i Y = y = Count Y = y y Count(Y = y ) Count X i = x i, Y = y x i Count (X i = x i, Y = y) 22

aïve Bayes Calculations 23

Subtleties of B Classifier: #1 Usually, features are not conditionally independent: p x 1,, x m y m p(x i y) The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases Plausible reason: Only need the probability of the correct class to be the largest! Example: binary classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99). 24

Subtleties of B Classifier: #2 What if you never see a training instance (X 1 = a, Y = b) Example: you did not see the word Enlargement in spam! Then p X 1 = a Y = b = 0 Thus no matter what values X 2,, X m take Why? P X 1 = a, X 2 = x 2,, X m = x m Y = b = 0 25

Subtleties of B Classifier: #2 To fix this, use a prior! Already saw how to do this in the coin-flipping example using the Beta distribution For B over discrete spaces, can use the Dirichlet prior The Dirichlet distribution is a distribution over z 1,, z k (0,1) such that z 1 + + z k = 1 characterized by k parameters α 1,, α k f z 1,, z k ; α 1,, α k k z i α i 1 Called smoothing, what are the MLE estimates under these kinds of priors? 26