Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in

Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks Bayesian Learning CSL603 - Machine Learning 2

Features of Bayesian Learning Each training example can incrementally increase or decrease the estimated probability that a hypothesis is correct. Allows for probabilistic predictions Practical learning algorithms Naïve Bayes learning Bayesian network learning Combine prior knowledge with observations Require prior probabilities Useful conceptual framework gold standard for evaluating other classifiers Tools for analysis Bayesian Learning CSL603 - Machine Learning 3

Bayes Theorem If A and B are two random variables P A B = In the context of classifier hypothesis h and training data I P h I = P B A P(A) P(B) P I h P(h) P(I) P(h) prior probability of hypothesis h P(I) prior probability of training data I P h I probability of h given I P I h probability of I given h Bayesian Learning CSL603 - Machine Learning 4

Choosing the Hypotheses Given the training data, we are interested in the most probable hypothesis Maximum a posteriori hypothesis - h +,- h +,- argmax 4 6 P h I argmax 4 6 P I h P(h) P(I) argmax 4 6 P I h P h If every hypothesis is equally probable, P h 7 = P h 8, h 7, h 8 H, then we can simplify it to Maximum likelihood (ML) hypothesis - h +< h +< = argmax 4= 6P I h 7 Bayesian Learning CSL603 - Machine Learning 5

Example Does the patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is no present. Furthermore, 0.008 of the entire population have this cancer. Bayesian Learning CSL603 - Machine Learning 6

P cancer = P cancer = P + cancer = P cancer = P + cancer = P cancer = P cancer + = Bayesian Learning CSL603 - Machine Learning 7

Brute-Force MAP Hypothesis Learner (1) If we are given I = < x H, y H,, x K, y K >, examples and the class labels, For each hypothesis h H, calculate the posterior probability P I h P(h) P h I = P(I) Output the hypothesis h +,- that has the highest posterior probability h +,- = argmax 4 6 P h I Bayesian Learning CSL603 - Machine Learning 8

Brute-Force MAP Hypothesis Learner (2) If we are given I = < x H, y H,, x K, y K >, examples and the class labels, choose P(I h) P(I h) = 1 if h is consistent with I P(I h) = 0 otherwise Choose P(h) to be uniform distribution P h = H 6 h H Then P h I = - I h -(4) -(P) Bayesian Learning CSL603 - Machine Learning 9

Bayesian Learning CSL603 - Machine Learning 10

Brute-Force MAP Hypothesis Learner (3) If we are given I = < x H, y H,, x K, y K >, examples and the class labels, choose P(I h) P(I h) = 1 if h is consistent with I P(I h) = 0 otherwise Choose P(h) to be uniform distribution P h = H 6 Then P h I = Q H R S,T, if h is consistent with I 0, otherwise Bayesian Learning CSL603 - Machine Learning 11

Evolution of Posterior Probabilities P h P h I H P h I H, I _ h h h Bayesian Learning CSL603 - Machine Learning 12

Classifying new instances Given a new instance x, what is the most probable classification? One solution h +,- (x) But can we do better? Consider the following example containing three hypotheses: P h H I = 0.4, P h _ I = 0.3, P h c I = 0.3 Given a new instance x, h H x = +, h _ x =, h c x = What is the most probable classification for x Bayesian Learning CSL603 - Machine Learning 13

Bayes Optimal Classifier (1) Combine the prediction of all hypotheses weighted by their posterior probabilities Bayes optimal classification argmax d Y f P y h 7 P(h 7 I) 4 = 6 Example P h H I =.4, P h H = 0, P + h H = 1 P h _ I =.3, P h _ = 1, P + h _ = 0 P h c I =.3, P h c = 1, P + h c = 0 f P + h 7 P(h 7 I) = f P h 7 P(h 7 I) = 4 = 6 4 = 6 Bayesian Learning CSL603 - Machine Learning 14

Bayes Optimal Classifier (2) Optimal in the sense No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average. Method maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypothesis. But it is inefficient Compute posterior probability for every hypothesis and combine the predictions of each hypothesis. Bayesian Learning CSL603 - Machine Learning 15

Gibbs Classifier Gibbs Algorithm Choose a hypothesis h H at random, according to the posterior probability distribution over H Use h to classify the new instance x. Observation Assume target concepts are drawn at random from H according to the priors on H, Then E E l7mmn 2E E qrstnuvw7xry Haussler et al., ML 1994 Bayesian Learning CSL603 - Machine Learning 16

Naïve Bayes Classifier (1) Bayes rule, slightly different application Let Y = {c H, c _, c { } be the different class labels. The label for i ~ instance y Y P c x 7 = P x 7 c P(c ) P(x 7 ) P c x 7 - posterior probability that instance x 7 belongs to class c P x 7 c - probability that an instance drawn from class c would be x 7 (likelihood) P(c ) probability of class c (prior) P(x ) probability of instance x 7 (evidence) Bayesian Learning CSL603 - Machine Learning 17

Naïve Bayes Classifier (2) Classify instance x as class y with maximum posterior probability y = argmax P(c x) Ignore the denominator (since we are only interested in the maximum) If the prior is uniform y = argmax P x c P(c ) y = argmax P x c Bayesian Learning CSL603 - Machine Learning 18

Naïve Bayes Classifier (3) Look at the classifier y = argmax P x c What is each instance x? A D dimensional tuple (x H,, x ƒ ) Estimate the joint probability distributionp x H, x ƒ c Practical issue- need to know the probability of every possible instance given every possible class. With D Boolean features and K classes K2 ƒ probability values!!! Bayesian Learning CSL603 - Machine Learning 19

Naïve Bayes Classifier (4) Make the naïve Bayes assumption features/attributes are conditionally independent given the target attribute (class label) P x H, x ƒ c ƒ = P x c ˆH This results in the naïve Bayes classifier (NBC)! ƒ y = argmax P x c ˆH P(c ) Bayesian Learning CSL603 - Machine Learning 20

NBC Practical Issues (1) Estimating probabilities from I Prior probabilities P c = x 7, y : y = c I If the features are discrete P x = v c = x 7, y : x = v y = c x 7, y : y = c Bayesian Learning CSL603 - Machine Learning 21

NBC Practical Issues (2) If the features are continuous? Assume some parameterized distribution for x, e.g., Normal Learn parameters of distribution from data, e.g., mean and variance of x values Determine the parameters that maximize the likelihood. P x c ~ N(μ, σ _ ), μ and σ _ are unknown Bayesian Learning CSL603 - Machine Learning 22

Bayesian Learning CSL465/603 - Machine Learning 23

NBC Practical Issues (3) If the features are continuous? Assume some parameterized distribution for x, e.g., Normal Learn parameters of distribution from data, e.g., mean and variance of x values Determine the parameters that maximize the likelihood. Discretize the feature E.g., price R to price low, medium, high Bayesian Learning CSL603 - Machine Learning 24

NBC Practical Issues (4) If there are no examples in class c P x = v c = 0 ƒ P x c ˆH P c = 0 for which x = v Use m-estimate defined as follows P x = v c = x 7, y : x = v y = c + mp x 7, y : y = c + m Prior estimate of the probability p Equivalent sample size m (how heavily to weight p relative to the observed data) Bayesian Learning CSL603 - Machine Learning 25

Example Learn to Classify Text Problem Definition Given a set of news articles that are of interest, we would to like to learn to classify the articles by topic. Naïve Bayes is among the most effective algorithms to perform this task. What will be attributes to represent the documents? Vector of words one attribute per word position in the document What is the Target concept Is the document interesting? Topic of the document Bayesian Learning CSL603 - Machine Learning 26

Algorithm Learn Naïve Bayes Collect all words and tokens that occur in the Examples (I) Vocabulary all distinct words and tokens in I Compute probabilities P c and P x c I Examples for which the target label is c P c = P š P n total number of words in I (counting duplicates multiple times) For each work x in Vocabulary n = number of times word x occurs in I P x c = œ H RžŸ d Bayesian Learning CSL603 - Machine Learning 27

Algorithm Classify Naïve Bayes Given a test instance Compute the frequency of occurrence in the test instance of each term in the vocabulary Apply naïve Bayes classification rule! Bayesian Learning CSL603 - Machine Learning 28

Example: 20 Newsgroup Given 1000 training documents from each group Learn to classify new documents according to the newsgroup it came from NBC 89% accuracy Bayesian Learning CSL603 - Machine Learning 29

Bayesian Network (1) Naïve Bayes assumption of conditional independence is too restrictive. The problem is intractable without some conditional independence assumption Bayesian networks describe conditional independence among subsets of variables. Allows for combining prior knowledge about (in) dependencies among variables with training data Recollect Conditional Independence Bayesian Learning CSL603 - Machine Learning 30

Bayesian Network - Example Storm BusTourGroup S,B S, B S,B S, B Lightning Campfire C C 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 Campfire Thunder ForestFire Bayesian Learning CSL603 - Machine Learning 31

Bayes Network (2) Network represents the joint probability distribution over all variables P(Storm, BusTourGroup, ForestFire) In general, ƒ P x H, x _,, x ƒ = P x Parents x Where Parents x x in the graph. ˆH denotes immediate predecessors of What is the Bayes Network corresponding to the Naive Bayes Classifier? Bayesian Learning CSL603 - Machine Learning 32

Bayes Network (3) Inference Bayes network encodes all the information required for inference. Exact inference methods Work well for some structures Monte Carlo methods Simulate the network randomly to calculate approximate solutions. Learning If the structure is known and there are no missing values, it is easy to learn a Bayes network If the network structure is known and there are some missing values, expectation maximization algorithm If the structure is unknown, the problem is very difficult. Bayesian Learning CSL603 - Machine Learning 33

Summary Bayes rule Bayes Optimal Classifier Practical Naïve Bayes Classifier Example text classification task Maximum-likelihood estimates Bayesian networks Bayesian Learning CSL603 - Machine Learning 34