Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Size: px

Start display at page:

Download "Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning"

Roderick Perkins
6 years ago
Views:

1 Topics Bayesian Learning Sattiraju Prabhakar CS898O: ML Wichita State University Objectives for Bayesian Learning Bayes Theorem and MAP Bayes Optimal Classifier Naïve Bayes Classifier An Example Classifying Test 4/20/2006 ML_BayesianLearning 2 Objectives for Bayesian Learning What is Bayesian Learning? Classes can be represented using a set of variables The values of these variable values are governed by probability distributions Decisions about which classes best explain the observations are based on reasoning about these probabilities To reason about evidence weighting the evidence and combining the evidence supports alternative hypotheses. 4/20/2006 ML_BayesianLearning 3 4/20/2006 ML_BayesianLearning 4 1

2 Why Bayesian Learning? Bayesian Learning methods calculate explicit probabilities for hypotheses They provide a framework for many learning algorithms Features of Bayesian Learning Methods Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct Prior knowledge can be combined with observed data to determine the final probability of a hypothesis Hypotheses make probabilistic predictions Hypotheses can be combined to classify new instances 4/20/2006 ML_BayesianLearning 5 4/20/2006 ML_BayesianLearning 6 Bayesian Learning Bayes Theorem and MAP (Observed Instances, classifications) Bayesian Learner Hypotheses, Posterior probabilities Hypotheses 4/20/2006 ML_BayesianLearning 7 4/20/2006 ML_BayesianLearning 8 2

3 Bayes Theorem A Learning Scenario Terms: h = hypothesis being evaluated D = observed data Bayes Theorem: h) = prior probability that h holds D) = probability of observed training data D/h) = posterior probability of D occurring, given h holds h/d) = probability that h holds, after having D observed D / h) h) P ( h / D) = D) Learner considers a set of hypotheses, H Needs to decide which is the most probable hypothesis h (h D), having observed the data, D A maximally probable hypothesis is Maximum A Posteriori (MAP) hypothesis (h MAP ). 4/20/2006 ML_BayesianLearning 9 4/20/2006 ML_BayesianLearning 10 MAP hypothesis from Bayes Theorem h MAP argmax h D) h H D h) h) = argmax h H D) = argmax D h) h) h H Example: Medical Diagnosis (1) H = { Patient has cancer, Patient does not have cancer} D = { test is positive, test is negative} Here are the known probabilities: cancer) = cancer) = cancer) = 0.98 cancer) = 0.03 Ө cancer) = 0.02 Ө cancer) = /20/2006 ML_BayesianLearning 11 4/20/2006 ML_BayesianLearning 12 3

4 Example: Medical Diagnosis (2) Bayes Optimal Classifier We compute D h)h) for each hypothesis: Given that positive result is observed for the test cancer) cancer) = 0.98 x = cancer) cancer) = 0.03 x = Thus h MAP = cancer The actual value of posterior probability of hypothesis cancer ) We divide it by ) / ( ) = /20/2006 ML_BayesianLearning 13 4/20/2006 ML_BayesianLearning 14 Formulation Previous Formulation What is the most probable hypothesis given the training data? New Formulation What is the most probable classification of the new instance given the training data? Example: Classes: + and Probable classifications for the data: h1 +, p(h1) = 0.4 h2 -, p(h2) = 0.3 h3 -, p(h3) = 0.3 Conclusions: given new data x p(x = +) = 0.4 p(x = -) = 0.6 Optimal Classification Most probable classification of the new instance = weighted combinations of predictions of all hypotheses This weighting is done by posterior probabilities In simple terms: Classification of an instance as belonging to a class C is computed by Predictions of C by each hypothesis Multiplied by the posterior probability of occurrence of that hypothesis We do this for all hypotheses And then sum them up 4/20/2006 ML_BayesianLearning 15 4/20/2006 ML_BayesianLearning 16 4

5 Formal Specification of Bayes Optimal Specification P ( v D) = v h) h D) h H Where, v V is any possible classification The optimal classification of the new instance is the value v, for which v D) is maximum. arg max v V h H P ( v h) h D) Example Let possible classifications, V = {, Ө } h1 D) = 0.4, Ө h1) = 0, h1) = 1 h2 D) = 0.3, Ө h2) = 1, h2) = 0 h3 D) = 0.3, Ө h3) = 1, h3) = 0 h i H hi H arg P ( h i ) P ( h i D ) = 0. 4 P ( hi ) P ( hi D ) = 0. 6 max vj { +, } hi H P ( vj hi ) P ( hi D ) = 4/20/2006 ML_BayesianLearning 17 4/20/2006 ML_BayesianLearning 18 Bayes Optimal Classifier Naïve Bayes Classifier A system that classifies new instances according to Bayesian Optimal Classification is called Bayes Optimal Classifier Bayes Optimal Classifier shows best performance among all the classifiers that use the same hypothesis space and same prior knowledge. Limitations: It is costly to apply the Bayes Optimal Classifier Needs to compute posterior probability of every hypothesis, and then combine the predictions of hypotheses to classify each new instance 4/20/2006 ML_BayesianLearning 19 4/20/2006 ML_BayesianLearning 20 5

6 Features A highly practical method Performance comparable to Decision Tree Learning, and Neural Networks Applicable to learning tasks: Each instance x is described by a conjunction of attribute values Target function f(x) takes on any values from some finite set V. Characterization Task: Input: Training examples as attribute values that describe each instance: <a 1, a 2, a n > A new instance, for which you want classification Output: For a new instance, predict the target value or classification. Approach: vmap = arg max vj a1, a 2... an) vj V Assign most probable target value, v MAP to the instance 4/20/2006 ML_BayesianLearning 21 4/20/2006 ML_BayesianLearning 22 v Applying Bayes Theorem MAP P a a a v P v = arg max vj V ( 1, 2... n j) ( j) a1, a2... an) v P a a a v P v MAP = arg max ( 1, 2... n j) ( j) vj V Assumption Attribute values are conditionally independent given target value. a1, a2... an vj) = ai vj) Substituting this into previous equation, the target value output by the naïve Bayes classifier vnb = arg max vj) ai vj) vj V i i 4/20/2006 ML_BayesianLearning 23 4/20/2006 ML_BayesianLearning 24 6

7 Learning Step: Method Various v j ) and a i v j ) terms are estimated based on their frequencies in the training examples These estimates are used to learn hypothesis This hypothesis is used to classify each new instance by applying the Naïve Bayes Classification rule Example: PlayTennis Training Examples: PlayTennis (14 examples) Attributes: <Outlook, Temperature, Humidity, Wind> Target Concept: PlayTennis Problem: To classify the Instance: <Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong> To predict the target value (yes, or no) for PlayTennis 4/20/2006 ML_BayesianLearning 25 4/20/2006 ML_BayesianLearning 26 Example (Contd) Example: PlayTennis v = arg max v ) a v ) NB j i j vj { yes, no} i vnb = arg max Outlook = sunny vj) vj { yes, no} Temperature = cool vj) Humidity = high vj) Wind = strong vj) 4/20/2006 ML_BayesianLearning 27 4/20/2006 ML_BayesianLearning 28 7

8 Example Contd PlayTennis= yes) = 9/14 = 0.64 PlayTennis = no) = 5/14 = 0.36 Wind = strong PlayTennis= yes) = 3/9 = 0.33 Wind = strong PlayTennis=no) = 3/5 = 0.60 v NB is computed using: yes) sunny yes) cool yes) high yes) strong yes) =.0053 no) sunny no) cool no) high no) strong no) = Target value of PlayTennis is no Normalized value is: /( ) = /20/2006 ML_BayesianLearning 29 Method Let v j stand for class, a i = attribute Example: PlayTennis=no Learning Step: Various v j ) and a i v j ) terms are estimated based on their frequencies in the training examples These estimates are used to learn hypothesis the hypothesis has more than one attribute This hypothesis is used to classify each new instance by applying the Naïve Bayes Classification rule 4/20/2006 ML_BayesianLearning 30 Naïve Bayes Algorithm Solving the example PlayTennis= yes) = 9/14 = 0.64 PlayTennis = no) = 5/14 = 0.36 Wind = strong PlayTennis= yes) = 3/9 = 0.33 Wind = strong PlayTennis=no) = 3/5 = 0.60 v NB is computed using: yes) sunny yes) cool yes) high yes) strong yes) =.0053 no) sunny no) cool no) high no) strong no) = Target value of PlayTennis is no Normalized value is: /( ) = /20/2006 ML_BayesianLearning 31 4/20/2006 ML_BayesianLearning 32 8

9 Exercise Learning to Classify Text For the restaurant example: Given the table as training set of examples Find the classification for the new example: <Alt=yes, Bar=no, Fri = yes, Hun = yes, Pat = some, Price = $$, Rain = no, Res = no, Type = Thai, Est = 0-10> 4/20/2006 ML_BayesianLearning 33 4/20/2006 ML_BayesianLearning 34 Learning to Classifying Text 20 Newsgroups Examples: Learn from the articles (text with several thousands of words), which articles are interesting learn to classify the articles as belonging to like or dislike Learn to classify the web pages as belonging to different topics {Restaurants, Hotels, Movies, Shopping, Tourism, } 4/20/2006 ML_BayesianLearning 35 4/20/2006 ML_BayesianLearning 36 9

10 Article from rec.sport.hockey Issues for applying Naïve Bayes Classifier What are attributes? How are they related to words? How can we represent the text as a single example? How can we compute probabilities: v j ), a i v j )? 4/20/2006 ML_BayesianLearning 37 4/20/2006 ML_BayesianLearning 38 Representation of text documents as Examples Given a text document: Identify each word position in the document as an attribute We define the word in that position as the value of the attribute Example: If this sentence is the text, then it has 16 words, and it has 16 attributes. The value of the attribute 3 is sentence. 4/20/2006 ML_BayesianLearning 39 Completing the Learning Example Set Up Available Training Examples (documents) = 1000 Categories= {like, dislike} like: 700 dislike: 300 New document is given need to classify as like or dislike 4/20/2006 ML_BayesianLearning 40 10

11 Example - contd Naïve Bayes Classification: (for the text): If this sentence is the text, then it has 16 words, and it has 16 attributes. 15 vnb = arg max vj) ai vj) vj { like, dislike} i= 1 v P v P a If v P a this v NB = arg max ( j) ( 1 = " " j) ( 2 = " " j) vj { like, dislike}... ( 15 " " j) P a = attributes v 4/20/2006 ML_BayesianLearning 41 Computational Complexity Due to independence of each attribute with respect to others, the number of conditional probabilities that needs to be computed is extremely large. Example: If we assume 50,000 average values for 15 attributes, and for two possible target values: 2 * 15 * 50,000 = 1.5 million probabilistic terms Assumption: Attributes are independent and identically distributed, given the target classification The probability of encountering a specific word wk is independent of the position being considered. 4/20/2006 Example: Attributes ML_BayesianLearning a 42 1 and a 13 have same probability for a word. Refined Solution Formal Presentation of Assumption: P (a i = w k v j ) = P (a m = w k v j ) for all i, j, k, m How this effects the algorithm? We now only need to compute (for the example) 2 * 15 * 50,000 = 100,000 We estimate w k v j ) to be: nk + 1 n+ Vocabulary n = total number of word positions in all training examples nk = number of word wk is found in n word positions Vocabulary = total number of distinct words Learning Curve for 20 Newsgroups 4/20/2006 ML_BayesianLearning 43 4/20/2006 ML_BayesianLearning 44 11

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more