Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Similar documents
Bayesian Classification. Bayesian Classification: Why?

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning Features of Bayesian learning methods:

Algorithms for Classification: The Basic Methods

Lecture 9: Bayesian Learning

Artificial Intelligence Programming Probability

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

From inductive inference to machine learning

Statistical Learning. Philipp Koehn. 10 November 2015

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Bayesian Learning Extension

Learning Decision Trees

Lecture 3: Decision Trees

Learning Decision Trees

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 4 1

Learning with Probabilities

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Mining Classification Knowledge

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Mining Classification Knowledge

Classification and Regression Trees

CS 6375 Machine Learning

COMP 328: Machine Learning

Bayesian Learning. Bayesian Learning Criteria

Probability Based Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees.

Decision Tree Learning - ID3

Machine Learning 2nd Edi7on

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Stephen Scott.

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Lecture 3: Decision Trees

Linear Classifiers and the Perceptron

Dan Roth 461C, 3401 Walnut

Statistical learning. Chapter 20, Sections 1 3 1

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Introduction to Bayesian Learning. Machine Learning Fall 2018

COMP61011 : Machine Learning. Probabilis*c Models + Bayes Theorem

Decision Tree Learning

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

EECS 349:Machine Learning Bryan Pardo

Decision Trees.

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Decision Trees. Tirgul 5

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Decision Tree Learning

Induction on Decision Trees

Decision Tree Learning and Inductive Inference

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Chapter 3: Decision Tree Learning

Logistic Regression. Machine Learning Fall 2018

Bayesian Learning (II)

Artificial Intelligence. Topic

Machine Learning. Bayesian Learning.

Introduction to Machine Learning

Data Mining Part 4. Prediction

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

ARTIFICIAL INTELLIGENCE. Supervised learning: classification

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Naïve Bayes Lecture 6: Self-Study -----

Introduction to Machine Learning

Notes on Machine Learning for and

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

MODULE -4 BAYEIAN LEARNING

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Trees. Danushka Bollegala

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

10-701/ Machine Learning: Assignment 1

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Decision Trees / NLP Introduction

Decision Tree Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Trees. Gavin Brown

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Naïve Bayes classification

Week 5: Logistic Regression & Neural Networks

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Computing Posterior Probabilities. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Transcription:

15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive reasoning: We start with a model of the world, and use that model to derive new facts. Logic: Modus ponens, resolution, forward/backward chaining Probability: Bayes rule, belief networks. Inductive reasoning: We start with data, and try to derive a model of the world. Logic: decision trees Probability:? Department of Computer Science University of San Francisco p.1/?? Department of Computer Science University of San Francisco p.2/?? 15-1: Probabilistic Learning We would like to be able to: Estimate probabilities of events (or combinations of events) from data Update these estimates as new data becomes available. Predict the most likely hypothesis for a set of data. 15-2: Learning and Classification An important sort of learning problem is the classification problem. This involves placing examples into one of two or more classes. Should/shouldn t get credit Categories of documents. Golf-playing/non-golf-playing days This requires access to a set of labeled training examples, which allow us to induce a hypothesis that describes how to decide what class an example should be in. Department of Computer Science University of San Francisco p.3/?? Department of Computer Science University of San Francisco p.4/?? Recall the definition of Bayes Theorem P (b a) = P (a b)p (b) P (a) Let s rewrite this a bit. Let D be the data we ve seen so far. Let h be a possible hypothesis P (h D) = P (D h)p (h) P (D) 15-3: Bayes Theorem 15-4: Concept Learning We can apply Bayes rule to learning problems if we assume that we have a finite set of hypotheses. A concept is a function that maps from a set of attributes to a classification. A hypothesis is a possible concept - we want to find the one that best explains the training data. Use our evidence to compute the posterior probability of each hypothesis given the data Find the hypothesis that best fits the data, or is the most likely. Department of Computer Science University of San Francisco p.5/?? Department of Computer Science University of San Francisco p.6/??

15-5: MAP Hypothesis Usually, we won t be so interested in the particular probabilities for each hypothesis. Instead, we want to know: Which hypothesis is most likely, given the data? We call this the maximum a posteriori hypothesis (MAP hypothesis). In this case, we can ignore the denominator in Bayes Theorem, since it will be the same for all h. h MAP = argmax h H P (D h)p (h) Advantages: Simpler calculation No need to have a prior for P (D) Department of Computer Science University of San Francisco p.7/?? 15-6: ML Hypothesis In some cases, we can simplify things even further. What are the priors P (h) for each hypothesis? Without any other information, we ll often assume that they re equally possible. Each has probability 1 H In this case, we can just consider the conditional probability P (D h). We call the hypothesis that maximizes this conditional probability the maximum likelihood hypothesis. h ML = argmax h H P (D h) Department of Computer Science University of San Francisco p.8/?? 15-7: Example Imagine that we have a large bag of candy. We want to know the ratio of cherry to lime in the bag. We start with 5 hypotheses: 1. h 1 : 100% cherry 2. h 2 75% cherry, 25% lime. 3. h 3 50% cherry, 50% lime 4. h 4 25% cherry, 75% lime 5. h 5 100% lime Our agent repeatedly draws pieces of candy. We want it to correctly pick the type of the next piece of candy. 15-8: Example Let s assume our priors for the different hypotheses are: (0.1, 0.2, 0.4, 0.2, 0.1) Also, we assume that the observations are i.i.d. This means that each choice is independent of the others. (order doesn t matter) In that case, we can multiply probabilities. P (D h i ) = Π j P (d j h i ) Suppose we draw 10 limes in a row. P (D h 3 ) is ( 1 2 )10, since the probability of drawing a lime under h 3 is 1 2. Department of Computer Science University of San Francisco p.9/?? Department of Computer Science University of San Francisco p.10/?? How do the hypotheses change as data is observed? Initially, we start with the priors: (0.1, 0.2, 0.4, 0.2, 0.1) Then we draw a lime. P (h 1 lime) = αp (lime h 1 )P (h 1 ) = 0. 15-9: Example P (h 2 lime) = αp (lime h 2 )P (h 2 ) = α 1 0.2 = α0.05. 4 P (h 3 lime) = αp (lime h 3 )P (h 3 ) = α 1 0.4 = α0.2 2 P (h 4 lime) = αp (lime h 4 )P (h 4 ) = α 3 0.2 = α0.15. 4 P (h 5 lime) = αp (lime h 5 )P (h 5 ) = α1 0.1 = α0.1. α = 2. Department of Computer Science University of San Francisco p.11/?? Then we draw a second lime. P (h 1 lime, lime) = αp (lime, lime h 1 )P (h 1 ) = 0. P (h 2 lime, lime) = αp (lime, lime h 2 )P (h 2 ) = 1 0.2 = α0.0125. α 1 4 4 P (h 3 lime, lime) = αp (lime, lime h 3 )P (h 3 ) = 1 0.4 = α0.1 α 1 2 2 P (h 4 lime, lime) = αp (lime, lime h 4 )P (h 4 ) = 3 0.2 = α0.1125. α 3 4 4 15-10: Example P (h 5 lime) = αp (lime, lime h 5 )P (h 5 ) = α 1 1 0.1 = α0.1. α = 3.07. Strictly speaking, we don t really care what α is. We can just select the MAP hypothesis, since we just Department of Computer Science University of San Francisco p.12/?? want to know the most likely hypothesis.

15-11: Bayesian Learning Eventually, the true hypothesis will dominate all others. Caveat: assuming the data is noise-free (or noise is normally distributed) What sort of bias does Bayesian Learning use? Typically, simpler hypotheses will have larger priors. More complex hypotheses will fit data more exactly (but there s many more of them). Under these assumptions, h MAP will be the simplest hypothesis that fits the data. This is Occam s razor, again. Think about the deterministic case, where P (h i D) is either 1 or 0. Department of Computer Science University of San Francisco p.13/?? 15-12: Bayesian Concept Learning Bayesian Learning involves estimating the likelihood of each hypothesis. In a more complex world where observations are not independent, this could be difficult. Our first cut at doing this might be a brute force approach: 1. For each h in H, calculate P (h D) = P (D h)p (h) P (D) 2. From this, output the hypothesis h MAP with the highest posterior probability. This is what we did in the example. Challenge - Bayes Theorem can be difficult to use in the case where observations are not i.i.d. P (h o 1, o 2 ) = P (o1 h,o2)p (h o2) P (o 1 o 2) Department of Computer Science University of San Francisco p.14/?? 15-13: Bayesian Optimal Classifiers There s one other problem with the formulation as we have it. Usually, we re not so interested in the hypothesis that fits the data. Instead, we want to classify some unseen data, given the data we ve seen so far. It s sunny and windy. Sould we play tennis? One approach would be to just return the MAP hypothesis. We can do better, though. 15-14: Bayesian Optimal Classifiers Suppose we have three hypotheses and posteriors: h 1 = 0.4, h 2 = 0.3, h 3 = 0.3. We get a new piece of data - h 1 says it s positive, h 2 and h 3 negative. h 1 is the MAP hypothesis, yet there s a 0.6 chance that the data is negative. By combining weighted hypotheses, we improve our performance. Department of Computer Science University of San Francisco p.15/?? Department of Computer Science University of San Francisco p.16/?? 15-15: Bayesian Optimal Classifiers By combining the predictions of each hypothesis, we get a Bayesian optimal classifier. More formally, let s say our unseen data belongs to one of v classes. The probability P (v j D) that our new instance belongs to class v j is: h i H P (v j h i )P (h i D) Intuitively, each hypothesis gives its prediction, weighted by the likelihood that that hypothesis is the correct one. This classification method is provably optimal - on average, no other algorithm can perform better. 15-16: Problems with the Bayes Optimal classifier However, the Bayes optimal classifier is mostly interesting as a theoretical benchmark. In practice, computing the posterior probabilities is exponentially hard. Problem: when data are dependent upon each other. Can we get around this? Department of Computer Science University of San Francisco p.17/?? Department of Computer Science University of San Francisco p.18/??

15-17: Naive Bayes classifier The Naive Bayes classifer makes a strong assumption that makes the algorithm practical: Each attribute of an example is independent of the others. P (a b) = P (a)p (b) for all a and b. This makes it straightforward to compute posteriors. 15-18: The Bayesian Learning Given: a set of labeled, multivalued examples. Find a function F (x) that correctly classifies an unseen example with attributes (a 1, a 2,..., a n ). Call the most probable category v map. v map = argmax vi V P (v i a 1, a 2,..., a n ) Problem We rewrite this with Bayes Theorem as: v map = argmax vi V P (a 1, a 2,..., a n v i )P (v i ) Estimating P (v i ) is straightforward with a large training set; count the fraction of the set that are of class v i. Department of Computer Science University of San Francisco p.19/?? However, estimating P (a 1, a 2,..., a n v i ) is difficult unless our training set is very large. We need to see every possible attribute combination many times. Department of Computer Science University of San Francisco p.20/?? 15-19: Naive Bayes assumption Naive Bayes assumes that all attributes are conditionally independent of each other. In this case, P (a 1, a 2,..., a n v i ) = Π i P (a i v i ). This can be estimated from the training data. The classifier then picks the class with the highest probability according to this equation. Interestingly, Naive Bayes performs well even in cases where the conditional independence assumption fails. 15-20: Example Recall your tennis-playing problem from the decision tree homework. We want to use the training data and a Naive Bayes classifier to classify the following instance: Outlook = Sunny, Temperature = Cool, Humidity = high, Wind = Strong. Department of Computer Science University of San Francisco p.21/?? Department of Computer Science University of San Francisco p.22/?? 15-21: Example 15-22: Example Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Our priors are: P (P layt ennis = yes) = 9/14 = 0.64 P (P layt ennis = no) = 5/14 = 0.36 We can estimate: P (wind = strong P layt ennis = yes) = 3/9 = 0.33 P (wind = strong P layt ennis = no) = 3/5 = 0.6 P (humidity = high P layt ennis = yes) = 3/9 = 0.33 P (humidity = high P layt ennis = no) = 4/5 = 0.8 P (outlook = sunny P layt ennis = yes) = 2/9 = 0.22 P (outlook = sunny P layt ennis = no) = 3/5 = 0.6 P (temp = cool P layt ennis = yes) = 3/9 = 0.33 P (temp = cool P layt ennis = no) = 1/5 = 0.2 Department of Computer Science University of San Francisco p.23/?? Department of Computer Science University of San Francisco p.24/??

15-23: Example v yes = P (yes)p (sunny yes)p (cool yes)p (high yes)p (strong yes) = 0.005 v no = P (no)p (sunny no)p (cool no)p (high no)p (strong no) = 0.0206 So we see that not playing tennis is the maximum likelihood hypothesis. Further, by normalizing, we see that the classifier predicts = 0.80 probability of not playing tennis. a 0.0206 0.005+0.0206 Department of Computer Science University of San Francisco p.25/?? 15-24: Estimating Probabilities As we can see from this example, estimating probabilities through frequency is risky when our data set is small. We only have 5 negative examples, so we may not have an accurate estimate. A better approach is to use the following formula, called an m-estimate: n c+mp n+m Where n c is the number of individuals with the characteristic of interest (say Wind = strong), n is the total number of positive/negative examples, p is our prior estimate, and m is a constant called the equivalent sample size. m lets us control how much influence the estimated prior Department of Computer Science University of San Francisco p.26/?? and the frequency each have. m determines how heavily to weight p. p is assumed to be uniform. 15-25: Estimating Probabilities So, in the Tennis example, P (wind = strong playt ennis = no) = 3+0.2m 5+m We ll determine an m based on sample size. 15-26: Using Naive Bayes to classify One area where Naive Bayes has been very successful is in text classification. Despite the violation of independence assumptions. Problem - given some text documents labeled as to category, determine the category of new and unseen documents. Our features will be the words that appear in a document. Based on this, we ll predict a category. text Department of Computer Science University of San Francisco p.27/?? Department of Computer Science University of San Francisco p.28/?? 15-27: Using Naive Bayes to classify The first thing we need is some prior probabilities for words. We get this by scanning the entire training set and counting the frequency of each word. Common, content-free words such as a,an, the, and are often left out. This word frequency count gives us an estimate of the priors for each word. We then need priors for each category. We get this by counting the fraction of the training set that belong to a given category. text Finally, we need to determine the conditional probability P (w k v j ). Department of Computer Science University of San Francisco p.29/?? We do this by concatenating all documents in class v j. Call this the Text. 15-28: Classifying text Once we ve computed the conditional probabilities, we can classify text. For a new document, find all distinct words in the document. For each class of document, compute the posterior probability of that class, given the words present in the document. Seems too simple, but it actually works. (Up to 90% accurate in some cases.) Department of Computer Science University of San Francisco p.30/??

15-29: Summary We can use Bayes rule to estimate the probabilities of different hypotheses from data. MAP hypothesis is useful if we just need to know which hypothesis is most likely. If Priors are uniform, we can compute ML hypothesis. This lets us build a Bayesian optimal classifier Prediction of each hypothesis, weighted by its likelihood. Optimal Bayes classifier is great, but intractable in practice. 15-30: Summary If we assume all attributes in our data set are independent, we have a Naive Bayes classifier. Computation is linear in the number of attributes. Even on data that s not independent, Naive Bayes performs well in practice. As you ll discover in Project 3... Department of Computer Science University of San Francisco p.31/?? Department of Computer Science University of San Francisco p.32/??