Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Similar documents
CSCE 478/878 Lecture 6: Bayesian Learning

Notes on Machine Learning for and

Stephen Scott.

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Bayesian Learning Features of Bayesian learning methods:

Lecture 9: Bayesian Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Naïve Bayes classification

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Introduction to Bayesian Learning. Machine Learning Fall 2018

Two Roles for Bayesian Methods

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Probabilistic Classification

Machine Learning (CS 567)

Bayesian Learning (II)

Learning with Probabilities

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

The Naïve Bayes Classifier. Machine Learning Fall 2017

Statistical learning. Chapter 20, Sections 1 3 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

MODULE -4 BAYEIAN LEARNING

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Bayesian Approaches Data Mining Selected Technique

Bayesian Networks BY: MOHAMAD ALSABBAGH

Naïve Bayesian. From Han Kamber Pei

Bayesian Classification. Bayesian Classification: Why?

Bayesian Methods: Naïve Bayes

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Introduction to Bayesian Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

Probability Based Learning

Statistical learning. Chapter 20, Sections 1 3 1

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Chapter 6 Classification and Prediction (2)

{ p if x = 1 1 p if x = 0

Introduction to Machine Learning

Introduction to Machine Learning

Logistic Regression. Machine Learning Fall 2018

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

an introduction to bayesian inference

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Undirected Graphical Models

Machine Learning. Bayesian Learning.

Mining Classification Knowledge

Statistical Learning. Philipp Koehn. 10 November 2015

Statistical learning. Chapter 20, Sections 1 4 1

Bayesian Learning Extension

Algorithmisches Lernen/Machine Learning

STA 4273H: Statistical Machine Learning

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Naïve Bayes Classifiers

Probabilistic Graphical Networks: Definitions and Basic Results

Data Mining Part 4. Prediction

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Day 5: Generative models, structured classification

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Probabilistic Graphical Models

Lecture : Probabilistic Machine Learning

Bayesian Machine Learning

CHAPTER-17. Decision Tree Induction

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. Bayes Theorem. MAP, MLhypotheses. MAP learners. Minimum description length principle. Bayes optimal classier. Naive Bayes learner

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Machine Learning. Lecture 2

Why Probability? It's the right way to look at the world.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Probabilistic Machine Learning

Mining Classification Knowledge

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning, Fall 2012 Homework 2

Directed Graphical Models

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Bayesian Networks. Motivation

ECE521 week 3: 23/26 January 2017

Machine Learning

Conditional Independence

The Bayesian Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Graphical Models and Kernel Methods

Bayesian Learning. Bayesian Learning Criteria

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Transcription:

Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in

Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks Bayesian Learning CSL603 - Machine Learning 2

Features of Bayesian Learning Each training example can incrementally increase or decrease the estimated probability that a hypothesis is correct. Allows for probabilistic predictions Practical learning algorithms Naïve Bayes learning Bayesian network learning Combine prior knowledge with observations Require prior probabilities Useful conceptual framework gold standard for evaluating other classifiers Tools for analysis Bayesian Learning CSL603 - Machine Learning 3

Bayes Theorem If A and B are two random variables P A B = In the context of classifier hypothesis h and training data I P h I = P B A P(A) P(B) P I h P(h) P(I) P(h) prior probability of hypothesis h P(I) prior probability of training data I P h I probability of h given I P I h probability of I given h Bayesian Learning CSL603 - Machine Learning 4

Choosing the Hypotheses Given the training data, we are interested in the most probable hypothesis Maximum a posteriori hypothesis - h +,- h +,- argmax 4 6 P h I argmax 4 6 P I h P(h) P(I) argmax 4 6 P I h P h If every hypothesis is equally probable, P h 7 = P h 8, h 7, h 8 H, then we can simplify it to Maximum likelihood (ML) hypothesis - h +< h +< = argmax 4= 6P I h 7 Bayesian Learning CSL603 - Machine Learning 5

Example Does the patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is no present. Furthermore, 0.008 of the entire population have this cancer. Bayesian Learning CSL603 - Machine Learning 6

P cancer = P cancer = P + cancer = P cancer = P + cancer = P cancer = P cancer + = Bayesian Learning CSL603 - Machine Learning 7

Brute-Force MAP Hypothesis Learner (1) If we are given I = < x H, y H,, x K, y K >, examples and the class labels, For each hypothesis h H, calculate the posterior probability P I h P(h) P h I = P(I) Output the hypothesis h +,- that has the highest posterior probability h +,- = argmax 4 6 P h I Bayesian Learning CSL603 - Machine Learning 8

Brute-Force MAP Hypothesis Learner (2) If we are given I = < x H, y H,, x K, y K >, examples and the class labels, choose P(I h) P(I h) = 1 if h is consistent with I P(I h) = 0 otherwise Choose P(h) to be uniform distribution P h = H 6 h H Then P h I = - I h -(4) -(P) Bayesian Learning CSL603 - Machine Learning 9

Bayesian Learning CSL603 - Machine Learning 10

Brute-Force MAP Hypothesis Learner (3) If we are given I = < x H, y H,, x K, y K >, examples and the class labels, choose P(I h) P(I h) = 1 if h is consistent with I P(I h) = 0 otherwise Choose P(h) to be uniform distribution P h = H 6 Then P h I = Q H R S,T, if h is consistent with I 0, otherwise Bayesian Learning CSL603 - Machine Learning 11

Evolution of Posterior Probabilities P h P h I H P h I H, I _ h h h Bayesian Learning CSL603 - Machine Learning 12

Classifying new instances Given a new instance x, what is the most probable classification? One solution h +,- (x) But can we do better? Consider the following example containing three hypotheses: P h H I = 0.4, P h _ I = 0.3, P h c I = 0.3 Given a new instance x, h H x = +, h _ x =, h c x = What is the most probable classification for x Bayesian Learning CSL603 - Machine Learning 13

Bayes Optimal Classifier (1) Combine the prediction of all hypotheses weighted by their posterior probabilities Bayes optimal classification argmax d Y f P y h 7 P(h 7 I) 4 = 6 Example P h H I =.4, P h H = 0, P + h H = 1 P h _ I =.3, P h _ = 1, P + h _ = 0 P h c I =.3, P h c = 1, P + h c = 0 f P + h 7 P(h 7 I) = f P h 7 P(h 7 I) = 4 = 6 4 = 6 Bayesian Learning CSL603 - Machine Learning 14

Bayes Optimal Classifier (2) Optimal in the sense No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average. Method maximizes the probability that the new instance is classified correctly, given the available data, hypothesis space and prior probabilities over the hypothesis. But it is inefficient Compute posterior probability for every hypothesis and combine the predictions of each hypothesis. Bayesian Learning CSL603 - Machine Learning 15

Gibbs Classifier Gibbs Algorithm Choose a hypothesis h H at random, according to the posterior probability distribution over H Use h to classify the new instance x. Observation Assume target concepts are drawn at random from H according to the priors on H, Then E E l7mmn 2E E qrstnuvw7xry Haussler et al., ML 1994 Bayesian Learning CSL603 - Machine Learning 16

Naïve Bayes Classifier (1) Bayes rule, slightly different application Let Y = {c H, c _, c { } be the different class labels. The label for i ~ instance y Y P c x 7 = P x 7 c P(c ) P(x 7 ) P c x 7 - posterior probability that instance x 7 belongs to class c P x 7 c - probability that an instance drawn from class c would be x 7 (likelihood) P(c ) probability of class c (prior) P(x ) probability of instance x 7 (evidence) Bayesian Learning CSL603 - Machine Learning 17

Naïve Bayes Classifier (2) Classify instance x as class y with maximum posterior probability y = argmax P(c x) Ignore the denominator (since we are only interested in the maximum) If the prior is uniform y = argmax P x c P(c ) y = argmax P x c Bayesian Learning CSL603 - Machine Learning 18

Naïve Bayes Classifier (3) Look at the classifier y = argmax P x c What is each instance x? A D dimensional tuple (x H,, x ƒ ) Estimate the joint probability distributionp x H, x ƒ c Practical issue- need to know the probability of every possible instance given every possible class. With D Boolean features and K classes K2 ƒ probability values!!! Bayesian Learning CSL603 - Machine Learning 19

Naïve Bayes Classifier (4) Make the naïve Bayes assumption features/attributes are conditionally independent given the target attribute (class label) P x H, x ƒ c ƒ = P x c ˆH This results in the naïve Bayes classifier (NBC)! ƒ y = argmax P x c ˆH P(c ) Bayesian Learning CSL603 - Machine Learning 20

NBC Practical Issues (1) Estimating probabilities from I Prior probabilities P c = x 7, y : y = c I If the features are discrete P x = v c = x 7, y : x = v y = c x 7, y : y = c Bayesian Learning CSL603 - Machine Learning 21

NBC Practical Issues (2) If the features are continuous? Assume some parameterized distribution for x, e.g., Normal Learn parameters of distribution from data, e.g., mean and variance of x values Determine the parameters that maximize the likelihood. P x c ~ N(μ, σ _ ), μ and σ _ are unknown Bayesian Learning CSL603 - Machine Learning 22

Bayesian Learning CSL465/603 - Machine Learning 23

NBC Practical Issues (3) If the features are continuous? Assume some parameterized distribution for x, e.g., Normal Learn parameters of distribution from data, e.g., mean and variance of x values Determine the parameters that maximize the likelihood. Discretize the feature E.g., price R to price low, medium, high Bayesian Learning CSL603 - Machine Learning 24

NBC Practical Issues (4) If there are no examples in class c P x = v c = 0 ƒ P x c ˆH P c = 0 for which x = v Use m-estimate defined as follows P x = v c = x 7, y : x = v y = c + mp x 7, y : y = c + m Prior estimate of the probability p Equivalent sample size m (how heavily to weight p relative to the observed data) Bayesian Learning CSL603 - Machine Learning 25

Example Learn to Classify Text Problem Definition Given a set of news articles that are of interest, we would to like to learn to classify the articles by topic. Naïve Bayes is among the most effective algorithms to perform this task. What will be attributes to represent the documents? Vector of words one attribute per word position in the document What is the Target concept Is the document interesting? Topic of the document Bayesian Learning CSL603 - Machine Learning 26

Algorithm Learn Naïve Bayes Collect all words and tokens that occur in the Examples (I) Vocabulary all distinct words and tokens in I Compute probabilities P c and P x c I Examples for which the target label is c P c = P š P n total number of words in I (counting duplicates multiple times) For each work x in Vocabulary n = number of times word x occurs in I P x c = œ H RžŸ d Bayesian Learning CSL603 - Machine Learning 27

Algorithm Classify Naïve Bayes Given a test instance Compute the frequency of occurrence in the test instance of each term in the vocabulary Apply naïve Bayes classification rule! Bayesian Learning CSL603 - Machine Learning 28

Example: 20 Newsgroup Given 1000 training documents from each group Learn to classify new documents according to the newsgroup it came from NBC 89% accuracy Bayesian Learning CSL603 - Machine Learning 29

Bayesian Network (1) Naïve Bayes assumption of conditional independence is too restrictive. The problem is intractable without some conditional independence assumption Bayesian networks describe conditional independence among subsets of variables. Allows for combining prior knowledge about (in) dependencies among variables with training data Recollect Conditional Independence Bayesian Learning CSL603 - Machine Learning 30

Bayesian Network - Example Storm BusTourGroup S,B S, B S,B S, B Lightning Campfire C C 0.4 0.6 0.1 0.9 0.8 0.2 0.2 0.8 Campfire Thunder ForestFire Bayesian Learning CSL603 - Machine Learning 31

Bayes Network (2) Network represents the joint probability distribution over all variables P(Storm, BusTourGroup, ForestFire) In general, ƒ P x H, x _,, x ƒ = P x Parents x Where Parents x x in the graph. ˆH denotes immediate predecessors of What is the Bayes Network corresponding to the Naive Bayes Classifier? Bayesian Learning CSL603 - Machine Learning 32

Bayes Network (3) Inference Bayes network encodes all the information required for inference. Exact inference methods Work well for some structures Monte Carlo methods Simulate the network randomly to calculate approximate solutions. Learning If the structure is known and there are no missing values, it is easy to learn a Bayes network If the network structure is known and there are some missing values, expectation maximization algorithm If the structure is unknown, the problem is very difficult. Bayesian Learning CSL603 - Machine Learning 33

Summary Bayes rule Bayes Optimal Classifier Practical Naïve Bayes Classifier Example text classification task Maximum-likelihood estimates Bayesian networks Bayesian Learning CSL603 - Machine Learning 34