Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Similar documents
CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning. Bayesian Learning.

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Stephen Scott.

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Notes on Machine Learning for and

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Bayesian Learning Features of Bayesian learning methods:

Machine Learning (CS 567)

Lecture 9: Bayesian Learning

MODULE -4 BAYEIAN LEARNING

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Two Roles for Bayesian Methods

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Bayes Decision Rule and Naïve Bayes Classifier

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Naïve Bayes classification

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Learning Extension

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

COMP 328: Machine Learning

Combining Classifiers

Bayesian Learning. Bayesian Learning Criteria

Support Vector Machines. Maximizing the Margin

The Naïve Bayes Classifier. Machine Learning Fall 2017

Understanding Machine Learning Solution Manual

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Probability Based Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Bayesian Learning (II)

Learning with Probabilities

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Boosting with log-loss

Probability Distributions

1 Proof of learning bounds

Algorithmisches Lernen/Machine Learning

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

CS6220: DATA MINING TECHNIQUES

Probabilistic Machine Learning

Estimating Parameters for a Gaussian pdf

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Machine Learning Basics: Estimators, Bias and Variance

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

CS6220: DATA MINING TECHNIQUES

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Discrete Optimization

CS Lecture 13. More Maximum Likelihood

PAC-Bayes Analysis Of Maximum Entropy Learning

Chapter 3: Decision Tree Learning

Naïve Bayesian. From Han Kamber Pei

Statistical Learning. Philipp Koehn. 10 November 2015

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Notes on Machine Learning for and

Statistical learning. Chapter 20, Sections 1 3 1

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Introduction to Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Methods: Naïve Bayes

Introduction to Machine Learning

Support Vector Machines MIT Course Notes Cynthia Rudin

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Math 262A Lecture Notes - Nechiporuk s Theorem

Introduction to Bayesian Learning

From inductive inference to machine learning

Fixed-to-Variable Length Distribution Matching

arxiv: v1 [cs.ds] 3 Feb 2014

Algorithms for Classification: The Basic Methods

Tracking using CONDENSATION: Conditional Density Propagation

Machine Learning

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Support Vector Machines. Goals for the lecture

Stochastic Subgradient Methods

1 Rademacher Complexity Bounds

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

What is Probability? (again)

Pattern Recognition and Machine Learning. Artificial Neural networks

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Computational and Statistical Learning Theory

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Tree Learning

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

List Scheduling and LPT Oliver Braun (09/05/2017)

Generalized Queries on Probabilistic Context-Free Grammars

Transcription:

Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu description length principle Bayes optial classifier Naive Bayes learner (if tie) Exaple: Learning over text data Bayesian belief networks Expectation Maxiization algorith Roles for Bayesian Methods Provides practical learning algoriths: Naive Bayes learning Bayesian belief network learning Cobine prior knowledge (prior probabilities) with observed data Requires prior probabilities Provides useful conceptual fraework Provides gold standard for evaluating other learning algoriths Additional insight into Occa's razor Bayes Theore P(h D) = P(D h) P(h) / P(D) P(h) = prior prob. of hypothesis h P(D) = prior prob. of training data D P(h D) = probability of h given D P(D h) = probability of D given h

Choosing Hypotheses Natural choice is ost probable hypothesis given the training data, or axiu a posteriori hypothesis h MAP : h MAP = argax h in H P(h D) = argax h in H P(D h) P(h) / P(D) = argax h in H P(D h) P(h) If assue P(h i ) = P(h j ) then can further siplify, and choose the axiu likelihood (ML) hypothesis h ML = argax hi in H P(D h i ) Bayes Theore Does patient have cancer or not? A patient takes a lab test and the result coes back positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furtherore,.008 of the entire population have this cancer. P (cancer) = P (not cancer) = P (+ cancer) = P (- cancer) = P (+ not cancer) = P (- not cancer) = Basic Forulas for Probs Product Rule: probability P(A ^ B) of a conjunction of two events A and B: P(A ^ B) = P(A B) P(B) = P(B A) P(A) Su Rule: probability of a disjunction of two events A and B: P(A v B) = P(A) + P(B) - P(A ^ B ) Theore of total probability: if the events A 1,., A n are utually exclusive with! i=1 n P (A i ) = 1, then P(B) =! i=1 n P(B A i ) P(A i ) Brute Force MAP Learner 1. For each hypothesis h in H, calculate the posterior probability P(h D) = P(D h) P(h) / P(D) 2. Output the hypothesis h MAP with the highest posterior probability h MAP = argax h in H P(h D)

Evolution of Posterior Probs As data is added, certainty of hypotheses increases. What is the effect on entropy? Real-Valued Functions Consider any real-valued target function f Training exaples <x i, d i >, where d i is noisy training value d i = f(x i ) + e i e i is rando variable (noise) drawn independently for each x i according to soe Gaussian distribution with ean=0 Then, the axiu likelihood hypothesis h ML is the one that iniizes the su of squared errors: h ML = argin h in H! i=1 (d i -h(x i )) 2 MAP and Least Squares MAP/Least Squares Proof h MAP = argax h in H P(h D) = argax h in H P(D h) = argax h in H " i =1 1/sqrt(2#$ 2 ) exp(-1/2 ((d i -h(x i ))/$) 2 ) = argax h in H! i =1 ln 1/sqrt(2#$ 2 ) -1/2 ((d i -h(x i ))/$) 2 = argax h in H! i =1-1/2 ((d i -h(x i ))/$) 2 = argax h in H! i =1 -(d i -h(x i )) 2 = argin h in H! i =1 (d i -h(x i )) 2

Predicting Probabilities Consider predicting survival probability fro patient data Training exaples <x i, d i >, where d i is 1 or 0 Want to train neural network to output a probability given x i (not a 0 or 1) Predicting Probabilities In this case, can show h ML = argax h in H! i=1 (d i ln h(x i ) + (1-d i ) ln(1-h(x i ))) Weight update rule for a sigoid unit: w jk % w jk +&w jk where &w jk = '! i=1 (d i - h(x i )) x ijk MDL Principle Miniu Description Length Principle Occa's razor: prefer the shortest hypothesis MDL: prefer the hypothesis h that iniizes h MDL = argin h in H (L C1 (h) + L C2 (D h)) where L C (x) is the description length of x under encoding C MDL Exaple Exaple: H = decision trees, D = training data labels L C1 (h) is # bits to describe tree h L C2 (D h) is # bits to describe D given h Note L C2 (D h) = 0 if exaples classified perfectly by h. Need only describe exceptions. Hence, h MDL trades off tree size for training errors

MDL Justification h MAP = argax h in H P(D h) P(h) = argax h in H (log 2 P(D h) +log 2 P(h)) = argin h in H (-log 2 P(D h) -log 2 P(h)) Fro inforation theory: The optial (shortest expected coding length) code for an event with probability p is -log 2 p So, prefer the hypothesis that iniizes length(h) + length(isclassifications) Classifying New Instances So far we've sought the ost probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its ost probable classification? h MAP (x) is not the ost probable classification! Classification Exaple Consider: Three possible hypotheses: P(h 1 D) =.4, P(h 2 D) =.3, P(h 3 D) =.3 Given new instance x, h 1 (x) = +, h 2 (x) = -, h 3 (x) = - What s h MAP (x)? What's ost probable classification of x? Bayes Optial Classifier Bayes optial classification: argax vj in V! hi in H P(v j h i ) P(h i D) Exaple: P(h 1 D) =.4, P(- h 1 ) = 0, P(+ h 2 ) = 1 P(h 2 D) =.3, P(- h 2 ) = 1, P(+ h 3 ) = 0 P(h 3 D) =.3, P(- h 3 ) = 1, P(+ h 3 ) = 0, therefore! hi in H P(+ h i ) P(h i D) =.4! hi in H P( - h i ) P(h i D) =.6 MAP class

Gibbs Classifier Bayes optial classifier provides best result, but can be expensive if any hypotheses. Gibbs algorith: 1. Choose one hypothesis at rando, according to P(h D) 2. Use this one to classify new instance Error of Gibbs (Not so) surprising fact: Assue target concepts are drawn at rando fro H according to priors on H. Then: E [error Gibbs ]! 2E [error BayesOptial ] Suppose correct, unifor prior distribution over H, then Pick any hypothesis consistent with the data, with unifor probability Its expected error no worse than twice Bayes optial Naive Bayes Classifier Along with decision trees, neural networks, knn, one of the ost practical and ost used learning ethods. When to use: Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text docuents Naive Bayes Classifier Assue target function f : X ( V, where each instance x described by attributes <a 1, a 2 a n >. Most probable value of f (x) is: v MAP = argax vj in V P(v j a 1, a 2 a n ) = argax vj in V P(a 1, a 2 a n, v j ) P(v j ) / P(a 1, a 2 a n ) = argax vj in V P(a 1, a 2 a n, v j ) P(v j )

Naïve Bayes Assuption P(a 1, a 2 a n, v j ) = " i P(a i v j ), which gives Naive Bayes classifier: v NB = argax vj in V P(v j ) " i P(a i v j ) Note: No search in training! Naïve Bayes Algorith Naïve_Bayes_Learn(exaples) For each target value v j ^ P(v j ) % estiate P(v j ) For each attribute value a i of each attribute a ^ P(a i v j ) % estiate P(a i v j ) Classify_New_Instance(x) ^ ^ v NB = argax vj in V P(v j ) " i P(a i v j ) Naïve Bayes: Exaple Consider PlayTennis again, and new instance <Outlk = sun, Tep = cool, Huid = high, Wind = strong> Want to copute: v NB = argax vj in V P(v j ) " i P(a i v j ) P(y) P(sun y) P(cool y) P(high y) P(strong y) =.005 P(n) P(sun n) P(cool n) P(high n) P(strong n) =.021 So, v NB = n Naïve Bayes: Subtleties 1. Conditional independence assuption is often violated P(a 1, a 2 a n, v j ) = " i P(a i v j )...but it works surprisingly well anyway. Note don't need estiated posteriors P(v j x) to be correct; need only that argax vj in V P(v j a 1, a 2 a n ) = argax vj in V P(v j ) " i P(a i v j ) Doingos & Pazzani [1996] for analysis Naïve Bayes posteriors often unrealistically close to 1 or 0

Naïve Bayes: Subtleties 2. what if none of the training instances with target value v j have attribute a i? P(a i v j ) = 0, and P(v j ) " i P(a i v j ) = 0 Solution is Bayesian estiate: P(a i v j ) = (n c + p)/(n + ) where n is nuber of training exaples for which v = v j, n c nuber of exaples for which v = v j and a = a i p is prior estiate for P(a i v j ) is weight given to prior (i.e., nuber of virtual exaples)