Bayesian Learning. Bayesian Learning Criteria

Similar documents
CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Naïve Bayes classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Notes on Machine Learning for and

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Bayesian Learning Features of Bayesian learning methods:

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Lecture 9: Bayesian Learning

Introduction to Bayesian Learning

Algorithms for Classification: The Basic Methods

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Classification. Bayesian Classification: Why?

Stephen Scott.

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Mining Classification Knowledge

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Induction on Decision Trees

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Algorithmisches Lernen/Machine Learning

Bayesian Learning Extension

Probability Based Learning

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Mining Classification Knowledge

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

The Solution to Assignment 6

Logistic Regression. Machine Learning Fall 2018

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Data Mining Part 4. Prediction

Probabilistic Classification

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Machine Learning. Bayesian Learning.

Numerical Learning Algorithms

Bayesian Decision Theory

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Learning (II)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

COMP 328: Machine Learning

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Bayesian Methods: Naïve Bayes

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Qualifier: CS 6375 Machine Learning Spring 2015

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

STA 4273H: Statistical Machine Learning

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Quantifying uncertainty & Bayesian networks

Bayesian Approaches Data Mining Selected Technique

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Building Bayesian Networks. Lecture3: Building BN p.1

Bayesian Decision Theory Lecture 2

Machine Learning, Fall 2012 Homework 2

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Data classification (II)

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning with Probabilities

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Introduction to Bayesian Learning. Machine Learning Fall 2018

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Announcements. Proposals graded

Review: Bayesian learning and inference

Introduction to Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Uncertainty and Bayesian Networks

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

CSE 473: Artificial Intelligence Autumn Topics

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Regression and Discrimination

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Minimum Error-Rate Discriminant

6.867 Machine Learning

Learning Bayesian belief networks

Two Roles for Bayesian Methods

Decision Trees. Gavin Brown

Learning Bayesian network : Given structure and completely observed data

Decision Trees. Tirgul 5

Decision Support. Dr. Johan Hagelbäck.

Naive Bayes Classifier. Danushka Bollegala

Machine Learning (CS 567)

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Transcription:

Bayesian Learning In Bayesian learning, we are interested in the probability of a hypothesis h given the dataset D. By Bayes theorem: P (h D) = P (D h)p (h) P (D) Other useful formulas to remember are: Product Rule: P (A B) = P (A B)P (B) = P (B A)P (A) Sum Rule: P (A B) = P (A) + P (B) P (A B) Theorem of total probability: If A 1,..., A n are mutually exclusive with Σ n i=1p (A i ) = 1, then P (B) = Σ n i=1p (B A i ) Bayesian Learning Criteria Different criteria for specifying the optimal hypothesis use Bayes theorem in different ways. Choose h to: MAP: Maximum a posteriori: maximize P (h D) ML: Maximum Likelihood: maximize P (D h) MDL: Minimum Description Length: minimize Coding length of h + Coding length of D assuming h Bayes Optimal Classification of x X: Choose y that maximizes P (x, y) Parameter Estimation: P (x, y) has a known type, but parameters are unknown.

Maximum Likelihood and Least Squared Error Suppose outcomes normally dist. around f(x) p(y x) = e (f(x) y)2 /(2σ 2 ) 2πσ h ML p(d h) Π m i=1 p((x i, y i ) h) Assume x is independent of h (skipping steps) = argmin Π m i=1 p(y i h, x i ) Π m i=1 e (h(x i) y) 2 /(2σ 2 ) 2πσ m Σ i=1 (h(x i) y) 2 Maximum Likelihood and Probability Prediction Assume Y = {0, 1} and f(x) = P (y = 1 x) h ML p(d h) Π m i=1 p((x i, y i ) h) Assume x is independent of h (skipping steps) Π m i=1 P (y i h, x i ) P (y h, x) is h(x) if y = 1, or 1 h(x) if y = 0. = argmin Π m i=1 h(x i ) y i (1 h(x i )) 1 y i m Σ i=1 y i ln h(x i ) (1 y i ) ln(1 h(x i )) The last expression is called cross entropy. By a trick, we can convert this to least squares error again.

Minimum Description Length Principle h MAP = argmin P (h D) P (D h)p (h) log 2 P (D h) + log 2 P (h) log 2 P (D h) log 2 P (h) P (D h)p (h) P (D) MDL makes the following assumptions: log 2 P (h) = description length of h in bits log 2 P (D h) = length of D assuming h = length of outcomes y 1,..., y m assuming h Bayes Optimal Classification Previous criteria focus on the most probable hypothesis. BOC focuses on the most probable classification. Each hypothesis is weighted by its posterior probability. argmax P ((x, y) D) By theorem of total probability: After skipping a few steps: Σ P ((x, y) h D) Σ P (y h, x)p (h D)

Gibbs Algorithm BOC is too expensive because it requires a summation over the whole hypothesis space. A simpler alternative is Gibbs algorithm: For each instance x, Select h randomly according to P (h D) Use h to classify x Gibbs algorithm has at most twice the error rate of BOC. However, Gibbs algorithm is still a difficult algorithm to apply because it requires random sampling based on P (h D). Parameter Estimation If the type of probability distribution is known/assumed, use D to estimate its parameters. Example: Each class y follows a multivariate normal distribution: 1 P (y x) = (2π) n Σ exp{ (x m)t Σ 1 (x m)/2} where x is a n 1 vector, m are the means, Σ is an n n covariance matrix, Σ is its determinant, and Σ 1 is its inverse. Use D to estimate m and Σ for each class.

Example Parameter Estimation: Iris2 Assume each class is generated by a normal distribution. Sample covariance σ 2 x j x k is Σ (x j µ xj )(x k µ xk )/( D 1) x D µ 1 = [1.464 0.244] Σ 1 = µ 2 = [4.260 1.326] Σ 2 = µ 3 = [5.552 2.026] Σ 3 = 0.0301 0.0057 0.0057 0.0115 0.2208 0.0731 0.0731 0.0391 0.3046 0.0488 0.0488 0.0754 Iris2 Distributions 1.8 2 1.6 1.4 1.2 0.8 1 0.6 0.4 0.2 0 1 2 3 4 5 6 7 0 0.5 1 1.5 2 2.5 3

Discriminant Functions Discriminant functions d i (x) for classes Y are designed so that the decision for x is class: argmax P (y x) The decision boundary between two classes y 1 and y 2 is where d 1 (x) = d 2 (x). Typically, d i (x) ln P (x y i ) + ln P (y i ) For normal distributions, d i (x) is equal to: 1 2 ln Σ i 1 2 (x µ i )Σ 1 i (x µ i ) T + ln P (y i ) Iris2 Decision Boundaries 3 2.5 2 1.5 1 0.5 1 2 3 4 5 6 7 0

Naive Bayes Classifier Naive Bayes classification is derived by: y MAP P (y x) P (x y)p (y) P (x)p (y) P (x) Now naively assume that attributes are conditionally independent given the outcome. Let x = (x 1,..., x n ) P (y) Π n j=1p (x j y) log P (y) + Σ n log P (x i y) j=1 Note that the result is a linear function. Estimating Probabilities Let N = D and let N(condition) be the number of exs. in D for which condition is true. ˆP (y) = N(y) N ˆP (x j y) = N(x j y) N(y) is problematic when N(x j y) = 0. It is better to use Laplace s law of succession: ˆP (x j y) = N(x j y) + 1 where x j has k possible vals. N(y) + k Or a more sophisticated m-estimate: ˆP (x j y) = N(x j y) + m/k N(y) + m

Naive Bayes Example: Learning P (good) = 9/14 P (bad) = 5/14 Attribute P (x good) P (x bad) Outlook = sunny 3/12 4/8 Outlook = overcast 5/12 1/8 Outlook = rain 4/12 3/8 Temp = hot 3/12 3/8 Temp = mild 4/12 3/8 Temp = cool 5/12 2/8 Humidity = high 4/11 5/7 Humidity = normal 7/11 2/7 Windy = true 4/11 4/7 Windy = false 7/11 3/7 Naive Bayes Example: Classification P (good (rain, cool, normal, true)) (9/14)(4/12)(5/12)(7/11)(4/11) 0.0207 P (bad (rain, cool, normal, true)) (5/14)(3/8)(2/8)(2/7)(4/7) 0.0055 P (good (rain, mild, high, false)) (9/14)(4/12)(4/12)(4/11)(7/11) 0.0165 P (bad (rain, mild, high, false)) (5/14)(3/8)(3/8)(5/7)(3/7) 0.0154

Comments on Multivariate Normal and Naive Bayes Both are very efficient! Only need one pass through examples. Surprisingly, NB is more effective. Often, several attributes give strong evidence for the class. Numeric attributes should be discretized. MV can perform badly when assumptions are wrong. NB is easy to interpret. One can identify attributes that support and oppose a classification. NB cannot model combinations of attributes properly. NB has linear decision boundary for nominal attributes. MN has hyperquadric decision boundary. Bayesian Networks In Bayesian networks, the probability distribution is a product of conditional probabilities. P (x 1,..., x n, y) = P (y parents(y)) Π n j=1p (x j parents(x j )) where parents(y) and parents(x j ) are subsets of the variables, and the parent-child relation forms a directed acyclic graph. The values of the parents should cause the value of the child.

Example Bayesian Network R Q A B Q = earthquake C B = burglary A = house alarm R = radio announcement of earthquake C = phone call from neighbor P (Q) T rue F alse 0.0001 0.9999 P (B) T rue F alse 0.001 0.999 P (A Q, B) Q B T rue F alse T rue T rue 0.999 0.001 T rue F alse 0.8 0.2 F alse T rue 0.95 0.05 F alse F alse 0.001 0.999 P (R Q) Q T rue F alse T rue 0.999 0.001 F alse 0.0005 0.9995 P (C A) A T rue F alse T rue 0.9 0.1 F alse 0.01 0.99 P ( Q B R A C) = P ( Q) P (B) P (R Q) P ( A Q B) P (C A) = (0.9999) (0.001) (0.0005) (0.05) (0.01)

Comments on Bayesian Networks Variables can be nominal or numeric. Numeric variables need a probability density function (like normal dist.). Inference algorithms can compute the probabilities of unknown variables from the known variables. In practice, the probability computation is usually efficient, but, in theory, not guaranteed to be so. Learning a Bayesian network depends on 2 conditions: Is the network structure known or unknown? Are all the variables observable or are some hidden? Bayes Risk Criterion Choosing the most probable class might not be the best decision. Consider deciding on turning onto an access road based on the probability of an accident. Let L ij = loss of deciding class i when class j is true. L ij is the extra cost or risk of being wrong. Bayes risk criterion is to choose: argmin Σ L ij P (y =j x) i j the class that minimizes loss.

Assuming two classes and L ii = 0, choose class 1 if: L 21 P (y =1 x) > L 12 P (y =2 x) This is equivalent to: L 21 P (x y =1) P (y =1) > L 12 P (x y =2) P (y =2) 0-1 loss is defined as L ii = 0 otherwise L ij = 1. Note that Bayes risk criterion with 0-1 loss is equivalent to MAP.