Bayesian Decision and Bayesian Learning

Similar documents
Parametric Techniques Lecture 3

Parametric Techniques

Bayesian Decision Theory

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayes Decision Theory

Introduction to Machine Learning

44 CHAPTER 2. BAYESIAN DECISION THEORY

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Bayesian Decision Theory Lecture 2

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 4: Probabilistic Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

COM336: Neural Computing

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

COS513 LECTURE 8 STATISTICAL CONCEPTS

Contents 2 Bayesian decision theory

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CS 195-5: Machine Learning Problem Set 1

Machine Learning 2017

Lecture 1: Bayesian Framework Basics

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

5. Discriminant analysis

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Bayes Rule for Minimizing Risk

Bayesian Decision Theory

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

CMU-Q Lecture 24:

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Pattern Recognition. Parameter Estimation of Probability Density Functions

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Visual Object Detection

Introduction to Probabilistic Graphical Models: Exercises

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Maximum Likelihood Estimation. only training data is available to design a classifier

Introduction to Machine Learning

Introduction to Machine Learning

p(d θ ) l(θ ) 1.2 x x x

CSC411 Fall 2018 Homework 5

p(x ω i 0.4 ω 2 ω

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning

PMR Learning as Inference

Machine learning - HT Maximum Likelihood

Pattern Classification

Lecture Notes on the Gaussian Distribution

Bayesian Learning (II)

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Bayesian Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Minimum Error Rate Classification

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Naïve Bayes classification

Introduction to Bayesian Inference

Bayes Decision Theory

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Pattern Classification

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Linear Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Regression and Discrimination

Density Estimation: ML, MAP, Bayesian estimation

A732: Exercise #7 Maximum Likelihood

EIE6207: Maximum-Likelihood and Bayesian Estimation

David Giles Bayesian Econometrics

Lecture 3: Pattern Classification

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Notes and Solutions for: Pattern Recognition by Sergios Theodoridis and Konstantinos Koutroumbas.

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

PATTERN RECOGNITION AND MACHINE LEARNING

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Factor Analysis and Kalman Filtering (11/2/04)

Building C303, Harbin Institute of Technology, Shenzhen University Town, Xili, Shenzhen,Guangdong Yonghui Wu, Yaoyun Zhang;

STAT 730 Chapter 4: Estimation

Learning Bayesian network : Given structure and completely observed data

Data Analysis and Uncertainty Part 2: Estimation

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

1 Data Arrays and Decompositions

Machine Learning (CS 567) Lecture 5

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

GWAS IV: Bayesian linear (variance component) models

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Methods: Naïve Bayes

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Graphical Models for Collaborative Filtering

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Minimum Error-Rate Discriminant

Classification 1: Linear regression of indicators, linear discriminant analysis

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Transcription:

Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30

Bayes Rule p(x ω i ) Likelihood p(ω i ) Prior p(ω i x) Posterior Bayes Rule In other words p(ω i x) = p(x ω i)p(ω i ) p(x) = p(x ω i)p(ω i ) i p(x ω i)p(ω i ) posterior likelihood prior 2 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 3 / 30

Action and Risk Classes: {ω 1, ω 2,..., ω c } Actions: {α 1, α 2,..., α a } Loss: λ(α k ω i ) Conditional risk: R(α k x) = c λ(α k ω i )p(ω i x) i=1 Decision function, α(x), specifies a decision rule. Overall risk: R = x R(α(x) x)p(x)dx It is the expected loss associated with a given decision rule. 4 / 30

Bayesian Decision and Bayesian Risk Bayesian decision α = argmin R(α k x) k This leads to the minimum overall risk. (why?) Bayesian risk: the minimum overall risk R = R(α x)p(x)dx Bayesian risk is the best one can achieve. x 5 / 30

Example: Minimum-error-rate classification Let s have a specific example of Bayesian decision In classification problems, action α k corresponds to ω k Let s define a zero-one loss function λ(α k ω i ) = { 0 i = k 1 i k i, k = 1,..., c This means: no loss for correct decisions & all errors are equal It easy to see: the conditional risk error rate R(α k x) = i k P(ω i x) = 1 P(ω k x) Bayesian decision rule minimum-error-rate classification Decide ω k if P(ω k x) > P(ω i x) i k 6 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 7 / 30

Classifier and Discriminant Functions Discriminant function: g i (x), i = 1,..., C, assigns ω i to x Classifier Examples: x ω i if g i (x) > g j (x) g i (x) = P(ω i x) g i (x) = P(x ω i )P(ω i ) j i g i (x) = ln P(x ω i ) + ln P(ω i ) Note: the choice of D-function is not unique, but they may give equivalent classification result. Decision region: the partition of the feature space x R i if g i (x) > g j (x) j i Decision boundary: 8 / 30

Multivariate Gaussian Distribution p(x) = N(µ, Σ) = { 1 (2π) d/2 exp 1 } Σ 1/2 2 (x µ)t Σ 1 (x µ) x 2 principal axis p 2 p 1 x 1 The principal axes (the direction) are given by the eigenvectors of the covariance matrix Σ The length of the axes (the uncertainty) is given by the eigenvalues of Σ 9 / 30

Mahalanobis Distance Mahalanobis distance is a normalized distance x µ m = (x µ) T Σ 1 (x µ) x 2 principal axis p 1 µ 2 p 2 µ 2 p 1 µ m = p 2 µ m p 2 p 1 x 1 10 / 30

Whitening To refresh your memory: A linear transformation of a Gaussian is still a Gaussian. p(x) = N(µ, Σ), and y = A T x p(y) = N(A T µ, A T ΣA) Question: Find one such that the covariance becomes an identity matrix (i.e., each component has equal uncertainty) y = A w x x y Whitening is a transform that de-couples the correlation. A w = U T Λ 1 2, where Σ = U T ΛU prove it: A T w ΣA w = I 11 / 30

Discriminant Functions for Gaussian Densities Minimum-error-rate classifier g i (x) = ln p(x ω i ) + ln p(ω i ) When using Gaussian densities, it is easy to see: g i (x) = 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln 2π 1 2 ln Σ i + ln p(ω i ) 12 / 30

Case I: Σ i = σ 2 I g i (x) = x µ i 2 2σ 2 +ln p(ω i ) = 1 2σ 2 [xt x 2µ T i x+µ T i µ i ]+ln p(ω i ) Notice that x T x is a constant. Equivalently we have g i (x) = [ 1 σ 2 µ i] T x + [ 1 2σ 2 µt i µ i + ln p(ω i )] This leads to a linear discriminant function g i (x) = W T i x + W i0 At the decision boundary g i (x) = g j (x), which is linear: x 0 = 1 2 (µ i + µ j ) W T (x x 0 ) = 0, where W = µ i µ j and σ 2 µ i µ j 2 ln p(ω i) p(ω j ) (µ i µ j ) 13 / 30

To See it Clearly... Let s view a specific case, where p(ω i ) = p(ω j ). The decision boundary we have: What does it mean? (µ i µ j ) T (x µ i + µ j ) = 0 2 The boundary is the perpendicular bisector of the two Gaussian densities! what if p(ω i ) p(ω j )? 14 / 30

Case II: Σ i = Σ g i (x) = 1 2 (x µ i) T Σ 1 (x µ i ) + ln p(ω i ) Similarly, we can have an equivalent one: g i (x) = (Σ 1 µ i ) T x + ( 1 2 µt i Σ 1 µ i + ln(ω i )) The discriminant function and decision boundary are still linear: W T (x x 0 ) = 0 x 0 = 1 2 (µ i + µ j ) where W = Σ 1 (µ i µ j ) and ln p(ω i ) ln p(ω j ) (µ i µ j ) T Σ 1 (µ i µ j ) (µ i µ j ) Note: Compared with Case I, the Euclidean distance is replaced by Mahalanobis distance. The boundary is still linear, but the hyperplane is no longer orthogonal to µ i µ j. 15 / 30

Case III: Σ i = arbitrary where g i (x) = x T A i x + W i x + W i0, A i = 1 2 Σ 1 i W i = Σ 1 i µ i W i0 = 1 2 µt i Σ 1 i µ i 1 2 ln Σ i + ln p(ω i ) Note: The decision boundary is no longer linear! It is byperquadrics. 16 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 17 / 30

Learning Learning means training i.e., estimating some unknowns from training samples Why? It is very difficult to specify these unknowns Hopefully, these unknowns can be recovered from examples given 18 / 30

Maximum Likelihood (ML) Estimation Collected samples D = {x 1, x 2,..., x n } Estimate unknown parameters Θ in the sense that the data likelihood is maximized Data likelihood n p(d Θ) = p(x k Θ) Log likelihood ML estimation k=1 L(Θ) = ln p(d Θ) = Θ = argmax Θ n p(x k Θ) k=1 p(d Θ) = argmax Θ L(D Θ) 19 / 30

Example I: Gaussian densities (unknown µ) ln p(x k µ) = 1 2 ln((2π)d Σ ) 1 2 (x k µ) T Σ 1 (x k µ) Its partial derivative is: ln p(x k µ) µ = Σ 1 (x k µ) So the KKT condition writes: n Σ 1 (x k ˆµ) = 0 k=1 It is easy to see the ML estimate of µ is: ˆµ = 1 n n k=1 x k This is exactly what we do in practice. 20 / 30

Example II: Gaussian densities (unknown µ and Σ) Let s do the univariate case first. Denote σ 2 by. ln p(x k µ, ) = 1 2 ln 2π 1 2 (x k µ) 2 The partial derivatives and KKT conditions are: { ln p(xk µ, ) µ = 1 (x k µ) = ln p(x k µ, ) = 1 2 + (x k µ) 2 2 2 So we have n ˆµ = 1 n x k k=1 n ˆσ 2 = 1 n (x k ˆµ) 2 k=1 generalize = { n 1ˆ (x k=1 k ˆµ) = 0 n k=1 { 1ˆ + (x k ˆµ) 2 } = 0 ˆ 2 ˆµ = 1 n ˆΣ = 1 n n k=1 x k n (x k ˆµ)(x k ˆµ) T k=1 21 / 30

Outline Bayesian Decision Theory Bayesian Classification Maximum Likelihood Estimation and Learning Bayesian Estimation and Learning 22 / 30

Bayesian Estimation Collect samples D = {x 1, x 2,..., x n }, drawn independently from a fixed but unknown distribution p(x) Bayesian estimation uses D to determine p(x D), i.e., to learn a p.d.f. The unknown Θ is a random variable (or random vector), i.e., Θ is drawn from p(θ). p(θ) is unknown, but has a parametric form with parameters Θ p(θ) We hope p(θ) is sharply peaked at the true value. Differences from ML in Bayesian estimation, Θ is not a value, but a random vector and we need to recover the distribution of Θ, rather than a single value. p(x D) also needs to be estimated 23 / 30

Two Problems in Bayesian Learning It is clear based on the total probability rule that p(x D) = p(x, Θ D)dΘ = p(x Θ)p(Θ D)dΘ p(x D) is a weighted average over all Θ if p(θ D) peaks very sharply about some value ˆΘ, then p(x D) can be approximated by p(x ˆΘ) X D The generation of the observation D can be illustrated in a graphical model. The two problems are estimating p(θ D) estimating p(x D) 24 / 30

Example: The Univariate Gaussian Case p(µ D) Assume µ is the only unknown and it has a known Gaussian prior: p(µ) = N(µ 0, σ 2 0 ). i.e., µ 0 is the best guess of µ, and σ 0 is its uncertainty Assume a Gaussian likelihood, p(x µ) = N(µ, σ 2 ) It is clear that p(µ D) p(d µ)p(µ) = n p(x k µ)p(µ) k=1 where p(x k µ) = N(µ, σ 2 ) and p(µ) = N(µ 0, σ 2 0) Let s prove that p(µ D) is still a Gaussian density (why?) hint : p(µ D) exp{ 1 2 ( n σ 2 + 1 σ0 2 )µ 2 2( 1 n σ 2 x k + µ 0 σ 2 )µ} k=1 25 / 30

What is Going on? As D is collection of n samples, let s denote p(µ D) = N(µ n, σ 2 n). Denote µ n = 1 n n k=1 x k. µ n represents our best guess for µ after observing n samples σ 2 n measures the uncertainty of this guess So, what is really going on here? We can obtain the following: (prove it!) µ n = nσ2 0 µ nσ0 2 n + σ2 µ +σ2 nσ0 2 0 +σ2 σ 2 n = σ2 0 σ2 nσ 2 0 +σ2 Data fidelity Prior µ n is the data fidelity µ 0 is the prior µ n is a tradeoff (weighed average) between them 26 / 30

Example: The Univariate Gaussian case p(x D) After obtaining the posterior p(µ D), we can estimate p(x D) p(x D) = p(x µ)p(µ D)dµ It is the convolution of two Gaussian distributions. You can easily prove: do it! p(x D) = N(µ n, σ 2 + σ 2 n) 27 / 30

What is really going on? Let s study σ n σn 2 monotonically each additional observation decreases the uncertainty of the estimate σ 2 n σ2 n 0 Let s study p(x D) p(x D) = N(µ n, σ 2 + σn) 2 p(µ D) becomes more and more sharply peaked let s discuss if σ 0 = 0, then what? if σ 0 σ, then what? 28 / 30

Example: The Multivariate case We can generalize the univariate case to multivariate Gaussian, p(x µ) = N(µ, Σ), p(µ) = N(µ 0, Σ 0 ). µ n = Σ 0 (Σ 0 + 1 n Σ) 1 µ n + 1 n Σ(Σ 0 + 1 n Σ) 1 µ 0 Σ n = 1 n Σ 0(Σ 0 + 1 n Σ) 1 Σ Actually, this is the best linear unbiased estimate (BLUE). In addition, p(x D) = N(µ n, Σ + Σ n ) 29 / 30

Recursive Learning Bayesian estimation can be done recursively, i.e., updating the previous estimates with new data. Denote D n = {x 1, x 2,..., x n } Easy to see p(d n Θ) = p(x n Θ)p(D n 1 Θ), and p(θ D 0 ) = p(θ) The recursion is: p(θ D n ) = p(x n Θ)p(Θ D n 1 ) p(xn Θ)p(Θ D n 1 )dθ So we start from p(θ), then move on to p(θ x 1 ), p(θ x 1, x 2 ), and so on 30 / 30