Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Similar documents
P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Composite Hypotheses testing

Lecture 12: Classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Statistical pattern recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Maximum Likelihood Estimation (MLE)

Pattern Classification

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Linear Approximation with Regularization and Moving Least Squares

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Kernel Methods and SVMs Extension

Homework Assignment 3 Due in class, Thursday October 15

Pattern Classification

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Communication with AWGN Interference

Error Probability for M Signals

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The big picture. Outline

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Lecture Notes on Linear Regression

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Explaining the Stein Paradox

Limited Dependent Variables

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Classification as a Regression Problem

Chapter Newton s Method

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Clustering & Unsupervised Learning

Lecture 4 Hypothesis Testing

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Lecture 6 More on Complete Randomized Block Design (RBD)

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

2.3 Nilpotent endomorphisms

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Which Separator? Spring 1

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

APPENDIX A Some Linear Algebra

Feb 14: Spatial analysis of data fields

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Chapter 8 Indicator Variables

Multilayer Perceptron (MLP)

Ph 219a/CS 219a. Exercises Due: Wednesday 23 October 2013

LECTURE 9 CANONICAL CORRELATION ANALYSIS

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Clustering & (Ken Kreutz-Delgado) UCSD

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Chapter 11: Simple Linear Regression and Correlation

Lecture 12: Discrete Laplacian

Generative classification models

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Estimation: Part 2. Chapter GREG estimation

Lecture 3. Ax x i a i. i i

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Stat 543 Exam 2 Spring 2016

Learning from Data 1 Naive Bayes

Lecture 6/7 (February 10/12, 2014) DIRAC EQUATION. The non-relativistic Schrödinger equation was obtained by noting that the Hamiltonian 2

Radar Trackers. Study Guide. All chapters, problems, examples and page numbers refer to Applied Optimal Estimation, A. Gelb, Ed.

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS

INF 4300 Digital Image Analysis REPETITION

Stat 543 Exam 2 Spring 2016

Lecture 20: Hypothesis testing

STAT 3008 Applied Regression Analysis

Robert Eisberg Second edition CH 09 Multielectron atoms ground states and x-ray excitations

Convergence of random processes

Time-Varying Systems and Computations Lecture 6

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Generalized Linear Methods

Lecture 2: Prelude to the big shrink

Ph 219a/CS 219a. Exercises Due: Wednesday 12 November 2008

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Transcription:

Why Bayesan? 3. Bayes and Normal Models Alex M. Martnez alex@ece.osu.edu Handouts Handoutsfor forece ECE874 874Sp Sp007 If all our research (n PR was to dsappear and you could only save one theory, whch one would you save? Bayesan theory s probably the most mportant one you should keep. It s smple, ntutve and optmal. Reverend Bayes (763 and Laplace (8 set the foundatons of what we now know a Bayes theory. State of nature: class A sample (usually corresponds to a state of nature; e.g. salmon and sea bass. he state of nature usually corresponds to a set of dscreet categores (classes. Note that the contnuous also exsts. Prors: some classes mght occur more often or mght be more mportant, P w. ( Decson rule We need a decson rule to help us determne to whch class a testng vector belongs. Smplest (useless: C arg max P( w Posteror probablty: w, x s the (observed data. Obvously, we do not have w. But we can estmate: p x w & P w. ( ( Bayes heorem (yes, the famous one x w P( w w x c x x w P( Bayes decson rule: w arg max guarantees 0 p ( w w Rev. homas Bayes (70-76 Durng hs lfetme, Bayes was a defender of Isaac Newton's calculus, and developed several mportant results of whch the Bayes heorem s hs most known and, arguably, most elegant. hs theorem and the subsequent development of Bayesan theory are among the most relevant topcs n pattern recognton and have found applcatonsn almost every corner of the scentfc world. Bayes hmself dd not, however, provde the dervatons of the Bayes heorem as ths s now known to us. Bayes develop the method for unform prors.hs result was later extended by Laplace and contemporares.nonetheless, Bayes s generally acknowledge as the frst to have establshed a mathematcal bass for probablty nference.

Multple random varables o be mathematcally precse, one should wrte p ( x w nstead of x w, because ths probablty densty functon depends on a sngle random varable. In general there s no need for dstncton (e.g., p & py. Shall ths arse, we wll use the above notaton. Loss functon & decson rsk States exactly how costly each acton s, and s used to convert a probablty determnaton nto a decson. classes: { c w,..., w } actons:,..., } loss functon: w ( { a w the cost (rsk( rsk of gong from to (see Appendx A.4 Condtonal Rsk: c R( ( w w Bayes decson rule Condtonal rsk: c Bayes decson rule (Bayes( rsk: R( ( w w arg mn R( he resultng mnmum overall rsk s called the Bayes rsk. A smple example wo-class classfer: R( w w R w w ( x Decson rule: (or w f R( R (, w ( w ( x Applyng Bayes: x w P( w. x w P( w Notaton: Notaton: w ( threshold Feature Space: Geometry threshold d When x, we have our d-dmensonal feature space. Sometmes, ths feature space s consdered to be an Eucldean space; but as we l see many other alternatves exsts. hs allows for the study of PR problems form a geometrc pont of vew. hs s key to many algorthms.

Dscrmnant functons We can construct a set of dscrmnant functons: g (x, =,, c. We classfy a feature vector as w f: g ( x g ( x, he Bayes classfer s: g ( x R(. If errors are to be mnmzed, one needs to mnmze the probablty of error: mnmum 0 ( w (zero-one one loss mnmum-error-raterate If we use Bayes & mnmum-error-rate classfcaton, we get: x w P( w g ( x p ( w c x w P( w Sometme we fnd more convenent to wrte ths equaton as: g ( x ln x w ln P( w. Geometry (key pont: the goal s to dvde the feature space nto c decson regons, R,..., R. c Classfcaton s also known as hypothess testng. constant Key: the effect of any decson rule s to dvde the feature space nto decson regons. Symbolc representaton Other crteron In some applcatons the prors are not known. In ths case, we usually attempt to mnmze the worst overall rsk. wo approaches for that are: the mnmax and the Neyman-Pearson crtera. 3

Normal Dstrbutons & Bayes So far we have used x w and P( w to specfy the decson boundares of a Bayes classfer. he Normal dstrbuton s the most typcal PDF for. Recall the central lmt theorem. Central Lmt theorem (smplfed Assume that the random varables,..., n are d, each wth fnte mean and varance. When n, the standardzed random varable converges to a normal dstrbuton. (see Stark & Woods pp. 5-30 * * = = Unvarate case he Gaussan dstrbuton s: x x exp ~ N(, Multvarate case (d> x d / / ( ( x ( x ~ (, N xp E ( x ( x dx ( x E (( x x dx 4

Dstances he general dstance n a space s gven by: d ( x ( x where s the covarance matrx of the dstrbuton (or data. If I then the above equaton becomes the Eucldean (norm dstance. If s Normal, ths dstance s called Mahalanobs dstance. Example (D Normals Heteroscedastc Homoscedastc Moments of the estmates In statstcs the estmates are generally known as the moments of the data. he frst moment s the sample mean. he second, the sample autocorrelaton matrx: n S x x. n Central moments he varance and the covarance matrx are specal cases, because they depend on the mean of the data whch s unknown. Usually we solve that by usng the sample mean: n ˆ ( x ˆ ( x ˆ. n hs s the sample covarance matrx. Whtenng ransformaton Recall, t s sometme convenent to represent the data n a space where ts sample covarance matrx equals the dentty matrx, I. A ΣA I Lnear transformatons An n-dmensonal vector can be transformed lnearly to another, Y, as: Y A he mean s then: M Y E( Y A M he cov.: ΣY A Σ A he order of the dstances n the transformed space s dentcal to the one n the orgnal space. 5

Orthonormal transformaton ΣΦΦΛ Egenanalyss: Egenvectors: ΦΦ [ ],..., Φp Egenvalues: 0 Λ 0 p he trasformaton s then: Y Φ Σ Y Φ Σ ΦΛ (recall, & Φ Φ Φ ΦI Whtenng o obtan a covarance matrx equal to the dentty matrx we can apply the orthogonal transformaton frst and then normalze the result wth Λ / : / Y Y Φ Φ / Σ Φ Σ Φ / Φ Σ Φ ----------- I Propertes Whtenng transformatons are not orthogonal transformatons. herefore, Eucldean dstances are not preserved. After whtenng, the covarance matrx s nvarant to any orthogonal transformaton: I I. Smultaneous dagonalzaton It s usually the case where two or more covarance matrces need to be dagonal. Assume Σ and Σ are two covarance matrces. Our goal s to have: A ΣA I and A ΣA Λ. Homework: fnd the algorthm. 6

Some advantages Algorthms usually become much smpler after dagonalzaton or whtenng. he general dstance => a smple Eucldean dstance. Whtened data s nvarant to other orthogonal transformatons. Some algorthms requre whtenng to have certan p r ope r t e s( we l ls e et h sl a t t e r nt hec o ur s e. I Dscrmnant Functons for Normal PDFs he dscrmnant functon, g (x ln x w ln p (w, for the Normal densty, N (,, s: same prors d g (x (x (x ln ln ln p ( w Possble scenaros (or assumptons: Sometmes, we mght be able to assume I. A more general case s when all covarance matrces are dentcal ;.e. homoscedastc. he most complex case s when arbtrary ; that s, heteroscedastc. I he Bayes bound s a d- dmensonal hyperplane perpendcular to the lne that passes through both means. g (x x Homoscedastc: Homoscedastc: ln p (w g (x (x (x ln w Mahalanobs 7

Heterodscedastc arbtrary In the -class case, the decson surface s an hyperquadrc (e.g. hyperplanes, hyperspheres, hyperhyperbolods, etc.. hese decson boundares may not be connected. Any hyperquadrc can be gven (represented by two Gaussan dstrbutons. arbtrary Proect #. Implement these three cases usng Matlab (see pp. 36-4 for detals. D and/or 3D plots.. Generalze the algorthm to more than two classes Gaussans. 3. Smulate dfferent Gaussans and dstnct prors. P (error P( x R, w P (x R, w Bayes Is Optmal p ( x w P ( w dx p (x w P( w dx R If our goal s to mnmze the classfcaton error, then Bayes s optmal (you cannot do better than Bayes ever. In general, f x w P( w p (x w P( w, t s preferable to classfy x n w so that the smallest ntegral contrbutes to the error (see next slde => hs s what Bayes does. here s no possble smaller error. R Bayes 8

he multclass case: Error Bounds: How to calculate the error? C P(correct P( x R, w C x w P( w dx. R Bayes yelds the smallest error. But whch s the actual error? he above equaton cannot be readly computed, because the regons R may be very complex. Chernoff Bound For ths, we need an ntegral eq. that we can solve. For example, P(error P(error p ( x dx, where P( w, f we decde w P (error P ( w, f we decde w. p s (x w p s (x w dx e k ( s, where s ( s k ( s s ( s s ( s ln s s. Several approxmatons are easer to compute (usually upper bounds: Chernoff bound. Bhattacharyya bound (assumes pdf are homoscedastc. hese bound can only be appled to the class case only. Or, we can also wrte: P (error mn p ( x w P( w, p (x w P( w dx s s Snce, t s known that mn( a, b a b, 0 s, we can now wrte: P (error P s ( w P s ( w p s ( x w p s (x w dx If the condtonal probabltes are normal, we can solve ths analytcally: Bhattacharyya Bound When the data s homoscedastc,, the optmal soluton s s=/. hs s the Bhattacharyya bound. A tghter bound s the asymptotc nearest neghbor error, whch s derved from: x w P ( w x w P( w error dx p ( x x w P ( w x w P ( w dx. 9

Closng Notes Bayes s mportant because t mnmzes the pr oba b l t yo fe r r o r.i nt ha ts e ns ewes a y t s optmal. Unfortunately, Bayse assumes that the condtonal denstes and prors are known (or can be estmated; whch s not necessarly true. In general, not even the form of these probabltes s known. Most PR approaches attempt to solve these shortcomngs. hs s, n fact, what most of PR s all about. On the + sde: a smple example We want to predct whether a student wll pass or not a test. Y= denotes pass. Y=0 falure. he observaton s a sngle random varable whch specfes the hours of study. x Let P(Y x c x. hen: f P (Y x / ;.e. x c g ( x 0 otherwse Optonal homework Hnts Usng Matlab generate n observatons of P(Y= =x and P(Y=0 =x. Approxmate each usng a Gaussan dstrbuton. Calculate the Bayes decson boundary and classfcaton error. Select several arbtrary values for c and see how well you can approxmate them. Error = mn [P(Y= =x, P(Y=0 =x]. Plot the orgnal dstrbuton to help you. 0