Machine Learning for Signal Processing Bayes Classification

Similar documents
Machine Learning for Signal Processing Bayes Classification and Regression

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

The Bayes classifier

CSCI-567: Machine Learning (Spring 2019)

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning Midterm, Tues April 8

Chapter 6 Classification and Prediction (2)

Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

ECE521 week 3: 23/26 January 2017

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

MLE/MAP + Naïve Bayes

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

K-Means and Gaussian Mixture Models

Machine Learning (CS 567) Lecture 5

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning Linear Classification. Prof. Matteo Matteucci

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Least Squares Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Machine Learning Fall 2018

MIRA, SVM, k-nn. Lirong Xia

ESS2222. Lecture 4 Linear model

Statistical Data Mining and Machine Learning Hilary Term 2016

18.9 SUPPORT VECTOR MACHINES

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Bayesian Methods: Naïve Bayes

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine Learning Lecture 5

STA 4273H: Statistical Machine Learning

Announcements. Proposals graded

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Machine Learning - MT Classification: Generative Models

CMU-Q Lecture 24:

Logistic Regression. COMP 527 Danushka Bollegala

Bayesian Learning (II)

Machine Learning, Fall 2012 Homework 2

Pattern Recognition and Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Linear & nonlinear classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning: A Statistics and Optimization Perspective

Least Squares Regression

Lecture 3: Pattern Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Recap from previous lecture

An Introduction to Machine Learning

Machine Learning

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

ECE521 Lecture7. Logistic Regression

Bayes Decision Theory

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture : Probabilistic Machine Learning

MLE/MAP + Naïve Bayes

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Algorithmisches Lernen/Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Gaussian Mixture Models, Expectation Maximization

Bayesian Decision Theory

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Machine Learning for Data Science (CS4786) Lecture 12

CPSC 540: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

5. Discriminant analysis

Midterm, Fall 2003

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Linear Models for Regression CS534

Ch 4. Linear Models for Classification

Mixtures of Gaussians continued

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

STA 414/2104, Spring 2014, Practice Problem Set #1

Overfitting, Bias / Variance Analysis

Data Mining Part 4. Prediction

Classification and Support Vector Machine

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Transcription:

Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1

Recap: KNN A very effective and simple way of performing classification Simple model: For any instance, select the class from the instances close to it in feature space 11755/18797 2

Multi-class Image Classification 3

k-nearest Neighbors Given a query item: Find k closest matches in a labeled dataset 4

k-nearest Neighbors Given a query item: Find k closest matches Return the most Frequent label 5

k = 3 votes for cat k-nearest Neighbors 6

k-nearest Neighbors 2 votes for cat, 1 each for Buffalo, Cat wins Deer, Lion 7

Nearest neighbor method Majority vote within the k nearest neighbors Y x = 1 k σ x i Nk (x) y i new K= 1: blue K= 3: green 8

But what happens if.. Majority vote on nearest neighbors Test new Training y = argmax y count(y x) 9

But what happens if.. Majority vote on nearest neighbors Test new Training y = argmax y N y (x) N(x) 10

But what happens if.. Majority vote on nearest neighbors Test new Training y = argmax y P(y x) 11

But what happens if.. Test new Training y = argmax y P(y x) Bayes Classification Rule 12

Bayes Classification Rule For any observed feature X, select the class value Y that is most frequent Also applies to continuous valued predicted variables I.e. regression Select Y to maximize the a posteriori probability P(Y X) Bayes classification is an instance of maximum a posteriori estimation 11755/18797 13

Bayes classification What happens if there are no exact neighbors No training instances with exactly the same X value? new 14

Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } Conditional probability distributions P(C X) Classification performed as Problem መC = argmax C C P(C X) Generally almost impossible to define P(C X) for every value of X Cannot account for even known changes in relative proportions of classes Alternate formulation: መC = argmax P(C)P(X C) C C With appropriate care, P(X C) can be estimated even for never-seen values of X Challenge: How do we represent P(X C) 11755/18797 15

The Bayesian Classifier.. መC = argmax P(C X) C C Choose the class that is most frequent for the given X P(C X) = P(C)P(X C) P(X) argmax C C P(C X) = argmax C C P(C)P(X C) Choose the class that is most likely to have produced X While accounting for the relative frequency of C 11755/18797 16

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax P(C)P(X C) C C P(X C i ) measures the probability that a random instance of class C i will take the value X P(X C 1 ) X 11755/18797 17

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax P(C)P(X C) C C P(X C i ) measures the probability that a random instance of class C i will take the value X P(X C 2 ) P(X C 1 ) X 11755/18797 18

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 19

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Decision boundary P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 20

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Fraction of all instances that belong to C 1 and fall on the wrong Decision boundary side of the boundary and are misclassified P(C 1 )P(X C 1 ) P(C 2 )P(X C 2 ) X 11755/18797 21

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Fraction of all instances that belong to C 2 and fall on the wrong Decision boundary side of the boundary and are misclassified P(C 1 )P(X C 1 ) P(C 2 )P(X C 2 ) X 11755/18797 22

P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Decision boundary Total misclassification probability P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 23

P(X) Bayes Classification Rule The Bayes classification rule is the statistically optimal classification rule Moving the boundary in either direction will always increase the classification error Excess classification error with shifted boundary P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 24

Modeling P(X C) Challenge: How to learn P(X C) This will not be known beforehand and must be learned from examples of X that belong to class C Will generally have unknown and unknowable shape We only observe samples of X Must make some assumptions about the form of P(X C) 11755/18797 25

The problem of dependent variables P(X C) = P(X 1, X 2,, X D C) must be defined for every combination of X 1, X 2,, X D Too many parameters Most combinations unseen in training data P(X C) may have an arbitrary scatter/shape Hard to characterize mathematically 11755/18797 26

The problem of dependent variables P(X C) = P(X 1, X 2,, X D C) must be defined for every combination of X 1, X 2,, X D Too many parameters Most combinations unseen in training data P(X C) may have an arbitrary scatter/shape Hard to characterize mathematically 11755/18797 27

The Naïve Bayes assumption C C X 1 X 5 X2 X 4 X 1 X 2 X 3 X 4 X 5 X 3 Assume all the components are independent of one another The joint probability is the product of the marginal P X C = P X 1, X 2,, X D C = P X i C Sufficient to learn marginal distributions P X i C The problem of having to observe all combinations of X 1, X 2,, X D never arises i 11755/18797 28

Naïve Bayes estimating P X i C P X i C may be estimated using conventional maximum likelihood estimation Given a number of training instances belonging to class C Select the i-th component of all instances Estimate P X i C For discrete-valued X i this will be a multinomial distribution For continuous valued X i a form must be assumed E.g Gaussian, Laplacian etc 11755/18797 29

Naïve Bayes Binary Case 11755/18797 30

The problem of dependent variables P(X C) = P(X 1, X 2,, X D C) must be defined for every combination of X 1, X 2,, X D Too many parameters Most combinations unseen in training data P(X C) may have an arbitrary scatter/shape Hard to characterize mathematically 11755/18797 31

Gaussian Distribution Mean Vector Covariance Matrix - Symmetric - Positive Definite 11755/18797 32

Gaussian Distribution 11755/18797 33

Parameters Estimation 11755/18797 34

Gaussian classifier Different Classes, different Gaussians 11755/18797 35

Gaussian Classifier For each class we need: Mean Vector Covariance Matrix Training Fit a Gaussian in each class Classification: Problem: Many parameters to train! 11755/18797 36

Homo-skedastic Gaussians Decision Boundary Values of x where we can not decide 11755/18797 37

Homo-skedastic Gaussians Linear Boundary!!! 11755/18797 38

Homo-skedastic Gaussians 11755/18797 39

Homo-skedastic Gaussians Classification given by 11755/18797 40

Homo-skedastic Gaussians Mahalanobis Distance Therefore, a Gaussian Classifier with common Covariance Matrix is similar to a Nearest Neighbor Classifier Classification corresponds to the nearest mean vector 11755/18797 41

How to estimate the Covariance Matrix? 11755/18797 42

Hetero-skedastic Gaussians Different Covariance Matrices 1D case. K = 2 Decision Boundary Quadratic Boundary!!! 11755/18797 43

Hetero-skedastic Gaussians 11755/18797 44

GMM classifier - For each class, train a GMM (with EM) - Classify according to 11755/18797 45

Estimating P(C) Count Held out data.. Change with problem/test setup.. Binary problem: Detection error tradeoff DET curves EER Choosing operating point 11755/18797 46

MAP regression MAP estimation of continuous variables MAP for Gaussians is linear regression MAP estimation of more complex problems is hard/impossible 11755/18797 47

Map Estimation A Maximum Likelihood Estimator maximizes A Maximum A Posteriori Estimator maximizes 11755/18797 48

MAP estimation of continuous variables An example We want to estimate the mean, but we have a previous knowledge given 11755/18797 49

MAP estimation of continuous variables An example MAP minimizes 11755/18797 50

MAP estimation of continuous variables What is a good prior? - Analyze the support of the distribution 11755/18797 51

MAP estimation of continuous variables We can consider We say that the Beta distribution is the conjugate prior of the Bernoulli distribution 11755/18797 52

Probabilistic Linear Regression 11755/18797 53

MAP estimate priors Left: Gaussian Prior on W Right: Laplacian Prior 11755/18797 54

MAP estimate of weights Loss + Regularization 11755/18797 55

MAP estimate of weights Equivalent to diagonal loading of correlation matrix Improves condition number of correlation matrix Can be inverted with greater stability Will not affect the estimation from well-conditioned data Also called Tikhonov Regularization Dual form: Ridge regression MAP estimate of weights Not to be confused with MAP estimate of Y 11755/18797 56

MAP estimate of weights Loss + Regularization 11755/18797 57

MAP estimation of weights with Laplacian prior Assume weights drawn from a Laplacian P(a) = l -1 exp(-l -1 a 1 ) Maximum a posteriori estimate aˆ arg max A C' ( y a T X) T ( y a T X) T l 1 a 1 No closed form solution Quadratic programming solution required Non-trivial 11755/18797 58

MAP estimation of weights with Laplacian prior Assume weights drawn from a Laplacian P(a) = l -1 exp(-l -1 a 1 ) Maximum a posteriori estimate aˆ arg max A C' ( y a X) ( y a X) l Identical to L 1 regularized least-squares estimation T T T T 1 a 1 11755/18797 59

L 1 -regularized LSE aˆ arg max C' ( y No closed form solution Quadratic programming solutions required Dual formulation A a T X) T ( y a T X) T l 1 a 1 aˆ arg max A T T T T C ' ( y a X) ( y a X) subject to a 1 t LASSO Least absolute shrinkage and selection operator 11755/18797 60

LASSO Algorithms Various convex optimization algorithms LARS: Least angle regression Pathwise coordinate descent.. Matlab code available from web 11755/18797 61

Regularized least squares Image Credit: Tibshirani Regularization results in selection of suboptimal (in least-squares sense) solution One of the loci outside center Tikhonov regularization selects shortest solution L 1 regularization selects sparsest solution 11755/18797 62

Predicting Time Series Need time-series models HMMs later in the course 11755/18797 63