Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1
Recap: KNN A very effective and simple way of performing classification Simple model: For any instance, select the class from the instances close to it in feature space 11755/18797 2
Multi-class Image Classification 3
k-nearest Neighbors Given a query item: Find k closest matches in a labeled dataset 4
k-nearest Neighbors Given a query item: Find k closest matches Return the most Frequent label 5
k = 3 votes for cat k-nearest Neighbors 6
k-nearest Neighbors 2 votes for cat, 1 each for Buffalo, Cat wins Deer, Lion 7
Nearest neighbor method Majority vote within the k nearest neighbors Y x = 1 k σ x i Nk (x) y i new K= 1: blue K= 3: green 8
But what happens if.. Majority vote on nearest neighbors Test new Training y = argmax y count(y x) 9
But what happens if.. Majority vote on nearest neighbors Test new Training y = argmax y N y (x) N(x) 10
But what happens if.. Majority vote on nearest neighbors Test new Training y = argmax y P(y x) 11
But what happens if.. Test new Training y = argmax y P(y x) Bayes Classification Rule 12
Bayes Classification Rule For any observed feature X, select the class value Y that is most frequent Also applies to continuous valued predicted variables I.e. regression Select Y to maximize the a posteriori probability P(Y X) Bayes classification is an instance of maximum a posteriori estimation 11755/18797 13
Bayes classification What happens if there are no exact neighbors No training instances with exactly the same X value? new 14
Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } Conditional probability distributions P(C X) Classification performed as Problem መC = argmax C C P(C X) Generally almost impossible to define P(C X) for every value of X Cannot account for even known changes in relative proportions of classes Alternate formulation: መC = argmax P(C)P(X C) C C With appropriate care, P(X C) can be estimated even for never-seen values of X Challenge: How do we represent P(X C) 11755/18797 15
The Bayesian Classifier.. መC = argmax P(C X) C C Choose the class that is most frequent for the given X P(C X) = P(C)P(X C) P(X) argmax C C P(C X) = argmax C C P(C)P(X C) Choose the class that is most likely to have produced X While accounting for the relative frequency of C 11755/18797 16
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax P(C)P(X C) C C P(X C i ) measures the probability that a random instance of class C i will take the value X P(X C 1 ) X 11755/18797 17
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax P(C)P(X C) C C P(X C i ) measures the probability that a random instance of class C i will take the value X P(X C 2 ) P(X C 1 ) X 11755/18797 18
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 19
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Decision boundary P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 20
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Fraction of all instances that belong to C 1 and fall on the wrong Decision boundary side of the boundary and are misclassified P(C 1 )P(X C 1 ) P(C 2 )P(X C 2 ) X 11755/18797 21
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Fraction of all instances that belong to C 2 and fall on the wrong Decision boundary side of the boundary and are misclassified P(C 1 )P(X C 1 ) P(C 2 )P(X C 2 ) X 11755/18797 22
P(X) Bayes Classification Rule Given a set of classes C = {C 1, C 2,, C N } መC = argmax C C P(C)P(X C) P(C i ) scales them up to match the expected relative proportions of the classes Decision boundary Total misclassification probability P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 23
P(X) Bayes Classification Rule The Bayes classification rule is the statistically optimal classification rule Moving the boundary in either direction will always increase the classification error Excess classification error with shifted boundary P(C 2 )P(X C 2 ) P(C 1 )P(X C 1 ) X 11755/18797 24
Modeling P(X C) Challenge: How to learn P(X C) This will not be known beforehand and must be learned from examples of X that belong to class C Will generally have unknown and unknowable shape We only observe samples of X Must make some assumptions about the form of P(X C) 11755/18797 25
The problem of dependent variables P(X C) = P(X 1, X 2,, X D C) must be defined for every combination of X 1, X 2,, X D Too many parameters Most combinations unseen in training data P(X C) may have an arbitrary scatter/shape Hard to characterize mathematically 11755/18797 26
The problem of dependent variables P(X C) = P(X 1, X 2,, X D C) must be defined for every combination of X 1, X 2,, X D Too many parameters Most combinations unseen in training data P(X C) may have an arbitrary scatter/shape Hard to characterize mathematically 11755/18797 27
The Naïve Bayes assumption C C X 1 X 5 X2 X 4 X 1 X 2 X 3 X 4 X 5 X 3 Assume all the components are independent of one another The joint probability is the product of the marginal P X C = P X 1, X 2,, X D C = P X i C Sufficient to learn marginal distributions P X i C The problem of having to observe all combinations of X 1, X 2,, X D never arises i 11755/18797 28
Naïve Bayes estimating P X i C P X i C may be estimated using conventional maximum likelihood estimation Given a number of training instances belonging to class C Select the i-th component of all instances Estimate P X i C For discrete-valued X i this will be a multinomial distribution For continuous valued X i a form must be assumed E.g Gaussian, Laplacian etc 11755/18797 29
Naïve Bayes Binary Case 11755/18797 30
The problem of dependent variables P(X C) = P(X 1, X 2,, X D C) must be defined for every combination of X 1, X 2,, X D Too many parameters Most combinations unseen in training data P(X C) may have an arbitrary scatter/shape Hard to characterize mathematically 11755/18797 31
Gaussian Distribution Mean Vector Covariance Matrix - Symmetric - Positive Definite 11755/18797 32
Gaussian Distribution 11755/18797 33
Parameters Estimation 11755/18797 34
Gaussian classifier Different Classes, different Gaussians 11755/18797 35
Gaussian Classifier For each class we need: Mean Vector Covariance Matrix Training Fit a Gaussian in each class Classification: Problem: Many parameters to train! 11755/18797 36
Homo-skedastic Gaussians Decision Boundary Values of x where we can not decide 11755/18797 37
Homo-skedastic Gaussians Linear Boundary!!! 11755/18797 38
Homo-skedastic Gaussians 11755/18797 39
Homo-skedastic Gaussians Classification given by 11755/18797 40
Homo-skedastic Gaussians Mahalanobis Distance Therefore, a Gaussian Classifier with common Covariance Matrix is similar to a Nearest Neighbor Classifier Classification corresponds to the nearest mean vector 11755/18797 41
How to estimate the Covariance Matrix? 11755/18797 42
Hetero-skedastic Gaussians Different Covariance Matrices 1D case. K = 2 Decision Boundary Quadratic Boundary!!! 11755/18797 43
Hetero-skedastic Gaussians 11755/18797 44
GMM classifier - For each class, train a GMM (with EM) - Classify according to 11755/18797 45
Estimating P(C) Count Held out data.. Change with problem/test setup.. Binary problem: Detection error tradeoff DET curves EER Choosing operating point 11755/18797 46
MAP regression MAP estimation of continuous variables MAP for Gaussians is linear regression MAP estimation of more complex problems is hard/impossible 11755/18797 47
Map Estimation A Maximum Likelihood Estimator maximizes A Maximum A Posteriori Estimator maximizes 11755/18797 48
MAP estimation of continuous variables An example We want to estimate the mean, but we have a previous knowledge given 11755/18797 49
MAP estimation of continuous variables An example MAP minimizes 11755/18797 50
MAP estimation of continuous variables What is a good prior? - Analyze the support of the distribution 11755/18797 51
MAP estimation of continuous variables We can consider We say that the Beta distribution is the conjugate prior of the Bernoulli distribution 11755/18797 52
Probabilistic Linear Regression 11755/18797 53
MAP estimate priors Left: Gaussian Prior on W Right: Laplacian Prior 11755/18797 54
MAP estimate of weights Loss + Regularization 11755/18797 55
MAP estimate of weights Equivalent to diagonal loading of correlation matrix Improves condition number of correlation matrix Can be inverted with greater stability Will not affect the estimation from well-conditioned data Also called Tikhonov Regularization Dual form: Ridge regression MAP estimate of weights Not to be confused with MAP estimate of Y 11755/18797 56
MAP estimate of weights Loss + Regularization 11755/18797 57
MAP estimation of weights with Laplacian prior Assume weights drawn from a Laplacian P(a) = l -1 exp(-l -1 a 1 ) Maximum a posteriori estimate aˆ arg max A C' ( y a T X) T ( y a T X) T l 1 a 1 No closed form solution Quadratic programming solution required Non-trivial 11755/18797 58
MAP estimation of weights with Laplacian prior Assume weights drawn from a Laplacian P(a) = l -1 exp(-l -1 a 1 ) Maximum a posteriori estimate aˆ arg max A C' ( y a X) ( y a X) l Identical to L 1 regularized least-squares estimation T T T T 1 a 1 11755/18797 59
L 1 -regularized LSE aˆ arg max C' ( y No closed form solution Quadratic programming solutions required Dual formulation A a T X) T ( y a T X) T l 1 a 1 aˆ arg max A T T T T C ' ( y a X) ( y a X) subject to a 1 t LASSO Least absolute shrinkage and selection operator 11755/18797 60
LASSO Algorithms Various convex optimization algorithms LARS: Least angle regression Pathwise coordinate descent.. Matlab code available from web 11755/18797 61
Regularized least squares Image Credit: Tibshirani Regularization results in selection of suboptimal (in least-squares sense) solution One of the loci outside center Tikhonov regularization selects shortest solution L 1 regularization selects sparsest solution 11755/18797 62
Predicting Time Series Need time-series models HMMs later in the course 11755/18797 63