Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability Density Estimation 6.0.207 Monday lecture slot Shifted to 8:30 0:00 in AH IV (276 seats) We will monitor the situation and take further action if the space is not sufficient Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Thursday lecture slot Will stay at 4:5 5:45 in H02 (C.A.R.L, 786 seats) Eercises (non-mandatory) We will try to offer corrections, but we will have to see how to handle those numbers 2 Announcements Course Outline L2P electronic repository Slides, eercises, and supplementary material will be made available here Lecture recordings will be uploaded 2-3 days after the lecture Course webpage http://www.vision.rwth-aachen.de/courses/ Slides will also be made available on the webpage Fundamentals Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Please subscribe to the lecture on the Campus system! Important to get email announcements and L2P access! Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 3 4 Topics of This Lecture Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Thomas Bayes, 70-76 The theory of inverse probability is founded upon an error, and must be wholly rejected. R.A. Fisher, 925 5 4 Image source: Wikipedia
Eample: handwritten character recognition Concept : Priors (a priori probabilities) What we can tell about the probability before seeing the data. Eample: pc k Goal: Classify a new letter such that the probability of misclassification is minimized. 5 In general: C C 2 a b pck k pc pc 2 0.75 0.25 6 Concept 2: Conditional probabilities Let be a feature vector. measures/describes certain properties of the input. E.g. number of black piels, aspect ratio, p( C k ) describes its likelihood for class C k. p a p b p Ck 7 Eample: Question: p a p b 5 Which class? p b p a Since is much smaller than, the decision should be a here. 8 Eample: Eample: p a p b p a p b 25 Question: Which class? p a p b Since is much smaller than, the decision should be b here. Question: Which class? 20 Remember that p(a) = 0.75 and p(b) = 0.25 I.e., the decision should be again a. How can we formalize this? 9 20 2
Concept 3: Posterior probabilities We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector. Bayes Theorem: k p C p Ck p Ck p Ck p Ck p p Ci p Ci Interpretation Likelihood Prior Posterior Normalization Factor i k p C 2 p a p b p a p( a) p b p( b) pa pb Decision boundary Likelihood Likelihood Prior Likelihood Prior Posterior = NormalizationFactor 22 Bayesian Decision Theory Goal: Minimize the probability of a misclassification The green and blue regions stay constant. Optimal decision rule Decide for C if p(c j) > p(c 2 j) Only the size of the red region varies! This is equivalent to p(jc )p(c ) > p(jc 2 )p(c 2 ) Z = R Z p(c 2 j)p()d + R2 p(c j)p()d 23 Which is again equivalent to (Likelihood-Ratio test) p(jc ) p(jc 2 ) > p(c 2) p(c ) Decision threshold 24 Generalization to More Than 2 Classes Classifying with Loss Functions Decide for class k whenever it has the greatest posterior probability of all classes: Likelihood-ratio test p(c k j) > p(c j j) 8j 6= k p(jc k )p(c k ) > p(jc j )p(c j ) 8j 6= k p(jc k ) p(jc j ) > p(c j) p(c k ) 8j 6= k Generalization to decisions with a loss function Differentiate between the possible decisions and the possible true classes. Eample: medical diagnosis Decisions: sick or healthy (or: further eamination necessary) Classes: patient is sick or healthy The cost may be asymmetric: loss(decision = healthyjpatient = sick) >> loss(decision = sickjpatient = healthy) 25 26 3
Truth Classifying with Loss Functions Classifying with Loss Functions In general, we can formalize this by introducing a loss matri L kj L kj = loss for decision C j if truth is C k : Loss functions may be different for different actors. Eample: L stocktrader (subprime) = invest don t invest µ 2 c gain 0 0 0 Eample: cancer diagnosis Decision µ L bank (subprime) = 2 c gain 0 0 L cancer diagnosis = Different loss functions may lead to different Bayes optimal strategies. 27 28 Minimizing the Epected Loss Minimizing the Epected Loss Optimal solution is the one that minimizes the loss. But: loss function depends on the true class, which is unknown. Eample: 2 Classes: C, C 2 Solution: Minimize the epected loss 2 Decision:, 2 Loss function: L( j jc k ) = L kj Epected loss (= risk R) for the two decisions: This can be done by choosing the regions R j such that which is easy to do once we know the posterior class probabilities p(c kj). Goal: Decide such that epected loss is minimized I.e. decide if 29 30 Minimizing the Epected Loss The Reject Option R( 2 j) > R( j) L 2 p(c j) + L 22 p(c 2 j) > L p(c j) + L 2 p(c 2 j) (L 2 L )p(c j) > (L 2 L 22 )p(c 2 j) (L 2 L ) (L 2 L 22 ) > p(c 2j) p(c j) = p(jc 2)p(C 2 ) p(jc )p(c ) p(jc ) p(jc 2 ) > (L 2 L 22 ) p(c 2 ) (L 2 L ) p(c ) Adapted decision rule taking into account the loss. 3 Classification errors arise from regions where the largest posterior probability p(c kj) is significantly less than. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human epert. 32 4
Discriminant Functions Different Views on the Decision Problem Formulate classification in terms of comparisons Discriminant functions Classify as class C k if y (); : : : ; y K () y k () > y j () 8j 6= k Eamples () y k () = p(c k j) y k () = p(jc k )p(c k ) y k () = log p(jc k ) + log p(c k ) 33 y k () / p(jc k )p(c k ) First determine the class-conditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k () = p(c k j) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new to its class. Discriminative methods Alternative Directly find a discriminant function y k () which maps each input directly onto a class label. 34 Topics of This Lecture Probability Density Estimation Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning Up to now Bayes optimal classification Based on the probabilities p(jc k )p(c k ) How can we estimate (=learn) those probability densities? Supervised training case: data and class labels are known. Estimate the probability density for each class C k separately: p(jc k ) (For simplicity of notation, we will drop the class label C k in the following.) 35 36 Probability Density Estimation The Gaussian (or Normal) Distribution Data:, 2, 3, 4, One-dimensional case Mean ¹ Estimate: p() Variance ¾ 2 N(j¹; ¾ 2 ) = p ¾ ( ¹)2 ep ½ 2¼¾ 2¾ 2 Methods Parametric representations (today) Non-parametric representations (lecture 3) Miture models (lecture 4) 37 Multi-dimensional case Mean ¹ Covariance N(j¹; ) = ½ ep ¾ (2¼) D=2 j j=2 2 ( ¹)T ( ¹) 38 5
Gaussian Distribution Properties Gaussian Distribution Properties Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. In practice, the convergence to a Gaussian can be very rapid. This makes the Gaussian interesting for many applications. Quadratic Form N depends on through the eponent Here, M is often called the Mahalanobis distance from to ¹. Eample: N uniform [0,] random variables. Shape of the Gaussian is a real, symmetric matri. We can therefore decompose it into its eigenvectors DX = iu i u T i i= and thus obtain with. 39 Constant density on ellipsoids with main directions along the eigenvectors u i and scaling factors p i. 40 Gaussian Distribution Properties Gaussian Distribution Properties Special cases Full covariance matri = [¾ ij ] The marginals of a Gaussian are again Gaussians: General ellipsoid shape Diagonal covariance matri = diagf¾ i g Ais-aligned ellipsoid Uniform variance = ¾ 2 I Hypersphere 4 42 Topics of This Lecture Parametric Methods Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Given Data X = f ; 2 ; : : : ; N g Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: µ = (¹; ¾) Learning Estimation of the parameters µ Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) 43 Slide adapted from Bernt Schiele 44 6
Maimum Likelihood Approach Computation of the likelihood Single data point: p( n jµ) Maimum Likelihood Approach NY Likelihood: L(µ) = p(xjµ) = p( n jµ) Assumption: all data points are independent NY L(µ) = p(xjµ) = p( n jµ) We want to obtain ^µ such that L(^µ) is maimized. p(xjµ) Log-likelihood E(µ) = ln L(µ) = ln p( n jµ) Estimation of the parameters µ (Learning) Maimize the likelihood Minimize the negative log-likelihood ^µ µ 45 46 Maimum Likelihood Approach Maimum Likelihood Approach Minimizing the log-likelihood How do we minimize a function? Take the derivative and set it to zero. @ @µ E(µ) = @ ln p( n jµ) = @µ Log-likelihood for Normal distribution (D case) E(µ) = ln p( n j¹; ¾) µ = ln p ep ½ jj n ¹jj 2 ¾ 2¼¾ 2¾ 2 @ @µ p( njµ) p( n jµ)! = 0 Minimizing the log-likelihood @ @¹ E(¹; ¾) = X N @ @¹ p( nj¹; ¾) p( n j¹; ¾) = 2( n ¹) 2¾ 2 X N = ( ( 2 n n ¹) ¹) ¾ Ã = N! X 2 n n N¹ ¾ p( n j¹; ¾) = p e jjn ¹jj2 2¾ 2 2¼¾ 47 48 Maimum Likelihood Approach Maimum Likelihood Approach We thus obtain ^¹ = N n In a similar fashion, we get ^¾ 2 = ( n ^¹) 2 N ^µ = (^¹; ^¾) is the Maimum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong sample mean sample variance 49 Or not wrong, but rather biased Assume the samples, 2,, N come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the epectations of the ML estimates with respect to the data set values. It can be shown that E(¹ ML ) = ¹ µ N E(¾ML) 2 = ¾ 2 N The ML estimate will underestimate the true variance. Corrected estimate: ~¾ 2 = N N ¾2 ML = N ( n ^¹) 2 50 7
Maimum Likelihood Limitations Deeper Reason Maimum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = ; X = f g Maimum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fied, but can be estimated more precisely when more data is available. Maimum-likelihood estimate: We say ML overfits to the observed data. ^¾ = 0! We will still often use ML, but it is important to know about this effect. ^¹ This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well Slide adapted from Bernt Schiele 5 52 Bayesian vs. Frequentist View To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. 53 Bayesian Approach to Parameter Learning Conceptual shift Maimum Likelihood views the true parameter vector µ to be unknown, but fied. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e. to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. The prior thus encodes knowledge we have about the type of distribution we epect to see for µ. Slide adapted from Bernt Schiele 54 Bayesian Learning References and Further Reading Bayesian Learning is an important concept However, it would lead to far here. I will introduce it in more detail in the Advanced ML lecture. More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and 2.3.-2.3.4. Bayesian Learning: Ch..2.3 and 2.3.6. Nonparametric methods: Ch. 2.5. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 55 84 8