Machine Learning Lecture 2

Size: px

Start display at page:

Download "Machine Learning Lecture 2"

Randell Doyle
6 years ago
Views:

Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.

1 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage Slides will be made available on the webpage Probability Density Estimation L2P electronic repository Eercises and supplementary materials will be posted on the L2P Bastian Leibe RWTH Aachen Please subscribe to the lecture on the Campus system! Important to get announcements and L2P access! leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele 2 Course Outline Topics of This Lecture Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields Recap: Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 4 5 Recap: Bayes Decision Theory Concepts Concept : Priors (a priori probabilities) What we can tell about the probability before seeing the data. Eample: pc k Recap: Bayes Decision Theory Concepts Concept 2: Conditional probabilities Let be a feature vector. measures/describes certain properties of the input. E.g. number of black piels, aspect ratio, p( C k ) describes its likelihood for class C k. p Ck p a C C 2 a b pc pc p b In general: pck k 6 7

2 Bayes Decision Theory Concepts Concept 3: Posterior probabilities We are typically interested in the a posteriori probability, i.e. the probability of class C k given the measurement vector. Bayes Theorem: k p C p Ck p Ck p Ck p Ck p p Ci p Ci Interpretation Likelihood Prior Posterior Normalization Factor i k p C 8 Recap: Bayes Decision Theory p a p b p a p( a) p b p( b) pa pb Decision boundary Likelihood Likelihood Prior Likelihood Prior Posterior = NormalizationFactor 9 Recap: Bayes Decision Theory Bayes Decision Theory Optimal decision rule Decide for C if p(c j) > p(c 2 j) Decision regions: R, R 2, R 3, This is equivalent to p(jc )p(c ) > p(jc 2 )p(c 2 ) Which is again equivalent to (Likelihood-Ratio test) p(jc ) p(jc 2 ) > p(c 2) p(c ) Decision threshold 0 Recap: Minimizing the Epected Loss Recap: Minimizing the Epected Loss Eample: 2 Classes: C, C 2 2 Decision:, 2 Loss function: L( j jc k ) = L kj R( 2 j) > R( j) L 2 p(c j) + L 22 p(c 2 j) > L p(c j) + L 2 p(c 2 j) (L 2 L )p(c j) > (L 2 L 22 )p(c 2 j) Epected loss (= risk R) for the two decisions: (L 2 L ) (L 2 L 22 ) > p(c 2j) p(c j) = p(jc 2)p(C 2 ) p(jc )p(c ) Goal: Decide such that epected loss is minimized I.e. decide if p(jc ) p(jc 2 ) > (L 2 L 22 ) p(c 2 ) (L 2 L ) p(c ) Adapted decision rule taking into account the loss

3 The Reject Option Discriminant Functions Formulate classification in terms of comparisons Discriminant functions Classify as class C k if y (); : : : ; y K () y k () > y j () 8j 6= k Classification errors arise from regions where the largest posterior probability p(c kj) is significantly less than. These are the regions where we are relatively uncertain about class membership. For some applications, it may be better to reject the automatic decision entirely in such a case and e.g. consult a human epert. 4 Eamples (Bayes Decision Theory) y k () = p(c k j) y k () = p(jc k )p(c k ) y k () = log p(jc k ) + log p(c k ) 5 Different Views on the Decision Problem Topics of This Lecture y k () / p(jc k )p(c k ) First determine the class-conditional densities for each class individually and separately infer the prior class probabilities. Then use Bayes theorem to determine class membership. Generative methods y k () = p(c k j) First solve the inference problem of determining the posterior class probabilities. Then use decision theory to assign each new to its class. Discriminative methods Alternative Directly find a discriminant function y k () which maps each input directly onto a class label. Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning 6 7 Probability Density Estimation Probability Density Estimation Up to now Bayes optimal classification Based on the probabilities p(jc k )p(c k ) Data:, 2, 3, 4, How can we estimate (=learn) those probability densities? Supervised training case: data and class labels are known. Estimate the probability density for each class C k separately: p(jc k ) (For simplicity of notation, we will drop the class label C k in the following.) Estimate: p() Methods Parametric representations (today) Non-parametric representations (lecture 3) Miture models (lecture 4) 8 9 3

The Gaussian (or Normal) Distribution Gaussian Distribution Properties One-dimensional case Mean ¹ Variance ¾

stribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows.

This makes the Gaussian interesting for many applications. Eample: N uniform [0,] random variables.

Distribution Properties Gaussian Distribution Properties Quadratic Form N depends on through the eponent

Special cases Full covariance matri = [¾ ij ] General ellipsoid shape Shape of the Gaussian is a real,

Diagonal covariance matri = diagf¾ i g We can therefore decompose it into its eigenvectors DX = iu i u T i i=

Constant density on ellipsoids with main directions along the p i Ais-aligned ellipsoid Uniform variance = ¾

4 The Gaussian (or Normal) Distribution Gaussian Distribution Properties One-dimensional case Mean ¹ Variance ¾ 2 N(j¹; ¾ 2 ) = p ¾ ( ¹)2 ep ½ 2¼¾ 2¾ 2 Central Limit Theorem The distribution of the sum of N i.i.d. random variables becomes increasingly Gaussian as N grows. In practice, the convergence to a Gaussian can be very rapid. This makes the Gaussian interesting for many applications. Eample: N uniform [0,] random variables. Multi-dimensional case Mean ¹ Covariance N(j¹; ) = ½ ep ¾ (2¼) D=2 j j=2 2 ( ¹)T ( ¹) 20 2 Gaussian Distribution Properties Gaussian Distribution Properties Quadratic Form N depends on through the eponent Here, M is often called the Mahalanobis distance from to ¹. Special cases Full covariance matri = [¾ ij ] General ellipsoid shape Shape of the Gaussian is a real, symmetric matri. Diagonal covariance matri = diagf¾ i g We can therefore decompose it into its eigenvectors DX = iu i u T i i= and thus obtain with. Constant density on ellipsoids with main directions along the p i Ais-aligned ellipsoid Uniform variance = ¾ 2 I Hypersphere eigenvectors u i and scaling factors Gaussian Distribution Properties Topics of This Lecture The marginals of a Gaussian are again Gaussians: Bayes Decision Theory Basic concepts Minimizing the misclassification rate Minimizing the epected loss Discriminant functions Probability Density Estimation General concepts Gaussian distribution Parametric Methods Maimum Likelihood approach Bayesian vs. Frequentist views on probability Bayesian Learning

5 Parametric Methods Given Data X = f ; 2 ; : : : ; N g Parametric form of the distribution with parameters µ E.g. for Gaussian distrib.: µ = (¹; ¾) Learning Estimation of the parameters µ Likelihood of µ Probability that the data X have indeed been generated from a probability density with parameters µ L(µ) = p(xjµ) Slide adapted from Bernt Schiele 26 Maimum Likelihood Approach Computation of the likelihood Single data point: p( n jµ) Assumption: all data points are independent NY L(µ) = p(xjµ) = p( n jµ) Log-likelihood E(µ) = ln L(µ) = ln p( n jµ) Estimation of the parameters µ (Learning) Maimize the likelihood Minimize the negative log-likelihood 27 Maimum Likelihood Approach NY Likelihood: L(µ) = p(xjµ) = p( n jµ) We want to obtain ^µ such that L(^µ) is maimized. p(xjµ) Maimum Likelihood Approach Minimizing the log-likelihood How do we minimize a function? Take the derivative and set it E(µ) ln p( n @µ p( njµ) p( n jµ)! = 0 ^µ µ Log-likelihood for Normal distribution (D case) E(µ) = ln p( n j¹; ¾) µ = ln p ep ½ jj n ¹jj 2 ¾ 2¼¾ 2¾ Maimum Likelihood Approach Maimum Likelihood Approach Minimizing E(¹; ¾) = p( nj¹; ¾) p( n j¹; ¾) = 2( n ¹) 2¾ 2 X N = 2 ( ( n n ¹) ¹) ¾ Ã = N! X 2 n n E(¹; ¾) =! 0, ^¹ = N n p( n j¹; ¾) = p e jjn ¹jj2 2¾ 2 2¼¾ 30 We thus obtain ^¹ = N n In a similar fashion, we get ^¾ 2 = ( n ^¹) 2 N ^µ = (^¹; ^¾) is the Maimum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong sample mean sample variance 3 5

Maimum Likelihood Approach Or not wrong, but rather biased Assume the samples, 2,, N come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the epectations of the ML

6 Maimum Likelihood Approach Or not wrong, but rather biased Assume the samples, 2,, N come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the epectations of the ML estimates with respect to the data set values. It can be shown that E(¹ ML ) = ¹ µ N E(¾ML) 2 = ¾ 2 N The ML estimate will underestimate the true variance. Corrected estimate: ~¾ 2 = N N ¾2 ML = N ( n ^¹) 2 32 Maimum Likelihood Limitations Maimum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case N = ; X = f g Maimum-likelihood estimate: We say ML overfits to the observed data. We will still often use ML, but it is important to know about this effect. Slide adapted from Bernt Schiele ^¹ ^¾ = 0! 33 Deeper Reason Bayesian vs. Frequentist View Maimum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fied, but can be estimated more precisely when more data is available. This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well 34 To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. 35 Bayesian Approach to Parameter Learning Bayesian Learning Approach Conceptual shift Maimum Likelihood views the true parameter vector µ to be unknown, but fied. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e. to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is p(jx) = p(; µjx)dµ Assumption: given µ, this doesn t depend on X anymore p(jx) = p(; µjx) = p(jµ; X)p(µjX) p(jµ)p(µjx)dµ The prior thus encodes knowledge we have about the type of distribution we epect to see for µ. Slide adapted from Bernt Schiele 36 Slide adapted from Bernt Schiele This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). 37 6

7 Bayesian Learning Approach p(jx) = p(jµ)p(µjx)dµ Bayesian Learning Approach Discussion Likelihood of the parametric form µ given the data set X. p(µjx) = p(xjµ)p(µ) p(x) p(x) = = p(µ) p(x) L(µ) p(xjµ)p(µ)dµ = L(µ)p(µ)dµ Estimate for based on parametric form µ p(jµ)l(µ)p(µ) p(jx) = R dµ L(µ)p(µ)dµ Prior for the parameters µ Inserting this above, we obtain p(jµ)l(µ)p(µ) p(jµ)l(µ)p(µ) p(jx) = dµ = R dµ p(x) L(µ)p(µ)dµ 38 Normalization: integrate over all possible values of µ If we now plug in a (suitable) prior p(µ), we can estimate from the data set X. p(jx) 39 Bayesian Density Estimation Bayesian Density Estimation Discussion p(jx) = p(jµ)l(µ)p(µ) p(jµ)p(µjx)dµ = R dµ L(µ)p(µ)dµ Problem In the general case, the integration over µ is not possible (or only possible stochastically). The probability p(µjx) makes the dependency of the estimate on the data eplicit. If p(µjx) is very small everywhere, but is large for one ^µ, then p(jx) ¼ p(j^µ) The more uncertain we are about µ, the more we average over all parameter values. Eample where an analytical solution is possible Normal distribution for the data, ¾ 2 assumed known and fied. Estimate the distribution of the mean: p(¹jx) = p(xj¹)p(¹) p(x) Prior: We assume a Gaussian prior over ¹, 40 4 Bayesian Learning Approach Summary: ML vs. Bayesian Learning Sample mean: Bayes estimate: Note: ¹ N = ¾2 ¹ 0 + N¾ 2 0 ¹ ¾ 2 + N¾ 2 0 ¾ 2 N = ¾0 2 + N ¾ 2 ¹ = N 42 Slide adapted from Bernt Schiele n p(¹jx) ¹ 0 = 0 Maimum Likelihood Simple approach, often analytically possible. Problem: estimation is biased, tends to overfit to the data. Often needs some correction or regularization. But: Approimation gets accurate for N!. Bayesian Learning General approach, avoids the estimation bias through a prior. Problems: Need to choose a suitable prior (not always obvious). Integral over µ often not analytically feasible anymore. But: Efficient stochastic sampling techniques available (see Adv. ML). (In this lecture, we ll use both concepts wherever appropriate) 43 7

References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and 2.3.-2.3.4. Bayesian Learning: Ch.

2 Bayesian Learning: Ch. 3.3-3.5 Nonparametric methods: Ch. 4.-4.5 Christopher M.

8 References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and Bayesian Learning: Ch..2.3 and Nonparametric methods: Ch Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 Bayesian Learning: Ch Nonparametric methods: Ch Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience,

Machine Learning Lecture 2

Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen