Machine Learning Lecture 3

Size: px

Start display at page:

Download "Machine Learning Lecture 3"

Hannah Watson
5 years ago
Views:

Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.

1 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II Eercises The first eercise sheet is available on L2P now First eercise lecture on Please submit your results by evening of via L2P (detailed instructions can be found on the eercise sheet) Bastian Leibe RWTH Aachen leibe@vision.rwth-aachen.de 2 Course Outline Topics of This Lecture Fundamentals Bayes Decision Theory Probability Density Estimation Recap: Parametric Methods Gaussian distribution Maimum Likelihood approach Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Deep Learning Foundations Convolutional eural etworks Recurrent eural etworks on-parametric Methods Histograms Kernel density estimation K-earest eighbors k- for Classification Miture distributions Miture of Gaussians (MoG) Maimum Likelihood estimation attempt 3 4 Recap: Gaussian (or ormal) Distribution Recap: Maimum Likelihood Approach One-dimensional case Mean ¹ Variance ¾ 2 (j¹; ¾ 2 ) = p ¾ ( ¹)2 ep ½ 2¼¾ 2¾ 2 Multi-dimensional case Mean ¹ Covariance (j¹; ) = ½ ep ¾ (2¼) D=2 j j=2 2 ( ¹)T ( ¹) 5 Computation of the likelihood Single data point: p( n jµ) Assumption: all data points X = f ; : : : ; ng are independent Y L(µ) = p(xjµ) = p( n jµ) Log-likelihood E(µ) = ln L(µ) = ln p( n jµ) Estimation of the parameters µ (Learning) Maimize the likelihood (=minimize the negative log-likelihood) Take the derivative and set it E(µ) p( njµ) p( n jµ)! = 0 6

Maimum Likelihood Approach Maimum Likelihood Approach When applying ML to the Gaussian distribution, we obtain ^¹ = n sample mean In a similar fashion, we get ^¾ 2 = ( n ^¹) 2 ^µ = (^¹; ^¾) is the

2 Maimum Likelihood Approach Maimum Likelihood Approach When applying ML to the Gaussian distribution, we obtain ^¹ = n sample mean In a similar fashion, we get ^¾ 2 = ( n ^¹) 2 ^µ = (^¹; ^¾) is the Maimum Likelihood estimate for the parameters of a Gaussian distribution. This is a very important result. Unfortunately, it is wrong sample variance 7 Or not wrong, but rather biased Assume the samples, 2,, come from a true Gaussian distribution with mean ¹ and variance ¾ 2 We can now compute the epectations of the ML estimates with respect to the data set values. It can be shown that E(¹ ML ) = ¹ µ E(¾ML) 2 = ¾ 2 The ML estimate will underestimate the true variance. Corrected estimate: ~¾ 2 = ¾2 ML = ( n ^¹) 2 8 Maimum Likelihood Limitations Deeper Reason Maimum Likelihood has several significant limitations It systematically underestimates the variance of the distribution! E.g. consider the case = ; X = f g Maimum Likelihood is a Frequentist concept In the Frequentist view, probabilities are the frequencies of random, repeatable events. These frequencies are fied, but can be estimated more precisely when more data is available. Maimum-likelihood estimate: ^¾ 2 = ( n ^¹) 2 We say ML overfits to the observed data. ^¾ = 0! We will still often use ML, but it is important to know about this effect. ^¹ This is in contrast to the Bayesian interpretation In the Bayesian view, probabilities quantify the uncertainty about certain states or events. This uncertainty can be revised in the light of new evidence. Bayesians and Frequentists do not like each other too well 9 0 Bayesian vs. Frequentist View To see the difference Suppose we want to estimate the uncertainty whether the Arctic ice cap will have disappeared by the end of the century. This question makes no sense in a Frequentist view, since the event cannot be repeated numerous times. In the Bayesian view, we generally have a prior, e.g. from calculations how fast the polar ice is melting. If we now get fresh evidence, e.g. from a new satellite, we may revise our opinion and update the uncertainty from the prior. Posterior / Likelihood Prior This generally allows to get better uncertainty estimates for many situations. Main Frequentist criticism The prior has to come from somewhere and if it is wrong, the result will be worse. Bayesian Approach to Parameter Learning Conceptual shift Maimum Likelihood views the true parameter vector µ to be unknown, but fied. In Bayesian learning, we consider µ to be a random variable. This allows us to use knowledge about the parameters µ i.e. to use a prior for µ Training data then converts this prior distribution on µ into a posterior probability density. The prior thus encodes knowledge we have about the type of distribution we epect to see for µ. 2 2

Bayesian Learning Topics of This Lecture Bayesian Learning is an important concept However, it would lead to far here. I will introduce it in more detail in the Advanced ML lecture.

Miture of Gaussians (MoG) Maimum Likelihood estimation attempt 3 4 on-parametric Methods Histograms on-parametric representations Often the functional form of the distribution is unknown Basic idea:

3 Bayesian Learning Topics of This Lecture Bayesian Learning is an important concept However, it would lead to far here. I will introduce it in more detail in the Advanced ML lecture. Recap: Parametric Methods Gaussian distribution Maimum Likelihood approach on-parametric Methods Histograms Kernel density estimation K-earest eighbors k- for Classification Miture distributions Miture of Gaussians (MoG) Maimum Likelihood estimation attempt 3 4 on-parametric Methods Histograms on-parametric representations Often the functional form of the distribution is unknown Basic idea: Partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. 3 =0 2 Estimate probability density from data Histograms Kernel density estimation (Parzen window / Gaussian kernels) k-earest-eighbor Often, the same width is used for all bins, i = This can be done, in principle, for any dimensionality D but the required number of bins grows eponentially with D! 5 6 Histograms Summary: Histograms The bin width M acts as a smoothing factor. not smooth enough Properties Very general. In the limit (!), every probability density can be represented. o need to store the data points once histogram is computed. Rather brute-force about OK too smooth Problems High-dimensional feature spaces D-dimensional space with M bins/dimension will require M D bins! Requires an eponentially growing number of data points Curse of dimensionality Discontinuities at bin edges Bin size? too large: too much smoothing too small: too much noise 7 8 3

4 Statistically Better-Founded Approach Statistically Better-Founded Approach Data point comes from pdf p() Probability that falls into small region R Z P = p(y)dy R If R is sufficiently small, p() is roughly constant Let V be the volume of R Z P = p(y)dy ¼ p()v R fied V determine K Kernel Methods p() ¼ K V fied K determine V K-earest eighbor If the number of samples is sufficiently large, we can estimate P as P = K ) p() ¼ K V 9 Kernel methods Eample: Determine the number K of data points inside a fied hypercube 20 Kernel Methods Kernel Methods: Parzen Window Parzen Window Hypercube of dimension D with edge length h: k u =, u i h, i =,, D 2 0, else Kernel function K = k n Probability density estimate: p K V = h D k n V = න k u du = h D Interpretations. We place a kernel window k at location and count how many data points fall inside it. 2. We place a kernel window k around each data point n and sum up their influences at location. Direct visualization of the density. Still, we have artificial discontinuities at the cube boundaries We can obtain a smoother density model if we choose a smoother kernel function, e.g. a Gaussian 2 22 Kernel Methods: Gaussian Kernel Gauss Kernel: Eamples Gaussian kernel Kernel function k(u) = ¾ ep ½ u2 (2¼h 2 ) =2 2h 2 not smooth enough K = Z k( n ) V = k(u)du = about OK Probability density estimate p() ¼ K V = ½ (2¼) D=2 h ep jj njj 2 ¾ 2h 2 too smooth h acts as a smoother

5 Kernel Methods Statistically Better-Founded Approach In general Any kernel such that p() ¼ K V can be used. Then K = k( n ) fied V determine K Kernel Methods fied K determine V K-earest eighbor And we get the probability density estimate p() ¼ K V = k( n ) K-earest eighbor Increase the volume V until the K net data points are found K-earest eighbor k-earest eighbor: Eamples earest-eighbor density estimation Fi K, estimate V from the data. Consider a hypersphere centred on and let it grow to a volume V? that includes K of the given data points. Then K = 3 not smooth enough about OK Side note Strictly speaking, the model produced by K- is not a true density model, because the integral over all space diverges. E.g. consider K = and a sample eactly on a data point = j. too smooth K acts as a smoother Summary: Kernel and k- Density Estimation K-earest eighbor Classification Properties Very general. In the limit (!), every probability density can be represented. o computation involved in the training phase Simply storage of the training set Bayesian Classification Here we have p(c j j) = p(jc j)p(c j ) p() Problems Requires storing and computing with the entire dataset. Computational cost linear in the number of data points. This can be improved, at the epense of some computation during training, by constructing efficient tree-based search structures. Kernel size / K in K-? Too large: too much smoothing Too small: too much noise p() ¼ K V p(jc j ) ¼ K j j V p(c j ) ¼ j p(c j j) ¼ K j j V j V K k-earest eighbor classification = K j K

K-earest eighbors for Classification K-earest eighbors for Classification Results on an eample data set K = 3 K = 3 K acts as a smoothing parameter. Theoretical guarantee For!

32 Bias-Variance Tradeoff Probability density estimation Histograms: bin size? M too large: too smooth M too small: not smooth enough Kernel methods: kernel size?

K too large: too smooth K too small: not smooth enough This is a general problem of many probability density estimation methods Including parametric methods and miture models Too much bias Too much

6 K-earest eighbors for Classification K-earest eighbors for Classification Results on an eample data set K = 3 K = 3 K acts as a smoothing parameter. Theoretical guarantee For!, the error rate of the - classifier is never more than twice the optimal error (obtained from the true conditional class distributions). 32 Bias-Variance Tradeoff Probability density estimation Histograms: bin size? M too large: too smooth M too small: not smooth enough Kernel methods: kernel size? h too large: too smooth h too small: not smooth enough K-earest eighbor: K? K too large: too smooth K too small: not smooth enough This is a general problem of many probability density estimation methods Including parametric methods and miture models Too much bias Too much variance 33 Discussion The methods discussed so far are all simple and easy to apply. They are used in many practical applications. However Histograms scale poorly with increasing dimensionality. Only suitable for relatively low-dimensional data. Both k- and kernel density estimation require the entire data set to be stored. Too epensive if the data set is large. Simple parametric models are very restricted in what forms of distributions they can represent. Only suitable if the data has the same general form. We need density models that are efficient and fleible! et topic 34 Topics of This Lecture Miture Distributions Recap: Parametric Methods Gaussian distribution Maimum Likelihood approach A single parametric distribution is often not sufficient E.g. for multimodal data on-parametric Methods Histograms Kernel density estimation K-earest eighbors k- for Classification Miture distributions Miture of Gaussians (MoG) Maimum Likelihood estimation attempt Single Gaussian Miture of two Gaussians

7 Miture of Gaussians (MoG) Sum of M individual ormal distributions f() In the limit, every smooth distribution can be approimated this way (if M is large enough) MX p(jµ) = p(jµ j )p(j) 37 Miture of Gaussians otes MX p(jµ) = p(jµ j )p(j) p(jµ j ) = (j¹ j ; ¾ 2 j ) = X M p(j) = ¼ j with 0 ¼ j and ¼ j =. The miture density integrates to : The miture parameters are p 2¼¾j ep ( µ = (¼ ; ¹ ; ¾ ; : : : ; ; ¼ M ; ¹ M ; ¾ M ) Z Likelihood of measurement given miture component j ( ¹ j) 2 2¾ 2 j p()d = ) Prior of component j 38 Miture of Gaussians (MoG) Miture of Multivariate Gaussians Generative model j p(j) = ¼ j Weight of miture component p() 2 3 p(jµ j ) Miture component p() Miture density MX p(jµ) = p(jµ j )p(j) Miture of Multivariate Gaussians Miture of Multivariate Gaussians Multivariate Gaussians MX p(jµ) = p(jµ j )p(j) p(jµ j ) = ½ ep ¾ (2¼) D=2 j j j=2 2 ( ¹ j) T j ( ¹ j ) Generative model j p(j) = ¼ j 3 2 3X p(jµ) = ¼ jp(jµ j) Miture weights / miture coefficients: X M p(j) = ¼ j with 0 ¼ j and ¼ j = Parameters: µ = (¼ ; ¹ ; ; : : : ; ¼ M ; ¹ M ; M ) p(jµ ) p(jµ 2) p(jµ 3)

8 Miture of Gaussians st Estimation Attempt Miture of Gaussians st Estimation Attempt Maimum Likelihood Minimize E = ln L(µ) = ln p( n jµ) Let s first look at ¹ j = j E We can already see that this will be difficult, since ( K ) X ln p(xj¼; ¹; ) = ln ¼ k ( n j¹ k ; k ) k= ¹ @¹ p( n jµ j ) = j K j k= p( njµ k ) Ã! = p( n jµ j ) ( n ¹ j ) P K k= p( njµ k ) X ¼ j ( n j¹ j ; j )! = P ( n ¹ j ) P K k= ¼ = 0 K k= p( ( n ¹ j )p( n jµ j ) =! 0 njµ k ) k ( n j¹ k ; k ) We thus obtain P ) ¹ j = j( n ) n P j( j (nj¹ k ; k) = (n ¹ j ) (nj¹ k ; k) = j ( n ) responsibility of component j for n This will cause problems! Miture of Gaussians st Estimation Attempt Miture of Gaussians Other Strategy But Other strategy: P ¹ j = j( n ) n P j( n ) I.e. there is no direct analytical = f (¼ ; ; ; : : : ; ¼ M ; ¹ M ; M ) j Comple gradient function (non-linear mutual dependencies) Optimization of one Gaussian depends on all other Gaussians! It is possible to apply iterative numerical optimization here, but in the following, we will see a simpler method. Observed data: Unobserved data: Unobserved = hidden variable : j h(j = j n ) = h(j = 2j n ) = Miture of Gaussians Other Strategy Miture of Gaussians Other Strategy Assuming we knew the values of the hidden variable Assuming we knew the miture components assumed known ML for Gaussian # ML for Gaussian #2 p(j = j) p(j = 2j) assumed known j j h(j = j n ) = h(j = 2j n ) = P ¹ = n) n P i= h(j = j n) P ¹ 2 = n) n P i= h(j = 2j n) Bayes decision rule: Decide j = if p(j = j n ) > p(j = 2j n )

Miture of Gaussians Other Strategy References and Further Reading Chicken and egg problem what comes first? More information in Bishop s book Gaussian distribution and ML: Ch..2.4

9 Miture of Gaussians Other Strategy References and Further Reading Chicken and egg problem what comes first? More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and Bayesian Learning: Ch..2.3 and onparametric methods: Ch We don t know any of those! j Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 In order to break the loop, we need an estimate for j. E.g. by clustering et lecture

Machine Learning Lecture 2

Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on