MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions

2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal, likelihood Uni-dimensional, multi-dimensional Gauss function Decomposition in eigenvalue Estimation of density with epectation maimization Use of Bayes rule for classification

3 3 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete values over the intervals [...N ] and [...N y] respectively. ( = ) ( ) P i : the probability that the variable takes value i. 0 P = i, i =,..., N, N i= ( i) and P = =. ( ) Idem for P y = j, j =,... N y

4 4 MACHINE LEARNING Probability Distributions, Density Functions p() a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable. p ( ) 0, p( ) d =

MACHINE LEARNING PDF: Eample The uni-dimensional Gaussian or Normal distribution is a pdf given by: ( ) p = e σ 2π ( µ ) 2 2 2σ 2, μ:mean, σ :variance The Gaussian function is entirely determined by its mean and variance. For this reason, it is referred to as a parametric distribution. Illustrations from Wikipedia 5 5

MACHINE LEARNING Epectation The epectation of the random variable with probability P() (in the discrete case) and pdf p() (in the continuous case), also called the epected value or mean, is the mean of the observed value of weighted by p(). If X is the set of observations of, then: { } When takes discrete values: µ = E = P( ) { } X For continuous distributions: µ = E = p( ) d X 6 6 6

MACHINE LEARNING Variance 2 σ, the variance of a distribution measures the amount of spread of the distribution around its mean: {( ) 2 } { } { } 2 2 2 σ Var( ) E µ E E = = = σ is the standard deviation of. 7 7 7

MACHINE LEARNING Mean and Variance in the Gauss PDF ~68% of the data are comprised between +/ sigma ~96% of the data are comprised between +/ 2 sigma-s ~99% of the data are comprised between +/ 3 sigma-s This is no longer true for arbitrary pdf-s! 8 Illustrations from Wikipedia 8 8

9 9 MACHINE LEARNING Mean and Variance in Miture of Gauss Functions 0.7 0.6 sigma=0.68 0.5 f=/3(f+f2+f3) 0.4 0.3 0.2 0. Epectation: 0-4 -3-2 - 0 2 3 4 3 Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions. For other pdf than the Gaussian distribution, the variance represents a notion of dispersion around the epected value.

MACHINE LEARNING 202 Multi-dimensional Gaussian Function The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by: ( µ ) 2 2 2σ p( ; µσ, ) = e, μ:mean, σ:variance σ 2π The multi-dimensional Gaussian or Normal distribution has a pdf given by: ( ; µ, ) p Σ = ( 2π ) N 2 2 Σ e Σ 2 T ( µ ) ( µ ) if is N-dimensional, then μ is a N dim ensional mean vector is a N N covariance matri 0 0 0

MACHINE LEARNING 2-dimensional Gaussian Pdf (, ) p 2 2 2 ( ; µ, ) p Σ = ( 2π ) N 2 2 Σ e Σ 2 T ( µ ) ( µ ) Isolines: p ( ) = cst if is N-dimensional, then μ is a N dim ensional mean vector is a N N covariance matri

MACHINE LEARNING 202 Modeling Joint Pdf with a Gaussian Function i i= { }... Construct covariance matri from (centered) set of datapoints X = : = XX M T M ( ; µ, ) p Σ = ( 2π ) N 2 2 Σ e Σ 2 T ( µ ) ( µ ) if is N-dimensional, then μ is a N dim ensional mean vector is a N N covariance matri 2 2 2

MACHINE LEARNING 202 is square and symmetric. It can be decomposed using the eigenvalue decomposition. = VΛV 2 Modeling Joint Pdf with a Gaussian Function T, λ 0 =... 0 λ N i i= { }... Construct covariance matri from (centered) set of datapoints X = : = XX M T V : matri of eigenvectors, Λ : diagonal matri composed of eigenvalues For the -std ellipse, the aes' lengths are st eigenvector equal to: λ 0 T λ and λ2, with = V V. 2nd eigenvector 0 λ2 Each isoline corresponds to a scaling of the std ellipse. M 3 3 3

MACHINE LEARNING 202 Fitting a single Gauss function and PCA Principal Component Analysis (PCA) Identifies a suitable representation of a multivariate data set by de-correlating the dataset. 2 When projected onto e and e, the set of datapoints appears to follow two uncorrelated Normal distributions. (( 2 T ) ) ~ ( ) ;, 2 2 2 p e X N X µ λ 2 st eigenvector 2nd eigenvector (( T ) ) ~ ;, ( ) 2 p e X N X µ λ 2 e 4 e 4 4

5 5 MACHINE LEARNING 202 Marginal Pdf ( ) (, ) ( ) Consider two random variables and ywith joint distribution py, then the marginal probability of is given by: p = py, y dy Drop lower indices for clarify of notation, p is a Gaussian when p y, is also a Gaussian To compute the marginal, one needs the joint distribution p(,y). Often, one does not know it and one can only estimate it. If is a multidimensional variable the marginal is a joint distribution!

6 6 MACHINE LEARNING Joint vs Marginal and the Curse of Dimensionality ( ) ( ) ( ) The joint distribution p, y is richer than the marginals p, p y. In discrete case: To estimate the marginals of N variables taking K values requires to estimate ( ) N K probabilities. To estimate the joint K computing ~ N probabilities. distribution corresponds to

7 7 MACHINE LEARNING 202 Marginal & Conditional Pdf ( ) (, ) ( ) Consider two random variables and ywith joint distribution py,, then the marginal probability of is given by: p = p y dy The conditional probability of y given is given by: ( ) p y = py (, ) p ( ) To compute the conditional from the joint distribution requires to estimate the marginal of one of the two variables. If you need only the conditional, estimate it directly!

8 8 MACHINE LEARNING Joint vs Conditional: Eample Here we care about predicting y = hitting angle given = position, but not the converse p( y ) ( ) One can estimate directly and use this { } to predict yˆ ~ E p y

9 9 MACHINE LEARNING Joint Distribution Estimation and Curse of Dimensionality Pros of computing the joint distribution: Provides statistical dependencies across all variables and the marginal distributions Cons: Computational costs grow eponentially with number of dimensions (statistical power: 0 samples to estimate each parameter of a model) Compute solely the conditional if you care only about dependencies across variables

20 20 MACHINE LEARNING 202 The conditional and marginal pdf of a multi-dimensional Gauss function are all Gauss functions! marginal density of y µ marginal density of Illustrations from Wikipedia

2 2 MACHINE LEARNING 202 PDF equivalency with Discrete Probability The cumulative distribution function (or simply distribution function) of X is: ( *) ( * = ) D P ( * ) * D = p( ) d, p() d ~ probability of to fall within an infinitesimal interval [, + d]

MACHINE LEARNING 202 PDF equivalency with Discrete Probability Uniform distribution on p( ) Probability that [ ab] takes a value in the subinterval, is given by : P( b): = D ( b) = p( ) d P( a b) = D ( b) D ( a) Pa ( b) = pd ( ) = b a b D * ( ) * 22 222

23 23 MACHINE LEARNING 202 PDF equivalency with Discrete Probability 0.4 Gaussian fct Cumulative distribution 0.3 0.2 0. 0-4 -2 0 2 4 0.5 0-4 -2 0 2 4

24 24 MACHINE LEARNING Conditional and Bayes s Rule ( ) Consider the joint probability of two variables and ypy :, The conditional probability is computed as follows: ( ) p y = py (, ) p ( ) ( ) p y = ( ) ( ) p( ) p y p y Bayes' theorem

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > δ, δ 0. otherwise takes class label -. 0.4 ( = + ) P y Gaussian fct 0.3 0.2 0. δ 0-4 -2 0 2 4 min ma 25 25

MACHINE LEARNING Probability that takes a value in the subinterval min ma, is given by : P a b D D ma min ( ) = ( ) ( ) D( ) Cumulative distribution 0.5 0.4 0-4 -2 0 2 4 min ma ( = + ) P y Gaussian fct 0.3 0.2 0. δ 0-4 -2 0 2 4 min ma 26 26

27 27 MACHINE LEARNING 202 Eample of use of Bayes rule in ML 0.4 Gaussian fct Cumulative distribution 0.2 0-4 -2 0 2 4 Probability of within interval = 0.90309 with cutoff0. 0.5 0-4 -2 0 2 4

28 28 MACHINE LEARNING 202 Eample of use of Bayes rule in ML 0.4 Gaussian fct Cumulative distribution 0.2 0-4 -2 0 2 4 Probability of within interval = 0.38995 0.90309 with cutoff0.35 cutoff0. 0.5 0-4 -2 0 2 4

MACHINE LEARNING 202 Eample of use of Bayes rule in ML 0.4 Gaussian fct 0.2 Cumulative distribution 0-4 -5-2 0 2 45 Probability of of within interval = 0.90309 0.99994 with cutoff0. The 0.5 choice of cut-off in Bayes rule corresponds to some probability of seeing an event, e.g. to belong 0-4 -5 y = + -2 0 2 45 to class. True only when the class follows a Gauss distribution! 29 29

30 30 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Here eample for binary classification. Decision rule is: has class label y = + if: ( =+ ) ( = ) p y p y Boundary is found at the intersection between the isolines. Each class is modeled with one Gauss distribution. ( ) ~ ( ;, ) p y N µσ

3 3 MACHINE LEARNING Eample of use of Bayes rule in ML One can also add a prior ( ) and p( ) on either of the marginal to change the boundary. ( ) p( ) Change this ratio ( ) p( ) p y =+ p( y =+ ) = p( y =+ ) p y = p( y = ) = p( y = ) p y are usually estimated from data. If not estimated, one can add a prior on either marginal, e.g. ( ) give more importance to one class by decreasing its p y. Change this ratio

MACHINE LEARNING ( ) and p( ) Eample of use of Bayes rule in ML One can also add a prior on either of the marginal to change the boundary. ( ) p( ) Change this ratio ( ) p( ) p y =+ p( y =+ ) = p( y =+ ) p y = p( y = ) = p( y = ) p y are usually estimated from data. If not estimated, one can add a prior on either marginal, e.g. ( ) give more importance to one class by decreasing its p y. Change this ratio ( ) p( y ) The ratio δ = p y =+ / = determines the boundary. 32 32

MACHINE LEARNING The ROC (Receiver Operating Curve) determines the influence of the threshold. Compute false and true positives for each value of δ in this interval. True Positives( TP) : nm of datapoints of class that are correctly classified ( ) p( y ) The ratio δ = p y =+ / = determines the boundary. False Positives( FP) : nm of datapoints of class 2 that are incorrectly classified 333

34 34 MACHINE LEARNING Likelihood Function 2 M ( ) = (,,..., ) ( ) The Likelihood function orlikelihood for short determines i { } the probability density of observing the set X =, of M datapoints, ( ) if each datapoint has been generated by the pdf p : L X p If the data are independently and identically distributed (i.i.d),then : M i ( ) p( ) L X = i= The likelihood can be used to determinehow well a particular choice ( ) of density p models the data at hand. M i=

MACHINE LEARNING Likelihood of Gaussian Pdf Paramatrization Consider that the pdf of the dataset X is parametrized with parameters µ, Σ. One writes: ( ; µ, Σ) or p( X µ, Σ) p X The likelihood function of the model parameters is given by: ( µ, Σ ) : = ( ; µ, Σ) L X p X Measures probability of observing X if the distribution of X is parametrized with µ, Σ If all datapoints are identically and independently distributed (i.i.d.) M i ( µ, Σ ) = ( ; µ, Σ) L X p i= To determine the best fit, search for parameters that maimize the likelihood. 35 35 35

MACHINE LEARNING Likelihood Function: Eercise Fraction of observation of Fraction of observation of.5 0.5 Likelihood=5.0936e- 0 0 2 3 4 Likelihood=9.5536e-05.5 0.5 0 0 2 3 4 Fraction of observation of Fraction of observation of.5 0.5 Likelihood=4.4994e-06 0 0 2 3 4 Likelihood=4.8759e-07.5 0.5 Data Model 0 0 2 3 4 Estimate the likelihood of each of the 4 models. Determine which one is optimal. 36 36 36

MACHINE LEARNING Maimum Likelihood Optimization The principle of maimum likelihood consists of finding the optimal parameters of a given distribution by maimizing the likelihood function of these parameters, equivalently by maimizing the probability of the data given the model and its parameters. For a multi-variate Gaussian pdf, one can determine the mean and covariance matri by solving: ( µ Σ ) = ( µ Σ) ma L, X ma p X, µ, Σ µ, Σ p( X µ, Σ ) = 0 and p( X µ, Σ ) = 0 µ Σ Eercise: If p is the Gaussian pdf, then the above has an analytical solution. 38 38 38

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = p ( Σ ) = k k K K ; Θ α ; µ, with Θ { µ, Σ,... µ, Σ }, α [0, ]. k = k k k Linear weighted combination K Gaussian Functions 0.7 0.6 0.5 Eample of a uni-dimensional miture of 3 Gaussian functions plotted on the right f=/3(f+f2+f3) 0.4 0.3 0.2 0. 0-4 -3-2 - 0 2 3 4 39 39 39

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = p ( Σ ) = k k K K ; Θ α ; µ, with Θ { µ, Σ,... µ, Σ }, α [0, ]. k = k k k Linear weighted combination K Gaussian Functions No closed form solution to: ( Θ ) = ( Θ) ma L X ma p X Θ Θ M K i k k ( Θ ) = α k k ( µ Σ ) ma L X ma p ;, Θ Θ i= k = E-M is an iterative procedure to estimate the best set of parameters Converges to a local optimum Sensitive to initialization! 40 40 40

4 4 MACHINE LEARNING Epectation-Maimization (E-M) EM is an iterative procedure Θ= { µ Σ µ Σ K K 0) Make a guess pick a set of,,..., (initialization) ( Θˆ X Z) ) Compute likelihood of initial model L, (E-Step) ( Θ X Z) 2) Update Θ by gradient ascent on L, 3) Iterate between steps and 2 until reach plateau (no improvement on likelihood) } E-M converges to a local optimum only!

MACHINE LEARNING Estimate the joint density, p(,y), across pairs of datapoints using a miture of two Gauss functions: K i i i i i i (, ) = α i (, ; µ, ), with (, ; µ, ) = ( µ, ) py py py N i= i i µ, : mean and covariance matri of Gaussian i y Epectation-Maimization (E-M) Parameters are learned through the iterative procedure E-M. Start with random initialization. 42 42

43 43 MACHINE LEARNING Epectation-Maimization (E-M) Eercise Demonstrate the non-conveity of the likelihood with an eample E-M converges to a local optimum only!

444 MACHINE LEARNING Epectation-Maimization (E-M) Solution The assignment of Gaussian or 2 to one of the two cluster yields the same likelihood The likelihood function has two minima. E-M converges to one of the two solutions depending on initialization.

MACHINE LEARNING Determining the number of components to estimate a pdf with miture of Gaussians through Maimum Likelihood Matlab and mldemos eamples 45 45 45

46 APPLIED MACHINE LEARNING Determining Number of Components: Bayesian Information Criterion: BIC = 2ln L + Bln( M ) L: maimum likelihood of the model given B parameters M: number of datapoints Lower BIC implies either fewer eplanatory variables, better fit, or both. BIC Penalize too small increments in likelihood that leads to too large an increase in the number of parameters (case of overfitting) B