MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions

22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete values over the intervals [...N ] and [...N y] respectively. ( = ) ( ) P i : the probability that the variable takes value i. P = i, " i =,..., N, N å i= ( i) and P = =. ( ) Idem for P y = j, j =,... N y

33 MACHINE LEARNING Probability Distributions, Density Functions p() a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable.

MACHINE LEARNING PDF: Eample The uni-dimensional Gaussian or Normal distribution is a pdf given by: ( ) p = e s 2p ( µ ) 2 æ - ö ç- ç 2 2s è ø 2, µ:mean, σ :variance The Gaussian function is entirely determined by its mean and variance. For this reason, it is referred to as a parametric distribution. Illustrations from Wikipedia 44

55 MACHINE LEARNING Epectation The epectation of the random variable with probability P() (in the discrete case) and pdf p() (in the continuous case), also called the epected value or mean, is the mean of the observed value of weighted by p(). If X is the set of observations of, then: { } When takes discrete values: µ = E = P( ) { } ÎX For continuous distributions: µ = E = p( ) d ò å X

66 MACHINE LEARNING Variance 2 s, the variance of a distribution measures the amount of spread of the distribution around its mean: {( ) 2 } { } { } 2 2 2 s Var( ) E µ E ée ù = = - = - ë û s is the standard deviation of.

MACHINE LEARNING Mean and Variance in the Gauss PDF ~68% of the data are comprised between +/ sigma ~96% of the data are comprised between +/ 2 sigma-s ~99% of the data are comprised between +/ 3 sigma-s This is no longer true for arbitrary pdf-s! Illustrations from Wikipedia 77

88 MACHINE LEARNING Mean and Variance in Miture of Gauss Functions.7.6 sigma=.68.5 f=/3(f+f2+f3).4.3.2. Epectation: -4-3 -2-2 3 4 3 Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions. For other pdf than the Gaussian distribution, the variance represents a notion of dispersion around the epected value.

99 MACHINE LEARNING 22 Multi-dimensional Gaussian Function The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by: ( -µ ) 2 æ ö ç - ç 2 2s è ø p( ; µs, ) = e, µ:mean, σ:variance s 2p The multi-dimensional Gaussian or Normal distribution has a pdf given by: ( ; µ, ) p S = ( 2p ) N - 2 2 S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

MACHINE LEARNING 2-dimensional Gaussian Pdf (, ) p 2 2 2 ( ; µ, ) p S = ( 2p ) N - 2 2 S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø Isolines: p( ) = cst if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

MACHINE LEARNING 22 i i= { }... Construct covariance matri from (centered) set of datapoints X = : å= XX M Modeling Data with a Gaussian Function T M ( ; µ, ) p S = ( 2p ) N - 2 2 S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

MACHINE LEARNING 22 å is square and symmetric. It can be decomposed using the eigenvalue decomposition. å= VLV 2 Modeling Data with a Gaussian Function T, æl ö ç =... ç l è N ø i i= { }... Construct covariance matri from (centered) set of datapoints X = : å= XX M T V : matri of eigenvectors, L : diagonal matri composed of eigenvalues st eigenvector For the -std ellipse, the aes' lengths are equal to: æl ö T 2nd eigenvector l and l2, with å= Vç V. è l2 ø Each isoline corresponds to a scaling of the std ellipse. M 22

MACHINE LEARNING 22 Fitting a single Gauss function and PCA PCA Identifies a suitable representation of a multivariate data set by decorrelating the dataset. 2 When projected onto e and e, the set of datapoints appears to follow two uncorrelated Normal distributions. (( 2 T ) ) ~ æ ;, ( ) 2 2 2 ö p e X Nç X µ l è ø 2 st eigenvector 2nd eigenvector 2 e (( T ) ) ~ æ ;, ( ) 2 ö p e X Nç X µ l è ø e 33

44 MACHINE LEARNING 22 Marginal Pdf ( ) (, ) ( ) Consider two random variables and y with joint distribution p, y then the marginal probability of is given by: p = ò p, y y dy Drop lower indices for clarify of notation, p is a Gaussian when p, y is also a Gaussian To compute the marginal, one needs the joint distribution p(,y). Often, one does not know it and one can only estimate it. If is a multidimensional variable à the marginal is a joint distribution!

55 MACHINE LEARNING Joint vs Marginal and the Curse of Dimensionality ( ) ( ) ( ) The joint distribution p, y is richer than the marginals p, p y. In discrete case: To estimate the marginals of N variables taking K values requires to estimate ( ) N K - probabilities. To estimate the joint K computing ~ N probabilities. distribution corresponds to

66 MACHINE LEARNING 22 Marginal & Conditional Pdf ( ) (, ) ( ) Consider two random variables and y with joint distribution p, y, then the marginal probability of is given by: p = ò p y dy The conditional probability of y given is given by: ( ) p y = py (, ) p ( ) To compute the conditional from the joint distribution requires to estimate the marginal of one of the two variables. If you need only the conditional, estimate it directly!

77 MACHINE LEARNING Joint vs Conditional: Eample Here we care about predicting y = hitting angle given = position, but not the converse p( y ) ( ) One can estimate directly and use this { } to predict yˆ ~ E p y

88 MACHINE LEARNING Joint Distribution Estimation and Curse of Dimensionality Pros of computing the joint distribution: Provides statistical dependencies across all variables and the marginal distributions Cons: Computational costs grow eponentially with number of dimensions (statistical power: samples to estimate each parameter of a model) à Compute solely the conditional if you care only about dependencies across variables

99 MACHINE LEARNING 22 The conditional and marginal pdf of a multi-dimensional Gauss function are all Gauss functions! marginal density of y µ marginal density of Illustrations from Wikipedia

22 MACHINE LEARNING 22 PDF equivalency with Discrete Probability The cumulative distribution function (or simply distribution function) of X is: p() d ~ probability of to fall within an infinitesimal interval [, + d]

MACHINE LEARNING 22 PDF equivalency with Discrete Probability Uniform distribution on p( ) Probability that [ ab], is given by : takes a value in the subinterval P( b): = D ( b) = p( ) d Pa ( b) = D( b) - D( a) P( a b) = p( ) d = ò b a ò b - D ( * ) * 22

222 MACHINE LEARNING 22 PDF equivalency with Discrete Probability.4 Gaussian fct.3.2. -4-2 2 4 Cumulative distribution.5-4 -2 2 4

2323 MACHINE LEARNING Joint and conditional probabilities ( ) The joint probability of two variables and y is written: p, y The conditional probability is computed as follows: ( ) p y = py (, ) py ( ) p y ( ) = p y p y ( ) ( ) p ( ) Bayes' theorem

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -..4 ( =+ ) P y Gaussian fct.3.2. d -4-2 2 4 min ma 2424

MACHINE LEARNING Probability that takes a value in the subinterval é ë ù û min ma, is given by : Pa b D D ma min ( ) = ( )- ( ) D( ) Cumulative distribution.5.4-4 -2 2 4 min ma ( =+ ) P y Gaussian fct.3.2. d -4-2 2 4 min ma 2525

2626 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2-4 -2 2 4 Probability of within interval =.939 with cutoff..5-4 -2 2 4

2727 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2-4 -2 2 4 Probability of within interval =.38995.939 with cutoff.35 cutoff..5-4 -2 2 4

2828 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2-4 -5-2 2 45 Probability of of within interval =.939.99994 with cutoff..5-4 -5-2 2 45

2929 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2 The choice of cut-off in Bayes rule corresponds to -4-5some probability -2 of seeing an event, e.g. 2 to belong4 5 to class Probability y =+. of of within interval =.939.99994 with cutoff. True only when the class follows a Gauss distribution!.5-4 -5-2 2 45

33 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification for binary classification. Each class is modeled with a two-dimensional Gauss function. Bayes rule for binary classification: ( ) has class label + if p y=+ > d, d ³. otherwise takes class label y =. The threshold d determines the boundary.

33 MACHINE LEARNING Eample of use of Bayes rule in ML The ROC (Receiver Operating Curve) determines the influence of the threshold d. Compute false and true positives for each value of d in this interval.

MACHINE LEARNING 2 M ( ) = (,,..., ) Likelihood Function ( ) The Likelihood function orlikelihood for short determines i { } the probability density of observing the set X =, of M datapoints, ( ) if each datapoint has been generated by the pdf p : L X p If the data are independently and identically distributed (i.i.d ),then : M ( ) ( i p ) L X = Õ i= The likelihood can be used to determinehow well a particular choice ( ) of density p models the data at hand. M i= 32 3232

MACHINE LEARNING Likelihood of Gaussian Pdf Parameterization Consider that the pdf of the dataset X is parametrized with parameters µ, S. One writes: ( ; µ, S) or p( X µ, S) p X The likelihood function (short likelihood) of the model parameters is given by: ( µ, S ): = ( ; µ, S) L X p X Measures probability of observing X if the distribution of X is parametrized with µ, S If all datapoints are identically and independently distributed (i.i.d.) M i ( µ, S ) = Õ ( ; µ, S) L X p i= To determine the best fit, search for parameters that maimize the likelihood. 333

3434 MACHINE LEARNING Likelihood Function: Eample Fraction of observation of Fraction of observation of.5.5 Likelihood=5.936e- 2 3 4 Likelihood=9.5536e-5.5.5 2 3 4 Fraction of observation of Fraction of observation of.5.5 Likelihood=4.4994e-6 2 3 4 Likelihood=4.8759e-7.5.5 Data Model 2 3 4 Gaussian model Which one is the most likely?

MACHINE LEARNING Likelihood Function Fraction of observation of Fraction of observation of.5.5 Likelihood=5.936e- 2 3 4 Likelihood=9.5536e-5.5.5 2 3 4 Fraction of observation of Fraction of observation of.5.5 Likelihood=4.4994e-6 2 3 4 Likelihood=4.8759e-7.5.5 Data Model 2 3 4 Plot eamples of the likelihood for a series of Gauss functions applied to non gauss datasets 35 3535

MACHINE LEARNING Maimum Likelihood Optimization The principle of maimum likelihood consists of finding the optimal parameters of a given distribution by maimizing the likelihood function of these parameters, equivalently by maimizing the probability of the data given the model and its parameters. For a multi-variate Gaussian pdf, one can determine the mean and covariance matri by solving: ( µ S ) = ( µ S) ma L, X ma p X, µ, S µ, S p( X µ, S ) = and p( X µ, S ) = µ S Eercise: If p is the Gaussian pdf, then the above has an analytical solution. 36 3636

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = å p ( S ) = k k K K ; Q a ; µ, with Q { µ, S,... µ, S }, a Î[, ]. k = k k k Linear weighted combination K Gaussian Functions.7.6 Eample of a uni-dimensional miture of 3 Gaussian functions plotted on the right.5 f=/3(f+f2+f3).4.3.2. -4-3 -2-2 3 4 37 3737

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = å p ( S ) = k k K K ; Q a ; µ, with Q { µ, S,... µ, S }, a Î[, ]. k = k k k Linear weighted combination K Gaussian Functions No closed form solution to: ( Q ) = ( Q) ma L X ma p X Q Q M K i k k ( Q ) = Õåa k k ( µ S ) ma L X ma p ;, Q Q i= k= E-M is an iterative procedure to estimate the best set of parameters Converges to a local optimum à Sensitive to initialization! 38 3838

3939 MACHINE LEARNING Epectation-Maimization (E-M) EM is an iterative procedure K K ) Make a guess pick a set of Q= { µ, S,... µ, S } (initialization) ( Qˆ X Z) ) Compute likelihood of initial model L, (E-Step) ( Q X Z) 2) Update Q by gradient ascent on L, 3) Iterate between steps and 2 until reach plateau (no improvement on likelihood) ) E-M converges to a local optimum only!

44 MACHINE LEARNING Epectation-Maimization (E-M) Solution The assignment of Gaussian or 2 to one of the two cluster yields the same likelihood à The likelihood function has two minima. E-M converges to one of the two solutions depending on initialization.

44 MACHINE LEARNING Epectation-Maimization (E-M) Eercise Demonstrate the non-conveity of the likelihood with an eample E-M converges to a local optimum only!

MACHINE LEARNING Determining the number of components to estimate a pdf with miture of Gaussians through Maimum Likelihood Matlab and mldemos eamples 42 4242

43 APPLIED MACHINE LEARNING Determining Number of Components: Bayesian Information Criterion: BIC =- 2ln L + Bln ( M ) L: maimum likelihood of the model given B parameters M: number of datapoints Lower BIC implies either fewer eplanatory variables, better fit, or both. BIC B

444 MACHINE LEARNING Epectation-Maimization (E-M) Two-dimensional view of Gaussian Miture MOdel Conditional probability of a 2-dim. Miture à Interpretation of modes à Climbing the gradient à Show applications à Variability (multiple paths) à Two different paths (multiple modes) à Choose one by climbing the gradient

MACHINE LEARNING Maimum Likelihood Eample: We have the realization of N Boolean random variables, independent, of same parameter q, and want to estimate q. Let s denote the sum of the observed values. If we introduce P(,..., ) p q q ( q) q ( q) ( ) n n = Õ = Õ - = - q N i n=,... s n=,.., s - s N - s The estimate of q following that principle is the solution of p (q)=, hence q=s/n. 45 4545

4646 MACHINE LEARNING ML in Practice : Caveats on Statistical Measures A large number of algorithms we will see in class require knowing the mean and covariance of the probability distribution function of the data. In practice, the class means and covariances are not known. They can, however, be estimated from the training set. Either the maimum likelihood estimate or the maimum a posteriori estimate may be used in place of the eact value. Several of the algorithms to estimate these assume that the underlying distribution follows a normal distribution. This is usually not true. Thus, one should keep in mind that, although the estimates of the covariance may be considered optimal, this does not mean that the resulting computation obtained by substituting these values is optimal, even if the assumption of normally distributed classes is correct.

4747 MACHINE LEARNING ML in Practice : Caveats on Statistical Measures Another complication that you will often encounter when dealing with algorithms that require computing the inverse of the covariance matri of the data is that, with real data, the number of observations of each sample eceeds the number of samples. In this case, the covariance estimates do not have full rank, i.e. they cannot be inverted. There are a number of ways to deal with this. One is to use the pseudoinverse of the covariance matri. Another way is to proceed to singular value decomposition (SVD).

MACHINE LEARNING 22 Marginal, Conditional in Pdf Consider two random variables and 2 with joint distribution p(, 2 ), then the marginal probability of given is: ( ) p = ò p(, ) d 2 2 The conditional probability is given by: p (, ) p = Û p = ( ) ( ) ( ) ( ) p ( ) p 2 2 2 2 2 p( ) p 48 4848

MACHINE LEARNING 22 Conditional Pdf and Statistical Independence and 2 are statistically independent if: p ( ) = p ( ) and p ( ) = p ( ) 2 2 2 2 Þ p (, ) = p ( ) p ( ) 2 2 2 and are uncorrelated if cov(, ) =. 2 2 2 = E{ 2} -E{ } E{ 2} {, } { } { } cov(, ), Þ E = E E 2 2 The epectation of the joint distribution is equal to the product of the epectation of each distribution separately. 49 4949

MACHINE LEARNING 22 Statistical independence and uncorrelatedness Independent Uncorrelated (, ) = ( ) ( ) Þ {, } = { } { } p p p E E E 2 2 2 2 (, ) = ( ) ( ) Ü {, } = { } { } p p p E E E 2 2 2 2 Statistical independence ensures uncorrelatedness. The converse is not true 5 55

MACHINE LEARNING 22 Statistical independence and uncorrelatedness X=- X= X= Total Y=- 3/2 3/2 /2 Y= /2 4/2 /2 /2 Total /3 /3 /3 and 2 are uncorrelated but not statistically independent. 5 55

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label.. Gaussian fct.4.3.2 Gaussian fct ( =+ ) P y.4.3.2. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -. d Cumulative distribution -4.5-4 -4-2 2 4 5252