MACHINE LEARNING ADVANCED MACHINE LEARNING

Similar documents
MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 3: Pattern Classification. Pattern classification

Machine Learning 4771

Expectation Maximization Algorithm

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

2 Statistical Estimation: Basic Concepts

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Machine Learning Lecture 2

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Naïve Bayes classification

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning Lecture 3

ECE521 week 3: 23/26 January 2017

Machine Learning Lecture 2

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation Maximization

Machine Learning Basics III

Linear Algebra Methods for Data Mining

Introduction to Graphical Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Dimensionality reduction

STA 4273H: Statistical Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Mobile Robot Localization

Course Outline MODEL INFORMATION. Bayes Decision Theory. Unsupervised Learning. Supervised Learning. Parametric Approach. Nonparametric Approach

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Algorithmisches Lernen/Machine Learning

L11: Pattern recognition principles

Latent Variable Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Machine Learning for Signal Processing Bayes Classification

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Lecture 3: Pattern Classification

Ways to make neural networks generalize better

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich

6. Vector Random Variables

Gradient Ascent Chris Piech CS109, Stanford University

ECE 5984: Introduction to Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Probability and Information Theory. Sargur N. Srihari

Computer Vision Group Prof. Daniel Cremers. 3. Regression

MACHINE LEARNING 2012 MACHINE LEARNING

Machine Learning - MT & 14. PCA and MDS

Probabilistic Graphical Models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

The Bayes classifier

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 4: Probabilistic Learning

CSE446: Clustering and EM Spring 2017

L2: Review of probability and statistics

Announcements (repeat) Principal Components Analysis

COM336: Neural Computing

Background Mathematics (2/2) 1. David Barber

Machine Learning: Logistic Regression. Lecture 04

Problem Set 3 Due Oct, 5

Mixture of Gaussians Models

component risk analysis

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Notes on Discriminant Functions and Optimal Classification

1 Kernel methods & optimization

Linear discriminant functions

Principal Component Analysis

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Machine Learning Midterm, Tues April 8

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Mixtures of Gaussians continued

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

FINAL: CS 6375 (Machine Learning) Fall 2014

Gaussians. Hiroshi Shimodaira. January-March Continuous random variables

Machine Learning Lecture 5

Bias-Variance Tradeoff

Parameter estimation Conditional risk

Introduction to Machine Learning

Advanced statistical methods for data analysis Lecture 2

Introduction to Machine Learning

CS 6375 Machine Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

Gaussian Mixture Models

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Review (Probability & Linear Algebra)

SGN (4 cr) Chapter 5

Final Exam, Fall 2002

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

CS281 Section 4: Factor Analysis and PCA

Computational Genomics

CS534 Machine Learning - Spring Final Exam

Principal Component Analysis CS498

Machine Learning 4771

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Density Estimation: ML, MAP, Bayesian estimation

Transcription:

MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions

22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete values over the intervals [...N ] and [...N y] respectively. ( = ) ( ) P i : the probability that the variable takes value i. P = i, " i =,..., N, N å i= ( i) and P = =. ( ) Idem for P y = j, j =,... N y

33 MACHINE LEARNING Probability Distributions, Density Functions p() a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable.

MACHINE LEARNING PDF: Eample The uni-dimensional Gaussian or Normal distribution is a pdf given by: ( ) p = e s 2p ( µ ) 2 æ - ö ç- ç 2 2s è ø 2, µ:mean, σ :variance The Gaussian function is entirely determined by its mean and variance. For this reason, it is referred to as a parametric distribution. Illustrations from Wikipedia 44

55 MACHINE LEARNING Epectation The epectation of the random variable with probability P() (in the discrete case) and pdf p() (in the continuous case), also called the epected value or mean, is the mean of the observed value of weighted by p(). If X is the set of observations of, then: { } When takes discrete values: µ = E = P( ) { } ÎX For continuous distributions: µ = E = p( ) d ò å X

66 MACHINE LEARNING Variance 2 s, the variance of a distribution measures the amount of spread of the distribution around its mean: {( ) 2 } { } { } 2 2 2 s Var( ) E µ E ée ù = = - = - ë û s is the standard deviation of.

MACHINE LEARNING Mean and Variance in the Gauss PDF ~68% of the data are comprised between +/ sigma ~96% of the data are comprised between +/ 2 sigma-s ~99% of the data are comprised between +/ 3 sigma-s This is no longer true for arbitrary pdf-s! Illustrations from Wikipedia 77

88 MACHINE LEARNING Mean and Variance in Miture of Gauss Functions.7.6 sigma=.68.5 f=/3(f+f2+f3).4.3.2. Epectation: -4-3 -2-2 3 4 3 Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions. For other pdf than the Gaussian distribution, the variance represents a notion of dispersion around the epected value.

99 MACHINE LEARNING 22 Multi-dimensional Gaussian Function The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by: ( -µ ) 2 æ ö ç - ç 2 2s è ø p( ; µs, ) = e, µ:mean, σ:variance s 2p The multi-dimensional Gaussian or Normal distribution has a pdf given by: ( ; µ, ) p S = ( 2p ) N - 2 2 S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

MACHINE LEARNING 2-dimensional Gaussian Pdf (, ) p 2 2 2 ( ; µ, ) p S = ( 2p ) N - 2 2 S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø Isolines: p( ) = cst if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

MACHINE LEARNING 22 i i= { }... Construct covariance matri from (centered) set of datapoints X = : å= XX M Modeling Data with a Gaussian Function T M ( ; µ, ) p S = ( 2p ) N - 2 2 S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

MACHINE LEARNING 22 å is square and symmetric. It can be decomposed using the eigenvalue decomposition. å= VLV 2 Modeling Data with a Gaussian Function T, æl ö ç =... ç l è N ø i i= { }... Construct covariance matri from (centered) set of datapoints X = : å= XX M T V : matri of eigenvectors, L : diagonal matri composed of eigenvalues st eigenvector For the -std ellipse, the aes' lengths are equal to: æl ö T 2nd eigenvector l and l2, with å= Vç V. è l2 ø Each isoline corresponds to a scaling of the std ellipse. M 22

MACHINE LEARNING 22 Fitting a single Gauss function and PCA PCA Identifies a suitable representation of a multivariate data set by decorrelating the dataset. 2 When projected onto e and e, the set of datapoints appears to follow two uncorrelated Normal distributions. (( 2 T ) ) ~ æ ;, ( ) 2 2 2 ö p e X Nç X µ l è ø 2 st eigenvector 2nd eigenvector 2 e (( T ) ) ~ æ ;, ( ) 2 ö p e X Nç X µ l è ø e 33

44 MACHINE LEARNING 22 Marginal Pdf ( ) (, ) ( ) Consider two random variables and y with joint distribution p, y then the marginal probability of is given by: p = ò p, y y dy Drop lower indices for clarify of notation, p is a Gaussian when p, y is also a Gaussian To compute the marginal, one needs the joint distribution p(,y). Often, one does not know it and one can only estimate it. If is a multidimensional variable à the marginal is a joint distribution!

55 MACHINE LEARNING Joint vs Marginal and the Curse of Dimensionality ( ) ( ) ( ) The joint distribution p, y is richer than the marginals p, p y. In discrete case: To estimate the marginals of N variables taking K values requires to estimate ( ) N K - probabilities. To estimate the joint K computing ~ N probabilities. distribution corresponds to

66 MACHINE LEARNING 22 Marginal & Conditional Pdf ( ) (, ) ( ) Consider two random variables and y with joint distribution p, y, then the marginal probability of is given by: p = ò p y dy The conditional probability of y given is given by: ( ) p y = py (, ) p ( ) To compute the conditional from the joint distribution requires to estimate the marginal of one of the two variables. If you need only the conditional, estimate it directly!

77 MACHINE LEARNING Joint vs Conditional: Eample Here we care about predicting y = hitting angle given = position, but not the converse p( y ) ( ) One can estimate directly and use this { } to predict yˆ ~ E p y

88 MACHINE LEARNING Joint Distribution Estimation and Curse of Dimensionality Pros of computing the joint distribution: Provides statistical dependencies across all variables and the marginal distributions Cons: Computational costs grow eponentially with number of dimensions (statistical power: samples to estimate each parameter of a model) à Compute solely the conditional if you care only about dependencies across variables

99 MACHINE LEARNING 22 The conditional and marginal pdf of a multi-dimensional Gauss function are all Gauss functions! marginal density of y µ marginal density of Illustrations from Wikipedia

22 MACHINE LEARNING 22 PDF equivalency with Discrete Probability The cumulative distribution function (or simply distribution function) of X is: p() d ~ probability of to fall within an infinitesimal interval [, + d]

MACHINE LEARNING 22 PDF equivalency with Discrete Probability Uniform distribution on p( ) Probability that [ ab], is given by : takes a value in the subinterval P( b): = D ( b) = p( ) d Pa ( b) = D( b) - D( a) P( a b) = p( ) d = ò b a ò b - D ( * ) * 22

222 MACHINE LEARNING 22 PDF equivalency with Discrete Probability.4 Gaussian fct.3.2. -4-2 2 4 Cumulative distribution.5-4 -2 2 4

2323 MACHINE LEARNING Joint and conditional probabilities ( ) The joint probability of two variables and y is written: p, y The conditional probability is computed as follows: ( ) p y = py (, ) py ( ) p y ( ) = p y p y ( ) ( ) p ( ) Bayes' theorem

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -..4 ( =+ ) P y Gaussian fct.3.2. d -4-2 2 4 min ma 2424

MACHINE LEARNING Probability that takes a value in the subinterval é ë ù û min ma, is given by : Pa b D D ma min ( ) = ( )- ( ) D( ) Cumulative distribution.5.4-4 -2 2 4 min ma ( =+ ) P y Gaussian fct.3.2. d -4-2 2 4 min ma 2525

2626 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2-4 -2 2 4 Probability of within interval =.939 with cutoff..5-4 -2 2 4

2727 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2-4 -2 2 4 Probability of within interval =.38995.939 with cutoff.35 cutoff..5-4 -2 2 4

2828 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2-4 -5-2 2 45 Probability of of within interval =.939.99994 with cutoff..5-4 -5-2 2 45

2929 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2 The choice of cut-off in Bayes rule corresponds to -4-5some probability -2 of seeing an event, e.g. 2 to belong4 5 to class Probability y =+. of of within interval =.939.99994 with cutoff. True only when the class follows a Gauss distribution!.5-4 -5-2 2 45

33 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification for binary classification. Each class is modeled with a two-dimensional Gauss function. Bayes rule for binary classification: ( ) has class label + if p y=+ > d, d ³. otherwise takes class label y =. The threshold d determines the boundary.

33 MACHINE LEARNING Eample of use of Bayes rule in ML The ROC (Receiver Operating Curve) determines the influence of the threshold d. Compute false and true positives for each value of d in this interval.

MACHINE LEARNING 2 M ( ) = (,,..., ) Likelihood Function ( ) The Likelihood function orlikelihood for short determines i { } the probability density of observing the set X =, of M datapoints, ( ) if each datapoint has been generated by the pdf p : L X p If the data are independently and identically distributed (i.i.d ),then : M ( ) ( i p ) L X = Õ i= The likelihood can be used to determinehow well a particular choice ( ) of density p models the data at hand. M i= 32 3232

MACHINE LEARNING Likelihood of Gaussian Pdf Parameterization Consider that the pdf of the dataset X is parametrized with parameters µ, S. One writes: ( ; µ, S) or p( X µ, S) p X The likelihood function (short likelihood) of the model parameters is given by: ( µ, S ): = ( ; µ, S) L X p X Measures probability of observing X if the distribution of X is parametrized with µ, S If all datapoints are identically and independently distributed (i.i.d.) M i ( µ, S ) = Õ ( ; µ, S) L X p i= To determine the best fit, search for parameters that maimize the likelihood. 333

3434 MACHINE LEARNING Likelihood Function: Eample Fraction of observation of Fraction of observation of.5.5 Likelihood=5.936e- 2 3 4 Likelihood=9.5536e-5.5.5 2 3 4 Fraction of observation of Fraction of observation of.5.5 Likelihood=4.4994e-6 2 3 4 Likelihood=4.8759e-7.5.5 Data Model 2 3 4 Gaussian model Which one is the most likely?

MACHINE LEARNING Likelihood Function Fraction of observation of Fraction of observation of.5.5 Likelihood=5.936e- 2 3 4 Likelihood=9.5536e-5.5.5 2 3 4 Fraction of observation of Fraction of observation of.5.5 Likelihood=4.4994e-6 2 3 4 Likelihood=4.8759e-7.5.5 Data Model 2 3 4 Plot eamples of the likelihood for a series of Gauss functions applied to non gauss datasets 35 3535

MACHINE LEARNING Maimum Likelihood Optimization The principle of maimum likelihood consists of finding the optimal parameters of a given distribution by maimizing the likelihood function of these parameters, equivalently by maimizing the probability of the data given the model and its parameters. For a multi-variate Gaussian pdf, one can determine the mean and covariance matri by solving: ( µ S ) = ( µ S) ma L, X ma p X, µ, S µ, S p( X µ, S ) = and p( X µ, S ) = µ S Eercise: If p is the Gaussian pdf, then the above has an analytical solution. 36 3636

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = å p ( S ) = k k K K ; Q a ; µ, with Q { µ, S,... µ, S }, a Î[, ]. k = k k k Linear weighted combination K Gaussian Functions.7.6 Eample of a uni-dimensional miture of 3 Gaussian functions plotted on the right.5 f=/3(f+f2+f3).4.3.2. -4-3 -2-2 3 4 37 3737

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = å p ( S ) = k k K K ; Q a ; µ, with Q { µ, S,... µ, S }, a Î[, ]. k = k k k Linear weighted combination K Gaussian Functions No closed form solution to: ( Q ) = ( Q) ma L X ma p X Q Q M K i k k ( Q ) = Õåa k k ( µ S ) ma L X ma p ;, Q Q i= k= E-M is an iterative procedure to estimate the best set of parameters Converges to a local optimum à Sensitive to initialization! 38 3838

3939 MACHINE LEARNING Epectation-Maimization (E-M) EM is an iterative procedure K K ) Make a guess pick a set of Q= { µ, S,... µ, S } (initialization) ( Qˆ X Z) ) Compute likelihood of initial model L, (E-Step) ( Q X Z) 2) Update Q by gradient ascent on L, 3) Iterate between steps and 2 until reach plateau (no improvement on likelihood) ) E-M converges to a local optimum only!

44 MACHINE LEARNING Epectation-Maimization (E-M) Solution The assignment of Gaussian or 2 to one of the two cluster yields the same likelihood à The likelihood function has two minima. E-M converges to one of the two solutions depending on initialization.

44 MACHINE LEARNING Epectation-Maimization (E-M) Eercise Demonstrate the non-conveity of the likelihood with an eample E-M converges to a local optimum only!

MACHINE LEARNING Determining the number of components to estimate a pdf with miture of Gaussians through Maimum Likelihood Matlab and mldemos eamples 42 4242

43 APPLIED MACHINE LEARNING Determining Number of Components: Bayesian Information Criterion: BIC =- 2ln L + Bln ( M ) L: maimum likelihood of the model given B parameters M: number of datapoints Lower BIC implies either fewer eplanatory variables, better fit, or both. BIC B

444 MACHINE LEARNING Epectation-Maimization (E-M) Two-dimensional view of Gaussian Miture MOdel Conditional probability of a 2-dim. Miture à Interpretation of modes à Climbing the gradient à Show applications à Variability (multiple paths) à Two different paths (multiple modes) à Choose one by climbing the gradient

MACHINE LEARNING Maimum Likelihood Eample: We have the realization of N Boolean random variables, independent, of same parameter q, and want to estimate q. Let s denote the sum of the observed values. If we introduce P(,..., ) p q q ( q) q ( q) ( ) n n = Õ = Õ - = - q N i n=,... s n=,.., s - s N - s The estimate of q following that principle is the solution of p (q)=, hence q=s/n. 45 4545

4646 MACHINE LEARNING ML in Practice : Caveats on Statistical Measures A large number of algorithms we will see in class require knowing the mean and covariance of the probability distribution function of the data. In practice, the class means and covariances are not known. They can, however, be estimated from the training set. Either the maimum likelihood estimate or the maimum a posteriori estimate may be used in place of the eact value. Several of the algorithms to estimate these assume that the underlying distribution follows a normal distribution. This is usually not true. Thus, one should keep in mind that, although the estimates of the covariance may be considered optimal, this does not mean that the resulting computation obtained by substituting these values is optimal, even if the assumption of normally distributed classes is correct.

4747 MACHINE LEARNING ML in Practice : Caveats on Statistical Measures Another complication that you will often encounter when dealing with algorithms that require computing the inverse of the covariance matri of the data is that, with real data, the number of observations of each sample eceeds the number of samples. In this case, the covariance estimates do not have full rank, i.e. they cannot be inverted. There are a number of ways to deal with this. One is to use the pseudoinverse of the covariance matri. Another way is to proceed to singular value decomposition (SVD).

MACHINE LEARNING 22 Marginal, Conditional in Pdf Consider two random variables and 2 with joint distribution p(, 2 ), then the marginal probability of given is: ( ) p = ò p(, ) d 2 2 The conditional probability is given by: p (, ) p = Û p = ( ) ( ) ( ) ( ) p ( ) p 2 2 2 2 2 p( ) p 48 4848

MACHINE LEARNING 22 Conditional Pdf and Statistical Independence and 2 are statistically independent if: p ( ) = p ( ) and p ( ) = p ( ) 2 2 2 2 Þ p (, ) = p ( ) p ( ) 2 2 2 and are uncorrelated if cov(, ) =. 2 2 2 = E{ 2} -E{ } E{ 2} {, } { } { } cov(, ), Þ E = E E 2 2 The epectation of the joint distribution is equal to the product of the epectation of each distribution separately. 49 4949

MACHINE LEARNING 22 Statistical independence and uncorrelatedness Independent Uncorrelated (, ) = ( ) ( ) Þ {, } = { } { } p p p E E E 2 2 2 2 (, ) = ( ) ( ) Ü {, } = { } { } p p p E E E 2 2 2 2 Statistical independence ensures uncorrelatedness. The converse is not true 5 55

MACHINE LEARNING 22 Statistical independence and uncorrelatedness X=- X= X= Total Y=- 3/2 3/2 /2 Y= /2 4/2 /2 /2 Total /3 /3 /3 and 2 are uncorrelated but not statistically independent. 5 55

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label.. Gaussian fct.4.3.2 Gaussian fct ( =+ ) P y.4.3.2. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -. d Cumulative distribution -4.5-4 -4-2 2 4 5252

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -..4 ( =+ ) P y Gaussian fct.3.2. d -4-2 2 4 5353