MACHINE LEARNING ADVANCED MACHINE LEARNING

Similar documents
MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Pattern Classification. Pattern classification

Machine Learning 4771

Machine Learning Lecture 2

Machine Learning Lecture 3

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Pattern Recognition. Parameter Estimation of Probability Density Functions

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Naïve Bayes classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning Basics III

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning Lecture 2

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Mobile Robot Localization

L11: Pattern recognition principles

6.867 Machine learning

Algorithmisches Lernen/Machine Learning

Introduction to Machine Learning Midterm, Tues April 8

Bayesian Decision Theory

2 Statistical Estimation: Basic Concepts

Linear discriminant functions

Expectation Maximization

Notes on Discriminant Functions and Optimal Classification

Introduction to Graphical Models

STA 4273H: Statistical Machine Learning

Expectation Maximization Algorithm

Classification: The rest of the story

ECE521 week 3: 23/26 January 2017

Dimensionality reduction

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning

Lecture 3: Pattern Classification

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

ECE 5984: Introduction to Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Advanced statistical methods for data analysis Lecture 2

Ch 4. Linear Models for Classification

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Machine Learning for Signal Processing Bayes Classification

Generative v. Discriminative classifiers Intuition

Machine Learning Lecture 5

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bias-Variance Tradeoff

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Latent Variable Models

Gaussian Mixture Models

CS281 Section 4: Factor Analysis and PCA

CSE446: Clustering and EM Spring 2017

COM336: Neural Computing

The Bayes classifier

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Lecture 4 Discriminant Analysis, k-nearest Neighbors

p(x ω i 0.4 ω 2 ω

Machine Learning - MT Classification: Generative Models

FINAL: CS 6375 (Machine Learning) Fall 2014

Mixture of Gaussians Models

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Computational Genomics

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013

MACHINE LEARNING 2012 MACHINE LEARNING

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Principal Component Analysis

Dimensionality Reduction and Principal Components

Machine Learning - MT & 14. PCA and MDS

Sparse Linear Models (10/7/13)

SGN (4 cr) Chapter 5

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

BAYESIAN DECISION THEORY

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning Lecture 7

Bayesian Decision and Bayesian Learning

Stochastic Gradient Descent

Expectation maximization

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic Graphical Models

CMU-Q Lecture 24:

Final Exam, Fall 2002

Probabilistic & Unsupervised Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Maximum Likelihood Estimation. only training data is available to design a classifier

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Transcription:

MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions

2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal, likelihood Uni-dimensional, multi-dimensional Gauss function Decomposition in eigenvalue Estimation of density with epectation maimization Use of Bayes rule for classification

3 3 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete values over the intervals [...N ] and [...N y] respectively. ( = ) ( ) P i : the probability that the variable takes value i. 0 P = i, i =,..., N, N i= ( i) and P = =. ( ) Idem for P y = j, j =,... N y

4 4 MACHINE LEARNING Probability Distributions, Density Functions p() a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable. p ( ) 0, p( ) d =

MACHINE LEARNING PDF: Eample The uni-dimensional Gaussian or Normal distribution is a pdf given by: ( ) p = e σ 2π ( µ ) 2 2 2σ 2, μ:mean, σ :variance The Gaussian function is entirely determined by its mean and variance. For this reason, it is referred to as a parametric distribution. Illustrations from Wikipedia 5 5

MACHINE LEARNING Epectation The epectation of the random variable with probability P() (in the discrete case) and pdf p() (in the continuous case), also called the epected value or mean, is the mean of the observed value of weighted by p(). If X is the set of observations of, then: { } When takes discrete values: µ = E = P( ) { } X For continuous distributions: µ = E = p( ) d X 6 6 6

MACHINE LEARNING Variance 2 σ, the variance of a distribution measures the amount of spread of the distribution around its mean: {( ) 2 } { } { } 2 2 2 σ Var( ) E µ E E = = = σ is the standard deviation of. 7 7 7

MACHINE LEARNING Mean and Variance in the Gauss PDF ~68% of the data are comprised between +/ sigma ~96% of the data are comprised between +/ 2 sigma-s ~99% of the data are comprised between +/ 3 sigma-s This is no longer true for arbitrary pdf-s! 8 Illustrations from Wikipedia 8 8

9 9 MACHINE LEARNING Mean and Variance in Miture of Gauss Functions 0.7 0.6 sigma=0.68 0.5 f=/3(f+f2+f3) 0.4 0.3 0.2 0. Epectation: 0-4 -3-2 - 0 2 3 4 3 Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions. For other pdf than the Gaussian distribution, the variance represents a notion of dispersion around the epected value.

MACHINE LEARNING 202 Multi-dimensional Gaussian Function The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by: ( µ ) 2 2 2σ p( ; µσ, ) = e, μ:mean, σ:variance σ 2π The multi-dimensional Gaussian or Normal distribution has a pdf given by: ( ; µ, ) p Σ = ( 2π ) N 2 2 Σ e Σ 2 T ( µ ) ( µ ) if is N-dimensional, then μ is a N dim ensional mean vector is a N N covariance matri 0 0 0

MACHINE LEARNING 2-dimensional Gaussian Pdf (, ) p 2 2 2 ( ; µ, ) p Σ = ( 2π ) N 2 2 Σ e Σ 2 T ( µ ) ( µ ) Isolines: p ( ) = cst if is N-dimensional, then μ is a N dim ensional mean vector is a N N covariance matri

MACHINE LEARNING 202 Modeling Joint Pdf with a Gaussian Function i i= { }... Construct covariance matri from (centered) set of datapoints X = : = XX M T M ( ; µ, ) p Σ = ( 2π ) N 2 2 Σ e Σ 2 T ( µ ) ( µ ) if is N-dimensional, then μ is a N dim ensional mean vector is a N N covariance matri 2 2 2

MACHINE LEARNING 202 is square and symmetric. It can be decomposed using the eigenvalue decomposition. = VΛV 2 Modeling Joint Pdf with a Gaussian Function T, λ 0 =... 0 λ N i i= { }... Construct covariance matri from (centered) set of datapoints X = : = XX M T V : matri of eigenvectors, Λ : diagonal matri composed of eigenvalues For the -std ellipse, the aes' lengths are st eigenvector equal to: λ 0 T λ and λ2, with = V V. 2nd eigenvector 0 λ2 Each isoline corresponds to a scaling of the std ellipse. M 3 3 3

MACHINE LEARNING 202 Fitting a single Gauss function and PCA Principal Component Analysis (PCA) Identifies a suitable representation of a multivariate data set by de-correlating the dataset. 2 When projected onto e and e, the set of datapoints appears to follow two uncorrelated Normal distributions. (( 2 T ) ) ~ ( ) ;, 2 2 2 p e X N X µ λ 2 st eigenvector 2nd eigenvector (( T ) ) ~ ;, ( ) 2 p e X N X µ λ 2 e 4 e 4 4

5 5 MACHINE LEARNING 202 Marginal Pdf ( ) (, ) ( ) Consider two random variables and ywith joint distribution py, then the marginal probability of is given by: p = py, y dy Drop lower indices for clarify of notation, p is a Gaussian when p y, is also a Gaussian To compute the marginal, one needs the joint distribution p(,y). Often, one does not know it and one can only estimate it. If is a multidimensional variable the marginal is a joint distribution!

6 6 MACHINE LEARNING Joint vs Marginal and the Curse of Dimensionality ( ) ( ) ( ) The joint distribution p, y is richer than the marginals p, p y. In discrete case: To estimate the marginals of N variables taking K values requires to estimate ( ) N K probabilities. To estimate the joint K computing ~ N probabilities. distribution corresponds to

7 7 MACHINE LEARNING 202 Marginal & Conditional Pdf ( ) (, ) ( ) Consider two random variables and ywith joint distribution py,, then the marginal probability of is given by: p = p y dy The conditional probability of y given is given by: ( ) p y = py (, ) p ( ) To compute the conditional from the joint distribution requires to estimate the marginal of one of the two variables. If you need only the conditional, estimate it directly!

8 8 MACHINE LEARNING Joint vs Conditional: Eample Here we care about predicting y = hitting angle given = position, but not the converse p( y ) ( ) One can estimate directly and use this { } to predict yˆ ~ E p y

9 9 MACHINE LEARNING Joint Distribution Estimation and Curse of Dimensionality Pros of computing the joint distribution: Provides statistical dependencies across all variables and the marginal distributions Cons: Computational costs grow eponentially with number of dimensions (statistical power: 0 samples to estimate each parameter of a model) Compute solely the conditional if you care only about dependencies across variables

20 20 MACHINE LEARNING 202 The conditional and marginal pdf of a multi-dimensional Gauss function are all Gauss functions! marginal density of y µ marginal density of Illustrations from Wikipedia

2 2 MACHINE LEARNING 202 PDF equivalency with Discrete Probability The cumulative distribution function (or simply distribution function) of X is: ( *) ( * = ) D P ( * ) * D = p( ) d, p() d ~ probability of to fall within an infinitesimal interval [, + d]

MACHINE LEARNING 202 PDF equivalency with Discrete Probability Uniform distribution on p( ) Probability that [ ab] takes a value in the subinterval, is given by : P( b): = D ( b) = p( ) d P( a b) = D ( b) D ( a) Pa ( b) = pd ( ) = b a b D * ( ) * 22 222

23 23 MACHINE LEARNING 202 PDF equivalency with Discrete Probability 0.4 Gaussian fct Cumulative distribution 0.3 0.2 0. 0-4 -2 0 2 4 0.5 0-4 -2 0 2 4

24 24 MACHINE LEARNING Conditional and Bayes s Rule ( ) Consider the joint probability of two variables and ypy :, The conditional probability is computed as follows: ( ) p y = py (, ) p ( ) ( ) p y = ( ) ( ) p( ) p y p y Bayes' theorem

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > δ, δ 0. otherwise takes class label -. 0.4 ( = + ) P y Gaussian fct 0.3 0.2 0. δ 0-4 -2 0 2 4 min ma 25 25

MACHINE LEARNING Probability that takes a value in the subinterval min ma, is given by : P a b D D ma min ( ) = ( ) ( ) D( ) Cumulative distribution 0.5 0.4 0-4 -2 0 2 4 min ma ( = + ) P y Gaussian fct 0.3 0.2 0. δ 0-4 -2 0 2 4 min ma 26 26

27 27 MACHINE LEARNING 202 Eample of use of Bayes rule in ML 0.4 Gaussian fct Cumulative distribution 0.2 0-4 -2 0 2 4 Probability of within interval = 0.90309 with cutoff0. 0.5 0-4 -2 0 2 4

28 28 MACHINE LEARNING 202 Eample of use of Bayes rule in ML 0.4 Gaussian fct Cumulative distribution 0.2 0-4 -2 0 2 4 Probability of within interval = 0.38995 0.90309 with cutoff0.35 cutoff0. 0.5 0-4 -2 0 2 4

MACHINE LEARNING 202 Eample of use of Bayes rule in ML 0.4 Gaussian fct 0.2 Cumulative distribution 0-4 -5-2 0 2 45 Probability of of within interval = 0.90309 0.99994 with cutoff0. The 0.5 choice of cut-off in Bayes rule corresponds to some probability of seeing an event, e.g. to belong 0-4 -5 y = + -2 0 2 45 to class. True only when the class follows a Gauss distribution! 29 29

30 30 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Here eample for binary classification. Decision rule is: has class label y = + if: ( =+ ) ( = ) p y p y Boundary is found at the intersection between the isolines. Each class is modeled with one Gauss distribution. ( ) ~ ( ;, ) p y N µσ

3 3 MACHINE LEARNING Eample of use of Bayes rule in ML One can also add a prior ( ) and p( ) on either of the marginal to change the boundary. ( ) p( ) Change this ratio ( ) p( ) p y =+ p( y =+ ) = p( y =+ ) p y = p( y = ) = p( y = ) p y are usually estimated from data. If not estimated, one can add a prior on either marginal, e.g. ( ) give more importance to one class by decreasing its p y. Change this ratio

MACHINE LEARNING ( ) and p( ) Eample of use of Bayes rule in ML One can also add a prior on either of the marginal to change the boundary. ( ) p( ) Change this ratio ( ) p( ) p y =+ p( y =+ ) = p( y =+ ) p y = p( y = ) = p( y = ) p y are usually estimated from data. If not estimated, one can add a prior on either marginal, e.g. ( ) give more importance to one class by decreasing its p y. Change this ratio ( ) p( y ) The ratio δ = p y =+ / = determines the boundary. 32 32

MACHINE LEARNING The ROC (Receiver Operating Curve) determines the influence of the threshold. Compute false and true positives for each value of δ in this interval. True Positives( TP) : nm of datapoints of class that are correctly classified ( ) p( y ) The ratio δ = p y =+ / = determines the boundary. False Positives( FP) : nm of datapoints of class 2 that are incorrectly classified 333

34 34 MACHINE LEARNING Likelihood Function 2 M ( ) = (,,..., ) ( ) The Likelihood function orlikelihood for short determines i { } the probability density of observing the set X =, of M datapoints, ( ) if each datapoint has been generated by the pdf p : L X p If the data are independently and identically distributed (i.i.d),then : M i ( ) p( ) L X = i= The likelihood can be used to determinehow well a particular choice ( ) of density p models the data at hand. M i=

MACHINE LEARNING Likelihood of Gaussian Pdf Paramatrization Consider that the pdf of the dataset X is parametrized with parameters µ, Σ. One writes: ( ; µ, Σ) or p( X µ, Σ) p X The likelihood function of the model parameters is given by: ( µ, Σ ) : = ( ; µ, Σ) L X p X Measures probability of observing X if the distribution of X is parametrized with µ, Σ If all datapoints are identically and independently distributed (i.i.d.) M i ( µ, Σ ) = ( ; µ, Σ) L X p i= To determine the best fit, search for parameters that maimize the likelihood. 35 35 35

MACHINE LEARNING Likelihood Function: Eercise Fraction of observation of Fraction of observation of.5 0.5 Likelihood=5.0936e- 0 0 2 3 4 Likelihood=9.5536e-05.5 0.5 0 0 2 3 4 Fraction of observation of Fraction of observation of.5 0.5 Likelihood=4.4994e-06 0 0 2 3 4 Likelihood=4.8759e-07.5 0.5 Data Model 0 0 2 3 4 Estimate the likelihood of each of the 4 models. Determine which one is optimal. 36 36 36

MACHINE LEARNING Maimum Likelihood Optimization The principle of maimum likelihood consists of finding the optimal parameters of a given distribution by maimizing the likelihood function of these parameters, equivalently by maimizing the probability of the data given the model and its parameters. For a multi-variate Gaussian pdf, one can determine the mean and covariance matri by solving: ( µ Σ ) = ( µ Σ) ma L, X ma p X, µ, Σ µ, Σ p( X µ, Σ ) = 0 and p( X µ, Σ ) = 0 µ Σ Eercise: If p is the Gaussian pdf, then the above has an analytical solution. 38 38 38

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = p ( Σ ) = k k K K ; Θ α ; µ, with Θ { µ, Σ,... µ, Σ }, α [0, ]. k = k k k Linear weighted combination K Gaussian Functions 0.7 0.6 0.5 Eample of a uni-dimensional miture of 3 Gaussian functions plotted on the right f=/3(f+f2+f3) 0.4 0.3 0.2 0. 0-4 -3-2 - 0 2 3 4 39 39 39

MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = p ( Σ ) = k k K K ; Θ α ; µ, with Θ { µ, Σ,... µ, Σ }, α [0, ]. k = k k k Linear weighted combination K Gaussian Functions No closed form solution to: ( Θ ) = ( Θ) ma L X ma p X Θ Θ M K i k k ( Θ ) = α k k ( µ Σ ) ma L X ma p ;, Θ Θ i= k = E-M is an iterative procedure to estimate the best set of parameters Converges to a local optimum Sensitive to initialization! 40 40 40

4 4 MACHINE LEARNING Epectation-Maimization (E-M) EM is an iterative procedure Θ= { µ Σ µ Σ K K 0) Make a guess pick a set of,,..., (initialization) ( Θˆ X Z) ) Compute likelihood of initial model L, (E-Step) ( Θ X Z) 2) Update Θ by gradient ascent on L, 3) Iterate between steps and 2 until reach plateau (no improvement on likelihood) } E-M converges to a local optimum only!

MACHINE LEARNING Estimate the joint density, p(,y), across pairs of datapoints using a miture of two Gauss functions: K i i i i i i (, ) = α i (, ; µ, ), with (, ; µ, ) = ( µ, ) py py py N i= i i µ, : mean and covariance matri of Gaussian i y Epectation-Maimization (E-M) Parameters are learned through the iterative procedure E-M. Start with random initialization. 42 42

43 43 MACHINE LEARNING Epectation-Maimization (E-M) Eercise Demonstrate the non-conveity of the likelihood with an eample E-M converges to a local optimum only!

444 MACHINE LEARNING Epectation-Maimization (E-M) Solution The assignment of Gaussian or 2 to one of the two cluster yields the same likelihood The likelihood function has two minima. E-M converges to one of the two solutions depending on initialization.

MACHINE LEARNING Determining the number of components to estimate a pdf with miture of Gaussians through Maimum Likelihood Matlab and mldemos eamples 45 45 45

46 APPLIED MACHINE LEARNING Determining Number of Components: Bayesian Information Criterion: BIC = 2ln L + Bln( M ) L: maimum likelihood of the model given B parameters M: number of datapoints Lower BIC implies either fewer eplanatory variables, better fit, or both. BIC Penalize too small increments in likelihood that leads to too large an increase in the number of parameters (case of overfitting) B

MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > δ, δ 0. otherwise takes class label -. 0.4 ( = + ) P y Gaussian fct 0.3 0.2 0. δ 0-4 -2 0 2 4 56 56