MACHINE LEARNING ADVANCED MACHINE LEARNING

Size: px
Start display at page:

Download "MACHINE LEARNING ADVANCED MACHINE LEARNING"

Transcription

1 MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions

2 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete values over the intervals [...N ] and [...N y] respectively. ( = ) ( ) P i : the probability that the variable takes value i. P = i, " i =,..., N, N å i= ( i) and P = =. ( ) Idem for P y = j, j =,... N y

3 33 MACHINE LEARNING Probability Distributions, Density Functions p() a continuous function is the probability density function or probability distribution function (PDF) (sometimes also called probability distribution or simply density) of variable.

4 MACHINE LEARNING PDF: Eample The uni-dimensional Gaussian or Normal distribution is a pdf given by: ( ) p = e s 2p ( µ ) 2 æ - ö ç- ç 2 2s è ø 2, µ:mean, σ :variance The Gaussian function is entirely determined by its mean and variance. For this reason, it is referred to as a parametric distribution. Illustrations from Wikipedia 44

5 55 MACHINE LEARNING Epectation The epectation of the random variable with probability P() (in the discrete case) and pdf p() (in the continuous case), also called the epected value or mean, is the mean of the observed value of weighted by p(). If X is the set of observations of, then: { } When takes discrete values: µ = E = P( ) { } ÎX For continuous distributions: µ = E = p( ) d ò å X

6 66 MACHINE LEARNING Variance 2 s, the variance of a distribution measures the amount of spread of the distribution around its mean: {( ) 2 } { } { } s Var( ) E µ E ée ù = = - = - ë û s is the standard deviation of.

7 MACHINE LEARNING Mean and Variance in the Gauss PDF ~68% of the data are comprised between +/ sigma ~96% of the data are comprised between +/ 2 sigma-s ~99% of the data are comprised between +/ 3 sigma-s This is no longer true for arbitrary pdf-s! Illustrations from Wikipedia 77

8 88 MACHINE LEARNING Mean and Variance in Miture of Gauss Functions.7.6 sigma=.68.5 f=/3(f+f2+f3) Epectation: Gaussians distributions Resulting distribution when superposing the 3 Gaussian distributions. For other pdf than the Gaussian distribution, the variance represents a notion of dispersion around the epected value.

9 99 MACHINE LEARNING 22 Multi-dimensional Gaussian Function The uni-dimensional Gaussian or Normal distribution is a distribution with pdf given by: ( -µ ) 2 æ ö ç - ç 2 2s è ø p( ; µs, ) = e, µ:mean, σ:variance s 2p The multi-dimensional Gaussian or Normal distribution has a pdf given by: ( ; µ, ) p S = ( 2p ) N S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

10 MACHINE LEARNING 2-dimensional Gaussian Pdf (, ) p ( ; µ, ) p S = ( 2p ) N S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø Isolines: p( ) = cst if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

11 MACHINE LEARNING 22 i i= { }... Construct covariance matri from (centered) set of datapoints X = : å= XX M Modeling Data with a Gaussian Function T M ( ; µ, ) p S = ( 2p ) N S e æ T - ö ç - ( - µ ) S ( - µ ) è 2 ø if is N-dimensional, then µ is a N - dimensional mean vector å is a N N covariance matri

12 MACHINE LEARNING 22 å is square and symmetric. It can be decomposed using the eigenvalue decomposition. å= VLV 2 Modeling Data with a Gaussian Function T, æl ö ç =... ç l è N ø i i= { }... Construct covariance matri from (centered) set of datapoints X = : å= XX M T V : matri of eigenvectors, L : diagonal matri composed of eigenvalues st eigenvector For the -std ellipse, the aes' lengths are equal to: æl ö T 2nd eigenvector l and l2, with å= Vç V. è l2 ø Each isoline corresponds to a scaling of the std ellipse. M 22

13 MACHINE LEARNING 22 Fitting a single Gauss function and PCA PCA Identifies a suitable representation of a multivariate data set by decorrelating the dataset. 2 When projected onto e and e, the set of datapoints appears to follow two uncorrelated Normal distributions. (( 2 T ) ) ~ æ ;, ( ) ö p e X Nç X µ l è ø 2 st eigenvector 2nd eigenvector 2 e (( T ) ) ~ æ ;, ( ) 2 ö p e X Nç X µ l è ø e 33

14 44 MACHINE LEARNING 22 Marginal Pdf ( ) (, ) ( ) Consider two random variables and y with joint distribution p, y then the marginal probability of is given by: p = ò p, y y dy Drop lower indices for clarify of notation, p is a Gaussian when p, y is also a Gaussian To compute the marginal, one needs the joint distribution p(,y). Often, one does not know it and one can only estimate it. If is a multidimensional variable à the marginal is a joint distribution!

15 55 MACHINE LEARNING Joint vs Marginal and the Curse of Dimensionality ( ) ( ) ( ) The joint distribution p, y is richer than the marginals p, p y. In discrete case: To estimate the marginals of N variables taking K values requires to estimate ( ) N K - probabilities. To estimate the joint K computing ~ N probabilities. distribution corresponds to

16 66 MACHINE LEARNING 22 Marginal & Conditional Pdf ( ) (, ) ( ) Consider two random variables and y with joint distribution p, y, then the marginal probability of is given by: p = ò p y dy The conditional probability of y given is given by: ( ) p y = py (, ) p ( ) To compute the conditional from the joint distribution requires to estimate the marginal of one of the two variables. If you need only the conditional, estimate it directly!

17 77 MACHINE LEARNING Joint vs Conditional: Eample Here we care about predicting y = hitting angle given = position, but not the converse p( y ) ( ) One can estimate directly and use this { } to predict yˆ ~ E p y

18 88 MACHINE LEARNING Joint Distribution Estimation and Curse of Dimensionality Pros of computing the joint distribution: Provides statistical dependencies across all variables and the marginal distributions Cons: Computational costs grow eponentially with number of dimensions (statistical power: samples to estimate each parameter of a model) à Compute solely the conditional if you care only about dependencies across variables

19 99 MACHINE LEARNING 22 The conditional and marginal pdf of a multi-dimensional Gauss function are all Gauss functions! marginal density of y µ marginal density of Illustrations from Wikipedia

20 22 MACHINE LEARNING 22 PDF equivalency with Discrete Probability The cumulative distribution function (or simply distribution function) of X is: p() d ~ probability of to fall within an infinitesimal interval [, + d]

21 MACHINE LEARNING 22 PDF equivalency with Discrete Probability Uniform distribution on p( ) Probability that [ ab], is given by : takes a value in the subinterval P( b): = D ( b) = p( ) d Pa ( b) = D( b) - D( a) P( a b) = p( ) d = ò b a ò b - D ( * ) * 22

22 222 MACHINE LEARNING 22 PDF equivalency with Discrete Probability.4 Gaussian fct Cumulative distribution

23 2323 MACHINE LEARNING Joint and conditional probabilities ( ) The joint probability of two variables and y is written: p, y The conditional probability is computed as follows: ( ) p y = py (, ) py ( ) p y ( ) = p y p y ( ) ( ) p ( ) Bayes' theorem

24 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -..4 ( =+ ) P y Gaussian fct.3.2. d min ma 2424

25 MACHINE LEARNING Probability that takes a value in the subinterval é ë ù û min ma, is given by : Pa b D D ma min ( ) = ( )- ( ) D( ) Cumulative distribution min ma ( =+ ) P y Gaussian fct.3.2. d min ma 2525

26 2626 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution Probability of within interval =.939 with cutoff

27 2727 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution Probability of within interval = with cutoff.35 cutoff

28 2828 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution Probability of of within interval = with cutoff

29 2929 MACHINE LEARNING 22 Eample of use of Bayes rule in ML.4 Gaussian fct Cumulative distribution.2 The choice of cut-off in Bayes rule corresponds to -4-5some probability -2 of seeing an event, e.g. 2 to belong4 5 to class Probability y =+. of of within interval = with cutoff. True only when the class follows a Gauss distribution!

30 33 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification for binary classification. Each class is modeled with a two-dimensional Gauss function. Bayes rule for binary classification: ( ) has class label + if p y=+ > d, d ³. otherwise takes class label y =. The threshold d determines the boundary.

31 33 MACHINE LEARNING Eample of use of Bayes rule in ML The ROC (Receiver Operating Curve) determines the influence of the threshold d. Compute false and true positives for each value of d in this interval.

32 MACHINE LEARNING 2 M ( ) = (,,..., ) Likelihood Function ( ) The Likelihood function orlikelihood for short determines i { } the probability density of observing the set X =, of M datapoints, ( ) if each datapoint has been generated by the pdf p : L X p If the data are independently and identically distributed (i.i.d ),then : M ( ) ( i p ) L X = Õ i= The likelihood can be used to determinehow well a particular choice ( ) of density p models the data at hand. M i=

33 MACHINE LEARNING Likelihood of Gaussian Pdf Parameterization Consider that the pdf of the dataset X is parametrized with parameters µ, S. One writes: ( ; µ, S) or p( X µ, S) p X The likelihood function (short likelihood) of the model parameters is given by: ( µ, S ): = ( ; µ, S) L X p X Measures probability of observing X if the distribution of X is parametrized with µ, S If all datapoints are identically and independently distributed (i.i.d.) M i ( µ, S ) = Õ ( ; µ, S) L X p i= To determine the best fit, search for parameters that maimize the likelihood. 333

34 3434 MACHINE LEARNING Likelihood Function: Eample Fraction of observation of Fraction of observation of.5.5 Likelihood=5.936e Likelihood=9.5536e Fraction of observation of Fraction of observation of.5.5 Likelihood=4.4994e Likelihood=4.8759e Data Model Gaussian model Which one is the most likely?

35 MACHINE LEARNING Likelihood Function Fraction of observation of Fraction of observation of.5.5 Likelihood=5.936e Likelihood=9.5536e Fraction of observation of Fraction of observation of.5.5 Likelihood=4.4994e Likelihood=4.8759e Data Model Plot eamples of the likelihood for a series of Gauss functions applied to non gauss datasets

36 MACHINE LEARNING Maimum Likelihood Optimization The principle of maimum likelihood consists of finding the optimal parameters of a given distribution by maimizing the likelihood function of these parameters, equivalently by maimizing the probability of the data given the model and its parameters. For a multi-variate Gaussian pdf, one can determine the mean and covariance matri by solving: ( µ S ) = ( µ S) ma L, X ma p X, µ, S µ, S p( X µ, S ) = and p( X µ, S ) = µ S Eercise: If p is the Gaussian pdf, then the above has an analytical solution

37 MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = å p ( S ) = k k K K ; Q a ; µ, with Q { µ, S,... µ, S }, a Î[, ]. k = k k k Linear weighted combination K Gaussian Functions.7.6 Eample of a uni-dimensional miture of 3 Gaussian functions plotted on the right.5 f=/3(f+f2+f3)

38 MACHINE LEARNING Epectation-Maimization (E-M) EM used when no closed form solution eists for the maimum likelihood Eample: Fit Miture of Gaussian Functions p K ( ) = å p ( S ) = k k K K ; Q a ; µ, with Q { µ, S,... µ, S }, a Î[, ]. k = k k k Linear weighted combination K Gaussian Functions No closed form solution to: ( Q ) = ( Q) ma L X ma p X Q Q M K i k k ( Q ) = Õåa k k ( µ S ) ma L X ma p ;, Q Q i= k= E-M is an iterative procedure to estimate the best set of parameters Converges to a local optimum à Sensitive to initialization!

39 3939 MACHINE LEARNING Epectation-Maimization (E-M) EM is an iterative procedure K K ) Make a guess pick a set of Q= { µ, S,... µ, S } (initialization) ( Qˆ X Z) ) Compute likelihood of initial model L, (E-Step) ( Q X Z) 2) Update Q by gradient ascent on L, 3) Iterate between steps and 2 until reach plateau (no improvement on likelihood) ) E-M converges to a local optimum only!

40 44 MACHINE LEARNING Epectation-Maimization (E-M) Solution The assignment of Gaussian or 2 to one of the two cluster yields the same likelihood à The likelihood function has two minima. E-M converges to one of the two solutions depending on initialization.

41 44 MACHINE LEARNING Epectation-Maimization (E-M) Eercise Demonstrate the non-conveity of the likelihood with an eample E-M converges to a local optimum only!

42 MACHINE LEARNING Determining the number of components to estimate a pdf with miture of Gaussians through Maimum Likelihood Matlab and mldemos eamples

43 43 APPLIED MACHINE LEARNING Determining Number of Components: Bayesian Information Criterion: BIC =- 2ln L + Bln ( M ) L: maimum likelihood of the model given B parameters M: number of datapoints Lower BIC implies either fewer eplanatory variables, better fit, or both. BIC B

44 444 MACHINE LEARNING Epectation-Maimization (E-M) Two-dimensional view of Gaussian Miture MOdel Conditional probability of a 2-dim. Miture à Interpretation of modes à Climbing the gradient à Show applications à Variability (multiple paths) à Two different paths (multiple modes) à Choose one by climbing the gradient

45 MACHINE LEARNING Maimum Likelihood Eample: We have the realization of N Boolean random variables, independent, of same parameter q, and want to estimate q. Let s denote the sum of the observed values. If we introduce P(,..., ) p q q ( q) q ( q) ( ) n n = Õ = Õ - = - q N i n=,... s n=,.., s - s N - s The estimate of q following that principle is the solution of p (q)=, hence q=s/n

46 4646 MACHINE LEARNING ML in Practice : Caveats on Statistical Measures A large number of algorithms we will see in class require knowing the mean and covariance of the probability distribution function of the data. In practice, the class means and covariances are not known. They can, however, be estimated from the training set. Either the maimum likelihood estimate or the maimum a posteriori estimate may be used in place of the eact value. Several of the algorithms to estimate these assume that the underlying distribution follows a normal distribution. This is usually not true. Thus, one should keep in mind that, although the estimates of the covariance may be considered optimal, this does not mean that the resulting computation obtained by substituting these values is optimal, even if the assumption of normally distributed classes is correct.

47 4747 MACHINE LEARNING ML in Practice : Caveats on Statistical Measures Another complication that you will often encounter when dealing with algorithms that require computing the inverse of the covariance matri of the data is that, with real data, the number of observations of each sample eceeds the number of samples. In this case, the covariance estimates do not have full rank, i.e. they cannot be inverted. There are a number of ways to deal with this. One is to use the pseudoinverse of the covariance matri. Another way is to proceed to singular value decomposition (SVD).

48 MACHINE LEARNING 22 Marginal, Conditional in Pdf Consider two random variables and 2 with joint distribution p(, 2 ), then the marginal probability of given is: ( ) p = ò p(, ) d 2 2 The conditional probability is given by: p (, ) p = Û p = ( ) ( ) ( ) ( ) p ( ) p p( ) p

49 MACHINE LEARNING 22 Conditional Pdf and Statistical Independence and 2 are statistically independent if: p ( ) = p ( ) and p ( ) = p ( ) Þ p (, ) = p ( ) p ( ) and are uncorrelated if cov(, ) = = E{ 2} -E{ } E{ 2} {, } { } { } cov(, ), Þ E = E E 2 2 The epectation of the joint distribution is equal to the product of the epectation of each distribution separately

50 MACHINE LEARNING 22 Statistical independence and uncorrelatedness Independent Uncorrelated (, ) = ( ) ( ) Þ {, } = { } { } p p p E E E (, ) = ( ) ( ) Ü {, } = { } { } p p p E E E Statistical independence ensures uncorrelatedness. The converse is not true 5 55

51 MACHINE LEARNING 22 Statistical independence and uncorrelatedness X=- X= X= Total Y=- 3/2 3/2 /2 Y= /2 4/2 /2 /2 Total /3 /3 /3 and 2 are uncorrelated but not statistically independent. 5 55

52 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label.. Gaussian fct Gaussian fct ( =+ ) P y Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -. d Cumulative distribution

53 MACHINE LEARNING Eample of use of Bayes rule in ML Naïve Bayes classification can be used to determine the likelihood of the class label. Bayes rule for binary classification: ( ) has class label + if p y =+ > d, d ³. otherwise takes class label -..4 ( =+ ) P y Gaussian fct.3.2. d

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 7 Unsupervised Learning Statistical Perspective Probability Models Discrete & Continuous: Gaussian, Bernoulli, Multinomial Maimum Likelihood Logistic

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan Lecture 3: Latent Variables Models and Learning with the EM Algorithm Sam Roweis Tuesday July25, 2006 Machine Learning Summer School, Taiwan Latent Variable Models What to do when a variable z is always

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 1 MACHINE LEARNING Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA 2 Practicals Next Week Next Week, Practical Session on Computer Takes Place in Room GR

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Dimensionality reduction

Dimensionality reduction Dimensionality Reduction PCA continued Machine Learning CSE446 Carlos Guestrin University of Washington May 22, 2013 Carlos Guestrin 2005-2013 1 Dimensionality reduction n Input data may have thousands

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Course Outline MODEL INFORMATION. Bayes Decision Theory. Unsupervised Learning. Supervised Learning. Parametric Approach. Nonparametric Approach

Course Outline MODEL INFORMATION. Bayes Decision Theory. Unsupervised Learning. Supervised Learning. Parametric Approach. Nonparametric Approach Course Outline MODEL INFORMATION COMPLETE INCOMPLETE Bayes Decision Theory Supervised Learning Unsupervised Learning Parametric Approach Nonparametric Approach Parametric Approach Nonparametric Approach

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Latent Variable Models

Latent Variable Models Latent Variable Models Stefano Ermon, Aditya Grover Stanford University Lecture 5 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 5 1 / 31 Recap of last lecture 1 Autoregressive models:

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 Prof. Steven Waslander p(a): Probability that A is true 0 pa ( ) 1 p( True) 1, p( False) 0 p( A B) p( A) p( B) p( A B) A A B B 2 Discrete Random Variable X

More information

Machine Learning for Signal Processing Bayes Classification

Machine Learning for Signal Processing Bayes Classification Machine Learning for Signal Processing Bayes Classification Class 16. 24 Oct 2017 Instructor: Bhiksha Raj - Abelino Jimenez 11755/18797 1 Recap: KNN A very effective and simple way of performing classification

More information

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline. MFM Practitioner Module: Risk & Asset Allocation September 11, 2013 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y

More information

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE You are allowed a two-page cheat sheet. You are also allowed to use a calculator. Answer the questions in the spaces provided on the question sheets.

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich

NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS. Balaji Lakshminarayanan and Raviv Raich NON-NEGATIVE MATRIX FACTORIZATION FOR PARAMETER ESTIMATION IN HIDDEN MARKOV MODELS Balaji Lakshminarayanan and Raviv Raich School of EECS, Oregon State University, Corvallis, OR 97331-551 {lakshmba,raich@eecs.oregonstate.edu}

More information

6. Vector Random Variables

6. Vector Random Variables 6. Vector Random Variables In the previous chapter we presented methods for dealing with two random variables. In this chapter we etend these methods to the case of n random variables in the following

More information

Gradient Ascent Chris Piech CS109, Stanford University

Gradient Ascent Chris Piech CS109, Stanford University Gradient Ascent Chris Piech CS109, Stanford University Our Path Deep Learning Linear Regression Naïve Bayes Logistic Regression Parameter Estimation Our Path Deep Learning Linear Regression Naïve Bayes

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Classification: Logistic Regression NB & LR connections Readings: Barber 17.4 Dhruv Batra Virginia Tech Administrativia HW2 Due: Friday 3/6, 3/15, 11:55pm

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II 1 Non-linear regression techniques Part - II Regression Algorithms in this Course Support Vector Machine Relevance Vector Machine Support vector regression Boosting random projections Relevance vector

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

MACHINE LEARNING 2012 MACHINE LEARNING

MACHINE LEARNING 2012 MACHINE LEARNING 1 MACHINE LEARNING kernel CCA, kernel Kmeans Spectral Clustering 2 Change in timetable: We have practical session net week! 3 Structure of toda s and net week s class 1) Briefl go through some etension

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

CSE446: Clustering and EM Spring 2017

CSE446: Clustering and EM Spring 2017 CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled

More information

L2: Review of probability and statistics

L2: Review of probability and statistics Probability L2: Review of probability and statistics Definition of probability Axioms and properties Conditional probability Bayes theorem Random variables Definition of a random variable Cumulative distribution

More information

Announcements (repeat) Principal Components Analysis

Announcements (repeat) Principal Components Analysis 4/7/7 Announcements repeat Principal Components Analysis CS 5 Lecture #9 April 4 th, 7 PA4 is due Monday, April 7 th Test # will be Wednesday, April 9 th Test #3 is Monday, May 8 th at 8AM Just hour long

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Background Mathematics (2/2) 1. David Barber

Background Mathematics (2/2) 1. David Barber Background Mathematics (2/2) 1 David Barber University College London Modified by Samson Cheung (sccheung@ieee.org) 1 These slides accompany the book Bayesian Reasoning and Machine Learning. The book and

More information

Machine Learning: Logistic Regression. Lecture 04

Machine Learning: Logistic Regression. Lecture 04 Machine Learning: Logistic Regression Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Supervised Learning Task = learn an (unkon function t : X T that maps input

More information

Problem Set 3 Due Oct, 5

Problem Set 3 Due Oct, 5 EE6: Random Processes in Systems Lecturer: Jean C. Walrand Problem Set 3 Due Oct, 5 Fall 6 GSI: Assane Gueye This problem set essentially reviews detection theory. Not all eercises are to be turned in.

More information

Mixture of Gaussians Models

Mixture of Gaussians Models Mixture of Gaussians Models Outline Inference, Learning, and Maximum Likelihood Why Mixtures? Why Gaussians? Building up to the Mixture of Gaussians Single Gaussians Fully-Observed Mixtures Hidden Mixtures

More information

component risk analysis

component risk analysis 273: Urban Systems Modeling Lec. 3 component risk analysis instructor: Matteo Pozzi 273: Urban Systems Modeling Lec. 3 component reliability outline risk analysis for components uncertain demand and uncertain

More information

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth Application: Can we tell what people are looking at from their brain activity (in real time? Gaussian Spatial Smooth 0 The Data Block Paradigm (six runs per subject Three Categories of Objects (counterbalanced

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Mixtures of Gaussians continued

Mixtures of Gaussians continued Mixtures of Gaussians continued Machine Learning CSE446 Carlos Guestrin University of Washington May 17, 2013 1 One) bad case for k-means n Clusters may overlap n Some clusters may be wider than others

More information

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Some slides have been adopted from Prof. H.R. Rabiee s and also Prof. R. Gutierrez-Osuna

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Gaussians. Hiroshi Shimodaira. January-March Continuous random variables

Gaussians. Hiroshi Shimodaira. January-March Continuous random variables Cumulative Distribution Function Gaussians Hiroshi Shimodaira January-March 9 In this chapter we introduce the basics of how to build probabilistic models of continuous-valued data, including the most

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Parameter estimation Conditional risk

Parameter estimation Conditional risk Parameter estimation Conditional risk Formalizing the problem Specify random variables we care about e.g., Commute Time e.g., Heights of buildings in a city We might then pick a particular distribution

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Advanced statistical methods for data analysis Lecture 2

Advanced statistical methods for data analysis Lecture 2 Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Analysis of Spectral Kernel Design based Semi-supervised Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning Analysis of Spectral Kernel Design based Semi-supervised Learning Tong Zhang IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Rie Kubota Ando IBM T. J. Watson Research Center Yorktown Heights,

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline. Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Computational Genomics

Computational Genomics Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Principal Component Analysis CS498

Principal Component Analysis CS498 Principal Component Analysis CS498 Today s lecture Adaptive Feature Extraction Principal Component Analysis How, why, when, which A dual goal Find a good representation The features part Reduce redundancy

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Density Estimation: ML, MAP, Bayesian estimation

Density Estimation: ML, MAP, Bayesian estimation Density Estimation: ML, MAP, Bayesian estimation CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Introduction Maximum-Likelihood Estimation Maximum

More information