Maximum Likelihood Estimation (MLE)

Similar documents
Clustering & Unsupervised Learning

Clustering & (Ken Kreutz-Delgado) UCSD

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 12: Classification

Statistical pattern recognition

Classification as a Regression Problem

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Lecture Notes on Linear Regression

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Composite Hypotheses testing

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Linear Approximation with Regularization and Moving Least Squares

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

The exam is closed book, closed notes except your one-page cheat sheet.

The Geometry of Logit and Probit

10-701/ Machine Learning, Fall 2005 Homework 3

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lecture 10 Support Vector Machines II

SDMML HT MSc Problem Sheet 4

Generalized Linear Methods

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

3.1 ML and Empirical Distribution

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Homework Assignment 3 Due in class, Thursday October 15

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Conjugacy and the Exponential Family

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

1 Convex Optimization

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Neural networks. Nuno Vasconcelos ECE Department, UCSD

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Kernel Methods and SVMs Extension

Linear Feature Engineering 11

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Linear Classification, SVMs and Nearest Neighbors

Machine learning: Density estimation

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Pattern Classification

Global Sensitivity. Tuesday 20 th February, 2018

Lecture 12: Discrete Laplacian

The big picture. Outline

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

EM and Structure Learning

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Topic 5: Non-Linear Regression

Assortment Optimization under MNL

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Generative classification models

Support Vector Machines

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Expected Value and Variance

VQ widely used in coding speech, image, and video

Maximum Likelihood Estimation

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Basically, if you have a dummy dependent variable you will be estimating a probability.

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

1 Motivation and Introduction

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

More metrics on cartesian products

Explaining the Stein Paradox

Limited Dependent Variables

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

PHYS 705: Classical Mechanics. Calculus of Variations II

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Chapter Newton s Method

Feb 14: Spatial analysis of data fields

Chapter 11: Simple Linear Regression and Correlation

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

4.3 Poisson Regression

Estimation: Part 2. Chapter GREG estimation

Support Vector Machines

Hydrological statistics. Hydrological statistics and extremes

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Logistic Classifier CISC 5800 Professor Daniel Leeds

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Support Vector Machines

Laboratory 1c: Method of Least Squares

Communication with AWGN Interference

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Transcription:

Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD

Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y ), fnd an approxmatng functon f (x) y x f () ŷ f( x) y hs s called tranng or learnng. wo major types of learnng: Unsupervsed Classfcaton (aka Clusterng) or Regresson ( blnd curve fttng): only X s known. Supervsed Classfcaton or Regresson: both X and target value Y are known durng tranng, only X s known at test tme.

Optmal Classfers Performance depends on the data/feature space metrc Some metrcs are better than others he meanng of better s connected to how well adapted the metrc s to the propertes of the data But can we be more rgorous? What do we mean by optmal? o talk about optmalty we need to talk about cost or loss x yˆ f ( x) () f L( y, yˆ ) Average Loss (Rsk) s the functon that we want to mnmze Rsk depends on true y and the predcton ells us how good our predctor/estmator s ŷ 3

Data-Condtonal Rsk, R(x,), for 0/1 Loss An mportant specal case of nterest: zero loss for no error and equal loss for two error types hs s equvalent to the zero/one loss : L j 0 1 j j snake predcton dart frog regular frog regular 1 0 dart 0 1 Under ths loss * x L j PYX j x j ( ) arg mn ( ) arg mn P ( j x) j YX 4

Data-Condtonal Rsk, R(x,), for 0/1 Loss Note, then, that n the 0/1 loss case, ] j R( x, ) E L[ Y x P ( Y j x). I.e., the data-condtonal rsk under the 0/1 loss s equal to the data-condtonal Probablty of Error, hus the optmal Bayesan decson rule (BDR) under 0/1 loss mnmzes the condtonal probablty of error. hs s gven by the MAP BDR : Y X R( x, ) P( x) 1 P( x) * ( x) argma x P( x). 5

Data-Condtonal Rsk, R(x,), for 0/1 Loss Summarzng: * x PYX j x j ( ) arg mn ( ) arg mn 1 PYX ( x) arg max P ( x) YX he optmal decson rule s the MAP Rule : Pck the class wth largest probablty gven the observaton x hs the Bayes Decson Rule (BDR) for the 0/1 loss We wll often smplfy our dscusson by assumng ths loss But you should always be aware that other losses may be used 6

BDR (under 0/1 Loss) For the zero/one loss, the followng three decson rules are optmal and equvalent 1) ) 3) * x PYX x ( ) arg max ( ) ( ) argmax ( ) ( ) * x PX Y x PY ( ) arg max log ( ) log ( ) * x PX Y x PY Form 1) s usually hard to use, 3) s frequently easer than ) 7

Gaussan BDR Classfer (0/1 Loss) A very mportant case s that of Gaussan classes he pdf of each class s a Gaussan of mean m and covarance S 1 1 P x x x 1 X Y ( ) exp ( m ) S ( m ) d ( ) S he Gaussan BDR under 0/1 Loss s * 1 1 ( x) argmax ( ) ( ) x m S x m 1 log( ) d S log PY ( ) 8

Gaussan Classfer (0/1 Loss) hs can be wrtten as ( x) argmn d ( x, m ) a * ( ) = 0.5 wth d ( x, y) ( x y) S ( x y) 1 a d log( ) S log P ( ) Y and can be nterpeted as a nearest class-neghbor classfer whch uses a funny metrc Note that each class has ts own dstance functon whch s related to the sum of the square of the Mahalanobs dstance for that class plus the a term for that class we effectvely use dfferent metrcs n dfferent regons of the space 9

Gaussan Classfer (0/1 Loss) A specal case of nterest s when all classes have the same S =S ( ) = 0.5 x d x m a * ( ) argmn (, ) wth d x y x y x y Note: 1 (, ) ( ) S ( ) a log P ( ) Y a can be dropped when all classes have equal probablty (the case shown n the above fgure). In ths case the classfer s close n form to a NN classfer wth Mahalanobs dstance, but nstead of fndng the nearest tranng data pont, t looks for the nearest class prototype m usng the Mahalanobs dstance 10

Gaussan Classfer (0/1 Loss) Bnary Classfcaton wth S =S One mportant property of ths case s that the decson boundary s a hyperplane (Homework) hs can be shown by computng the set of ponts x such that d ( x, m ) a d ( x, m ) a 0 0 1 1 and showng that they satsfy dscrmnant for ( ) = 0.5. w ( x x ) 0. 0 hs s the equaton of a hyperplane wth normal w. x 0 can be any fxed pont on the hyperplane, but t s standard to choose t to have mnmum norm, n whch case w and x 0 are then parallel x n x 1 x 3 x x x 0 0 x w 11

Gaussan Classfer (0/1 Loss) Furthermore, f all the covarances are the dentty S =I x d x m a * ( ) argmn (, ) wth d ( x, y) x y a log P ( ) hs s just Eucldean Dstance emplate Matchng wth class means as templates E.g. for dgt classfcaton Y *? Compare complexty to nearest neghbors! 1

he Sgmod n 0/1 Loss Detecton We have derved all of ths from the log-based 0/1 BDR ( ) arg max log ( ) log ( ) * x PX Y x PY When there are only two classes, t s also nterestng to look at the orgnal defnton n an alternatve form: wth * ( x) argmax g ( x) g ( x) P ( x) Y X P ( x ) P ( ) X Y X Y P ( x) P ( x ) P ( ) P ( x 0) P (0) P ( x 1) P (1) X Y Y X Y Y X Y Y 13

he Sgmod n MAP Detecton Note that ths can be wrtten as * ( x) argmax g ( x) g ( x) 1 g ( x) 1 0 g 0 ( x) 1 1 P ( x 1) P (1) X Y P ( x 0) P (0) X Y Y Y For Gaussan classes, the posteror probablty for 0 s g 0 1 ( x) 1 exp d ( x, m ) d ( x, m ) a a 0 0 1 1 0 1 where, as before, d ( x, y) ( x y) S ( x y) 1 a d log( ) S log P ( ) Y 14

he Sgmod n MAP detecton he posteror densty for class 0, g 0 1 ( x) 1 exp d ( x, m ) d ( x, m ) a a s a sgmod and looks lke ths 0 0 1 1 0 1 ( 1 ) = 0.5 15

he Sgmod n Neural Networks he sgmod functon also appears n neural networks In neural networks, t can be nterpreted as a posteror densty for a Gaussan problem where the covarances are the same. 16

he Sgmod n Neural Networks But not necessarly when the covarances are dfferent 17

Implementaton All of ths s appealng, but n practce one doesn t know the values of the parameters m, S, P Y (1) In the homework we use an ntutve soluton to desgn a Gaussan classfer: Start from a collecton of datasets: D () = {x 1 (),..., x n () } = set of examples from class For each class estmate the Gaussan BDR parameters usng, 1 () 1 ˆ m x j S Pˆ Y () n j ˆ ( ) ( ) ( ˆ )( ˆ xj m xj m ) n j where s the total number of examples (over all classes) E.g., below are sample means computed for dgt classfcaton: n 18

A Practcal Gaussan MAP Classfer Instead of the deal BDR * 1 1 ( x) argmax ( ) ( ) x m S x m 1 log( ) d S log PY ( ) use the estmate of the BDR found from ˆ* 1 ˆ 1 ( x) argmax ( x ˆ m ) ( ˆ S x m ) 1 d log( ) ˆ log ˆ S PY( ) 19

Important Warnng: at ths pont all optmalty clams for the BDR cease to be vald!! he BDR s guaranteed to acheve the mnmum loss only when we use the true probabltes When we plug n probablty estmates, we could be mplementng a classfer that s qute dstant from the optmal E.g. f the P X Y (x ) look lke the example above one could never approxmate t well by usng smple parametrc models (e.g. a sngle Gaussan). 0

Maxmum lkelhood Estmaton (MLE) Gven a parameterzed pdf how should one estmate the parameters whch defne the pdf? here are many technques of parameter estmaton. We shall utlze the maxmum lkelhood (ML) prncple. hs has three steps: 1) We choose a parametrc model for all probabltes. o make ths clear we denote the vector of parameters by and the class-condtonal dstrbutons by PX Y ( x ; ), p Note: hs s a classcal statstcs approach, whch means that s NO a random varable. It s a determnstc but unknown parameter, and the probabltes are a functon of ths unknown parameter. 1

Maxmum Lkelhood Estmaton (MLE) he three steps contnued: ) Assemble a collecton of datasets: D () = {x 1 (),..., x n () } = set of examples from each class 3) Select the values of the parameters of class to be the ones that maxmze the probablty of the data from that class ˆ ( ) arg max P ; ( ) D Y ( ) argmaxlog P ( ) ; D Y Note that t does not make any dfference to maxmze probabltes or ther logs.

Maxmum Lkelhood Estmaton (MLE) Snce Each sample D () s consdered ndependently Each parameter vector s estmated only from sample D () we smply have to repeat the procedure for all classes. So, from now on we omt the class varable : ˆ arg max P ; ML argmaxlog P ; he functon L( ; D) = P X (D; ) s the lkelhood of the parameter gven the data D, or smply the lkelhood functon. X X 3

he Lkelhood Functon Gven a parameterzed famly of pdf s (aka known as a statstcal model) for the data D, we defne a Lkelhood of the parameter vector gven D : L ( ) L( ; ) a( ) P ( ) D where a(d ) > 0 for all D, and a(d ) s ndependent of the parameter. he choce a(d ) = 1 yelds the Standard Lkelhood: L( ; D) = P D (D ; ) whch was shown on the prevous slde. 4

Maxmum Lkelhood Prncple p ( x) 1 Lx ) P ( x) P x Lx 1 ( ( ) ( ) 1 x X

he Lkelhood Functon Note that the lkelhood functon s a functon of the parameters It does not have the same shape as the densty tself E.g. the lkelhood functon of a Gaussan s not bell-shaped he lkelhood s defned only after we have a data sample P X 1 ( d m) ( d; ) exp ( ) 6

Maxmum Lkelhood Estmaton (MLE) Gven a sample, to obtan ML estmate we need to solve ˆ ; M L arg max P D When s a scalar, ths s hgh-school calculus: We have a local maxmum of f(x) at a pont x when he frst dervatve at x s zero. (x s a statonary pont.) he second dervatve s negatve at x. 7

MLE Example Gaussan wth unknown mean & standard devaton: Gven a data sample D = { 1,, N } of ndependent and dentcally dstrbuted (d) measurements, the (standard) lkelhood s L(, ;,, ) 1 N 8

MLE Example he log-lkelhood s he dervatve wth respect to the mean s zero when yeldng Note that ths s just the sample mean 9

MLE Example he log-lkelhood s he dervatve wrt the standard devaton s zero when or Note that ths s just the sample varance. 30

MLE Example Numercal example: If sample s {10,0,30,40,50} 31

he Gradent In hgher dmensons, the generalzaton of the dervatve s the gradent he (Cartesan) gradent of a functon f(w) at z s f f f f ( z) ( z) ( z),, ( z) w w w 0 n 1 he gradent has a nce geometrc nterpretaton It ponts n the drecton of maxmum growth of the functon. (Steepest Ascent Drecton.) Whch makes t perpendcular to the contours where the functon s constant. he above s the gradent for the smple (unweghted) Eucldean Norm (aka the Cartesan Gradent). f ( x 0, y0) f f(x,y) f ( x, y 1 1) 3

he Gradent Note that f f(x) = 0 here s no drecton of growth at x also f(x) = 0, and there s no drecton of decrease at x We are ether at a local mnmum or maxmum or saddle pont at x Conversely, f there s a local mn or max or saddle pont at x here s no drecton of growth or decrease at x f (x) = 0 hs shows that we have a statonary pont at x f and only f f(x) = 0 o determne whch type holds we need second order condtons max mn saddle 33

he Hessan he extenson of the scalar second-order dervatve s the Hessan matrx of second partal dervatves: ( x) f f ( x) ( x) x0 x0 xn1 ( x) f f ( x) ( x) xn1 x0 x n1 f f ( x) x x x Note that the Hessan s symmetrc. he Hessan gves us the quadratc functon 1 ( ) x x ( 0 x )( 0 x x ) 0 that best approxmates f(x) at a statonary pont x 0. 34

Hessan as a Quadratc Approxmaton E.g. ths means that f the gradent s zero at x 0, we have a maxmum when the functon f(x) can be locally approxmated by an upwards pontng quadratc bowl (H (x 0 ) s neg-def) max a mnmum when the functon can be locally approxmated by a downwards pontng quadratc bowl (H (x 0 )s pos-def) a saddle pont otherwse (H (x 0 ) s ndefnte) saddle mn 35

Hessan Gves Local Behavor max hs s somethng that we already saw: For any matrx M, the quadratc functon x M x s an upwards pontng quadratc quadratc bowl at the pont x = 0 when M s negatve defnte s a downwards pontng quadratc bowl at x = 0 when M s postve defnte s a saddle pont at x = 0 otherwse Hence, smlarly, what matters s the defnteness property of the Hessan at a statonary pont x 0 saddle mn E.g., we have a maxmum at a statonary pont x 0 when the Hessan s negatve defnte at x 0 36

Optmalty Condtons In summary: w 0 s a local mnmum of f(w) f and only f f has zero gradent at w 0 f( w ) 0 0 and the Hessan of f at w 0 s postve defnte where ( x) d ( ) 0, n x, 0 0 d d d f f ( x) ( x) x0 x0xn1 f f ( x) ( x) xn1 x0 x n1 37

Maxmum Lkelhood Estmaton (MLE) Gven a sample, to obtan an MLE we want to solve ˆ ML arg max P D ; max Canddate solutons are the parameter values ˆ such that P D ( ; ˆ ) 0 ( ˆ ) 0, 0 Note that you always have to check the second-order Hessan condton 38

MLE Example Back to our Gaussan example Gven d samples { 1,, N } the lkelhood s L(, ;,, ) 1 N 39

MLE Example he log-lkelhood s he dervatve of wth respect to the mean s from whch we compute the second-order dervatves N 3 40

41 MLE Example he dervatve of wth respect to the standard devaton s whch yelds the the second-order dervatves he statonary parameter values are, N 4 3 3

MLE Example he elements of the Hessan are: N 3 hus the Hessan s ( ) 0 3 4 N N N whch s clearly negatve defnte at the statonary pont. hs we have determned the MLE of the parameters. 3 N 1 0 0 4 0

nd MLE Example o fnd the MLE s of the two pror class probabltes P Y () note that, 1 PY () 1, 0 can be wrtten as x 1 x P ( x ) 1 x 0, 1 Y where x s the so-called ndcator (or 0-1) functon. Gven d ndcator samples D = {x 1,..., x N }, we have x L( ; ) P ( ; ) 1 Y N 1 1 x 43

nd MLE Example herefore n 1 log P ( ; ) x log 1 x log1 Y Settng the dervatve of the log-lkelhood wth respect to equal to zero, N log PY ( ) 1 N 1 x 1 1 1 1 N 1 N x (1 ) 1 1 N 0, x 44

nd MLE Example yelds the MLE estmate ˆ ML 1 n x where n x N N N 1 N 1 Note that ths s just the relatve frequency of occurrence of the value 1 n the sample. I.e. the MLE s just the count of the number of 1 s over the total number of ponts! Agan we see that the MLE yelds an ntutvely pleasng estmate of the unknown parameters. 45

nd MLE Example Check that the second dervatve s negatve: for < 1. N log PY ( D) 1 N x (1 ) 1 1 N 1 1 1 N 1 1 0 1 46

Combnng the MLE Examples For Gaussan Classes all of the above formulas can be generalzed to the random vector case as follows: D () = {x 1 (),..., x n () } = set of d vector examples from each class, = 1,, d. he MLE estmates n the vector random data case are: ˆ 1 () m x j Pˆ n Y () j 1 S ˆ ( ) ( ) ( ˆ )( ˆ xj m xj m ) n j hese are the sample estmates gven earler wth no justfcaton. he ML solutons are ntutve, whch s usually the case. n N 47

END 48