ECE 901 Lecture 13: Maximum Likelihood Estimation

Similar documents
Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Lecture 7: October 18, 2017

Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Stat410 Probability and Statistics II (F16)

Algorithms for Clustering

Agnostic Learning and Concentration Inequalities

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 11 and 12: Basic estimation theory

Rates of Convergence by Moduli of Continuity

Regression and generalization

Lecture 12: September 27

Problem Set 4 Due Oct, 12

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Machine Learning Brett Bernstein

Empirical Process Theory and Oracle Inequalities

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Maximum Likelihood Estimation

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

Machine Learning Theory (CS 6783)

Exponential Families and Bayesian Inference

Lecture 9: September 19

5. Likelihood Ratio Tests

Lecture 10: Universal coding and prediction

7.1 Convergence of sequences of random variables

Lecture 2: Monte Carlo Simulation

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture Chapter 6: Convergence of Random Sequences

7.1 Convergence of sequences of random variables

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

1 Review and Overview

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

STATISTICS 593C: Spring, Model Selection and Regularization

1 Review and Overview

Notes 5 : More on the a.s. convergence of sums

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Distribution of Random Samples & Limit theorems

Infinite Sequences and Series

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Optimally Sparse SVMs

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

A statistical method to determine sample size to estimate characteristic value of soil parameters

Quantile regression with multilayer perceptrons.

Unbiased Estimation. February 7-12, 2008

Statistical Pattern Recognition

MA Advanced Econometrics: Properties of Least Squares Estimators

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Lecture 23: Minimal sufficiency

Empirical Processes: Glivenko Cantelli Theorems

Lecture 19: Convergence

Application to Random Graphs

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 6 Principles of Data Reduction

4.1 Data processing inequality

Questions and Answers on Maximum Likelihood

Last Lecture. Unbiased Test

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

SDS 321: Introduction to Probability and Statistics

CSE 527, Additional notes on MLE & EM

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Glivenko-Cantelli Classes

Lecture 2: Concentration Bounds

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Spring Information Theory Midterm (take home) Due: Tue, Mar 29, 2016 (in class) Prof. Y. Polyanskiy. P XY (i, j) = α 2 i 2j

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Intro to Learning Theory

Rademacher Complexity

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

10.6 ALTERNATING SERIES

TAMS24: Notations and Formulas

Lecture 10 October Minimaxity and least favorable prior sequences

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Machine Learning Brett Bernstein

Department of Mathematics

Introduction to Machine Learning DIS10

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

A survey on penalized empirical risk minimization Sara A. van de Geer

Lecture 8: Convergence of transformations and law of large numbers

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Expectation and Variance of a random variable

Lecture 6: Coupon Collector s problem

Properties of Point Estimators and Methods of Estimation

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

Bayesian Methods: Introduction to Multi-parameter Models

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Transcription:

ECE 90 Lecture 3: Maximum Likelihood Estimatio R. Nowak 5/7/009 The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Ulike earlier approaches cosidered here we are willig to make somewhat stroger assumptios about the relatio betwee features ad labels. These are quite reasoable i may settigs, i particular i may imagig applicatios. Cosider the classical sigal plus oise model: ( ) i Y i f + W i, i,..., where W i are iid zero-mea oises. Furthermore, assume that W i have a distributio characterized by a probability desity fuctio (p.d.f.) p(w) for some kow desity p(w). The ( ( )) i Y i p y f p fi (y) sice Y i f ( i ) Wi. I a settig like this it is quite commo to cosider the maximum likelihood approach - seek a mostprobable explaatio for the observatios. Defie the likelihood of the data to be the the p.d.f. of the observatios (Y,..., Y ) p fi (Y i ). i The maximum likelihood estimator seeks the model that maximizes the likelihood, or equivaletly miimizes the egative log-likelihood i log p fi (Y i). () We immediately otice the similarity betwee the empirical risk we had see before ad the egative loglikelihood. We will see that we ca regard maximum likelihood estimatio as our familiar miimal empirical risk whe the loss fuctio is chose appropriately. I the meatime ote that miimizig () yields our familiar square-error loss if W i s are Gaussia. If the W i s are Laplacia (p W (w) e c w ) we get the sum of absolute errors. We ca also cosider o-additive models like the Poisso model (used ofte i medical imagig applicatios, like PET imagig) Y i p(y f(i/)) e f(i/) f y (i/) y, that gives rise to the followig egative log-likelihood log P (Y i f (i/)) f (i/) Y i log (f (i/)) + costat, which is a very differet loss fuctio, but quite appropriate for may imagig problems.

Maximum Likelihood Estimatio Before we ivestigate maximum likelihood estimatio for model selectio, let s review some of the basis cocepts. Let Θ deote a parameter space (e.g., Θ R, or Θ {smooth fuctios}). Assume we have observatios iid Y i p θ (y), i,..., where θ Θ is a parameter determiig the desity of the {Y i }. The ML estimator of θ is ˆθ arg max θ Θ Note that by the strog law of large umbers i arg max θ Θ p θ (Y i ) i i arg mi θ Θ. i a.s. E log p θ (Y ). So we ca use the egative log-likelihood as a proxy for E log p θ (Y ). Let s see why this is the thik to do. Elog p θ (Y ) log p θ (Y ) E log p θ (Y ) p θ (Y ) log p θ (y) p θ (y) p θ (y)dy K(p θ, p θ ) the KL divergece 0 with equality iff p θ p θ. Where K is the Kullback-Leibler divergece betwee to desities. This is a measure of the distiguishability betwee two differet radom variables. It is ot a symmetric fuctio so the order of the argumets is importat. Furthermore it is always positive, ad zero oly if the two desities are idetical E log p θ (y) p θ (y) E log p θ(y) p θ (y) pθ (y) log E p θ (y) log p θ (y)dy 0 K(p θ, p θ ) 0 By showig that Elog p θ (Y ) log p θ (Y ) 0 we immediately see that miimizig E log p θ (Y ) with respect to θ gets us close to the true model θ, exactly what we wat to do.. Likelihood as a Loss Fuctio We ca restate the maximum likelihood estimator i the geeral terms we are usig i this course. We have i.i.d observatios draw from a ukow distributio Y i i.i.d. p θ, i {,..., }

where θ Θ. We ca view p θ as a member of a parametric class of distributios, P {p θ } θ Θ. Our goal is to use the observatios {Y i } to select a appropriate distributio (e.g., model) from P. We would like the selected distributio to be close to p θ i some sese. We use the egative log-likelihood loss fuctio, defied as l(θ, Y i ). The empirical risk is ˆR. We select the distributio that miimizes the empirical risk i mi log p(y i ) mi p P θ Θ I other words, the distributio we select is ˆp : pˆθ, where The risk is defied as i Ad, the excess risk of θ is defied as R(θ) R(θ ) ˆθ arg mi θ Θ i i R(θ) El(θ, Y ) Elog p θ (Y ). log p θ (y) p θ (y) p θ (y) dy K(p θ, p θ ). We recogized that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL) Divergece or Relative Etropy, deoted by K(p θ, p θ ). It is easy to see that K(p θ, p θ ) is always oegative ad is zero if ad oly if p θ p θ. This shows that θ miimizes the risk. The KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures.. Covergece of Log-Likelihood to KL Divergece Sice ˆθ maximizes the likelihood over θ Θ, we have Therefore, or re-arragig Notice that the quatity i pˆθ (Y i ) i i log p θ (Y i ) log pˆθ (Y i ) 0 pˆθ (Y i ) K(pˆθ, p θ ) + K(pˆθ, p θ ) 0, p K(pˆθ θ ) (Y pˆθ i ), p K(pˆθ θ ) i i p θ (Y i ) is a empirical average whose mea is K(p θ, p θ ). By the law of large umbers, for each θ Θ, p θ (Y i ) K(p a.s. θ, p θ ) 0 i 3

If this also holds for the sequece {ˆθ }, the we have p θ (Y i ), p K(pˆθ θ ) log (Y pˆθ i ), p K(pˆθ θ ) 0 as which implies that which ofte implies that pˆθ p θ ˆθ θ i some appropriate sese (e.g., poit-wise or i orm). Example. Gaussia Distributios p θ (y) (y θ ) πσ e σ K(p θ, p θ ) θ Θ R, {Y i } i iid p θ (y) log p θ (y) p θ (y) p θ (y)dy σ (y θ) (y θ ) p θ (y)dy σ E θ (Y θ) σ E θ (Y θ ) σ E θ Y (θ + θ ) + θ + θ σ (θ + θ )E θ Y + θ + θ (θ θ) σ. maximizes Elog p θ (Y ) wrt θ Θ.3 Helliger Distace The KL divergece is ot a distace fuctio. ˆθ arg max θ { (Y i θ) } arg mi{ (Y i θ) } θ Y i i K(p θ, p θ ) K(p θ, p θ ) Therefore, it is ofte more coveiet to work with the Helliger metric, H(p θ, p θ ) ( ( ) p θ p ) θ dy 4

The Helliger metric is symmetric, o-egative ad H(p θ, p θ ) H(p θ, p θ ) ad therefore it is a distace measure. Furthermore, the squared Helliger distace lower bouds the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace. Propositio. H (p θ, p θ ) K(p θ, p θ ) Proof: ( H (p θ, p θ ) (y) ) pθ p (y) θ dy pθ p θ (y)dy + p θ (y)dy (y) p θ (y)dy pθ (y) p (y)dy, sice θ p θ (y)dy θ ( ) E θ pθ (Y )/p θ (Y ) ( ) log E θ pθ (Y )/p θ (Y ), sice x log x E θ log p θ (Y )/p θ (Y ), by Jese s iequality E θ log(p θ (Y )/p θ (Y )) K(p θ, p θ ) Note that i the proof we also showed that H(p θ, p θ ) ( pθ (y) ) p θ (y)dy ad usig the fact log x x agai, we have ( ) pθ H(p θ, p θ ) log (y) p (y)dy θ The quatity iside the log is called the affiity betwee p θ ad p θ : A(p θ, p θ ) pθ (y) p θ (y)dy This is aother measure of closeess betwee p θ ad p θ. Example. Gaussia Distributios p θ (y) (y θ) e σ πσ 5

log pθ (y) p θ (y)dy ( ) ( log (y θ πσ e ) σ (y θ πσ e ) ( ) (y θ ) log πσ e 4σ + (y θ ) 4σ dy log πσ e ( y ( θ +θ )) σ + (θ θ ) 8σ ) σ dy dy log e (θ θ ) 8σ (θ θ ) 4σ Summary iid Y i p θ log A(p θ, p θ ) (θ θ ) 4σ for Gaussia distributios H (p θ, p θ ) (θ θ ) 4σ for Gaussia.. Maximum likelihood estimator maximizes the empirical average (our empirical risk is egative log-likelihood). θ maximizes the expectatio E i (the risk is the expected egative log-likelihood) i 3. i a.s. E so we expect some sort of cocetratio of measure. i 4. I particular, sice i p θ (Y i ) a.s. K(p θ, p θ ) we might expect that K(pˆθ, p θ ) 0 for the sequece of estimates {pˆθ }. So, the poit is that maximum likelihood estimator is just a special case of a loss fuctio i learig. Due to its special structure, we are aturally led to cosider KL divergeces, Helliger distaces, ad Affiities. 6