Lecture 13: Maximum Likelihood Estimation

Similar documents
ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Maximum Likelihood Estimation and Complexity Regularization

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Agnostic Learning and Concentration Inequalities

Regression and generalization

Algorithms for Clustering

Stat410 Probability and Statistics II (F16)

Lecture 11 and 12: Basic estimation theory

Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Machine Learning Brett Bernstein

STATISTICS 593C: Spring, Model Selection and Regularization

Lecture 7: October 18, 2017

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Empirical Process Theory and Oracle Inequalities

Maximum Likelihood Estimation

Lecture 10: Universal coding and prediction

A survey on penalized empirical risk minimization Sara A. van de Geer

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Unbiased Estimation. February 7-12, 2008

Machine Learning Brett Bernstein

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Machine Learning Theory (CS 6783)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Rates of Convergence by Moduli of Continuity

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Lecture 9: September 19

Exponential Families and Bayesian Inference

1.010 Uncertainty in Engineering Fall 2008

Lecture 2: Monte Carlo Simulation

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

Lecture 2: Concentration Bounds

Lecture 12: September 27

1 Convergence in Probability and the Weak Law of Large Numbers

Glivenko-Cantelli Classes

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

4.1 Data processing inequality

Simulation. Two Rule For Inverting A Distribution Function

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

Lecture 15: Learning Theory: Concentration Inequalities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

5. Likelihood Ratio Tests

5.1 A mutual information bound based on metric entropy

SDS 321: Introduction to Probability and Statistics

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

TAMS24: Notations and Formulas

Lecture Chapter 6: Convergence of Random Sequences

REGRESSION WITH QUADRATIC LOSS

Lecture 6: Coupon Collector s problem

1 Review and Overview

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

7.1 Convergence of sequences of random variables

Problem Set 4 Due Oct, 12

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Estimation of the Mean and the ACVF

Learnability with Rademacher Complexities

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Spring Information Theory Midterm (take home) Due: Tue, Mar 29, 2016 (in class) Prof. Y. Polyanskiy. P XY (i, j) = α 2 i 2j

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

7.1 Convergence of sequences of random variables

Regression with quadratic loss

Unsupervised Learning 2001

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Last Lecture. Biostatistics Statistical Inference Lecture 16 Evaluation of Bayes Estimator. Recap - Example. Recap - Bayes Estimator

2 Banach spaces and Hilbert spaces

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

An Introduction to Randomized Algorithms

Introductory statistics

Chapter 6 Principles of Data Reduction

Learning Theory: Lecture Notes

10.6 ALTERNATING SERIES

ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Distribution of Random Samples & Limit theorems

1 Review and Overview

Math 112 Fall 2018 Lab 8

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 3: August 31

Lecture 11: Decision Trees

Questions and Answers on Maximum Likelihood

Lecture 12: February 28

Probability and Statistics

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Rademacher Complexity

Sequences and Series of Functions

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Transcription:

ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select a f F so that E(f(X) Y ) ] E(f (X) Y ) ] is small, where f (x) = EY X = x]. The result is summarized below. Theorem (Complexity Regularizatio with Squared Error Loss) Let X = R d, Y = b/, b/], {X i, Y i } iid, P XY ukow, F = {collectio of cadidate fuctios}, f : R d Y, R(f) = E(f(X) Y ) ]. Let c(f), f F, be positive umbers satisfyig f F c(f), ad select a fuctio from F accordig to { ˆf = arg mi ˆR (f) + ɛ } c(f) log, with ɛ 3 5b ad ˆR (f) = (f(x i) Y i ). The, ( ) { ER( ˆf + α )] R(f ) mi R(f) R(f ) + α f F ɛ where α = ɛb b ɛ/3 Maximum Likelihood Estimatio } c(f) log + O( ) The focus of this lecture is to cosider aother approach to learig based o maximum likelihood estimatio. Cosider the classical sigal plus oise model: ( ) i Y i = f + W i, i =,, where W i are iid zero-mea oises. Furthermore, assume that W i p(w) for some kow desity p(w). The ( ( )) i Y i p y f p fi (y) sice Y i f ( i ) = Wi. A very commo ad useful loss fuctio to cosider is ˆR (f) = ( log p fi (Y i )). Miimizig ˆR with respect to f is equivalet to maximizig log p fi (Y i )

Lecture 3: Maximum Likelihood Estimatio or p fi (Y i ). Thus, usig the egative log-likelihood as a loss fuctio leads to maximum likelihood estimatio. If the W i are iid zero-mea Gaussia r.v.s the this is just the squared error loss we cosidered last time. If the W i are Laplacia distributed e.g. p(w) e w, the we obtai the absolute error, or L, loss fuctio. We ca also hadle o-additive models such as the Poisso model I this case f(i/) f(i/)]y Y i P (y f (i/)) = e y! log P (Y i f (i/)) = f (i/) Y i log (f (i/)) + costat which is a very differet loss fuctio, but quite appropriate for may imagig problems. Before we ivestigate maximum likelihood estimatio for model selectio, let s review some of the basis cocepts. Let Θ deote a parameter space (e.g., Θ = R), ad assume we have observatios Y i iid p θ (y), i =,..., where θ Θ is a parameter determiig the desity of the {Y i }. The ML estimator of θ is ˆθ = arg max = arg max = arg mi p θ (Y i ). ˆθ maximizes the expected log-likelihood. To see this, let s compare the expected log-likelihood of θ with ay other θ Θ. Elog p θ (Y ) log p θ (Y )] = E log p θ (Y ) ] p θ (Y ) = log p θ (y) p θ (y) p θ (y)dy = K(p θ, p θ ) the KL divergece 0 with equality iff p θ = p θ. Why? E log p θ (y) ] p θ (y) = E log p ] θ(y) p θ (y) ] pθ (y) log E p θ (y) = log p θ (y)dy = 0 K(p θ, p θ ) 0

Lecture 3: Maximum Likelihood Estimatio 3. Likelihood as a Loss Fuctio We ca restate the maximum likelihood estimator i the geeral terms we are usig i this course. We have i.i.d observatios draw from a ukow distributio Y i i.i.d. p θ, i = {,..., } where θ Θ. We ca view p θ as a member of a parametric class of distributios, P = {p θ }. Our goal is to use the observatios {Y i } to select a appropriate distributio (e.g., model) from P. We would like the selected distributio to be close to p θ i some sese. We use the egative log-likelihood loss fuctio, defied as l(θ, Y i ) =. The empirical risk is ˆR =. We select the distributio that miimizes the empirical risk mi log p(y i ) = mi p P I other words, the distributio we select is ˆp := pˆθ, where ˆθ = arg mi The risk is defied as Ad, the excess risk of θ is defied as R(θ) R(θ ) = R(θ) = El(θ, Y )] = Elog p θ (Y )]. log p θ (y) p θ (y) p θ (y) dy K(p θ, p θ ). We recogized that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL) Divergece or Relative Etropy, deoted by K(p θ, p θ ). It is easy to see that K(p θ, p θ ) is always oegative ad is zero if ad oly if p θ = p θ. This shows that θ miimizes the risk. The KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures.. Covergece of Log-Likelihood to KL Divergece Sice ˆθ maximizes the likelihood over θ Θ, we have pˆθ (Y i ) = log p θ (Y i ) log pˆθ (Y i ) 0 Therefore, or re-arragig pˆθ (Y i ) K(pˆθ, p θ ) + K(pˆθ, p θ ) 0, p K(pˆθ θ ) (Y pˆθ i ), p K(pˆθ θ )

Lecture 3: Maximum Likelihood Estimatio 4 Notice that the quatity p θ (Y i ) is a empirical average whose mea is K(p θ, p θ ). By the law of large umbers, for each θ Θ, p θ (Y i ) K(p a.s. θ, p θ ) 0 If this also holds for the sequece {ˆθ }, the we have p θ (Y i ), p K(pˆθ θ ) log (Y pˆθ i ), p K(pˆθ θ ) 0 as which implies that which ofte implies that pˆθ p θ ˆθ θ i some appropriate sese (e.g., poit-wise or i orm). Example Gaussia Distributios p θ (y) = π e (y θ ) Θ = R, {Y i } iid p θ (y) K(p θ, p θ ) = = log p θ (y) p θ (y) p θ (y)dy (y θ) (y θ ) ]p θ (y)dy = E θ (y θ) ] E θ (y θ ) ] = E θ Y Y θ + θ ] / = (θ ) + / θ θ + θ / = (θ θ) θ maximizes Elog p θ (Y )] wrt θ Θ ˆθ = arg max θ { (Y i θ) } = arg mi{ (Y i θ) } θ = Y i

Lecture 3: Maximum Likelihood Estimatio 5.3 Helliger Distace The KL divergece is ot a distace fuctio. K(p θ, p θ ) K(p θ, p θ ) Therefore, it is ofte more coveiet to work with the Helliger metric, H(p θ, p θ ) = The Helliger metric is symmetric, o-egative ad ( ( ) p θ p ) θ dy H(p θ, p θ ) = H(p θ, p θ ) ad therefore it is a distace measure. Furthermore, the squared Helliger distace lower bouds the KL divergece, so covergece i KL divergece implies covergece of the Helliger distace. Propositio H (p θ, p θ ) K(p θ, p θ ) Proof: ( H (p θ, p θ ) = (y) ) pθ p (y) θ dy pθ = p θ (y)dy + p θ (y)dy (y) p (y)dy θ pθ = (y) p (y)dy, sice θ p θ (y)dy = θ ( ]) = E θ pθ (Y )/p θ (Y ) ( ]) log E θ pθ (Y )/p θ (Y ), sice x log x E θ log ] p θ (Y )/p θ (Y ), by Jese s iequality = E θ log(p θ (Y )/p θ (Y ))] K(p θ, p θ ) Note that i the proof we also showed that H(p θ, p θ ) = ( pθ (y) ) p θ (y)dy ad usig the fact log x x agai, we have ( ) pθ H(p θ, p θ ) log (y) p (y)dy θ The quatity iside the log is called the affiity betwee p θ ad p θ : A(p θ, p θ ) = pθ (y) p θ (y)dy This is aother measure of closeess betwee p θ ad p θ.

Lecture 3: Maximum Likelihood Estimatio 6 Example Gaussia Distributios p θ (y) = π e (y θ) log pθ (y) p θ (y)dy ( ) ( ) = log π e (y θ) π e (y θ) dy ( = log π e (y θ ) ) + (y θ ) dy ( = log = log e ( θ θ ) = (θ θ ) h e θ (y ( +θ )) +( θ θ ) i ) dy π log A(p θ, p θ ) = (θ θ ) for Gaussia distributios H (p θ, p θ ) (θ θ ) for Gaussia. Summary iid Y i p θ. Maximum likelihood estimator maximizes the empirical average (our empirical risk is egative log-likelihood). θ maximizes the expectatio E (the risk is the expected egative log-likelihood) ] 3. a.s. E so we expect some sort of cocetratio of measure. ] 4. I particular, sice p θ (Y i ) a.s. K(p θ, p θ ) we might expect that K(pˆθ, p θ ) 0 for the sequece of estimates {pˆθ } =. So, the poit is that maximum likelihood estimator is just a special case of a loss fuctio i learig. Due to its special structure, we are aturally led to cosider KL divergeces, Helliger distaces, ad Affiities.