ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Similar documents
Maximum Likelihood Estimation and Complexity Regularization

Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 13: Maximum Likelihood Estimation

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Agnostic Learning and Concentration Inequalities

Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Lecture 11 and 12: Basic estimation theory

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Machine Learning Brett Bernstein

Empirical Process Theory and Oracle Inequalities

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 12: September 27

Lecture 7: October 18, 2017

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

A survey on penalized empirical risk minimization Sara A. van de Geer

7.1 Convergence of sequences of random variables

5.1 A mutual information bound based on metric entropy

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Bayesian Methods: Introduction to Multi-parameter Models

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 3: August 31

Optimally Sparse SVMs

Stat410 Probability and Statistics II (F16)

REGRESSION WITH QUADRATIC LOSS

Estimation for Complete Data

Statistical Theory MT 2009 Problems 1: Solution sketches

Exponential Families and Bayesian Inference

Statistical Theory MT 2008 Problems 1: Solution sketches

Machine Learning Brett Bernstein

Problem Set 4 Due Oct, 12

Lecture 15: Learning Theory: Concentration Inequalities

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Algorithms for Clustering

Glivenko-Cantelli Classes

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

1 Review and Overview

Lecture 10 October Minimaxity and least favorable prior sequences

Machine Learning Theory (CS 6783)

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Sequences. Notation. Convergence of a Sequence

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

1 Review and Overview

Output Analysis and Run-Length Control

Properties and Hypothesis Testing

7.1 Convergence of sequences of random variables

Random Variables, Sampling and Estimation

Questions and Answers on Maximum Likelihood

Lecture 2: Monte Carlo Simulation

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

Chapter 6 Principles of Data Reduction

Lecture 20: Multivariate convergence and the Central Limit Theorem

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Department of Mathematics

18.657: Mathematics of Machine Learning

Maximum Likelihood Estimation

Lecture 10: Universal coding and prediction

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Lecture 19: Convergence

Frequentist Inference

32 estimating the cumulative distribution function

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture 2: Concentration Bounds

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Regression with quadratic loss

Unbiased Estimation. February 7-12, 2008

STATISTICS 593C: Spring, Model Selection and Regularization

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

TAMS24: Notations and Formulas

Fall 2013 MTH431/531 Real analysis Section Notes

Lecture 11: Decision Trees

Intro to Learning Theory

1 Review and Overview

Advanced Stochastic Processes.

CS284A: Representations and Algorithms in Molecular Biology

Rademacher Complexity

Expectation and Variance of a random variable

Solutions: Homework 3

Lecture 18: Sampling distributions

Simulation. Two Rule For Inverting A Distribution Function

1.010 Uncertainty in Engineering Fall 2008

Rates of Convergence by Moduli of Continuity

Section 14. Simple linear regression.

LECTURE 8: ASYMPTOTICS I

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

5.1 Review of Singular Value Decomposition (SVD)

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Infinite Sequences and Series

Lecture 7: Channel coding theorem for discrete-time continuous memoryless channel

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Linear Support Vector Machines

Lecture 9: Expanders Part 2, Extractors

ECE 6980 An Algorithmic and Information-Theoretic Toolbox for Massive Data

Transcription:

ECE 90 Lecture 4: Maximum Likelihood Estimatio ad Complexity Regularizatio R Nowak 5/7/009 Review : Maximum Likelihood Estimatio We have iid observatios draw from a ukow distributio Y i iid p θ, i,, where θ Θ We ca view p θ as a member of a parametric class of distributios, P p θ θ Θ Our goal is to use the observatios Y i to select a appropriate distributio (eg, model from P We would like the selected distributio to be close to p θ i some sese We use the egative log-likelihood loss fuctio, defied as l(θ, Y i log p θ (Y i The empirical risk is ˆR (θ log p θ (Y i We select the distributio that miimizes the empirical risk mi log p(y i mi log p θ (Y i p P θ Θ I other words, the distributio we select is ˆp : pˆθ, where ˆθ arg mi θ Θ log p θ (Y i The risk is defied as As show before, θ miimizes R(θ over Θ R(θ E[l(θ, Y ] E[log p θ (Y ] θ arg mi E[log p θ(y ] θ Θ arg mi log p θ (y p θ (y dy θ Θ Fially, the excess risk of θ is defied as R(θ R(θ log p θ (y p θ (y p θ (y dy K(p θ, p θ

We see that the excess risk correspodig to this loss fuctio is simply the Kullback-Leibler (KL Divergece or Relative Etropy, deoted by K(p θ, p θ It is easy to see that K(p θ, p θ is always o-egative ad is zero if ad oly if p θ p θ KL divergece measures how differet two probability distributios are ad therefore is atural to measure covergece of the maximum likelihood procedures However, K(p θ, p θ is ot a distace metric because it is ot symmetric ad does ot satisfy the triagle iequality For this reaso, two other quatities play a key role i the aalysis of maximum likelihood estimatio, amely Helliger Distace ad Affiity The Helliger distace is defied as H(p θ, p θ ( ( pθ (y p θ (y dy We proved that the squared Helliger distace lower bouds the KL divergece: H (p θ, p θ K(p θ, p θ H (p θ, p θ K(p θ, p θ The affiity is defied as we also proved that pθ A(p θ, p θ (yp θ (y dy H (p θ, p θ log (A(p θ, p θ Example (Gaussia Distributio Y is Gaussia with mea θ ad variace σ The, First, look at p θ (y [ K(p θ, p θ E θ log p ] θ p θ (y θ e σ πσ log p θ p θ σ [(θ θ (θ θ y] θ θ σ (θ θ σ y p θ (y dy E[Y ]θ σ (θ + θ θ θ (θ θ σ ( ( log A(p θ, p θ log (y θ πσ e / ( σ (y θ πσ e / σ dy ( log ( log (θ θ log e σ (y θ πσ e 4σ (y θ 4σ dy πσ e σ [ (y θ +θ +( θ θ ] dy (θ θ 4σ K(p θ, p θ H (p θ, p θ

Maximum likelihood estimatio ad Complexity regularizatio Suppose that we have iid traiig samples, X i, Y i Usig coditioal probability, p XY ca be writte as iid p XY p XY (x, y p X (x p Y Xx (y Let s assume for the momet that p X is completely ukow, but p Y Xx (y has a special form: p Y Xx (y p f (x(y where p Y Xx (y is a kow parametric desity fuctio with parameter f (x Example (Sigal-plus-oise observatio model Y i f (X i + W i, i,, where W i iid N (0, σ ad X i iid p X Y X x Poisso(f (x p f (x(y (y f (x πσ e σ The likelihood loss fuctio is p f (x(y e f (x [f (x] y y! l(f(x, y log p XY (X, Y log p X (X log p Y X (Y X log p X (X log p f(x (Y The expected loss is E[l(f(X, Y ] E [E[l(f(X, Y X]] E[ E[ log p X (X log p f(x (Y X] ] E[ log p X (X ] E[ E[ log p f(x (Y X ] ] E[ log p X (X ] E[ log p f(x (Y ] Notice that the first term is a costat with respect to f With that i mid we defie our risk to be R(f E[ log p f(x (Y ] E[ E[log p f(x (Y X] ] ( log p f(x (y p f (x(y dy p X (x dx The fuctio f miimizes this risk sice f(x f (x miimizes the itegrad Our empirical risk is the egative log-likelihood of the traiig samples: ˆ R (f log p f(xi(y i We ca regard the value as the empirical probability of observig X X i (sice we are makig o assumptios o P X 3

3 Determiistic Desigs Ofte i fuctio estimatio, we have cotrol over where we sample For illustratio let s assume that X [0, ] ad Y R Suppose we have the samples determiistically distributed somewhat uiformly over the domai X Let x i, i,, deote these sample poits, ad assume that Y i p f (x i(y The, our empirical risk is ˆR (f l(f(x i, Y i log p f(xi(y i Note that x i is ow a determiistic quatity (hece the ame determiistic desig Our risk is R(f E [ log p f(xi(y i ] [ ] log p f(xi(y i p f (x i(y i dy i The risk is miimized by f However, f is ot a uique miimizer Ay f that agrees with f at the poit x i also miimizes this risk Now, we will make use of the followig vector ad shorthad otatio The uppercase Y deotes a radom variable, while the lowercase y ad x deote determiistic quatities Y y x Y y x Y Y The, p f (Y p (Y i f(x i (radom p f (y p (y i f(x i (determiistic y y x With this otatio, the empirical risk ad the true risk ca be writte as ˆ R (f log p f (Y x R(f E[log p f (Y ] log p f (y p f (y dy 4 Costructig Error Bouds Suppose that we have a pool of cadidate fuctios F, ad we wat to select a fuctio f from F usig the traiig data Our usual approach is to show that the distributio of R ˆ (f cocetrates about its mea as grows First, we assig a complexity c(f > 0 to each f F so that c(f The, apply the uio boud to get a uiform cocetratio iequality holdig for all models i F Fially, we use this cocetratio iequality to boud the expected risk of our selected model We will essetially accomplish the same result here, but avoid the eed for explicit cocetratio iequalities ad istead make use of the iformatio-theoretic bouds 4

where We would like to select a f F so that the excess risk is small is agai the KL divergece 0 R(f R(f E[log p f (Y log p f (Y ] [ E log p f (Y ] p f (Y K(p f ( K(p f log p f (x i(y i p f(xi(y i p f (x i(y i dy i K(p f(xi,p f (xi Ufortuately, as metioed before, K(p f is ot a true distace So istead we will focus o the expected squared Helliger distace as our measure of performace: H (p f ( p f(xi(y i p f (x i(y i dyi 5 Maximum Complexity-Regularized Likelihood Estimatio Theorem (Li-Barro 000, Kolaczyk-Nowak 00 Let x i, Y i be a radom sample of traiig data, where x i are determiistic ad Y i are idepedet radom variables, distributed as Y i p f (x i(y i, i,, for some ukow fuctio f Suppose we have a collectio of cadidate fuctios F measurable fuctios f : X Y, ad complexities c(f > 0, f F, satisfyig c(f Defie the complexity-regularized estimator ˆf arg mi log p f (Y i + c(f log The, [ ] E H (p ˆf ( [log E mi ] A(p ˆf K(p f + c(f log Before provig the theorem, let s look at a very special ad importat case We will use this results quite a lot i the followig lectures 5

Example 3 (Gaussia oise Suppose Y i f(x i + W i Usig results from example, we have The, We also have, ad log p f (Y ( log A p ˆf (Y, p f (Y, W i iid N (0, σ p f(xi(y i (y πσ e i f(x i σ ( log A p ˆf(x (Y i i, p f (x i(y i log p ˆf (y (x i i p f (x i(y i dy i 4σ ] [log E A(p ˆf 4σ (Y i f(x i σ K(p f ˆf arg mi The theorem tells us that ( ˆf (x i f (x i E 4 σ ( ˆf (x i f (x i [ ( ] E ˆf (x i f (x i (f(x i f (x i σ, Combie everythig together to get (Y i f(x i + 8σ c(f log mi (f(x i f (x i σ + c(f log or We will ow prove Theorem Proof: [ ( ] E ˆf (x i f (x i mi H ( p ˆf, p f (f(x i f (x i + 8σ c(f log ( p (y ˆf p f (y dy ( log p (y p ˆf f (y dy affiitya(p ˆf,p f Notice that A(p ˆf is a radom quatity (it depedets of the traiig set through ˆf Keepig that i mid the above result tells us that [ ( [ ( ] ] E H p ˆf, p f E log A(p ˆf 6

Now, defie the theoretical aalog of ˆf : f arg mi Now rewrite the defiitio of ˆf ˆf arg mi arg max arg max arg max arg max K (p f + c(f log log p c(f log f (Y + (log p f (Y c(f log (log p f (Y c(f log log (p f (Y c(f p f (Y c(f Sice ˆf is the fuctio f F that maximizes p f (Y c(f we coclude that p ˆf (Y c( ˆf pf (Y c(f The ca write [ ( ] E H p ˆf, p f [ ( ] E log A(p ˆf p ˆf (Y c( ˆf E log pf (Y c(f A(p ˆf Now, simply multiply the argumet iside the log by [ ( ] E H p ˆf, p f [ E log p f (Y pf (Y p f (Y to get pf (Y p ˆf (Y E log c( ˆf pf (Y pf (Y c(f A(p ˆf ( ] pf (Y + c(f log p ˆf (Y +E log pf (Y c( ˆf A(p ˆf K (p f + c(f log p ˆf (Y +E log pf (Y c( ˆf A(p ˆf The terms K (p f + c(f log are precisely what we wated for the upper boud of the theorem So, to fiish the proof we oly eed to show that the last term is ot-positive 7

Applyig Jese s iequality, we get p ˆf (Y p f (Y p ˆf (Y E log pf (Y c( ˆf log log E A(p ˆf c( ˆf A(p ˆf ( Note that i the above expressio the radom quatities are Y ad ˆf Because ˆf F ad we kow somethig about this space we ca use a uio boud to get the radomess of ˆf out of the way I this case the uio boud is simply sayig that ay idividual term i a summatio of positive terms is smaller tha the summatio E log p ˆf (Y pf (Y e c( ˆf log A(p ˆf, p f pf (Y p f (Y log E e c(f log A(p f, p f [ ] E pf (Y log c(f p f (Y A(p f, p f log c(f 0, where the last steps of the proof follow from the fact that [ ] p f (Y p f (y E p f (Y p f (y p f (y dy p f (y p f (y dy A(p f, ad c(f Let z, z, be o-egative The for all i N x i j x j 8