Agnostic Learning and Concentration Inequalities

Similar documents
ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Maximum Likelihood Estimation and Complexity Regularization

Glivenko-Cantelli Classes

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Lecture 13: Maximum Likelihood Estimation

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Empirical Process Theory and Oracle Inequalities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture 2: Concentration Bounds

ECE 901 Lecture 13: Maximum Likelihood Estimation

Learning Theory: Lecture Notes

A survey on penalized empirical risk minimization Sara A. van de Geer

Machine Learning Brett Bernstein

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Lecture 3: August 31

Machine Learning Theory (CS 6783)

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

An Introduction to Randomized Algorithms

Convergence of random variables. (telegram style notes) P.J.C. Spreij

7.1 Convergence of sequences of random variables

Rademacher Complexity

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

Lecture Chapter 6: Convergence of Random Sequences

Sieve Estimators: Consistency and Rates of Convergence

Lecture 2: Monte Carlo Simulation

Distribution of Random Samples & Limit theorems

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Machine Learning Brett Bernstein

Probability and Random Processes

4.1 Data processing inequality

1 Review and Overview

Lecture 11 and 12: Basic estimation theory

1 Review and Overview

7.1 Convergence of sequences of random variables

REGRESSION WITH QUADRATIC LOSS

Intro to Learning Theory

SDS 321: Introduction to Probability and Statistics

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Optimally Sparse SVMs

Lecture 6: Coupon Collector s problem

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Estimation for Complete Data

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 10 October Minimaxity and least favorable prior sequences

Lecture 4: April 10, 2013

Lecture 11: Decision Trees

Learnability with Rademacher Complexities

Lecture 2. The Lovász Local Lemma

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Algorithms for Clustering

This section is optional.

Lecture 15: Learning Theory: Concentration Inequalities

Math 216A Notes, Week 5

Regression with quadratic loss

18.657: Mathematics of Machine Learning

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

2 Banach spaces and Hilbert spaces

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Problem Set 2 Solutions

Lecture 19: Convergence

Introduction to Probability. Ariel Yadin

Lecture 2 February 8, 2016

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Unbiased Estimation. February 7-12, 2008

Lecture 9: Expanders Part 2, Extractors

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Lecture 14: Graph Entropy

Rates of Convergence by Moduli of Continuity

Lecture 3 : Random variables and their distributions

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Sequences and Series of Functions

Lecture 2: April 3, 2013

Law of the sum of Bernoulli random variables

Notes 5 : More on the a.s. convergence of sums

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Advanced Stochastic Processes.

Lecture 7: October 18, 2017

Stat 421-SP2012 Interval Estimation Section

Simulation. Two Rule For Inverting A Distribution Function

Lecture 12: September 27

Application to Random Graphs

Lecture 16: Monotone Formula Lower Bounds via Graph Entropy. 2 Monotone Formula Lower Bounds via Graph Entropy

Lecture 12: November 13, 2018

The log-behavior of n p(n) and n p(n)/n

Notes 19 : Martingale CLT

Lecture 8: Convergence of transformations and law of large numbers

1 Convergence in Probability and the Weak Law of Large Numbers

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

5.1 Review of Singular Value Decomposition (SVD)

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Transcription:

ECE901 Sprig 2004 Statistical Regularizatio ad Learig Theory Lecture: 7 Agostic Learig ad Cocetratio Iequalities Lecturer: Rob Nowak Scribe: Aravid Kailas 1 Itroductio 1.1 Motivatio I the last lecture we cosider a learig problem i which the optimal fuctio beloged to a fiite class of fuctios. Specifically, for some collectio of fuctios Fwith fiite cardiality F, we have mi R(f) 0 f F This is almost always ot the situatio i the real-world learig problems. Let us suppose we have a fiite collectio of cadidate fuctios F. Furthermore, we do ot assume that the optimal fuctio f, which satisfies R(f ) if R(f) f, where the if is take over all measurable fuctios, is a member of F. That is, we make few, if ay, assumptios about f. This situatio is sometimes termed as Agostic Learig. The root of the word agostic literally meas ot kow. The term agostic learig is used to emphasize the fact that ofte, perhaps usually, we may have o prior kowledge about f. The questio the arises about how we ca reasoably select a f F i this settig. 1.2 The Problem The PAC style bouds discussed i the previous lecture, offer some help. Sice we are selectig a fuctio based o the empirical risk, the questio is how close is ˆR (f) to R(f) f F. I other words, we wish that the empirical risk is a good idicator of the true risk for every fuctio i F. If this is case, the selectio of f that miimizes the empirical risk fˆ arg mi ˆR (f) should also yield a small true risk, that is, R( ˆ f ) should be close to mi R(f). Fially, we ca thus state our desired situatio as P ( ˆ R (f) R(f) > ɛ) < δ, f F I other words, f F, with probability at least 1 δ, R ˆ (f) R(f) > ɛ. I this lecture, we will start to develop bouds of this form. First we will focus o boudig P ( R ˆ (f) R(f) > ɛ) for oe fixed f F. 1

Agostic Learig ad Cocetratio Iequalities 2 2 Developig Iitial Bouds To begi, let us recall the defiitio of empirical risk for {X i, Y i } be a collectio of traiig data. The the empirical risk is defied as ˆR (f) 1 l(f(x i ), Y i ) Note that sice the traiig data {X i, Y i } are assumed to be i.i.d. pairs, each term i the sum is a i.i.d radom variables. Let L i l(f(x i ), Y i ) The collectio of losses {L i } is i.i.d accordig to some ukow distributio (depedig o the ukow joit distributio of (X,Y) ad the loss fuctio). The expectatio of L i is E[l(f(X i ), Y i )] E[l(f(X), Y )] R(f), the true risk of f. For ow, let s assume that f is fixed. E[ ˆ R (f)] 1 E[l(f(X i ), Y i )] 1 E[L i ] R(f) We kow from the strog law of large umbers that the average (or empirical mea) R ˆ (f) coverges almost surely to the true mea R(f). That is, R ˆ (f) R(f) almost surely as. The questio is how fast. 3 Cocetratio of Measure Iequalities Cocetratio iequalities are upper bouds o how fast empirical meas coverge to their esemble couterparts, i probability. Area of the shaded tail regios is P ( R ˆ (f) R(f) > ɛ). We are iterested i fidig out how fast this probability teds to zero as. At this stage, we recall Markov s Iequality. Let Z be a oegative radom variable. E[Z] 0 t 0 0 + t zp(z)dz zp(z)dz + tp (Z t) P (Z t) E[Z] t P (Z 2 t 2 ) E[Z2 ] t 2 t u zp(z)dz zp(z)dz Take Z R ˆ(f) R(f) ad t ɛ

Agostic Learig ad Cocetratio Iequalities 3 Figure 1: Distributio of ˆ R (f) P ( R ˆ (f) R(f) ɛ) E[ R ˆ(f) R(f) 2 ] ɛ 2 var( ˆR (f)) ɛ 2 var( Li ) ɛ 2 var(l(x), Y ) ɛ 2 σ2 L ɛ 2 So, the probability goes to zero at a rate of at least 1. However, it turs out that this is a extremely loose boud. Accordig to the Cetral Limit Theorem ˆ R (f) 1 L i N ( ) R(f), σ2 L as i distributio. This suggests that for large values of, ) P ( R ˆ (f) R(f) ɛ) O (e ɛ2 2σ L 2 That is, the Gaussia tail probability is tedig to zero expoetially fast.

Agostic Learig ad Cocetratio Iequalities 4 4 A Dichotomy Obviously, the boud based o Markov s iequality is extremely loose for large. Tighter cocetratio iequalities ca be derived usig more sophisticated techiques. There is a importat dichotomy at this poit ito the class of bouded loss fuctios (leadig to bouded radom variables L i ) ad ubouded loss fuctios (leadig to ubouded radom variables L i ). Example 1 Bouded Loss Fuctios By this, we mea ay loss fuctio mappig ito a bouded set, for example, l : Y Y [0, 1] So here, L i 0 or 1. 0 1 loss, R(f) E[1 f(x) Y ] P (f(x) Y ). Example 2 Ubouded Loss Fuctios Ay loss fuctio mappig ito a ubouded set, for example squared error, R(f) E[(f(X) Y ) 2 ]. The case of ubouded losses is simpler, sice we ca exploit the boudedess i a key way. Therefore, we ca cocetrate o bouded loss fuctios ad classificatio problems first, ad later we will look at ubouded losses ad estimatio problems. 5 Bouded Loss Fuctios ad Cheroff s Boud Note that for ay oegative radom variable Z ad t > 0, P (Z t) P (e sz e st ) E[esZ ] e st, s > 0 by Markov s iequality Cheroff s boud is based o fidig the value of s that miimizes the upper boud. If Z is a sum of idepedet radom variables. For example, say Z ( ) (l(f(x i ), Y i ) R(f)) ˆR (f) R(f) the the boud becomes ( ) P (L i E[L i ]) t e st E[e s (Li E[Li]) ] e st E[e s(li E[Li]) ], from idepedece. Thus, the problem of fidig a tight boud boils dow to fidig a good boud for E[s s(li E[Li]) ]. Cheroff ( 52), first studied this situatio for biary radom variables. The, Hoeffdig ( 63) derived a more geeral result for arbitrary bouded radom variables. Theorem 1 Hoeffdig s Iequality Let Z 1, Z 2,..., Z be idepedet bouded radom variables such that Z i [a i, b i ] with probability 1. Let S Z i. The for ay t > 0, we have P ( S E[S ] t) 2e 2t 2 (b i a i ) 2

Agostic Learig ad Cocetratio Iequalities 5 Applicatio: Let Z i 1 f(xi) Y i R(f), as i the classificatio problem. The for a fixed f, it follows from Hoeffdig s iequality (i.e., Cheroff s boud i this special case) that ( ) P ( R ˆ 1 (f) R(f) ɛ) P S E[S ] ɛ P ( S E[S ] ɛ) 2e 2(ɛ)2 2e 2ɛ2 Proof: The key to provig Hoeffdig s iequality is the followig upper boud: if Z is a radom variable with E[Z] 0 ad a Z b, the E[e sz ] e s2 (b a) 2 This upper boud is derived as follows. By the covexity of the expoetial fuctio, e sz z a b a esb + b z b a esa, for a z b Figure 2: Covexity of expoetial fuctio. Thus, E[e sz ] [ Z a E b a b b a esa ] e sb + E [ ] b Z e sa b a a b a esb, sice E[Z] 0 (1 θ + θe s(b a) )e θs(b a), where θ a b a

Agostic Learig ad Cocetratio Iequalities 6 Now let The we have u s(b a) ad defie φ(u) θu + log(1 θ + ) E[e sz ] (1 θ + θe s(b a) )e θs(b a) e φ(u) To miimize the upper boud let s express φ(u) i a Taylor s series with remaider : φ(u) φ(0) + uφ (0) + u2 2 φ (v) for some v [0, u] φ (u) θ + 1 θ + φ (u) 0 φ (u) 1 θ + (1 θ + ) 2 1 θ + (1 1 θ + ) ρ(1 ρ) Now, φ (u) is maximized by So, ρ 1 θ + 1 2 φ (u) 1 4 φ(u) u2 s2 (b a) 2 E[e sz ] e s2 (b a) 2 Now, we ca apply this upper boud to derive Hoeffdig s iequality. P (S E[S ] t) e st E[e s(li E[Li]) ] e st e s 2 (bi a i ) 2 e st e s2 (b i a i ) 2 e 2t 2 (b i a i ) 2 by choosig s 4t (b i a i ) 2 2t 2 (b Similarly, P (E[S ] S t) e i a i ) 2. This completes the proof of the Hoeffdig s theorem.

Agostic Learig ad Cocetratio Iequalities 7 Now, we wat a boud like this to hold for all f F. Let us eumerate the fuctios i F as f 1, f 2,..., f F, where F deotes the cardiality of F. We would like to boud the probability that R ˆ (f) R(f) ɛ for ay f F. This probability is ( P R ˆ (f 1 ) R(f 1 ) ɛ or or R ˆ ) (f F ) R(f F ) ɛ P R ˆ (f) R(f) ɛ. P R ˆ (f) R(f) ɛ P ( R ˆ (f) R(f) ɛ), the uio of evets boud 2 F e 2ɛ2, by Hoeffdig s iequality. Thus, we have show that f F with probability at least 1 2 F e 2ɛ2, R ˆ (f) R(f) < ɛ. Ad accordigly, we ca be reasoably cofidet i selectig f from F based o the empirical risk fuctio ˆR.