Excess Error, Approximation Error, and Estimation Error

Similar documents
1 Definition of Rademacher Complexity

Computational and Statistical Learning theory Assignment 4

COS 511: Theoretical Machine Learning

Learning Theory: Lecture Notes

System in Weibull Distribution

Online Classification: Perceptron and Winnow

1 Review From Last Time

Xiangwen Li. March 8th and March 13th, 2001

Generalized Linear Methods

XII.3 The EM (Expectation-Maximization) Algorithm

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

Least Squares Fitting of Data

arxiv: v2 [math.co] 3 Sep 2017

Least Squares Fitting of Data

LECTURE :FACTOR ANALYSIS

Preference and Demand Examples

Designing Fuzzy Time Series Model Using Generalized Wang s Method and Its application to Forecasting Interest Rate of Bank Indonesia Certificate

Feature Selection: Part 1

Slobodan Lakić. Communicated by R. Van Keer

The Parity of the Number of Irreducible Factors for Some Pentanomials

ITERATIVE ESTIMATION PROCEDURE FOR GEOSTATISTICAL REGRESSION AND GEOSTATISTICAL KRIGING

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

Applied Mathematics Letters

Lecture Notes on Linear Regression

Lecture 10 Support Vector Machines II

1 Convex Optimization

COMP th April, 2007 Clement Pang

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Fermi-Dirac statistics

Homework Assignment 3 Due in class, Thursday October 15

Lecture 4: September 12

Edge Isoperimetric Inequalities

Solving Fuzzy Linear Programming Problem With Fuzzy Relational Equation Constraint

Vapnik-Chervonenkis theory

Lecture 4. Instructor: Haipeng Luo

18.1 Introduction and Recap

Small-Sample Equating With Prior Information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

A note on almost sure behavior of randomly weighted sums of φ-mixing random variables with φ-mixing weights

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Lecture 17 : Stochastic Processes II

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

Lecture 4: Universal Hash Functions/Streaming Cont d

Scattering by a perfectly conducting infinite cylinder

PGM Learning Tasks and Metrics

Gadjah Mada University, Indonesia. Yogyakarta State University, Indonesia Karangmalang Yogyakarta 55281

Near Optimal Online Algorithms and Fast Approximation Algorithms for Resource Allocation Problems

THE SUMMATION NOTATION Ʃ

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

AN ANALYSIS OF A FRACTAL KINETICS CURVE OF SAVAGEAU

On the Construction of Polar Codes

Need for Probabilistic Reasoning. Raymond J. Mooney. Conditional Probability. Axioms of Probability Theory. Classification (Categorization)

ON THE NUMBER OF PRIMITIVE PYTHAGOREAN QUINTUPLES

On the Construction of Polar Codes

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

LINEAR REGRESSION ANALYSIS. MODULE VIII Lecture Indicator Variables

Handling Overload (G. Buttazzo, Hard Real-Time Systems, Ch. 9) Causes for Overload

Two Conjectures About Recency Rank Encoding

Estimation: Part 2. Chapter GREG estimation

Finite Fields and Their Applications

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Joint Statistical Meetings - Biopharmaceutical Section

What is LP? LP is an optimization technique that allocates limited resources among competing activities in the best possible manner.

Kernel Methods and SVMs Extension

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

On the number of regions in an m-dimensional space cut by n hyperplanes

Several generation methods of multinomial distributed random number Tian Lei 1, a,linxihe 1,b,Zhigang Zhang 1,c

1 The Mistake Bound Model

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 4: November 17, Part 1 Single Buffer Management

A Knowledge-Based Feature Selection Method for Text Categorization

An application of generalized Tsalli s-havrda-charvat entropy in coding theory through a generalization of Kraft inequality

Economics 130. Lecture 4 Simple Linear Regression Continued

Computing MLE Bias Empirically

Lecture Randomized Load Balancing strategies and their analysis. Probability concepts include, counting, the union bound, and Chernoff bounds.

Our focus will be on linear systems. A system is linear if it obeys the principle of superposition and homogenity, i.e.

Markov Chain Monte-Carlo (MCMC)

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

On Pfaff s solution of the Pfaff problem

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Lecture 10 Support Vector Machines. Oct

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

An Optimal Bound for Sum of Square Roots of Special Type of Integers

Expected Value and Variance

Statistical Foundations of Pattern Recognition

Foundations of Arithmetic

STAT 3008 Applied Regression Analysis

P exp(tx) = 1 + t 2k M 2k. k N

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

U-Pb Geochronology Practical: Background

Assortment Optimization under MNL

Three Algorithms for Flexible Flow-shop Scheduling

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Machine learning: Density estimation

Transcription:

E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple settng: gven a fnte saple S X Y drawn accordng to, we have seen how to obtan hgh confdence bounds on the generalzaton error of a functon learned fro S, usually n ters of soe eprcal quantty that easures the perforance of the functon on S. Another queston of nterest concerns the behavour of a learnng algorth n the nfnte saple lt: as t receves ore and ore data, does the algorth converge to an optal predcton rule,.e. does the generalzaton error of the learned functon approach the optal error? Recall that for a dstrbuton on X Y and a loss l : Y Y [0,, the optal error w.r.t. and l s the lowest possble error achevable by any functon h : X Y: er l, = nf h:x Y erl [h]. 1 For the 0-1 loss, the optal error s known as the Bayes error. To foralze the above, for any functon h : X Y, defne ts excess error w.r.t. and l as er [h] er l,. We would lke to study the behavour of the excess error of the functon learned by an algorth fro a tranng saple S as. As we have seen, snce nzng the error over all possble functons n Y X can be dffcult, ost learnng algorths select a functon fro soe fxed functon class H Y X. In such cases, we can only hope to acheve generalzaton error close to the lowest possble wthn the class; we refer to ths as the optal error wthn H w.r.t. and l: er l [H] = nf erl [h]. 3 It s then useful to vew the excess error of functons h H as a su of the followng two ters: er [h] er l, = er l [h] er l [H] + er l [H] er l,. The frst ter s called the estaton error, and easures how far h s fro the optal wthn H. The second ter, called the approxaton error, easures how close one can get to the optal error usng functons n H; ths s an nherent property of the functon class, and fors a lower bound on the excess error of any functon learned fro H. In the followng we wll focus on the estaton error, whch s what a learnng algorth learnng fro a functon class H can hope to nze. We frst gve a couple of defntons. Statstcal Consstency efnton. Let H Y X. Let A : =1X Y H be a learnng algorth that gven a tranng saple S =1X Y, returns a functon h S H. Let be a probablty dstrbuton on X Y and 1

Excess Error, Approxaton Error, and Estaton Error l : Y Y [0,. We say A s statstcally consstent n H w.r.t. and l f the estaton error of the functon learned by A fro S converges n probablty to zero,.e. f for all ɛ > 0, er l [h S ] er l [H] ɛ 0 as. If A s consstent n H w.r.t. l for all dstrbutons on X Y, we say A s unversally consstent n H w.r.t. l. 1 efnton. Let A : =1X Y Y X be a learnng algorth that gven a tranng saple S =1X Y, returns a functon h S : X Y. Let be a probablty dstrbuton on X Y and l : Y Y [0,. We say A s Bayes consstent w.r.t. and l f the excess error of the functon learned by A fro S converges n probablty to zero,.e. f for all ɛ > 0, er l [h S ] er l, ɛ 0 as. If A s Bayes consstent w.r.t. l for all dstrbutons on X Y, we say A s unversally Bayes consstent w.r.t. l. One can also defne analogous notons of strong consstency, whch requre alost sure convergence nstead of convergence n probablty. 3 Consstency of Eprcal Rsk Mnzaton n H Let H Y X and l : Y Y [0,. Consder the eprcal rsk nzaton ERM algorth n H, whch gven a tranng saple S =1X Y returns 3 h S arg n erl S[h]. 5 Then for any dstrbuton on X Y, we can wrte the estaton error of h S as er l [h S ] er l [H] = er l [h S ] er l S[h S ] + er l S[h S ] er l [H] 6 er l [h S ] er l S[h S ] + sup er l S[h] er l [h] 7 sup er l S[h] er l [h]. 8 Therefore, unfor convergence of eprcal errors n H ples consstency of ERM n H! In partcular, for bnary classfcaton, we edately have the followng: Theore 3.1. Let H {±1} X and l = l 0-1. If VCdH = d <, then ERM n H s unversally consstent n H w.r.t. l 0-1. Proof. Let be any probablty dstrbuton on X {±1}. Let ɛ > 0. Then [h S ] [H] ɛ sup [h S ] [H] ɛ by Eq. 8 9 d e e ɛ /3 by prevous results 10 d 0 as. 11 1 Note that one could also defne a noton of consstency n ters of convergence n expectaton, whch would requre that E S [er l [h S] er l [H]] 0 as. It s easy to show that a sequence of bounded, non-negatve rando varables converges n probablty f and only f t converges n expectaton show ths!, and therefore when the loss functon l s bounded, consstency n ters of convergence n probablty s equvalent to consstency n ters of convergence n expectaton. Note that the ter Bayes consstency s usually used to refer to convergence to the optal error for bnary classfcaton wth the 0-1 loss; we wll use the ter for any learnng proble/loss functon to dstngush t fro consstency wthn H. 3 We assue for splcty that the nu s acheved n H; the results we dscuss contnue to hold f h S s selected to be any functon n H whose eprcal error s wthn an approprately sall precson of nf er l S [h].

Excess Error, Approxaton Error, and Estaton Error 3 Several rearks are n order: 1. As we have noted before, for bnary classfcaton, ERM s typcally not coputatonally effcent, except for soe sple classes H. We wll later dscuss consstency of algorths that nze a convex upper bound on l 0-1.. Note that for any 0 < δ 1, we have wth probablty at least 1 δ over S, [h S ] [H] c d ln + ln 1 δ ln As a functon of the saple sze, ths gves a rate of convergence of O for the estaton error. For dstrbutons for whch er [H] = 0 so that there s a target functon t H such that wth probablty 1, the true label y of any nstance x under s gven by tx,.e. P x,y y = tx = 1, one can actually show a faster rate of convergence of O ln. Ths follows fro a better unfor convergence bound for such dstrbutons wth an e cɛ ter n the bound rather than e cɛ ; we probably wll not show ths for the general case, but wll show ths for fnte H n a later lecture. A dervaton for the general case can be found for exaple n [1]. 3. It s portant to note that the above result apples only to classes of fnte VC-denson. Snce no such class can have zero approxaton error for all dstrbutons, ERM n such a class cannot acheve unversal Bayes consstency.. For classes H of fnte VC-denson, the above result actually establshes that ERM n H s strongly unversally consstent n H, by vrtue of the Borel-Cantell lea see [1].. Consstency of Structural Rsk Mnzaton n H = H Let H 1 H..., where H Y X. Let l : Y Y [0,. Gven a tranng saple S =1X Y, the structural rsk nzaton SRM algorth n H =1 returns h S arg n er l S[h S] + penalty,, 1 where h S H s the functon returned by ERM n H, and penalty, s a penalty ter that ncreases wth the coplexty of H. Under certan condtons, one can show that SRM n H =1 s consstent n H = =1 H ; f n addton the sequence H =1 s such that H = =1 H has zero approxaton error, then SRM n H =1 can also be Bayes consstent. For exaple, for bnary classfcaton, we have the followng result: Theore.1 Lugos and Zeger, 1996. Let H 1 H..., where H {±1} X, VCdH = d <, and d < d +1. Let l = l 0-1. Then SRM wth penaltes gven by penalty, = s unversally consstent n H = =1 H w.r.t. l 0-1. 8d lne + Proof. Let be any probablty dstrbuton on X {±1}. Let ɛ > 0. We can wrte the estaton error of h S as [h S ] [H] = S [h S] + penalty, + [h S ] nf nf S [h S] + penalty, [H]. 13

Excess Error, Approxaton Error, and Estaton Error Therefore we have [h S ] [H] ɛ S [h S] + penalty, ɛ [h S ] nf nf S [h S] + penalty, + [H] ɛ. 1 We wll bound each probablty n turn. For the frst probablty, we have [h S ] nf S [h S] + penalty, ɛ 15 sup [h S] S [h S] + penalty, ɛ 16 [h S] S [h S] ɛ + penalty, by unon bound 17 =1 d e e ɛ +penalty, /8 =1 d e d e ɛ /3 e penalty, /8 =1 = e ɛ /3 = e ɛ /3 = 18 19 e d e 8d lne+/8 0 =1 e /8 1 =1 e ɛ /3. 1 e 1/8 For the second probablty, let be such that and let be such that for all, Then we have [H ] [H] + ɛ, 3 penalty, ɛ 8. nf S [h S] + penalty, [H] ɛ 5 nf S [h S] + penalty, [H ] ɛ 6 S [h S ] + penalty, [H ] ɛ 7 S [h S ] [H ] ɛ, for 8 8 sup S [h] [h] ɛ 9 8 d e d Thus we have [h S ] [H] ɛ e ɛ /51. 30 d e e ɛ /3 + e ɛ /51, 1 e 1/8 d for 31 0 as. 3

Excess Error, Approxaton Error, and Estaton Error 5 A couple of rearks: 1. As noted above, f the sequence H =1 s such that nf nf [h] = er0-1, for all dstrbutons on X {±1}.e. f the approxaton error of H = =1 H s zero for all, then SRM n H =1 as above s unversally Bayes consstent w.r.t. l 0-1.. Agan, except for the splest probles, SRM partcularly for bnary classfcaton s often not coputatonally feasble; however t s useful as a theoretcal tool for understandng odel selecton technques and Bayes consstency, and can also serve as a gude for the developent of approxate algorths. 5 Consstency and Learnablty: Two Sdes of the Sae Con In the next few lectures we wll turn to learnablty, and then return to a ore detaled dscusson of statstcal consstency. As we wll see, the two notons are closely related, although they arose n dfferent countes and tend to ephasze soewhat dfferent aspects: Statstcal Consstency Learnablty Orgns n statstcs Starts wth learnng algorth; asks f t s statstcally consstent Both consstency wthn H and Bayes consstency of nterest Mostly dstrbuton-free; also nterested n low-nose settngs Focus on convergence rates ɛ, δ Orgns n theoretcal coputer scence Starts wth functon class H; asks f there s a learnng algorth that s statstcally consstent n H wth an addtonal requreent we wll see next te By defnton, nterest s n consstency w.r.t. H Often assue er l [H] = 0 target functon settng; ostly dstrbuton-free otherwse, but soetes nterested n specfc dstrbutons such as the unfor dstrbuton over the Boolean cube X = {0, 1} n Focus on saple coplexty ɛ, δ and coputatonal coplexty 6 Next Lecture In the next lecture we wll ntroduce the noton of learnablty, and wll gve a few basc results and exaples to llustrate the concept. The next few lectures after that wll dscuss ore results and exaples related to learnablty, before we return to talk ore about statstcal consstency. References [1] Luc evroye, Laszlo Gyorf, and Gabor Lugos. A Probablstc Theory of Pattern Recognton. Sprnger, 1996.