Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Similar documents
Statistical Properties of Large Margin Classifiers

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Computational Oracle Inequalities for Large Scale Model Selection Problems

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Consistency of Nearest Neighbor Methods

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

A talk on Oracle inequalities and regularization. by Sara van de Geer

Convexity, Classification, and Risk Bounds

Divergences, surrogate loss functions and experimental design

Adaptive Sampling Under Low Noise Conditions 1

The sample complexity of agnostic learning with deterministic labels

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Calibrated Surrogate Losses

Generalization theory

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Statistical Data Mining and Machine Learning Hilary Term 2016

Concentration behavior of the penalized least squares estimator

Margin Maximizing Loss Functions

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Adaptive Online Gradient Descent

Large Margin Classifiers: Convexity and Classification

A survey on penalized empirical risk minimization Sara A. van de Geer

A Study of Relative Efficiency and Robustness of Classification Methods

Symmetrization and Rademacher Averages

On the rate of convergence of regularized boosting classifiers

Plug-in Approach to Active Learning

Sample width for multi-category classifiers

Advanced Machine Learning

Machine Learning for NLP

IFT Lecture 7 Elements of statistical learning theory

Empirical Risk Minimization

Adaptive Minimax Classification with Dyadic Decision Trees

Machine Learning And Applications: Supervised Learning-SVM

Minimax risk bounds for linear threshold functions

MythBusters: A Deep Learning Edition

Recap from previous lecture

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

A Geometric Approach to Statistical Learning Theory

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Does Modeling Lead to More Accurate Classification?

Why does boosting work from a statistical view

Classification with Reject Option

Probably Approximately Correct (PAC) Learning

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

Stochastic Optimization

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

On the Consistency of AUC Pairwise Optimization

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Fast learning rates for plug-in classifiers under the margin condition

On the Rate of Convergence of Regularized Boosting Classifiers

Effective Dimension and Generalization of Kernel Learning

Multi-View Point Cloud Kernels for Semi-Supervised Learning

Learning with Rejection

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Generalization bounds

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Optimal Aggregation of Classifiers and Boosting Maps in Functional Magnetic Resonance Imaging

Unlabeled Data: Now It Helps, Now It Doesn t

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Model Selection and Geometry

Introduction to Machine Learning

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

On threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden

Computational and Statistical Learning Theory

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Gradient Boosting (Continued)

Understanding Generalization Error: Bounds and Decompositions

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Machine Learning Theory (CS 6783)

The Performance of a New Hybrid Classifier Based on Boxes and Nearest Neighbors

Lecture Support Vector Machine (SVM) Classifiers

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Constrained Optimization and Lagrangian Duality

FORMULATION OF THE LEARNING PROBLEM

Boosting with Early Stopping: Convergence and Consistency

Online Learning and Sequential Decision Making

How can a Machine Learn: Passive, Active, and Teaching

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Foundations of Machine Learning

Worst-Case Bounds for Gaussian Process Models

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Class 2 & 3 Overfitting & Regularization

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

2 Upper-bound of Generalization Error of AdaBoost

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Model Selection and Error Estimation

On Learnability, Complexity and Stability

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Rademacher Bounds for Non-i.i.d. Processes

Least Squares Regression

BINARY CLASSIFICATION

Perceptron Mistake Bounds

Risk Bounds for CART Classifiers under a Margin Condition

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Transcription:

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu March 9, 007 Abstract We consider complexity penalization methods for model selection. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. It is well known that if we use a bound on the maximal deviation between empirical and true risks as a complexity penalty, then the risk of our choice is no more than the approximation error plus twice the complexity penalty. There are many cases, however, where complexity penalties like this give loose upper bounds on the estimation error. In particular, if we choose a function from a suitably simple convex function class with a strictly convex loss function, then the estimation error the difference between the risk of the empirical risk minimizer and the minimal risk in the class approaches zero at a faster rate than the maximal deviation between empirical and true risks. In this note, we address the question of whether it is possible to design a complexity penalized model selection method for these situations. We show that, provided the sequence of models is ordered by inclusion, in these cases we can use tight upper bounds on estimation error as a complexity penalty. Surprisingly, this is the case even in situations when the difference between the empirical risk and true risk and indeed the error of any estimate of the approximation error decreases much more slowly than the complexity penalty. We 1

give an oracle inequality showing that the resulting model selection method chooses a function with risk no more than the approximation error plus a constant times the complexity penalty. 1 Introduction Consider the following prediction problem. We have independent and identically distributed random pairs X, Y, X 1, Y 1,..., X n, Y n from X Y, where the label space Y is a subset of R. The aim is to use the data X 1, Y 1,..., X n, Y n to choose a function f n : X Y with small risk, Rf n = ElY, f n X, where l is a non-negative loss function. Ideally, the risk should be close to the minimal or Bayes risk, R = inf f Rf, where the infimum is over all measurable functions. A natural approach is to minimize the empirical risk, ˆRf = n 1 n i=1 ly i, fx i, over some class F of functions. Clearly, there is a trade-off in the choice of F: we can decompose the excess risk of f n as Rf n R = Rf n inf Rf + inf Rf R, f F f F where the first term represents the estimation error and the second the approximation error. We would like both terms in the decomposition to be small. For the first term to be small, F should be sufficiently small that it is possible to use the data to choose a near-optimal function from F based on the data. For the second term to be small, F should be sufficiently large that it contains a good approximation to the optimal prediction. One approach to this problem is to choose a large class F and decompose it into F = j 1 F j, then choose ˆf j F j to minimize the empirical risk, and finally choose the model index j so that ˆf j best balances these conflicting requirements. In this note, we consider complexity penalization approaches, in which we choose f n = ˆf ˆm with ˆm chosen to minimize a combination of the empirical risk of ˆf j and a penalty term: ˆm = arg min ˆR ˆfj + p j, j

where the penalty p j depends on the sample size n. We are interested in oracle inequalities of the form Rf n R inf j inf f F j Rf R + cp j where c is a positive constant. In such an inequality, if the term cp j decreases with n at the same rate as the estimation error R ˆf j inf f Fj Rf, then the inequality shows that our choice f n has excess risk that decreases at the same rate as if we had the advice of an oracle who tells us the best complexity class F j to choose. The utility of an oracle inequality of this kind depends on the accuracy with which the penalty term approximates the estimation error. It is well known that such inequalities follow easily from uniform convergence results, as illustrated by the following theorem. The proof is immediate; see, for example [1]. Theorem 1. If the data is such that then in that case sup j sup f F j Rf ˆRf pj Rf n R inf j 0,, inf Rf R + p j. f F j Notice that the first inequality of the theorem involves random variables. Thus, when it holds with high probability, the second inequality of the theorem holds with high probability. The theorem shows that it suffices to choose the penalty p j as a highprobability upper bound on the maximal deviation between empirical risks and risks. But it is sometimes possible to obtain faster rates of convergence smaller values of p j as a function of n than could be implied by the uniform convergence results. Indeed, the following theorem shows that for a nontrivial function class and loss function, the maximal deviation between ˆRf and Rf can approach zero at a rate no faster than n 1/ see, for example, Theorem.3 in [4]. 3

Theorem. There is a constant c such that, for any loss function l : Y [0, 1] and any function class F for which σ = sup f F varly, fx > 0, we have E sup ˆRf Rf cσ. f F n On the other hand, for the quadratic loss ly, ŷ = y ŷ or the exponential loss ly, ŷ = exp yŷ used by AdaBoost, for example, and for F j a convex bounded subset of a finite-dimensional linear class, the empirical minimizer ˆf j can approach inf f Fj Rf at the faster rate of On 1 log n, as the following theorem shows. The theorem is an easy consequence of Theorem 3.3 in [], Lemmas 14 and 15 in [3] and an application of Dudley s [6] entropy integral see, for example the proof of Corollary 3.7 in []. Theorem 3. Suppose l : [0, 1] [0, 1] satisfies the Lipschitz condition ly, ŷ 1 ly, ŷ sup y,ŷ 1 ŷ ŷ 1 ŷ < and the uniform convexity condition 1 ly, ŷ1 + ly, ŷ inf inf l y, ŷ1 + ŷ > 0. ɛ>0 ɛ y, ŷ 1 ŷ ɛ Then there is a constant c for which the following holds. Suppose that F [0, 1] X is convex and a subset of a d-dimensional linear space. Let the distribution of the random pairs X, Y, X 1, Y 1,..., X n, Y n be such that the minimizer f F of Rf = ElY, fx exists. Then for all x > 0, with probability at least 1 e x, sup f F sup f F Rf Rf ˆRf ˆRf ˆRf ˆRf Rf Rf c x + d logn/d c n x + d logn/d Clearly, for the empirical minimizer ˆf = arg min f F ˆRf, the term ˆR ˆf ˆRf in the first inequality of the theorem is non-positive, and so with probability at least 1 e x, R ˆf inf f F Rf + c 4 x + d logn/d n. n,.

Results of this kind were first shown for the quadratic loss and function classes with small covering numbers [8], later generalized to strictly convex normed spaces and classes with small covering numbers [10] and strictly convex losses and classes with small local Rademacher averages [, 3]; see also [7]. There are many examples for which these bounds on estimation error decrease at essentially the best possible rate see, for example, [9]. While these results show that the risk of the empirical minimizer converges to the minimal risk in the function class surprisingly quickly, it is not clear that the results are useful for complexity penalization. Is it possible to perform model selection by choosing the class with the smallest penalized empirical risk, and will this lead to an oracle inequality with the corresponding fast rate? It has been suggested that this is not possible: see, for example, the comments in Section.1 of [5]. Indeed, Theorem shows that, although the empirical minimizer s risk approaches the minimal risk roughly as n 1 in the example of Theorem 3, we cannot hope to estimate the minimal risk or the risk of any function in the class with non-constant loss to better accuracy than n 1/. In this note, we give a simple proof that, for nested hierarchies, the fast rates in inequalities of the kind that appear in Theorem 3 do imply oracle inequalities for complexity penalization methods. Thus, although the empirical risks that are used to compare classes in the hierarchy fluctuate on a scale that can be large compared to the complexity penalties, for nested hierarchies the fluctuations are sufficiently correlated that they do not excessively affect our choice of class. Oracle inequalities As above, consider the following complexity penalization approach. Decompose a class F of real-valued functions defined on an input space X into a sequence of subsets F 1, F,.... For each j, define the empirical risk minimizer and the true risk minimizer in the class F j as ˆf j = arg min f F j ˆRf, f j = arg min f F j Rf. 5

Assign to each class F j a complexity penalty p j. Then choose f n = ˆf ˆm, where ˆm = arg min ˆR ˆfj + p j. j Theorem 4. Suppose that the classes are ordered by inclusion, choose positive numbers ɛ k that are similarly ordered, F 1 F F 3 ɛ 1 ɛ ɛ 3 and choose a penalty p k = 7ɛ k /. Then if the data is such that sup sup Rf Rfk ˆRf ˆRf k ɛ k 0, 1 k f F k sup sup ˆRf ˆRf k Rf Rfk ɛ k 0, k f F k this implies that Rf n inf k Rf k + 9ɛ k. Notice as before that the inequalities 1 and involve random variables. Thus, when they hold with high probability such as under the conditions of Theorem 3, then the final inequality of the theorem holds with high probability. If we apply 1 to the empirical minimizer ˆf k in a single class F k, the difference ˆR ˆf k ˆRf k is non-positive, so we obtain R ˆf k inf f F k Rf + ɛ k. The theorem shows that adding an Oɛ k penalty to the empirical risks allows us to choose the class k that satisfies the best of these inequalities. The intuition behind condition 1 is that for functions with small regret that is, the difference between their risk and the minimal risk over the class is small, the empirical regret should be a more accurate upper bound on the regret, whereas it need not be so accurate for functions with large regret. Condition provides corresponding lower bounds. This second condition is necessary to ensure that the penalized estimate does not choose the class ˆm too large. 6

3 Proof We set p j = cɛ j for some constant c, and we shall see later that c = 7/ suffices. First we show that f n = ˆf ˆm F ˆm will have risk that is not much worse than that of any ˆf j from a larger class, F j F ˆm. Lemma 5. In the event that j ˆm, Rf n Rf j + max {c +, 3} ɛ j. Proof. From the condition 1, we see that for any j, In addition, the definition of f n = ˆf ˆm implies ˆRf j ˆR ˆf j + ɛ j. 3 ˆR ˆf ˆm + p ˆm ˆR ˆf j + p j. Thus, we have ˆRf ˆm ˆR ˆf ˆm + ɛ ˆm ˆR ˆf j + p j p ˆm + ɛ ˆm { ˆR ˆf 1 j + cɛ j + max ˆR ˆf j + max { c, 1 } c, 0 } ɛ j, ɛ ˆm since ɛ ˆm ɛ j. Applying 1 for the class F j, we have Rf ˆm Rfj + ˆRf ˆm ˆRf j + ɛ j Rfj + ˆR ˆfj ˆRf j + max {c + 1, } ɛ j Rf j + max {c + 1, } ɛ j, by definition of ˆf j. Thus, applying 1 again, this time for the class F ˆm, we have R ˆf ˆm Rf ˆm + ˆR ˆf ˆm ˆRf ˆm + ɛ ˆm Rf j + max {c +, 3} ɛ j. 7

The next step is to show that the risk of f n = ˆf ˆm cannot be much worse than that of any ˆf j from a smaller class F j F ˆm. First, we show that the risk of the optimal function from F ˆm is not much bigger than the risk of the optimal function from any smaller class. Lemma 6. In the event that j ˆm, Rf ˆm Rf j + 3 4 c ɛ ˆm + cɛ j. Although it is immediate that Rf ˆm Rf j in this case, we shall see that we need to ensure that, for a suitably large penalty, the gap decreases with ɛ ˆm. This is where the condition is important: it allows us to show that ˆm is not chosen too large. Proof. We have Rfj Rf ˆm 1 ˆRf j ˆRf ˆm ɛ ˆm 1 ˆR ˆfj ˆRf ˆm ɛ ˆm 1 ˆR ˆf ˆm + p ˆm p j ˆRf 1 ˆRf ˆm ɛ ˆm + p ˆm p j c = 3 ɛ ˆm cɛ j 4. The result follows. ˆm by, for F ˆm by definition of ˆf j ɛ ˆm by definition of ˆm ˆRf ˆm ɛ ˆm by 3 Next we show that this implies the risk of f n = ˆf ˆm is not much larger than the risk of any ˆf j from a smaller class. Lemma 7. In the event that j ˆm, Rf n Rf j + 7 4 c ɛ ˆm + cɛ j. 8

Proof. From 1 and the previous lemma, Rf n = R ˆf ˆm Rf ˆm + ˆR ˆf ˆm ˆRf ˆm + ɛ ˆm Rf ˆm + ɛ ˆm 7 = Rfj + 4 c ɛ ˆm + cɛ j. Choosing c = 7/ gives the result. Acknowledgements We gratefully acknowledge the support of the NSF under award DMS- 0434383. Thanks also to three anonymous reviewers for useful comments that improved the presentation. References [1] P. L. Bartlett, S. Boucheron, and G. Lugosi. Model selection and error estimation. Machine Learning, 48:85 113, 00. [] Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher complexities. Annals of Statistics, 334:1497 1537, 005. [3] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101473:138 156, 006. [4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 1353:311 334, 006. [5] G. Blanchard, G. Lugosi, and N. Vayatis. On the rate of convergence of regularized boosting classifiers. Journal of Machine Learning Research, 4:861 894, 003. 9

[6] R. M. Dudley. Uniform Central Limit Theorems. Cambridge University Press, 1999. [7] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Technical report, 003. [8] W. S. Lee, P. L. Bartlett, and R. C. Williamson. Efficient agnostic learning of neural networks with bounded fan-in. IEEE Transactions on Information Theory, 46:118 13, 1996. [9] P. Massart. Some applications of concentration inequalities to statistics. Annales de la Faculté des Sciences de Toulouse, IX:45 303, 000. [10] Shahar Mendelson. Improving the sample complexity using global data. IEEE Transactions on Information Theory, 487:1977 1991, 00. 10