Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Similar documents
Approximation Theoretical Questions for SVMs

Advanced Introduction to Machine Learning CMU-10715

Generalization theory

Class 2 & 3 Overfitting & Regularization

Lecture 3: Statistical Decision Theory (Part II)

Fast learning rates for plug-in classifiers under the margin condition

Robust Support Vector Machines for Probability Distributions

FORMULATION OF THE LEARNING PROBLEM

Machine Learning. Lecture 9: Learning Theory. Feng Li.

The Learning Problem and Regularization

On non-parametric robust quantile regression by support vector machines

Introduction to Machine Learning

Bias-Variance Tradeoff

Lecture 7 Introduction to Statistical Decision Theory

Curve learning. p.1/35

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

CMU-Q Lecture 24:

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

2 Loss Functions and Their Risks

Discriminative Models

Recap from previous lecture

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Lecture 2 Machine Learning Review

Discriminative Models

Computational and Statistical Learning Theory

Logistic Regression. Machine Learning Fall 2018

Understanding Generalization Error: Bounds and Decompositions

An Introduction to Statistical Machine Learning - Theoretical Aspects -

AdaBoost and other Large Margin Classifiers: Convexity in Classification

Optimal Rates for Regularized Least Squares Regression

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning And Applications: Supervised Learning-SVM

Does Unlabeled Data Help?

Generalization, Overfitting, and Model Selection

Logistic Regression Logistic

University of North Carolina at Chapel Hill

Solving Classification Problems By Knowledge Sets

Minimax risk bounds for linear threshold functions

Logistic regression and linear classifiers COMS 4771

Consistency of Nearest Neighbor Methods

Empirical Risk Minimization

A Magiv CV Theory for Large-Margin Classifiers

A talk on Oracle inequalities and regularization. by Sara van de Geer

ECE 5424: Introduction to Machine Learning

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Reproducing Kernel Hilbert Spaces

Decision trees COMS 4771

PAC-learning, VC Dimension and Margin-based Bounds

Lecture 3: Introduction to Complexity Regularization

Computational Learning Theory. CS534 - Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Generalization and Overfitting

Learning with Rejection

Distribution-specific analysis of nearest neighbor search and classification

The sample complexity of agnostic learning with deterministic labels

PAC Model and Generalization Bounds

VBM683 Machine Learning

Stability of optimization problems with stochastic dominance constraints

Bits of Machine Learning Part 1: Supervised Learning

Case study: stochastic simulation via Rademacher bootstrap

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Midterm exam CS 189/289, Fall 2015

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Foundations of Machine Learning

Introduction to Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ECE521 week 3: 23/26 January 2017

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Minimax Estimation of Kernel Mean Embeddings

Supremum of simple stochastic processes

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Unlabeled Data: Now It Helps, Now It Doesn t

Machine Learning for NLP

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Computational and Statistical Learning Theory

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Sparseness of Support Vector Machines

Statistical Machine Learning

Classification: The rest of the story

Model Selection and Geometry

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij

BINARY CLASSIFICATION

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Ad Placement Strategies

COMS 4771 Introduction to Machine Learning. Nakul Verma

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Linear & nonlinear classifiers

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Generative v. Discriminative classifiers Intuition

Introduction. Chapter 1

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification objectives COMS 4771

Transcription:

Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62

Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 2 / 62

Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Goal: With the help of D find a function f D : X R such that f D (x) is a good prediction of the label y for new, unseen x. Learning method: Assigns to every training set D a predictor f D : X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 2 / 62

Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 3 / 62

Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 3 / 62

Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 3 / 62

Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 4 / 62

Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Example: 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 4 / 62

Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 5 / 62

Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Consequences The label y for a given x is, in general, not deterministic. The past and the future look the same. We seek algorithms that work well for many (or even all) P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 5 / 62

Basics Formalized Description Performance Evaluation I Loss Function L : X Y R [0, ) measures cost or loss L(x, y, t) of predicting label y by value t at point x. Interpretation As the name suggests, we prefer predictions with small loss. L is chosen by us. Since future (x, y) are random, it makes sense to consider the average loss of a predictor. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 6 / 62

Basics Formalized Description Performance Evaluation II Risk The risk of a predictor f : X R is the average loss R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y For D = ((x 1, y 1 ),..., (x n, y n )) the empirical risk is R L,D (f ) := 1 n n L(x i, y i, f (x i )). i=1 Interpretation By the law of large numbers, we have P -almost surely: R L,P (f ) = lim R L,D(f ) D Thus, R L,P (f ) is the long-term average future loss when using f. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 7 / 62

Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies R L,P (f L,P ) = R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 8 / 62

Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies Interpretation R L,P (f L,P ) = R L,P. We will never find a predictor whose risk is smaller than R L,P. We seek a predictor f : X R whose excess risk is close to 0. R L,P (f ) R L,P Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 8 / 62

Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 9 / 62

Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Equality Typically: R L,P = R L,P iff there is a constant Bayes predictor. If P = P X P Y, then R L,P = R L,P, but the converse is false. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 9 / 62

Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 10 / 62

Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 10 / 62

Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Naïve Risk Naïve risk: R L,P = min{p(y = 1), 1 P(Y = 1)} R L,P = R L,P iff η 1/2 or η 1/2 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 10 / 62

Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 11 / 62

Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 11 / 62

Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Naïve Risk Naïve risk: R L,P = var P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 11 / 62

Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 12 / 62

Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 12 / 62

Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Naïve Risk Naïve risk: R L,P = median P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 12 / 62

Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 13 / 62

Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Good News Many learning methods are universally consistent. First result: Stone (1977), AoS Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 13 / 62

Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 14 / 62

Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 14 / 62

Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Remark If X <, then it is usually easy to obtain a uniform learning rate for which C P depends on X. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 14 / 62

Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 15 / 62

Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Tasks Identify interesting ( realistic ) classes P with good optimal rates. Find learning algorithms that achieve these rates. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 15 / 62

Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 16 / 62

Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Remarks The smoother target µ = fl,p is, the better it can be learned. The larger the input dimension is, the harder learning becomes. There exists various learning algorithms achieving the optimal rate. They usually require us to know m in advance. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 16 / 62

Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 17 / 62

Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Task We seek learning algorithms that are universally consistent. learn all P θ with the optimal rate without knowing θ. Such learning algorithms are adaptive to the unknown θ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 17 / 62

Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) with probability P n not smaller than 1 δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 18 / 62

Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) Remarks with probability P n not smaller than 1 δ. If there exists a sequence (λ n ) with lim ε(p, λ n, δ, n) = 0 n for all P and δ, then the algorithm can be made universally consistent. We automatically obtain learning rates for such sequences. If X = and..., then such ε(p, λ, δ, n) must depend on P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 18 / 62

Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 19 / 62

Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Remarks ε(λ, δ, n) must not depend on P since we do not know P. ε(λ, δ, n) can be used to derive parameter selection strategies such as structural risk minimization. Alternative: Use second data set D and R L,D (f D,λ ) as an estimate. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 19 / 62

Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 20 / 62

Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Can be modified to new problems that have a different loss. Has a good record on real-world problems. Runs efficiently on a computer.... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 20 / 62

Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 21 / 62

Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Remarks Not every F makes ERM possible. ERM is, in general, not unique. ERM may not be computationally feasible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 21 / 62

Definition and Examples Empirical Risk Minimization Danger of underfitting ERM can never produce predictors with risk better than R L,P,F := inf{r L,P(f ) : f F}. Example: L least squares, X = [0, 1], P X uniform distribution, f L,P not linear, and F set of linear functions, then R L,P,F > R L,P, and thus ERM cannot be consistent. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 22 / 62

Definition and Examples Empirical Risk Minimization Danger of overfitting If F is too large, ERM may overfit. Example: L least squares, X = [0, 1], P X uniform distribution, fl,p = 1 X, R L,P = 0, and F set of all functions. Then { y i if x = x i for some i f D (x) = 0 otherwise. satisfies R L,D (f D ) = 0 but R L,P (f D ) = 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 23 / 62

Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 24 / 62

Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Learning is R L,P (f D ) R L,P Asymptotically, this is possible, but no uniform rates are possible. We seek adaptive learning algorithms. Ideally, these are fully automated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 24 / 62

Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 25 / 62

Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Remarks Υ = 0 yields ERM. All remarks about ERM apply to RERM, too. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 25 / 62

Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 26 / 62

Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Examples of Regularizers l 1 -regularization: Υ(f c ) = λ c 1 = λ m i=1 c i, l 2 -regularization: Υ(f c ) = λ c 2 = λ m i=1 c i 2, l -regularization: Υ(f c ) = λ c = λ max i c i, or, in case of dependent h i, we take the infimum over all representations. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 26 / 62

Definition and Examples Examples of Regularized ERM II Further Examples Support Vector Machines Regularized Decision Trees... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 27 / 62

Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 28 / 62

Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 28 / 62

Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. In the following, we assume that the optimization problem also has a solution f P, when we replace D by P: f P arg min f F Υ(f ) + R L,P(f ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 28 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Υ(f P ) + R L,P (f P ) + 2ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 30 / 62

Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Since Υ(f D ) 0, all what remains to be done, is to estimate the probability of (2) the regularization error Υ(f P ) + R L,P (f P ) R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 30 / 62

Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 31 / 62

Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Consequences It suffices to bound P(D : R L,P (f ) R L,D (f ) > ε) for all f. No assumptions on P are made so far. In particular, so far data D does not need to be i.i.d. nor even random. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 31 / 62

Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 32 / 62

Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Application Consider Ω := (X Y ) n and Q := P n. For ξ i (D) := L(x i, y i, f (x i )) we have a = 0 and 1 n n ( ) ξi E P nξ i = RL,D (f ) R L,P (f ). i=1 Assuming L(x, y, f (x)) B makes application of Hoeffding possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 32 / 62

Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 33 / 62

Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. Remarks Does not specify approximation error R L,P,F R L,P. If F =, the bound becomes meaningless. What happens, if we consider RERM with non-trivial regularizer? Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 33 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 34 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! General Approach Given some δ > 0, find a finite N δ set of functions such that sup R L,P (f ) R L,D (f ) sup R L,P (f ) R L,D (f ) + δ f F f N δ Then apply the union bound for N δ. The rest remains unchanged. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 34 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n The new inequality ) 1 e τ. ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 35 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. The new inequality ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Tasks For each δ > 0, find a small set N δ. Optimize the right-hand side wrt. δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 35 / 62

Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 36 / 62

Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. x 1,..., x n is called an ε-net. N (A, d, ε) is the size of the smallest ε-net. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 36 / 62

Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 37 / 62

Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. For sets F of functions f : X R, the behavior of N (F,, ε) may be very different! The literature is full of estimates of ln N (F,, ε). A typical estimate looks like ln N (B E, F, ε) cε 2p, ε > 0 Here p may depend on the input dimension and the smoothness of the functions in E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 37 / 62

Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 38 / 62

Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Then we have ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nε ) + 2ε 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 38 / 62

Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 39 / 62

Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p n ) + 2ε 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 39 / 62

Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p Optimizing wrt. ε gives a constant K p such that P n ( D : R L,P (f D ) < R L,P,F + K pc 1 2+2p B τn 1 2+2p For ERM over finite F, we had p = 0. n ) + 2ε 1 e τ. ) 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 39 / 62

Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 40 / 62

Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Solution RERM actually solves its optimization problem ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f E over a set, which is significantly smaller than E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 40 / 62

Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 41 / 62

Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Consequence RERM optimization problem is actually solved over the ball with radius λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 41 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 43 / 62

Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Then, for all n 1, λ (0, 1], τ 1, we have λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) + K p c 1 2+2p τλ 1 α n 1 2+2p with probability P n not less than 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 43 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Regularization error: A(λ) := λ f P,λ α E + R L,P(f P,λ ) R L,P,E Approximation error: R L,P,E R L,P. Statistical error: K p c 1 2+2p τλ 1 α n 1 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 45 / 62

Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Remarks A linear behaviour of A often requires such an f. A typical behavior is, for some β (0, 1], of the form A(λ) cλ β A sufficient condition for such a behaviour can be described with the help of so-called interpolation spaces of the real method. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 45 / 62

Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 46 / 62

Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 46 / 62

Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. If A(λ) cλ β, then achieves best rate λ n n α (αβ+1)(2+2p) n αβ (αβ+1)(2+2p) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 46 / 62

Statistical Analysis Main Results for ERM II Discussion Assumptions for consistency on E are minimal. More sophisticated algorithms can be devised from oracle inequality. For example, E could change with sample size, too. To achieve best learning rates, we need to know β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 47 / 62

Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 48 / 62

Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Observation Approach performs RERM on D 1 and ERM over F := { f D1,λ : λ Λ} on D 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 48 / 62

Statistical Analysis Learning Rates: Hyper-Parameters VI Theorem If Λ n is a polynomially growing n α/2 -net of (0, 1], our TV-RERM is consistent and enjoys the same best rates as RERM without knowing β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 49 / 62

Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 50 / 62

Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Negative Aspect For RERM, the rates are never optimal! This analysis is out-dated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 50 / 62

Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 51 / 62

Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn For large n this estimate is sharper! αβ 2(αβ+1)(1+p). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 51 / 62

Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn αβ 2(αβ+1)(1+p). For large n this estimate is sharper! Using the sharper estimate in the proof, we obtain a better learning rate. Argument can be iterated... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 51 / 62

Statistical Analysis Learning Rates: Non-Optimality II Bernstein s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [ B, B] be independent random variables satisfying E Q ξ i = 0 E Q ξ 2 i σ 2 Then, for all τ > 0, n 1, we have 1 n 2σ Q( ξ i 2 τ n n i=1 + 2Bτ 3n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 52 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Rates up to O(n 1 ) become possible. Iteration can be avoided! Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality IV Further Reasons The fact that L is clippable, should be used to obtain a smaller supremum term. -covering numbers provide a worst-case tool. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 54 / 62

Statistical Analysis Adaptivity of Standard SVMs Theorem (Eberts & S. 2011) Consider an SVM with least squares loss and Gaussian kernel k σ. Pick λ and σ by a suitable training/validation approach. Then, for m (d/2, ), the SVM learns every fl,p W m (X ) with the (essentially) optimal rate n 2m 2m+d +ε without knowing m. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 55 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 56 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Variance bound: E P h 2 f V (E P h f ) ϑ Supremum bound: h f B Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 56 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 57 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. This yields R L,P (f D ) R L,P (f P ) = E P h fd E P h fp E P h fd E D h fd + E D h fp E P h fp We will estimate the two differences separately. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 57 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B V (E P h fp ) ϑ Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ n + 4Bτ 3n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ + 4Bτ n 3n ( ) 1 2V τ 2 ϑ 4Bτ E P h fp + + n 3n. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Supremum bound: g f,r E P h f h f r 1 2Br 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM V Application of Bernstein With probability P n not smaller than 1 F e τ we have sup E D g f,r < f F 2V τ 4Bτ + nr 2 ϑ 3nr Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 60 / 62