Learning Theory. Ingo Steinwart University of Stuttgart. September 4, PDF Free Download

Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62

Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Goal: With the help of D find a function f D : X R such that f D (x) is a good prediction of the label y for new, unseen x. Learning method: Assigns to every training set D a predictor f D : X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 2 / 62

Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 3 / 62

Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 3 / 62

Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 4 / 62

Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Example: 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 1.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 1.5 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 4 / 62

Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Consequences The label y for a given x is, in general, not deterministic. The past and the future look the same. We seek algorithms that work well for many (or even all) P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 5 / 62

Basics Formalized Description Performance Evaluation I Loss Function L : X Y R [0, ) measures cost or loss L(x, y, t) of predicting label y by value t at point x. Interpretation As the name suggests, we prefer predictions with small loss. L is chosen by us. Since future (x, y) are random, it makes sense to consider the average loss of a predictor. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 6 / 62

Basics Formalized Description Performance Evaluation II Risk The risk of a predictor f : X R is the average loss R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y For D = ((x 1, y 1 ),..., (x n, y n )) the empirical risk is R L,D (f ) := 1 n n L(x i, y i, f (x i )). i=1 Interpretation By the law of large numbers, we have P -almost surely: R L,P (f ) = lim R L,D(f ) D Thus, R L,P (f ) is the long-term average future loss when using f. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 7 / 62

Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies R L,P (f L,P ) = R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 8 / 62

Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies Interpretation R L,P (f L,P ) = R L,P. We will never find a predictor whose risk is smaller than R L,P. We seek a predictor f : X R whose excess risk is close to 0. R L,P (f ) R L,P Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 8 / 62

Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 9 / 62

Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Equality Typically: R L,P = R L,P iff there is a constant Bayes predictor. If P = P X P Y, then R L,P = R L,P, but the converse is false. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 9 / 62

Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 10 / 62

Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Naïve Risk Naïve risk: R L,P = min{p(y = 1), 1 P(Y = 1)} R L,P = R L,P iff η 1/2 or η 1/2 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 10 / 62

Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 11 / 62

Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Naïve Risk Naïve risk: R L,P = var P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 11 / 62

Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 12 / 62

Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 12 / 62

Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Naïve Risk Naïve risk: R L,P = median P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 12 / 62

Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 13 / 62

Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Good News Many learning methods are universally consistent. First result: Stone (1977), AoS Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 13 / 62

Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 14 / 62

Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Remark If X <, then it is usually easy to obtain a uniform learning rate for which C P depends on X. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 14 / 62

Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 15 / 62

Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Tasks Identify interesting ( realistic ) classes P with good optimal rates. Find learning algorithms that achieve these rates. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 15 / 62

Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Remarks The smoother target µ = fl,p is, the better it can be learned. The larger the input dimension is, the harder learning becomes. There exists various learning algorithms achieving the optimal rate. They usually require us to know m in advance. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 16 / 62

Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 17 / 62

Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Task We seek learning algorithms that are universally consistent. learn all P θ with the optimal rate without knowing θ. Such learning algorithms are adaptive to the unknown θ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 17 / 62

Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) Remarks with probability P n not smaller than 1 δ. If there exists a sequence (λ n ) with lim ε(p, λ n, δ, n) = 0 n for all P and δ, then the algorithm can be made universally consistent. We automatically obtain learning rates for such sequences. If X = and..., then such ε(p, λ, δ, n) must depend on P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 18 / 62

Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Remarks ε(λ, δ, n) must not depend on P since we do not know P. ε(λ, δ, n) can be used to derive parameter selection strategies such as structural risk minimization. Alternative: Use second data set D and R L,D (f D,λ ) as an estimate. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 19 / 62

Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 20 / 62

Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Can be modified to new problems that have a different loss. Has a good record on real-world problems. Runs efficiently on a computer.... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 20 / 62

Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 21 / 62

Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Remarks Not every F makes ERM possible. ERM is, in general, not unique. ERM may not be computationally feasible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 21 / 62

Definition and Examples Empirical Risk Minimization Danger of underfitting ERM can never produce predictors with risk better than R L,P,F := inf{r L,P(f ) : f F}. Example: L least squares, X = [0, 1], P X uniform distribution, f L,P not linear, and F set of linear functions, then R L,P,F > R L,P, and thus ERM cannot be consistent. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 22 / 62

Definition and Examples Empirical Risk Minimization Danger of overfitting If F is too large, ERM may overfit. Example: L least squares, X = [0, 1], P X uniform distribution, fl,p = 1 X, R L,P = 0, and F set of all functions. Then { y i if x = x i for some i f D (x) = 0 otherwise. satisfies R L,D (f D ) = 0 but R L,P (f D ) = 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 23 / 62

Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Learning is R L,P (f D ) R L,P Asymptotically, this is possible, but no uniform rates are possible. We seek adaptive learning algorithms. Ideally, these are fully automated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 24 / 62

Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 25 / 62

Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Remarks Υ = 0 yields ERM. All remarks about ERM apply to RERM, too. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 25 / 62

Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 26 / 62

Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Examples of Regularizers l 1 -regularization: Υ(f c ) = λ c 1 = λ m i=1 c i, l 2 -regularization: Υ(f c ) = λ c 2 = λ m i=1 c i 2, l -regularization: Υ(f c ) = λ c = λ max i c i, or, in case of dependent h i, we take the infimum over all representations. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 26 / 62

Definition and Examples Examples of Regularized ERM II Further Examples Support Vector Machines Regularized Decision Trees... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 27 / 62

Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 28 / 62

Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. In the following, we assume that the optimization problem also has a solution f P, when we replace D by P: f P arg min f F Υ(f ) + R L,P(f ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 28 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Υ(f P ) + R L,P (f P ) + 2ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 29 / 62

Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Since Υ(f D ) 0, all what remains to be done, is to estimate the probability of (2) the regularization error Υ(f P ) + R L,P (f P ) R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 30 / 62

Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Consequences It suffices to bound P(D : R L,P (f ) R L,D (f ) > ε) for all f. No assumptions on P are made so far. In particular, so far data D does not need to be i.i.d. nor even random. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 31 / 62

Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 32 / 62

Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Application Consider Ω := (X Y ) n and Q := P n. For ξ i (D) := L(x i, y i, f (x i )) we have a = 0 and 1 n n ( ) ξi E P nξ i = RL,D (f ) R L,P (f ). i=1 Assuming L(x, y, f (x)) B makes application of Hoeffding possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 32 / 62

Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. Remarks Does not specify approximation error R L,P,F R L,P. If F =, the bound becomes meaningless. What happens, if we consider RERM with non-trivial regularizer? Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 33 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! General Approach Given some δ > 0, find a finite N δ set of functions such that sup R L,P (f ) R L,D (f ) sup R L,P (f ) R L,D (f ) + δ f F f N δ Then apply the union bound for N δ. The rest remains unchanged. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 34 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n The new inequality ) 1 e τ. ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 35 / 62

Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. The new inequality ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Tasks For each δ > 0, find a small set N δ. Optimize the right-hand side wrt. δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 35 / 62

Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 36 / 62

Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. x 1,..., x n is called an ε-net. N (A, d, ε) is the size of the smallest ε-net. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 36 / 62

Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 37 / 62

Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. For sets F of functions f : X R, the behavior of N (F,, ε) may be very different! The literature is full of estimates of ln N (F,, ε). A typical estimate looks like ln N (B E, F, ε) cε 2p, ε > 0 Here p may depend on the input dimension and the smoothness of the functions in E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 37 / 62

Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 38 / 62

Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Then we have ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nε ) + 2ε 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 38 / 62

Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 39 / 62

Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p Optimizing wrt. ε gives a constant K p such that P n ( D : R L,P (f D ) < R L,P,F + K pc 1 2+2p B τn 1 2+2p For ERM over finite F, we had p = 0. n ) + 2ε 1 e τ. ) 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 39 / 62

Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Solution RERM actually solves its optimization problem ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f E over a set, which is significantly smaller than E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 40 / 62

Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 41 / 62

Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Consequence RERM optimization problem is actually solved over the ball with radius λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 41 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 42 / 62

Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Then, for all n 1, λ (0, 1], τ 1, we have λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) + K p c 1 2+2p τλ 1 α n 1 2+2p with probability P n not less than 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 43 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Regularization error: A(λ) := λ f P,λ α E + R L,P(f P,λ ) R L,P,E Approximation error: R L,P,E R L,P. Statistical error: K p c 1 2+2p τλ 1 α n 1 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 44 / 62

Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Remarks A linear behaviour of A often requires such an f. A typical behavior is, for some β (0, 1], of the form A(λ) cλ β A sufficient condition for such a behaviour can be described with the help of so-called interpolation spaces of the real method. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 45 / 62

Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 46 / 62

Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 46 / 62

Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. If A(λ) cλ β, then achieves best rate λ n n α (αβ+1)(2+2p) n αβ (αβ+1)(2+2p) Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 46 / 62

Statistical Analysis Main Results for ERM II Discussion Assumptions for consistency on E are minimal. More sophisticated algorithms can be devised from oracle inequality. For example, E could change with sample size, too. To achieve best learning rates, we need to know β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 47 / 62

Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 48 / 62

Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Observation Approach performs RERM on D 1 and ERM over F := { f D1,λ : λ Λ} on D 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 48 / 62

Statistical Analysis Learning Rates: Hyper-Parameters VI Theorem If Λ n is a polynomially growing n α/2 -net of (0, 1], our TV-RERM is consistent and enjoys the same best rates as RERM without knowing β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 49 / 62

Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 50 / 62

Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Negative Aspect For RERM, the rates are never optimal! This analysis is out-dated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 50 / 62

Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 51 / 62

Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn For large n this estimate is sharper! αβ 2(αβ+1)(1+p). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 51 / 62

Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn αβ 2(αβ+1)(1+p). For large n this estimate is sharper! Using the sharper estimate in the proof, we obtain a better learning rate. Argument can be iterated... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 51 / 62

Statistical Analysis Learning Rates: Non-Optimality II Bernstein s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [ B, B] be independent random variables satisfying E Q ξ i = 0 E Q ξ 2 i σ 2 Then, for all τ > 0, n 1, we have 1 n 2σ Q( ξ i 2 τ n n i=1 + 2Bτ 3n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 52 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Rates up to O(n 1 ) become possible. Iteration can be avoided! Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 53 / 62

Statistical Analysis Learning Rates: Non-Optimality IV Further Reasons The fact that L is clippable, should be used to obtain a smaller supremum term. -covering numbers provide a worst-case tool. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 54 / 62

Statistical Analysis Adaptivity of Standard SVMs Theorem (Eberts & S. 2011) Consider an SVM with least squares loss and Gaussian kernel k σ. Pick λ and σ by a suitable training/validation approach. Then, for m (d/2, ), the SVM learns every fl,p W m (X ) with the (essentially) optimal rate n 2m 2m+d +ε without knowing m. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 55 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 56 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Variance bound: E P h 2 f V (E P h f ) ϑ Supremum bound: h f B Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 56 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 57 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. This yields R L,P (f D ) R L,P (f P ) = E P h fd E P h fp E P h fd E D h fd + E D h fp E P h fp We will estimate the two differences separately. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 57 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B V (E P h fp ) ϑ Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ n + 4Bτ 3n Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ + 4Bτ n 3n ( ) 1 2V τ 2 ϑ 4Bτ E P h fp + + n 3n. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 58 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Supremum bound: g f,r E P h f h f r 1 2Br 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 59 / 62

Advanced Statistical Analysis ERM Towards a Better Analysis for ERM V Application of Bernstein With probability P n not smaller than 1 F e τ we have sup E D g f,r < f F 2V τ 4Bτ + nr 2 ϑ 3nr Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 60 / 62

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013