Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Size: px
Start display at page:

Download "Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013"

Transcription

1 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

2 Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

3 Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Goal: With the help of D find a function f D : X R such that f D (x) is a good prediction of the label y for new, unseen x. Learning method: Assigns to every training set D a predictor f D : X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

4 Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

5 Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

6 Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

7 Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

8 Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

9 Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

10 Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Consequences The label y for a given x is, in general, not deterministic. The past and the future look the same. We seek algorithms that work well for many (or even all) P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

11 Basics Formalized Description Performance Evaluation I Loss Function L : X Y R [0, ) measures cost or loss L(x, y, t) of predicting label y by value t at point x. Interpretation As the name suggests, we prefer predictions with small loss. L is chosen by us. Since future (x, y) are random, it makes sense to consider the average loss of a predictor. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

12 Basics Formalized Description Performance Evaluation II Risk The risk of a predictor f : X R is the average loss R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y For D = ((x 1, y 1 ),..., (x n, y n )) the empirical risk is R L,D (f ) := 1 n n L(x i, y i, f (x i )). i=1 Interpretation By the law of large numbers, we have P -almost surely: R L,P (f ) = lim R L,D(f ) D Thus, R L,P (f ) is the long-term average future loss when using f. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

13 Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies R L,P (f L,P ) = R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

14 Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies Interpretation R L,P (f L,P ) = R L,P. We will never find a predictor whose risk is smaller than R L,P. We seek a predictor f : X R whose excess risk is close to 0. R L,P (f ) R L,P Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

15 Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

16 Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Equality Typically: R L,P = R L,P iff there is a constant Bayes predictor. If P = P X P Y, then R L,P = R L,P, but the converse is false. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

17 Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

18 Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

19 Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Naïve Risk Naïve risk: R L,P = min{p(y = 1), 1 P(Y = 1)} R L,P = R L,P iff η 1/2 or η 1/2 Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

20 Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

21 Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

22 Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Naïve Risk Naïve risk: R L,P = var P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

23 Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

24 Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

25 Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Naïve Risk Naïve risk: R L,P = median P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

26 Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

27 Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Good News Many learning methods are universally consistent. First result: Stone (1977), AoS Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

28 Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

29 Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

30 Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Remark If X <, then it is usually easy to obtain a uniform learning rate for which C P depends on X. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

31 Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

32 Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Tasks Identify interesting ( realistic ) classes P with good optimal rates. Find learning algorithms that achieve these rates. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

33 Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

34 Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Remarks The smoother target µ = fl,p is, the better it can be learned. The larger the input dimension is, the harder learning becomes. There exists various learning algorithms achieving the optimal rate. They usually require us to know m in advance. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

35 Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

36 Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Task We seek learning algorithms that are universally consistent. learn all P θ with the optimal rate without knowing θ. Such learning algorithms are adaptive to the unknown θ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

37 Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) with probability P n not smaller than 1 δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

38 Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) Remarks with probability P n not smaller than 1 δ. If there exists a sequence (λ n ) with lim ε(p, λ n, δ, n) = 0 n for all P and δ, then the algorithm can be made universally consistent. We automatically obtain learning rates for such sequences. If X = and..., then such ε(p, λ, δ, n) must depend on P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

39 Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

40 Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Remarks ε(λ, δ, n) must not depend on P since we do not know P. ε(λ, δ, n) can be used to derive parameter selection strategies such as structural risk minimization. Alternative: Use second data set D and R L,D (f D,λ ) as an estimate. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

41 Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

42 Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Can be modified to new problems that have a different loss. Has a good record on real-world problems. Runs efficiently on a computer.... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

43 Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

44 Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Remarks Not every F makes ERM possible. ERM is, in general, not unique. ERM may not be computationally feasible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

45 Definition and Examples Empirical Risk Minimization Danger of underfitting ERM can never produce predictors with risk better than R L,P,F := inf{r L,P(f ) : f F}. Example: L least squares, X = [0, 1], P X uniform distribution, f L,P not linear, and F set of linear functions, then R L,P,F > R L,P, and thus ERM cannot be consistent. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

46 Definition and Examples Empirical Risk Minimization Danger of overfitting If F is too large, ERM may overfit. Example: L least squares, X = [0, 1], P X uniform distribution, fl,p = 1 X, R L,P = 0, and F set of all functions. Then { y i if x = x i for some i f D (x) = 0 otherwise. satisfies R L,D (f D ) = 0 but R L,P (f D ) = 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

47 Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

48 Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Learning is R L,P (f D ) R L,P Asymptotically, this is possible, but no uniform rates are possible. We seek adaptive learning algorithms. Ideally, these are fully automated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

49 Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

50 Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Remarks Υ = 0 yields ERM. All remarks about ERM apply to RERM, too. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

51 Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

52 Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Examples of Regularizers l 1 -regularization: Υ(f c ) = λ c 1 = λ m i=1 c i, l 2 -regularization: Υ(f c ) = λ c 2 = λ m i=1 c i 2, l -regularization: Υ(f c ) = λ c = λ max i c i, or, in case of dependent h i, we take the infimum over all representations. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

53 Definition and Examples Examples of Regularized ERM II Further Examples Support Vector Machines Regularized Decision Trees... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

54 Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

55 Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

56 Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. In the following, we assume that the optimization problem also has a solution f P, when we replace D by P: f P arg min f F Υ(f ) + R L,P(f ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

57 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

58 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

59 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

60 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

61 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Υ(f P ) + R L,P (f P ) + 2ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

62 Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

63 Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Since Υ(f D ) 0, all what remains to be done, is to estimate the probability of (2) the regularization error Υ(f P ) + R L,P (f P ) R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

64 Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

65 Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Consequences It suffices to bound P(D : R L,P (f ) R L,D (f ) > ε) for all f. No assumptions on P are made so far. In particular, so far data D does not need to be i.i.d. nor even random. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

66 Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

67 Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Application Consider Ω := (X Y ) n and Q := P n. For ξ i (D) := L(x i, y i, f (x i )) we have a = 0 and 1 n n ( ) ξi E P nξ i = RL,D (f ) R L,P (f ). i=1 Assuming L(x, y, f (x)) B makes application of Hoeffding possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

68 Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

69 Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. Remarks Does not specify approximation error R L,P,F R L,P. If F =, the bound becomes meaningless. What happens, if we consider RERM with non-trivial regularizer? Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

70 Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

71 Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! General Approach Given some δ > 0, find a finite N δ set of functions such that sup R L,P (f ) R L,D (f ) sup R L,P (f ) R L,D (f ) + δ f F f N δ Then apply the union bound for N δ. The rest remains unchanged. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

72 Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n The new inequality ) 1 e τ. ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

73 Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. The new inequality ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Tasks For each δ > 0, find a small set N δ. Optimize the right-hand side wrt. δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

74 Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

75 Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. x 1,..., x n is called an ε-net. N (A, d, ε) is the size of the smallest ε-net. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

76 Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

77 Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. For sets F of functions f : X R, the behavior of N (F,, ε) may be very different! The literature is full of estimates of ln N (F,, ε). A typical estimate looks like ln N (B E, F, ε) cε 2p, ε > 0 Here p may depend on the input dimension and the smoothness of the functions in E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

78 Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

79 Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Then we have ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nε ) + 2ε 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

80 Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

81 Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p n ) + 2ε 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

82 Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p Optimizing wrt. ε gives a constant K p such that P n ( D : R L,P (f D ) < R L,P,F + K pc 1 2+2p B τn 1 2+2p For ERM over finite F, we had p = 0. n ) + 2ε 1 e τ. ) 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

83 Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

84 Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Solution RERM actually solves its optimization problem ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f E over a set, which is significantly smaller than E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

85 Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

86 Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Consequence RERM optimization problem is actually solved over the ball with radius λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

87 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

88 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

89 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

90 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

91 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

92 Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

93 Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Then, for all n 1, λ (0, 1], τ 1, we have λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) + K p c 1 2+2p τλ 1 α n 1 2+2p with probability P n not less than 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

94 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

95 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

96 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

97 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Regularization error: A(λ) := λ f P,λ α E + R L,P(f P,λ ) R L,P,E Approximation error: R L,P,E R L,P. Statistical error: K p c 1 2+2p τλ 1 α n 1 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

98 Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

99 Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Remarks A linear behaviour of A often requires such an f. A typical behavior is, for some β (0, 1], of the form A(λ) cλ β A sufficient condition for such a behaviour can be described with the help of so-called interpolation spaces of the real method. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

100 Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

101 Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

102 Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. If A(λ) cλ β, then achieves best rate λ n n α (αβ+1)(2+2p) n αβ (αβ+1)(2+2p) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

103 Statistical Analysis Main Results for ERM II Discussion Assumptions for consistency on E are minimal. More sophisticated algorithms can be devised from oracle inequality. For example, E could change with sample size, too. To achieve best learning rates, we need to know β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

104 Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

105 Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Observation Approach performs RERM on D 1 and ERM over F := { f D1,λ : λ Λ} on D 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

106 Statistical Analysis Learning Rates: Hyper-Parameters VI Theorem If Λ n is a polynomially growing n α/2 -net of (0, 1], our TV-RERM is consistent and enjoys the same best rates as RERM without knowing β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

107 Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

108 Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Negative Aspect For RERM, the rates are never optimal! This analysis is out-dated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

109 Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

110 Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn For large n this estimate is sharper! αβ 2(αβ+1)(1+p). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

111 Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn αβ 2(αβ+1)(1+p). For large n this estimate is sharper! Using the sharper estimate in the proof, we obtain a better learning rate. Argument can be iterated... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

112 Statistical Analysis Learning Rates: Non-Optimality II Bernstein s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [ B, B] be independent random variables satisfying E Q ξ i = 0 E Q ξ 2 i σ 2 Then, for all τ > 0, n 1, we have 1 n 2σ Q( ξ i 2 τ n n i=1 + 2Bτ 3n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

113 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

114 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

115 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

116 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Rates up to O(n 1 ) become possible. Iteration can be avoided! Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

117 Statistical Analysis Learning Rates: Non-Optimality IV Further Reasons The fact that L is clippable, should be used to obtain a smaller supremum term. -covering numbers provide a worst-case tool. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

118 Statistical Analysis Adaptivity of Standard SVMs Theorem (Eberts & S. 2011) Consider an SVM with least squares loss and Gaussian kernel k σ. Pick λ and σ by a suitable training/validation approach. Then, for m (d/2, ), the SVM learns every fl,p W m (X ) with the (essentially) optimal rate n 2m 2m+d +ε without knowing m. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

119 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

120 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Variance bound: E P h 2 f V (E P h f ) ϑ Supremum bound: h f B Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

121 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

122 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. This yields R L,P (f D ) R L,P (f P ) = E P h fd E P h fp E P h fd E D h fd + E D h fp E P h fp We will estimate the two differences separately. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

123 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

124 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B V (E P h fp ) ϑ Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

125 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ n + 4Bτ 3n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

126 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ + 4Bτ n 3n ( ) 1 2V τ 2 ϑ 4Bτ E P h fp + + n 3n. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

127 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

128 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

129 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

130 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Supremum bound: g f,r E P h f h f r 1 2Br 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

131 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM V Application of Bernstein With probability P n not smaller than 1 F e τ we have sup E D g f,r < f F 2V τ 4Bτ + nr 2 ϑ 3nr Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

On non-parametric robust quantile regression by support vector machines

On non-parametric robust quantile regression by support vector machines On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Curve learning. p.1/35

Curve learning. p.1/35 Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

2 Loss Functions and Their Risks

2 Loss Functions and Their Risks 2 Loss Functions and Their Risks Overview. We saw in the introduction that the learning problems we consider in this book can be described by loss functions and their associated risks. In this chapter,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Optimal Rates for Regularized Least Squares Regression

Optimal Rates for Regularized Least Squares Regression Optimal Rates for Regularized Least Suares Regression Ingo Steinwart, Don Hush, and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

University of North Carolina at Chapel Hill

University of North Carolina at Chapel Hill University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Year 2015 Paper 44 Feature Elimination in Support Vector

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Logistic regression and linear classifiers COMS 4771

Logistic regression and linear classifiers COMS 4771 Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1 Learning Objectives At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Distribution-specific analysis of nearest neighbor search and classification

Distribution-specific analysis of nearest neighbor search and classification Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

PAC Model and Generalization Bounds

PAC Model and Generalization Bounds PAC Model and Generalization Bounds Overview Probably Approximately Correct (PAC) model Basic generalization bounds finite hypothesis class infinite hypothesis class Simple case More next week 2 Motivating

More information

VBM683 Machine Learning

VBM683 Machine Learning VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data

More information

Stability of optimization problems with stochastic dominance constraints

Stability of optimization problems with stochastic dominance constraints Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012 Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2. CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Minimax Estimation of Kernel Mean Embeddings

Minimax Estimation of Kernel Mean Embeddings Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin

More information

Supremum of simple stochastic processes

Supremum of simple stochastic processes Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there

More information

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions 58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Unlabeled Data: Now It Helps, Now It Doesn t

Unlabeled Data: Now It Helps, Now It Doesn t institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence

Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.

More information

Sparseness of Support Vector Machines

Sparseness of Support Vector Machines Journal of Machine Learning Research 4 2003) 07-05 Submitted 0/03; Published /03 Sparseness of Support Vector Machines Ingo Steinwart Modeling, Algorithms, and Informatics Group, CCS-3 Mail Stop B256 Los

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij

Weak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij Weak convergence and Brownian Motion (telegram style notes) P.J.C. Spreij this version: December 8, 2006 1 The space C[0, ) In this section we summarize some facts concerning the space C[0, ) of real

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information