Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
|
|
- Monica Janis Mosley
- 5 years ago
- Views:
Transcription
1 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
2 Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
3 Basics Informal Introduction Informal Description of Supervised Learning X space of input samples Y space of labels, usually Y R. Already observed samples D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Goal: With the help of D find a function f D : X R such that f D (x) is a good prediction of the label y for new, unseen x. Learning method: Assigns to every training set D a predictor f D : X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
4 Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
5 Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
6 Basics Informal Introduction Illustration: Binary Classification Problem: The labels are ±1. Goal: Make few mistakes on future data. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
7 Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
8 Basics Informal Introduction Illustration: Regression Problem: The labels are R-valued. Goal: Estimate label y for new data x as accurate as possible. Example: Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
9 Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
10 Basics Formalized Description Data Generation Assumptions P is an unknown probability measure on X Y. D = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n is sampled from P n. Future samples (x, y) will also be sampled from P. Consequences The label y for a given x is, in general, not deterministic. The past and the future look the same. We seek algorithms that work well for many (or even all) P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
11 Basics Formalized Description Performance Evaluation I Loss Function L : X Y R [0, ) measures cost or loss L(x, y, t) of predicting label y by value t at point x. Interpretation As the name suggests, we prefer predictions with small loss. L is chosen by us. Since future (x, y) are random, it makes sense to consider the average loss of a predictor. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
12 Basics Formalized Description Performance Evaluation II Risk The risk of a predictor f : X R is the average loss R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y For D = ((x 1, y 1 ),..., (x n, y n )) the empirical risk is R L,D (f ) := 1 n n L(x i, y i, f (x i )). i=1 Interpretation By the law of large numbers, we have P -almost surely: R L,P (f ) = lim R L,D(f ) D Thus, R L,P (f ) is the long-term average future loss when using f. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
13 Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies R L,P (f L,P ) = R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
14 Basics Formalized Description Performance Evaluation III Bayes Risk and Bayes Predictor The Bayes risk is the smallest possible risk R L,P := inf{ R L,P (f ) f : X R (measurable) }. A Bayes predictor is any function fl,p : X R that satisfies Interpretation R L,P (f L,P ) = R L,P. We will never find a predictor whose risk is smaller than R L,P. We seek a predictor f : X R whose excess risk is close to 0. R L,P (f ) R L,P Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
15 Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
16 Basics Formalized Description Performance Evaluation IV Best Naïve Risk The best naïve risk is the smallest risk one obtains by ignoring X : Remarks R L,P := inf{ R L,P (c1 X ) c R }. The best naïve risk (and its minimizer) is usually easy to estimate. Using fancy learning algorithms only makes sense, if R L,P < R L,P. Equality Typically: R L,P = R L,P iff there is a constant Bayes predictor. If P = P X P Y, then R L,P = R L,P, but the converse is false. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
17 Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
18 Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
19 Basics Formalized Description Learning Goals I Binary Classification: Y = { 1, 1} L(y, t) := 1 (,0] ( y sign t ) penalizes predictions t with sign t y. R L,P (f ) = P({(x, y) : sign f (x) y}). Optimal Risk Let η(x) := P(Y = 1 x) be the probability of a positive label at x X. Bayes risk: R L,P = E P X min{η, 1 η}. f is Bayes predictor iff (2η 1) sign f 0. Naïve Risk Naïve risk: R L,P = min{p(y = 1), 1 P(Y = 1)} R L,P = R L,P iff η 1/2 or η 1/2 Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
20 Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
21 Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
22 Basics Formalized Description Learning Goals II Least Squares Regression: Y R L(y, t) := (y t) 2 Conditional expectation: µ P (x) := E P (Y x). Conditional variance: σp 2 (x) := E P(Y 2 x) µ 2 (x). Optimal Risk µ P is the only Bayes predictor and R L,P = E P X σp 2. Excess risk: R L,P (f ) R L,P = f µ P 2 L 2 (P X ). Least squares regression aims at estimating the conditional mean. Naïve Risk Naïve risk: R L,P = var P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
23 Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
24 Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
25 Basics Formalized Description Learning Goals III Absolute Value Regression: Y R L(y, t) := y t Conditional medians: m P (x) := median P (Y x). Optimal Risk The medians m P are the only Bayes predictors. Excess risk: R L,P (f n ) R L,P 0 implies f n m P in probability P X. Absolute value regression aims at estimating the conditional median. Naïve Risk Naïve risk: R L,P = median P Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
26 Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
27 Basics What is learning theory about? Questions in Statistical Learning I Asymptotic Learning A learning method is called universally consistent if lim R L,P(f D ) = R n L,P in probability P (1) for every probability measure P on X Y. Good News Many learning methods are universally consistent. First result: Stone (1977), AoS Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
28 Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
29 Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
30 Basics What is learning theory about? Questions in Statistical Learning II Learning Rates A learning method learns for a distribution P with rate a n 0, if E D P nr L,P (f D ) R L,P + C Pa n, n 1. Similar: learning rates in probability. Bad News (Devroye, 1982, IEEE TPAMI) If X =, Y 2, and L non-trivial, then it is impossible to obtain a learning rate that is independent of P. Remark If X <, then it is usually easy to obtain a uniform learning rate for which C P depends on X. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
31 Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
32 Basics What is learning theory about? Questions in Statistical Learning III Relative Learning Rates Let P be a set of distributions on X Y. A learning method learns P with rate a n 0, if, for all P P, E D P nr L,P (f D ) R L,P + C Pa n, n 1. The rate optimal (a n ) is minmax optimal, if, in addition, there is no learning method that learns P with a rate (b n ) such that b n /a n 0. Tasks Identify interesting ( realistic ) classes P with good optimal rates. Find learning algorithms that achieve these rates. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
33 Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
34 Basics What is learning theory about? Example of Optimal Rates Classical Least Squares Example X = [0, 1] d, Y = [ 1, 1], L is least squares. W m Sobolev space on X with order of smoothness m > d/2. P the set of P such that f L,P W m with norm bounded by K. Optimal rate is n 2m 2m+d. Remarks The smoother target µ = fl,p is, the better it can be learned. The larger the input dimension is, the harder learning becomes. There exists various learning algorithms achieving the optimal rate. They usually require us to know m in advance. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
35 Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
36 Basics What is learning theory about? Questions in Statistical Learning IV Assumptions for Adaptivity Usually one has a familiy (P θ ) θ Θ of large sets P θ of distributions. Each set P θ has its own optimal rate. We don t know whether P P θ for some θ, but we hope so. If P P θ, we don t know θ and we have no mean to estimate it. Task We seek learning algorithms that are universally consistent. learn all P θ with the optimal rate without knowing θ. Such learning algorithms are adaptive to the unknown θ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
37 Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) with probability P n not smaller than 1 δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
38 Basics What is learning theory about? Questions in Statistical Learning V Finite Sample Estimates Assume that our algorithm has some hyper-parameters λ Λ. For each P, λ, δ (0, 1) and n 1 we seek an ε(p, λ, δ, n) such that R L,P (f D,λ ) R L,P ε(p, λ, δ, n) Remarks with probability P n not smaller than 1 δ. If there exists a sequence (λ n ) with lim ε(p, λ n, δ, n) = 0 n for all P and δ, then the algorithm can be made universally consistent. We automatically obtain learning rates for such sequences. If X = and..., then such ε(p, λ, δ, n) must depend on P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
39 Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
40 Basics What is learning theory about? Questions in Statistical Learning VI Generalization Error Bounds Goal: Estimate risk R L,P (f D,λ ) by the performance of f D,λ on D. Find ε(λ, δ, n) such that with probability P n not smaller than 1 δ: R L,P (f D,λ ) R L,D (f D,λ ) + ε(λ, δ, n). Remarks ε(λ, δ, n) must not depend on P since we do not know P. ε(λ, δ, n) can be used to derive parameter selection strategies such as structural risk minimization. Alternative: Use second data set D and R L,D (f D,λ ) as an estimate. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
41 Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
42 Basics What is learning theory about? Summary A good learning algorithm: Is universally consistent. Is adaptive for realistic classes of distributions. Can be modified to new problems that have a different loss. Has a good record on real-world problems. Runs efficiently on a computer.... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
43 Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
44 Definition and Examples Empirical Risk Minimization Definition Let F be a set of functions X R. A learning method whose predictors satisfy f D F and R L,D (f D ) = min f F R L,D(f ) is called empirical risk minimization (ERM). Remarks Not every F makes ERM possible. ERM is, in general, not unique. ERM may not be computationally feasible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
45 Definition and Examples Empirical Risk Minimization Danger of underfitting ERM can never produce predictors with risk better than R L,P,F := inf{r L,P(f ) : f F}. Example: L least squares, X = [0, 1], P X uniform distribution, f L,P not linear, and F set of linear functions, then R L,P,F > R L,P, and thus ERM cannot be consistent. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
46 Definition and Examples Empirical Risk Minimization Danger of overfitting If F is too large, ERM may overfit. Example: L least squares, X = [0, 1], P X uniform distribution, fl,p = 1 X, R L,P = 0, and F set of all functions. Then { y i if x = x i for some i f D (x) = 0 otherwise. satisfies R L,D (f D ) = 0 but R L,P (f D ) = 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
47 Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
48 Definition and Examples Summary of Last Session Risk of a predictor f : X R is R L,P (f ) := L ( x, y, f (x) ) dp(x, y). X Y Bayes risk R L,P is the smallest possible risk. A Bayes predictor f L,P achieves this minimal risk. Learning is R L,P (f D ) R L,P Asymptotically, this is possible, but no uniform rates are possible. We seek adaptive learning algorithms. Ideally, these are fully automated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
49 Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
50 Definition and Examples Regularized ERM Definition Let F be a non-empty set of functions X R and Υ : F [0, ) be a map. A learning method whose predictors satisfy f D F and ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f F is called regularized empirical risk minimization (RERM). Remarks Υ = 0 yields ERM. All remarks about ERM apply to RERM, too. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
51 Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
52 Definition and Examples Examples of Regularized ERM I General Dictionary Methods For bounded h 1,..., h m : X R consider F := { f c := m i=1 c i h i : (c 1,..., c m ) R m }, Examples of Regularizers l 1 -regularization: Υ(f c ) = λ c 1 = λ m i=1 c i, l 2 -regularization: Υ(f c ) = λ c 2 = λ m i=1 c i 2, l -regularization: Υ(f c ) = λ c = λ max i c i, or, in case of dependent h i, we take the infimum over all representations. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
53 Definition and Examples Examples of Regularized ERM II Further Examples Support Vector Machines Regularized Decision Trees... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
54 Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
55 Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
56 Norm Regularizers Regularized ERM: Norm Regularizers Conventions Whenever we consider regularizers they will be of the form Υ(f ) = λ f α E, f F, where α 1 and E := F is a vector space of functions X R. In this case, we additionally assume that f f E, f E. In the following, we assume that the optimization problem also has a solution f P, when we replace D by P: f P arg min f F Υ(f ) + R L,P(f ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
57 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
58 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
59 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
60 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
61 Towards the Statistical Analysis The Classical Argument I Ansatz Assume that we have a data set D and an ε > 0 such that sup R L,P (f ) R L,D (f ) ε f F Then we obtain Υ(f D ) + R L,P (f D ) = Υ(f D ) + R L,P (f D ) R L,D (f D ) + R L,D (f D ) Υ(f D ) + R L,D (f D ) + ε Υ(f P ) + R L,D (f P ) + ε Υ(f P ) + R L,P (f P ) + 2ε Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
62 Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
63 Towards the Statistical Analysis The Classical Argument II Discussion The uniform bound sup R L,P (f ) R L,D (f ) ε (2) f F led to the inequality Υ(f D ) + R L,P (f D ) R L,P Υ(f P) + R L,P (f P ) R L,P + 2ε. Since Υ(f D ) 0, all what remains to be done, is to estimate the probability of (2) the regularization error Υ(f P ) + R L,P (f P ) R L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
64 Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
65 Towards the Statistical Analysis The Classical Argument III Union Bound Assume that F is finite. The union bound gives P(D : sup R L,P (f ) R L,D (f ) ε) f F = 1 P(D : sup R L,P (f ) R L,D (f ) > ε) f F 1 f F P(D : R L,P (f ) R L,D (f ) > ε) Consequences It suffices to bound P(D : R L,P (f ) R L,D (f ) > ε) for all f. No assumptions on P are made so far. In particular, so far data D does not need to be i.i.d. nor even random. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
66 Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
67 Towards the Statistical Analysis The Classical Argument IV Hoeffding s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [a, b] be independent random variables. Then, for all τ > 0, n 1, we have 1 Q( n n i=1 ( ξi E Q ξ i ) (b a) τ 2n ) 2e τ. Application Consider Ω := (X Y ) n and Q := P n. For ξ i (D) := L(x i, y i, f (x i )) we have a = 0 and 1 n n ( ) ξi E P nξ i = RL,D (f ) R L,P (f ). i=1 Assuming L(x, y, f (x)) B makes application of Hoeffding possible. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
68 Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
69 Towards the Statistical Analysis The Classical Argument V Theorem for ERM Let L : X Y R [0, ) be a loss, F be a non-empty finite set of functions f : X R, and B > 0 be a constant such that L(x, y, f (x)) B, (x, y) X Y, f F. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. Remarks Does not specify approximation error R L,P,F R L,P. If F =, the bound becomes meaningless. What happens, if we consider RERM with non-trivial regularizer? Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
70 Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
71 Towards the Statistical Analysis ERM for Infinite F: The General Approach So far... The union bound was the trick to make a conclusion from an estimate of R L,P (f ) R L,D (f ) ε for a single f to all f F. For infinite F, this does not work! General Approach Given some δ > 0, find a finite N δ set of functions such that sup R L,P (f ) R L,D (f ) sup R L,P (f ) R L,D (f ) + δ f F f N δ Then apply the union bound for N δ. The rest remains unchanged. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
72 Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n The new inequality ) 1 e τ. ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
73 Towards the Statistical Analysis ERM for Infinite F: The General Approach The old inequality P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 2 ln(2 F ) n ) 1 e τ. The new inequality ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nδ ) + δ 1 e τ. n Tasks For each δ > 0, find a small set N δ. Optimize the right-hand side wrt. δ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
74 Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
75 Towards the Statistical Analysis Covering Numbers Definition Let (M, d) be a metric space, A M, and ε > 0. The ε-covering number of A is defined by { N (A, d, ε) := inf n 1 : x 1,..., x n M such that A n i=1 } B d (x i, ε) where inf :=, and B d (x i, ε) is the ball with radius ε and center x i. x 1,..., x n is called an ε-net. N (A, d, ε) is the size of the smallest ε-net. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
76 Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
77 Towards the Statistical Analysis Covering Numbers II Every bounded A R d satisfies N (A,, ε) cε d, ε > 0 where c > 0 is a constant and the norm does only influence c. For sets F of functions f : X R, the behavior of N (F,, ε) may be very different! The literature is full of estimates of ln N (F,, ε). A typical estimate looks like ln N (B E, F, ε) cε 2p, ε > 0 Here p may depend on the input dimension and the smoothness of the functions in E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
78 Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
79 Towards the Statistical Analysis ERM with Infinite Sets Theorem Let L be Lipschitz in its third argument, Lipschitz constant = 1. Assume that L f B for all f F. Let N ε be a minimal ε-net of F, i.e. N ε = N (F,, ε). Then we have ( ) 2τ + 2 ln(2 P n D : R L,P (f D ) < R L,P,F + B Nε ) + 2ε 1 e τ. n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
80 Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
81 Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p n ) + 2ε 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
82 Towards the Statistical Analysis Using Covering Numbers VII Example Let L satisfy assumptions on previous theorem. Let F set of functions with ln N (F,, ε) cε 2p. Then we have P n ( D : R L,P (f D ) < R L,P,F + B 2τ + 4cε 2p Optimizing wrt. ε gives a constant K p such that P n ( D : R L,P (f D ) < R L,P,F + K pc 1 2+2p B τn 1 2+2p For ERM over finite F, we had p = 0. n ) + 2ε 1 e τ. ) 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
83 Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
84 Towards the Statistical Analysis Standard Analysis for RERM Difficulties when Analyzing RERM We are interested in RERMs, where F is a vector space E. Vector spaces E are never compact, thus ln N (E,, ε) =. It seems that our approach does not work in this case. Solution RERM actually solves its optimization problem ( ) Υ(f D ) + R L,D (f D ) = inf Υ(f ) + R L,D (f ) f E over a set, which is significantly smaller than E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
85 Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
86 Towards the Statistical Analysis Norm Bound for RERM Lemma Assume that L(x, y, 0) 1. Then, for any RERM predictor f D,λ E we have f D,λ E λ 1/α. Consequence RERM optimization problem is actually solved over the ball with radius λ 1/α. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
87 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
88 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
89 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
90 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
91 Towards the Statistical Analysis Norm Bound for RERM II Proof Our assumptions L(x, y, t) 0 and L(x, y, 0) 1 yield λ f D,λ α E λ f D,λ α E + R L,D(f D,λ ) ( ) = inf λ f α E + R L,D(f ) f E λ 0 α E + R L,D(0) 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
92 Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
93 Statistical Analysis An Oracle Inequality Theorem (Example) L Lipschitz continuous with L 1 1 and L(x, y, 0) 1. E vector space with norm E satisfying E. Υ(f ) = λ f α E. We have ln N (B E,, ε) cε 2p Then, for all n 1, λ (0, 1], τ 1, we have λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) + K p c 1 2+2p τλ 1 α n 1 2+2p with probability P n not less than 1 e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
94 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) < λ f P,λ α E + R L,P(f P,λ ) +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
95 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
96 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
97 Statistical Analysis Consequences of the Oracle Inequality Oracle inequality λ f D,λ α E + R L,P(f D,λ ) R L,P < λ f P,λ α E + R L,P(f P,λ ) R L,P,E +R L,P,E R L,P +K p c 1 2+2p τλ 1 α n 1 2+2p Regularization error: A(λ) := λ f P,λ α E + R L,P(f P,λ ) R L,P,E Approximation error: R L,P,E R L,P. Statistical error: K p c 1 2+2p τλ 1 α n 1 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
98 Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
99 Statistical Analysis Bounding the Remaining Errors Lemma 1 If E is dense in L 1 (P X ), then R L,P,E R L,P = 0. Lemma 2 We have lim λ 0 A(λ) = 0, and if there is an f E with R L,P (f ) = R L,P,E, then A(λ) λ f α E. Remarks A linear behaviour of A often requires such an f. A typical behavior is, for some β (0, 1], of the form A(λ) cλ β A sufficient condition for such a behaviour can be described with the help of so-called interpolation spaces of the real method. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
100 Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
101 Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
102 Statistical Analysis Main Results for RERM Oracle inequality We assume R L,P,E R L,P = 0. λ f D,λ α E + R L,P(f D,λ ) R L,P < A(λ) + K p c 1 2+2p τλ 1 α n 1 2+2p Consequences Consistent, if λ n 0 with λ n n α 2+2p. If A(λ) cλ β, then achieves best rate λ n n α (αβ+1)(2+2p) n αβ (αβ+1)(2+2p) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
103 Statistical Analysis Main Results for ERM II Discussion Assumptions for consistency on E are minimal. More sophisticated algorithms can be devised from oracle inequality. For example, E could change with sample size, too. To achieve best learning rates, we need to know β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
104 Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
105 Statistical Analysis Learning Rates: Hyper-Parameters III Training-Validation Approach Assume that L is clippable. Split data into equally sized parts D 1 and D 2. We write m := n/2. Fix a finite set Λ (0, 1] of candidate values for λ. For each λ Λ compute f D1,λ. Pick the λ D2 Λ such that f D1,λ D2 minimizes empirical risk R L,D2. Observation Approach performs RERM on D 1 and ERM over F := { f D1,λ : λ Λ} on D 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
106 Statistical Analysis Learning Rates: Hyper-Parameters VI Theorem If Λ n is a polynomially growing n α/2 -net of (0, 1], our TV-RERM is consistent and enjoys the same best rates as RERM without knowing β. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
107 Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
108 Statistical Analysis Summary Positive Aspects Finite sample estimates in forms of oracle inequalities. Consistency and learning rates. Adaptivity to best learning rates the analysis can provide. Framework applies to a variety of algorithms, e.g. SVMs with Gaussian kernels. Analysis is very robust to changes in the scenario. Negative Aspect For RERM, the rates are never optimal! This analysis is out-dated. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
109 Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
110 Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn For large n this estimate is sharper! αβ 2(αβ+1)(1+p). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
111 Statistical Analysis Learning Rates: Non-Optimality I For RERM, with probability P n not less than 1 e τ we have λ n f D,λn α E + R L,P (f D,λn ) R L,P C τn 2(αβ+1)(1+p). (3) αβ In the proof of this result we used λ n f D,λn α E 1, but (3) shows λ n f D,λn 2 E C τn αβ 2(αβ+1)(1+p). For large n this estimate is sharper! Using the sharper estimate in the proof, we obtain a better learning rate. Argument can be iterated... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
112 Statistical Analysis Learning Rates: Non-Optimality II Bernstein s Inequality Let (Ω, A, Q) be a probability space and ξ 1,..., ξ n : Ω [ B, B] be independent random variables satisfying E Q ξ i = 0 E Q ξ 2 i σ 2 Then, for all τ > 0, n 1, we have 1 n 2σ Q( ξ i 2 τ n n i=1 + 2Bτ 3n ) 2e τ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
113 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
114 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
115 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
116 Statistical Analysis Learning Rates: Non-Optimality III Some loss functions or distributions allow a variance bound E P (L f L f L,P) 2 V ( R L,P (f ) R L,P) ϑ. Use Bernstein s inequality rather than Hoeffding s inequality leads to an oracle inequality with a variance term, which is O(n 1/2 ) supremum term, which is O(n 1 ) Iteration in the proof: Initial analysis provides small excess risk with high probability Variance bound converts small excess risk into small variance Variance term in oracle inequality becomes smaller, leading to a faster rate... Rates up to O(n 1 ) become possible. Iteration can be avoided! Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
117 Statistical Analysis Learning Rates: Non-Optimality IV Further Reasons The fact that L is clippable, should be used to obtain a smaller supremum term. -covering numbers provide a worst-case tool. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
118 Statistical Analysis Adaptivity of Standard SVMs Theorem (Eberts & S. 2011) Consider an SVM with least squares loss and Gaussian kernel k σ. Pick λ and σ by a suitable training/validation approach. Then, for m (d/2, ), the SVM learns every fl,p W m (X ) with the (essentially) optimal rate n 2m 2m+d +ε without knowing m. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
119 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
120 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM I Basic Setup We consider ERM over finite F. We assume that a Bayes predictor f L,P exists. We consider excess losses Thus E P h f = R L,P (f ) R L,P. h f := L f L f L,P. Variance bound: E P h 2 f V (E P h f ) ϑ Supremum bound: h f B Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
121 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
122 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM II Decomposition Let f P F satisfy R L,P (f P ) = R L,P,F. R L,D (f D ) R L,D (f P ) implies E D h fd E D h fp. This yields R L,P (f D ) R L,P (f P ) = E P h fd E P h fp E P h fd E D h fd + E D h fp E P h fp We will estimate the two differences separately. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
123 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
124 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B V (E P h fp ) ϑ Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
125 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ n + 4Bτ 3n Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
126 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM III Second Difference We have E D h fp E P h fp = E D (h fp E P h fp ). Centered: E P (h fp E P h fp ) = 0. Variance bound: E P (h fp E P h fp ) 2 E P hf 2 P Supremum bound: h fp E P h fp 2B Bernstein yields V (E P h fp ) ϑ E D h fp E P h fp 2τV (EP h fp ) ϑ + 4Bτ n 3n ( ) 1 2V τ 2 ϑ 4Bτ E P h fp + + n 3n. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
127 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
128 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
129 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
130 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM IV First Difference To estimate the remaining term E P h fd E D h fd, we define the functions g f,r := E Ph f h f E P h f + r, f F, r > 0. Bernstein Conditions Centered: E P g f,r = 0. Variance bound: E P g 2 f,r E P h 2 f (E P h f + r) 2 E P h 2 f r 2 ϑ (E P h f ) ϑ V r ϑ 2. Supremum bound: g f,r E P h f h f r 1 2Br 1. Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
131 Advanced Statistical Analysis ERM Towards a Better Analysis for ERM V Application of Bernstein With probability P n not smaller than 1 F e τ we have sup E D g f,r < f F 2V τ 4Bτ + nr 2 ϑ 3nr Ingo Steinwart University of Stuttgart () Learning Theory September 4, / 62
Approximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationCurve learning. p.1/35
Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More information2 Loss Functions and Their Risks
2 Loss Functions and Their Risks Overview. We saw in the introduction that the learning problems we consider in this book can be described by loss functions and their associated risks. In this chapter,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationOptimal Rates for Regularized Least Squares Regression
Optimal Rates for Regularized Least Suares Regression Ingo Steinwart, Don Hush, and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationUniversity of North Carolina at Chapel Hill
University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Year 2015 Paper 44 Feature Elimination in Support Vector
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationMinimax risk bounds for linear threshold functions
CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability
More informationLogistic regression and linear classifiers COMS 4771
Logistic regression and linear classifiers COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationLearning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1
Learning Objectives At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationDistribution-specific analysis of nearest neighbor search and classification
Distribution-specific analysis of nearest neighbor search and classification Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to information retrieval and classification.
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationPAC Model and Generalization Bounds
PAC Model and Generalization Bounds Overview Probably Approximately Correct (PAC) model Basic generalization bounds finite hypothesis class infinite hypothesis class Simple case More next week 2 Motivating
More informationVBM683 Machine Learning
VBM683 Machine Learning Pinar Duygulu Slides are adapted from Dhruv Batra Bias is the algorithm's tendency to consistently learn the wrong thing by not taking into account all the information in the data
More informationStability of optimization problems with stochastic dominance constraints
Stability of optimization problems with stochastic dominance constraints D. Dentcheva and W. Römisch Stevens Institute of Technology, Hoboken Humboldt-University Berlin www.math.hu-berlin.de/~romisch SIAM
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationCase study: stochastic simulation via Rademacher bootstrap
Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic
More informationMachine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Computational Learning Theory Le Song Lecture 11, September 20, 2012 Based on Slides from Eric Xing, CMU Reading: Chap. 7 T.M book 1 Complexity of Learning
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More information1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.
CS229 Problem Set #2 Solutions 1 CS 229, Public Course Problem Set #2 Solutions: Theory Kernels, SVMs, and 1. Kernel ridge regression In contrast to ordinary least squares which has a cost function J(θ)
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationIntroduction to Machine Learning
Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationToday. Calculus. Linear Regression. Lagrange Multipliers
Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationMinimax Estimation of Kernel Mean Embeddings
Minimax Estimation of Kernel Mean Embeddings Bharath K. Sriperumbudur Department of Statistics Pennsylvania State University Gatsby Computational Neuroscience Unit May 4, 2016 Collaborators Dr. Ilya Tolstikhin
More informationSupremum of simple stochastic processes
Subspace embeddings Daniel Hsu COMS 4772 1 Supremum of simple stochastic processes 2 Recap: JL lemma JL lemma. For any ε (0, 1/2), point set S R d of cardinality 16 ln n S = n, and k N such that k, there
More informationSupervised Machine Learning (Spring 2014) Homework 2, sample solutions
58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS
Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)
More informationNearest neighbor classification in metric spaces: universal consistency and rates of convergence
Nearest neighbor classification in metric spaces: universal consistency and rates of convergence Sanjoy Dasgupta University of California, San Diego Nearest neighbor The primeval approach to classification.
More informationSparseness of Support Vector Machines
Journal of Machine Learning Research 4 2003) 07-05 Submitted 0/03; Published /03 Sparseness of Support Vector Machines Ingo Steinwart Modeling, Algorithms, and Informatics Group, CCS-3 Mail Stop B256 Los
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationWeak convergence and Brownian Motion. (telegram style notes) P.J.C. Spreij
Weak convergence and Brownian Motion (telegram style notes) P.J.C. Spreij this version: December 8, 2006 1 The space C[0, ) In this section we summarize some facts concerning the space C[0, ) of real
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationChap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University
Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationIntroduction. Chapter 1
Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More information