Generalization theory
|
|
- Karin Chambers
- 5 years ago
- Views:
Transcription
1 Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1
2 Motivation 2
3 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w n Loss function is hinge loss [1 y i w T x i ] +. l(ŷ, y) = [1 yŷ] + = max{1 yŷ, 0}. (Here, we are okay with a real-valued prediction.) The λ 2 w 2 2 term is called Tikhonov regularization, which we ll discuss later. 3
4 Basic statistical model for data IID model of data Training data and test example are independent and identically distributed (X Y)-valued random variables: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P. 4
5 Basic statistical model for data IID model of data Training data and test example are independent and identically distributed (X Y)-valued random variables: SVM in the iid model (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P. Return solution ŵ to following optimization problem: λ min w R d 2 w n [1 Y i w T X i ] +. Therefore, ŵ is a random variable, depending on (X 1, Y 1 ),..., (X n, Y n ). 4
6 Convergence of empirical risk For w that does not depend on training data: Empirical risk R n (w) = 1 n is a sum of iid random variables. l(w T X i, Y i ) 5
7 Convergence of empirical risk For w that does not depend on training data: Empirical risk is a sum of iid random variables. R n (w) = 1 l(w T X i, Y i ) n Law of Large Numbers gives an asymptotic result: R n (w) = 1 l(w T p X i, Y i ) E[l(w T X, Y )] = R(w). n (This can be made non-asymptotic.) 5
8 Uniform convergence of empirical risk However, ŵ does depend on training data. Empirical risk of ŵ is not a sum of iid random variables: R n (ŵ) = 1 l(ŵ T X i, Y i ). n 6
9 Uniform convergence of empirical risk However, ŵ does depend on training data. Empirical risk of ŵ is not a sum of iid random variables: R n (ŵ) = 1 l(ŵ T X i, Y i ). n Idea: ŵ could conceivably take any value w, but if then R n (ŵ) p sup R n (w) R(w) 0, (1) w p R(ŵ) as well. (1) is called uniform convergence. 6
10 Detour: Concentration inequalities 7
11 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is S n = ε i. 8
12 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is How far from origin? S n = ε i. By independence, var(s n ) = n var(ε i ) = n. So expected distance from origin is E S n var(s n ) n. 8
13 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is How far from origin? S n = ε i. By independence, var(s n ) = n var(ε i ) = n. So expected distance from origin is E S n var(s n ) n. How many realizations are n from origin? 8
14 Markov s inequality For any random variable X and any t 0, P( X t) E X. t Proof: t 1{ X t} X. 9
15 Markov s inequality For any random variable X and any t 0, P( X t) E X. t Proof: t 1{ X t} X. Application to symmetric random walk: P( S n c n) E S n c n 1 c. 9
16 Hoeffding s inequality If X 1,..., X n are independent random variables, with X i taking values in [a i, b i ], then for any t 0, ( 2t P (X i E(X i )) t 2 ) exp n (b i a i ) 2. 10
17 Hoeffding s inequality If X 1,..., X n are independent random variables, with X i taking values in [a i, b i ], then for any t 0, ( 2t P (X i E(X i )) t 2 ) exp n (b i a i ) 2. E.g., Rademacher random variables have [a i, b i ] = [ 1, +1], so P(S n t) exp( 2t 2 /(4n)). 10
18 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 11
19 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 11
20 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 2. Apply Hoeffding to ε 1,..., ε n : P( S n c n) exp( c 2 /2). 11
21 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 2. Apply Hoeffding to ε 1,..., ε n : P( S n c n) exp( c 2 /2). 3. Therefore, by union bound, P( S n c n) 2 exp( c 2 /2). (Compare to bound from Markov s inequality: 1/c.) 11
22 Equivalent form of Hoeffding s inequality Let X 1,..., X n be independent random variables, with X i taking values in [a i, b i ], and let S n = n X i. For any δ (0, 1), P S n E[S n ] < 1 2 (b i a i ) 2 ln(1/δ) 1 δ. This is a high probability upper-bound on S n E[S n ]. 12
23 Uniform convergence: Finite classes 13
24 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P 14
25 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P We let ˆf arg min f F R n (f) be minimizer of empirical risk R n (f) = 1 l(f(x i ), Y i ). n 14
26 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P We let ˆf arg min f F R n (f) be minimizer of empirical risk R n (f) = 1 l(f(x i ), Y i ). n Our worry: over-fitting R( ˆf) R n ( ˆf). 14
27 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). 15
28 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). Since R n (f) is sum of n independent [0, 1 n ]-valued random variables, P ( R n (f) R(f) t ) ( ) 2 exp 2t2 n ( 1 = 2 exp( 2nt 2 ) n )2 for any t > 0, by Hoeffding s inequality and union bound. 15
29 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). Since R n (f) is sum of n independent [0, 1 n ]-valued random variables, P ( R n (f) R(f) t ) ( ) 2 exp 2t2 n ( 1 = 2 exp( 2nt 2 ) n )2 for any t > 0, by Hoeffding s inequality and union bound. This argument does not apply to ˆf, because ˆf depends on (X 1, Y 1 ),..., (X n, Y n ). 15
30 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. 16
31 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. One possible solution: ensure empirical risk of every f F is close to its expected value. This is called uniform convergence. 16
32 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. One possible solution: ensure empirical risk of every f F is close to its expected value. This is called uniform convergence. How much data is needed to ensure this? 16
33 Uniform convergence for all functions in a finite class If F <, then by Hoeffding s inequality and union bound, P ( f F s.t. R n (f) R(f) t ) = P { Rn (f) R(f) t } f F P ( R n (f) R(f) t ) f F F 2 exp( 2nt 2 ). 17
34 Uniform convergence for all functions in a finite class If F <, then by Hoeffding s inequality and union bound, P ( f F s.t. R n (f) R(f) t ) = P { Rn (f) R(f) t } f F P ( R n (f) R(f) t ) f F F 2 exp( 2nt 2 ). Choose t so that RHS is δ, and invert. Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(2 F /δ) 1 δ. 2n 17
35 What we get from uniform convergence If n log F, then with high probability, no function f F will over-fit the training data. 18
36 What we get from uniform convergence If n log F, then with high probability, no function f F will over-fit the training data. Also: An empirical risk minimizer (ERM), like ˆf, is near optimal! Theorem. With probability at least 1 δ, R( ˆf) R(f ) = R( ˆf) R n ( ˆf) ( ɛ) + R n ( ˆf) R n (f ) ( 0) + R n (f ) R(f ) ( ɛ) 2ɛ where f arg min f F R(f) and ɛ = ln(2 F /δ) 2n. 18
37 Uniform convergence: General case 19
38 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. 20
39 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. Notation: Let P f = E[f(X)] for X P. Let P n be the empirical distribution on X 1,..., X n iid P, which assigns probability mass 1/n to each X i. So P n f = 1 n n f(x i ). We are interested in the maximum (or supremum) deviation: sup P n f P f. f F 20
40 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. Notation: Let P f = E[f(X)] for X P. Let P n be the empirical distribution on X 1,..., X n iid P, which assigns probability mass 1/n to each X i. So P n f = 1 n n f(x i ). We are interested in the maximum (or supremum) deviation: sup P n f P f. f F The arguments from before show that for any finite class of bounded functions F, p sup P n f P f 0, f F and also give a non-asymptotic rate of convergence. 20
41 Infinite classes For which classes F R X does uniform convergence hold? 21
42 Infinite classes For which classes F R X does uniform convergence hold? Example: F = {f S (x) = 1{x S} : S R, S < }, i.e., {0, 1}-valued functions that take value 1 on a finite set. If P is continuous, then P f = 0 for all f F. But sup f F P n f = 1 for all n. So sup f F P n f P f = 1 for all n. 21
43 Infinite classes For which classes F R X does uniform convergence hold? Example: F = {f S (x) = 1{x S} : S R, S < }, i.e., {0, 1}-valued functions that take value 1 on a finite set. If P is continuous, then P f = 0 for all f F. But sup f F P n f = 1 for all n. So sup f F P n f P f = 1 for all n. What is the appropriate complexity measure of a function class? 21
44 Rademacher complexity Let ε 1,..., ε n be independent Rademacher random variables. Uniform convergence with F holds iff lim ε sup 1 n ε i f(x i ) f F n = 0 } {{ } Rad n(f) (where E ε is expectation with respect to ε = (ε 1,..., ε n )). 22
45 Rademacher complexity Let ε 1,..., ε n be independent Rademacher random variables. Uniform convergence with F holds iff lim ε sup 1 n ε i f(x i ) f F n = 0 } {{ } Rad n(f) (where E ε is expectation with respect to ε = (ε 1,..., ε n )). Rad n (F) is the Rademacher complexity of F, which measures how well vectors in (random) set F(X 1:n ) = {(f(x 1 ),..., f(x n )) : f F} can correlate with uniformly random signs ε 1,..., ε n. 22
46 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). 23
47 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). F contains a single function f 0 : X { 1, +1}: Rad n (F) = EE ε 1 n ε i f 0 (X i ) 1. n 23
48 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). F contains a single function f 0 : X { 1, +1}: Rad n (F) = EE ε 1 n ε i f 0 (X i ) 1. n F contains all functions X { 1, +1}: Rad n (F) = EE ε sup 1 f F n ε i f(x i ) = 1. 23
49 Uniform convergence via Rademacher complexity Theorem. 1. Uniform convergence in expectation: For any F R X, [ ] E sup P n f P f f F 2 Rad n (F). 2. Uniform convergence with high probability: For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, sup P n f P f 2 Rad n (F) + f F 2 ln(1/δ). n 24
50 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. 25
51 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. Then E [ sup P n f P f f F ] = E E sup f F E E 1 f(x i ) f(x n i) sup 1 f(x i ) f(x f F n i) = EE sup P n f P nf. f F 25
52 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. Then E [ sup P n f P f f F ] = E E sup f F E E 1 f(x i ) f(x n i) sup 1 f(x i ) f(x f F n i) = EE sup P n f P nf. f F The random variable P n f P nf is arguably nicer than P n f P f because it is symmetric. 25
53 Step 2: Symmetrization by random signs Consider any ε = (ε 1,..., ε n ) { 1, +1} n. Distribution of is the same distribution of P n f P nf = 1 f(x i ) f(x n i) P n f P nf = 1 n ) ε i (f(x i ) f(x i). 26
54 Step 2: Symmetrization by random signs Consider any ε = (ε 1,..., ε n ) { 1, +1} n. Distribution of is the same distribution of P n f P nf = 1 f(x i ) f(x n i) P n f P nf = 1 n ) ε i (f(x i ) f(x i). Thus, this is also true for uniform average over all ε { 1, +1} n (i.e., expectation over Rademacher ε): EE sup f F P n f P nf = EE E ε sup f F 1 n ε i (f(x i ) f(x i)). 26
55 Step 3: Back to a single sample By triangle inequality, sup 1 ε i (f(x i)) i ) f(x f F n sup 1 ε i f(x i ) f F n + sup 1 f F n ε i f(x i) The two terms on the RHS have the same distribution. 27
56 Step 3: Back to a single sample By triangle inequality, sup 1 ε i (f(x i)) i ) f(x f F n sup 1 ε i f(x i ) f F n + sup 1 f F n ε i f(x i) The two terms on the RHS have the same distribution. So EE E ε sup f F 1 ε i (f(x i)) i ) f(x n 2EE ε sup f F = 2 Rad n (F). 1 ε i f(x i ) n 27
57 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, sup P n f P f 2 Rad n (F) + f F 2 ln(1/δ). n 28
58 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, Conclusion sup P n f P f 2 Rad n (F) + f F If Rad n (F) 0, then uniform convergence holds. 2 ln(1/δ). n 28
59 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, Conclusion sup P n f P f 2 Rad n (F) + f F If Rad n (F) 0, then uniform convergence holds. 2 ln(1/δ). n (Can also show: If uniform convergence holds, then Rad n (F) 0.) 28
60 Analysis of SVM 29
61 Loss class Back to classes of prediction functions F R X. 30
62 Loss class Back to classes of prediction functions F R X. Consider a loss function l: R Y R + that satisfies l(0, y) 1 for all y Y, and is 1-Lipschitz in first argument: for all ŷ, ŷ R, l(ŷ, y) l(ŷ, y) ŷ ŷ. (Example: hinge loss l(ŷ, y) = [1 ŷy] +.) 30
63 Loss class Back to classes of prediction functions F R X. Consider a loss function l: R Y R + that satisfies l(0, y) 1 for all y Y, and is 1-Lipschitz in first argument: for all ŷ, ŷ R, l(ŷ, y) l(ŷ, y) ŷ ŷ. (Example: hinge loss l(ŷ, y) = [1 ŷy] +.) Define the associated loss class by Then l F = {(x, y) l(f(x), y) : f F}. Rad n (l F ) 2 Rad n (F) + 2 ln 2 n. So uniform convergence holds for l F if it holds for F. 30
64 Rademacher complexity of linear predictors Linear functions F lin = {w R d }. What is the Rademacher complexity of F lin? Rad n (F lin ) = EE ε sup 1 ε i w T X i n. w R d 31
65 Rademacher complexity of linear predictors Linear functions F lin = {w R d }. What is the Rademacher complexity of F lin? Rad n (F lin ) = EE ε sup 1 ε i w T X i n. w R d Inside the EE ε : sup 1 n w R d wt ε i X i = sup w 2 1 ε i X i w R d n. 2 As long as n ε i X i 0, this is unbounded! :-( 31
66 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. 32
67 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. Objective value at w = 0 is 1, so objective value at minimizer ŵ is no worse than this: λ 2 ŵ n [1 y i ŵ T x i ]
68 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. Objective value at w = 0 is 1, so objective value at minimizer ŵ is no worse than this: λ 2 ŵ n [1 y i ŵ T x i ] + 1. Therefore ŵ λ. 32
69 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. 33
70 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. What is the Rademacher complexity of F l2,b? Rad n (F l2,b) = EE ε sup 1 ε i w T X i w 2 B n = BEE ε sup 1 ε i u T X i u 2 1 n = BEE ε 1 ε i X i 1 2 n B EE ε ε i X i n
71 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. What is the Rademacher complexity of F l2,b? Rad n (F l2,b) = EE ε sup 1 ε i w T X i w 2 B n = BEE ε sup 1 ε i u T X i u 2 1 n = BEE ε 1 ε i X i 1 2 n B EE ε ε i X i n. 2 2 This is d-dimensional random walk, where i-th step is ±X i. 33
72 Rademacher complexity of bounded linear predictors (2) EE ε 1 n 2 ε i X i = 1 n 2 EE ε ε i X i ε i ε j Xi T X j 2 i j = 1 n 2 E X i 2 2 = 1 n E X
73 Rademacher complexity of bounded linear predictors (2) EE ε 1 n 2 ε i X i = 1 n 2 EE ε ε i X i ε i ε j Xi T X j 2 i j = 1 n 2 E X i 2 2 = 1 n E X 2 2. Conclusion Rademacher complexity of F l2,b = {w R d : w 2 B}: Rad n (F l2,b) B E X 2 2. n 34
74 Risk bound for SVM E [ R(ŵ) R(w ) ] = E [ R(ŵ) R n (ŵ) ] ( ɛ) [ ] + E λ 2 ŵ R n (ŵ) λ 2 w 2 2 R n (w ) ( 0) + E [ R n (w ) R(w ) ] (= 0) [ ] + E λ 2 w 2 2 λ 2 ŵ 2 2 ɛ + λ 2 w 2 2 where w arg min w R d λ 2 w R(w), ɛ = O E X 2 2 λn + 1. n 35
75 Risk bound for SVM E [ R(ŵ) R(w ) ] = E [ R(ŵ) R n (ŵ) ] ( ɛ) [ ] + E λ 2 ŵ R n (ŵ) λ 2 w 2 2 R n (w ) ( 0) + E [ R n (w ) R(w ) ] (= 0) [ ] + E λ 2 w 2 2 λ 2 ŵ 2 2 ɛ + λ 2 w 2 2 where w arg min w R d λ 2 w R(w), ɛ = O E X 2 2 λn + 1. n This suggests we should use λ 0 such that λn as n. 35
76 Kernels Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces. SVM can be applied in such spaces as long as there is an algorithm for computing inner products. This is the kernel trick, and these corresponding spaces are called Reproducing Kernel Hilbert Spaces (RKHS). 36
77 Kernels Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces. SVM can be applied in such spaces as long as there is an algorithm for computing inner products. This is the kernel trick, and these corresponding spaces are called Reproducing Kernel Hilbert Spaces (RKHS). Universal approximation With some RKHS, can approximate any function arbitrarily well: { } λ lim inf λ 0 w F 2 w 2 + R(w) = inf R(g). g : X R 36
78 Other regularizers Instead of SVM, suppose ŵ is solution to min λ w 1 + R n (w). w R d So ŵ F l1,b = {w R d : w 1 B} for B = 1/λ. 37
79 Other regularizers Instead of SVM, suppose ŵ is solution to min λ w 1 + R n (w). w R d So ŵ F l1,b = {w R d : w 1 B} for B = 1/λ. What is Rademacher complexity of F l1,b? Rad n (F l1,b) = EE ε sup 1 ε i w T X i w 1 B n = BEE ε sup 1 ε i u T X i u 1 1 n = BEE ε 1 ε i X i n. 37
80 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n 38
81 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n 38
82 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n Let X = { 1, +1} d. Then x 2 2 = d but x 2 = 1 for all x X. Dependence on d much better than using bound for l 2 -bounded linear predictors, which would have looked like B d/n. 38
83 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n Let X = { 1, +1} d. Then x 2 2 = d but x 2 = 1 for all x X. Dependence on d much better than using bound for l 2 -bounded linear predictors, which would have looked like B d/n. This kind of bound is used to study generalization of AdaBoost. 38
84 Other examples of Rademacher complexity F = any class of {0, 1}-valued functions with VC dimension V : Rad n (F) = O V. n F = ReLU networks of depth D with parameter matrices of Frobenius norm 1: D E X 2 2 Rad n (F) = O. n F = Lipschitz functions from [0, 1] d to R: ( Rad n (F) = O n 1/(2+d)). F = functions from [0, 1] d to R with Lipschitz k-th derivatives: ( Rad n (F) = O n (k+1)/(2(k+1)+d)). 39
85 Questions Are these the right notions of complexity? 40
86 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. 40
87 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. Do training algorithms for neural nets lead to Frobenius norm-bounded parameter matrices? 40
88 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. Do training algorithms for neural nets lead to Frobenius norm-bounded parameter matrices? Do complexity bounds suggest different algorithms? 40
89 Beyond uniform convergence 41
90 Deficiencies of uniform convergence analysis For certain loss functions, if R(f) is small, then variance of R n (f) is also small, and bound should reflect this. Instead of Hoeffding s inequality, use concentration inequality that involves variance information (e.g., Bernstein s inequality). Overkill to require all functions in F to not over-fit. Just need to worry about the f, e.g., with small empirical risk. Solution: Local Rademacher complexity. 42
91 Example: Occam s razor bound Suppose F is countable and we fix (a priori) a probability distribution π = (π f : f F) on F. Think of π as placing bets on which functions are likely to be the one to be picked by your learning algorithm. 43
92 Example: Occam s razor bound Suppose F is countable and we fix (a priori) a probability distribution π = (π f : f F) on F. Think of π as placing bets on which functions are likely to be the one to be picked by your learning algorithm. For any fixed f F, ) P ( R n (f) R(f) t f 2 exp( 2nt 2 f ) for any t f > 0, by Hoeffding s inequality and union bound. Note: We can choose the t f s non-uniformly. 43
93 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F 44
94 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(1/π f ) + ln(2/δ) 1 δ. 2n 44
95 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(1/π f ) + ln(2/δ) 1 δ. 2n Better bound for functions f with higher prior probability π f! 44
96 Other forms of generalization analysis Stability If a learning algorithm s output does not change much if a single data point is changed, then its output will generalize. Connections to differential privacy and regularization. Compression bounds If a learning algorithm s output is invariant to all but a small number k n of training data (e.g., # support vectors in SVM), then get bound of the form k/(n k). Direct analyses Some well-known learning algorithms do not fit the mold of typical (regularized) ERM algorithm, and seem to require a direct analysis. E.g., nearest neighbor rule. Many others 45
97 Many active areas of research in learning theory Implicit bias of optimization algorithms E.g., gradient descent for least squares linear regression converges to solution of smallest norm. What about for other problems? Efficient algorithms for non-linear models E.g., polynomials, neural networks, kernel machines. Understand if/why existing algorithms work! Learning algorithms with robustness guarantees Noisy labels, missing / malformed data, heavy-tail distributions, adversarial corruptions, etc. Interactive learning Learning algorithms that interact with external environment (e.g., bandits, active learning, reinforcement learning). More: see proceedings of Conference on Learning Theory! 46
Class 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationComputational Learning Theory - Hilary Term : Learning Real-valued Functions
Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationHilbert Space Methods in Learning
Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationSymmetrization and Rademacher Averages
Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and
More informationGeneralization Bounds and Stability
Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More information4 Bias-Variance for Ridge Regression (24 points)
Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationDivide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates
: A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationKernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444
Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationMath 328 Course Notes
Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationAdvanced Introduction to Machine Learning CMU-10715
Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationSupervised Machine Learning (Spring 2014) Homework 2, sample solutions
58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMythBusters: A Deep Learning Edition
1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More information10.1 The Formal Model
67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate
More informationTutorial on Machine Learning for Advanced Electronics
Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationConvex optimization COMS 4771
Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationThe Representor Theorem, Kernels, and Hilbert Spaces
The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationCase study: stochastic simulation via Rademacher bootstrap
Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic
More informationStatistical Machine Learning Hilary Term 2018
Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationReferences for online kernel methods
References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationCIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM
CIS59: Applied Machine Learning Fall 208 Homework 5 Handed Out: December 5 th, 208 Due: December 0 th, 208, :59 PM Feel free to talk to other members of the class in doing the homework. I am more concerned
More information