Generalization theory

Size: px
Start display at page:

Download "Generalization theory"

Transcription

1 Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1

2 Motivation 2

3 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w n Loss function is hinge loss [1 y i w T x i ] +. l(ŷ, y) = [1 yŷ] + = max{1 yŷ, 0}. (Here, we are okay with a real-valued prediction.) The λ 2 w 2 2 term is called Tikhonov regularization, which we ll discuss later. 3

4 Basic statistical model for data IID model of data Training data and test example are independent and identically distributed (X Y)-valued random variables: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P. 4

5 Basic statistical model for data IID model of data Training data and test example are independent and identically distributed (X Y)-valued random variables: SVM in the iid model (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P. Return solution ŵ to following optimization problem: λ min w R d 2 w n [1 Y i w T X i ] +. Therefore, ŵ is a random variable, depending on (X 1, Y 1 ),..., (X n, Y n ). 4

6 Convergence of empirical risk For w that does not depend on training data: Empirical risk R n (w) = 1 n is a sum of iid random variables. l(w T X i, Y i ) 5

7 Convergence of empirical risk For w that does not depend on training data: Empirical risk is a sum of iid random variables. R n (w) = 1 l(w T X i, Y i ) n Law of Large Numbers gives an asymptotic result: R n (w) = 1 l(w T p X i, Y i ) E[l(w T X, Y )] = R(w). n (This can be made non-asymptotic.) 5

8 Uniform convergence of empirical risk However, ŵ does depend on training data. Empirical risk of ŵ is not a sum of iid random variables: R n (ŵ) = 1 l(ŵ T X i, Y i ). n 6

9 Uniform convergence of empirical risk However, ŵ does depend on training data. Empirical risk of ŵ is not a sum of iid random variables: R n (ŵ) = 1 l(ŵ T X i, Y i ). n Idea: ŵ could conceivably take any value w, but if then R n (ŵ) p sup R n (w) R(w) 0, (1) w p R(ŵ) as well. (1) is called uniform convergence. 6

10 Detour: Concentration inequalities 7

11 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is S n = ε i. 8

12 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is How far from origin? S n = ε i. By independence, var(s n ) = n var(ε i ) = n. So expected distance from origin is E S n var(s n ) n. 8

13 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is How far from origin? S n = ε i. By independence, var(s n ) = n var(ε i ) = n. So expected distance from origin is E S n var(s n ) n. How many realizations are n from origin? 8

14 Markov s inequality For any random variable X and any t 0, P( X t) E X. t Proof: t 1{ X t} X. 9

15 Markov s inequality For any random variable X and any t 0, P( X t) E X. t Proof: t 1{ X t} X. Application to symmetric random walk: P( S n c n) E S n c n 1 c. 9

16 Hoeffding s inequality If X 1,..., X n are independent random variables, with X i taking values in [a i, b i ], then for any t 0, ( 2t P (X i E(X i )) t 2 ) exp n (b i a i ) 2. 10

17 Hoeffding s inequality If X 1,..., X n are independent random variables, with X i taking values in [a i, b i ], then for any t 0, ( 2t P (X i E(X i )) t 2 ) exp n (b i a i ) 2. E.g., Rademacher random variables have [a i, b i ] = [ 1, +1], so P(S n t) exp( 2t 2 /(4n)). 10

18 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 11

19 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 11

20 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 2. Apply Hoeffding to ε 1,..., ε n : P( S n c n) exp( c 2 /2). 11

21 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 2. Apply Hoeffding to ε 1,..., ε n : P( S n c n) exp( c 2 /2). 3. Therefore, by union bound, P( S n c n) 2 exp( c 2 /2). (Compare to bound from Markov s inequality: 1/c.) 11

22 Equivalent form of Hoeffding s inequality Let X 1,..., X n be independent random variables, with X i taking values in [a i, b i ], and let S n = n X i. For any δ (0, 1), P S n E[S n ] < 1 2 (b i a i ) 2 ln(1/δ) 1 δ. This is a high probability upper-bound on S n E[S n ]. 12

23 Uniform convergence: Finite classes 13

24 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P 14

25 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P We let ˆf arg min f F R n (f) be minimizer of empirical risk R n (f) = 1 l(f(x i ), Y i ). n 14

26 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P We let ˆf arg min f F R n (f) be minimizer of empirical risk R n (f) = 1 l(f(x i ), Y i ). n Our worry: over-fitting R( ˆf) R n ( ˆf). 14

27 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). 15

28 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). Since R n (f) is sum of n independent [0, 1 n ]-valued random variables, P ( R n (f) R(f) t ) ( ) 2 exp 2t2 n ( 1 = 2 exp( 2nt 2 ) n )2 for any t > 0, by Hoeffding s inequality and union bound. 15

29 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). Since R n (f) is sum of n independent [0, 1 n ]-valued random variables, P ( R n (f) R(f) t ) ( ) 2 exp 2t2 n ( 1 = 2 exp( 2nt 2 ) n )2 for any t > 0, by Hoeffding s inequality and union bound. This argument does not apply to ˆf, because ˆf depends on (X 1, Y 1 ),..., (X n, Y n ). 15

30 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. 16

31 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. One possible solution: ensure empirical risk of every f F is close to its expected value. This is called uniform convergence. 16

32 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. One possible solution: ensure empirical risk of every f F is close to its expected value. This is called uniform convergence. How much data is needed to ensure this? 16

33 Uniform convergence for all functions in a finite class If F <, then by Hoeffding s inequality and union bound, P ( f F s.t. R n (f) R(f) t ) = P { Rn (f) R(f) t } f F P ( R n (f) R(f) t ) f F F 2 exp( 2nt 2 ). 17

34 Uniform convergence for all functions in a finite class If F <, then by Hoeffding s inequality and union bound, P ( f F s.t. R n (f) R(f) t ) = P { Rn (f) R(f) t } f F P ( R n (f) R(f) t ) f F F 2 exp( 2nt 2 ). Choose t so that RHS is δ, and invert. Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(2 F /δ) 1 δ. 2n 17

35 What we get from uniform convergence If n log F, then with high probability, no function f F will over-fit the training data. 18

36 What we get from uniform convergence If n log F, then with high probability, no function f F will over-fit the training data. Also: An empirical risk minimizer (ERM), like ˆf, is near optimal! Theorem. With probability at least 1 δ, R( ˆf) R(f ) = R( ˆf) R n ( ˆf) ( ɛ) + R n ( ˆf) R n (f ) ( 0) + R n (f ) R(f ) ( ɛ) 2ɛ where f arg min f F R(f) and ɛ = ln(2 F /δ) 2n. 18

37 Uniform convergence: General case 19

38 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. 20

39 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. Notation: Let P f = E[f(X)] for X P. Let P n be the empirical distribution on X 1,..., X n iid P, which assigns probability mass 1/n to each X i. So P n f = 1 n n f(x i ). We are interested in the maximum (or supremum) deviation: sup P n f P f. f F 20

40 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. Notation: Let P f = E[f(X)] for X P. Let P n be the empirical distribution on X 1,..., X n iid P, which assigns probability mass 1/n to each X i. So P n f = 1 n n f(x i ). We are interested in the maximum (or supremum) deviation: sup P n f P f. f F The arguments from before show that for any finite class of bounded functions F, p sup P n f P f 0, f F and also give a non-asymptotic rate of convergence. 20

41 Infinite classes For which classes F R X does uniform convergence hold? 21

42 Infinite classes For which classes F R X does uniform convergence hold? Example: F = {f S (x) = 1{x S} : S R, S < }, i.e., {0, 1}-valued functions that take value 1 on a finite set. If P is continuous, then P f = 0 for all f F. But sup f F P n f = 1 for all n. So sup f F P n f P f = 1 for all n. 21

43 Infinite classes For which classes F R X does uniform convergence hold? Example: F = {f S (x) = 1{x S} : S R, S < }, i.e., {0, 1}-valued functions that take value 1 on a finite set. If P is continuous, then P f = 0 for all f F. But sup f F P n f = 1 for all n. So sup f F P n f P f = 1 for all n. What is the appropriate complexity measure of a function class? 21

44 Rademacher complexity Let ε 1,..., ε n be independent Rademacher random variables. Uniform convergence with F holds iff lim ε sup 1 n ε i f(x i ) f F n = 0 } {{ } Rad n(f) (where E ε is expectation with respect to ε = (ε 1,..., ε n )). 22

45 Rademacher complexity Let ε 1,..., ε n be independent Rademacher random variables. Uniform convergence with F holds iff lim ε sup 1 n ε i f(x i ) f F n = 0 } {{ } Rad n(f) (where E ε is expectation with respect to ε = (ε 1,..., ε n )). Rad n (F) is the Rademacher complexity of F, which measures how well vectors in (random) set F(X 1:n ) = {(f(x 1 ),..., f(x n )) : f F} can correlate with uniformly random signs ε 1,..., ε n. 22

46 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). 23

47 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). F contains a single function f 0 : X { 1, +1}: Rad n (F) = EE ε 1 n ε i f 0 (X i ) 1. n 23

48 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). F contains a single function f 0 : X { 1, +1}: Rad n (F) = EE ε 1 n ε i f 0 (X i ) 1. n F contains all functions X { 1, +1}: Rad n (F) = EE ε sup 1 f F n ε i f(x i ) = 1. 23

49 Uniform convergence via Rademacher complexity Theorem. 1. Uniform convergence in expectation: For any F R X, [ ] E sup P n f P f f F 2 Rad n (F). 2. Uniform convergence with high probability: For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, sup P n f P f 2 Rad n (F) + f F 2 ln(1/δ). n 24

50 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. 25

51 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. Then E [ sup P n f P f f F ] = E E sup f F E E 1 f(x i ) f(x n i) sup 1 f(x i ) f(x f F n i) = EE sup P n f P nf. f F 25

52 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. Then E [ sup P n f P f f F ] = E E sup f F E E 1 f(x i ) f(x n i) sup 1 f(x i ) f(x f F n i) = EE sup P n f P nf. f F The random variable P n f P nf is arguably nicer than P n f P f because it is symmetric. 25

53 Step 2: Symmetrization by random signs Consider any ε = (ε 1,..., ε n ) { 1, +1} n. Distribution of is the same distribution of P n f P nf = 1 f(x i ) f(x n i) P n f P nf = 1 n ) ε i (f(x i ) f(x i). 26

54 Step 2: Symmetrization by random signs Consider any ε = (ε 1,..., ε n ) { 1, +1} n. Distribution of is the same distribution of P n f P nf = 1 f(x i ) f(x n i) P n f P nf = 1 n ) ε i (f(x i ) f(x i). Thus, this is also true for uniform average over all ε { 1, +1} n (i.e., expectation over Rademacher ε): EE sup f F P n f P nf = EE E ε sup f F 1 n ε i (f(x i ) f(x i)). 26

55 Step 3: Back to a single sample By triangle inequality, sup 1 ε i (f(x i)) i ) f(x f F n sup 1 ε i f(x i ) f F n + sup 1 f F n ε i f(x i) The two terms on the RHS have the same distribution. 27

56 Step 3: Back to a single sample By triangle inequality, sup 1 ε i (f(x i)) i ) f(x f F n sup 1 ε i f(x i ) f F n + sup 1 f F n ε i f(x i) The two terms on the RHS have the same distribution. So EE E ε sup f F 1 ε i (f(x i)) i ) f(x n 2EE ε sup f F = 2 Rad n (F). 1 ε i f(x i ) n 27

57 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, sup P n f P f 2 Rad n (F) + f F 2 ln(1/δ). n 28

58 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, Conclusion sup P n f P f 2 Rad n (F) + f F If Rad n (F) 0, then uniform convergence holds. 2 ln(1/δ). n 28

59 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, Conclusion sup P n f P f 2 Rad n (F) + f F If Rad n (F) 0, then uniform convergence holds. 2 ln(1/δ). n (Can also show: If uniform convergence holds, then Rad n (F) 0.) 28

60 Analysis of SVM 29

61 Loss class Back to classes of prediction functions F R X. 30

62 Loss class Back to classes of prediction functions F R X. Consider a loss function l: R Y R + that satisfies l(0, y) 1 for all y Y, and is 1-Lipschitz in first argument: for all ŷ, ŷ R, l(ŷ, y) l(ŷ, y) ŷ ŷ. (Example: hinge loss l(ŷ, y) = [1 ŷy] +.) 30

63 Loss class Back to classes of prediction functions F R X. Consider a loss function l: R Y R + that satisfies l(0, y) 1 for all y Y, and is 1-Lipschitz in first argument: for all ŷ, ŷ R, l(ŷ, y) l(ŷ, y) ŷ ŷ. (Example: hinge loss l(ŷ, y) = [1 ŷy] +.) Define the associated loss class by Then l F = {(x, y) l(f(x), y) : f F}. Rad n (l F ) 2 Rad n (F) + 2 ln 2 n. So uniform convergence holds for l F if it holds for F. 30

64 Rademacher complexity of linear predictors Linear functions F lin = {w R d }. What is the Rademacher complexity of F lin? Rad n (F lin ) = EE ε sup 1 ε i w T X i n. w R d 31

65 Rademacher complexity of linear predictors Linear functions F lin = {w R d }. What is the Rademacher complexity of F lin? Rad n (F lin ) = EE ε sup 1 ε i w T X i n. w R d Inside the EE ε : sup 1 n w R d wt ε i X i = sup w 2 1 ε i X i w R d n. 2 As long as n ε i X i 0, this is unbounded! :-( 31

66 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. 32

67 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. Objective value at w = 0 is 1, so objective value at minimizer ŵ is no worse than this: λ 2 ŵ n [1 y i ŵ T x i ]

68 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. Objective value at w = 0 is 1, so objective value at minimizer ŵ is no worse than this: λ 2 ŵ n [1 y i ŵ T x i ] + 1. Therefore ŵ λ. 32

69 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. 33

70 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. What is the Rademacher complexity of F l2,b? Rad n (F l2,b) = EE ε sup 1 ε i w T X i w 2 B n = BEE ε sup 1 ε i u T X i u 2 1 n = BEE ε 1 ε i X i 1 2 n B EE ε ε i X i n

71 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. What is the Rademacher complexity of F l2,b? Rad n (F l2,b) = EE ε sup 1 ε i w T X i w 2 B n = BEE ε sup 1 ε i u T X i u 2 1 n = BEE ε 1 ε i X i 1 2 n B EE ε ε i X i n. 2 2 This is d-dimensional random walk, where i-th step is ±X i. 33

72 Rademacher complexity of bounded linear predictors (2) EE ε 1 n 2 ε i X i = 1 n 2 EE ε ε i X i ε i ε j Xi T X j 2 i j = 1 n 2 E X i 2 2 = 1 n E X

73 Rademacher complexity of bounded linear predictors (2) EE ε 1 n 2 ε i X i = 1 n 2 EE ε ε i X i ε i ε j Xi T X j 2 i j = 1 n 2 E X i 2 2 = 1 n E X 2 2. Conclusion Rademacher complexity of F l2,b = {w R d : w 2 B}: Rad n (F l2,b) B E X 2 2. n 34

74 Risk bound for SVM E [ R(ŵ) R(w ) ] = E [ R(ŵ) R n (ŵ) ] ( ɛ) [ ] + E λ 2 ŵ R n (ŵ) λ 2 w 2 2 R n (w ) ( 0) + E [ R n (w ) R(w ) ] (= 0) [ ] + E λ 2 w 2 2 λ 2 ŵ 2 2 ɛ + λ 2 w 2 2 where w arg min w R d λ 2 w R(w), ɛ = O E X 2 2 λn + 1. n 35

75 Risk bound for SVM E [ R(ŵ) R(w ) ] = E [ R(ŵ) R n (ŵ) ] ( ɛ) [ ] + E λ 2 ŵ R n (ŵ) λ 2 w 2 2 R n (w ) ( 0) + E [ R n (w ) R(w ) ] (= 0) [ ] + E λ 2 w 2 2 λ 2 ŵ 2 2 ɛ + λ 2 w 2 2 where w arg min w R d λ 2 w R(w), ɛ = O E X 2 2 λn + 1. n This suggests we should use λ 0 such that λn as n. 35

76 Kernels Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces. SVM can be applied in such spaces as long as there is an algorithm for computing inner products. This is the kernel trick, and these corresponding spaces are called Reproducing Kernel Hilbert Spaces (RKHS). 36

77 Kernels Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces. SVM can be applied in such spaces as long as there is an algorithm for computing inner products. This is the kernel trick, and these corresponding spaces are called Reproducing Kernel Hilbert Spaces (RKHS). Universal approximation With some RKHS, can approximate any function arbitrarily well: { } λ lim inf λ 0 w F 2 w 2 + R(w) = inf R(g). g : X R 36

78 Other regularizers Instead of SVM, suppose ŵ is solution to min λ w 1 + R n (w). w R d So ŵ F l1,b = {w R d : w 1 B} for B = 1/λ. 37

79 Other regularizers Instead of SVM, suppose ŵ is solution to min λ w 1 + R n (w). w R d So ŵ F l1,b = {w R d : w 1 B} for B = 1/λ. What is Rademacher complexity of F l1,b? Rad n (F l1,b) = EE ε sup 1 ε i w T X i w 1 B n = BEE ε sup 1 ε i u T X i u 1 1 n = BEE ε 1 ε i X i n. 37

80 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n 38

81 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n 38

82 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n Let X = { 1, +1} d. Then x 2 2 = d but x 2 = 1 for all x X. Dependence on d much better than using bound for l 2 -bounded linear predictors, which would have looked like B d/n. 38

83 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n Let X = { 1, +1} d. Then x 2 2 = d but x 2 = 1 for all x X. Dependence on d much better than using bound for l 2 -bounded linear predictors, which would have looked like B d/n. This kind of bound is used to study generalization of AdaBoost. 38

84 Other examples of Rademacher complexity F = any class of {0, 1}-valued functions with VC dimension V : Rad n (F) = O V. n F = ReLU networks of depth D with parameter matrices of Frobenius norm 1: D E X 2 2 Rad n (F) = O. n F = Lipschitz functions from [0, 1] d to R: ( Rad n (F) = O n 1/(2+d)). F = functions from [0, 1] d to R with Lipschitz k-th derivatives: ( Rad n (F) = O n (k+1)/(2(k+1)+d)). 39

85 Questions Are these the right notions of complexity? 40

86 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. 40

87 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. Do training algorithms for neural nets lead to Frobenius norm-bounded parameter matrices? 40

88 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. Do training algorithms for neural nets lead to Frobenius norm-bounded parameter matrices? Do complexity bounds suggest different algorithms? 40

89 Beyond uniform convergence 41

90 Deficiencies of uniform convergence analysis For certain loss functions, if R(f) is small, then variance of R n (f) is also small, and bound should reflect this. Instead of Hoeffding s inequality, use concentration inequality that involves variance information (e.g., Bernstein s inequality). Overkill to require all functions in F to not over-fit. Just need to worry about the f, e.g., with small empirical risk. Solution: Local Rademacher complexity. 42

91 Example: Occam s razor bound Suppose F is countable and we fix (a priori) a probability distribution π = (π f : f F) on F. Think of π as placing bets on which functions are likely to be the one to be picked by your learning algorithm. 43

92 Example: Occam s razor bound Suppose F is countable and we fix (a priori) a probability distribution π = (π f : f F) on F. Think of π as placing bets on which functions are likely to be the one to be picked by your learning algorithm. For any fixed f F, ) P ( R n (f) R(f) t f 2 exp( 2nt 2 f ) for any t f > 0, by Hoeffding s inequality and union bound. Note: We can choose the t f s non-uniformly. 43

93 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F 44

94 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(1/π f ) + ln(2/δ) 1 δ. 2n 44

95 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(1/π f ) + ln(2/δ) 1 δ. 2n Better bound for functions f with higher prior probability π f! 44

96 Other forms of generalization analysis Stability If a learning algorithm s output does not change much if a single data point is changed, then its output will generalize. Connections to differential privacy and regularization. Compression bounds If a learning algorithm s output is invariant to all but a small number k n of training data (e.g., # support vectors in SVM), then get bound of the form k/(n k). Direct analyses Some well-known learning algorithms do not fit the mold of typical (regularized) ERM algorithm, and seem to require a direct analysis. E.g., nearest neighbor rule. Many others 45

97 Many active areas of research in learning theory Implicit bias of optimization algorithms E.g., gradient descent for least squares linear regression converges to solution of smallest norm. What about for other problems? Efficient algorithms for non-linear models E.g., polynomials, neural networks, kernel machines. Understand if/why existing algorithms work! Learning algorithms with robustness guarantees Noisy labels, missing / malformed data, heavy-tail distributions, adversarial corruptions, etc. Interactive learning Learning algorithms that interact with external environment (e.g., bandits, active learning, reinforcement learning). More: see proceedings of Conference on Learning Theory! 46

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

RegML 2018 Class 2 Tikhonov regularization and kernels

RegML 2018 Class 2 Tikhonov regularization and kernels RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Oslo Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Generalization Bounds and Stability

Generalization Bounds and Stability Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 9 2009 About this class Goal To recall the notion of generalization bounds and show how they can be derived from a stability

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates

Divide and Conquer Kernel Ridge Regression. A Distributed Algorithm with Minimax Optimal Rates : A Distributed Algorithm with Minimax Optimal Rates Yuchen Zhang, John C. Duchi, Martin Wainwright (UC Berkeley;http://arxiv.org/pdf/1305.509; Apr 9, 014) Gatsby Unit, Tea Talk June 10, 014 Outline Motivation.

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions 58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

MythBusters: A Deep Learning Edition

MythBusters: A Deep Learning Edition 1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

The Representor Theorem, Kernels, and Hilbert Spaces

The Representor Theorem, Kernels, and Hilbert Spaces The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Statistical Machine Learning Hilary Term 2018

Statistical Machine Learning Hilary Term 2018 Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM CIS59: Applied Machine Learning Fall 208 Homework 5 Handed Out: December 5 th, 208 Due: December 0 th, 208, :59 PM Feel free to talk to other members of the class in doing the homework. I am more concerned

More information