Generalization theory

Size: px

Start display at page:

Download "Generalization theory"

Karin Chambers
5 years ago
Views:

1 Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1

2 Motivation 2

3 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w n Loss function is hinge loss [1 y i w T x i ] +. l(ŷ, y) = [1 yŷ] + = max{1 yŷ, 0}. (Here, we are okay with a real-valued prediction.) The λ 2 w 2 2 term is called Tikhonov regularization, which we ll discuss later. 3

4 Basic statistical model for data IID model of data Training data and test example are independent and identically distributed (X Y)-valued random variables: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P. 4

5 Basic statistical model for data IID model of data Training data and test example are independent and identically distributed (X Y)-valued random variables: SVM in the iid model (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P. Return solution ŵ to following optimization problem: λ min w R d 2 w n [1 Y i w T X i ] +. Therefore, ŵ is a random variable, depending on (X 1, Y 1 ),..., (X n, Y n ). 4

6 Convergence of empirical risk For w that does not depend on training data: Empirical risk R n (w) = 1 n is a sum of iid random variables. l(w T X i, Y i ) 5

7 Convergence of empirical risk For w that does not depend on training data: Empirical risk is a sum of iid random variables. R n (w) = 1 l(w T X i, Y i ) n Law of Large Numbers gives an asymptotic result: R n (w) = 1 l(w T p X i, Y i ) E[l(w T X, Y )] = R(w). n (This can be made non-asymptotic.) 5

8 Uniform convergence of empirical risk However, ŵ does depend on training data. Empirical risk of ŵ is not a sum of iid random variables: R n (ŵ) = 1 l(ŵ T X i, Y i ). n 6

9 Uniform convergence of empirical risk However, ŵ does depend on training data. Empirical risk of ŵ is not a sum of iid random variables: R n (ŵ) = 1 l(ŵ T X i, Y i ). n Idea: ŵ could conceivably take any value w, but if then R n (ŵ) p sup R n (w) R(w) 0, (1) w p R(ŵ) as well. (1) is called uniform convergence. 6

10 Detour: Concentration inequalities 7

11 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is S n = ε i. 8

12 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is How far from origin? S n = ε i. By independence, var(s n ) = n var(ε i ) = n. So expected distance from origin is E S n var(s n ) n. 8

13 Symmetric random walk Rademacher random variables ε 1,..., ε n iid with P(ε i = 1) = P(ε i = 1) = 1/2. Symmetric random walk: position after n steps is How far from origin? S n = ε i. By independence, var(s n ) = n var(ε i ) = n. So expected distance from origin is E S n var(s n ) n. How many realizations are n from origin? 8

14 Markov s inequality For any random variable X and any t 0, P( X t) E X. t Proof: t 1{ X t} X. 9

15 Markov s inequality For any random variable X and any t 0, P( X t) E X. t Proof: t 1{ X t} X. Application to symmetric random walk: P( S n c n) E S n c n 1 c. 9

16 Hoeffding s inequality If X 1,..., X n are independent random variables, with X i taking values in [a i, b i ], then for any t 0, ( 2t P (X i E(X i )) t 2 ) exp n (b i a i ) 2. 10

17 Hoeffding s inequality If X 1,..., X n are independent random variables, with X i taking values in [a i, b i ], then for any t 0, ( 2t P (X i E(X i )) t 2 ) exp n (b i a i ) 2. E.g., Rademacher random variables have [a i, b i ] = [ 1, +1], so P(S n t) exp( 2t 2 /(4n)). 10

18 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 11

19 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 11

20 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 2. Apply Hoeffding to ε 1,..., ε n : P( S n c n) exp( c 2 /2). 11

21 Applying Hoeffding s inequality to symmetric random walk Union bound: For any events A and B, P(A B) P(A) + P(B). 1. Apply Hoeffding to ε 1,..., ε n : P(S n c n) exp( c 2 /2). 2. Apply Hoeffding to ε 1,..., ε n : P( S n c n) exp( c 2 /2). 3. Therefore, by union bound, P( S n c n) 2 exp( c 2 /2). (Compare to bound from Markov s inequality: 1/c.) 11

22 Equivalent form of Hoeffding s inequality Let X 1,..., X n be independent random variables, with X i taking values in [a i, b i ], and let S n = n X i. For any δ (0, 1), P S n E[S n ] < 1 2 (b i a i ) 2 ln(1/δ) 1 δ. This is a high probability upper-bound on S n E[S n ]. 12

23 Uniform convergence: Finite classes 13

24 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P 14

25 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P We let ˆf arg min f F R n (f) be minimizer of empirical risk R n (f) = 1 l(f(x i ), Y i ). n 14

26 Back to statistical learning Cast of characters: feature and outcome spaces: X, Y function class: F Y X loss function: l: Y Y R + (assume bounded by 1) training and test data: (X 1, Y 1 ),..., (X n, Y n ), (X, Y ) iid P We let ˆf arg min f F R n (f) be minimizer of empirical risk R n (f) = 1 l(f(x i ), Y i ). n Our worry: over-fitting R( ˆf) R n ( ˆf). 14

27 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). 15

28 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). Since R n (f) is sum of n independent [0, 1 n ]-valued random variables, P ( R n (f) R(f) t ) ( ) 2 exp 2t2 n ( 1 = 2 exp( 2nt 2 ) n )2 for any t > 0, by Hoeffding s inequality and union bound. 15

29 Convergence of empirical risk for fixed function For any fixed function f F, E [ R n (f) ] = E 1 l(f(x i ), Y i ) = 1 n n E [ l(f(x i ), Y i ) ] = R(f). Since R n (f) is sum of n independent [0, 1 n ]-valued random variables, P ( R n (f) R(f) t ) ( ) 2 exp 2t2 n ( 1 = 2 exp( 2nt 2 ) n )2 for any t > 0, by Hoeffding s inequality and union bound. This argument does not apply to ˆf, because ˆf depends on (X 1, Y 1 ),..., (X n, Y n ). 15

30 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. 16

31 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. One possible solution: ensure empirical risk of every f F is close to its expected value. This is called uniform convergence. 16

32 Uniform convergence We cannot directly apply Hoeffding s inequality to ˆf, since its empirical risk R n ( ˆf) is not average of iid random variables. One possible solution: ensure empirical risk of every f F is close to its expected value. This is called uniform convergence. How much data is needed to ensure this? 16

33 Uniform convergence for all functions in a finite class If F <, then by Hoeffding s inequality and union bound, P ( f F s.t. R n (f) R(f) t ) = P { Rn (f) R(f) t } f F P ( R n (f) R(f) t ) f F F 2 exp( 2nt 2 ). 17

34 Uniform convergence for all functions in a finite class If F <, then by Hoeffding s inequality and union bound, P ( f F s.t. R n (f) R(f) t ) = P { Rn (f) R(f) t } f F P ( R n (f) R(f) t ) f F F 2 exp( 2nt 2 ). Choose t so that RHS is δ, and invert. Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(2 F /δ) 1 δ. 2n 17

35 What we get from uniform convergence If n log F, then with high probability, no function f F will over-fit the training data. 18

36 What we get from uniform convergence If n log F, then with high probability, no function f F will over-fit the training data. Also: An empirical risk minimizer (ERM), like ˆf, is near optimal! Theorem. With probability at least 1 δ, R( ˆf) R(f ) = R( ˆf) R n ( ˆf) ( ɛ) + R n ( ˆf) R n (f ) ( 0) + R n (f ) R(f ) ( ɛ) 2ɛ where f arg min f F R(f) and ɛ = ln(2 F /δ) 2n. 18

37 Uniform convergence: General case 19

38 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. 20

39 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. Notation: Let P f = E[f(X)] for X P. Let P n be the empirical distribution on X 1,..., X n iid P, which assigns probability mass 1/n to each X i. So P n f = 1 n n f(x i ). We are interested in the maximum (or supremum) deviation: sup P n f P f. f F 20

40 Uniform convergence: General case Let F R X be a class of real-valued functions, and let P be a probability distribution on X. Notation: Let P f = E[f(X)] for X P. Let P n be the empirical distribution on X 1,..., X n iid P, which assigns probability mass 1/n to each X i. So P n f = 1 n n f(x i ). We are interested in the maximum (or supremum) deviation: sup P n f P f. f F The arguments from before show that for any finite class of bounded functions F, p sup P n f P f 0, f F and also give a non-asymptotic rate of convergence. 20

41 Infinite classes For which classes F R X does uniform convergence hold? 21

42 Infinite classes For which classes F R X does uniform convergence hold? Example: F = {f S (x) = 1{x S} : S R, S < }, i.e., {0, 1}-valued functions that take value 1 on a finite set. If P is continuous, then P f = 0 for all f F. But sup f F P n f = 1 for all n. So sup f F P n f P f = 1 for all n. 21

43 Infinite classes For which classes F R X does uniform convergence hold? Example: F = {f S (x) = 1{x S} : S R, S < }, i.e., {0, 1}-valued functions that take value 1 on a finite set. If P is continuous, then P f = 0 for all f F. But sup f F P n f = 1 for all n. So sup f F P n f P f = 1 for all n. What is the appropriate complexity measure of a function class? 21

44 Rademacher complexity Let ε 1,..., ε n be independent Rademacher random variables. Uniform convergence with F holds iff lim ε sup 1 n ε i f(x i ) f F n = 0 } {{ } Rad n(f) (where E ε is expectation with respect to ε = (ε 1,..., ε n )). 22

45 Rademacher complexity Let ε 1,..., ε n be independent Rademacher random variables. Uniform convergence with F holds iff lim ε sup 1 n ε i f(x i ) f F n = 0 } {{ } Rad n(f) (where E ε is expectation with respect to ε = (ε 1,..., ε n )). Rad n (F) is the Rademacher complexity of F, which measures how well vectors in (random) set F(X 1:n ) = {(f(x 1 ),..., f(x n )) : f F} can correlate with uniformly random signs ε 1,..., ε n. 22

46 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). 23

47 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). F contains a single function f 0 : X { 1, +1}: Rad n (F) = EE ε 1 n ε i f 0 (X i ) 1. n 23

48 Extreme cases of Rademacher complexity For simplicity, assume X 1,..., X n are distinct (e.g., P continuous). F contains a single function f 0 : X { 1, +1}: Rad n (F) = EE ε 1 n ε i f 0 (X i ) 1. n F contains all functions X { 1, +1}: Rad n (F) = EE ε sup 1 f F n ε i f(x i ) = 1. 23

49 Uniform convergence via Rademacher complexity Theorem. 1. Uniform convergence in expectation: For any F R X, [ ] E sup P n f P f f F 2 Rad n (F). 2. Uniform convergence with high probability: For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, sup P n f P f 2 Rad n (F) + f F 2 ln(1/δ). n 24

50 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. 25

51 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. Then E [ sup P n f P f f F ] = E E sup f F E E 1 f(x i ) f(x n i) sup 1 f(x i ) f(x f F n i) = EE sup P n f P nf. f F 25

52 Step 1: Symmetrization by ghost sample Let P n be empirical distribution on independent copies X 1,..., X n of X 1,..., X n. Write E for expectation with respect to X 1:n. Then E [ sup P n f P f f F ] = E E sup f F E E 1 f(x i ) f(x n i) sup 1 f(x i ) f(x f F n i) = EE sup P n f P nf. f F The random variable P n f P nf is arguably nicer than P n f P f because it is symmetric. 25

53 Step 2: Symmetrization by random signs Consider any ε = (ε 1,..., ε n ) { 1, +1} n. Distribution of is the same distribution of P n f P nf = 1 f(x i ) f(x n i) P n f P nf = 1 n ) ε i (f(x i ) f(x i). 26

54 Step 2: Symmetrization by random signs Consider any ε = (ε 1,..., ε n ) { 1, +1} n. Distribution of is the same distribution of P n f P nf = 1 f(x i ) f(x n i) P n f P nf = 1 n ) ε i (f(x i ) f(x i). Thus, this is also true for uniform average over all ε { 1, +1} n (i.e., expectation over Rademacher ε): EE sup f F P n f P nf = EE E ε sup f F 1 n ε i (f(x i ) f(x i)). 26

55 Step 3: Back to a single sample By triangle inequality, sup 1 ε i (f(x i)) i ) f(x f F n sup 1 ε i f(x i ) f F n + sup 1 f F n ε i f(x i) The two terms on the RHS have the same distribution. 27

56 Step 3: Back to a single sample By triangle inequality, sup 1 ε i (f(x i)) i ) f(x f F n sup 1 ε i f(x i ) f F n + sup 1 f F n ε i f(x i) The two terms on the RHS have the same distribution. So EE E ε sup f F 1 ε i (f(x i)) i ) f(x n 2EE ε sup f F = 2 Rad n (F). 1 ε i f(x i ) n 27

57 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, sup P n f P f 2 Rad n (F) + f F 2 ln(1/δ). n 28

58 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, Conclusion sup P n f P f 2 Rad n (F) + f F If Rad n (F) 0, then uniform convergence holds. 2 ln(1/δ). n 28

59 Recap For any F R X, [ E sup P n f P f f F ] 2 Rad n (F). For any F [ 1, +1] X and δ (0, 1), with probability 1 δ, Conclusion sup P n f P f 2 Rad n (F) + f F If Rad n (F) 0, then uniform convergence holds. 2 ln(1/δ). n (Can also show: If uniform convergence holds, then Rad n (F) 0.) 28

60 Analysis of SVM 29

61 Loss class Back to classes of prediction functions F R X. 30

62 Loss class Back to classes of prediction functions F R X. Consider a loss function l: R Y R + that satisfies l(0, y) 1 for all y Y, and is 1-Lipschitz in first argument: for all ŷ, ŷ R, l(ŷ, y) l(ŷ, y) ŷ ŷ. (Example: hinge loss l(ŷ, y) = [1 ŷy] +.) 30

63 Loss class Back to classes of prediction functions F R X. Consider a loss function l: R Y R + that satisfies l(0, y) 1 for all y Y, and is 1-Lipschitz in first argument: for all ŷ, ŷ R, l(ŷ, y) l(ŷ, y) ŷ ŷ. (Example: hinge loss l(ŷ, y) = [1 ŷy] +.) Define the associated loss class by Then l F = {(x, y) l(f(x), y) : f F}. Rad n (l F ) 2 Rad n (F) + 2 ln 2 n. So uniform convergence holds for l F if it holds for F. 30

64 Rademacher complexity of linear predictors Linear functions F lin = {w R d }. What is the Rademacher complexity of F lin? Rad n (F lin ) = EE ε sup 1 ε i w T X i n. w R d 31

65 Rademacher complexity of linear predictors Linear functions F lin = {w R d }. What is the Rademacher complexity of F lin? Rad n (F lin ) = EE ε sup 1 ε i w T X i n. w R d Inside the EE ε : sup 1 n w R d wt ε i X i = sup w 2 1 ε i X i w R d n. 2 As long as n ε i X i 0, this is unbounded! :-( 31

66 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. 32

67 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. Objective value at w = 0 is 1, so objective value at minimizer ŵ is no worse than this: λ 2 ŵ n [1 y i ŵ T x i ]

68 Regularization Recall SVM optimization problem: λ min w R d 2 w n [1 y i w T x i ] +. Objective value at w = 0 is 1, so objective value at minimizer ŵ is no worse than this: λ 2 ŵ n [1 y i ŵ T x i ] + 1. Therefore ŵ λ. 32

69 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. 33

70 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. What is the Rademacher complexity of F l2,b? Rad n (F l2,b) = EE ε sup 1 ε i w T X i w 2 B n = BEE ε sup 1 ε i u T X i u 2 1 n = BEE ε 1 ε i X i 1 2 n B EE ε ε i X i n

71 Rademacher complexity of bounded linear predictors Bounded linear functions F l2,b = {w R d : w 2 B}. What is the Rademacher complexity of F l2,b? Rad n (F l2,b) = EE ε sup 1 ε i w T X i w 2 B n = BEE ε sup 1 ε i u T X i u 2 1 n = BEE ε 1 ε i X i 1 2 n B EE ε ε i X i n. 2 2 This is d-dimensional random walk, where i-th step is ±X i. 33

72 Rademacher complexity of bounded linear predictors (2) EE ε 1 n 2 ε i X i = 1 n 2 EE ε ε i X i ε i ε j Xi T X j 2 i j = 1 n 2 E X i 2 2 = 1 n E X

73 Rademacher complexity of bounded linear predictors (2) EE ε 1 n 2 ε i X i = 1 n 2 EE ε ε i X i ε i ε j Xi T X j 2 i j = 1 n 2 E X i 2 2 = 1 n E X 2 2. Conclusion Rademacher complexity of F l2,b = {w R d : w 2 B}: Rad n (F l2,b) B E X 2 2. n 34

74 Risk bound for SVM E [ R(ŵ) R(w ) ] = E [ R(ŵ) R n (ŵ) ] ( ɛ) [ ] + E λ 2 ŵ R n (ŵ) λ 2 w 2 2 R n (w ) ( 0) + E [ R n (w ) R(w ) ] (= 0) [ ] + E λ 2 w 2 2 λ 2 ŵ 2 2 ɛ + λ 2 w 2 2 where w arg min w R d λ 2 w R(w), ɛ = O E X 2 2 λn + 1. n 35

75 Risk bound for SVM E [ R(ŵ) R(w ) ] = E [ R(ŵ) R n (ŵ) ] ( ɛ) [ ] + E λ 2 ŵ R n (ŵ) λ 2 w 2 2 R n (w ) ( 0) + E [ R n (w ) R(w ) ] (= 0) [ ] + E λ 2 w 2 2 λ 2 ŵ 2 2 ɛ + λ 2 w 2 2 where w arg min w R d λ 2 w R(w), ɛ = O E X 2 2 λn + 1. n This suggests we should use λ 0 such that λn as n. 35

76 Kernels Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces. SVM can be applied in such spaces as long as there is an algorithm for computing inner products. This is the kernel trick, and these corresponding spaces are called Reproducing Kernel Hilbert Spaces (RKHS). 36

77 Kernels Excess risk bound has no explicit dependence on the dimension d. In particular, it holds in infinite dimensional inner product spaces. SVM can be applied in such spaces as long as there is an algorithm for computing inner products. This is the kernel trick, and these corresponding spaces are called Reproducing Kernel Hilbert Spaces (RKHS). Universal approximation With some RKHS, can approximate any function arbitrarily well: { } λ lim inf λ 0 w F 2 w 2 + R(w) = inf R(g). g : X R 36

78 Other regularizers Instead of SVM, suppose ŵ is solution to min λ w 1 + R n (w). w R d So ŵ F l1,b = {w R d : w 1 B} for B = 1/λ. 37

79 Other regularizers Instead of SVM, suppose ŵ is solution to min λ w 1 + R n (w). w R d So ŵ F l1,b = {w R d : w 1 B} for B = 1/λ. What is Rademacher complexity of F l1,b? Rad n (F l1,b) = EE ε sup 1 ε i w T X i w 1 B n = BEE ε sup 1 ε i u T X i u 1 1 n = BEE ε 1 ε i X i n. 37

80 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n 38

81 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n 38

82 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n Let X = { 1, +1} d. Then x 2 2 = d but x 2 = 1 for all x X. Dependence on d much better than using bound for l 2 -bounded linear predictors, which would have looked like B d/n. 38

83 Rademacher complexity of l 1 -bounded linear predictors Can show, using martingale argument, EE ε 1 ε i X i O(log d) E X n 2. n Rademacher complexity of F l1,b = {w R d : w 1 B}: Rad n (F l1,b) B O(log d) E X 2. n Let X = { 1, +1} d. Then x 2 2 = d but x 2 = 1 for all x X. Dependence on d much better than using bound for l 2 -bounded linear predictors, which would have looked like B d/n. This kind of bound is used to study generalization of AdaBoost. 38

84 Other examples of Rademacher complexity F = any class of {0, 1}-valued functions with VC dimension V : Rad n (F) = O V. n F = ReLU networks of depth D with parameter matrices of Frobenius norm 1: D E X 2 2 Rad n (F) = O. n F = Lipschitz functions from [0, 1] d to R: ( Rad n (F) = O n 1/(2+d)). F = functions from [0, 1] d to R with Lipschitz k-th derivatives: ( Rad n (F) = O n (k+1)/(2(k+1)+d)). 39

85 Questions Are these the right notions of complexity? 40

86 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. 40

87 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. Do training algorithms for neural nets lead to Frobenius norm-bounded parameter matrices? 40

88 Questions Are these the right notions of complexity? For SVM, the complexity of l 2 -bounded linear predictors is relevant because l 2 -regularization explicitly ensures the solution to SVM problem is l 2 -bounded. Do training algorithms for neural nets lead to Frobenius norm-bounded parameter matrices? Do complexity bounds suggest different algorithms? 40

89 Beyond uniform convergence 41

90 Deficiencies of uniform convergence analysis For certain loss functions, if R(f) is small, then variance of R n (f) is also small, and bound should reflect this. Instead of Hoeffding s inequality, use concentration inequality that involves variance information (e.g., Bernstein s inequality). Overkill to require all functions in F to not over-fit. Just need to worry about the f, e.g., with small empirical risk. Solution: Local Rademacher complexity. 42

91 Example: Occam s razor bound Suppose F is countable and we fix (a priori) a probability distribution π = (π f : f F) on F. Think of π as placing bets on which functions are likely to be the one to be picked by your learning algorithm. 43

92 Example: Occam s razor bound Suppose F is countable and we fix (a priori) a probability distribution π = (π f : f F) on F. Think of π as placing bets on which functions are likely to be the one to be picked by your learning algorithm. For any fixed f F, ) P ( R n (f) R(f) t f 2 exp( 2nt 2 f ) for any t f > 0, by Hoeffding s inequality and union bound. Note: We can choose the t f s non-uniformly. 43

93 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F 44

94 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(1/π f ) + ln(2/δ) 1 δ. 2n 44

95 Occam s razor bound (continued) Let t f = ln(1/πf )+ln(2/δ) 2n. By union bound, ) P ( f F s.t. R n (f) R(f) t f ) P ( R n (f) R(f) t f f F 2 exp( 2nt 2 f ) = π f δ = δ. f F f F Theorem. For any δ (0, 1), P f F : R n (f) R(f) < ln(1/π f ) + ln(2/δ) 1 δ. 2n Better bound for functions f with higher prior probability π f! 44

96 Other forms of generalization analysis Stability If a learning algorithm s output does not change much if a single data point is changed, then its output will generalize. Connections to differential privacy and regularization. Compression bounds If a learning algorithm s output is invariant to all but a small number k n of training data (e.g., # support vectors in SVM), then get bound of the form k/(n k). Direct analyses Some well-known learning algorithms do not fit the mold of typical (regularized) ERM algorithm, and seem to require a direct analysis. E.g., nearest neighbor rule. Many others 45

97 Many active areas of research in learning theory Implicit bias of optimization algorithms E.g., gradient descent for least squares linear regression converges to solution of smallest norm. What about for other problems? Efficient algorithms for non-linear models E.g., polynomials, neural networks, kernel machines. Understand if/why existing algorithms work! Learning algorithms with robustness guarantees Noisy labels, missing / malformed data, heavy-tail distributions, adversarial corruptions, etc. Interactive learning Learning algorithms that interact with external environment (e.g., bandits, active learning, reinforcement learning). More: see proceedings of Conference on Learning Theory! 46

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating