Algorithmic Stability and Generalization Christoph Lampert

Size: px

Start display at page:

Download "Algorithmic Stability and Generalization Christoph Lampert"

Amos Wheeler
5 years ago
Views:

1 Algorithmic Stability and Generalization Christoph Lampert November 28, / 32

IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts of Vienna Basic Research at IST Austria curiosity-driven,

2 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts of Vienna Basic Research at IST Austria curiosity-driven, interdisciplinarity currently 50 research groups: Computer Science, Mathematics, Physics, Chemisty, Biology, Neuroscience contact: Open Positions Faculty (Asst.Prof. and Prof.) Postdocs PhD (US-style graduate school): Jan 8! Internships 2 / 32

3 Research Group: Computer Vision and Machine Learning Theory (Statistical Machine Learning) Multi-task learning Domain adaptation Lifelong learning/learning to learn Learning with dependent data Models/Algorithms Zero-shot learning Incremental learning Weakly-supervised learning Non-standard forms of annotation Applications (in Computer Vision) Object recognition Image generation Image segmentation Semantic image representations 3 / 32

4 Overview A crash course in statistical machine learning Algorithmic Stability Stochastic Gradient Descent Ilja Kuzborskij, CHL: "Data-Dependent Stability of Stochastic Gradient Descent", International Conference on Machine Learning (ICML), / 32

5 Statistical Machine Learning ( Artificial Intelligence) 5 / 32

6 6 / 32

7 7 / 32

8 8 / 32

9 9 / 32

10 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. 10 / 32

11 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Example: Backgammon [Tesauro, ACM 1995] T)ask: Play backgammon. E)xperience: Games playes against itself P)erformance Measure: Games won against human players. 10 / 32

12 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Example: Spam classification T)ask: determine if s are Spam or non-spam. E)xperience: Incoming s with human classification P)erformance Measure: percentage of correct decisions 10 / 32

13 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Example: Stock market predictions T)ask: predict the price of some shares E)xperience: past prices P)erformance Measure: money you win or lose 10 / 32

14 Notation: Supervised Machine Learning Task: input set X, output set, Y, prediction function f : X Y e.g. X = { all possible s }, Y = {spam, ham} f spam filter: for new x X : f(x) = spam or f(x) = ham. 11 / 32

15 Notation: Supervised Machine Learning Task: input set X, output set, Y, prediction function f : X Y e.g. X = { all possible s }, Y = {spam, ham} f spam filter: for new x X : f(x) = spam or f(x) = ham. Experience: different scenarios. For us: supervised learning assume: probability distribution, P, over X Y (unknown to algorithm) observe a training set D m = {(x 1, y 1 ),..., (x m, y m )} i.i.d. P e.g. 100 of your s, each annotated if it s spam or not. e.g. 1 mio Go positions, each annotated with move a grandmaster made next 11 / 32

16 Notation: Supervised Machine Learning Task: input set X, output set, Y, prediction function f : X Y e.g. X = { all possible s }, Y = {spam, ham} f spam filter: for new x X : f(x) = spam or f(x) = ham. Experience: different scenarios. For us: supervised learning assume: probability distribution, P, over X Y (unknown to algorithm) observe a training set D m = {(x 1, y 1 ),..., (x m, y m )} i.i.d. P e.g. 100 of your s, each annotated if it s spam or not. e.g. 1 mio Go positions, each annotated with move a grandmaster made next Performance: loss function, l : Y Y R. l(y, y ) is cost of predicting y if y is correct e.g. l(y, y ) = y y for y {spam, ham}, or l(y, y ) = (y y ) 2 for y R risk (=expected loss) of a prediction function: R(f) = E (x,y) P l(y, f(x)) 11 / 32

17 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. 12 / 32

18 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. Machine Learning research: design and study learning algorithms 12 / 32

19 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. Machine Learning research: design and study learning algorithms Statistical learning theory: treat A as a mathematical object answer statistical questions, e.g. "How large does D m have to be for property X?" 12 / 32

20 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. Machine Learning research: design and study learning algorithms Statistical learning theory: treat A as a mathematical object answer statistical questions, e.g. "How large does D m have to be for property X?" Computational learning theory: A should be computable in polynomial time. 12 / 32

21 Definition (Empirical Risk Minimization) Given a training set D m = { (x 1, y 1 ),..., (x m, y m ) }, the empirical risk minimization algorithm identifies a prediction function by minimizing the empirical loss over a set of functions, H {h : X Y}, called the hypothesis set. A[D m ] argmin h H ˆR m (h) for ˆRm (f) = 1 m l( y i, f(x i )) m i=1 13 / 32

22 Definition (Empirical Risk Minimization) Given a training set D m = { (x 1, y 1 ),..., (x m, y m ) }, the empirical risk minimization algorithm identifies a prediction function by minimizing the empirical loss over a set of functions, H {h : X Y}, called the hypothesis set. A[D m ] argmin h H ˆR m (h) for ˆRm (f) = 1 m l( y i, f(x i )) m i=1 Examples: Least-Squared Regression X = R d, Y = R, H = { f : f(x) = w x for w R d }, l(y, y ) = (y y ) 2 1 ŵ argmin w R d m m (f(x i ) y i ) 2 i=1 output: A[D m ] = f for f(x) = ŵ x 13 / 32

23 Empirical Risk Minimization Examples: Least-Squared Regression X = R d, Y = R, H = { f : f(x) = w x for w R d }, l(y, y ) = (y y ) 2 1 ŵ argmin w R d m m (f(x i ) y i ) 2 i=1 output: A[D m ] = f for f(x) = ŵ x / 32

24 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 15 / 32

25 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 1) if f : X Y is a fixed function, then for any D m = {(x 1, y 1 ),..., (x m, y m )} ˆR m (f) = 1 m l( y i, f(x i )) unbiased, consistent estimator of R(f), m i=1 l( y i, f(x i )) are identically distributed independent random variables 15 / 32

26 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 1) if f : X Y is a fixed function, then for any D m = {(x 1, y 1 ),..., (x m, y m )} ˆR m (f) = 1 m l( y i, f(x i )) unbiased, consistent estimator of R(f), m i=1 l( y i, f(x i )) are identically distributed independent random variables 2) observe D m = {(x 1, y 1 ),..., (x m, y m )}, then choose f based on D m : ˆR m (f) = 1 m l( y i, f(x i )) E[ m ˆR m (f)] =? R(f) ˆR m (f)? i=1 l( y i, f(x i )) are not independent, law of large numbers not applicable 15 / 32

27 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 1) if f : X Y is a fixed function, then for any D m = {(x 1, y 1 ),..., (x m, y m )} ˆR m (f) = 1 m l( y i, f(x i )) unbiased, consistent estimator of R(f), m i=1 l( y i, f(x i )) are identically distributed independent random variables 2) observe D m = {(x 1, y 1 ),..., (x m, y m )}, then choose f based on D m : ˆR m (f) = 1 m l( y i, f(x i )) E[ m ˆR m (f)] =? R(f) ˆR m (f)? i=1 l( y i, f(x i )) are not independent, law of large numbers not applicable In machine learning, we do the latter. So what guarantees do we have? 15 / 32

28 Probably Approximate Correct (PAC) Learning Theorem (Generalization bound for finite hypothesis classes) Let H be a finite hypothesis class, i.e. H <, and assume l( ) [0, 1]. Then, with probability at least 1 δ (over D m i.i.d. P ), the following inequality holds for all h H, R(h) ˆR m (h) log H + log 1 δ 2m Statement of this form are examples of Probably Approximately Correct (PAC) Learning 16 / 32

29 Algorithm-dependent Learning Guarantees...many more similar bounds in the literature, including for H =,... PAC-style generalization bound With probability at least 1 δ, h H : R(h) ˆR m (h) something(m, H, δ) Observation: if it holds, it holds uniformly over H, the algorithm can pick any h H it wants but: overly conservative, we would need the inequality only for A[D m ] H, not uniformly for all h H 17 / 32

30 Algorithm-dependent Learning Guarantees...many more similar bounds in the literature, including for H =,... PAC-style generalization bound With probability at least 1 δ, h H : R(h) ˆR m (h) something(m, H, δ) Observation: if it holds, it holds uniformly over H, the algorithm can pick any h H it wants but: overly conservative, we would need the inequality only for A[D m ] H, not uniformly for all h H Goal: algorithm-dependent learning guarantees "Which learning algorithms do not overfit?" 17 / 32

31 Algorithmic Stability 18 / 32

32 Adjust notation: Z: input set (typically Z = X Y) L(h, z): loss function of the form L(h, z) = l(y, h(x)) Reminder (Learning algorithm) A learning algorithm, A, is a function that takes as input a finite subset, D m Z, and outputs a hypothesis A[D m ] H. 19 / 32

33 Adjust notation: Z: input set (typically Z = X Y) L(h, z): loss function of the form L(h, z) = l(y, h(x)) Reminder (Learning algorithm) A learning algorithm, A, is a function that takes as input a finite subset, D m Z, and outputs a hypothesis A[D m ] H. Definition (Uniform stability) For a training set, D m = {z 1,..., z m }, we call the training set with the i-th element removed D \i m = {z 1,..., z i 1, z i+1,..., z m }. A learning algorithm, A, has uniform stability β m with respect to the loss L if the following inequality holds, D m Z i {1, 2,..., m} z Z L(A[Dm ], z) L(A[D \i m], z) βm 19 / 32

34 A stable learning algorithm 20 / 32

35 A stable learning algorithm 20 / 32

36 A stable learning algorithm 20 / 32

37 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. 20 / 32

38 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

39 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

40 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

41 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

42 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

43 Theorem (Stable algorithms generalize well [Bousquet et al., 2002]) Let A be a β m -uniformly stable learning algorithm. For a training set D m that consists of m i.i.d. samples, let f = A[D m ] be the output of A on D m. Let l( ) [0, M]. Then, for any δ > 0, with probability at least 1 δ, R(f) ˆR(f) 2β m + (4mβ m + M) log(1/δ) 2m Proof: concentration argument using McDiarmid s inequality. bound is useful (decreases with m), if β m behaves like o( 1 m ) β m behaving like 1 m, we recover classical rates of R(f) ˆR(f) O( 1 m ) 21 / 32

44 Theorem (Stable algorithms generalize well [Bousquet et al., 2002]) Let A be a β m -uniformly stable learning algorithm. For a training set D m that consists of m i.i.d. samples, let f = A[D m ] be the output of A on D m. Let l( ) [0, M]. Then, for any δ > 0, with probability at least 1 δ, R(f) ˆR(f) 2β m + (4mβ m + M) log(1/δ) 2m Proof: concentration argument using McDiarmid s inequality. bound is useful (decreases with m), if β m behaves like o( 1 m ) β m behaving like 1 m, we recover classical rates of R(f) ˆR(f) O( 1 m ) Are there learning algorithms that have this property? 21 / 32

45 Stochastic Gradient Descent (for Empirical Risk Minimization) 22 / 32

46 Reminder (Empirical Risk Minimization) Given a training set D m = { (x 1, y 1 ),..., (x m, y m ) }, the empirical risk minimization algorithm identifies a prediction function by minimizing the empirical loss over a set of functions, H {h : X Y}, called the hypothesis set. A[D m ] argmin h H ˆR(h) for ˆR(f) = 1 m m l( y i, f(x i )) i=1 How do we solve the argmin in practice? 23 / 32

47 min θ Θ F (θ) (Steepest) Gradient Descent Minimization 1: θ 1 0 2: repeat 3: θ t+1 θ t η t θ F (θ t ) for some step size η t > 0 4: until convergence output θ final used in many variants in ML, often also with momentum (but not 2nd order) Advantage: simple, works well for many ML problems Disadvantage: finds only local minimum (or saddle point) 24 / 32

48 min θ Θ F (θ) for F (θ) = 1 m (Steepest) Gradient Descent Minimization m l( y i, h θ (x i )) i=1 1: θ 1 0 2: repeat 3: θ t+1 θ t η t θ F (θ t ) for some step size η t > 0 4: until convergence output θ final used in many variants in ML, often also with momentum (but not 2nd order) Advantage: simple, works well for many ML problems Disadvantage: finds only local minimum (or saddle point) Bigger disadvantage: computing θ F is O(m), slow if m is large (e.g. millions) 24 / 32

49 min θ Θ F (θ) for F (θ) = 1 m Stochastic Gradient Descent m F i (θ) for F i (θ) = l( y i, h θ (x i )) i=1 1: θ 1 0 2: for t = 1,..., T do 3: i random index in 1, 2,..., m 4: θ t+1 θ t η t θ F i (θ t ) for some step size η t > 0 5: end for output θ T +1 each iteration is faster: O(1) instead of O(m), "wrong updates", but correct in expectation: E i θ F i (θ t ) = θ F (θ t ) converges still guaranteed for suitable choice of η t (e.g. η t 1 t ) in practice, don t sample i, but shuffle D m and go sequentially, i = 1, 2,..., m 25 / 32

50 Theorem (Stability of Stochastic Gradient Descent [Hardt et al., 2016]) Let L(, z) be γ-smooth, convex and M-Lipschitz for every z. Suppose that we run SGD with step sizes η t 2/γ for T steps. Then, SGD satisfies uniform stability with β m 2M 2 m T η t. t=1 Let L(, z) be γ-smooth and M-Lipschitz, but not necessarily convex. Assume we run SGD with monotonically non-increasing step sizes η t c/t for some c. Then, SGD satisfies uniform stability with β m γc m 1 (2cM 2 ) 1 γc+1 T γc γc+1. Proof and details: [M. Hardt, B. Recht, Y. Singer, "Train faster, generalize better: Stability of stochastic gradient descent", ICML 2016] 26 / 32

51 Theorem (Stability of Stochastic Gradient Descent [Hardt et al., 2016]) Let L(, z) be γ-smooth, convex and M-Lipschitz for every z. Suppose that we run SGD with step sizes η t 2/γ for T steps. Then, SGD satisfies uniform stability with β m 2M 2 m T η t. t=1 For η t c t and T = O(m), one has β m = O( log m m ) Let L(, z) be γ-smooth and M-Lipschitz, but not necessarily convex. Assume we run SGD with monotonically non-increasing step sizes η t c/t for some c. Then, SGD satisfies uniform stability with β m γc m 1 (2cM 2 ) 1 γc+1 T γc γc+1. For T = O(m), one has β = O(m 1 γc+1 ) Proof and details: [M. Hardt, B. Recht, Y. Singer, "Train faster, generalize better: Stability of stochastic gradient descent", ICML 2016] 26 / 32

52 Proof idea Compare SGD running on D m and D i m = {z 1,..., z i 1, z, z i+1,..., z m } for z Z In any iteration, t, call δ t = θ t θ t difference in parameters. Case 1) SGD operates on same example z j for j i happens with probability 1 1 m for convex and β-smooth F i and η t 2/β, one can show δ t+1 δ t Case 2) SGD operates on different examples, z i vs. z happens with probability 1 m δ t+1 δ t + 2η t M, where M is Lipschitz constant (because F i M) Together, E[δ t+1 ] (1 1 m ) E[δ t] + 1 m (E[δ t + 2η t M]) = E[δ t ] + 2Mη t m Unraveling the recursion, E[δ T +1 ] 2M ( T ) η t m Use that L(, z) is M-Lipschitz: L(θ T +1, z) L( θ T +1, z) M δ T / 32 t=1

53 "Data-Dependent Stability of Stochastic Gradient Descent" ICML 2018 İlja Kuzborskij now University of Milan Ilja Kuzborskij, CHL, "Data-Dependent Stability of Stochastic Gradient Descent", International Conference on Machine Learning (ICML), / 32

54 Data-Dependent Stability of SGD R(ϑ) Shortcomings of the existing result: function properties are measured globally (worst case) in practice, we only need them on optimization path Idea: local stability analysis e.g., different starting points lead to differently stable algorithms Huge Huge M small small M 29 / 32

55 Theorem (Theorem 3 in [Kuzborskij, CHL, 2018]) Assume that L(, z) is β-smooth, M-Lipschitz and convex for every z Z. Then, SGD with step sizes η t 2 satisfies hypothesis stability with β β m (θ 1 ) 2L 2βR(θ 1 ) m T η t. t=1 (hypothesis stability is very similar to uniform stability, but slightly relaxed) Comparison: Hardt et al.: β m 2M 2 T η t m t=1 Our analysis replaces one global Lipschitz constant M by data-dependent O( R(θ 1 )) starting at "good" location makes algorithm more stable 30 / 32

56 Theorem (Theorem 4 in [Kuzborskij, CHL, 2018]) Assume that L(, z) is β-smooth, M-Lipschitz, with ρ-lipschitz Hessian, and upper bounded by 1 for every z Z, but not necessarily convex. Then, SGD with step sizes η t = c t for c min{ 1 β, 1 4(2β(log T ) 2 } satisfies hypothesis stability with β m (θ 1 ) cγ m (2cM) 1 cγ 1+cγ T 1+cγ { } for γ = min β, E z [ 2 L(θ 1, z) 2 ] + cρ(1 + log T ) 2βR(θ 1 ). Comparison: Hardt et al.: same form, but with γ = β (and weaker assumptions) new bound shows higher robustness if γ < β, in particular: R(θ1 ) small initialized in low-risk region E z [ 2 L(θ 1, z) 2 ] small initialized in low-curvature region 31 / 32

57 Summary Machine Learning crucially relies on optimization methods, e.g. for risk minimization. Stable optimization algorithms guarantee good generalization. Stochastic Gradient Descent (with suitable step sizes) is stable for smooth convex, but also for non-convex loss functions. Data-dependent stability of SGD: refined analysis, shows how stability depends on additional properties, in particular initialization. 32 / 32

58 Summary Machine Learning crucially relies on optimization methods, e.g. for risk minimization. Stable optimization algorithms guarantee good generalization. Stochastic Gradient Descent (with suitable step sizes) is stable for smooth convex, but also for non-convex loss functions. Data-dependent stability of SGD: refined analysis, shows how stability depends on additional properties, in particular initialization. Thank you! 32 / 32

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology