Algorithmic Stability and Generalization Christoph Lampert

Size: px
Start display at page:

Download "Algorithmic Stability and Generalization Christoph Lampert"

Transcription

1 Algorithmic Stability and Generalization Christoph Lampert November 28, / 32

2 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts of Vienna Basic Research at IST Austria curiosity-driven, interdisciplinarity currently 50 research groups: Computer Science, Mathematics, Physics, Chemisty, Biology, Neuroscience contact: Open Positions Faculty (Asst.Prof. and Prof.) Postdocs PhD (US-style graduate school): Jan 8! Internships 2 / 32

3 Research Group: Computer Vision and Machine Learning Theory (Statistical Machine Learning) Multi-task learning Domain adaptation Lifelong learning/learning to learn Learning with dependent data Models/Algorithms Zero-shot learning Incremental learning Weakly-supervised learning Non-standard forms of annotation Applications (in Computer Vision) Object recognition Image generation Image segmentation Semantic image representations 3 / 32

4 Overview A crash course in statistical machine learning Algorithmic Stability Stochastic Gradient Descent Ilja Kuzborskij, CHL: "Data-Dependent Stability of Stochastic Gradient Descent", International Conference on Machine Learning (ICML), / 32

5 Statistical Machine Learning ( Artificial Intelligence) 5 / 32

6 6 / 32

7 7 / 32

8 8 / 32

9 9 / 32

10 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. 10 / 32

11 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Example: Backgammon [Tesauro, ACM 1995] T)ask: Play backgammon. E)xperience: Games playes against itself P)erformance Measure: Games won against human players. 10 / 32

12 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Example: Spam classification T)ask: determine if s are Spam or non-spam. E)xperience: Incoming s with human classification P)erformance Measure: percentage of correct decisions 10 / 32

13 What is Machine Learning Definition (Mitchell, 1997) A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Example: Stock market predictions T)ask: predict the price of some shares E)xperience: past prices P)erformance Measure: money you win or lose 10 / 32

14 Notation: Supervised Machine Learning Task: input set X, output set, Y, prediction function f : X Y e.g. X = { all possible s }, Y = {spam, ham} f spam filter: for new x X : f(x) = spam or f(x) = ham. 11 / 32

15 Notation: Supervised Machine Learning Task: input set X, output set, Y, prediction function f : X Y e.g. X = { all possible s }, Y = {spam, ham} f spam filter: for new x X : f(x) = spam or f(x) = ham. Experience: different scenarios. For us: supervised learning assume: probability distribution, P, over X Y (unknown to algorithm) observe a training set D m = {(x 1, y 1 ),..., (x m, y m )} i.i.d. P e.g. 100 of your s, each annotated if it s spam or not. e.g. 1 mio Go positions, each annotated with move a grandmaster made next 11 / 32

16 Notation: Supervised Machine Learning Task: input set X, output set, Y, prediction function f : X Y e.g. X = { all possible s }, Y = {spam, ham} f spam filter: for new x X : f(x) = spam or f(x) = ham. Experience: different scenarios. For us: supervised learning assume: probability distribution, P, over X Y (unknown to algorithm) observe a training set D m = {(x 1, y 1 ),..., (x m, y m )} i.i.d. P e.g. 100 of your s, each annotated if it s spam or not. e.g. 1 mio Go positions, each annotated with move a grandmaster made next Performance: loss function, l : Y Y R. l(y, y ) is cost of predicting y if y is correct e.g. l(y, y ) = y y for y {spam, ham}, or l(y, y ) = (y y ) 2 for y R risk (=expected loss) of a prediction function: R(f) = E (x,y) P l(y, f(x)) 11 / 32

17 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. 12 / 32

18 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. Machine Learning research: design and study learning algorithms 12 / 32

19 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. Machine Learning research: design and study learning algorithms Statistical learning theory: treat A as a mathematical object answer statistical questions, e.g. "How large does D m have to be for property X?" 12 / 32

20 Definition (Learning algorithm) A learning algorithm, A, is a function that takes as input any finite set of training examples, D m X Y, and outputs a function A[D m ] : X Y. Machine Learning research: design and study learning algorithms Statistical learning theory: treat A as a mathematical object answer statistical questions, e.g. "How large does D m have to be for property X?" Computational learning theory: A should be computable in polynomial time. 12 / 32

21 Definition (Empirical Risk Minimization) Given a training set D m = { (x 1, y 1 ),..., (x m, y m ) }, the empirical risk minimization algorithm identifies a prediction function by minimizing the empirical loss over a set of functions, H {h : X Y}, called the hypothesis set. A[D m ] argmin h H ˆR m (h) for ˆRm (f) = 1 m l( y i, f(x i )) m i=1 13 / 32

22 Definition (Empirical Risk Minimization) Given a training set D m = { (x 1, y 1 ),..., (x m, y m ) }, the empirical risk minimization algorithm identifies a prediction function by minimizing the empirical loss over a set of functions, H {h : X Y}, called the hypothesis set. A[D m ] argmin h H ˆR m (h) for ˆRm (f) = 1 m l( y i, f(x i )) m i=1 Examples: Least-Squared Regression X = R d, Y = R, H = { f : f(x) = w x for w R d }, l(y, y ) = (y y ) 2 1 ŵ argmin w R d m m (f(x i ) y i ) 2 i=1 output: A[D m ] = f for f(x) = ŵ x 13 / 32

23 Empirical Risk Minimization Examples: Least-Squared Regression X = R d, Y = R, H = { f : f(x) = w x for w R d }, l(y, y ) = (y y ) 2 1 ŵ argmin w R d m m (f(x i ) y i ) 2 i=1 output: A[D m ] = f for f(x) = ŵ x / 32

24 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 15 / 32

25 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 1) if f : X Y is a fixed function, then for any D m = {(x 1, y 1 ),..., (x m, y m )} ˆR m (f) = 1 m l( y i, f(x i )) unbiased, consistent estimator of R(f), m i=1 l( y i, f(x i )) are identically distributed independent random variables 15 / 32

26 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 1) if f : X Y is a fixed function, then for any D m = {(x 1, y 1 ),..., (x m, y m )} ˆR m (f) = 1 m l( y i, f(x i )) unbiased, consistent estimator of R(f), m i=1 l( y i, f(x i )) are identically distributed independent random variables 2) observe D m = {(x 1, y 1 ),..., (x m, y m )}, then choose f based on D m : ˆR m (f) = 1 m l( y i, f(x i )) E[ m ˆR m (f)] =? R(f) ˆR m (f)? i=1 l( y i, f(x i )) are not independent, law of large numbers not applicable 15 / 32

27 Empirical Risk Minimization Is the empirical risk, ˆRm (f), a good proxy for the true risk, R(f)? 1) if f : X Y is a fixed function, then for any D m = {(x 1, y 1 ),..., (x m, y m )} ˆR m (f) = 1 m l( y i, f(x i )) unbiased, consistent estimator of R(f), m i=1 l( y i, f(x i )) are identically distributed independent random variables 2) observe D m = {(x 1, y 1 ),..., (x m, y m )}, then choose f based on D m : ˆR m (f) = 1 m l( y i, f(x i )) E[ m ˆR m (f)] =? R(f) ˆR m (f)? i=1 l( y i, f(x i )) are not independent, law of large numbers not applicable In machine learning, we do the latter. So what guarantees do we have? 15 / 32

28 Probably Approximate Correct (PAC) Learning Theorem (Generalization bound for finite hypothesis classes) Let H be a finite hypothesis class, i.e. H <, and assume l( ) [0, 1]. Then, with probability at least 1 δ (over D m i.i.d. P ), the following inequality holds for all h H, R(h) ˆR m (h) log H + log 1 δ 2m Statement of this form are examples of Probably Approximately Correct (PAC) Learning 16 / 32

29 Algorithm-dependent Learning Guarantees...many more similar bounds in the literature, including for H =,... PAC-style generalization bound With probability at least 1 δ, h H : R(h) ˆR m (h) something(m, H, δ) Observation: if it holds, it holds uniformly over H, the algorithm can pick any h H it wants but: overly conservative, we would need the inequality only for A[D m ] H, not uniformly for all h H 17 / 32

30 Algorithm-dependent Learning Guarantees...many more similar bounds in the literature, including for H =,... PAC-style generalization bound With probability at least 1 δ, h H : R(h) ˆR m (h) something(m, H, δ) Observation: if it holds, it holds uniformly over H, the algorithm can pick any h H it wants but: overly conservative, we would need the inequality only for A[D m ] H, not uniformly for all h H Goal: algorithm-dependent learning guarantees "Which learning algorithms do not overfit?" 17 / 32

31 Algorithmic Stability 18 / 32

32 Adjust notation: Z: input set (typically Z = X Y) L(h, z): loss function of the form L(h, z) = l(y, h(x)) Reminder (Learning algorithm) A learning algorithm, A, is a function that takes as input a finite subset, D m Z, and outputs a hypothesis A[D m ] H. 19 / 32

33 Adjust notation: Z: input set (typically Z = X Y) L(h, z): loss function of the form L(h, z) = l(y, h(x)) Reminder (Learning algorithm) A learning algorithm, A, is a function that takes as input a finite subset, D m Z, and outputs a hypothesis A[D m ] H. Definition (Uniform stability) For a training set, D m = {z 1,..., z m }, we call the training set with the i-th element removed D \i m = {z 1,..., z i 1, z i+1,..., z m }. A learning algorithm, A, has uniform stability β m with respect to the loss L if the following inequality holds, D m Z i {1, 2,..., m} z Z L(A[Dm ], z) L(A[D \i m], z) βm 19 / 32

34 A stable learning algorithm 20 / 32

35 A stable learning algorithm 20 / 32

36 A stable learning algorithm 20 / 32

37 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. 20 / 32

38 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

39 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

40 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

41 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

42 A stable learning algorithm Removing any point from the training set has little ( β m ) effect on the decision function. (consequence): replacing any point from the training set by another point also has little ( 2β m ) effect on the decision function. 20 / 32

43 Theorem (Stable algorithms generalize well [Bousquet et al., 2002]) Let A be a β m -uniformly stable learning algorithm. For a training set D m that consists of m i.i.d. samples, let f = A[D m ] be the output of A on D m. Let l( ) [0, M]. Then, for any δ > 0, with probability at least 1 δ, R(f) ˆR(f) 2β m + (4mβ m + M) log(1/δ) 2m Proof: concentration argument using McDiarmid s inequality. bound is useful (decreases with m), if β m behaves like o( 1 m ) β m behaving like 1 m, we recover classical rates of R(f) ˆR(f) O( 1 m ) 21 / 32

44 Theorem (Stable algorithms generalize well [Bousquet et al., 2002]) Let A be a β m -uniformly stable learning algorithm. For a training set D m that consists of m i.i.d. samples, let f = A[D m ] be the output of A on D m. Let l( ) [0, M]. Then, for any δ > 0, with probability at least 1 δ, R(f) ˆR(f) 2β m + (4mβ m + M) log(1/δ) 2m Proof: concentration argument using McDiarmid s inequality. bound is useful (decreases with m), if β m behaves like o( 1 m ) β m behaving like 1 m, we recover classical rates of R(f) ˆR(f) O( 1 m ) Are there learning algorithms that have this property? 21 / 32

45 Stochastic Gradient Descent (for Empirical Risk Minimization) 22 / 32

46 Reminder (Empirical Risk Minimization) Given a training set D m = { (x 1, y 1 ),..., (x m, y m ) }, the empirical risk minimization algorithm identifies a prediction function by minimizing the empirical loss over a set of functions, H {h : X Y}, called the hypothesis set. A[D m ] argmin h H ˆR(h) for ˆR(f) = 1 m m l( y i, f(x i )) i=1 How do we solve the argmin in practice? 23 / 32

47 min θ Θ F (θ) (Steepest) Gradient Descent Minimization 1: θ 1 0 2: repeat 3: θ t+1 θ t η t θ F (θ t ) for some step size η t > 0 4: until convergence output θ final used in many variants in ML, often also with momentum (but not 2nd order) Advantage: simple, works well for many ML problems Disadvantage: finds only local minimum (or saddle point) 24 / 32

48 min θ Θ F (θ) for F (θ) = 1 m (Steepest) Gradient Descent Minimization m l( y i, h θ (x i )) i=1 1: θ 1 0 2: repeat 3: θ t+1 θ t η t θ F (θ t ) for some step size η t > 0 4: until convergence output θ final used in many variants in ML, often also with momentum (but not 2nd order) Advantage: simple, works well for many ML problems Disadvantage: finds only local minimum (or saddle point) Bigger disadvantage: computing θ F is O(m), slow if m is large (e.g. millions) 24 / 32

49 min θ Θ F (θ) for F (θ) = 1 m Stochastic Gradient Descent m F i (θ) for F i (θ) = l( y i, h θ (x i )) i=1 1: θ 1 0 2: for t = 1,..., T do 3: i random index in 1, 2,..., m 4: θ t+1 θ t η t θ F i (θ t ) for some step size η t > 0 5: end for output θ T +1 each iteration is faster: O(1) instead of O(m), "wrong updates", but correct in expectation: E i θ F i (θ t ) = θ F (θ t ) converges still guaranteed for suitable choice of η t (e.g. η t 1 t ) in practice, don t sample i, but shuffle D m and go sequentially, i = 1, 2,..., m 25 / 32

50 Theorem (Stability of Stochastic Gradient Descent [Hardt et al., 2016]) Let L(, z) be γ-smooth, convex and M-Lipschitz for every z. Suppose that we run SGD with step sizes η t 2/γ for T steps. Then, SGD satisfies uniform stability with β m 2M 2 m T η t. t=1 Let L(, z) be γ-smooth and M-Lipschitz, but not necessarily convex. Assume we run SGD with monotonically non-increasing step sizes η t c/t for some c. Then, SGD satisfies uniform stability with β m γc m 1 (2cM 2 ) 1 γc+1 T γc γc+1. Proof and details: [M. Hardt, B. Recht, Y. Singer, "Train faster, generalize better: Stability of stochastic gradient descent", ICML 2016] 26 / 32

51 Theorem (Stability of Stochastic Gradient Descent [Hardt et al., 2016]) Let L(, z) be γ-smooth, convex and M-Lipschitz for every z. Suppose that we run SGD with step sizes η t 2/γ for T steps. Then, SGD satisfies uniform stability with β m 2M 2 m T η t. t=1 For η t c t and T = O(m), one has β m = O( log m m ) Let L(, z) be γ-smooth and M-Lipschitz, but not necessarily convex. Assume we run SGD with monotonically non-increasing step sizes η t c/t for some c. Then, SGD satisfies uniform stability with β m γc m 1 (2cM 2 ) 1 γc+1 T γc γc+1. For T = O(m), one has β = O(m 1 γc+1 ) Proof and details: [M. Hardt, B. Recht, Y. Singer, "Train faster, generalize better: Stability of stochastic gradient descent", ICML 2016] 26 / 32

52 Proof idea Compare SGD running on D m and D i m = {z 1,..., z i 1, z, z i+1,..., z m } for z Z In any iteration, t, call δ t = θ t θ t difference in parameters. Case 1) SGD operates on same example z j for j i happens with probability 1 1 m for convex and β-smooth F i and η t 2/β, one can show δ t+1 δ t Case 2) SGD operates on different examples, z i vs. z happens with probability 1 m δ t+1 δ t + 2η t M, where M is Lipschitz constant (because F i M) Together, E[δ t+1 ] (1 1 m ) E[δ t] + 1 m (E[δ t + 2η t M]) = E[δ t ] + 2Mη t m Unraveling the recursion, E[δ T +1 ] 2M ( T ) η t m Use that L(, z) is M-Lipschitz: L(θ T +1, z) L( θ T +1, z) M δ T / 32 t=1

53 "Data-Dependent Stability of Stochastic Gradient Descent" ICML 2018 İlja Kuzborskij now University of Milan Ilja Kuzborskij, CHL, "Data-Dependent Stability of Stochastic Gradient Descent", International Conference on Machine Learning (ICML), / 32

54 Data-Dependent Stability of SGD R(ϑ) Shortcomings of the existing result: function properties are measured globally (worst case) in practice, we only need them on optimization path Idea: local stability analysis e.g., different starting points lead to differently stable algorithms Huge Huge M small small M 29 / 32

55 Theorem (Theorem 3 in [Kuzborskij, CHL, 2018]) Assume that L(, z) is β-smooth, M-Lipschitz and convex for every z Z. Then, SGD with step sizes η t 2 satisfies hypothesis stability with β β m (θ 1 ) 2L 2βR(θ 1 ) m T η t. t=1 (hypothesis stability is very similar to uniform stability, but slightly relaxed) Comparison: Hardt et al.: β m 2M 2 T η t m t=1 Our analysis replaces one global Lipschitz constant M by data-dependent O( R(θ 1 )) starting at "good" location makes algorithm more stable 30 / 32

56 Theorem (Theorem 4 in [Kuzborskij, CHL, 2018]) Assume that L(, z) is β-smooth, M-Lipschitz, with ρ-lipschitz Hessian, and upper bounded by 1 for every z Z, but not necessarily convex. Then, SGD with step sizes η t = c t for c min{ 1 β, 1 4(2β(log T ) 2 } satisfies hypothesis stability with β m (θ 1 ) cγ m (2cM) 1 cγ 1+cγ T 1+cγ { } for γ = min β, E z [ 2 L(θ 1, z) 2 ] + cρ(1 + log T ) 2βR(θ 1 ). Comparison: Hardt et al.: same form, but with γ = β (and weaker assumptions) new bound shows higher robustness if γ < β, in particular: R(θ1 ) small initialized in low-risk region E z [ 2 L(θ 1, z) 2 ] small initialized in low-curvature region 31 / 32

57 Summary Machine Learning crucially relies on optimization methods, e.g. for risk minimization. Stable optimization algorithms guarantee good generalization. Stochastic Gradient Descent (with suitable step sizes) is stable for smooth convex, but also for non-convex loss functions. Data-dependent stability of SGD: refined analysis, shows how stability depends on additional properties, in particular initialization. 32 / 32

58 Summary Machine Learning crucially relies on optimization methods, e.g. for risk minimization. Stable optimization algorithms guarantee good generalization. Stochastic Gradient Descent (with suitable step sizes) is stable for smooth convex, but also for non-convex loss functions. Data-dependent stability of SGD: refined analysis, shows how stability depends on additional properties, in particular initialization. Thank you! 32 / 32

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information

Incentive Compatible Regression Learning

Incentive Compatible Regression Learning Incentive Compatible Regression Learning Ofer Dekel 1 Felix Fischer 2 Ariel D. Procaccia 1 1 School of Computer Science and Engineering The Hebrew University of Jerusalem 2 Institut für Informatik Ludwig-Maximilians-Universität

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Binary Classification / Perceptron

Binary Classification / Perceptron Binary Classification / Perceptron Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Supervised Learning Input: x 1, y 1,, (x n, y n ) x i is the i th data

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Nicholas Ruozzi University of Texas at Dallas Slides adapted from David Sontag and Vibhav Gogate Course Info. Instructor: Nicholas Ruozzi Office: ECSS 3.409 Office hours: Tues.

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io

Machine Learning. Lecture 2: Linear regression. Feng Li. https://funglee.github.io Machine Learning Lecture 2: Linear regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2017 Supervised Learning Regression: Predict

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

A summary of Deep Learning without Poor Local Minima

A summary of Deep Learning without Poor Local Minima A summary of Deep Learning without Poor Local Minima by Kenji Kawaguchi MIT oral presentation at NIPS 2016 Learning Supervised (or Predictive) learning Learn a mapping from inputs x to outputs y, given

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent

Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent Generalization Bounds for Randomized Learning with Application to Stochastic Gradient Descent Ben London blondon@amazon.com 1 Introduction Randomized algorithms are central to modern machine learning.

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Lecture 11 Linear regression

Lecture 11 Linear regression Advanced Algorithms Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2013-2014 Lecture 11 Linear regression These slides are taken from Andrew Ng, Machine Learning

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

CSC321 Lecture 2: Linear Regression

CSC321 Lecture 2: Linear Regression CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification with Linear Models CMPSCI 383 Nov 15, 2011! Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview Previous lectures: (Principle for loss function) MLE to derive loss Example: linear

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley

what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley what can deep learning learn from linear regression? Benjamin Recht University of California, Berkeley Collaborators Joint work with Samy Bengio, Moritz Hardt, Michael Jordan, Jason Lee, Max Simchowitz,

More information

Non-convex optimization. Issam Laradji

Non-convex optimization. Issam Laradji Non-convex optimization Issam Laradji Strongly Convex Objective function f(x) x Strongly Convex Objective function Assumptions Gradient Lipschitz continuous f(x) Strongly convex x Strongly Convex Objective

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Understanding Machine Learning A theory Perspective

Understanding Machine Learning A theory Perspective Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we

More information

Machine Learning Linear Models

Machine Learning Linear Models Machine Learning Linear Models Outline II - Linear Models 1. Linear Regression (a) Linear regression: History (b) Linear regression with Least Squares (c) Matrix representation and Normal Equation Method

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Optimized first-order minimization methods

Optimized first-order minimization methods Optimized first-order minimization methods Donghwan Kim & Jeffrey A. Fessler EECS Dept., BME Dept., Dept. of Radiology University of Michigan web.eecs.umich.edu/~fessler UM AIM Seminar 2014-10-03 1 Disclosure

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources STA 732: Inference Notes 10. Parameter Estimation from a Decision Theoretic Angle Other resources 1 Statistical rules, loss and risk We saw that a major focus of classical statistics is comparing various

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Machine Learning. Regression. Manfred Huber

Machine Learning. Regression. Manfred Huber Machine Learning Regression Manfred Huber 2015 1 Regression Regression refers to supervised learning problems where the target output is one or more continuous values Continuous output values imply that

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Advanced Introduction to Machine Learning CMU-10715

Advanced Introduction to Machine Learning CMU-10715 Advanced Introduction to Machine Learning CMU-10715 Risk Minimization Barnabás Póczos What have we seen so far? Several classification & regression algorithms seem to work fine on training datasets: Linear

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 3 Additive Models and Linear Regression Sinusoids and Radial Basis Functions Classification Logistic Regression Gradient Descent Polynomial Basis Functions

More information