Learnability, Stability, Regularization and Strong Convexity

Size: px
Start display at page:

Download "Learnability, Stability, Regularization and Strong Convexity"

Transcription

1 Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago (2008)

2 What is this tutorial about? Characterizing learnability What makes a problem learnable Necessary and sufficient conditions Role of stability Universal learning rules Anything learnable is learnable using XXX Only need to consider rules of the form XXX Relationship between online and statistical

3 Statistical Learning Relationship to Online Plan Characterizing (statistical) learnability Stability as the master property Stability in online learning Convex Problems Strong convexity as the master property

4 The General Learning Setting aka Stochastic Optimization min F w = E z~d f w, z w W Known objective function f: W Ω R Unknown distribution D over Z Ω Unknown distribution over random functions w f(w, z) Given an iid sample z 1, z 2, D, Given iid samples of random functions f, z 1, f, z 2, Vapnik95

5 General Learning: Examples Minimize F(w)=E z [f(w;z)] based on sample z 1,z 2,,z n Supervised learning: z = (x,y) w specifies a perdictor h w : X! Y f( w ; (x,y) ) = loss(h w (x),y) e.g. linear prediction: w, (x) 2 R d, f(w;(x,y) = loss(w (x),y) Unsupervised learning, e.g. k-means clustering: = x 2 R d w = ( [1],, [k]) 2 R d k specifies k cluster centers f( ( [1],, [k]) ; x ) = min j [j]-x 2 Density estimation: w specifies probability density p w (x) f( w ; x ) = -log p w (x) Optimization in uncertain environment, e.g.: z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = hw,zi = total delay along route

6 General Learning Setting: Learnability Minimize F w = E z [f w, z ] based on sample z 1,, z m Problem specified by f, and domain W Problem is learnable if: There is a learning rule w(z 1,, z m ) for choosing w based on z 1,, z m s.t. for every ε > 0 and large enough sample size m, for any distribution D: E z1,,z m D F w inf F w + ε w W F(w ) Supervised Learning: When f(w;(x,y)) = loss(h w (x),y): learnable agnostic PAC learnable

7 Online Learning (Optimization) Adversary: Learner: f( ;z 1 ) f( ;z 2 ) f( ;z 3 ) w 1 w 2 w 3. Known function f, Unknown sequence z 1, z 2, Online learning rule: w i (z 1,, z i 1 ) Goal: i f(w i, z i ) Differences vs stochastic setting: Any sequence not necessarily iid No distinction between train and test

8 Online and Stochastic Regret Problem specified by f(, ) and domain W Online Regret: for any sequence, 1 m m i=1 f(w i (z 1,, z i 1 ), z i ) inf w W 1 m m i=1 f w, z i + Reg(m) Statistical Regret: for any distribution D, E z1,,z m D F D w z 1,, z m inf w W F D w + ε(m) Online-To-Batch: F w F w w(z 1,, z m ) = w i with prob 1/m E F w F w + Reg m

9 Online vs Statistical Online Learnable Statistically Learnable Thresholds: θ 0,1, h θ x = x θ x, y, x 0,1, y ±1 f θ, x, y = y h θ x = y sign(x θ) θ Learnable with excess error ε m = O 1 ε 2 statistically Can force learner to always make a mistake on adversarial sequence (Reg m = 1)

10 What is Learnable? Is a problem, specified by f, and W, learnable? I.e. can we drive regret / excess error ε to zero? Using what learning rule?

11 Supervised Classification f(w;(x,y)) = loss(h w (x),y): Binary Hypothesis Classes: h w (x)2±1 { h w w 2 W } has finite VC dimension Uniform convergence: sup F (w) ^F n!1 (w)! 0 w2w Online Learnable Learnable using ERM: ^w = arg min ^F(w) F( ^w) n!1! F(w? ) Learnable (using some rule): F( ~w) n!1! F(w? )

12 Supervised Classification f(w;(x,y)) = loss(h w (x),y): Real Valued h w (with appropriate loss function) { h w w 2 W } has finite fat-shattering dimension Uniform convergence: sup F (w) ^F n!1 (w)! 0 w2w Online Learnable Learnable using ERM: ^w = arg min ^F(w) F( ^w) n!1! F(w? ) Learnable (using some rule): F( ~w) n!1! F(w? ) [Alon Ben-David Cesa-Bianchi Haussler 93]

13 Beyond Supervised Learning Supervised learning: f w, x, y = loss(h w x, y) Combinatorial necessary and sufficient condition of learning ERM universal (if learnable, can do it with ERM) General learning / stochastic optimization: f(w, z)????

14 Convex Lipchitz Problems W convex bounded subset of Hilbert space (e.g. R d ) w W w 2 B For each z, f(w, z) convex Lipschitz w.r.t w f w, z f w, z L w w 2 Online Gradient Descent: Stochastic Setting? Reg m B2 L 2 m

15 Stochastic Convex Optimization Generalized Linear Objectives L = l g R Lipchitz in w f w, z = g( w, φ z ; z ) l g -Lipchitz R Uniform convergence: sup w W F w F w O B 2 L 2 log 1/δ m w.p. 1 δ E F w F w + O B 2 L 2 m Lipschitz-Bounded Problems Learnable in statistical setting? Of course, via online-to-batch! Using ERM? Is there uniform convergence? Finite VC?

16 Empirical Minimizer Not Consistent! f(w; ) = w W = {w 2 R d : w 1 } 2 {0,1} d Elementwise product ~ Uniform({0,1} d ), i.e. [i] ~ iid uniform 0/1 If d>>2 m (think of d=1) then with high probability there exists a coordinate j that is never on in the sample, i.e. i [j]=0 for all i=1..m ^F (e j ) = 0 F (e j ) = 1=2 sup w2w F (w) ^F(w) 1=2 No uniform convergence! e j is an empirical minimizer with F(e j ) = ½, far from F(w * )=F(0)=0

17 Empirical Minimizer Not Consistent! f(w;(,x)) = (w-x) 2 {0,1} d x2 R d, x 1 ~ Uniform({0,1} d ), i.e. [i] ~ iid uniform 0/1 If d>>2 m (think of d=1) then with high probability there exists a coordinate j that is never on in the sample, i.e. i [j]=0 for all i=1..m ^F (e j ) = 0 F (e j ) = 1=2 sup w2w F (w) ^F(w) 1=2 No uniform convergence! e j is an empirical minimizer with F(e j ) = ½, far from F(w * )=F(0)=0

18 general setting general setting general setting { z!f(w;z) w 2 W } has finite fat-shattering dimension Supervised learning Uniform convergence: sup F (w) ^F n!1 (w)! 0 w2w Supervised learning Learnable with ERM: F( ^w) n!1! F(w? ) Online Learnable Supervised learning Learnable (using some rule): F( ~w) n!1! F(w? )

19 Stochastic Convex Optimization and Supervised Learning For supervised learning objectives, e.g. f(w; x, y = yh w x 1 + (y ±1) f(w; x, y = h w x y 2 learnabiliy is equivalent to uniform convergence. What types of predictors h w yield convex problems? When y = +1, h w (x) should be convex in w When y = 1, h w (x) should be concave in w ) h w (x) should be linear in w Convex Supervised Learning Generalized Linear objective (y2{-1,+1})

20 Stochastic Convex Optimization Empirical minimization might not be consistent Learnable using specific procedural rule (online-to-batch conversion of online gradient descent)??????????

21 Strongly Convex Objective f(w, z) is -strongly convex in w iff: f αw + 1 α w αf w + 1 α f w λ 2 α 1 α w w 2 2 Equivalent to w 2 f w, z λ If f(w, z) is λ-convex and L-Lipschitz w.r.t. w Online Gradient Descent [Hazan Kalai Kale Agarwal 2006] Reg O L2 log(m) λm Stochastic Setting? ERM?

22 Strongly Convex Objective f(w, z) is -strongly convex in w iff: f αw + 1 α w αf w + 1 α f w λ 2 α 1 α w w 2 2 Equivalent to w 2 f w, z λ If f(w, z) is λ-convex and L-Lipschitz w.r.t. w Online Gradient Descent [Hazan Kalai Kale Agarwal 2006] Reg O L2 log(m) λm Empirical Risk Minimization: E F w F w + O L2 λm

23 Strong Convexity and Stability Definition: rule w(z 1, z m ) is β(m)-stable if: f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) f is λ-strongly convex and L-Lipschitz f w z 1,, z m 1, z m f w z 1,, z m, z m β = 4L2 λm Symmetric w is β-stable E F w m 1 E F w m + β Conclusion: E F w For ERM: E F w E F w = F w β m

24 { z!f(w;z) w 2 W } has finite fat-shattering dimension Supervised learning Uniform convergence: not even local sup F (w) ^F n!1 (w)! 0 w2w Supervised learning Empirical minimizer is consistent: F( ^w) n!1! F(w? ) Online Learnable Supervised learning Solvable (using some algorithm): F( ~w) n!1! F(w? )

25 Back to Weak Convexity f(w, z) L-Lipchitz (and convex), w 2 B Use Regularized ERM: w λ = arg min w W F w + λ 2 w 2 2 Setting λ = L2 B 2 m : E F w λ F w + O L 2 B 2 m Key: strongly convex regularizer ensures stability

26 The Role of Regularization Typical view of regularization: Adding regularization term effectively constrains domain to lower complexity domain W r = w w r Learning guarantees (e.g. for SVMs, LASSO) are actually for empirical minimization inside W r, and are based on uniform convergence in W r. In our case: No uniform convergence in W r, for any r>0 No uniform convergence even of regularized loss Cannot solve stochastic optimization problem by restricting to W r, for any r. What regularization buys is stability

27 Stability Characterizes Learnability Theorem: Learnable with (symmetric) ERM w iff D E f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) For some β m 0 Theorem: Learnable iff symetric w s.t. D: w is an almost ERM : E F w F w ε(m) w is stable: E f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) For some ε m 0, β m 0 Finite fat-shat dimension Uniform Convergence Stable ERM Learnable with ERM supervised learning Stable AERM Learnable

28 Stability in Online Learning Reminder: rule w(z 1, z m ) is β(m)-stable if f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) Follow The Leader (FTL): w m z 1,, z m 1 Be The Leader (BTL): w m z 1,, z m 1 Reg BTL m 0 = arg min w = arg min w i=1 m 1 f(w, z i ) i=1 m f(w, z i ) If the ERM is β(m)-stable: Reg FTL m Reg BTL m + 1 m i β i 1 m i β i E.g. for λ-strongly convex and L-Lipchitz: Reg FTL m L2 B 2 log m m

29 Regularization and Stability How do we deal with non-strongly convex objective? Follow The Regularized Leader (FTRL): w m z 1,, z m 1 = arg min w i=1 m 1 f(w, z i ) + λr(w) Be The Regularized Leader (BTRL): w m z 1,, z m 1 = arg min w i=1 m f w, z i + λr(w) Reg BTRL m λr(w) If the Regularized ERM is β(λ, m)-stable: Reg FTRL m Reg BTRL m + 1 m i β λ, i λr w + 1 m i β λ, i Follow The Almost Leader: 1 s.t. m 1 f( w m, z i ) inf w m m 1 i=1 w 1 m 1 i=1 m 1 f(w, z i ) + ε(m) Reg FTAL m Reg BTAL m + 1 m i β i 1 m i i ε i + 1 m i β i

30 Regularization in Online Gradient Descent w i+1 arg min w F w i + i, w w i + 1 2η w w i 2 2 = arg min1st η order i, w model + 1 of w F(w) w w 2 i 2 2 around w i, based on = w i η i i = f(w i, z i ) only valid near w i, so don t go too far F(w) F w i + F w i, w w i w i

31 Regularization in Online Gradient Descent w i+1 arg min w F w i + i, w w i + 1 2η w w i 2 2 = arg min w η i, w w w i 2 2 = w i η i F(w) F w i + F w i, w w i w i

32 Regularization in Online Gradient Descent w i+1 arg min w F w i + i, w w i + 1 2η w w i 2 2 = arg min w η i, w w w i 2 2 = w i η i Replace with other regularizer? Effect of regularization: performance depends on w 2

33 Online Mirror Descent Main ingredient: R(w) strongly convex w.r.t. some w R αw + 1 α w αr w + 1 α R w λ 2 α 1 α w w 2 Mirror Descent update: w i+1 arg min w W η f(w i, z i ), w + D R (w w i ) = Π W R R 1 R w i η i Bregman Divergence R w B 2 Reg m B2 L 2 m f w, z L-Lipchitz w.r.t w : f w, z f w, z L w w i.e. f L For generalized linear: φ L Example: with R w = entropy w = KL(w unif) and w 1, MD=Multiplicative Weights / Exponential Gradient

34 Strong Convexity as the Master Property R w strongly convex w.r.t w Uniform convergence of x w, x w B} arg min F w + λr(w) stable Online Mirror Descent RERM FTRL stat learnability of generalized linear (including supervised) stat learnability of f f L} over { w B} online learnability of f f L} over { w B}

35 Answers and Questions Statistical setting: Stable AERM (Almost ERM) necessary and sufficient Is stable RERM (Regularized ERM) necessary? i.e. do we always have w λ = arg min F w + λr(w) Online general setting: Stable FTRL (Follow the Regularized Leader) is sufficient Is stable FTAL (Follow the Almost Leader) sufficient? Stability/regularization necessary and sufficient? Convex problems: Strong convexity necessary and sufficient for online learning Strong convexity implies stability Online Mirror Descent (and FTRL) universal Strong convexity sufficient and almost necessary for stochastic learning Direct analysis of Mirror Descent in terms of stability?

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

The Interplay Between Stability and Regret in Online Learning

The Interplay Between Stability and Regret in Online Learning The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Multi-class classification via proximal mirror descent

Multi-class classification via proximal mirror descent Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends

More information

Near-Optimal Algorithms for Online Matrix Prediction

Near-Optimal Algorithms for Online Matrix Prediction JMLR: Workshop and Conference Proceedings vol 23 (2012) 38.1 38.13 25th Annual Conference on Learning Theory Near-Optimal Algorithms for Online Matrix Prediction Elad Hazan Technion - Israel Inst. of Tech.

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle Multiclass Learnability and the ERM principle Amit Daniely Sivan Sabato Shai Ben-David Shai Shalev-Shwartz November 5, 04 arxiv:308.893v [cs.lg] 4 Nov 04 Abstract We study the sample complexity of multiclass

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Multiclass Learnability and the ERM principle

Multiclass Learnability and the ERM principle JMLR: Workshop and Conference Proceedings 19 2011 207 232 24th Annual Conference on Learning Theory Multiclass Learnability and the ERM principle Amit Daniely amit.daniely@mail.huji.ac.il Dept. of Mathematics,

More information

Sample Complexity of Learning Independent of Set Theory

Sample Complexity of Learning Independent of Set Theory Sample Complexity of Learning Independent of Set Theory Shai Ben-David University of Waterloo, Canada Based on joint work with Pavel Hrubes, Shay Moran, Amir Shpilka and Amir Yehudayoff Simons workshop,

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Online Linear Optimization via Smoothing

Online Linear Optimization via Smoothing JMLR: Workshop and Conference Proceedings vol 35:1 18, 2014 Online Linear Optimization via Smoothing Jacob Abernethy Chansoo Lee Computer Science and Engineering Division, University of Michigan, Ann Arbor

More information

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction

More information

6.036 midterm review. Wednesday, March 18, 15

6.036 midterm review. Wednesday, March 18, 15 6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that

More information

Multiclass Learnability and the ERM Principle

Multiclass Learnability and the ERM Principle Journal of Machine Learning Research 6 205 2377-2404 Submitted 8/3; Revised /5; Published 2/5 Multiclass Learnability and the ERM Principle Amit Daniely amitdaniely@mailhujiacil Dept of Mathematics, The

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

Competing With Strategies

Competing With Strategies JMLR: Workshop and Conference Proceedings vol (203) 27 Competing With Strategies Wei Han Alexander Rakhlin Karthik Sridharan wehan@wharton.upenn.edu rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu

More information

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016 Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

arxiv: v1 [cs.lg] 8 Feb 2018

arxiv: v1 [cs.lg] 8 Feb 2018 Online Learning: A Comprehensive Survey Steven C.H. Hoi, Doyen Sahoo, Jing Lu, Peilin Zhao School of Information Systems, Singapore Management University, Singapore School of Software Engineering, South

More information

Online Sparse Linear Regression

Online Sparse Linear Regression JMLR: Workshop and Conference Proceedings vol 49:1 11, 2016 Online Sparse Linear Regression Dean Foster Amazon DEAN@FOSTER.NET Satyen Kale Yahoo Research SATYEN@YAHOO-INC.COM Howard Karloff HOWARD@CC.GATECH.EDU

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection

More information

Learnability of the Superset Label Learning Problem

Learnability of the Superset Label Learning Problem Learnability of the Superset Label Learning Problem Li-Ping Liu LIULI@EECSOREGONSTATEEDU Thomas G Dietterich TGD@EECSOREGONSTATEEDU School of Electrical Engineering and Computer Science, Oregon State University,

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification with Linear Models CMPSCI 383 Nov 15, 2011! Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient

More information

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory

More information