Introduction to Machine Learning (67577) Lecture 7

Similar documents
Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Accelerating Stochastic Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 5

Stochastic Gradient Descent

Machine Learning in the Data Revolution Era

Computational and Statistical Learning Theory

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

Stochastic Gradient Descent with Variance Reduction

Optimization and Gradient Descent

Support vector machines Lecture 4

Neural Networks: Backpropagation

Introduction to Machine Learning (67577)

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Classification Logistic Regression

ECS171: Machine Learning

CS260: Machine Learning Algorithms

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Algorithmic Stability and Generalization Christoph Lampert

Convex Optimization Lecture 16

Online Convex Optimization

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Overfitting, Bias / Variance Analysis

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Trade-Offs in Distributed Learning and Optimization

Lecture Support Vector Machine (SVM) Classifiers

On the tradeoff between computational complexity and sample complexity in learning

CSC 411: Lecture 02: Linear Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

ECE 5424: Introduction to Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

CSC321 Lecture 9: Generalization

Linear and Logistic Regression. Dr. Xiaowei Huang

l 1 and l 2 Regularization

Information theoretic perspectives on learning algorithms

Statistical Data Mining and Machine Learning Hilary Term 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

CSC 411 Lecture 6: Linear Regression

Inverse Time Dependency in Convex Regularized Learning

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Stochastic Optimization

Least Mean Squares Regression

Optimistic Rates Nati Srebro

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machine

Online Learning Summer School Copenhagen 2015 Lecture 1

Computational Learning Theory

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Learnability, Stability, Regularization and Strong Convexity

Fast Stochastic Optimization Algorithms for ML

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

SGD and Deep Learning

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Large-scale Stochastic Optimization

Ad Placement Strategies

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Active Learning: Disagreement Coefficient

ECE 5984: Introduction to Machine Learning

Introduction to Machine Learning

Linear Models in Machine Learning

Efficient Bandit Algorithms for Online Multiclass Prediction

CSC321 Lecture 9: Generalization

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Machine Learning and Data Mining. Linear classification. Kalev Kask

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

CS7267 MACHINE LEARNING

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Computational and Statistical Learning Theory

The FTRL Algorithm with Strongly Convex Regularizers

Online Passive-Aggressive Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSC321 Lecture 8: Optimization

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Littlestone s Dimension and Online Learnability

Big Data Analytics: Optimization and Randomization

Machine Learning. Ensemble Methods. Manfred Huber

Dan Roth 461C, 3401 Walnut

Stochastic Optimization Algorithms Beyond SG

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Composite Objective Mirror Descent

Kernelized Perceptron Support Vector Machines

Deep Learning for Computer Vision

Linear Models for Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Understanding Generalization Error: Bounds and Decompositions

Part of the slides are adapted from Ziko Kolter

Transcription:

Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 1 / 31

Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 2 / 31

Convex-Lipschitz-bounded learning problem Definition (Convex-Lipschitz-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Lipschitz-Bounded, with parameters ρ, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex and ρ-lipschitz function. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31

Convex-Lipschitz-bounded learning problem Definition (Convex-Lipschitz-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Lipschitz-Bounded, with parameters ρ, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex and ρ-lipschitz function. Example: H = {w R d : w B} X = {x R d : x ρ}, Y = R, l(w, (x, y)) = w, x y Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 3 / 31

Convex-Smooth-bounded learning problem Definition (Convex-Smooth-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Smooth-Bounded, with parameters β, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex, non-negative, and β-smooth function. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31

Convex-Smooth-bounded learning problem Definition (Convex-Smooth-Bounded Learning Problem) A learning problem, (H, Z, l), is called Convex-Smooth-Bounded, with parameters β, B if the following holds: The hypothesis class H is a convex set and for all w H we have w B. For all z Z, the loss function, l(, z), is a convex, non-negative, and β-smooth function. Example: H = {w R d : w B} X = {x R d : x β/2}, Y = R, l(w, (x, y)) = ( w, x y) 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 4 / 31

Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 5 / 31

Learning Using Stochastic Gradient Descent Consider a learning problem. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D So far, learning was based on the empirical risk, L S (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Learning Using Stochastic Gradient Descent Consider a learning problem. Recall: our goal is to (probably approximately) solve: min L D(w) where L D (w) = E [l(w, z)] w H z D So far, learning was based on the empirical risk, L S (w) We now consider directly minimizing L D (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 6 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction In other words, v is an unbiased estimate of the gradient Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent min L D(w) where L D (w) = E [l(w, z)] w H z D Recall the gradient descent method in which we initialize w (1) = 0 and update w (t+1) = w (t) η L D (w) Observe: L D (w) = E z D [ l(w, z)] We can t calculate L D (w) because we don t know D But we can estimate it by l(w, z) for z D If we take a step based on the direction v = l(w, z) then in expectation we re moving in the right direction In other words, v is an unbiased estimate of the gradient We ll show that this is good enough Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 7 / 31

Stochastic Gradient Descent initialize: w (1) = 0 for t = 1, 2,..., T choose z t D let v t l(w (t), z t ) update w (t+1) = w (t) ηv t output w = 1 T T t=1 w(t) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31

Stochastic Gradient Descent initialize: w (1) = 0 for t = 1, 2,..., T choose z t D let v t l(w (t), z t ) update w (t+1) = w (t) ηv t output w = 1 T T t=1 w(t) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 8 / 31

Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Assume that v t ρ for all t and that w B we obtain T t=1 w (t) w, v t B2 2η + η ρ2 T 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

Analyzing SGD for convex-lipschitz-bounded By algebraic manipulations, for any sequence of v 1,..., v T, and any w, T t=1 w (t) w, v t = w(1) w 2 w (T +1) w 2 2η + η 2 T v t 2 t=1 Assume that v t ρ for all t and that w B we obtain T t=1 In particular, for η = w (t) w, v t B2 2η + η ρ2 T 2 B 2 ρ 2 T we get T w (t) w, v t B ρ T. t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 9 / 31

Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Therefore E [ w (t) w, v t ] = z 1,...,z T E z 1,...,z t 1 E [ w (t) w, v t z 1,..., z t 1 ]. z 1,...,z T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Analyzing SGD for convex-lipschitz-bounded Taking expectation of both sides w.r.t. the randomness of choosing z 1,..., z T we obtain: [ T ] w (t) w, v t B ρ T. E z 1,...,z T t=1 The law of total expectation: for every two random variables α, β, and a function g, E α [g(α)] = E β E α [g(α) β]. Therefore E [ w (t) w, v t ] = z 1,...,z T E z 1,...,z t 1 E [ w (t) w, v t z 1,..., z t 1 ]. z 1,...,z T Once we know z 1,..., z t 1 the value of w (t) is not random, hence, E [ w (t) w, v t z 1,..., z t 1 ] = w (t) w, E[ l(w (t), z t )] z 1,...,z T zt = w (t) w, L D (w (t) ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 10 / 31

Analyzing SGD for convex-lipschitz-bounded We got: E z 1,...,z T [ T ] w (t) w, L D (w (t) ) t=1 B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

Analyzing SGD for convex-lipschitz-bounded We got: [ T ] E z 1,...,z T t=1 w (t) w, L D (w (t) ) B ρ T By convexity, this means [ T ] E z 1,...,z T t=1 (L D (w (t) ) L D (w )) B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

Analyzing SGD for convex-lipschitz-bounded We got: [ T ] E z 1,...,z T t=1 w (t) w, L D (w (t) ) B ρ T By convexity, this means [ T ] E z 1,...,z T t=1 (L D (w (t) ) L D (w )) B ρ T Dividing by T and using convexity again, )] T w (t) E z 1,...,z T [L D ( 1 T t=1 L D (w ) + B ρ T Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 11 / 31

Learning convex-lipschitz-bounded problems using SGD Corollary Consider a convex-lipschitz-bounded learning problem with parameters ρ, B. Then, for every ɛ > 0, if we run the SGD method for minimizing L D (w) with a number of iterations (i.e., number of examples) and with η = B 2 ρ 2 T T B2 ρ 2 ɛ 2, then the output of SGD satisfies: E [L D ( w)] min w H L D(w) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31

Learning convex-lipschitz-bounded problems using SGD Corollary Consider a convex-lipschitz-bounded learning problem with parameters ρ, B. Then, for every ɛ > 0, if we run the SGD method for minimizing L D (w) with a number of iterations (i.e., number of examples) and with η = B 2 ρ 2 T T B2 ρ 2 ɛ 2, then the output of SGD satisfies: E [L D ( w)] min w H L D(w) + ɛ. Remark: Can obtain high probability bound using boosting the confidence (Lecture 4) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 12 / 31

Convex-smooth-bounded problems Similar result holds for smooth problems: Corollary Consider a convex-smooth-bounded learning problem with parameters β, B. Assume in addition that l(0, z) 1 for all z Z. For every ɛ > 0, set η = 1 β(1+3/ɛ). Then, running SGD with T 12B2 β/ɛ 2 yields E[L D ( w)] min w H L D(w) + ɛ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 13 / 31

Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 14 / 31

Regularized Loss Minimization (RLM) Given a regularization function R : R d R, the RLM rule is: A(S) = argmin (L S (w) + R(w)). w Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31

Regularized Loss Minimization (RLM) Given a regularization function R : R d R, the RLM rule is: A(S) = argmin (L S (w) + R(w)). w We will focus on Tikhonov regularization ( A(S) = argmin LS (w) + λ w 2). w Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 15 / 31

Why to regularize? Similar to MDL: specify prior belief in hypotheses. We bias ourselves toward short vectors. Stabilizer: we ll show that Tikhonov regularization makes the learner stable w.r.t. small perturbation of the training set, which in turn leads to generalization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 16 / 31

Stability Informally: an algorithm A is stable if a small change of its input S will lead to a small change of its output hypothesis Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31

Stability Informally: an algorithm A is stable if a small change of its input S will lead to a small change of its output hypothesis Need to specify what is small change of input and what is small change of output Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 17 / 31

Stability Replace one sample: given S = (z 1,..., z m ) and an additional example z, let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31

Stability Replace one sample: given S = (z 1,..., z m ) and an additional example z, let S (i) = (z 1,..., z i 1, z, z i+1,..., z m ) Definition (on-average-replace-one-stable) Let ɛ : N R be a monotonically decreasing function. We say that a learning algorithm A is on-average-replace-one-stable with rate ɛ(m) if for every distribution D E (S,z ) D m+1,i U(m) [l(a(s(i), z i )) l(a(s), z i )] ɛ(m). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 18 / 31

Stable rules do not ovefit Theorem if A is on-average-replace-one-stable with rate ɛ(m) then E S D m[l D(A(S)) L S (A(S))] ɛ(m). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31

Stable rules do not ovefit Theorem if A is on-average-replace-one-stable with rate ɛ(m) then Proof. E S D m[l D(A(S)) L S (A(S))] ɛ(m). Since S and z are both drawn i.i.d. from D, we have that for every i, E S [L D (A(S))] = E S,z [l(a(s), z )] = E S,z [l(a(s(i) ), z i )]. On the other hand, we can write E S [L S (A(S))] = E S,i [l(a(s), z i )]. The proof follows from the definition of stability. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 19 / 31

Tikhonov Regularization as Stabilizer Theorem Assume that the loss function is convex and ρ-lipschitz. Then, the RLM rule with the regularizer λ w 2 is on-average-replace-one-stable with rate 2 ρ 2 λ m. It follows that E S D m[l D(A(S)) L S (A(S))] 2 ρ2 λ m. Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βC λm, where C is an upper bound on max z l(0, z). Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31

Tikhonov Regularization as Stabilizer Theorem Assume that the loss function is convex and ρ-lipschitz. Then, the RLM rule with the regularizer λ w 2 is on-average-replace-one-stable with rate 2 ρ 2 λ m. It follows that E S D m[l D(A(S)) L S (A(S))] 2 ρ2 λ m. Similarly, for convex, β-smooth, and non-negative, loss the rate is 48βC λm, where C is an upper bound on max z l(0, z). The proof relies on the notion of strong convexity and can be found in the book. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 20 / 31

The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. The first term is how good A fits the training set The 2nd term is the overfitting, and is bounded by the stability of A Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

The Fitting-Stability Tradeoff Observe: E S [L D (A(S))] = E S [L S (A(S))] + E S [L D (A(S)) L S (A(S))]. The first term is how good A fits the training set The 2nd term is the overfitting, and is bounded by the stability of A λ controls the tradeoff between the two terms Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 21 / 31

The Fitting-Stability Tradeoff Let A be the RLM rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Taking expectation of both sides with respect to S and noting that E S [L S (w )] = L D (w ), we obtain that E S [L S (A(S))] L D (w ) + λ w 2. Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

The Fitting-Stability Tradeoff Let A be the RLM rule We saw (for convex-lipschitz losses) E S [L D (A(S)) L S (A(S))] 2 ρ2 λ m Fix some arbitrary vector w, then: L S (A(S)) L S (A(S)) + λ A(S) 2 L S (w ) + λ w 2. Taking expectation of both sides with respect to S and noting that E S [L S (w )] = L D (w ), we obtain that Therefore: E S [L S (A(S))] L D (w ) + λ w 2. E S [L D (A(S))] L D (w ) + λ w 2 + 2 ρ2 λ m Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 22 / 31

The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Can be seen as a pareto objective: minimize both L S (w) and w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

The Regularization Path The RLM rule as a function of λ is w(λ) = argmin w L S (w) + λ w 2 Can be seen as a pareto objective: minimize both L S (w) and w 2 L S (w) w( ) feasible w pareto optimal w(0) w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 23 / 31

How to choose λ? Bound minimization: choose λ according to the bound on L D (w) usually far from optimal as the bound is worst case Validation: calculate several pareto optimal points on the regularization path (by varying λ) and use validation set to choose the best one Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 24 / 31

Outline 1 Reminder: Convex learning problems 2 Learning Using Stochastic Gradient Descent 3 Learning Using Regularized Loss Minimization 4 Dimension vs. Norm bounds Example application: Text categorization Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 25 / 31

Dimension vs. Norm Bounds Previously in the course, when we learnt d parameters the sample complexity grew with d Here, we learn d parameters but the sample complexity depends on the norm of w and on the Lipschitzness/smoothness, rather than on d Which approach is better depends on the properties of the distribution Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 26 / 31

Example: document categorization Signs all encouraging for Phelps in comeback. He did not win any gold medals or set any world records but Michael Phelps ticked all the boxes he needed in his comeback to competitive swimming.? About sport? Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 27 / 31

Bag-of-words representation Signs all encouraging for Phelps in comeback. He did not win any gold medals or set any world records but Michael Phelps ticked all the boxes he needed in his comeback to competitive swimming. 0 0 1 0 0 1 0 0 0 0 swimming world elephant Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 28 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Intuitively: w i is large (positive) for words indicative to sport while w i is small (negative) for words indicative to non-sport Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Document categorization Let X = {x {0, 1} d : x 2 R 2, x d = 1} Think on x X as a text document represented as a bag of words: At most R 2 1 words in each document d 1 is the size of the dictionary Last coordinate is the bias Let Y = {±1} (e.g., the document is about sport or not) Linear classifiers x sign( w, x ) Intuitively: w i is large (positive) for words indicative to sport while w i is small (negative) for words indicative to non-sport Hinge-loss: l(w, (x, y)) = [1 y w, x ] + Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 29 / 31

Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Then, can learn it with sample complexity that depends on R 2 w 2, and does not depend on d at all! Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Dimension vs. Norm VC dimension is d, but d can be extremely large (number of words in English) Loss function is convex and R Lipschitz Assume that the number of relevant words is small, and their weights is not too large, then there is a w with small norm and small L D (w ) Then, can learn it with sample complexity that depends on R 2 w 2, and does not depend on d at all! But, there are of course opposite cases, in which d is much smaller than R 2 w 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 30 / 31

Summary Learning convex learning problems using SGD Learning convex learning problems using RLM The regularization path Dimension vs. Norm Shai Shalev-Shwartz (Hebrew U) IML Lecture 7 SGD and RLM 31 / 31