On the tradeoff between computational complexity and sample complexity in learning

Similar documents
Machine Learning in the Data Revolution Era

Introduction to Machine Learning (67577) Lecture 3

Using More Data to Speed-up Training Time

Littlestone s Dimension and Online Learnability

Efficient Bandit Algorithms for Online Multiclass Prediction

Learning Kernel-Based Halfspaces with the Zero-One Loss

Optimistic Rates Nati Srebro

LEARNING KERNEL-BASED HALFSPACES WITH THE 0-1 LOSS

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Introduction to Machine Learning

Learnability, Stability, Regularization and Strong Convexity

Computational and Statistical Learning Theory

Empirical Risk Minimization Algorithms

Multiclass Learnability and the ERM principle

Support vector machines Lecture 4

Computational Learning Theory

Practical Agnostic Active Learning

Active Learning: Disagreement Coefficient

Introduction to Support Vector Machines

Introduction to Machine Learning (67577) Lecture 5

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Inverse Time Dependency in Convex Regularized Learning

Introduction to Machine Learning (67577) Lecture 7

Multiclass Learnability and the ERM principle

c 2011 Society for Industrial and Applied Mathematics

IFT Lecture 7 Elements of statistical learning theory

From Batch to Transductive Online Learning

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Computational Learning Theory

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Online Learning Summer School Copenhagen 2015 Lecture 1

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Online Learning, Mistake Bounds, Perceptron Algorithm

Multiclass Learnability and the ERM Principle

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Efficient Bandit Algorithms for Online Multiclass Prediction

Computational and Statistical Learning Theory

Learning and 1-bit Compressed Sensing under Asymmetric Noise

Introduction to Machine Learning (67577)

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory

Computational and Statistical Learning Theory

The sample complexity of agnostic learning with deterministic labels

CS446: Machine Learning Spring Problem Set 4

Failures of Gradient-Based Deep Learning

Computational and Statistical Learning Theory

Machine Learning Theory (CS 6783)

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Online multiclass learning with bandit feedback under a Passive-Aggressive approach

Computational and Statistical Learning Theory

Learning Theory: Basic Guarantees

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Agnostic Online learnability

Computational Learning Theory

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning Theory (CS 6783)

Discriminative Learning can Succeed where Generative Learning Fails

Machine Learning

Empirical Risk Minimization

Sample Complexity of Learning Independent of Set Theory

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Classification objectives COMS 4771

Generalization and Overfitting

Active Learning Class 22, 03 May Claire Monteleoni MIT CSAIL

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Baum s Algorithm Learns Intersections of Halfspaces with respect to Log-Concave Distributions

Near-Optimal Algorithms for Online Matrix Prediction

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning

Generalization bounds

Lecture Support Vector Machine (SVM) Classifiers

Learning Hurdles for Sleeping Experts

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational and Statistical Learning theory

Introduction to Machine Learning

Computational Learning Theory

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory. Definitions

Statistical Learning Learning From Examples

Generalization, Overfitting, and Model Selection

Part of the slides are adapted from Ziko Kolter

Algorithms and hardness results for parallel large margin learning

Trade-Offs in Distributed Learning and Optimization

Multiclass Learning Approaches: A Theoretical Comparison with Implications

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Active Learning and Optimized Information Gathering

Discriminative Models

Logistic Regression. Machine Learning Fall 2018

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Machine Learning Lecture 6 Note

Online Convex Optimization

Transcription:

On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham Kakade, Ambuj Tewari, Ohad Shamir, Karthik Sridharan, Nati Srebro CS Seminar, Hebrew University, December 2009 Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 1 / 28

PAC Learning X - domain set Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Learning (formally) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Learning (formally) D - a distribution over X Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Learning (formally) D - a distribution over X Assumption: instances of S are chosen i.i.d. from D Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Learning (formally) D - a distribution over X Assumption: instances of S are chosen i.i.d. from D Error: err(h) = P[h(x) h (x)] Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Learning (formally) D - a distribution over X Assumption: instances of S are chosen i.i.d. from D Error: err(h) = P[h(x) h (x)] Goal: use S to find h s.t. w.p. 1 δ, err(h) ɛ Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

PAC Learning X - domain set Y = {±1} - target set Predictor: h : X Y Training set: S = (x 1, h (x 1 )),..., (x m, h (x m )) Learning (informally): Use S to find some h h Learning (formally) D - a distribution over X Assumption: instances of S are chosen i.i.d. from D Error: err(h) = P[h(x) h (x)] Goal: use S to find h s.t. w.p. 1 δ, err(h) ɛ Prior knowledge: h H Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 2 / 28

Complexity of Learning Sample complexity How many examples are needed? Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? Vapnik: exactly VC(H) log(1/δ) ɛ Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Computational complexity How much time is needed? Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Computational complexity How much time is needed? Naively: it takes Ω( H ) to implement the ERM Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Computational complexity How much time is needed? Naively: it takes Ω( H ) to implement the ERM Exponential gap between time and sample complexity (?) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Computational complexity How much time is needed? Naively: it takes Ω( H ) to implement the ERM Exponential gap between time and sample complexity (?) This talk joint time-sample dependency err(m, τ) def = min m m min A:time(A) τ E[err(A(S))] Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Computational complexity How much time is needed? Naively: it takes Ω( H ) to implement the ERM Exponential gap between time and sample complexity (?) This talk joint time-sample dependency err(m, τ) def = min m m min A:time(A) τ E[err(A(S))] Sample complexity arg min{m : err(m, ) ɛ} Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Complexity of Learning Sample complexity How many examples are needed? VC(H) log(1/δ) ɛ Vapnik: exactly Using the ERM (empirical risk minimization) Computational complexity How much time is needed? Naively: it takes Ω( H ) to implement the ERM Exponential gap between time and sample complexity (?) This talk joint time-sample dependency err(m, τ) def = min m m min A:time(A) τ E[err(A(S))] Sample complexity arg min{m : err(m, ) ɛ} Data laden err(, τ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 3 / 28

Main Conjecture Main Question How much time, τ, is needed to achieve error ɛ as a function of sample size, m? τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 4 / 28

Warmup example X = {0, 1} d τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

Warmup example X = {0, 1} d H is 3-term DNF formulae: τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

Warmup example X = {0, 1} d H is 3-term DNF formulae: h(x) = T 1 (x) T 2 (x) T 3 (x), where each T i is a conjunction τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

Warmup example X = {0, 1} d H is 3-term DNF formulae: h(x) = T 1 (x) T 2 (x) T 3 (x), where each T i is a conjunction E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

Warmup example X = {0, 1} d H is 3-term DNF formulae: h(x) = T 1 (x) T 2 (x) T 3 (x), where each T i is a conjunction E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) H 3 3d therefore sample complexity is order d/ɛ τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

Warmup example X = {0, 1} d H is 3-term DNF formulae: h(x) = T 1 (x) T 2 (x) T 3 (x), where each T i is a conjunction E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) H 3 3d therefore sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find h H s.t. err(h) ɛ τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

Warmup example X = {0, 1} d H is 3-term DNF formulae: h(x) = T 1 (x) T 2 (x) T 3 (x), where each T i is a conjunction E.g. h(x) = (x 1 x 3 x 7 ) (x 4 x 2 ) (x 5 x 9 ) H 3 3d therefore sample complexity is order d/ɛ Kearns & Vazirani: If RP NP, it is not possible to efficiently find h H s.t. err(h) ɛ Claim: if m d 3 /ɛ it is possible to find a predictor with error ɛ in polynomial time τ m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 5 / 28

How more data reduces time? Observation: T 1 T 2 T 3 = u T1,v T 2,w T 3 (u v w) Define: ψ : X {0, 1} 2(2d)3 s.t. for each triplet of literals u, v, w there are two variables indicating if u v w is true or false Observation: Exists Halfspace s.t. h (x) = sgn( w, ψ(x) + b) Therefore, can solve ERM w.r.t. Halfspaces (linear programming) VC dimension of Halfspaces is the dimension Sample complexity is order d 3 /ɛ x sgn( w, ψ(x) + b) x h(x) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 6 / 28

Trading samples for runtime Algorithm samples runtime 3-DNF d ɛ 2 d Halfspace d 3 ɛ poly(d) 3-DNF τ Halfspace m Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 7 / 28

But, The lower bound on the computational complexity is only for proper learning there s no lower bound on the computational complexity of improper learning with d/ɛ examples The lower bound on the sample complexity of Halfspaces is in the general case here we have a specific structure The interesting questions: Is the curve really true? Can one construct correct lower bounds? If the curve is true, one should be able to construct more algorithms on the curve. How? Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 8 / 28

Second example: Online Ads Placement For t = 1, 2,..., m Learner receives side information x t R d Learner predicts ŷ t [k] Learner pay cost 1[ŷ t h (x t )] Bandit setting learner does not know h (x t ) Goal: Minimize error rate: err = 1 m m 1[ŷ t h (x t )]. t=1 Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 9 / 28

Linear Hypotheses H = {x argmax (W x) r : W R k,d, W F 1} r R d Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 10 / 28

Large margin assumption Assumption: Data is separable with margin µ: t, r y t, (W x t ) yt (W x t ) r µ R d Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 11 / 28

First approach Halving Halving for Bandit Multiclass categorization Initialize: V 1 = H For t = 1, 2,... Receive x t For all r [k] let V t (r) = {h V t : h(x t ) = r} Predict ŷ t arg max r V t (r) If 1[ŷ t y t ] set V t+1 = V t \ V t (ŷ t ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 12 / 28

First approach Halving Halving for Bandit Multiclass categorization Initialize: V 1 = H For t = 1, 2,... Receive x t For all r [k] let V t (r) = {h V t : h(x t ) = r} Analysis: Predict ŷ t arg max r V t (r) If 1[ŷ t y t ] set V t+1 = V t \ V t (ŷ t ) Whenever we err V t+1 ( 1 1 k ) Vt exp( 1/k) V t Therefore: err k log( H ) m Equivalently, sample complexity is k log( H ) ɛ Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 12 / 28

Using Halving Step 1: Dimensionality reduction to d = O( ln(m+k) µ 2 ) Step 2: Discretize H to (1/µ) kd hypotheses Apply Halving on the resulting finite set of hypotheses Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 13 / 28

Using Halving Analysis: Step 1: Dimensionality reduction to d = O( ln(m+k) µ 2 ) Step 2: Discretize H to (1/µ) kd hypotheses Apply Halving on the resulting finite set of hypotheses Sample complexity is order of k2 /µ 2 ɛ But runtime grows like (1/µ) kd = (m + k)õ(k/µ2 ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 13 / 28

How can we improve runtime? Halving is not efficient because it does not utilize the structure of H In the full information case: Halving can be made efficient because each version space V t can be made convex! The Perceptron is a related approach which utilizes convexity and works in the full information case Next approach: Lets try to rely on the Perceptron Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 14 / 28

The Mutliclass Perceptron For t = 1, 2,..., m Receive x t R d Predict ŷ t = arg max r (W t x t ) r Receive y t = h (x t ) If ŷ t y t update: W t+1 = W t + U t Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 15 / 28

The Mutliclass Perceptron For t = 1, 2,..., m Receive x t R d Predict ŷ t = arg max r (W t x t ) r Receive y t = h (x t ) If ŷ t y t update: W t+1 = W t + U t U t = 0... 0. 0... 0... x t... 0... 0. 0... 0... x t... 0... 0. 0... 0 Row y t Row ŷ t Problem: In the bandit case, we re blind to value of y t Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 15 / 28

The Banditron (Kakade, S, Tewari 08) Explore: From time to time, instead of predicting ŷ t guess some ỹ t Suppose we get the feedback correct, i.e. ỹ t = y t Then, we have full information for Perceptron s update: (x t, ŷ t, ỹ t = y t ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 16 / 28

The Banditron (Kakade, S, Tewari 08) Explore: From time to time, instead of predicting ŷ t guess some ỹ t Suppose we get the feedback correct, i.e. ỹ t = y t Then, we have full information for Perceptron s update: (x t, ŷ t, ỹ t = y t ) Exploration-Exploitation Tradeoff: When exploring we may have ỹ t = y t ŷ t and can learn from this When exploring we may have ỹ t y t = ŷ t and then we had the right answer in our hands but didn t exploit it Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 16 / 28

The Banditron (Kakade, S, Tewari 08) For t = 1, 2,..., m Receive x t R d Set ŷ t = arg max r (W t x t ) r Define: P (r) = (1 γ)1[r = ŷ t ] + γ k Randomly sample ỹ t according to P Predict ỹ t Receive feedback 1[ỹ t = y t ] Update: W t+1 = W t + Ũ t Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 17 / 28

The Banditron (Kakade, S, Tewari 08) For t = 1, 2,..., m Receive x t R d Set ŷ t = arg max r (W t x t ) r Define: P (r) = (1 γ)1[r = ŷ t ] + γ k Randomly sample ỹ t according to P Predict ỹ t Receive feedback 1[ỹ t = y t ] Update: W t+1 = W t + Ũ t where... Ũ t = 0... 0.. 0... 0 1[y t=ỹ t] P (ỹ t) x t... 0... 0. 0... 0... x t... 0... 0. 0... 0 Row ỹ t Row ŷ t Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 17 / 28

The Banditron (Kakade, S, Tewari 08) Theorem Banditron s sample complexity is order of k/µ2 ɛ 2 Banditron s runtime is O(k/µ 2 ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 18 / 28

The Banditron (Kakade, S, Tewari 08) Theorem Banditron s sample complexity is order of k/µ2 ɛ 2 Banditron s runtime is O(k/µ 2 ) The crux of difference between Halving and Banditron: Without having the full information, the version space is non-convex and therefore it is hard to utilize the structure of H Because we relied on the Perceptron we did utilize the structure of H and got an efficient algorithm We managed to obtain full-information examples by using exploration The price of exploration is a higher regret Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 18 / 28

Trading samples for runtime Algorithm samples runtime Halving Banditron k 2 /µ 2 ɛ (m + k)õ(k/µ2 ) k/µ 2 ɛ 2 k/µ 2 τ Halving m Banditron Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 19 / 28

Next example: Agnostic PAC learning of fuzzy halfspaces Agnostic PAC: D - arbitrary distribution over X Y Training set: S = (x 1, y 1 ),..., (x m, y m ) Goal: use S to find h S s.t. w.p. 1 δ, err(h S ) min h H err(h) + ɛ Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 20 / 28

Hypothesis class H = {x φ( w, x ) : w 2 1}, φ(z) = 1 1+exp( z/µ) 1-1 1 Probabilistic classifier: P[h w (x) = 1] = φ( w, x ) Loss function: err(w; (x, y)) = P[h w (x) y] = φ( w, x ) y+1 Remark: Dimension can be infinite (kernel methods) 2 Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 21 / 28

First approach sub-sample covering Claim: exists 1/(ɛµ 2 ) examples from which we can efficiently learn w up to error of ɛ Proof idea: S = {(x i, y i ) : y i = y i if y i w, x i < µ and else y i = y i} Use surrogate convex loss 1 2 max{0, 1 y w, x /γ} Minimizing surrogate loss on S minimizing original loss on S Sample complexity w.r.t. surrogate loss is 1/(ɛµ 2 ) Analysis Sample complexity: 1/(ɛµ) 2 Time complexity: m 1/(ɛµ2) = ( 1 ɛµ ) 1/(ɛµ 2 ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 22 / 28

Second Approach IDPK (S, Shamir, Sridharan) Learning fuzzy halfspaces using Infinite-Dimensional-Polynomial-Kernel Original class: H = {x φ( w, x )} Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 23 / 28

Second Approach IDPK (S, Shamir, Sridharan) Learning fuzzy halfspaces using Infinite-Dimensional-Polynomial-Kernel Original class: H = {x φ( w, x )} Problem: Loss is non-convex w.r.t. w Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 23 / 28

Second Approach IDPK (S, Shamir, Sridharan) Learning fuzzy halfspaces using Infinite-Dimensional-Polynomial-Kernel Original class: H = {x φ( w, x )} Problem: Loss is non-convex w.r.t. w Main idea: Work with a larger hypothesis class for which the loss becomes convex x v, ψ(x) x φ( w, x ) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 23 / 28

Step 2 Learning fuzzy halfspaces with IDPK Original class: H = {x φ( w, x ) : w 1} New class: H = {x v, ψ(x) : v B} where ψ : X R N s.t. j, (i 1,..., i j ), ψ(x) (i1,...,i j ) = 2 j/2 x i1 x ij Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 24 / 28

Step 2 Learning fuzzy halfspaces with IDPK Original class: H = {x φ( w, x ) : w 1} New class: H = {x v, ψ(x) : v B} where ψ : X R N s.t. j, (i 1,..., i j ), ψ(x) (i1,...,i j ) = 2 j/2 x i1 x ij Lemma (S, Shamir, Sridharan 2009) If B = exp(õ(1/µ)) then for all h H exists h H s.t. for all x, h(x) h (x). Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 24 / 28

Step 2 Learning fuzzy halfspaces with IDPK Original class: H = {x φ( w, x ) : w 1} New class: H = {x v, ψ(x) : v B} where ψ : X R N s.t. j, (i 1,..., i j ), ψ(x) (i1,...,i j ) = 2 j/2 x i1 x ij Lemma (S, Shamir, Sridharan 2009) If B = exp(õ(1/µ)) then for all h H exists h H s.t. for all x, h(x) h (x). Remark: The above is a pessimistic choice of B. In practice, smaller B suffices. Is it tight? Even if it is, are there natural assumptions under which a better bound holds? (e.g. Kalai, Klivans, Mansour, Servedio 2005) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 24 / 28

Proof idea Polynomial approximation: φ(z) j=0 β jz j Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 25 / 28

Proof idea Polynomial approximation: φ(z) j=0 β jz j Therefore: φ( w, x ) β j ( w, x ) j = j=0 j=0 k 1,...,k j 2 j/2 β j 2 j/2 w k1 w kj x k1 x kj = v w, ψ(x) Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 25 / 28

Proof idea Polynomial approximation: φ(z) j=0 β jz j Therefore: φ( w, x ) β j ( w, x ) j = j=0 j=0 k 1,...,k j 2 j/2 β j 2 j/2 w k1 w kj x k1 x kj = v w, ψ(x) To obtain a concrete bound we use Chebyshev approximation technique: Family of orthogonal polynomials w.r.t. inner product: f, g = 1 x= 1 f(x)g(x) 1 x 2 dx Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 25 / 28

Infinite-Dimensional-Polynomial-Kernel Although the dimension is infinite, can be solved using the kernel trick The corresponding kernel (a.k.a. Vovk s infinite polynomial): ψ(x), ψ(x ) = K(x, x ) = 1 1 1 2 x, x Algorithm boils down to linear regression with the above kernel Convex! Can be solved efficiently Sample complexity: (B/ɛ) 2 = 2Õ(1/µ) /ɛ 2 Time complexity: m 2 Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 26 / 28

Trading samples for time Algorithm sample time ( ) 1/(ɛµ 1 2 ) Covering 1 ɛ 2 µ 2 ɛµ IDPK ( ) 1/µ ( ) 2/µ 1 1 1 1 ɛµ ɛ 2 ɛµ ɛ 4 Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 27 / 28

Summary Trading data for runtime (?) There are more examples of the phenomenon... Open questions: More points on the curve (new algorithms) Lower bounds??? Can you help? Shai Shalev-Shwartz (Hebrew University) Trading time for data Dec 09 28 / 28