Computational and Statistical Learning Theory

Similar documents
Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Generalization, Overfitting, and Model Selection

Learnability, Stability, Regularization and Strong Convexity

Computational and Statistical Learning Theory

Generalization and Overfitting

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 8. Instructor: Haipeng Luo

Generalization theory

Littlestone s Dimension and Online Learnability

CIS 520: Machine Learning Oct 09, Kernel Methods

Generalization, Overfitting, and Model Selection

Optimistic Rates Nati Srebro

Lecture Support Vector Machine (SVM) Classifiers

Consistency of Nearest Neighbor Methods

Lecture 4: Linear predictors and the Perceptron

Does Unlabeled Data Help?

Statistical learning theory, Support vector machines, and Bioinformatics

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

The Frank-Wolfe Algorithm:

Support Vector Machines for Classification and Regression

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Learning with Rejection

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Introduction to Machine Learning

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Computational and Statistical Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

Support Vector Machines, Kernel SVM

Approximation Theoretical Questions for SVMs

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

The sample complexity of agnostic learning with deterministic labels

Warm up: risk prediction with logistic regression

Machine Learning And Applications: Supervised Learning-SVM

Discriminative Models

Lecture 10: A brief introduction to Support Vector Machine

Stochastic Subgradient Method

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Lecture 10 February 23

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Notes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces

TTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model

Computational and Statistical Learning theory

Computational Learning Theory

Midterm 1. Every element of the set of functions is continuous

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Machine Learning, Fall 2011: Homework 5

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Discriminative Models

Understanding Generalization Error: Bounds and Decompositions

Introduction to Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning 4771

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011

Support vector machines Lecture 4

Statistical and Computational Learning Theory

Introduction to Support Vector Machines

Generalization theory

Review: Support vector machines. Machine learning techniques and image analysis

Class 2 & 3 Overfitting & Regularization

Overparametrization for Landscape Design in Non-convex Optimization

VC dimension and Model Selection

Announcements - Homework

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

FORMULATION OF THE LEARNING PROBLEM

Introduction to Machine Learning (67577) Lecture 3

Implicit Optimization Bias

CS446: Machine Learning Spring Problem Set 4

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

ECE 5984: Introduction to Machine Learning

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

10-701/ Machine Learning - Midterm Exam, Fall 2010

Day 3: Classification, logistic regression

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Support Vector Machines: Maximum Margin Classifiers

Basis Expansion and Nonlinear SVM. Kai Yu

CS-E4830 Kernel Methods in Machine Learning

The Perceptron Algorithm, Margins

CSC 411 Lecture 17: Support Vector Machine

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction

1 Machine Learning Concepts (16 points)

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Foundations of Machine Learning

Transcription:

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

Prediction Margin For a predictor h: X R and binary labels y = ±1: Margin on a single example: yh(x) Margin on a training set: margin h = min y ih(x i ) x i,y i S Most classification loss functions are a function of the margin: loss mrg h x ; y = margin < 1 loss hinge h x ; y = 1 margin + loss logistic h x ; y = log 1 + e margin loss exp h x ; y = e margin loss sq h x ; y = y h x 2 = 1 margin 2

Complexity and Margin δ Recall: S D m h H L 01 D h L mrg S h + R S H + margin S h = γ L S mrg h γ = 0 log 1 δ m L D 01 h L S mrg h γ + R S 1 H + log γ m 1 δ 1 γ R m H + log 1 δ m B φ x 2 γ 2 1 m + log 1 δ m l 2 -margin: sup w margin w w 2 = sup w 2 1 margin(w) Even better to consider relative l 2 -margin: sup l 1 -margin: sup w margin w w 1 Relative l 1 -margin: sup w margin w w 1 sup φ x H = h w x = w, φ x w 2 B w margin w w 2 sup φ x 2

Boosting and the l 1 Margin After T = Weak learning: can always find f with L 01 f 1 2 γ 2 48 log 2m γ 4 iteration, AdaBoost finds predictor with over φ x [f] = f x ±1, i.e. φ x = 1 (and as T, lim margin w w 1 γ) margin w w 1 γ 2 B = {weak predictors f} Need m = γ 2 2 VCdim B ε 2 samples to ensure L 01 w ε Can we understand AdaBoost purely in terms of l 1 -margin? Can we get a guarantee for AdaBoost that replies on existence of large l 1 - margin predictor, instead of on weak learnability? The AdaBoost analysis shows weak learning l 1 -margin. Converse?

Weak Learning and the l 1 Margin Consider a base class B = f: X ±1 and the corresponding feature map φ: X R B defined as φ x f = f(x). Goal: relate weak learnability using predictors in B to l 1 -margin using φ(x) Weak learnability: h: X ±1 is γ-weakly learnable using B if for any distribution D(X), there exists f B s.t. Pr x D f x = h x 1 2 + γ 2 Assume that B is symmetric, i.e. for any f B, also f B This allows us to consider only w 0, and so w 1 = w f If w f < 0, instead use w f > 0 (without assuming B is symmetric, we will need to talk about margin attainable only with w 0)

Weak Learning and the l 1 Margin Best possible l 1 margin for a labeling h: γ 1 = sup min h(x) φ x, w w 1 1 x For finite domain X = {x 1,, x n } and finite base class B (i.e. φ x R d is finite dimensional) consider matrix A ±1 n d with rows A i = h x i φ x i Can write the l 1 margin as: γ 1 = max min A i w = max w 1 1 i w 1 1 and since B is symmetric: γ 1 = max w R + d w 1 =1 min p Aw p R n + p 1 =1 min p Aw p R n + p 1 =1

Weak Learning and the l 1 Margin Best possible weak-learnability edge for h: X ±1: γ = min D max f B 2 Pr x D h x = f x 1 = min D max f B x D(x)h x f x For a finite domain, and in terms of the matrix with rows A i = h x i φ x i columns A j : γ = min p R + n p 1 =1 max p i h x i φ x i = min j 1..d p R n + i p 1 =1 = min p R + n p 1 =1 max j 1..d p A j max w R + d w 1 =1 p Aw max w R d + AdaBoost w 1 =1 and min p Aw = γ p R n 1 + p 1 =1

Weak Learning and the l 1 Margin Best possible weak-learnability edge for h: X ±1: γ = min D max f B 2 Pr x D h x = f x 1 = min D max f B x D(x)h x f x For a finite domain, and in terms of the matrix with rows A i = h x i φ x i columns A j : γ = min p R + n p 1 =1 max p i h x i φ x i = min j 1..d p R n + i p 1 =1 max j 1..d p A j and = min p R + n p 1 =1 max w R + d w 1 =1 p Aw = max w R + d w 1 =1 min p Aw = γ p R n 1 + p 1 =1 Strong duality

Weak Learning and the l 1 Margin Conclusion: γ-weakly learnable using predictors from base class B (i.e. for any distribution, can get error 1 γ using predictor for B) 2 2 if and only if realizable with l 1 margin γ using φ x = h x (i.e. there exists w 1 = h B 1 γ with L mrg x w, φ x = 0) AdaBoost can be viewed as an algorithm for maximizing the l 1 margin: margin S w w γ AdaBoost finds margin S w γ log m in O steps, w 1 w 1 2 γ 4 and eventually converges to the maximal l 1 margin solution.

Loss, Regularizer and Efficient Representation SVM: l 2 regularization dimension independent generalization Hine loss Represent infinite dimensional space via kernels Boosting: l 1 regularization sample complexity depends on log(d) or VCdim(features) Exp-loss / hard margin Represent infinite dimensional space via weak learning oracle, i.e. oracle for finding high-derivative feature

Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a N (H, α, m) R m (H) Lipschitz N (F, α, m) R m (F) l h, z a loss a S δ h L D h L S h ε S δ L ERM H inf L h + ε

Converse: ULLN For bounded loss, the following are equivalent: Finite fat-shattering dimension at every scale α > 0 Finite covering number at every scale α > 0 Radamacher complexity R m 0 as m sup Ef E S f 0 as m f (and equivalent quantitatively, up to log-factors)

Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a N (H, α, m) R m (H) Lipschitz N (F, α, m) R m (F) l h, z a loss a S δ h L D h L S h ε S δ L ERM H inf L h + ε

Fundamental Theorem of (Real Valued) Learning Finite fat-shattering dimension dim α H learnable, with sample complexity dim α H Can t expect converse for any loss function E.g. trivial loss loss y; y = 0 Or partially trivial: ramp loss, dim α H =, but h H,x X h x > 5 Focus on loss y, y = y y Theorem: With loss y, y = y y, for any H R X, any learning rule A and any α > 0, there exists D and h H, L D h = 0, but with m < 1 4 dim α H samples, E L A S > α 4 i.e. sample complexity to get error α 4 is at least dim α H Conclusion: Fat-shattering dimension tightly characterizes learnability If learnable, learnable using ERM with near-optimal sample complexity.

Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a l h, z a loss a N (H, α, m) N (F, α, m) R m (H) Lipschitz R m (F) S δ h L D h L S h ε S δ L ERM H inf L h + ε

General Learning Setting min h H L h = E z D l h, z Is learnability equivalent to finite fat-shattering dimension? Consider Z = R, H = {h: R R}, l h, z = h(z) H = h z = 0 h: R R 1 h 2} dim α H = for α < 1 2 But: ERM S z = 0 learns with excess error 0! If learnable, can we always learn with ERM?

A Different Approach: Stability Definition: A learning rule A: S h is (uniformly replacement) stable with rate β m if, for all z 1,, z m and z i : l A z 1,, z m, z i l A z 1,, z i,, z m, z i β m Theorem: If A is stable with rate β(m) then D : E S~D m L D A S E S~D m L S A S + β(m) Proof: E S~D m L D A S = 1 m i=1 m = E l A z 1,, z i,, z m, z i E l A z 1,, z i,, z m, z i 1 m i=1 m E l A z 1,, z m, z i + β m = E 1 m i=1 m l A S, z i + β m = E L S A S + β m S i

Stability of Linear Predictors? supervised learning: z = (x, y), l h, x, y X = x R 2 = loss(h x, y) x 2 1, Y = [ 1,1], loss y, y = y y H = x w, x w 2 2 Is A S = ERM H S stable? For any m, consider: x 1 = x 2 = = x m 1 = 1,0, y 1 = y 2 = = y m 1 = 1 x m = (0,1), y m = 1, which is replaced with x m = 0,1, y m = 1 A S = 1,1 and l A S, z m = 1, but A S m = (1, 1) and l A S m, z m = 2 ERM H does not have stability better than 2 (worst possible), even as m

Stability and Regularization Consider instead RERM λ S = arg min L 2 S w + λ w 2 w over X = x R d x 2 R with loss y, y = y y Claim: RERM λ is β m = 2R2 λm stable How can we use this to learn H B = w w 2 B? E L D RERM λ S E L S RERM λ S + β(m) E L S RERM λ S + λ RERM λ (S) 2 2 + β m E L S w + λ w 2 2 + β m = L D w + λ w 2 2 + 2R2 λm inf w 2 B L w + λb 2 + 2R2 = inf λm w 2 B L w + 8B2 R 2 m λ = 2R 2 B 2 m

Two Views of Regularization Uniform Convergence Limiting to w B ensure uniform convergence of L s (w) to L(w) Stability Adding a regularizer ensures stability, and thus generalization Motivates ERM B S = arg min w B L S(w) SRM variant, balancing complexity and approximation, is RERM λ (S) Motivates RERM λ S = arg min L S (w) + λ w 2 To learn w B, use λ 1 B m

We still need to prove stability! We will consider broader class of generalized learning problems with Lipschitz objective

Convex-Bounded-Lipschitz Problems For a generalized learning problem min w R d E z D[l(w, z)] with domain z Z and a hypothesis class H R d, we say: The problem is convex if for every z, l(w, z) is convex in w The problem is G-Lipschitz if for every z, l(w, z) is G-Lipschitz in w: z Z w,w H l w, z l w, z G w w 2 Or G-Lipschitz with respect to a norm w : z Z w,w H l w, z l w, z G w w The problem is B-bounded w.r.t norm w if w H w B For simplicity we write w R d. Actually, we can consider w W for some Banach space (normed vector space) W with norm w

Linear Prediction as a Generalized Lipschitz Problem z = x, y X Y, φ: X R d, loss: R Y R l w, x, y = loss w, φ x ; y If loss( y; y) is convex in y, the problem is convex If loss( y; y) is g-lipschitz in y (as a scalar function): l w, x, y l w, x, y = loss w, φ x ; y loss w, φ x ; y g w, φ x w, φ x = g w w, φ x g φ x 2 w w 2 If φ 2 R, then the problem is G = gr Lipschitz (w.r.t 2 ) For any norm w : l w, x, y l w, x, y g φ x w w If φ R for the dual norm, then the problem is G = gr Lipschitz

Stability for Convex Lipschitz Problems For a convex G-Lipschitz (w.r.t w 2 ) generalized learning problem, consider RERM λ S = arg min w L S w + λ w 2 2 Claim: RERM λ is β m = 2G2 λm stable Proof: homework Conclusion: using λ = 2G2 B 2 m, can learn any convex G-Lipschitz, Bbounded Generalized Learning Problem (w.r.t w 2 ) with sample complexity O B2 G 2 ε 2

Back to Converse of Fundamental Theorem of Learning For (bounded) supervised learning problems (with abs loss): Learnable if and only if fat shattering dim at every scale is finite Fat shattering dimension exactly characterizes sample complexity If learnable, we always have ULLN, and always learnable with ERM, with optimal sample complexity For generalized linear problems: Finite fat shattering dimension Learnable with ERM No strict converse because of silly problems, with complex irrelevant parts Converse for non trivial problems? If learnable, always learnable with ERM?

Center of Mass with Missing Data Center of mass (mean estimation) problem: Z = z R z 2 1, H = w R w 2 1} l h, z = h z 2 = i h i z i 2 Center of mass with missing data: Z = I, z I = I, z i i I z 2 1, I coordinates l h, I, z I = i I h i z i 2 4-Lipschitz and 1-Bounded convex problem wrt w 2, hence learnable with RERM λ But: consider distribution I, z I D with Pr i I = 1 2 independently for all i, and z i = 0 almost surely. L D 0 = 0 For any finite training set, there is (with probability one) some neverobserved coordinate j. Consider the standard basis vector e j L S e j = 0, hence its an ERM, but L D e j = 1 2 No ULLN, and not learnable with ERM