Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
Prediction Margin For a predictor h: X R and binary labels y = ±1: Margin on a single example: yh(x) Margin on a training set: margin h = min y ih(x i ) x i,y i S Most classification loss functions are a function of the margin: loss mrg h x ; y = margin < 1 loss hinge h x ; y = 1 margin + loss logistic h x ; y = log 1 + e margin loss exp h x ; y = e margin loss sq h x ; y = y h x 2 = 1 margin 2
Complexity and Margin δ Recall: S D m h H L 01 D h L mrg S h + R S H + margin S h = γ L S mrg h γ = 0 log 1 δ m L D 01 h L S mrg h γ + R S 1 H + log γ m 1 δ 1 γ R m H + log 1 δ m B φ x 2 γ 2 1 m + log 1 δ m l 2 -margin: sup w margin w w 2 = sup w 2 1 margin(w) Even better to consider relative l 2 -margin: sup l 1 -margin: sup w margin w w 1 Relative l 1 -margin: sup w margin w w 1 sup φ x H = h w x = w, φ x w 2 B w margin w w 2 sup φ x 2
Boosting and the l 1 Margin After T = Weak learning: can always find f with L 01 f 1 2 γ 2 48 log 2m γ 4 iteration, AdaBoost finds predictor with over φ x [f] = f x ±1, i.e. φ x = 1 (and as T, lim margin w w 1 γ) margin w w 1 γ 2 B = {weak predictors f} Need m = γ 2 2 VCdim B ε 2 samples to ensure L 01 w ε Can we understand AdaBoost purely in terms of l 1 -margin? Can we get a guarantee for AdaBoost that replies on existence of large l 1 - margin predictor, instead of on weak learnability? The AdaBoost analysis shows weak learning l 1 -margin. Converse?
Weak Learning and the l 1 Margin Consider a base class B = f: X ±1 and the corresponding feature map φ: X R B defined as φ x f = f(x). Goal: relate weak learnability using predictors in B to l 1 -margin using φ(x) Weak learnability: h: X ±1 is γ-weakly learnable using B if for any distribution D(X), there exists f B s.t. Pr x D f x = h x 1 2 + γ 2 Assume that B is symmetric, i.e. for any f B, also f B This allows us to consider only w 0, and so w 1 = w f If w f < 0, instead use w f > 0 (without assuming B is symmetric, we will need to talk about margin attainable only with w 0)
Weak Learning and the l 1 Margin Best possible l 1 margin for a labeling h: γ 1 = sup min h(x) φ x, w w 1 1 x For finite domain X = {x 1,, x n } and finite base class B (i.e. φ x R d is finite dimensional) consider matrix A ±1 n d with rows A i = h x i φ x i Can write the l 1 margin as: γ 1 = max min A i w = max w 1 1 i w 1 1 and since B is symmetric: γ 1 = max w R + d w 1 =1 min p Aw p R n + p 1 =1 min p Aw p R n + p 1 =1
Weak Learning and the l 1 Margin Best possible weak-learnability edge for h: X ±1: γ = min D max f B 2 Pr x D h x = f x 1 = min D max f B x D(x)h x f x For a finite domain, and in terms of the matrix with rows A i = h x i φ x i columns A j : γ = min p R + n p 1 =1 max p i h x i φ x i = min j 1..d p R n + i p 1 =1 = min p R + n p 1 =1 max j 1..d p A j max w R + d w 1 =1 p Aw max w R d + AdaBoost w 1 =1 and min p Aw = γ p R n 1 + p 1 =1
Weak Learning and the l 1 Margin Best possible weak-learnability edge for h: X ±1: γ = min D max f B 2 Pr x D h x = f x 1 = min D max f B x D(x)h x f x For a finite domain, and in terms of the matrix with rows A i = h x i φ x i columns A j : γ = min p R + n p 1 =1 max p i h x i φ x i = min j 1..d p R n + i p 1 =1 max j 1..d p A j and = min p R + n p 1 =1 max w R + d w 1 =1 p Aw = max w R + d w 1 =1 min p Aw = γ p R n 1 + p 1 =1 Strong duality
Weak Learning and the l 1 Margin Conclusion: γ-weakly learnable using predictors from base class B (i.e. for any distribution, can get error 1 γ using predictor for B) 2 2 if and only if realizable with l 1 margin γ using φ x = h x (i.e. there exists w 1 = h B 1 γ with L mrg x w, φ x = 0) AdaBoost can be viewed as an algorithm for maximizing the l 1 margin: margin S w w γ AdaBoost finds margin S w γ log m in O steps, w 1 w 1 2 γ 4 and eventually converges to the maximal l 1 margin solution.
Loss, Regularizer and Efficient Representation SVM: l 2 regularization dimension independent generalization Hine loss Represent infinite dimensional space via kernels Boosting: l 1 regularization sample complexity depends on log(d) or VCdim(features) Exp-loss / hard margin Represent infinite dimensional space via weak learning oracle, i.e. oracle for finding high-derivative feature
Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a N (H, α, m) R m (H) Lipschitz N (F, α, m) R m (F) l h, z a loss a S δ h L D h L S h ε S δ L ERM H inf L h + ε
Converse: ULLN For bounded loss, the following are equivalent: Finite fat-shattering dimension at every scale α > 0 Finite covering number at every scale α > 0 Radamacher complexity R m 0 as m sup Ef E S f 0 as m f (and equivalent quantitatively, up to log-factors)
Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a N (H, α, m) R m (H) Lipschitz N (F, α, m) R m (F) l h, z a loss a S δ h L D h L S h ε S δ L ERM H inf L h + ε
Fundamental Theorem of (Real Valued) Learning Finite fat-shattering dimension dim α H learnable, with sample complexity dim α H Can t expect converse for any loss function E.g. trivial loss loss y; y = 0 Or partially trivial: ramp loss, dim α H =, but h H,x X h x > 5 Focus on loss y, y = y y Theorem: With loss y, y = y y, for any H R X, any learning rule A and any α > 0, there exists D and h H, L D h = 0, but with m < 1 4 dim α H samples, E L A S > α 4 i.e. sample complexity to get error α 4 is at least dim α H Conclusion: Fat-shattering dimension tightly characterizes learnability If learnable, learnable using ERM with near-optimal sample complexity.
Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a l h, z a loss a N (H, α, m) N (F, α, m) R m (H) Lipschitz R m (F) S δ h L D h L S h ε S δ L ERM H inf L h + ε
General Learning Setting min h H L h = E z D l h, z Is learnability equivalent to finite fat-shattering dimension? Consider Z = R, H = {h: R R}, l h, z = h(z) H = h z = 0 h: R R 1 h 2} dim α H = for α < 1 2 But: ERM S z = 0 learns with excess error 0! If learnable, can we always learn with ERM?
A Different Approach: Stability Definition: A learning rule A: S h is (uniformly replacement) stable with rate β m if, for all z 1,, z m and z i : l A z 1,, z m, z i l A z 1,, z i,, z m, z i β m Theorem: If A is stable with rate β(m) then D : E S~D m L D A S E S~D m L S A S + β(m) Proof: E S~D m L D A S = 1 m i=1 m = E l A z 1,, z i,, z m, z i E l A z 1,, z i,, z m, z i 1 m i=1 m E l A z 1,, z m, z i + β m = E 1 m i=1 m l A S, z i + β m = E L S A S + β m S i
Stability of Linear Predictors? supervised learning: z = (x, y), l h, x, y X = x R 2 = loss(h x, y) x 2 1, Y = [ 1,1], loss y, y = y y H = x w, x w 2 2 Is A S = ERM H S stable? For any m, consider: x 1 = x 2 = = x m 1 = 1,0, y 1 = y 2 = = y m 1 = 1 x m = (0,1), y m = 1, which is replaced with x m = 0,1, y m = 1 A S = 1,1 and l A S, z m = 1, but A S m = (1, 1) and l A S m, z m = 2 ERM H does not have stability better than 2 (worst possible), even as m
Stability and Regularization Consider instead RERM λ S = arg min L 2 S w + λ w 2 w over X = x R d x 2 R with loss y, y = y y Claim: RERM λ is β m = 2R2 λm stable How can we use this to learn H B = w w 2 B? E L D RERM λ S E L S RERM λ S + β(m) E L S RERM λ S + λ RERM λ (S) 2 2 + β m E L S w + λ w 2 2 + β m = L D w + λ w 2 2 + 2R2 λm inf w 2 B L w + λb 2 + 2R2 = inf λm w 2 B L w + 8B2 R 2 m λ = 2R 2 B 2 m
Two Views of Regularization Uniform Convergence Limiting to w B ensure uniform convergence of L s (w) to L(w) Stability Adding a regularizer ensures stability, and thus generalization Motivates ERM B S = arg min w B L S(w) SRM variant, balancing complexity and approximation, is RERM λ (S) Motivates RERM λ S = arg min L S (w) + λ w 2 To learn w B, use λ 1 B m
We still need to prove stability! We will consider broader class of generalized learning problems with Lipschitz objective
Convex-Bounded-Lipschitz Problems For a generalized learning problem min w R d E z D[l(w, z)] with domain z Z and a hypothesis class H R d, we say: The problem is convex if for every z, l(w, z) is convex in w The problem is G-Lipschitz if for every z, l(w, z) is G-Lipschitz in w: z Z w,w H l w, z l w, z G w w 2 Or G-Lipschitz with respect to a norm w : z Z w,w H l w, z l w, z G w w The problem is B-bounded w.r.t norm w if w H w B For simplicity we write w R d. Actually, we can consider w W for some Banach space (normed vector space) W with norm w
Linear Prediction as a Generalized Lipschitz Problem z = x, y X Y, φ: X R d, loss: R Y R l w, x, y = loss w, φ x ; y If loss( y; y) is convex in y, the problem is convex If loss( y; y) is g-lipschitz in y (as a scalar function): l w, x, y l w, x, y = loss w, φ x ; y loss w, φ x ; y g w, φ x w, φ x = g w w, φ x g φ x 2 w w 2 If φ 2 R, then the problem is G = gr Lipschitz (w.r.t 2 ) For any norm w : l w, x, y l w, x, y g φ x w w If φ R for the dual norm, then the problem is G = gr Lipschitz
Stability for Convex Lipschitz Problems For a convex G-Lipschitz (w.r.t w 2 ) generalized learning problem, consider RERM λ S = arg min w L S w + λ w 2 2 Claim: RERM λ is β m = 2G2 λm stable Proof: homework Conclusion: using λ = 2G2 B 2 m, can learn any convex G-Lipschitz, Bbounded Generalized Learning Problem (w.r.t w 2 ) with sample complexity O B2 G 2 ε 2
Back to Converse of Fundamental Theorem of Learning For (bounded) supervised learning problems (with abs loss): Learnable if and only if fat shattering dim at every scale is finite Fat shattering dimension exactly characterizes sample complexity If learnable, we always have ULLN, and always learnable with ERM, with optimal sample complexity For generalized linear problems: Finite fat shattering dimension Learnable with ERM No strict converse because of silly problems, with complex irrelevant parts Converse for non trivial problems? If learnable, always learnable with ERM?
Center of Mass with Missing Data Center of mass (mean estimation) problem: Z = z R z 2 1, H = w R w 2 1} l h, z = h z 2 = i h i z i 2 Center of mass with missing data: Z = I, z I = I, z i i I z 2 1, I coordinates l h, I, z I = i I h i z i 2 4-Lipschitz and 1-Bounded convex problem wrt w 2, hence learnable with RERM λ But: consider distribution I, z I D with Pr i I = 1 2 independently for all i, and z i = 0 almost surely. L D 0 = 0 For any finite training set, there is (with probability one) some neverobserved coordinate j. Consider the standard basis vector e j L S e j = 0, hence its an ERM, but L D e j = 1 2 No ULLN, and not learnable with ERM