Computational and Statistical Learning Theory
|
|
- Asher Moody
- 5 years ago
- Views:
Transcription
1 Computational and Statistical Learning Theory TTIC Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
2 Prediction Margin For a predictor h: X R and binary labels y = ±1: Margin on a single example: yh(x) Margin on a training set: margin h = min y ih(x i ) x i,y i S Most classification loss functions are a function of the margin: loss mrg h x ; y = margin < 1 loss hinge h x ; y = 1 margin + loss logistic h x ; y = log 1 + e margin loss exp h x ; y = e margin loss sq h x ; y = y h x 2 = 1 margin 2
3 Complexity and Margin δ Recall: S D m h H L 01 D h L mrg S h + R S H + margin S h = γ L S mrg h γ = 0 log 1 δ m L D 01 h L S mrg h γ + R S 1 H + log γ m 1 δ 1 γ R m H + log 1 δ m B φ x 2 γ 2 1 m + log 1 δ m l 2 -margin: sup w margin w w 2 = sup w 2 1 margin(w) Even better to consider relative l 2 -margin: sup l 1 -margin: sup w margin w w 1 Relative l 1 -margin: sup w margin w w 1 sup φ x H = h w x = w, φ x w 2 B w margin w w 2 sup φ x 2
4 Boosting and the l 1 Margin After T = Weak learning: can always find f with L 01 f 1 2 γ 2 48 log 2m γ 4 iteration, AdaBoost finds predictor with over φ x [f] = f x ±1, i.e. φ x = 1 (and as T, lim margin w w 1 γ) margin w w 1 γ 2 B = {weak predictors f} Need m = γ 2 2 VCdim B ε 2 samples to ensure L 01 w ε Can we understand AdaBoost purely in terms of l 1 -margin? Can we get a guarantee for AdaBoost that replies on existence of large l 1 - margin predictor, instead of on weak learnability? The AdaBoost analysis shows weak learning l 1 -margin. Converse?
5 Weak Learning and the l 1 Margin Consider a base class B = f: X ±1 and the corresponding feature map φ: X R B defined as φ x f = f(x). Goal: relate weak learnability using predictors in B to l 1 -margin using φ(x) Weak learnability: h: X ±1 is γ-weakly learnable using B if for any distribution D(X), there exists f B s.t. Pr x D f x = h x γ 2 Assume that B is symmetric, i.e. for any f B, also f B This allows us to consider only w 0, and so w 1 = w f If w f < 0, instead use w f > 0 (without assuming B is symmetric, we will need to talk about margin attainable only with w 0)
6 Weak Learning and the l 1 Margin Best possible l 1 margin for a labeling h: γ 1 = sup min h(x) φ x, w w 1 1 x For finite domain X = {x 1,, x n } and finite base class B (i.e. φ x R d is finite dimensional) consider matrix A ±1 n d with rows A i = h x i φ x i Can write the l 1 margin as: γ 1 = max min A i w = max w 1 1 i w 1 1 and since B is symmetric: γ 1 = max w R + d w 1 =1 min p Aw p R n + p 1 =1 min p Aw p R n + p 1 =1
7 Weak Learning and the l 1 Margin Best possible weak-learnability edge for h: X ±1: γ = min D max f B 2 Pr x D h x = f x 1 = min D max f B x D(x)h x f x For a finite domain, and in terms of the matrix with rows A i = h x i φ x i columns A j : γ = min p R + n p 1 =1 max p i h x i φ x i = min j 1..d p R n + i p 1 =1 = min p R + n p 1 =1 max j 1..d p A j max w R + d w 1 =1 p Aw max w R d + AdaBoost w 1 =1 and min p Aw = γ p R n 1 + p 1 =1
8 Weak Learning and the l 1 Margin Best possible weak-learnability edge for h: X ±1: γ = min D max f B 2 Pr x D h x = f x 1 = min D max f B x D(x)h x f x For a finite domain, and in terms of the matrix with rows A i = h x i φ x i columns A j : γ = min p R + n p 1 =1 max p i h x i φ x i = min j 1..d p R n + i p 1 =1 max j 1..d p A j and = min p R + n p 1 =1 max w R + d w 1 =1 p Aw = max w R + d w 1 =1 min p Aw = γ p R n 1 + p 1 =1 Strong duality
9 Weak Learning and the l 1 Margin Conclusion: γ-weakly learnable using predictors from base class B (i.e. for any distribution, can get error 1 γ using predictor for B) 2 2 if and only if realizable with l 1 margin γ using φ x = h x (i.e. there exists w 1 = h B 1 γ with L mrg x w, φ x = 0) AdaBoost can be viewed as an algorithm for maximizing the l 1 margin: margin S w w γ AdaBoost finds margin S w γ log m in O steps, w 1 w 1 2 γ 4 and eventually converges to the maximal l 1 margin solution.
10 Loss, Regularizer and Efficient Representation SVM: l 2 regularization dimension independent generalization Hine loss Represent infinite dimensional space via kernels Boosting: l 1 regularization sample complexity depends on log(d) or VCdim(features) Exp-loss / hard margin Represent infinite dimensional space via weak learning oracle, i.e. oracle for finding high-derivative feature
11 Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a N (H, α, m) R m (H) Lipschitz N (F, α, m) R m (F) l h, z a loss a S δ h L D h L S h ε S δ L ERM H inf L h + ε
12 Converse: ULLN For bounded loss, the following are equivalent: Finite fat-shattering dimension at every scale α > 0 Finite covering number at every scale α > 0 Radamacher complexity R m 0 as m sup Ef E S f 0 as m f (and equivalent quantitatively, up to log-factors)
13 Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a N (H, α, m) R m (H) Lipschitz N (F, α, m) R m (F) l h, z a loss a S δ h L D h L S h ε S δ L ERM H inf L h + ε
14 Fundamental Theorem of (Real Valued) Learning Finite fat-shattering dimension dim α H learnable, with sample complexity dim α H Can t expect converse for any loss function E.g. trivial loss loss y; y = 0 Or partially trivial: ramp loss, dim α H =, but h H,x X h x > 5 Focus on loss y, y = y y Theorem: With loss y, y = y y, for any H R X, any learning rule A and any α > 0, there exists D and h H, L D h = 0, but with m < 1 4 dim α H samples, E L A S > α 4 i.e. sample complexity to get error α 4 is at least dim α H Conclusion: Fat-shattering dimension tightly characterizes learnability If learnable, learnable using ERM with near-optimal sample complexity.
15 Hypothesis Class H = h: X Y Loss function loss( y, y) Loss Class F = f h z = l h, z h H VCdim(H) Monotone or unimodal loss VCdim(F) dim α (H) dim α (F) h(x) a l h, z a loss a N (H, α, m) N (F, α, m) R m (H) Lipschitz R m (F) S δ h L D h L S h ε S δ L ERM H inf L h + ε
16 General Learning Setting min h H L h = E z D l h, z Is learnability equivalent to finite fat-shattering dimension? Consider Z = R, H = {h: R R}, l h, z = h(z) H = h z = 0 h: R R 1 h 2} dim α H = for α < 1 2 But: ERM S z = 0 learns with excess error 0! If learnable, can we always learn with ERM?
17 A Different Approach: Stability Definition: A learning rule A: S h is (uniformly replacement) stable with rate β m if, for all z 1,, z m and z i : l A z 1,, z m, z i l A z 1,, z i,, z m, z i β m Theorem: If A is stable with rate β(m) then D : E S~D m L D A S E S~D m L S A S + β(m) Proof: E S~D m L D A S = 1 m i=1 m = E l A z 1,, z i,, z m, z i E l A z 1,, z i,, z m, z i 1 m i=1 m E l A z 1,, z m, z i + β m = E 1 m i=1 m l A S, z i + β m = E L S A S + β m S i
18 Stability of Linear Predictors? supervised learning: z = (x, y), l h, x, y X = x R 2 = loss(h x, y) x 2 1, Y = [ 1,1], loss y, y = y y H = x w, x w 2 2 Is A S = ERM H S stable? For any m, consider: x 1 = x 2 = = x m 1 = 1,0, y 1 = y 2 = = y m 1 = 1 x m = (0,1), y m = 1, which is replaced with x m = 0,1, y m = 1 A S = 1,1 and l A S, z m = 1, but A S m = (1, 1) and l A S m, z m = 2 ERM H does not have stability better than 2 (worst possible), even as m
19 Stability and Regularization Consider instead RERM λ S = arg min L 2 S w + λ w 2 w over X = x R d x 2 R with loss y, y = y y Claim: RERM λ is β m = 2R2 λm stable How can we use this to learn H B = w w 2 B? E L D RERM λ S E L S RERM λ S + β(m) E L S RERM λ S + λ RERM λ (S) β m E L S w + λ w β m = L D w + λ w R2 λm inf w 2 B L w + λb 2 + 2R2 = inf λm w 2 B L w + 8B2 R 2 m λ = 2R 2 B 2 m
20 Two Views of Regularization Uniform Convergence Limiting to w B ensure uniform convergence of L s (w) to L(w) Stability Adding a regularizer ensures stability, and thus generalization Motivates ERM B S = arg min w B L S(w) SRM variant, balancing complexity and approximation, is RERM λ (S) Motivates RERM λ S = arg min L S (w) + λ w 2 To learn w B, use λ 1 B m
21 We still need to prove stability! We will consider broader class of generalized learning problems with Lipschitz objective
22 Convex-Bounded-Lipschitz Problems For a generalized learning problem min w R d E z D[l(w, z)] with domain z Z and a hypothesis class H R d, we say: The problem is convex if for every z, l(w, z) is convex in w The problem is G-Lipschitz if for every z, l(w, z) is G-Lipschitz in w: z Z w,w H l w, z l w, z G w w 2 Or G-Lipschitz with respect to a norm w : z Z w,w H l w, z l w, z G w w The problem is B-bounded w.r.t norm w if w H w B For simplicity we write w R d. Actually, we can consider w W for some Banach space (normed vector space) W with norm w
23 Linear Prediction as a Generalized Lipschitz Problem z = x, y X Y, φ: X R d, loss: R Y R l w, x, y = loss w, φ x ; y If loss( y; y) is convex in y, the problem is convex If loss( y; y) is g-lipschitz in y (as a scalar function): l w, x, y l w, x, y = loss w, φ x ; y loss w, φ x ; y g w, φ x w, φ x = g w w, φ x g φ x 2 w w 2 If φ 2 R, then the problem is G = gr Lipschitz (w.r.t 2 ) For any norm w : l w, x, y l w, x, y g φ x w w If φ R for the dual norm, then the problem is G = gr Lipschitz
24 Stability for Convex Lipschitz Problems For a convex G-Lipschitz (w.r.t w 2 ) generalized learning problem, consider RERM λ S = arg min w L S w + λ w 2 2 Claim: RERM λ is β m = 2G2 λm stable Proof: homework Conclusion: using λ = 2G2 B 2 m, can learn any convex G-Lipschitz, Bbounded Generalized Learning Problem (w.r.t w 2 ) with sample complexity O B2 G 2 ε 2
25 Back to Converse of Fundamental Theorem of Learning For (bounded) supervised learning problems (with abs loss): Learnable if and only if fat shattering dim at every scale is finite Fat shattering dimension exactly characterizes sample complexity If learnable, we always have ULLN, and always learnable with ERM, with optimal sample complexity For generalized linear problems: Finite fat shattering dimension Learnable with ERM No strict converse because of silly problems, with complex irrelevant parts Converse for non trivial problems? If learnable, always learnable with ERM?
26 Center of Mass with Missing Data Center of mass (mean estimation) problem: Z = z R z 2 1, H = w R w 2 1} l h, z = h z 2 = i h i z i 2 Center of mass with missing data: Z = I, z I = I, z i i I z 2 1, I coordinates l h, I, z I = i I h i z i 2 4-Lipschitz and 1-Bounded convex problem wrt w 2, hence learnable with RERM λ But: consider distribution I, z I D with Pr i I = 1 2 independently for all i, and z i = 0 almost surely. L D 0 = 0 For any finite training set, there is (with probability one) some neverobserved coordinate j. Consider the standard basis vector e j L S e j = 0, hence its an ERM, but L D e j = 1 2 No ULLN, and not learnable with ERM
Computational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification MariaFlorina (Nina) Balcan 10/05/2016 Reminders Midterm Exam Mon, Oct. 10th Midterm Review Session
More informationOptimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationLecture 4: Linear predictors and the Perceptron
Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationComputational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More information1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationCS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell
CS340 Machine learning Lecture 4 Learning theory Some slides are borrowed from Sebastian Thrun and Stuart Russell Announcement What: Workshop on applying for NSERC scholarships and for entry to graduate
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationNotes on the framework of Ando and Zhang (2005) 1 Beyond learning good functions: learning good spaces
Notes on the framework of Ando and Zhang (2005 Karl Stratos 1 Beyond learning good functions: learning good spaces 1.1 A single binary classification problem Let X denote the problem domain. Suppose we
More informationTTIC An Introduction to the Theory of Machine Learning. Learning from noisy data, intro to SQ model
TTIC 325 An Introduction to the Theory of Machine Learning Learning from noisy data, intro to SQ model Avrim Blum 4/25/8 Learning when there is no perfect predictor Hoeffding/Chernoff bounds: minimizing
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationMidterm 1. Every element of the set of functions is continuous
Econ 200 Mathematics for Economists Midterm Question.- Consider the set of functions F C(0, ) dened by { } F = f C(0, ) f(x) = ax b, a A R and b B R That is, F is a subset of the set of continuous functions
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More informationMachine Learning, Fall 2011: Homework 5
0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to
More informationCS446: Machine Learning Fall Final Exam. December 6 th, 2016
CS446: Machine Learning Fall 2016 Final Exam December 6 th, 2016 This is a closed book exam. Everything you need in order to solve the problems is supplied in the body of this exam. This exam booklet contains
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationIntroduction to Machine Learning
Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationEconomics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011
Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011 Section 2.6 (cont.) Properties of Real Functions Here we first study properties of functions from R to R, making use of the additional structure
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationGeneralization theory
Generalization theory Chapter 4 T.P. Runarsson (tpr@hi.is) and S. Sigurdsson (sven@hi.is) Introduction Suppose you are given the empirical observations, (x 1, y 1 ),..., (x l, y l ) (X Y) l. Consider the
More informationReview: Support vector machines. Machine learning techniques and image analysis
Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationVC dimension and Model Selection
VC dimension and Model Selection Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2 PAC model: Setting A
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationCS 446: Machine Learning Lecture 4, Part 2: On-Line Learning
CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning 0.1 Linear Functions So far, we have been looking at Linear Functions { as a class of functions which can 1 if W1 X separate some data and not
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationImplicit Optimization Bias
Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,
More informationCS446: Machine Learning Spring Problem Set 4
CS446: Machine Learning Spring 2017 Problem Set 4 Handed Out: February 27 th, 2017 Due: March 11 th, 2017 Feel free to talk to other members of the class in doing the homework. I am more concerned that
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationThe definitions and notation are those introduced in the lectures slides. R Ex D [h
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationThe Perceptron Algorithm, Margins
The Perceptron Algorithm, Margins MariaFlorina Balcan 08/29/2018 The Perceptron Algorithm Simple learning algorithm for supervised classification analyzed via geometric margins in the 50 s [Rosenblatt
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationSimilarity-Based Theoretical Foundation for Sparse Parzen Window Prediction
Similarity-Based Theoretical Foundation for Sparse Parzen Window Prediction Nina Balcan Avrim Blum Nati Srebro Toyota Technological Institute Chicago Sparse Parzen Window Prediction We are concerned with
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More information