PAC-learning, VC Dimension and Margin-based Bounds

Similar documents
PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory Continued

Support Vector Machines

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

ECE 5424: Introduction to Machine Learning

Machine Learning

SVMs, Duality and the Kernel Trick

Support vector machines Lecture 4

ECE 5984: Introduction to Machine Learning

Machine Learning

Bias-Variance Tradeoff

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Computational Learning Theory

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Stochastic Gradient Descent

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Kernelized Perceptron Support Vector Machines

VBM683 Machine Learning

Announcements - Homework

EM (cont.) November 26 th, Carlos Guestrin 1

10-701/ Machine Learning - Midterm Exam, Fall 2010

Midterm Exam Solutions, Spring 2007

Computational Learning Theory. CS534 - Machine Learning

Midterm Exam, Spring 2005

Bayesian Networks Representation

Linear classifiers Lecture 3

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Classification: The rest of the story

Support Vector Machines. Machine Learning Fall 2017

Learning theory Lecture 4

Generalization, Overfitting, and Model Selection

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Advanced Introduction to Machine Learning CMU-10715

Introduction to Machine Learning

CSE 546 Final Exam, Autumn 2013

Ad Placement Strategies

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Gaussians Linear Regression Bias-Variance Tradeoff

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning, Midterm Exam

Introduction to Machine Learning Midterm, Tues April 8

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Machine Learning CMU-10701

Logistic Regression. Machine Learning Fall 2018

Machine Learning Gaussian Naïve Bayes Big Picture

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to Machine Learning CMU-10701

Voting (Ensemble Methods)

CS 188: Artificial Intelligence Spring Announcements

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Qualifying Exam in Machine Learning

Midterm, Fall 2003

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Active Learning and Optimized Information Gathering

COMS 4771 Introduction to Machine Learning. Nakul Verma

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CS534 Machine Learning - Spring Final Exam

Announcements. Proposals graded

Announcements Kevin Jamieson

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Machine Learning

Generative v. Discriminative classifiers Intuition

CSCI-567: Machine Learning (Spring 2019)

Stephen Scott.

CS 188: Artificial Intelligence Spring Announcements

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Generative v. Discriminative classifiers Intuition

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Bayesian Learning (II)

Overfitting, Bias / Variance Analysis

Logistic Regression Logistic

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Support Vector Machines

K-means. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. November 19 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Computational Learning Theory

Understanding Generalization Error: Bounds and Decompositions

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Introduction to Machine Learning

CS 6375: Machine Learning Computational Learning Theory

CS 6375 Machine Learning

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Kernel Methods and Support Vector Machines

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Learning with multiple models. Boosting.

Transcription:

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based Bounds Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 6 th, 2006 2006 Carlos Guestrin 1

Announcements 1 Midterm on Wednesday open book, texts, notes, no laptops bring a calculator 2006 Carlos Guestrin 2

Announcements 2 Final project details are out!!! http://www.cs.cmu.edu/~guestrin/class/10701/projects.html Great opportunity to apply ideas from class and learn more Example project: Take a dataset Define learning task Apply learning algorithms Design your own extension Evaluate your ideas many of suggestions on the webpage, but you can also do your own Boring stuff: Individually or groups of two students It s worth 20% of your final grade You need to submit a one page proposal on Wed. 3/22 (just after the break) A 5-page initial write-up (milestone) is due on 4/12 (20% of project grade) An 8-page final write-up due 5/8 (60% of the grade) A poster session for all students will be held on Friday 5/5 2-5pm in NSH atrium (20% of the grade) You can use late days on write-ups, each student in team will be charged a late day per day. MOST IMPORTANT: 2006 Carlos Guestrin 3

What now We have explored many ways of learning from data But How good is our classifier, really? How much data do I need to make it good enough? 2006 Carlos Guestrin 4

How likely is learner to pick a bad hypothesis Prob. h with error true (h) ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one? 2006 Carlos Guestrin 5

Union bound P(A or B or C or D or ) 2006 Carlos Guestrin 6

How likely is learner to pick a bad hypothesis Prob. h with error true (h) ε gets m data points right There are k hypothesis consistent with data How likely is learner to pick a bad one? 2006 Carlos Guestrin 7

Review: Generalization error in finite hypothesis spaces [Haussler 88] Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: 2006 Carlos Guestrin 8

Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, give you m 2: Pick m and δ, give you ε 2006 Carlos Guestrin 9

Review: Generalization error in finite hypothesis spaces [Haussler 88] Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: Even if h makes zero errors in training data, may make errors in test 2006 Carlos Guestrin 10

Limitations of Haussler 88 bound Consistent classifier Size of hypothesis space 2006 Carlos Guestrin 11

Simpler question: What s the expected error of a hypothesis? The error of a hypothesis is like estimating the parameter of a coin! Chernoff bound: for m i.d.d. coin flips, x 1,,x m, where x i {0,1}. For 0<ε<1: 2006 Carlos Guestrin 12

But we are comparing many hypothesis: Union bound For each hypothesis h i : What if I am comparing two hypothesis, h 1 and h 2? 2006 Carlos Guestrin 13

Generalization bound for H hypothesis Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: 2006 Carlos Guestrin 14

PAC bound and Bias-Variance tradeoff or, after moving some terms around, with probability at least 1-δ: Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 2006 Carlos Guestrin 15

What about the size of the hypothesis space? How large is the hypothesis space? 2006 Carlos Guestrin 16

Boolean formulas with n binary features 2006 Carlos Guestrin 17

Number of decision trees of depth k Recursive solution Given n attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * H k * H k Write L k = log 2 H k L 0 = 1 L k+1 = log 2 n + 2L k So L k = (2 k -1)(1+log 2 n) +1 2006 Carlos Guestrin 18

PAC bound for decision trees of depth k Bad!!! Number of points is exponential in depth! But, for m data points, decision tree can t get too big Number of leaves never more than number data points 2006 Carlos Guestrin 19

Number of decision trees with k leaves H k = Number of decision trees with k leaves H 0 =2 Loose bound: Reminder: 2006 Carlos Guestrin 20

PAC bound for decision trees with k leaves Bias-Variance revisited 2006 Carlos Guestrin 21

What did we learn from decision trees? Bias-Variance tradeoff formalized Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity m no bias, lots of variance Lower than m some bias, less variance 2006 Carlos Guestrin 22

What about continuous hypothesis spaces? Continuous hypothesis space: H = Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly! 2006 Carlos Guestrin 23

How many points can a linear boundary classify exactly? (1-D) 2006 Carlos Guestrin 24

How many points can a linear boundary classify exactly? (2-D) 2006 Carlos Guestrin 25

How many points can a linear boundary classify exactly? (d-d) 2006 Carlos Guestrin 26

Shattering a set of points 2006 Carlos Guestrin 27

VC dimension 2006 Carlos Guestrin 28

PAC bound using VC dimension Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves Bound for infinite dimension hypothesis spaces: 2006 Carlos Guestrin 29

Examples of VC dimension Linear classifiers: VC(H) = d+1, for d features plus constant term b Neural networks VC(H) = #parameters Local minima means NNs will probably not find best parameters 1-Nearest neighbor? 2006 Carlos Guestrin 30

Another VC dim. example What s the VC dim. of decision stumps in 2d? 2006 Carlos Guestrin 31

PAC bound for SVMs SVMs use a linear classifier For d features, VC(H) = d+1: 2006 Carlos Guestrin 32

VC dimension and SVMs: Problems!!! Doesn t take margin into account What about kernels? Polynomials: num. features grows really fast = Bad bound n input features p degree of polynomial Gaussian kernels can classify any set of points exactly 2006 Carlos Guestrin 33

Margin-based VC dimension H: Class of linear classifiers: w.φ(x) (b=0) Canonical form: min j w.φ(x j ) = 1 VC(H) = R 2 w.w Doesn t depend on number of features!!! R 2 = max j Φ(x j ).Φ(x j ) magnitude of data R 2 is bounded even for Gaussian kernels bounded VC dimension Large margin, low w.w, low VC dimension Very cool! 2006 Carlos Guestrin 34

Applying margin VC to SVMs? VC(H) = R 2 w.w R 2 = max j Φ(x j ).Φ(x j ) magnitude of data, doesn t depend on choice of w SVMs minimize w.w SVMs minimize VC dimension to get best bound? Not quite right: Bound assumes VC dimension chosen before looking at data Would require union bound over infinite number of possible VC dimensions But, it can be fixed! 2006 Carlos Guestrin 35

Structural risk minimization theorem For a family of hyperplanes with margin γ>0 w.w 1 SVMs maximize margin γ + hinge loss Optimize tradeoff training error (bias) versus margin γ (variance) 2006 Carlos Guestrin 36

Reality check Bounds are loose ε d=2000 d=200 d=20 d=2 m (in 10 5 ) Bound can be very loose, why should you care? There are tighter, albeit more complicated, bounds Bounds gives us formal guarantees that empirical studies can t provide Bounds give us intuition about complexity of problems and convergence rate of algorithms 2006 Carlos Guestrin 37

What you need to know Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension Bias-Variance tradeoff in learning theory Margin-based bound for SVM Remember: will your algorithm find best classifier? 2006 Carlos Guestrin 38

Big Picture Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University March 6 th, 2006 2006 Carlos Guestrin 39

What you have learned thus far Learning is function approximation Point estimation Regression Naïve Bayes Logistic regression Bias-Variance tradeoff Neural nets Decision trees Cross validation Boosting Instance-based learning SVMs Kernel trick PAC learning VC dimension Margin bounds Mistake bounds 2006 Carlos Guestrin 40

Review material in terms of Types of learning problems Hypothesis spaces Loss functions Optimization algorithms 2006 Carlos Guestrin 41

Text Classification Company home page vs Personal home page vs Univeristy home page vs 2006 Carlos Guestrin 42

Function fitting 49 50 51 OFFICE OFFICE 52 53 54 CONFERENCE STORAGE 8 7 9 10 11 12 QUIET PHONE 13 14 15 18 16 17 48 ELEC COPY 5 6 LAB 19 47 46 4 21 20 45 44 43 42 41 40 KITCHEN 39 37 38 36 2 35 1 34 3 33 32 SERVER 31 30 29 28 27 26 23 25 22 24 Temperature data 2006 Carlos Guestrin 43

Monitoring a complex system Reverse water gas shift system (RWGS) Learn model of system from data Use model to predict behavior and detect faults 2006 Carlos Guestrin 44

Types of learning problems Classification Input Features 28 26 24 22 Regression Output? 20 18 40 30 20 100 80 60 10 40 20 0 0 Density estimation 2006 Carlos Guestrin 45

The learning problem Features/Function approximator Learned function Data <x 1,,x n,y> Loss function Learning task Optimization algorithm 2006 Carlos Guestrin 46

Comparing learning algorithms Hypothesis space Loss function Optimization algorithm 2006 Carlos Guestrin 47

Naïve Bayes versus Logistic regression Naïve Bayes Logistic regression 2006 Carlos Guestrin 48

Naïve Bayes versus Logistic regression Classification as density estimation Choose class with highest probability In addition to class, we get certainty measure 2006 Carlos Guestrin 49

Logistic regression versus Boosting Logistic regression Boosting Classifier Log-loss Exponential-loss 2006 Carlos Guestrin 50

Linear classifiers Logistic regression versus SVMs w.x + b = 0 2006 Carlos Guestrin 51

What s the difference between SVMs and Logistic Regression? (Revisited again) Loss function SVMs Hinge loss Logistic Regression Log-loss High dimensional features with kernels Solution sparse Yes! Often yes! Yes! Almost always no! Type of learning 2006 Carlos Guestrin 52

SVMs and instance-based learning SVMs Classify as Instance based learning Data Classify as <x 1,,x n,y> 2006 Carlos Guestrin 53

Instance-based learning versus Decision trees 1-Nearest neighbor Decision trees 2006 Carlos Guestrin 54

Logistic regression versus Neural nets Logistic regression Neural Nets 2006 Carlos Guestrin 55

Linear regression versus Kernel regression Linear Regression Kernel regression Kernel-weighted linear regression 2006 Carlos Guestrin 56

Kernel-weighted linear regression Local basis functions for each region 50 51 49 OFFICE OFFICE 52 54 9 12 11 8 53 10 13 14 CONFERENCE STORAGE 7 QUIET PHONE 15 18 16 17 Kernels average between regions 48 ELEC COPY 5 6 LAB 19 47 46 4 21 20 45 44 43 42 41 40 KITCHEN 39 37 38 36 2 35 1 34 3 33 32 SERVER 31 30 29 28 27 26 23 25 22 24 2006 Carlos Guestrin 57

SVM regression w.x + b + ε w.x + b w.x + b - ε 2006 Carlos Guestrin 58

BIG PICTURE (a few points of comparison) Naïve Bayes DE, LL Logistic regression Neural Nets DE,Cl,Reg,RMS DE, LL This is a very incomplete view!!! Boosting Cl, exp-loss SVMs Cl, Mrg Instance-based Learning DE,Cl,Reg Decision trees DE,Cl,Reg learning task loss function DE Reg SVM regression kernel regression Log-loss/MLE linear regression 2006 Carlos Guestrin 59 Cl LL Mrg RMS Reg, RMS density estimation Classification Regression Margin-based Squared error Reg, Mrg Reg, RMS