Introduction to Machine Learning CMU-10701

Similar documents
Introduction to Machine Learning

Advanced Introduction to Machine Learning CMU-10715

Computational and Statistical Learning theory

Computational Learning Theory. CS534 - Machine Learning

Part of the slides are adapted from Ziko Kolter

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

VC dimension and Model Selection

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

Introduction to Machine Learning CMU-10701

PAC-learning, VC Dimension and Margin-based Bounds

The Vapnik-Chervonenkis Dimension

Machine Learning

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

IFT Lecture 7 Elements of statistical learning theory

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning

Learning Theory Continued

Computational and Statistical Learning Theory

Generalization, Overfitting, and Model Selection

Empirical Risk Minimization

ECS171: Machine Learning

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

CS534 Machine Learning - Spring Final Exam

Computational Learning Theory. Definitions

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Statistical Learning Learning From Examples

Computational Learning Theory

Generalization, Overfitting, and Model Selection

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

VC Dimension and Sauer s Lemma

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

References for online kernel methods

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Machine Learning Theory (CS 6783)

Computational Learning Theory

Introduction to Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Support Vector Machines

Kernel Methods. Barnabás Póczos

Support Vector Machines. Machine Learning Fall 2017

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning 4771

The PAC Learning Framework -II

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Understanding Generalization Error: Bounds and Decompositions

CSE 648: Advanced algorithms

1 A Lower Bound on Sample Complexity

Generalization and Overfitting

Models of Language Acquisition: Part II

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning Lecture 7

Warm up: risk prediction with logistic regression

The sample complexity of agnostic learning with deterministic labels

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Mathematical Induction

Introduction to Statistical Learning Theory

Support vector machines Lecture 4

Computational Learning Theory

Bias-Variance Tradeoff

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Introduction to Machine Learning (67577) Lecture 5

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

The Bayes classifier

Introduction to Support Vector Machines

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Logistic Regression. Machine Learning Fall 2018

Foundations of Machine Learning

Support Vector Machine. Natural Language Processing Lab lizhonghua

Minimax risk bounds for linear threshold functions

Lecture 2 Machine Learning Review

Neural Networks: Introduction

Introduction to Machine Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Machine Learning

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Computational Learning Theory (VC Dimension)

Statistical and Computational Learning Theory

Statistical learning theory, Support vector machines, and Bioinformatics

Generalization bounds

Mathematical Foundations of Supervised Learning

MIRA, SVM, k-nn. Lirong Xia

A first model of learning

Does Unlabeled Data Help?

Online Learning with Experts & Multiplicative Weights Algorithms

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Introduction to Machine Learning CMU-10701

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Transcription:

Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos

Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we need to make it good enough? 2

Please ask Questions and give us Feedbacks! 3

Review of what we have learned so far 4

Notation This is what the learning algorithm produces We will need these definitions, please copy it! 5

Big Picture Ultimate goal: Estimation error Approximation error Bayes risk Bayes risk Estimation error Approximation error 6

Big Picture Estimation error Approximation error Bayes risk Bayes risk 7

Big Picture Estimation error Approximation error Bayes risk Bayes risk 8

Big Picture: Illustration of Risks Upper bound Goal of Learning: 9

11. Learning Theory 10

Outline From Hoeffding s inequality, we have seen that Theorem: These results are useless if N is big, or infinite. (e.g. all possible hyperplanes) Today we will see how to fix this with the Shattering coefficient and VC dimension 11

Outline From Hoeffding s inequality, we have seen that Theorem: After this fix, we can say something meaningful about this too: This is what the learning algorithm produces and its true risk 12

Hoeffding inequality Theorem: Observation: 13

McDiarmid s Bounded Difference Inequality It follows that 14

Bounded Difference Condition Our main goal is to bound Lemma: Proof: Let g denote the following function: Observation: ) McDiarmid can be applied for g! 15

Bounded Difference Condition Corollary: The VapnikChervonenkis inequality does that with the shatter coefficient (and VC dimension)! 16

Concentration and Expected Value 17

VapnikChervonenkis inequality Our main goal is to bound We already know: VapnikChervonenkis inequality: Corollary: VapnikChervonenkis theorem: 18

Shattering 19

How many points can a linear boundary classify exactly in 1D? 2 pts 3 pts + + + + + There exists placement s.t. all labelings can be classified +?? The answer is 2 20

How many points can a linear boundary classify exactly in 2D? + 3 pts 4 pts + + + + + There exists placement s.t. all labelings can be classified +?? The answer is 3 21

How many points can a linear boundary classify exactly in 3D? The answer is 4 + + tetraeder How many points can a linear boundary classify exactly in ddim? The answer is d+1 22

Growth function, Shatter coefficient Definition (=5 in this example) Growth function, Shatter coefficient 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 1 0 1 1 1 maximum number of behaviors on n points 23

Growth function, Shatter coefficient Definition + Growth function, Shatter coefficient + maximum number of behaviors on n points Example: Half spaces in 2D + + 24

VCdimension Definition Growth function, Shatter coefficient maximum number of behaviors on n points # behaviors Definition: VCdimension Definition: Shattering Note: 25

VCdimension Definition # behaviors 26

VCdimension + + 27

Examples 28

VC dim of decision stumps (axis aligned linear separator) in 2d What s the VC dim. of decision stumps in 2d? + + + + + There is a placement of 3 pts that can be shattered ) VC dim 3 29

VC dim of decision stumps (axis aligned linear separator) in 2d What s the VC dim. of decision stumps in 2d? If VC dim = 3, then for all placements of 4 pts, there exists a labeling that can t be shattered 3 collinear + 1 in convex hull of other 3 + quadrilateral + + 30

VC dim. of axis parallel rectangles in 2d What s the VC dim. of axis parallel rectangles in 2d? + + + There is a placement of 3 pts that can be shattered ) VC dim 3 31

VC dim. of axis parallel rectangles in 2d There is a placement of 4 pts that can be shattered ) VC dim 4 32

VC dim. of axis parallel rectangles in 2d What s the VC dim. of axis parallel rectangles in 2d? If VC dim = 4, then for all placements of 5 pts, there exists a labeling that can t be shattered 4 collinear pentagon + + 2 in convex hull 1 in convex hull + + + + 33

Sauer s Lemma We already know that [Exponential in n] Sauer s lemma: The VC dimension can be used to upper bound the shattering coefficient. Corollary: [Polynomial in n] 34

Proof of Sauer s Lemma Write all different behaviors on a sample (x 1,x 2, x n ) in a matrix: 0 0 0 0 1 0 1 1 1 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 35

Proof of Sauer s Lemma 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 Shattered subsets of columns: We will prove that Therefore, 36

Proof of Sauer s Lemma 0 0 0 Shattered subsets of columns: 0 1 0 1 1 1 1 0 0 0 1 1 Lemma 1 In this example: 6 1+3+3=7 Lemma 2 for any binary matrix with no repeated rows. In this example: 5 6 37

Proof of Lemma 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 Shattered subsets of columns: In this example: 6 1+3+3=7 Lemma 1 Proof 38

Proof of Lemma 2 Lemma 2 for any binary matrix with no repeated rows. Proof Induction on the number of columns Base case: A has one column. There are three cases: ) 1 1 ) 1 1 ) 2 2 39

Proof of Lemma 2 Inductive case: A has at least two columns. We have, By induction (less columns) 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 40

Proof of Lemma 2 because 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 41

VapnikChervonenkis inequality VapnikChervonenkis inequality: [We don t prove this] From Sauer s lemma: Since Therefore, Estimation error 42

Linear (hyperplane) classifiers We already know that Estimation error Estimation error Estimation error 43

VapnikChervonenkis Theorem We already know from McDiarmid: VapnikChervonenkis inequality: Corollary: VapnikChervonenkis theorem: [We don t prove them] Hoeffding + Union bound for finite function class: 44

PAC Bound for the Estimation VC theorem: Error Inversion: Estimation error 45

Structoral Risk Minimization Estimation error Approximation error Bayes risk Ultimate goal: Estimation error Approximation error So far we studied when estimation error! 0, but we also want approximation error! 0 Many different variants penalize too complex models to avoid overfitting 46

What you need to know Complexity of the classifier depends on number of points that can be classified exactly Finite case Number of hypothesis Infinite case Shattering coefficient, VC dimension PAC bounds on true error in terms of empirical/training error and complexity of hypothesis space Empirical and Structural Risk Minimization 47

Thanks for your attention 48