ECS171: Machine Learning

Similar documents
Learning From Data Lecture 5 Training Versus Testing

ECS171: Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

ECS171: Machine Learning

Homework # 3. Due Monday, October 22, 2018, at 2:00 PM PDT

Introduction to Support Vector Machines

Introduction to Machine Learning

1 Learning Linear Separators

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

1 The Probably Approximately Correct (PAC) Model

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Classification: The PAC Learning Framework

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS260: Machine Learning Algorithms

Computational Learning Theory

Support vector machines Lecture 4

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Computational Learning Theory

Computational Learning Theory. Definitions

Machine Learning Foundations

Computational Learning Theory (VC Dimension)

1 Learning Linear Separators

Computational and Statistical Learning theory

Introduction to Machine Learning

Statistical and Computational Learning Theory

Online Learning, Mistake Bounds, Perceptron Algorithm

ESS2222. Lecture 3 Bias-Variance Trade-off

References for online kernel methods

Learning From Data Lecture 2 The Perceptron

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

The Vapnik-Chervonenkis Dimension

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Introduction to Machine Learning (67577) Lecture 3

ORIE 4741: Learning with Big Messy Data. Generalization

Computational Learning Theory

10.1 The Formal Model

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Does Unlabeled Data Help?

1 A Lower Bound on Sample Complexity

Bias-Variance Tradeoff

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Generalization, Overfitting, and Model Selection

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Learning From Data Lecture 7 Approximation Versus Generalization

Machine Learning 4771

Part of the slides are adapted from Ziko Kolter

2 Upper-bound of Generalization Error of AdaBoost

CS 498F: Machine Learning Fall 2010

Linear classifiers Lecture 3

VC dimension and Model Selection

CS 188: Artificial Intelligence Spring Announcements

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Binary Classification / Perceptron

Computational Learning Theory

Preliminary check: are the terms that we are adding up go to zero or not? If not, proceed! If the terms a n are going to zero, pick another test.

Introduction to Statistical Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

18.9 SUPPORT VECTOR MACHINES

Computational Learning Theory. CS534 - Machine Learning

Support Vector Machines

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

The PAC Learning Framework -II

Computational Learning Theory

Advanced Introduction to Machine Learning CMU-10715

Understanding Generalization Error: Bounds and Decompositions

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

MATH 829: Introduction to Data Mining and Analysis Support vector machines

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

CS 188: Artificial Intelligence. Outline

Probably Approximately Correct Learning - III

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Computational Learning Theory for Artificial Neural Networks

Learning From Data Lecture 3 Is Learning Feasible?

U.C. Berkeley Better-than-Worst-Case Analysis Handout 3 Luca Trevisan May 24, 2018

Kernel Methods and Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

Generalization bounds

Generalization Bounds

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

An Introduction to No Free Lunch Theorems

Statistical Learning Learning From Examples

Computational learning theory. PAC learning. VC dimension.

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Transcription:

ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018

Preamble to the theory

Training versus testing Out-of-sample error (generalization error): What we want: small E out E out = E x [e(h(x), f (x))]

Training versus testing Out-of-sample error (generalization error): What we want: small E out In-sample error (training error): E out = E x [e(h(x), f (x))] E in = 1 N N e(h(x n ), f (x n )) n=1 This is what we can minimize

The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0

The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0 Learning is thus split into 2 questions: Can we make sure that E out (g) E in (g)? Hoeffding s inequality (?) Can we make E in (g) small? Optimization (done)

What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N

What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron)

What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron) Todo: We will establish a finite quantity to replace M P[ E in (g) E out (g) > ɛ]? 2m H (N)e 2ɛ2 N Study m H (N) to understand the trade-off for model complexity

Reducing M to finite number

Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N

Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N The union bound: P[B 1 or B 2 or or B M ] P[B 1 ] + P[B 2 ] + + P[B M ] 2Me 2ɛ2 N }{{} consider worst case: no overlaps

Can we improve on M?

Can we improve on M?

Can we improve on M?

Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points

Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points E in (h 1 ) E out (h 1 ) E in (h 2 ) E out (h 2 ) Overlapped events!

What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1}

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1}

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) :

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N Candidate for replacing M

The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X

The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X The growth function satisfies: m H (N) 2 N

Growth function for perceptrons Compute m H (3) in 2-D space What s H(x 1, x 2, x 3 )?

Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) m H (3) = 8

Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes)

Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) Doesn t matter because we only counts the most dichotomies

Growth function for perceptrons What s m H (4)?

Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies:

Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies: m H (4) = 14 < 2 4

Example I: positive rays

Example II: positive intervals

Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex m H (N) = 2 N for any N We say the N points are shattered by convex sets

The 3 growth functions H is positive rays: H is positive intervals: m H (N) = N + 1 m H (N) = 1 2 N2 + 1 2 N + 1 H is convex sets: m H (N) = 2 N

What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N

What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good!

What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good! How to show m H (N) is polynomial?

Conclusions Next class: LFD 2.1, 2.2 Questions?