Learning From Data Lecture 5 Training Versus Testing

Similar documents
Learning From Data Lecture 7 Approximation Versus Generalization

ECS171: Machine Learning

Learning From Data Lecture 3 Is Learning Feasible?

Learning From Data Lecture 2 The Perceptron

Learning From Data Lecture 10 Nonlinear Transforms

Learning From Data Lecture 13 Validation and Model Selection

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

ORIE 4741: Learning with Big Messy Data. Generalization

Learning From Data Lecture 8 Linear Classification and Regression

Learning From Data Lecture 14 Three Learning Principles

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Computational Learning Theory. CS534 - Machine Learning

Machine Learning Foundations

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Introduction to Machine Learning

Learning From Data Lecture 26 Kernel Machines

Learning From Data Lecture 25 The Kernel Trick

Computational Learning Theory. Definitions

Active Learning and Optimized Information Gathering

Foundations of Computer Science Lecture 14 Advanced Counting

Statistical and Computational Learning Theory

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Learning From Data Lecture 12 Regularization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

Introduction to Statistical Learning Theory

Homework # 3. Due Monday, October 22, 2018, at 2:00 PM PDT

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Understanding Generalization Error: Bounds and Decompositions

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory

Computational Learning Theory (VC Dimension)

CS 6375: Machine Learning Computational Learning Theory

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Computational Learning Theory

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Computational Learning Theory

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

1 A Lower Bound on Sample Complexity

Classification: The PAC Learning Framework

VC dimension and Model Selection

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CS 498F: Machine Learning Fall 2010

Web-Mining Agents Computational Learning Theory

Dan Roth 461C, 3401 Walnut

1 Proof of learning bounds

Does Unlabeled Data Help?

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

1 The Probably Approximately Correct (PAC) Model

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. Lecture 9: Learning Theory. Feng Li.

An Introduction to No Free Lunch Theorems

10.1 The Formal Model

Foundations of Computer Science Lecture 23 Languages: What is Computation?

The Vapnik-Chervonenkis Dimension

CSC321 Lecture 4 The Perceptron Algorithm

Computational and Statistical Learning theory

Sample complexity bounds for differentially private learning

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Linear models and the perceptron algorithm

Learning theory Lecture 4

Computational Learning Theory

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Intro to Contemporary Math

Chapter 2 Linear Equations and Inequalities in One Variable

Preliminary check: are the terms that we are adding up go to zero or not? If not, proceed! If the terms a n are going to zero, pick another test.

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Class 19. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Introduction to Machine Learning

Computational Learning Theory

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

ICML '97 and AAAI '97 Tutorials

Computational learning theory. PAC learning. VC dimension.

A Geometric Approach to Statistical Learning Theory

Part of the slides are adapted from Ziko Kolter

References for online kernel methods

A statistical learning approach to a problem of induction

Recap from end of last time

Minimax risk bounds for linear threshold functions

Online Learning, Mistake Bounds, Perceptron Algorithm

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

the tree till a class assignment is reached

1 Review of Winnow Algorithm

Small sample size generalization

Lecture 29: Computational Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Purposes of Data Analysis. Variables and Samples. Parameters and Statistics. Part 1: Probability Distributions

Introduction to Machine Learning

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Transcription:

Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of Generalization (E in E out ) An Effective Number of Hypotheses A Combinatorial Puzzle M. Magdon-Ismail CSCI 4100/6100

recap: The Two Questions of Learning 1.Can we make sure that E out (g) is close enough to E in (g)? 2.Can we make E in (g) small enough? The Hoeffding generalization bound: out-of-sample error 1 E out (g) E in (g)+ 2N ln2 H δ }{{} generalization error bar Error H model complexity in-sample error H E in : training (eg. the practice exam) E out : testing (eg. the real exam) There is a tradeoff when picking H. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 2 /18 Goal of generalization theory

What Will The Theory of Generalization Achieve? out-of-sample error E out (g) E in (g)+ 1 2N ln2 H δ Error model complexity in-sample error H H out-of-sample error E out (g) E in (g)+ 8 N ln4m H δ Error model complexity in-sample error model complexity The new bound will be applicable to infinite H. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 3 /18 H is overkill

Why is H an Overkill How did H come in? Bad events B g = { E out (g) E in (g) > ǫ} B m = { E out (h m ) E in (h m ) > ǫ} B 1 B 2 We do not know which g, so use a worst case union bound. P[B g ] P[any B m ] H m=1 P[B m ]. B 3 B m are events (sets of outcomes); they can overlap. If the B m overlap, the union bound is loose. If many h m are similar, the B m overlap. There are effectively fewer than H hypotheses,. We can replace H by something smaller. H fails to account for similarity between hypotheses. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 4 /18 Measuring diversity on N points

Measuring the Diversity (Size) of H We need a way to measure the diversity of H. A simple idea: Fix any set of N data points. If H is diverse it should be able to implement all functions...on these N points. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 5 /18 Example: large H

A Data Set Reveals the True Colors of an H H c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 6 /18...through the eyes of D

A Data Set Reveals the True Colors of an H H H through the eyes of the D c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 7 /18 Just one dichotomy

A Data Set Reveals the True Colors of an H From the point of view of D, the entire H is just one dichotomy. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 8 /18 An effective number of hypotheses

An Effective Number of Hypotheses If H is diverse it should be able to implement many dichotomys. H only captures the maximum possible diversity of H. Consider an h H, and a data set x 1,...,x N. h gives us an N-tuple of ±1 s: (h(x 1 ),...,h(x N )). A dichotomy of the inputs. If H is diverse, we get many different dichotomies. If H contains similar functions, we only get a few dichotomies. dichotomy The growth function quantifies this. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 9 /18 Growth function

The Growth Function m H (N) Define the the restriction of H to the inputs x 1,x 2,...,x N : H(x 1,...,x N ) = {(h(x 1 ),...,h(x N )) h H} (set of dichotomies induced by H) The Growth Function m H (N) The largest set of dichotomies induced by H: m H (N) = max x 1,...,x N H(x 1,...,x N ). m H (N) 2 N. Can we replace H by m H, an effective number of hypotheses? Replacing H with 2 N is no help in the bound. (why?) We want m H (N) poly(n) to get a useful error bar. ( the error bar is ) 1 2 H ln 2N δ c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 10 /18 Example: 2-d perceptron

Example: 2-D Perceptron Model Cannot implement Can implement all 8 Can implement at most 14 m H (3) = 8 = 2 3. What is m H (5)? m H (4) = 14 < 2 4. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 11 /18 Example: 1-d positive ray

Example: 1-D Positive Ray Model w 0 + x 1 x 2 x N h(x) = sign(x w 0 ) Consider N points. There are N +1 dichotomies depending on where you put w 0. m H (N) = N +1. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 12 /18 Example: 2-d positive rectangle

Example: Positive Rectangles in 2-D N = 4 N = 5 x 2 x 2 x 1 x 3 x 1 x 3 x 4 x 4 x 4 H implements all dichotomies some point will be inside a rectangle defined by others m H (4) = 2 4 m H (5) < 2 5 We have not computed m H (5) not impossible, but tricky. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 13 /18 The growth functions summarized

Example Growth Functions N 1 2 3 4 5 2-D perceptron 2 4 8 14 1-D pos. ray 2 3 4 5 2-D pos. rectangles 2 4 8 16 < 2 5 m H (N) drops below 2 N there is hope for the generalization bound. A break point is any n for which m H (n) < 2 n. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 14 /18 Combinatorial puzzle: dichotomys on 3 points

A Combinatorial Puzzle x 1 x 2 x 3 A set of dichotomys c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 15 /18 Two points shattered

A Combinatorial Puzzle x 1 x 2 x 3 Two points are shattered c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 16 /18 Another set of dichotomys

A Combinatorial Puzzle x 1 x 2 x 3 No pair of points is shattered c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 17 /18 What about N = 4?

A Combinatorial Puzzle x 1 x 2 x 3 x 1 x 2 x 3 x 4. 4 dichotomies is max. If N = 4 how many possible dichotomys with no 2 points shattered? c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 18 /18