PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Similar documents
Generalization, Overfitting, and Model Selection

Generalization and Overfitting

Generalization, Overfitting, and Model Selection

Machine Learning

Machine Learning

Introduction to Machine Learning

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Computational Learning Theory

Computational Learning Theory

Introduction to Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Introduction to Machine Learning CMU-10701

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

PAC-learning, VC Dimension and Margin-based Bounds

Perceptron (Theory) + Linear Regression

Understanding Generalization Error: Bounds and Decompositions

Does Unlabeled Data Help?

Computational Learning Theory (VC Dimension)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Introduction to Machine Learning (67577) Lecture 3

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

Computational Learning Theory

Foundations of Machine Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

The Vapnik-Chervonenkis Dimension

The sample complexity of agnostic learning with deterministic labels

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Empirical Risk Minimization

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Littlestone s Dimension and Online Learnability

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Computational Learning Theory. Definitions

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS Machine Learning Qualifying Exam

Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Active Learning and Optimized Information Gathering

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning with multiple models. Boosting.

COMS 4771 Introduction to Machine Learning. Nakul Verma

Advanced Introduction to Machine Learning CMU-10715

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Data Informatics. Seon Ho Kim, Ph.D.

PAC-learning, VC Dimension and Margin-based Bounds

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Machine Learning

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

IFT Lecture 7 Elements of statistical learning theory

Machine Learning

Support Vector Machines. Machine Learning Fall 2017

Computational Learning Theory. CS534 - Machine Learning

Machine Learning: Homework 5

CS 6375 Machine Learning

Generalization theory

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Machine Learning Lecture 7

Computational Learning Theory

Hidden Markov Models

Computational Learning Theory

Lecture 2 Machine Learning Review

Introduction to Machine Learning

Optimization Methods for Machine Learning (OMML)

Statistical Learning Learning From Examples

Machine Learning Theory (CS 6783)

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Generalization bounds

Part of the slides are adapted from Ziko Kolter

Computational and Statistical Learning Theory

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Computational Learning Theory

CS7267 MACHINE LEARNING

Introduction to Bayesian Learning. Machine Learning Fall 2018

Computational and Statistical Learning Theory

References for online kernel methods

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Machine Learning (CS 567) Lecture 2

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Linear Classifiers: Expressiveness

Deriving Principal Component Analysis (PCA)

VC dimension and Model Selection

Mathematical Formulation of Our Example

Web-Mining Agents Computational Learning Theory

Computational learning theory. PAC learning. VC dimension.

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Day 3: Classification, logistic regression

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1

ML Big Picture Learning Paradigms: What data is available and when? What form of prediction? supervised learning unsupervised learning semi-supervised learning reinforcement learning active learning imitation learning domain adaptation online learning density estimation recommender systems feature learning manifold learning dimensionality reduction ensemble learning distant supervision hyperparameter optimization Theoretical Foundations: What principles guide learning? q probabilistic q information theoretic q evolutionary search q ML as optimization Problem Formulation: What is the structure of our output prediction? boolean Binary Classification categorical Multiclass Classification ordinal Ordinal Classification real Regression ordering Ranking multiple discrete Structured Prediction multiple continuous (e.g. dynamical systems) both discrete & (e.g. mixed graphical models) cont. Facets of Building ML Systems: How to build systems that are robust, efficient, adaptive, effective? 1. Data prep 2. Model selection 3. Training (optimization / search) 4. Hyperparameter tuning on validation data 5. (Blind) Assessment on test data Application Areas Key challenges? NLP, Speech, Computer Vision, Robotics, Medicine, Search Big Ideas in ML: Which are the ideas driving development of the field? inductive bias generalization / overfitting bias-variance decomposition generative vs. discriminative deep nets, graphical models PAC learning distant rewards 2

LEARNING THEORY 3

Questions For Today 1. Given a classifier with zero training error, what can we say about generalization error? (Sample Complexity, Realizable Case) 2. Given a classifier with low training error, what can we say about generalization error? (Sample Complexity, Agnostic Case) 3. Is there a theoretical justification for regularization to avoid overfitting? (Structural Risk Minimization) 4

PAC/SLT models for Supervised Learning PAC / SLT Model Data Source Distribution D on X Learning Algorithm Expert / Oracle Labeled Examples (x 1,c*(x 1 )),, (x m,c*(x m )) Alg.outputs h : X! Y x 1 > 5 x 6 > 2 +1-1 +1 c* : X! Y + + - - - - - Slide from Nina Balcan 6

Two Types of Error True Error (aka. expected risk) Train Error (aka. empirical risk) 7

PAC / SLT Model 8

Three Hypotheses of Interest 9

PAC LEARNING 10

Probably Approximately Correct Whiteboard: (PAC) Learning PAC Criterion Meaning of Probably Approximately Correct PAC Learnable Consistent Learner Sample Complexity 11

Generalization and Overfitting Whiteboard: Realizable vs. Agnostic Cases Finite vs. Infinite Hypothesis Spaces 12

PAC Learning 13

SAMPLE COMPLEXITY RESULTS 14

Sample Complexity Results Four Cases we care about Realizable We ll start with the finite case Agnostic 15

Sample Complexity Results Four Cases we care about Realizable Agnostic 16

Example: Conjunctions In-Class Quiz: Suppose H = class of conjunctions over x in {0,1} M If M = 10,! = 0.1, δ = 0.01, how many examples suffice? Realizable Agnostic 17

Sample Complexity Results Four Cases we care about Realizable Agnostic 18

1. Bound is inversely linear in epsilon (e.g. halving the error requires double the examples) Sample Complexity Results 2. Bound is only logarithmic in H (e.g. quadrupling the hypothesis space only requires double the examples) Four Cases we care about 1. Bound is inversely quadratic in epsilon (e.g. halving the error requires 4x the examples) 2. Bound is only logarithmic in H (i.e. same as Realizable case) Realizable Agnostic 19

Generalization and Overfitting Whiteboard: Sample Complexity Bounds (Agnostic Case) Corollary (Agnostic Case) Empirical Risk Minimization Structural Risk Minimization Motivation for Regularization 20

Sample Complexity Results Four Cases we care about Realizable Agnostic We need a new definition of complexity for a Hypothesis space for these results (see VC Dimension) 21

Sample Complexity Results Four Cases we care about Realizable Agnostic 22

VC DIMENSION 23

What if H is infinite? E.g., linear separators in R d E.g., thresholds on the real line + + - - - - - - w + E.g., intervals on the real line - + - a b 24

Definition: Shattering, VC-dimension H[S] the set of splittings of dataset S using concepts from H. H shatters S if H S = 2 S. A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2 S possible ways; i.e., all possible ways of classifying points in S are achievable using concepts in H. Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) = 25

Shattering, VC-dimension Definition: VC-dimension (Vapnik-Chervonenkis dimension) The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H. If arbitrarily large finite sets can be shattered by H, then VCdim(H) = To show that VC-dimension is d: there exists a set of d points that can be shattered there is no set of d+1 points that can be shattered. Fact: If H is finite, then VCdim (H) log ( H ). 26

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Thresholds on the real line VCdim H = 1 + - - w + - E.g., H= Intervals on the real line + - VCdim H = 2 + - + 27

Shattering, VC-dimension If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered. E.g., H= Union of k intervals on the real line VCdim H = 2k - + - + - + VCdim H 2k A sample of size 2k shatters (treat each pair of points as a separate case of intervals) VCdim H < 2k + 1 + - + - + 28

Shattering, VC-dimension E.g., H= linear separators in R 2 VCdim H 3 29

Shattering, VC-dimension E.g., H= linear separators in R 2 VCdim H < 4 Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative. Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative. Fact: VCdim of linear separators in R d is d+1 30

Sample Complexity Results Four Cases we care about Realizable Agnostic 32

Questions For Today 1. Given a classifier with zero training error, what can we say about generalization error? (Sample Complexity, Realizable Case) 2. Given a classifier with low training error, what can we say about generalization error? (Sample Complexity, Agnostic Case) 3. Is there a theoretical justification for regularization to avoid overfitting? (Structural Risk Minimization) 37

Learning Theory Objectives You should be able to Identify the properties of a learning setting and assumptions required to ensure low generalization error Distinguish true error, train error, test error Define PAC and explain what it means to be approximately correct and what occurs with high probability Apply sample complexity bounds to real-world learning examples Distinguish between a large sample and a finite sample analysis Theoretically motivate regularization 38