An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Similar documents
COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational Learning Theory

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Neural Network Learning: Testing Bounds on Sample Complexity

Understanding Generalization Error: Bounds and Decompositions

Generalization, Overfitting, and Model Selection

Introduction to Machine Learning

Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

10.1 The Formal Model

Does Unlabeled Data Help?

Empirical Risk Minimization

Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

The sample complexity of agnostic learning with deterministic labels

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Generalization and Overfitting

Computational Learning Theory. Definitions

Generalization bounds

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

On the Sample Complexity of Noise-Tolerant Learning

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning Theory (CS 6783)

Computational Learning Theory

Active Learning and Optimized Information Gathering

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

CS 6375: Machine Learning Computational Learning Theory

On Learnability, Complexity and Stability

Computational Learning Theory

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

1 Differential Privacy and Statistical Query Learning

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Discriminative Models

Sample width for multi-category classifiers

Statistical Learning Learning From Examples

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

The PAC Learning Framework -II

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Machine Learning Theory (CS 6783)

Solving Classification Problems By Knowledge Sets

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

A Simple Algorithm for Learning Stable Machines

Computational Learning Theory. CS534 - Machine Learning

Consistency of Nearest Neighbor Methods

Introduction to Statistical Learning Theory. Material para Máster en Matemáticas y Computación

Computational and Statistical Learning Theory

Introduction to Machine Learning (67577) Lecture 3

Lecture 3: Introduction to Complexity Regularization

Computational Learning Theory

Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Discriminative Models

Introduction and Models

IFT Lecture 7 Elements of statistical learning theory

Advanced Introduction to Machine Learning CMU-10715

Learning with multiple models. Boosting.

Computational Learning Theory

Foundations of Machine Learning

PAC Model and Generalization Bounds

Understanding Machine Learning A theory Perspective

Models of Language Acquisition: Part II

Statistical learning theory, Support vector machines, and Bioinformatics

Introduction to Machine Learning CMU-10701

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Learnability and the Doubling Dimension

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Statistical and Computational Learning Theory

On the tradeoff between computational complexity and sample complexity in learning

Probabilistic Machine Learning. Industrial AI Lab.

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Lecture 29: Computational Learning Theory

FORMULATION OF THE LEARNING PROBLEM

Computational Learning Theory (VC Dimension)

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

The information-theoretic value of unlabeled data in semi-supervised learning

Learning with Rejection

21 An Augmented PAC Model for Semi- Supervised Learning

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CS 6375 Machine Learning

Discriminative Learning can Succeed where Generative Learning Fails

A Large Deviation Bound for the Area Under the ROC Curve

Statistical Machine Learning

Fast learning rates for plug-in classifiers under the margin condition

Distribution-Dependent Vapnik-Chervonenkis Bounds

Support Vector Machine. Natural Language Processing Lab lizhonghua

On a Theory of Learning with Similarity Functions

Chapter 6 The Structural Risk Minimization Principle

PAC-learning, VC Dimension and Margin-based Bounds

Computational and Statistical Learning theory

On Efficient Agnostic Learning of Linear Combinations of Basis Functions

2 Upper-bound of Generalization Error of AdaBoost

Transcription:

An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI

Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process of learning Observe a phenomenon Construct a model from observations Use that model to make decisions / predictions How can we make this more precise?

A statistical machinery for learning Phenomenon of interest: Input space: Output space: There is an unknown distribution over The learner observes examples drawn from Construct a model: Machine learning Let be a collection of models, where each predicts given From observations, select a model which predicts well. (generalization error of f ) We can say that we have learned the phenomenon if for any tolerance level of our choice.

PAC Learning For all tolerance levels, and all confidence levels, if there exists some model selection algorithm that selects from m observations ie,, and has the property: with probability at least over the draw of the sample, We call The model class is PAC-learnable. If the is polynomial in and, then is efficiently PAC-learnable A popular algorithm: Empirical risk minimizer (ERM) algorithm

PAC learning simple model classes Theorem (finite size ): Pick any tolerance level, and any confidence level let be examples drawn from an unknown if, then with probability at least is efficiently PAC learnable Occam s Razor Principle: All things being equal, usually the simplest explanation of a phenomenon is a good hypothesis. Simplicity = representational succinctness

Proof sketch Define: (generalization error of f ) (sample error of f ) We need to analyze: 0 Uniform deviations of expectation of a random variable to the sample

Proof sketch Fix any and a sample, define random variable (generalization error of f ) (sample error of f ) Lemma (Chernoff-Hoeffding bound 63): Let be m Bernoulli r.v. drawn independently from B(p). for any tolerance level

Proof sketch Need to analyze Equivalently, by choosing with probability at least, for all

PAC learning simple model classes Theorem (Occam s Razor): Pick any tolerance level, and any confidence level let be examples drawn from an unknown if, then with probability at least is efficiently PAC learnable

Learning general concepts Consider linear classification Occam s Razor bound is ineffective

VC Theory Need to capture the true richness of Definition (Vapnik-Chervonenkis or VC dimension): We say that a model class as VC dimension d, if d is the largest set of points such that for all possible labellings of there exists some that achieves that labelling. Example: = linear classifiers in R 2 linear classifiers can realize all possible labellings of 3 points linear classifiers CANNOT realize all labellings of 4 points VC( ) = 3

VC Theorem Theorem (Vapnik-Chervonenkis 71): Pick any tolerance level, and any confidence level let be examples drawn from an unknown if, then with probability at least is efficiently PAC learnable VC Theorem Occam s Razor Theorem

Tightness of VC bound Theorem (VC lower bound): Let be any model selection algorithm that given m samples, returns a model from, that is, For all tolerance levels, and all confidence levels, there exists a distribution such that if

Some implications VC dimension of a model class fully characterizes its learning ability! Results are agnostic to the underlying distribution.

One algorithm to rule them all? From our discussion it may seem that ERM algorithm is universally consistent. This is not the case! Theorem (no free lunch, Devroye 82): Pick any sample size m, any algorithm There exists a distribution such that and any while the Bayes optimal error,

Further refinements and extensions How to do model class selection? Structural risk results. Dealing with kernels Fat margin theory Incorporating priors over the models PAC-Bayes theory Is it possible to get distribution dependent bound? Rademacher complexity How about regression? Can derive similar results for nonparametric regression.

Questions / Discussion

Thank You! References: [1] Kearns and Vazirani. Introduction to Computational Learning Theory. [2] Devroye, Gyorfi and Luosi. A Probabilistic Theory of Pattern Recognition. [3] Anthony and Bartlett. Neural Network Learning: Theoretical Foundations.