Statistical Learning Learning From Examples

Similar documents
Computational Learning Theory

Computational Learning Theory. Definitions

1 The Probably Approximately Correct (PAC) Model

Computational Learning Theory

Computational Learning Theory

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

Computational Learning Theory

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Introduction to Machine Learning

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Generalization, Overfitting, and Model Selection

Does Unlabeled Data Help?

VC dimension and Model Selection

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Introduction to Statistical Learning Theory

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Introduction to Computational Learning Theory

Computational Learning Theory (COLT)

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

The Vapnik-Chervonenkis Dimension

Introduction to Machine Learning

CS 6375: Machine Learning Computational Learning Theory

Classification: The PAC Learning Framework

The PAC Learning Framework -II

Web-Mining Agents Computational Learning Theory

10.1 The Formal Model

Introduction to Machine Learning

PAC Model and Generalization Bounds

Introduction to Machine Learning

Computational Learning Theory - Hilary Term : Introduction to the PAC Learning Framework

Computational Learning Theory. CS534 - Machine Learning

COMS 4771 Introduction to Machine Learning. Nakul Verma

Statistical and Computational Learning Theory

Empirical Risk Minimization

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Introduction to Machine Learning CMU-10701

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

Generalization and Overfitting

Dan Roth 461C, 3401 Walnut

Computational Learning Theory

Computational and Statistical Learning theory

Models of Language Acquisition: Part II

Computational Learning Theory

Computational Learning Theory

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Understanding Generalization Error: Bounds and Decompositions

1 A Lower Bound on Sample Complexity

Computational and Statistical Learning Theory

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Solving Classification Problems By Knowledge Sets

Lecture 5: Efficient PAC Learning. 1 Consistent Learning: a Bound on Sample Complexity

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Part of the slides are adapted from Ziko Kolter

Lecture 29: Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

FORMULATION OF THE LEARNING PROBLEM

Machine Learning Theory (CS 6783)

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Generalization theory

Computational learning theory. PAC learning. VC dimension.

Generalization bounds

On the Sample Complexity of Noise-Tolerant Learning

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Computational and Statistical Learning Theory

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Foundations of Machine Learning and Data Science. Lecturer: Avrim Blum Lecture 9: October 7, 2015

Learning Theory Continued

ICML '97 and AAAI '97 Tutorials

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

Machine Learning

Computational and Statistical Learning Theory

Question of the Day? Machine Learning 2D1431. Training Examples for Concept Enjoy Sport. Outline. Lecture 3: Concept Learning

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Homework 4 Solutions

The sample complexity of agnostic learning with deterministic labels

CS446: Machine Learning Spring Problem Set 4

Understanding Machine Learning Solution Manual

Machine Learning

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Active Learning and Optimized Information Gathering

Discriminative Models

From Batch to Transductive Online Learning

Introduction to Machine Learning (67577) Lecture 3

The Monte Carlo Method

Hierarchical Concept Learning

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Sample Complexity of Learning Independent of Set Theory

Transcription:

Statistical Learning Learning From Examples We want to estimate the working temperature range of an iphone. We could study the physics and chemistry that affect the performance of the phone too hard We could sample temperatures in [-100C,+100C] and check if the iphone works in each of these temperatures We could sample users iphones for failures/temperature How many samples do we need? How good is the result? -100C a b +100C

Sample Complexity Sample Complexity answers the fundamental ques7ons in machine learning / sta7s7cal learning / data mining / data analysis: Does the data (training set) contains sufficient informa7on to make a valid predic7on (or fix a model)? Is the sample sufficiently large? How accurate is a predic7on (model) inferred from a sample of a given size? Standard sta7s7cs/probabilis7c techniques do not give adequate solu7ons

Outline Example: Learning binary classification Detection vs. estimation Uniform convergence VC-dimension The ε-net and ε-sample theorems Applications in learning and data analysis Rademacher complexity Applications of Rademacher complexity

Example An alien arrives in Providence. He has a perfect infrared sensors that detects the temperature. He wants to know when the locals say that it s warm (in contrast to cold or hot) so he can speak like a local. He asks everyone he meets and gets a collections of answers: (90F, hot), (40F, cold), (60F, warm), (85F, hot), (75F, warm), (30F, cold), (55F, warm)... He decides that the locals use warm for temperatures between 47.5F to 80F. How wrong can he be? How do we measure wrong? How about inconsistent training example?...

What s Learning? Two types of learning: What s a rectangle? A rectangle is any quadrilateral with four right angles Here are many random examples of rectangles, here are many random examples of shapes that are not rectangles. Make your own rule that best conforms with the examples - Statistical Learning.

Learning From Examples The alien had n random training examples from distribution D. A rule [a, b] conforms with the examples. The alien uses this rule to decide on the next example. If the next example is drawn from D, what is the probability that he is wrong? Let [c, d] be the correct rule. Let = ([a, b] [c, d]) ([c, d] [a, b]) The alien is wrong only on examples in.

What s the probability that the alien is wrong? The alien is wrong only on examples in. The probability that the alien is wrong is the probability of having a quary from. If Prob(sample from ) ɛ we don t care. If Prob(sample from ) ɛ then the probability that n training samples all missed, is bounded by (1 ɛ) n = δ, for n 1 ɛ log 1 δ. Thus, with n 1 ɛ log 1 δ training samples, with probability 1 δ, we chose a rule (interval) that gives the correct answer for quarries from D with probability 1 ɛ.

Learning a Binary Classifier An unknown probability distribution D on a domain U An unknown correct classification a partition c of U to In and Out sets Input: Concept class C a collection of possible classification rules (partitions of U). A training set {(x i, c(x i )) i = 1,..., m}, where x 1,..., x m are sampled from D. Goal: With probability 1 δ the algorithm generates a classification that is correct (on items generated from D) with probability opt(c) ɛ, where opt(c) is the probability of the best classification in C.

Learning a Binary Classifier Out and In items, and a concept class C of possible classifica;on rules

When does the sample identify the correct rule? - The realizable case The realizable case - the correct classification c C. Algorithm: choose h C that agrees with all the training set (there must be at least one). For any h C let (c, h) be the set of items on which the two classifiers differ: (c, h) = {x U h(x) c(x)} If the sample (training set) intersects every set in { (c, h) Pr( (c, h)) ɛ}, then Pr( (c, h )) ɛ.

Learning a Binary Classifier Red and blue items, possible classifica9on rules, and the sample items

When does the sample identify the correct rule? The unrealizable (agnostic) case The unrealizable case - c may not be in C. For any h C, let (c, h) be the set of items on which the two classifiers differ: (c, h) = {x U h(x) c(x)} For the training set {(x i, c(x i )) i = 1,..., m}, let m Pr( (c, h)) = 1 m i=1 1 h(xi ) c(x i ) Algorithm: choose h = arg min h C Pr( (c, h)). If for every set (c, h), Pr( (c, h)) Pr( (c, h)) ɛ, then Pr( (c, h )) opt(c) + 2ɛ.

Detection vs. Estimation Input: Concept class C a collection of possible classification rules (partitions of U). A training set {(x i, c(x i )) i = 1,..., m}, where x 1,..., x m are sampled from D. For any h C, let (c, h) be the set of items on which the two classifiers differ: (c, h) = {x U h(x) c(x)} For the realizable case we need a training set (sample) that with probability 1 δ intersects every set in { (c, h) Pr( (c, h)) ɛ} (ɛ-net) For the unrealizable case we need a training set that with probability 1 δ estimates, within additive error ɛ, every set in (c, h) = {x U h(x) c(x)} (ɛ-sample).

Learnability - Uniform Convergence Theorem In the realizable case, any concept class C can be learned with m = 1 ɛ (ln C + ln 1 δ ) samples. Proof. We need a sample that intersects every set in the family of sets { (c, c ) Pr( (c, c )) ɛ}. There are at most C such sets, and the probability that a sample is chosen inside a set is ɛ. The probability that m random samples did not intersect with at least one of the sets is bounded by C (1 ɛ) m C e ɛm C e (ln C +ln 1 δ ) δ.

How Good is this Bound? Assume that we want to es3mate the working temperature range of an iphone. We sample temperatures in [- 100C,+100C] and check if the iphone works in each of these temperatures. - 100C a b +100C

Learning an Interval A distribution D is defined on universe that is an interval [A, B]. The true classification rule is defined by a sub-interval [a, b] [A, B]. The concept class C is the collection of all intervals, C = {[c, d] [c, d] [A, B]} Theorem There is a learning algorithm that given a sample from D of size m = 2 ɛ ln 2 δ, with probability 1 δ, returns a classification rule (interval) [x, y] that is correct with probability 1 ɛ. Note that the sample size is independent of the size of the concept class C, which is infinite.

Learning an Interval If the classifica2on error is ε then the sample missed at least one of the the intervals [a,a ] or [b,b] each of probability ε/2 A a ε/2 a ε/2 b b B x y Each sample excludes many possible intervals. The union bound sums over overlapping hypothesis. Need beier characteriza2on of concept's complexity!

Proof. Algorithm: Choose the smallest interval [x, y] that includes all the In sample points. Clearly a x < y b, and the algorithm can only err in classifying In points as Out points. Fix a < a and b < b such that Pr([a, a ]) = ɛ/2 and Pr([b, b ]) = ɛ/2. If the probability of error when using the classification [x, y] is ɛ then either a x or y b or both. The probability that the sample of size m = 2 ɛ ln 2 δ did not intersect with one of these intervals is bounded by 2(1 ɛ 2 )m e ɛm 2 +ln 2 δ

The union bound is far too loose for our applications. It sums over overlapping hypothesis. Each sample excludes many possible intervals. Need better characterization of concept s complexity!

Probably Approximately Correct Learning (PAC Learning) The goal is to learn a concept (hypothesis) from a pre-defined concept class. (An interval, a rectangle, a k-cnf boolean formula, etc.) There is an unknown distribution D on input instances. Correctness of the algorithm is measured with respect to the distribution D. The goal: a polynomial time (and number of samples) algorithm that with probability 1 δ computes an hypothesis of the target concept that is correct (on each instance) with probability 1 ɛ.

Formal Definition We have a unit cost function Oracle(c, D) that produces a pair (x, c(x)), where x is distributed according to D, and c(x) is the value of the concept c at x. Successive calls are independent. A concept class C over input set X is PAC learnable if there is an algorithm L with the following properties: For every concept c C, every distribution D on X, and every 0 ɛ, δ 1/2, Given a function Oracle(c, D), ɛ and δ, with probability 1 δ the algorithm output an hypothesis h C such that Pr D (h(x) c(x)) ɛ. The concept class C is efficiently PAC learnable if the algorithm runs in time polynomial in the size of the problem,1/ɛ and 1/δ. So far we showed that the concept class intervals on the line is efficiently PAC learnable.

Learning Axis-Aligned Rectangle Concept class: all axis aligned rectangles. Given m samples {x i, y i, class}, i = 1,..., m. Let R be the smallest rectangle that contains all the positive examples. A(R ) the corresponding algorithm. Let R be the correct concept. W.l.o.g. Pr(R) > ɛ Define 4 sides each with probability ɛ/4 of R: r 1, r 2, r 3, r 4. If Pr(A(R )) ɛ) then there is an i {1, 2, 3, 4} such that Pr(R r i ) ɛ/4, and there were no training examples in R r i Pr(A(R )) ɛ) 4(1 ɛ/4) m

Learning Axis-Aligned Rectangle - More than One Solution Concept class: all axis aligned rectangles. Given m samples {x i, y i, class}, i = 1,..., m. Let R be the smallest rectangle that contains all the positive examples. Let R be the largest rectangle that contain no negative examples. Let R be the correct concept. R R R Define 4 sides (in for R, out for R ) each with probability 1/4 of R: r 1, r 2, r 3, r 4. Pr(A(R )) ɛ) 4(1 ɛ/4) m

Learning Boolean Conjunctions A Boolean literal is either x or x. A conjunction is x i x j x k... C = is the set of conjunctions of up to 2n literals. The input space is {0, 1} n Theorem The class of conjunctions of Boolean literals is efficiently PAC learnable.

Proof Start with the hypothesis h = x 1 x 1... x n x n. Ignore negative examples generated by Oracle(c, D). For a positive example (a 1,..., a n ), if a i = 1 remove x i, otherwise remove x i from h. Lemma At any step of the algorithm the current hypothesis never errs on negative example. It may err on positive examples by not removing enough literals from h. Proof. Initially the hypothesis has no satisfying assignment. It has a satisfying assignment only when no literal and its complement are left in the hypothesis. A literal is removed when it contradicts a positive example and thus cannot be in c. Literals of c are never removed. A negative example must contradict a literal in c, thus is not satisfied by h.

Analysis The learned hypothesis h can only err by rejecting a positive examples. (it rejects a input unless it had a similar positive example in the training set.) If h errs on a positive example then in has a literal that is not in c. Let z be a literal in h and not c. Let p(z) = Pr a D (c(a) = 1 and z = 0 in a). A literal z is bad If p(z) > ɛ 2n. Let m 2n ɛ ln(2n) + ln 1 δ. The probability that after m samples there is any bad literal in the hypothesis is bounded by 2n(1 ɛ 2n )m δ.

Two fundamental questions: What concept classes are PAC-learnable with a given number of training (random) examples? What concept class are efficiently learnable (in polynomial time)? A complete (and beautiful) characterization for the first question, not very satisfying answer for the second one. Some Examples: Efficiently PAC learnable: Interval in R, rectangular in R 2, disjunction of up to n variables, 3-CNF formula,... PAC learnable, but not in polynomial time (unless P = NP): DNF formula, finite automata,... Not PAC learnable: Convex body in R 2, {sin(hx) 0 h π},...

Uniform Convergence [Vapnik Chervonenkis 1971] Definition A set of functions F has the uniform convergence property with respect to a domain Z if there is a function m F (ɛ, δ) such that for any ɛ, δ > 0, m(ɛ, δ) < for any distribution D on Z, and a sample z 1,..., z m of size m = m F (ɛ, δ), Pr(sup 1 f F m m f (z i ) E D [f ] ɛ) 1 δ. i=1 Let f E (z) = 1 z E then E[f E (z)] = Pr(E).

Uniform Convergence and Learning Definition A set of functions F has the uniform convergence property with respect to a domain Z if there is a function m F (ɛ, δ) such that for any ɛ, δ > 0, m(ɛ, δ) < for any distribution D on Z, and a sample z 1,..., z m of size m = m F (ɛ, δ), Pr(sup 1 f F m m f (z i ) E D [f ] ɛ) 1 δ. i=1 Let F H = {f h h H}, where f h is the loss function for hypothesis h. F H has the uniform convergence property an ERM (Empirical Risk Minimization) algorithm learns H. The sample complexity of learning H is bounded by m FH (ɛ, δ)

Uniform Convergence - 1971, PAC Learning - 1984 Definition A set of functions F has the uniform convergence property with respect to a domain Z if there is a function m F (ɛ, δ) such that for any ɛ, δ > 0, m(ɛ, δ) < for any distribution D on Z, and a sample z 1,..., z m of size m = m F (ɛ, δ), Pr(sup 1 f F m m f (z i ) E D [f ] ɛ) 1 δ. i=1 Let F H = {f h h H}, where f h is the loss function for hypothesis h. F H has the uniform convergence property an ERM (Empirical Risk Minimization) algorithm learns H. PAC efficiently learnable if there a polynomial time ɛ, δ-approximation for minimum ERM.

Uniform Convergence Definition A set of functions F has the uniform convergence property with respect to a domain Z if there is a function m F (ɛ, δ) such that for any ɛ, δ > 0, m(ɛ, δ) < for any distribution D on Z, and a sample z 1,..., z m of size m = m F (ɛ, δ), Pr(sup 1 f F m m f (z i ) E D [f ] ɛ) 1 δ. i=1 VC-dimension and Rademacher complexity are the two major techniques to prove that a set of functions F has the uniform convergence property charaterize the function m F (ɛ, δ)

Some Background Let f x (z) = 1 z x (indicator function of the event {, x}) F m (x) = 1 m m i=1 f x(z i ) (empirical distributed function) Strong Law of Large Numbers: for a given x, F m (x) a.s F (x) = Pr(z x). Glivenko-Cantelli Theorem: sup F m (x) F (x) a.s 0. x R Dvoretzky-Keifer-Wolfowitz Inequality Pr(sup F m (x) F (x) ɛ) 2e 2nɛ2. x R VC-dimension characterizes the uniform convergence property for arbitrary sets of events.