Computational and Statistical Learning theory

Similar documents
Introduction to Machine Learning

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Introduction to Machine Learning CMU-10701

The PAC Learning Framework -II

The Vapnik-Chervonenkis Dimension

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning

IFT Lecture 7 Elements of statistical learning theory

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Generalization Bounds

Understanding Generalization Error: Bounds and Decompositions

Computational and Statistical Learning Theory

The sample complexity of agnostic learning with deterministic labels

Generalization, Overfitting, and Model Selection

Lecture 21: Minimax Theory

ADVANCED CALCULUS - MTH433 LECTURE 4 - FINITE AND INFINITE SETS

5 Set Operations, Functions, and Counting

Lecture 3: Empirical Risk Minimization

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Introduction to Machine Learning (67577) Lecture 3

Generalization bounds

VC dimension and Model Selection

In N we can do addition, but in order to do subtraction we need to extend N to the integers

Consistency of Nearest Neighbor Methods

POL502: Foundations. Kosuke Imai Department of Politics, Princeton University. October 10, 2005

Machine Learning 4771

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

VC Dimension and Sauer s Lemma

ECS171: Machine Learning

In N we can do addition, but in order to do subtraction we need to extend N to the integers

Active Learning: Disagreement Coefficient

Introduction and Models

HW 4 SOLUTIONS. , x + x x 1 ) 2

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Principles of Real Analysis I Fall I. The Real Number System

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions

Problem set 1, Real Analysis I, Spring, 2015.

Name (NetID): (1 Point)

Statistical Learning Learning From Examples

Introduction to Support Vector Machines

Handout 2 (Correction of Handout 1 plus continued discussion/hw) Comments and Homework in Chapter 1

Exercises for Unit VI (Infinite constructions in set theory)

Solving Classification Problems By Knowledge Sets

Stat 451: Solutions to Assignment #1

Does Unlabeled Data Help?

Empirical Risk Minimization

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

The Perceptron algorithm

1 A Lower Bound on Sample Complexity

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

CSC 2414 Lattices in Computer Science September 27, Lecture 4. An Efficient Algorithm for Integer Programming in constant dimensions

Computational Learning Theory for Artificial Neural Networks

Computational Learning Theory

2 Upper-bound of Generalization Error of AdaBoost

Computational Learning Theory

Show Your Work! Point values are in square brackets. There are 35 points possible. Some facts about sets are on the last page.

10.1 The Formal Model

Some Background Material

SETS AND FUNCTIONS JOSHUA BALLEW

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Some Basic Notations Of Set Theory

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Advanced Introduction to Machine Learning CMU-10715

Homework 1 (revised) Solutions

Machine Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Learning convex bodies is hard

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

MATH 102 INTRODUCTION TO MATHEMATICAL ANALYSIS. 1. Some Fundamentals

Introduction to Statistical Learning Theory

Mid Term-1 : Practice problems

COMS 4771 Introduction to Machine Learning. Nakul Verma

Computational Learning Theory. Definitions

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

1 The Probably Approximately Correct (PAC) Model

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Seminaar Abstrakte Wiskunde Seminar in Abstract Mathematics Lecture notes in progress (27 March 2010)

PAC-learning, VC Dimension and Margin-based Bounds

References for online kernel methods

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Topological properties

2.2 Some Consequences of the Completeness Axiom

Mathematics 220 Workshop Cardinality. Some harder problems on cardinality.

Part of the slides are adapted from Ziko Kolter

2. Two binary operations (addition, denoted + and multiplication, denoted

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

COMPUTATIONAL LEARNING THEORY

MORE ON CONTINUOUS FUNCTIONS AND SETS

Computational Learning Theory

Transcription:

Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n, y n X Y Hypothesis Class : H Risk : R(h = E [ 1 h(x y ] Empirical Risk : ˆR(h = 1 n m i=1 1 h(x i y i 1. Shatter Lemma : Given a set S = {x 1,..., x m } let H x1,...,x m = {(h(x 1,..., h(x m {±1} m : h H}. Recall that we say that such a set is shattered by H if H x1,...,x m = 2 m, and that the VC dimension of H is the size of he largest sample that can be shattered. Also recall that the growth function of the hypothesis class H is given by: Π H (m = sup H x1,...,x m. x 1,...,x m That is, we can also define the VC dimension as the largest m for which Π H (m = 2 m. The aim of this exercise is to prove the Shatter Lemma : if H has VC dimension d, then for any m, d ( m Π H (m. (1 i In order to prove (1, we will actually prove the following statement: for any set S = {x 1,..., x m }: H S {B S : B is shattered by H} (2 That is, the number of possible labeling of a S is bounded by the number of different subsets of S that can be shattered. We (i.e. you will prove (2 by induction. i=0 (a Establish that (2 holds for S = (the empty set. 1

(b For any set S and any point x S, assume (2 holds for S and for any hypothesis class, and prove that (2 holds for S = S {x } and any hypothesis class. To this end, for any hypothesis class H, write H = H H + where: H + = {h H : h(x = +1} H = {h H : h(x = 1} i. Prove that H S = H + S + H S. ii. Prove that (2 holds for S and H by applying (2 to each of the two terms on the right-hand-side above. We can now conclude that (2 holds for any (finite S and any H. (c Use (2 to establish (1. (d For d n, prove that d ( em d d 2. VC Dimension : i=0 ( m i m d. Optional: Prove the tighter bound: d i=0 ( m i (a Consider the hypothesis class H of positve circles in R 2. That is set of all hypothesis that are positive inside some circle and negative outside. Calculate the VC dimension of this class, and show that this is the exact value of the VC dimension. (b Consider the hypothesis class H of both positive and negative circles in R 2. That is set of all hypothesis that are positive inside some circle and negative outside and all hypothesis that are negative inside that circle and positive outside. Show how to shatter 4 points using this class and establish a lower bound of 4 on the VC dimension of the class. We now consider the VC dimension of the class H d of linear separators in R d : H d = { x sign(w x + b w R d, b R } (c Consider the set of d + 1 points that include origin and the d bases e i (ie. 1 on ith co-ordinate and 0 elsewhere. Show that the points can be shattered by H d. (d Prove that no set of d + 2 points can be shattered by H d. (Hint : Use Radon s theorem which states that any set of d + 2 points in R d can be partitioned into two disjoint sets whose convex hulls intersect. From this we conclude that the VC dimension of H d is exactly d + 1. (e Prove that for any H 1 H 2, the VC dimension of H 1 is not larger than that of H 2. (f Use the above to prove that if for some hypothesis class H, there exists a feature map φ : X R d such that any hypothesis h H can be written as ( d h(x = sign w i φ i (x + b for some w R d and some b R, then VC dimension of H is at most d + 1. 2 i=1

(g Use this to obtain a tight upper bound on the VC-dimension of H and conclude that the VC-dimension of this class is indeed four. Note that the bound you can get on H is not tight. 3. Description-Length Based Structural Risk Minimization : In this problem we will consider more carefully an analysis of a slightly cleaner MDL-based SRM learning rule. Recall that for any distribution p over hypotheses in H and any δ > 0, with probability at least 1 δ over the sample S := (x 1, y 1,..., (x n, y n, for all h H: R(h ˆR(h log 1 + log 1 p(h δ + (3 2n With the above in mind, for a prior distribution p(, define the following learning rule: SRM p (S = h = argmin h H ˆR(h + log 1 p(h 2n Prove that for any h H, any ɛ > 0 and δ > 0, with probability at least 1 δ over sample S of size : m > log 1 + 4 log 1 p(h δ ɛ 2 we will have R(SRM p (S R(h + ɛ. (Hint: you may find the inequality a + b a + b 2a + 2b useful 4. VC versus MDL : Using the SRM rule above, for any countable hypothesis class H, we can get arbitrary close to the generalization error of any h H. But we saw that the VC-dimension correctly captures the complexity of a hypothesis classes, and that in some sense, no learning guarantee is possible for classes with infinite VC dimension. We will now try to understand why there is no contradiction here. Let X be the interval (0, 1]. For any integer r > 0, consider the hypothesis class: H r = {all binary functions of φ r (x = r x }. Our hypothesis class will be an infinite union of such classes. To make things a bit simpler, we consider only resolutions r that are integers power of two, that is: H = (a Show that H has infinite VC-dimension. H 2 q. q=1 3

(b Suggest either a binary description language for H or a distribution over it. It is OK if multiple descriptions refer to the same function, or if you prefer assigning probability to multiple functions that are actually the same one. But be sure that every hypothesis in H has a description or positive probability mass. (c Show that the VC dimension of H is in fact. (d We first establish that the ERM is not appropriate here. Consider a source distribution which is uniform on X and for which: { +1 if x < 0.3473 y = 1 otherwise Show that for any sample, and any ɛ > 0, there exists a hypothesis h H with ˆR(h = 0 but R(h > 1 ɛ. (e The ERM is not appropriate, but SRM p is. Calculate an explicit number n 0 (we are looking for an actual number here, not an expression, such that for with probability at least 0.99 over sample of size n > n 0, we will have SRM p (S < 0.1, where p is the prior (description language you suggested above. (Don t turn in Think of a different description language or prior distribution for the same class H that would require a much smaller training set size to achieve a generalization error of 0.1 for the above source distribution. Then think of an example of a source distribution for which the new description language or prior distribution would require a much larger training set size then p to achieve low generalization error. (f For any prior distribution p, and any sample size m > 0, construct a source distribution such that there exists h H with R(h = 0, but with probability at least 0.2 over a sample of size m, R(SRM p (S > 0.2 (it is actually possible to get R(SRM p (S = 0.5 with probability close to one. 5. VC + MDL : MDL bounds are applicable for countable classes and VC bounds to possible uncountable classes with finite VC dimension. What about continuous classes with infinite VC dimensions? As an example consider the class of all polynomial threshold functions. Can we get learning guarantees for this class? Example : Consider input space X R and the hypothesis class H poly = {x sign(f(x < 0 : f is a polynomial function}. This function class is uncountable and has infinite VC dimension (Eg. any binary function can be approximated by polynomial functions. However it is possible to get learning gaurantees, to do so we use the key observation that H poly = d=0 H poly d where H polyd is the class of all polynomials of degree d. Note that by Problem 2(f we have that VC dimension of H d is bounded by d + 1. 4

We (i.e. you shall prove learning guarantees for general hypothesis classes that can be written as countable union of classes with finite VC dimension. Consider: H = d=1 H d where H d is a hypothesis class with VC dimension d. (a Prove a generalization error bound of the following form: For any δ > 0, with probability at least 1 δ over sample of size m, for all h H: R(h ˆR(h + ɛ(m, δ, d(h where: d(h = min d s.t. h H d and for any δ and d, ɛ(m, δ, h m 0. Be sure to specific ɛ(m, δ, h explicitly. (b Write down a learning rule SRM H that guarantees that for any ɛ, δ and h H, there exist m(h such that for any source distribution, with probability at least 1 δ over a sample of size m, R(SRM H (S < R(h + ɛ. Challenge Problems : 1. VC dimension of decision trees : (a Prove a learning guarantee for decision trees of size k (i.e. having at most k leaves over an input space of n binary variables, where each decision is over a single binary variable. (b For the input space X = R d, provide an upper bound (that is as tight as possible on the VC dimension of the class of stumps H = {x sign (ax i b : i [d], b R, a ±1} (c For the input space X = R d, provide an upper bound (that is as tight as possible on the VC dimension of the class of decision trees of size k where each decision is based on a stump from H. 2. Refine PAC-Bayes (a Prove the refined PAC-Bayes bound: for any δ > 0 and prior p, with probability at least 1 δ over the sample of size m, for any sample dependent distribution q over hypothesis, ( KL R(q ˆR(q KL (q p + log 2m + log 1 δ m 1 where recall that R(q = E h q [R(h] and ˆR(q [ ] = E h q ˆR(h are the risk and empirical risk of the randomized predictor defined by q. 5

(b Show that close to ˆR(q = 0 the above bound behaves as 1/m and far away from ˆR(q = 0 it behaves as 1/ m, matching our realizable and non-realizable bounds. Research Problems : 1. Show how a VC-based learning guarantee can be obtained from the PAC-Bayes bound. That is, for any class with VC dimension d, describe a prior p and a learning rule that returns a distribution (randomized ( hypothesis q, for which the PAC-Bayes bound guarantees R(q d/ɛ inf h H R(h + Õ. 2. Show that any hypothesis class with VC-dimension d has a compression scheme of size d. There is a 600 dollar prize on this problem. 6