Learning Theory: Lecture Notes

Similar documents
Vapnik-Chervonenkis theory

Lecture Notes on Linear Regression

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

COS 511: Theoretical Machine Learning

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Excess Error, Approximation Error, and Estimation Error

1 Definition of Rademacher Complexity

Multilayer Perceptron (MLP)

Generalized Linear Methods

Lecture 3: Shannon s Theorem

Calculation of time complexity (3%)

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 10 Support Vector Machines II

Kernel Methods and SVMs Extension

2.3 Nilpotent endomorphisms

10-701/ Machine Learning, Fall 2005 Homework 3

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Homework Assignment 3 Due in class, Thursday October 15

Affine transformations and convexity

Feature Selection: Part 1

HMMT February 2016 February 20, 2016

Online Classification: Perceptron and Winnow

Computational and Statistical Learning theory Assignment 4

18.1 Introduction and Recap

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Errors for Linear Systems

CSC 411 / CSC D11 / CSC C11

1 The Mistake Bound Model

a b a In case b 0, a being divisible by b is the same as to say that

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 4: Root Finding

PHYS 705: Classical Mechanics. Calculus of Variations II

CSCE 790S Background Results

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Lecture 10: May 6, 2013

Lecture 4: Universal Hash Functions/Streaming Cont d

REAL ANALYSIS I HOMEWORK 1

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Edge Isoperimetric Inequalities

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

MAXIMUM A POSTERIORI TRANSDUCTION

Week 5: Neural Networks

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Linear Classification, SVMs and Nearest Neighbors

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

The exam is closed book, closed notes except your one-page cheat sheet.

Natural Language Processing and Information Retrieval

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

CSE 546 Midterm Exam, Fall 2014(with Solution)

MMA and GCMMA two methods for nonlinear optimization

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

and problem sheet 2

APPENDIX A Some Linear Algebra

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Error Probability for M Signals

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Singular Value Decomposition: Theory and Applications

Ensemble Methods: Boosting

Lecture 10 Support Vector Machines. Oct

Basic Regular Expressions. Introduction. Introduction to Computability. Theory. Motivation. Lecture4: Regular Expressions

Maximal Margin Classifier

Problem Do any of the following determine homomorphisms from GL n (C) to GL n (C)?

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Lecture 4 Hypothesis Testing

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Dynamical Systems and Information Theory

The path of ants Dragos Crisan, Andrei Petridean, 11 th grade. Colegiul National "Emil Racovita", Cluj-Napoca

Lecture 4: Constant Time SVD Approximation

Spectral Graph Theory and its Applications September 16, Lecture 5

First day August 1, Problems and Solutions

1 Convex Optimization

EXPANSIVE MAPPINGS. by W. R. Utz

The Second Eigenvalue of Planar Graphs

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Lecture 12: Discrete Laplacian

Common loop optimizations. Example to improve locality. Why Dependence Analysis. Data Dependence in Loops. Goal is to find best schedule:

VQ widely used in coding speech, image, and video

Chapter 11: Simple Linear Regression and Correlation

NUMERICAL DIFFERENTIATION

Which Separator? Spring 1

Lecture 4: November 17, Part 1 Single Buffer Management

Difference Equations

Classification as a Regression Problem

Differentiating Gaussian Processes

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Grover s Algorithm + Quantum Zeno Effect + Vaidman

CSCI B609: Foundations of Data Science

EEE 241: Linear Systems

Assortment Optimization under MNL

Subset Topological Spaces and Kakutani s Theorem

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

What would be a reasonable choice of the quantization step Δ?

Problem Set 9 Solutions

Exercise Solutions to Real Analysis

Lecture 3: Probability Distributions

DECOUPLING THEORY HW2

Transcription:

Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be separable wth respect to the hypothess class H. The Agnostc PAC model removes ths restrcton. That s, there no longer exsts a h H wth h = 0. efnton 1 Agnostc PAC Model A hypothess class H s sad to be Agnostc PACLearnable f there s an algorthm A wth the followng property. For all ɛ, δ, 0 ɛ, δ 1 2, all dstrbutons over X Y, f A s gven ɛ, δ and m H ɛ, δ examples from, then wth probablty 1 δ, t outputs a h H wth: h ɛ nf h H h The learnng procedure n the PAC model s to fnd a hypothess n H whch s consstent wth all the nput examples. In the Agnostc PAC model, there s no such hypothess. Instead, a common learnng procedure s to fnd a hypothess h that mnmzes the emprcal or, or the or on the tranng examples. Suppose that gven a set of samples S drawn from a data dstrbuton, h mnmzes the emprcal or h, S whle h opt mnmzes the true or h. h = arg mn h, S and h opt = arg mn Our goal s to fnd the condton under whch h ε h opt. Lemma 1 For a fxed h H and m samples S drawn from, P h h, S ε 2e mε2. h. Proof: Let S = {x 1, y 1,..., x m, y m } be the sample set, and let Z = 1hx y for any h H. Then, E[Z ] = h and h, S = 1 Z. m The bound then follows drectly from applyng Hoeffdng s Inequalty. Theorem 1 For a fnte hypothess class H, P h h opt ε 2 H e mε2 /4. 1

Proof: Frst observe that h h opt can be splt nto three terms h h opt = h h, S h, S h opt, S h opt, S h opt. The mddle term, h, S h opt 0, because h mnmzes h, S. Thus h h opt 2 sup h h, S. The theorem then results from combnng ths wth the prevous lemma, and applyng an Unon Bound over all h H: P sup h h, S ε P 2 h h, S ε 2 H e mε2 /4. 2 For falure probablty δ, the bound n Theorem 1 can be rewrtten as: εm 2 ln2 H /δ Contrast ths wth the analogous bound for PAC learnng: m 1 εm ln H /δ m Thus, Agnostc PAC learnng s statstcally harder than PAC learnng. Usually t s also computatonally harder as well. 2 Bounds for Infnte Hypothess Classes The generalzaton bounds we have proved so far apply to fnte hypothess classes, because the unon bound step breaks down when H s nfnte. We wll now see how we can explot the structure of a hypothess class to show generalzaton bounds whch apply nfnte classes as well. What knd of structure can we explot? In cases where a hypothess class s nfnte, many dfferent hypotheses can produce the same labelng so often the set of meanngful hypotheses s much smaller. We wll measure the complexty a hypothess class by the rchness of the labelngs t can produce. Ths noton can be made formal by the VC dmenson. Assumng bnary classfcaton, that s Y = {0, 1}, for a hypothess class H, and a set of examples S = {x 1,..., x m }, we defne: Π H S = {hx 1,..., hx m h H}. Here H may be nfnte but Π H S has at most 2 m possble elements, and under certan condtons on H, Π H S may have even less. efnton 1 We say a hypothess class H shatters S f Π H S = {0, 1} m. efnton 2 The VC dmenson of H s the sze of the largest set of examples that can be shattered by H. The VC dmenson s nfnte f for all m, there s a set of m examples shattered by H. 2 2

Example 1: Bdrectonal Thresholds. Let X = R wth H = R {, }. Here each example s a pont on a lne, and has a bnary label. Each hypothess n H corresponds to a threshold t and a sgn or, and can be wrtten as h {t,} or h {t, }, defned as follows: h {t,} x =, x t =, otherwse In other words, h {t,} labels everythng to the rght of t as and everythng else as, and h {t, } s defned correspondngly. Snce t can take on any real value, H s nfnte. Note that on any fxed set of ponts S = {x 1, x 2,..., x m } of sze m, Π H S 2m.Consder the followng m 1 ntervals:, x 1, x 1, x 2, x 2, x 3,..., x m 2, x m 1, x m 1, x m, x m, 3 Two thresholds t and t placed n the same nterval and wth the same sgn would result n the same labelng; moreover the pars h {,} and h {, } as well as h {, } and h {,} result n the same labellng. Thus there are 2m dstnct labelngs. What s the VC dmenson of ths class? Thresholds can produce all possble labels on a set of two dstnct ponts. However on a sequence of three ponts, they cannot label the sequence,, or,,. Thus no sets of sze 3 are shattered, and the VC dmenson of ths hypothess class s 2. Example 2: Intervals on the lne. Let X = R wth H = R R. Samples agan label ponts on the lne and each hypothess corresponds to two real values defnng an nterval; ponts nsde the nterval are labeled and everythng else s labeled. Formally, for each nterval [a, b], h [a,b] x = for a x b, and otherwse. For any set S = {x 1,..., x m } of m ponts, Π H S = m1 2 1. Any two hypotheses h[a,b] and h [a,b ] where a and a or b and b le n the same nterval n the sequence n Equaton 3 produce the same labelng of S. Thus there are m1 2 dstnct labelngs of S where not all data ponts are labeled, correspondng to hypotheses h [a,b] where a and b le n dfferent ntervals n the sequence n Equaton 3. Fnally, we add the all labellng whch s acheved by h [a,a] for any a. What s the VC dmenson of ntervals? Intervals can label any sequence of two dstnct ponts but cannot label a sequence of three dstnct ponts,,. Thus the VC dmenson of H s 2. If H s expanded to allow bdrectonal ntervals, the prevous sequence could then be labeled but sequences such as,,, could not be, gvng a VC dmenson of 3. Example 3: Lnear Classfers. Let X = R 2 wth H = {lnear classfers over R 2 }. Consder a set S of 3 ponts n general poston. Fgure 2 shows that all possble labelngs of S are achevable by H. Thus there exsts a set of 3 ponts that can be shattered by H. On the other hand, t can be shown that no set of 4 dstnct ponts on the plane can be shattered by H. Thus the VC dmenson of H s 3. Note that a set of 3 collnear ponts on the plane cannot be shattered by H because the labelng,, s not achevable by H; but ths does not change the VC dmenson calculaton because there s a set of sze 3 that can be shattered. In general, the VC dmenson for the hypothess class of lnear classfers n R d s d 1. Theorem 2 For any fnte hypothess class H, VCdmH log 2 H. Proof: If H shatters S then H s at least 2 m meanng the VC dmenson can be at most log 2 H. 3

Fgure 1: All possble labelngs of S are achevable by the class of lnear classfers on the plane. Example 3: Infnte VC dmenson. Let X = R and H = R. For w R a hypothess s gven by h w x = sgnsnwx. For all m, the set S = {2 1, 2 2,..., 2 m } s shattered by h. To see ths, let w = π 0.y 1 y 2... y m be a decmal bnary encodng of a set of desred labels, convertng 1 to 0. Essentally each x bt shfts w to produce the desred label as a result of the fact that sgnsnπz = 1 z. Thus the VC dmenson of ths hypothess class s nfnte. 2.1 Sauer s Lemma Sauer s Lemma formally relates the VC dmenson of a hypothess class H and the sze of Π H S for any set S of examples of sze m. Lemma 2 If the VC dmenson for a hypothess class H s d then for a set of m samples S, where m d, m em d Π H S Om d d Proof: We wll prove ths by nducton over m and d. Let Φ d m = d m. The two base cases: When m = 0, S s the empty set so Π H S 1 and Φ d 0 = 1. When d = 0, H cannot even shatter one pont so only one labelng s possble and Π H S = Φ 0 m = 1. Then, assumng Sauer s Lemma holds for m 1, d and m 1, d 1, we wsh to show Π H S Φ d m. Let S = {x 1,..., x m }. In what follows, we restrct ourselves to the sample space S. Restrcton to S can only decrease the VC dmenson of H, so t does not affect the theorem statement. 4

We start by splttng Π H S through ntroducng two new hypothess classes H 1 and H 2 defned on samples S = {x 1,..., x m 1 }. H 1 s dentcal to H but gnores the last example x m whle H 2 conssts of only those hypotheses where duplcates dfferng only on x m would occur n H. A sample splt could be as follows: H H 1 H 2 x 1 x 2 x 3 x 4 x 5 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 h 1 0 1 1 0 0 0 1 1 0 h 2 0 1 1 0 1 0 1 1 0 h 3 0 1 1 1 0 0 1 1 1 h 4 1 0 0 1 0 1 0 0 1 h 5 1 0 0 1 1 1 0 0 1 h 6 1 1 0 0 1 1 1 0 0 If a set s shattered by H 1, t s also shattered by H. Thus VCdmH 1 VCdmH = d. If S s shattered by H 2, then S {x m } s shattered by H mplyng VCdmH 1 VCdmH 1 = d 1. Wth ths splt, Π H S = Π H1 S Π H2 S. Let l be any labelng of S \ {x m } achevable by H; f l, and l, both occur n Π H S, then l occurs n both H 1 and H 2 ; otherwse, l occurs only n H 1. So by the nductve hypothess, m 1 d 1 m 1 Π H S Φ d m 1 Φ d 1 m 1 = m 1 m 1 m = = = Φ d m. 1 =1 Fnally, from Sterlng s approxmaton, for when m d, Φ d m = m m d d m d = d m =1 m d 1 d m d m em d. d 5