Computational and Statistical Learning theory Assignment 4

Similar documents
Computational and Statistical Learning Theory

1 Definition of Rademacher Complexity

Excess Error, Approximation Error, and Estimation Error

COS 511: Theoretical Machine Learning

Vapnik-Chervonenkis theory

1 Review From Last Time

Learning Theory: Lecture Notes

Generalized Linear Methods

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

On the Calderón-Zygmund lemma for Sobolev functions

An Optimal Bound for Sum of Square Roots of Special Type of Integers

Slobodan Lakić. Communicated by R. Van Keer

Gradient Descent Learning and Backpropagation

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

10-701/ Machine Learning, Fall 2005 Homework 3

Xiangwen Li. March 8th and March 13th, 2001

Lecture 10 Support Vector Machines II

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

XII.3 The EM (Expectation-Maximization) Algorithm

Least Squares Fitting of Data

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

BAYESIAN CURVE FITTING USING PIECEWISE POLYNOMIALS. Dariusz Biskup

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Feature Selection: Part 1

AN ANALYSIS OF A FRACTAL KINETICS CURVE OF SAVAGEAU

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Chapter 12 Lyes KADEM [Thermodynamics II] 2007

Least Squares Fitting of Data

COMP th April, 2007 Clement Pang

Exercise Solutions to Real Analysis

PHYS 705: Classical Mechanics. Calculus of Variations II

1 Convex Optimization

System in Weibull Distribution

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

P exp(tx) = 1 + t 2k M 2k. k N

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

On the Eigenspectrum of the Gram Matrix and the Generalisation Error of Kernel PCA (Shawe-Taylor, et al. 2005) Ameet Talwalkar 02/13/07

The Parity of the Number of Irreducible Factors for Some Pentanomials

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Denote the function derivatives f(x) in given points. x a b. Using relationships (1.2), polynomials (1.1) are written in the form

Final Exam Solutions, 1998

Preference and Demand Examples

Lecture Notes on Linear Regression

Week 5: Neural Networks

On Pfaff s solution of the Pfaff problem

Lecture 4: Universal Hash Functions/Streaming Cont d

Recap: the SVM problem

Multipoint Analysis for Sibling Pairs. Biostatistics 666 Lecture 18

Multilayer Perceptron (MLP)

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Near Optimal Online Algorithms and Fast Approximation Algorithms for Resource Allocation Problems

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Lecture 14: Bandits with Budget Constraints

Fermi-Dirac statistics

Homework Assignment 3 Due in class, Thursday October 15

CSCE 790S Background Results

Several generation methods of multinomial distributed random number Tian Lei 1, a,linxihe 1,b,Zhigang Zhang 1,c

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

04 - Treaps. Dr. Alexander Souza

APPENDIX A Some Linear Algebra

Canonical transformations

Estimation: Part 2. Chapter GREG estimation

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Ensemble Methods: Boosting

More metrics on cartesian products

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Finding Dense Subgraphs in G(n, 1/2)

Errors for Linear Systems

), it produces a response (output function g (x)

Math 217 Fall 2013 Homework 2 Solutions

Lecture 12: Discrete Laplacian

y new = M x old Feature Selection: Linear Transformations Constraint Optimization (insertion)

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

arxiv: v2 [math.co] 3 Sep 2017

Strong Markov property: Same assertion holds for stopping times τ.

First Year Examination Department of Statistics, University of Florida

4 Column generation (CG) 4.1 Basics of column generation. 4.2 Applying CG to the Cutting-Stock Problem. Basic Idea of column generation

Expected Value and Variance

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Lecture 21: Numerical methods for pricing American type derivatives

STEINHAUS PROPERTY IN BANACH LATTICES

Robust Algorithms for Preemptive Scheduling

A. Proofs for learning guarantees

An Accurate Measure for Multilayer Perceptron Tolerance to Weight Deviations

Online Classification: Perceptron and Winnow

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Series Expansion for L p Hardy Inequalities

1.3 Hence, calculate a formula for the force required to break the bond (i.e. the maximum value of F)

The Second Anti-Mathima on Game Theory

Appendix B. Criterion of Riemann-Stieltjes Integrability

CHAPTER 7 CONSTRAINED OPTIMIZATION 1: THE KARUSH-KUHN-TUCKER CONDITIONS

On the number of regions in an m-dimensional space cut by n hyperplanes

Centroid Uncertainty Bounds for Interval Type-2 Fuzzy Sets: Forward and Inverse Problems

Introduction to Random Variables

General Permanence Conditions for Nonlinear Difference Equations of Higher Order

Limit Cycle Bifurcations in a Class of Cubic System near a Nilpotent Center *

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Transcription:

Coputatonal and Statstcal Learnng theory Assgnent 4 Due: March 2nd Eal solutons to : karthk at ttc dot edu Notatons/Defntons Recall the defnton of saple based Radeacher coplexty : [ ] R S F) := E ɛ {±} n sup ɛ fx ) Defnton. Gven a saple S = {x,..., x }, and any > 0, a set V R s sad to be an -cover n l p ) of functon class F on saple S f ) /p f F, v V s.t. fx ) v p Specfcally for p = fx ) v p) /p s replaced by ax [] fx ) v. Also defne N p F,, S) := n{ V : V s an -cover of n l p ) off on saple S} and N p F,, n) := sup N p F,, {x,..., x }) x,...,x Defnton 2. A functon F s sad to -shatter a saple S = {x,..., x } f there exsts a sequence of thresholds, s,..., s R such that ɛ {±}, f F s.t. [], ɛ fx ) s ) /2

Probles. VC Lea for Real-valued Functon classes : We shall prove that for any functon class F assue functons n F are bounded by ) and scale > 0, the l coverng nuber at scale can be bounded usng fat shatterng denson at that scale by provng a stateent analogous to VC lea. We shall proceed by frst extendng the stateent to fnte specfcally {0,..., k}) valued functon classes and then usng ths to prove the fnal bound of for fat N F,, n) n ) a) Let F k {0,..., k} X be a functon class wth fat 2 F k ) = d, show that N F k, /2, ) d ) ) k Show the above stateent usng nducton on n + d very slar to frst proble on Assgnent 2). Hnt : In Assgnent 2 proble where we used H + S and H S use nstead, for all {0,..., k}, F = {f F : fx ) = } note that ths s a sple ultlable extenson and for k =, F 0, F are dentcal to H +, H ). Use the noton of 2-shatterng nstead of shatterng for the VC case and use /2-cover nstead of growth functon. b) Usng the dea of -dscretzng the output of functon class F we shall conclude the requred stateent. Do the followng :. Create a {0,..., k}-valued class G where k s of order /. Show that coverng G at scale /2 ples we can cover F at scale and hence conclude that we can bound N F,, ) n ters of coverng nuber at scale /2 for G.. Show that fat 2 G) fat F). Cobne wth the bound on N G, /2, ) fro prevous sub-proble and conclude that fat N F,, n) n ) 2. Dudley Vs Pollards Bounds : In class we saw that Radeacher coplexty can be bounded n ters of coverng nubers usng Pollard s bound, Dudley ntegral bound and the slghtly odfed verson of Dudley ) 2

ntegral bound as follows : { } { } N F,, ) N2 F,, ) R S F) nf + nf + 0 0 { } N2 F, τ, ) R S F) nf 4 + 2 dτ 0 Pollard) Refned Dudley) In ths proble usng soe exaples we shall copare these bounds. a) Class wth fnte VC subgraph-denson : Assue that the VC subgraph-denson of functon class F s bounded by D. In ths case result n proble can be used to bound the coverng nuber of F n ters of D. Use ths bound on coverng nuber and copare Pollard s bound wth refned Dudley ntegral bound by wrtng down the bounds pled by each one. b) Lnear class wth bounded nor : Lnear classes n hgh densonal spaces s probably one of the ost portant and ost used functon class n achne learnng. Consder the specfc exaple where X = {x : x 2 } and F = {x w x : w 2 } In class we saw that for any ɛ > 0, fat ɛ F) 4. Usng ths wth the result n proble ɛ 2 we have that : en ) 4 ɛ N 2 F,, ) N F,, ) 2 ɛ Use the above bound on the coverng nuber and wrte down the bound on Radeacher coplexty pled by Pollard s bound. Wrte down the bound on Radeacher coplexty pled by the refned verson of the Dudley ntegral bound. 3. Data Dependent Bound : Recall the Radeacher coplexty bound we proved n class for functons F bounded by. For any δ > 0 wth probablty at least δ, ) [ sup E [fx)] ÊS[fx)] 2E S D RS F)] + log/δ) Note that we don t know the dstrbuton D. One way we used the above bound was by provdng [ upper bounds on R S F) for any saple of sze and usng ths nstead of E S D RS F)]. But deally we would lke to get tght bounds when the dstrbuton we are faced wth s ncer. The a of ths proble s to do ths. Prove that, for any δ > 0 wth probablty at least δ, over draw of saple S D, ) sup E [fx)] ÊS[fx)] 2 R log2/δ) S F) + K 3

provde explct value of constant K above). Notce that n the above bound the expected Radeacher coplexty s replaced by saple based one whch can be calculated fro the tranng saple. Hnt : Use McDard s nequalty on the expected Radeacher coplexty. 4. Learnablty and Fat-shatterng Denson Recall the settng of stochastc optzaton proble where objectve functon s appng r : H Z R. Saple S = {z,..., z } drawn d fro unknown dstrbuton D s provded to the learner and the a of the learner s to output ĥ H based on saple that has low expected objectve E [rh, z)]. a) Consder the stochastc optzaton proble wth r bounded by a,.e. rh, z) < a < for all h H and z Z. If functon class F := {z rh, z) h H} has fnte fat for all > 0, then show that the proble s learnable. b) Conclude that for a supervsed learnng proble wth bounded hypothess class H e. x X, hx) < a), and loss φ : Ŷ Y R that s L-Lpschtz n frst arguent), f H has fnte fat for all > 0, then the proble s learnable. c) Show a stochastc optzaton proble that s learnable even though t has nfnte fat for all 0. or any other cosntant of your choce). Explctly wrte down the hypothess class, and the learnng rule whch learns the class, argue that the proble s learnable, and explan why the fat s nfnte. Hnt : You can ake the learnng rule that s successful to even be ERM. d) Prove that for a supervsed learnng proble wth the absolute loss φŷ, y) = ŷ y, f the fat s nfnte for soe > 0, then the proble s not learnable. Hnt: as wth the bnary case, for every, construct a dstrbuton whch s concentrated on a set of ponts that can be fat-shattered. Challenge Probles. We saw that for any dstrbuton D, the expected Radeacher coplexty provded an upper bound on the axu devaton between ean and average unforly over functon class, specfcally we saw that E S D [ sup Prove the alost) converse that ) ] [ E [fx)] Ê [fx)] 2E S D RS F)] [ ] 2 E S D RS F) E S D 4 [ sup E [fx)] Ê [fx)] ) ]

Ths bascally establshes that Radeacher coplexty tghtly bounds the unfor axal devaton for every dstrbuton. 2. The worst case Radeacher coplexty s defned as R F) = e. supreu over saples of sze ). sup S={x,...,x } R S F) a) Prove that for any functon class F and any τ > R F), we have that fat τ F) 4 R F) 2 τ 2 Hnt : Frst start by provng the stateent for larger saple of sze = fat τ fat τ by takng fat τ saples and repeatng the approprate nuber of tes. You wll need to start wth a shattered set and you wll need to use Kntchne s nequalty whch states that for any n, [ ] n n E ɛ Unf{±} n ɛ 2 b) Cobne the above wth the refned verson of Dudley ntegral bound to prove that { } N2 F, τ, ) nf 4 + 2 dτ 0 R F) Olog 3/2 ) Ths shows that the refned dudley ntegral bound s tght to wthn log factors of the Radeacher coplexty. Thus we have establshed that n the worst case all the coplexty easures for functon class lke Radeacher coplexty, coverng nubers and fat shatterng denson all tghtly govern the rate of unfor axal devaton for the functon class all to wthn log factor). 3. Bounded Dfference Inequalty, Stablty and Generalzaton : Recall that a functon G : X R s sad to satsfy the bounded dfference nequalty f for all [] and all x,..., x, x X, Gx,..., x,..., x ) Gx,..., x, x, x +,..., x ) c for soe c 0. In ths case the McDard s nequalty gave us that for any δ > 0, wth probablty at least δ, c)2 log/δ) Gx,..., x ) E [Gx,..., x )] + The bounded dfference property turns out to be quet useful to analyze learnng algorths drectly nstead of lookng at the unfor devaton over functon class). 5

A proper learnng algorth s A : = X F s sad to be a unforly β stable s for all [], and any x,..., x, x X, sup Ax,..., x,..., x )x) Ax,..., x, x, x +,..., x )x) β x Assung functons n F are bounded by we shall prove that the learnng algorth generalzes well expected loss s close to eprcal loss of the algorth). Specfcally we shall prove that for any δ > 0, wth probablty at least δ, RAS)) RAS)) 2 log/δ) + β + 2β + ) where RAS)) = E x [AS)x)] and RAS)) = AS)x ). a) Frst show that E S [RAS)) RAS)) ] β. Hnt : Use renang of varables to frst show that for any [], E S [RAS))] = E S,x [Ax,..., x, x, x +,..., x )x )] b) Show that the functon GS) = RAS)) RAS)) satsfes bounded dfference property wth c 2β + 2. Conclude the requred stateent usng McDard s nequalty. c) Consder the stochastc convex optzaton proble where saple z = x, y) where y s real valued and x s are fro the unt ball n soe Hlbert space and hypothess s weght vectors w fro the sae Hlbert space wth objectve rw, x, y)) = w, x y + λ w 2 Show that the ERM algorth s stable for ths proble and thus provde a bound for ths algorth. 4. L Neural Network : A k-layer -nor neural network s gven by functon class F k whch s n turn defned recursvely as follows. { } d F = x wj x j w B and further for each 2 k, { } d F = x wj σf j x)) j [d ], f j F, w B j= j= 6

where d s the nuber of nodes n the th layer of the network. Functon σ : R [, ] s called the squash functon and s generally a sooth onotonc non-decreasng functon typcal exaple s the tanh functon). Assue that nput space X = [0, ] d and that σ s L-Lpschtz. Prove that k R S F k ) 2B ) L k 2T log d Notce that the above bound the d s don t appear n the bound ndcatng the nuber of nodes n nteredate layers don t affect the upper bound on Radeacher coplexty. Hnt : prove bound on Radeacher coplexty of F recursvely n ters of radeacher coplexty of F. 7