Learning Theory for Conditional Risk Minimization: Supplementary Material

Similar documents
Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

Online Learning & Game Theory

Probability Theory. Exercise Sheet 4. ETH Zurich HS 2017

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Define a Markov chain on {1,..., 6} with transition probability matrix P =

ECE 901 Lecture 4: Estimation of Lipschitz smooth functions

Intro to Learning Theory

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

A PROBABILITY PROBLEM

arxiv: v1 [math.st] 12 Dec 2018

Measure and Measurable Functions

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Chapter 2. Asymptotic Notation

7.1 Convergence of sequences of random variables

BIRKHOFF ERGODIC THEOREM

Rademacher Complexity

M17 MAT25-21 HOMEWORK 5 SOLUTIONS

Machine Learning Theory (CS 6783)

An Introduction to Randomized Algorithms

19.1 The dictionary problem

Asymptotic distribution of products of sums of independent random variables

Some remarks on the paper Some elementary inequalities of G. Bennett

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

10/ Statistical Machine Learning Homework #1 Solutions

JORGE LUIS AROCHA AND BERNARDO LLANO. Average atchig polyoial Cosider a siple graph G =(V E): Let M E a atchig of the graph G: If M is a atchig, the a

Queueing Theory II. Summary. M/M/1 Output process Networks of Queue Method of Stages. General Distributions

1 Review and Overview

6.883: Online Methods in Machine Learning Alexander Rakhlin

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Lecture Outline. 2 Separating Hyperplanes. 3 Banach Mazur Distance An Algorithmist s Toolkit October 22, 2009

Binomial transform of products

Complete Solutions to Supplementary Exercises on Infinite Series

CS 70 Second Midterm 7 April NAME (1 pt): SID (1 pt): TA (1 pt): Name of Neighbor to your left (1 pt): Name of Neighbor to your right (1 pt):

Entropy Rates and Asymptotic Equipartition

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Statistics for Applications Fall Problem Set 7

Lecture 15: Learning Theory: Concentration Inequalities

Automated Proofs for Some Stirling Number Identities

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Bertrand s postulate Chapter 2

7.1 Convergence of sequences of random variables

Fall 2013 MTH431/531 Real analysis Section Notes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

1 Review and Overview

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

Problem Set 2 Solutions

1 Convergence in Probability and the Weak Law of Large Numbers

A string of not-so-obvious statements about correlation in the data. (This refers to the mechanical calculation of correlation in the data.

6.883: Online Methods in Machine Learning Alexander Rakhlin

Mixture models (cont d)

6.3 Testing Series With Positive Terms

THE GREATEST ORDER OF THE DIVISOR FUNCTION WITH INCREASING DIMENSION

Lecture 3 The Lebesgue Integral

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

AVERAGE MARKS SCALING

Discrete Mathematics: Lectures 8 and 9 Principle of Inclusion and Exclusion Instructor: Arijit Bishnu Date: August 11 and 13, 2009

Machine Learning Theory (CS 6783)

The Hypergeometric Coupon Collection Problem and its Dual

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Application to Random Graphs

VECTOR SEMINORMS, SPACES WITH VECTOR NORM, AND REGULAR OPERATORS

2 Markov Chain Monte Carlo Sampling

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

INFINITE SEQUENCES AND SERIES

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Non-asymptotic sequential confidence regions with fixed sizes for the multivariate nonlinear parameters of regression. Andrey V.

1+x 1 + α+x. x = 2(α x2 ) 1+x

f(1), and so, if f is continuous, f(x) = f(1)x.

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Lecture 2: April 3, 2013

Math 210A Homework 1

6.4 Binomial Coefficients

page Suppose that S 0, 1 1, 2.

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Math Solutions to homework 6

Solutions to HW Assignment 1

ECE534, Spring 2018: Final Exam

Stanford Statistics 311/Electrical Engineering 377

The Poisson Process *

Empirical Process Theory and Oracle Inequalities

CSCI-6971 Lecture Notes: Stochastic processes

MATH 312 Midterm I(Spring 2015)

Generalized Semi- Markov Processes (GSMP)

Jacobi symbols. p 1. Note: The Jacobi symbol does not necessarily distinguish between quadratic residues and nonresidues. That is, we could have ( a

Surveying the Variance Reduction Methods

Series III. Chapter Alternating Series

Probability and Random Processes

5.6 Absolute Convergence and The Ratio and Root Tests

Element sampling: Part 2

REVIEW OF CALCULUS Herman J. Bierens Pennsylvania State University (January 28, 2004) x 2., or x 1. x j. ' ' n i'1 x i well.,y 2

Berry-Esseen bounds for self-normalized martingales

Lecture Notes for Analysis Class

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Machine Learning Brett Bernstein

Double Derangement Permutations

Notes 19 : Martingale CLT

Al Lehnen Madison Area Technical College 10/5/2014

Lecture 13: Maximum Likelihood Estimation

1.3 Convergence Theorems of Fourier Series. k k k k. N N k 1. With this in mind, we state (without proof) the convergence of Fourier series.

2. F ; =(,1)F,1; +F,1;,1 is satised by thestirlig ubers of the rst kid ([1], p. 824). 3. F ; = F,1; + F,1;,1 is satised by the Stirlig ubers of the se

Transcription:

Learig Theory for Coditioal Risk Miiizatio: Suppleetary Material Alexader Zii IST Austria azii@istacat Christoph H Lapter IST Austria chl@istacat Proofs Proof of Theore After the applicatio of (6) ad (8) we ca cosider the two parts separately: P R (h ) if R (h) > α (3) P sup (l(h, z t ) R t (h)) > α/4 (32) + P d t, > α/4 (33) The covergece of the probability i (32) is guarateed by the result of Rakhli et al, 204 for ay stochastic process The covergece of (33) follows fro the deitio of the coverget discrepacies ad is a cotet of Lea 2 Lea 2 If double array d t, d t, coverges to 0 i probability is coverget, the Proof The proof is siilar to that of the Toeplitz lea, but adapted to our otio of covergece Fix ε > 0 ad δ > 0 The, by the deitio of a coverget array, for ε = δ = δε 4 0, t 0 : 0 t 0 < 0, 0, t 0 t < : (34) P d t, > ε δ (35) I particular, this eas that for ay 0 ad t 0 t < we have E d t, ε + δ = δε 2, because of the boudedess of d t, Now, choose ay 0 that satises 0 ε 2 The for ay we get P d t, > ε P d t, > ε 2 t= 0+ (36) t= 2 0+ t, ε (37) δ, (38) where the last lie follows fro the boud o the expectatios To characterize a coplexity of soe fuctio class we use coverig ubers ad a sequetial fat-shatterig diesio But before we could give those deitios, we eed to itroduce a otio of Z-valued trees A Z-valued tree of depth is a sequece z : of appigs z i : {± i Z A sequece ε : {± dees a path i a tree To shorte the otatios, z t (ε :t ) is deoted as z t (ε) For a double sequece z :, z :, we dee χ t (ε) as z t if ε = ad z t if ε = Also dee distributios p t (ε :t, z :t, z :t ) over Z as P χ (ε ),, χ t (ε t ), where P is a distributio of a process uder cosideratio The we ca dee a distributio ρ over two Z-valued trees z ad z as follows: z ad z are sapled idepedetly fro the iitial distributio of the process ad for ay path ε : for 2 t, z t (ε) ad z t(ε) are sapled idepedetly fro p t (ε :t, z :t (ε), z :t (ε)) For ay rado variable y that is easurable with respect to σ (a σ-algebra geerated by z : ), we de- e its syetrized couterpart ỹ as follows We kow that there exists a easurable fuctio ψ such that y = ψ(z : ) The we dee ỹ = ψ(χ (ɛ ),, χ (ε )), where the saples used by χ t 's are uderstood fro the cotext Now we ca dee coverig ubers Deitio 5 A set, V, of R-valued trees of depth is a (sequetial) θ-cover (with respect to the l -or) of F {f : Z R o a tree z of depth if f F, ε {±, v V : (39) ax f(z t(ε)) v t (ε) θ (40) t The (sequetial) θ-coverig uber of a fuctio class F o a give tree z is N (F, θ, z) = i{ V : V is a θ-cover (4) wrt l -or of F o z (42)

Ruig headig title breaks the lie The axial θ-coverig uber of a fuctio class F over depth- trees is N (F, θ, ) = sup N (F, θ, z) (43) z To cotrol the growth of coverig ubers we use the followig otio of coplexity Deitio 6 A Z-valued tree z of depth is θ- shattered by a fuctio class F {f : Z R if there exists a R-valued tree s of depth such that ε {±, f F st t, (44) ε t (f(z t (ε)) s t (ε)) θ/2 (45) The (sequetial) fat-shatterig diesio fat θ (F) at scale θ is the largest d such that F θ-shatters a Z- valued tree of depth d A iportat result of Rakhli et al, 204 is the followig coectio betwee the coverig ubers ad the fat-shatterig diesio Lea 3 (Corollary of Rakhli et al, 204) Let F {f : Z, For ay θ > 0 ad ay, we have that ( ) fat 2e θ (F) N (F, θ, ) (46) θ I the proofs we deote L(H) as F Proof of Theore 2 After equatios (6), (8) ad (0), we are left to study the large deviatios of the followig quatity Θ(J ) = sup w t (J ) (f(z t ) E t f) (47) f F with the weights deed as i () Let us dee evets A r = {J = r ad B r (j) = {r g(m t,j) r +, such that E k, = { r k A r { r B r (J ) The we have P Θ(J ) α P Θ(J ) α E k, + P E c k, (48) Now we ca take a uio boud for the rst suad over A r 's ad get P Θ(J ) α E k, (49) k P Θ(j) α { r B r (j) (50) j= Takig aother uio boud for each j, we ed up with P Θ(j) α { r B r (j) (5) r P Θ(j) α B r (j) (52) Now we study the last probability for a xed r ad j O B r (j) we ca lower boud the deoiator of the weights g(m t,j) r leadig to Θ(j) Θ r (j) = r sup f F g(m t,j) (f(z t ) E t f) Let λ > 0 ad deote V = r 2 g2 (M t,j ), E = r g(m t,j) The, sice r g(m t,j) σ t by the deitio of a M-boud, Lea 4 gives us E e λθr(j) λ2 V 2λβE l 2N (F,β,) (53) Let C = {Θ r (j) α B r (j) ad ote that E r+ r 2 ad V r+ r 2 2 r o B r(j) by the boudedess of g The we have the followig chai of iequalities E e λθr(j) λ2 V 2λβE l 2N (F,β,) (54) E e λθr(j) λ2 V 2λβE l 2N (F,β,) I C (55) e λα λ2 2 r 4λβ l 2N (F,β,) P C (56) Hece, by optiizig over λ, we get P Θ(j) α B r (j) 2N (F, β, )e 2 r(α 4β)2 (57) Now, coig back to (5), we ca evaluate it by coputig the su to obtai P Θ(J) α E k, 2kN (F, β, ) (α 4β) 2 e 2 (α 4β) 2 (58) Lea 4 Let y : be a process such that each y t σ t ad deote E = y t, V = y2 t The for a xed λ, β > 0 ad c = l 2N (F, β, ) E e λ sup f F yt(f(zt) Et f) λ2 V 2λβE c (59) Proof Let z : be a decoupled taget sequece to z :, ie a sequece that satises E t f(z t ) = E t f(z t) = E f(z t) z : The E e λ sup f F yt(f(zt) Ei f) λ2 V 2λβE c (60) E e λ sup f F yt(f(zt) f(z t )) λ2 V 2λβE c The Lea 5 gives us that (6) equals to (6) E ρ E ε e λ sup f ỹtεt(f(zt(ε)) f(z t (ε))) λ2 Ṽ 2λβẼ c (62) E z ρ E ε e 2λ sup f ỹtεtf(zt(ε)) λ2 Ṽ 2λβẼ c, (63)

Alexader Zii, Christoph H Lapter where ỹ is a syetrized versio of y, Ẽ = ỹ t, Ṽ = ỹ2 t ad we used Jese iequality to get the secod lie Now we take a β-cover of F with respect to l -or to get the followig boud o (63) E z ρ N (F, β, )E ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ c (64) = 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ (65) Itroduce evets Y + = { ỹtε t f(z t ) 0 ad Y = { ỹtε t f(z t ) < 0 The the last lie is equal to 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2ṽ I Y + (66) + 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2ṽ I Y (67) 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ (68) + 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ (69), (70) where the last lie follows by the stadard artigale arguet, sice ỹ t ε t f(z t (ε)) is a artigale dierece sequece (for a xed tree z) Lea 5 Let z : be a saple fro a process ad z : its decoupled sequece Let y : be a process such that each y t σ t, the for ay easurable fuctios ϕ : R R ad ψ : Z R, we have ( ) E ϕ sup y t (f(z t ) f(z f F t)) ψ(z : ) (7) ( ) = E ρ E ε ϕ sup ỹ t ε t (f(z t ) f(z t)) ψ, f F where ψ is a syetrized versio of ψ(z : ) Proof The proof is direct extesio of Theore 3 fro Rakhli et al 20 by usig the fact y t σ t Proof of Corollary The proof follows fro the Theore 2 if we set β = α 8 ad use the Lea 3 Proof of Lea The proof follows fro the followig boud d t, = sup E t f E x f (72) f L(H) E sup E t f x f (73) f L(H) Ad the the covergece of the discrepacies follows fro the deitio of the uiforly coverget artigale 2 Exceptioal set exaples Markov chais A k : P J > k P F z First, we boud the probability of > k S ax P F s > k (74) s O the evet B k, we have the followig chai of iequalities I d t,j b t=j I d t,j b (75) I d t,j = 0 (76) I z t = z J, (77) which gives us P J k I d t,j b < t=j P J k I z t = z J < S ax P s J k (78) (79) I z t = s < z J = s (80) Now, for a give state s, I z t = s ca be lower bouded by the uber of ties we hit the state s agai Let Ts, i i, be idepedet copies of the recurrece ties The I z t = s for ay 0, such that i= T s i k We also have the followig sequece of iclusios { i : Ts i k J k z J = s { Ts i k J k z J = s i= (8) (82) { I z t = s J k z J = s (83) Ad this gives us P J k I z t = s < z J = s (84) P i : Ts i > k (85) P T s > k (86)

Ruig headig title breaks the lie Dyaical systes The boud o P A k follows fro the fact that J F (C ) For the B k, we get P B k, k ax J = j I d t,j b j k P t=j (87) Ad siilarly to the Markov chai exaple, P J = j I d t,j b P T (C j ) > j t=j (88) Geeral statioary processes The boud for this case is doe aalogously to the previous two exaples, thus we oit the arguet 3 Couter-exaple for learability Theore 3 Let Z = {0,, H = 0, ad l(h, z) = (h z) 2 Also, let C be a class of all statioary ergodic processes takig values i Z The for ay learig algorith that produces a sequece of hypotheses h, there is a process P C such that ( ) P li sup R (h ) if (h) > 6 8 (89) Proof Usig the fact that the iiizer of E (h z+ ) 2 is E z +, we ca rewrite for ay h σ R (h ) if R (h) (90) = E (h z + ) 2 if E (h z+ ) 2 (9) = E (h z + ) 2 E (E z + z + ) 2 (92) = (h E z + ) 2 (93) A ior odicatio of the proof of Theore of Györ et al, 998 gives that for every algorith that produces a sequece h of hypotheses, there is a statioary ad ergodic process such that P li sup (h E z + ) 2 > 6 8, (94) which shows that o algorith ca be a liit learer for the class of all statioary ad ergodic biary processes 4 Coectio to tie series predictio The goal of this sectio is to show the coectio of our fraework to existig theoretical approaches to tie series predictio I particular, we cosider two fraeworks, which are close eough to coditioal risk iiizatio I both cases, we show that the coditioal risk iiizatio solves harder proble i a sese that its solutios ca be used to solve these particular probles, but it requires ore assuptios to be valid We start with a fraework of tie series predictio by statistical learig, cosidered for exaple i Alquier et al, 203, McDoald et al, 202 Fixig soe poit i tie, we cosider a hypotheses class H {h : Z Z, where each hypotheses h gives us a predictio of the ext step by evaluatig the whole history For ay loss fuctio l : Z Z 0,, we cosider the followig risk iiizatio proble: i E l(h(z : ), z + ) (95) To set up the coditioal risk iiizatio, we dee a class of costat fuctios H = {h z (z) = z, z Z The if the process belogs to a class learable with H ad l, we ca guaratee that there is a algorith to choose a poit z, such that with probability δ E l(z, z + ) z : if E l(z, z + ) z : + ε (δ), z (96) where ε (δ) is a sequece of errors guarateed by the algorith for a give codece δ ad ε (δ) 0 Covertig this to the boud o the expectatio, we get E l(z, z + ) E if E l(z, z + ) z : z (97) + ε (δ) + δ (98) Notice that E if (99) E l(z, z + ) z : z E if E l(h(z : ), z + ) z : (00) if E l(h(z : ), z + ) (0) Therefore, if the process is fro a learable class, there is a algorith that always give good predictios accordig to this fraework as well The secod settig, which was cosidered by Witeberger, 204, is very close to the olie sequece predictio I order to reduce the otatios ad siplify the presetatio, we assue that the learer has a access to a (usually ite) hypothesis class H ad at every step t he should choose a distributio π t over H i a way that iiizes the regret: E t l(e πt h, z t ) i E t l(h, z t ) (02)

Alexader Zii, Christoph H Lapter Agai, if the process belogs to a learable class with H ad l, the there is a algorith, which produce the sequece h t that satises with probability δ E t l(h t, z t ) i E t l(h, z t ) + ε t (δ/) (03) for all t Suig up over t, we get E t l(h t, z t ) (04) i E t l(h, z t ) + i E t l(h, z t ) + ε t (δ/) (05) ε t (δ/) (06) Thus givig us ε t(δ/) boud o the regret with high probability For ice sequeces (like iid) ε t (δ/) ( ) is of order O, which gives a regret boud of log t order O ( log ) O the dowside, we ca get guaratees oly for a class of learable processes, while the results of Witeberger, 204 hold for ay stochastic process The reaso for this is that coditioal risk iiizatio is iheretly ore dicult proble, sice it requires to optiize at every step ad ot i the cuulative sese