Learning Theory for Conditional Risk Minimization: Supplementary Material

Size: px

Start display at page:

Download "Learning Theory for Conditional Risk Minimization: Supplementary Material"

Jonas Copeland
5 years ago
Views:

1 Learig Theory for Coditioal Risk Miiizatio: Suppleetary Material Alexader Zii IST Austria Christoph H Lapter IST Austria chl@istacat Proofs Proof of Theore After the applicatio of (6) ad (8) we ca cosider the two parts separately: P R (h ) if R (h) > α (3) P sup (l(h, z t ) R t (h)) > α/4 (32) + P d t, > α/4 (33) The covergece of the probability i (32) is guarateed by the result of Rakhli et al, 204 for ay stochastic process The covergece of (33) follows fro the deitio of the coverget discrepacies ad is a cotet of Lea 2 Lea 2 If double array d t, d t, coverges to 0 i probability is coverget, the Proof The proof is siilar to that of the Toeplitz lea, but adapted to our otio of covergece Fix ε > 0 ad δ > 0 The, by the deitio of a coverget array, for ε = δ = δε 4 0, t 0 : 0 t 0 < 0, 0, t 0 t < : (34) P d t, > ε δ (35) I particular, this eas that for ay 0 ad t 0 t < we have E d t, ε + δ = δε 2, because of the boudedess of d t, Now, choose ay 0 that satises 0 ε 2 The for ay we get P d t, > ε P d t, > ε 2 t= 0+ (36) t= 2 0+ t, ε (37) δ, (38) where the last lie follows fro the boud o the expectatios To characterize a coplexity of soe fuctio class we use coverig ubers ad a sequetial fat-shatterig diesio But before we could give those deitios, we eed to itroduce a otio of Z-valued trees A Z-valued tree of depth is a sequece z : of appigs z i : {± i Z A sequece ε : {± dees a path i a tree To shorte the otatios, z t (ε :t ) is deoted as z t (ε) For a double sequece z :, z :, we dee χ t (ε) as z t if ε = ad z t if ε = Also dee distributios p t (ε :t, z :t, z :t ) over Z as P χ (ε ),, χ t (ε t ), where P is a distributio of a process uder cosideratio The we ca dee a distributio ρ over two Z-valued trees z ad z as follows: z ad z are sapled idepedetly fro the iitial distributio of the process ad for ay path ε : for 2 t, z t (ε) ad z t(ε) are sapled idepedetly fro p t (ε :t, z :t (ε), z :t (ε)) For ay rado variable y that is easurable with respect to σ (a σ-algebra geerated by z : ), we de- e its syetrized couterpart ỹ as follows We kow that there exists a easurable fuctio ψ such that y = ψ(z : ) The we dee ỹ = ψ(χ (ɛ ),, χ (ε )), where the saples used by χ t 's are uderstood fro the cotext Now we ca dee coverig ubers Deitio 5 A set, V, of R-valued trees of depth is a (sequetial) θ-cover (with respect to the l -or) of F {f : Z R o a tree z of depth if f F, ε {±, v V : (39) ax f(z t(ε)) v t (ε) θ (40) t The (sequetial) θ-coverig uber of a fuctio class F o a give tree z is N (F, θ, z) = i{ V : V is a θ-cover (4) wrt l -or of F o z (42)

2 Ruig headig title breaks the lie The axial θ-coverig uber of a fuctio class F over depth- trees is N (F, θ, ) = sup N (F, θ, z) (43) z To cotrol the growth of coverig ubers we use the followig otio of coplexity Deitio 6 A Z-valued tree z of depth is θ- shattered by a fuctio class F {f : Z R if there exists a R-valued tree s of depth such that ε {±, f F st t, (44) ε t (f(z t (ε)) s t (ε)) θ/2 (45) The (sequetial) fat-shatterig diesio fat θ (F) at scale θ is the largest d such that F θ-shatters a Z- valued tree of depth d A iportat result of Rakhli et al, 204 is the followig coectio betwee the coverig ubers ad the fat-shatterig diesio Lea 3 (Corollary of Rakhli et al, 204) Let F {f : Z, For ay θ > 0 ad ay, we have that ( ) fat 2e θ (F) N (F, θ, ) (46) θ I the proofs we deote L(H) as F Proof of Theore 2 After equatios (6), (8) ad (0), we are left to study the large deviatios of the followig quatity Θ(J ) = sup w t (J ) (f(z t ) E t f) (47) f F with the weights deed as i () Let us dee evets A r = {J = r ad B r (j) = {r g(m t,j) r +, such that E k, = { r k A r { r B r (J ) The we have P Θ(J ) α P Θ(J ) α E k, + P E c k, (48) Now we ca take a uio boud for the rst suad over A r 's ad get P Θ(J ) α E k, (49) k P Θ(j) α { r B r (j) (50) j= Takig aother uio boud for each j, we ed up with P Θ(j) α { r B r (j) (5) r P Θ(j) α B r (j) (52) Now we study the last probability for a xed r ad j O B r (j) we ca lower boud the deoiator of the weights g(m t,j) r leadig to Θ(j) Θ r (j) = r sup f F g(m t,j) (f(z t ) E t f) Let λ > 0 ad deote V = r 2 g2 (M t,j ), E = r g(m t,j) The, sice r g(m t,j) σ t by the deitio of a M-boud, Lea 4 gives us E e λθr(j) λ2 V 2λβE l 2N (F,β,) (53) Let C = {Θ r (j) α B r (j) ad ote that E r+ r 2 ad V r+ r 2 2 r o B r(j) by the boudedess of g The we have the followig chai of iequalities E e λθr(j) λ2 V 2λβE l 2N (F,β,) (54) E e λθr(j) λ2 V 2λβE l 2N (F,β,) I C (55) e λα λ2 2 r 4λβ l 2N (F,β,) P C (56) Hece, by optiizig over λ, we get P Θ(j) α B r (j) 2N (F, β, )e 2 r(α 4β)2 (57) Now, coig back to (5), we ca evaluate it by coputig the su to obtai P Θ(J) α E k, 2kN (F, β, ) (α 4β) 2 e 2 (α 4β) 2 (58) Lea 4 Let y : be a process such that each y t σ t ad deote E = y t, V = y2 t The for a xed λ, β > 0 ad c = l 2N (F, β, ) E e λ sup f F yt(f(zt) Et f) λ2 V 2λβE c (59) Proof Let z : be a decoupled taget sequece to z :, ie a sequece that satises E t f(z t ) = E t f(z t) = E f(z t) z : The E e λ sup f F yt(f(zt) Ei f) λ2 V 2λβE c (60) E e λ sup f F yt(f(zt) f(z t )) λ2 V 2λβE c The Lea 5 gives us that (6) equals to (6) E ρ E ε e λ sup f ỹtεt(f(zt(ε)) f(z t (ε))) λ2 Ṽ 2λβẼ c (62) E z ρ E ε e 2λ sup f ỹtεtf(zt(ε)) λ2 Ṽ 2λβẼ c, (63)

3 Alexader Zii, Christoph H Lapter where ỹ is a syetrized versio of y, Ẽ = ỹ t, Ṽ = ỹ2 t ad we used Jese iequality to get the secod lie Now we take a β-cover of F with respect to l -or to get the followig boud o (63) E z ρ N (F, β, )E ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ c (64) = 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ (65) Itroduce evets Y + = { ỹtε t f(z t ) 0 ad Y = { ỹtε t f(z t ) < 0 The the last lie is equal to 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2ṽ I Y + (66) + 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2ṽ I Y (67) 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ (68) + 2 E z ρe ε e 2λ ỹtεtf(zt(ε)) λ2 Ṽ (69), (70) where the last lie follows by the stadard artigale arguet, sice ỹ t ε t f(z t (ε)) is a artigale dierece sequece (for a xed tree z) Lea 5 Let z : be a saple fro a process ad z : its decoupled sequece Let y : be a process such that each y t σ t, the for ay easurable fuctios ϕ : R R ad ψ : Z R, we have ( ) E ϕ sup y t (f(z t ) f(z f F t)) ψ(z : ) (7) ( ) = E ρ E ε ϕ sup ỹ t ε t (f(z t ) f(z t)) ψ, f F where ψ is a syetrized versio of ψ(z : ) Proof The proof is direct extesio of Theore 3 fro Rakhli et al 20 by usig the fact y t σ t Proof of Corollary The proof follows fro the Theore 2 if we set β = α 8 ad use the Lea 3 Proof of Lea The proof follows fro the followig boud d t, = sup E t f E x f (72) f L(H) E sup E t f x f (73) f L(H) Ad the the covergece of the discrepacies follows fro the deitio of the uiforly coverget artigale 2 Exceptioal set exaples Markov chais A k : P J > k P F z First, we boud the probability of > k S ax P F s > k (74) s O the evet B k, we have the followig chai of iequalities I d t,j b t=j I d t,j b (75) I d t,j = 0 (76) I z t = z J, (77) which gives us P J k I d t,j b < t=j P J k I z t = z J < S ax P s J k (78) (79) I z t = s < z J = s (80) Now, for a give state s, I z t = s ca be lower bouded by the uber of ties we hit the state s agai Let Ts, i i, be idepedet copies of the recurrece ties The I z t = s for ay 0, such that i= T s i k We also have the followig sequece of iclusios { i : Ts i k J k z J = s { Ts i k J k z J = s i= (8) (82) { I z t = s J k z J = s (83) Ad this gives us P J k I z t = s < z J = s (84) P i : Ts i > k (85) P T s > k (86)

4 Ruig headig title breaks the lie Dyaical systes The boud o P A k follows fro the fact that J F (C ) For the B k, we get P B k, k ax J = j I d t,j b j k P t=j (87) Ad siilarly to the Markov chai exaple, P J = j I d t,j b P T (C j ) > j t=j (88) Geeral statioary processes The boud for this case is doe aalogously to the previous two exaples, thus we oit the arguet 3 Couter-exaple for learability Theore 3 Let Z = {0,, H = 0, ad l(h, z) = (h z) 2 Also, let C be a class of all statioary ergodic processes takig values i Z The for ay learig algorith that produces a sequece of hypotheses h, there is a process P C such that ( ) P li sup R (h ) if (h) > 6 8 (89) Proof Usig the fact that the iiizer of E (h z+ ) 2 is E z +, we ca rewrite for ay h σ R (h ) if R (h) (90) = E (h z + ) 2 if E (h z+ ) 2 (9) = E (h z + ) 2 E (E z + z + ) 2 (92) = (h E z + ) 2 (93) A ior odicatio of the proof of Theore of Györ et al, 998 gives that for every algorith that produces a sequece h of hypotheses, there is a statioary ad ergodic process such that P li sup (h E z + ) 2 > 6 8, (94) which shows that o algorith ca be a liit learer for the class of all statioary ad ergodic biary processes 4 Coectio to tie series predictio The goal of this sectio is to show the coectio of our fraework to existig theoretical approaches to tie series predictio I particular, we cosider two fraeworks, which are close eough to coditioal risk iiizatio I both cases, we show that the coditioal risk iiizatio solves harder proble i a sese that its solutios ca be used to solve these particular probles, but it requires ore assuptios to be valid We start with a fraework of tie series predictio by statistical learig, cosidered for exaple i Alquier et al, 203, McDoald et al, 202 Fixig soe poit i tie, we cosider a hypotheses class H {h : Z Z, where each hypotheses h gives us a predictio of the ext step by evaluatig the whole history For ay loss fuctio l : Z Z 0,, we cosider the followig risk iiizatio proble: i E l(h(z : ), z + ) (95) To set up the coditioal risk iiizatio, we dee a class of costat fuctios H = {h z (z) = z, z Z The if the process belogs to a class learable with H ad l, we ca guaratee that there is a algorith to choose a poit z, such that with probability δ E l(z, z + ) z : if E l(z, z + ) z : + ε (δ), z (96) where ε (δ) is a sequece of errors guarateed by the algorith for a give codece δ ad ε (δ) 0 Covertig this to the boud o the expectatio, we get E l(z, z + ) E if E l(z, z + ) z : z (97) + ε (δ) + δ (98) Notice that E if (99) E l(z, z + ) z : z E if E l(h(z : ), z + ) z : (00) if E l(h(z : ), z + ) (0) Therefore, if the process is fro a learable class, there is a algorith that always give good predictios accordig to this fraework as well The secod settig, which was cosidered by Witeberger, 204, is very close to the olie sequece predictio I order to reduce the otatios ad siplify the presetatio, we assue that the learer has a access to a (usually ite) hypothesis class H ad at every step t he should choose a distributio π t over H i a way that iiizes the regret: E t l(e πt h, z t ) i E t l(h, z t ) (02)

5 Alexader Zii, Christoph H Lapter Agai, if the process belogs to a learable class with H ad l, the there is a algorith, which produce the sequece h t that satises with probability δ E t l(h t, z t ) i E t l(h, z t ) + ε t (δ/) (03) for all t Suig up over t, we get E t l(h t, z t ) (04) i E t l(h, z t ) + i E t l(h, z t ) + ε t (δ/) (05) ε t (δ/) (06) Thus givig us ε t(δ/) boud o the regret with high probability For ice sequeces (like iid) ε t (δ/) ( ) is of order O, which gives a regret boud of log t order O ( log ) O the dowside, we ca get guaratees oly for a class of learable processes, while the results of Witeberger, 204 hold for ay stochastic process The reaso for this is that coditioal risk iiizatio is iheretly ore dicult proble, sice it requires to optiize at every step ad ot i the cuulative sese

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces

Lecture 10: Bounded Linear Operators and Orthogonality in Hilbert Spaces Lecture : Bouded Liear Operators ad Orthogoality i Hilbert Spaces 34 Bouded Liear Operator Let ( X, ), ( Y, ) i i be ored liear vector spaces ad { } X Y The, T is said to be bouded if a real uber c such