Computational and Statistical Learning Theory

Size: px

Start display at page:

Download "Computational and Statistical Learning Theory"

Hester Dortha Brooks
5 years ago
Views:

1 Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher coplexity : [ ] R S (F) := E ɛ {±} n sup ɛ i f(x i ) Definition. Given a saple S = {x,..., x }, and any > 0, a set V R is said to be an -cover (in l p ) of function class F on saple S if ( ) /p f F, v V s.t. f(x i ) v i p Specifically for p = ( f(x i) v i p) /p is replaced by axi [] f(x i ) v i. Also define N p (F,, S) := in{ V : V is an -cover of (in l p ) off on saple S} and N p (F,, n) := sup N p (F,, {x,..., x }) x,...,x Definition 2. A function F is said to -shatter a saple S = {x,..., x } if there exists a sequence of thresholds, θ,..., θ R such that ξ {±}, f F s.t. i [], ξ i (f(x i ) s i )

2 Probles. VC Lea for Real-valued Function classes : We shall prove that for any function class F (assue functions in F are bounded by ) and scale > 0, the l covering nuber at scale can be bounded using fat shattering diension at that scale by proving a stateent analogous to VC lea. We shall proceed by first extending the stateent to finite (specifically {0,..., k}) valued function classes and then using this to prove the final bound of for N (F,, n) fat /2 ( n i ) ( (a) Let F k {0,..., k} X be a function class with fat (F k ) = d, show that N (F k, /2, ) d ) i ( ) k i i Show the above stateent using induction on n + d (very siilar to first proble on Assignent 2). Hint : In Assignent 2 proble where we used H + S and H S use instead, for all i {0,..., k}, F i = {f F : f(x ) = i} (note that this is a siple ultilable extension and for k =, F 0, F are identical to H +, H ). Use the notion of -shattering instead of shattering for the VC case and use /2-cover instead of growth function. (b) Using the idea of -discretizing the output of function class F we shall conclude the required stateent. Do the following : i. Create a {0,..., k}-valued class G where k is of order /. Show that covering G at scale /2 iplies we can cover F at scale and hence conclude that we can bound N (F,, ) in ters of covering nuber at scale /2 for G. ii. Show that fat (G) fat /2 (F) iii. Cobine with the bound on N (G, /2, ) fro previous sub-proble and conclude that N (F,, n) fat /2 ( n i ) ( 2. Dudley Vs Pollards Bounds : In class we saw that Radeacher coplexity of a function class F (assue functions in F are bounded by ) can be bounded in ters of covering nubers using Pollard s bound, Dudley integral bound and the slightly odified version of Dudley integral bound as follows ) i 2

3 : R S (F) inf + 0 { R S (F) inf 0 { } { } 2 log N (F,, ) 2 log N2 (F,, ) inf + 0 } log N2 (F, τ, ) dτ (Pollard) (Refined Dudley) In this proble using soe exaples we shall copare these bounds. (a) Class with finite VC subgraph-diension : Assue that the VC subgraph-diension of function class F is bounded by D. In this case result in proble can be used to bound the covering nuber of F in ters of D. Use this bound on covering nuber and copare Pollard s bound with refined Dudley integral bound by writing down the bounds iplied by each one. (b) Linear class with bounded nor : Linear classes in high diensional spaces is probably one of the ost iportant and ost used function class in achine learning. Consider the specific exaple where X = {x : x 2 } and F = {x w x : w 2 } In class we saw that for any ɛ > 0, fat ɛ (F). Using this with the result in proble ɛ 2 we have that : ( en ) ɛ N 2 (F,, ) N (F,, ) 2 ɛ Use the above bound on the covering nuber and write down the bound on Radeacher coplexity iplied by Pollard s bound. Write down the bound on Radeacher coplexity iplied by the refined version of the Dudley integral bound. 3. Data Dependent Bound : Recall the Radeacher coplexity bound we proved in class for functions F bounded by. For any δ > 0 with probability at least δ, ( ) [ sup E [f(x)] ÊS[f(x)] 2E S D RS (F)] + log(/δ) Note that we don t know the distribution D. One way we used the above bound was by providing [ upper bounds on R S (F) for any saple of size and using this instead of E S D RS (F)]. But ideally we would like to get tight bounds when the distribution we are faced with is nicer. The ai of this proble is to do this. Prove that, for any δ > 0 with probability at least δ, over draw of saple S D, ( ) sup E [f(x)] ÊS[f(x)] 2 R log(2/δ) S (F) + K 3

4 (provide explicit value of constant K above). Notice that in the above bound the expected Radeacher coplexity is replaced by saple based one which can be calculated fro the training saple. Hint : Use McDiarid s inequality on the expected Radeacher coplexity. 4. Learnability and Fat-shattering Diension: Recall the setting of stochastic optiization proble where objective function is apping r : H Z R. Saple S = {z,..., z } drawn iid fro unknown distribution D is provided to the learner and the ai of the learner is to output ĥ H based on saple that has low expected objective E [r(h, z)]. (a) Consider the stochastic optiization proble with r bounded by a, i.e. r(h, z) < a < for all h H and z Z. If function class F := {z r(h, z) h H} has finite fat for all > 0, then show that the proble is learnable. (b) Conclude that for a supervised learning proble with bounded hypothesis class H (ie. x X, h(x) < a), and loss φ : Ŷ Y R that is L-Lipschtiz (in first arguent), if H has finite fat for all > 0, then the proble is learnable. (c) Show a stochastic optiization proble that is learnable even though it has infinite fat for all 0. (or any other cosntant of your choice). Explicitly write down the hypothesis class, and the learning rule which learns the class, argue that the proble is learnable, and explain why the fat is infinite. Hint : You can ake the learning rule that is successful to even be ERM. (d) Prove that for a supervised learning proble with the absolute loss φ(ŷ, y) = ŷ y, if the fat is infinite for soe > 0, then the proble is not learnable. Hint: as with the binary case, for every, construct a distribution which is concentrated on a set of points that can be fat-shattered. 5. Massart s Lea: In this proble, we want to prove Massart s lea, i.e. for finite F R the following inequality holds: 2 log F R (F) a () where a = sup f 2. (a) Use Jensen s inequality and union bound to prove that for any λ > 0 e λr [ E ɛ {±} e λɛ i f i ] (2) 4

5 (b) Find an upper bound for the expected value in the above inequality using the fact that e x + e x 2e x2 /2. (c) Show that for any λ > 0 R (F) (d) Find λ that iniize the above bound. log F λ + λa2 2 (3) 6. Refined Dudley integral bound on the Radeacher coplexity: Given a sequence of saples S = (x,..., x ) we shall prove that for any function class F containing functions f : X R { } sup Ê[f 2 ] N2 (F, τ, S) R S (F) inf dτ (4) >0 where Ê[f 2 ] = f 2 (x i ). (a) Let β 0 = sup Ê[f 2 ] and for any j Z + let β j = β 0 /2 j and C j be β j -cover (in l 2 ) of function class F on saple S. For each f F we pick an ˆf j C j such that Ê[(f ˆfj ) 2 ] β j. The ain idea of the proof is that for any N > 0 we can express f by chaining as f = f ˆf N + N ( ˆfj ˆf ) j j= where ˆf 0 = 0. Show that under the sae conditions stated above, for any N > 0 the following inequality holds: N R S (F) β N + R S (G j ). (6) where G j = { ˆf ˆf ˆf T j, ˆf T j } (b) Show that G j N 2 (F, β j, S). (c) Use Massart s lea to find the following upper bound for Radeacher coplexity of class G j : log N2 (F, β j, S) R S (G j ) 5β j (7) (d) Show that the following inequality holds: j= (5) R S (F) β N + 0 N log N2 (F, β j, S) (β j β j ) j= (8) 5

6 (e) Prove the following bound for any j > 0: log N2 (F, β j, S) (β j β j ) βj β j Find the upper bound for the su ter as a single integral. log N 2(F, τ, S) dτ (9) (f) Show that for any > 0 we can choose N such that sup Ê[f 2 ] log N2 (F, τ, S) R S (F) dτ (0) Conclude that the Dudley integral bound holds. 7. A general bound for all nors: In class we studied the hypothesis class H B = {w w B} and the corresponding rule ERM B, and showed that for every B, w.h.p over S, for all w H B we had ( ) B2 + log(/δ) ˆL(w) L(w) O () We then used this to get a bound on L {0,} in ters of ˆL γ, for each γ separately. We will now change the order of quantifiers, and get a bound that holds, w.h.p., over all γ concurrently, with only a very ild (double logarithic) additional ter. (a) Prove that, for H [ a, a] X, for any D(X, Y ), w.h.p. over S D, for all h H and all γ > 0: ( ) L {0,} (h) ˆL R (H) log(log(a/γ)) + log(/δ) γ (h) + O + (2) γ Rewrite the above bound with proper constants and without big-o notation. (b) In particular, linear separators with geoetric argin γ correspond to using H = {w w }, in which case show that we have, w.h.p. over S D., for all w and γ > 0: ( ) L {0,} (w) ˆL /γ2 + log(/δ)) γ (h) + O (3) Rewrite the above bound with proper constants and without big-o notation. Challenge Probles. We saw that for any distribution D, the expected Radeacher coplexity provided an upper bound on the axiu deviation between ean and average uniforly over function class, specifically we saw that E S D [ sup ( ) ] [ E [f(x)] Ê [f(x)] 2E S D RS (F)] 6

7 Prove the (alost) converse that [ ] 2 E S D RS (F) E S D [ sup (E [f(x)] Ê [f(x)] ) ] This basically establishes that Radeacher coplexity tightly bounds the unifor axial deviation for every distribution. 2. The worst case Radeacher coplexity is defined as R (F) = (ie. supreu over saples of size ). sup S={x,...,x } R S (F) (a) Prove that for any function class F and any τ > R (F), we have that fat τ (F) R (F) 2 (b) Cobine the above with the refined version of Dudley integral bound to prove that { } N2 (F, τ, ) inf dτ 0 R (F) O(log 3/2 ) This shows that the refined dudley integral bound is tight to within log factors of the Radeacher coplexity. Thus we have established that in the worst case all the coplexity easures for function class like Radeacher coplexity, covering nubers and fat shattering diension all tightly govern the rate of unifor axial deviation for the function class (all to within log factor). 3. Bounded Difference Inequality, Stability and Generalization : Recall that a function G : X R is said to satisfy the bounded difference inequality if for all i [] and all x,..., x, x i X, G(x,..., x i,..., x ) G(x,..., x i, x i, x i+,..., x ) c for soe c 0. In this case the McDiarid s inequality gave us that for any δ > 0, with probability at least δ, (c)2 log(/δ) G(x,..., x ) E [G(x,..., x )] + The bounded difference property turns out to be quiet useful to analyze learning algoriths directly (instead of looking at the unifor deviation over function class). τ 2 7

8 A proper learning algorith is A : = X F is said to be a uniforly β stable is for all i [], and any x,..., x, x i X, sup A(x,..., x i,..., x )(x) A(x,..., x i, x i, x i+,..., x )(x) β x Assuing functions in F are bounded by we shall prove that the learning algorith generalizes well (expected loss is close to epirical loss of the algorith). Specifically we shall prove that for any δ > 0, with probability at least δ, L(A(S)) L(A(S)) 2 log(/δ) + β + 2(β + ) where L(A(S)) = E x [A(S)(x)] and L(A(S)) = A(S)(x i). (a) First show that E S [L(A(S)) L(A(S)) ] β. Hint : Use renaing of variables to first show that for any i [], E S [L(A(S))] = E S,x i [A(x,..., x i, x i, x i+,..., x )(x i )] (b) Show that the function G(S) = L(A(S)) L(A(S)) satisfies bounded difference property with c 2β + 2. Conclude the required stateent using McDiarid s inequality. (c) Consider the stochastic convex optiization proble where saple z = (x, y) where y is real valued and x s are fro the unit ball in soe Hilbert space and hypothesis is weight vectors w fro the sae Hilbert space with objective r(w, (x, y)) = w, x y + λ w 2 Show that the ERM algorith is stable for this proble and thus provide a bound for this algorith. 4. L Neural Network : A k-layer -nor neural network is given by function class F k which is in turn defined recursively as follows. { } d F = x wj x j w B and further for each 2 i k, { } d i F i = x wj σ(f j (x)) j [d i], f j F i, w i B j= j= 8

9 where d i is the nuber of nodes in the ith layer of the network. Function σ : R [, ] is called the squash function and is generally a sooth onotonic non-decreasing function (typical exaple is the tanh function). Assue that input space X = [0, ] d and that σ is L-Lipschitz. Prove that ( k R S (F k ) 2B i ) L k 2T log d Notice that the above bound the d i s don t appear in the bound indicating the nuber of nodes in interediate layers don t affect the upper bound on Radeacher coplexity. Hint : prove bound on Radeacher coplexity of F i recursively in ters of radeacher coplexity of F i. 9

Computational and Statistical Learning theory Assignment 4

Computational and Statistical Learning theory Assignment 4 Coputatonal and Statstcal Learnng theory Assgnent 4 Due: March 2nd Eal solutons to : karthk at ttc dot edu Notatons/Defntons Recall the defnton of saple based Radeacher coplexty : [ ] R S F) := E ɛ {±}