Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Size: px

Start display at page:

Download "Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks"

Valentine Conley
5 years ago
Views:

1 Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University Pittsburgh, PA 1513 Abstract We study the optial rates of convergence for estiating a prior distribution over a VC class fro a sequence of independent data sets respectively labeled by independent target functions sapled fro the prior. We specifically derive upper and lower bounds on the optial rates under a soothness condition on the correct prior, with the nuber of saples per data set equal the VC diension. These results have iplications for the iproveents achievable via transfer learning. This research is supported in part by NSF grant IIS and a Google Core AI grant.

2 Keywords: Miniax Rates, Transfer Learning, VC Diension, Bayesian Learning

3 1 Introduction In the transfer learning setting, we are presented with a sequence of learning probles, each with soe respective target concept we are tasked with learning. The key question in transfer learning is how to leverage our access to past learning probles in order to iprove perforance on learning probles we will be presented with in the future. Aong the several proposed odels for transfer learning, one particularly appealing odel supposes the learning probles are independent and identically distributed, with unknown distribution, and the advantage of transfer learning then coes fro the ability to estiate this shared distribution based on the data fro past learning probles [Baxter, 1997, Yang et al., 011]. For instance, when custoizing a speech recognition syste to a particular speaker s voice, we ight expect the first few people would need to speak any words or phrases in order for the syste to accurately identify the nuances. However, after perforing this for any different people, if the software has access to those past training sessions when custoizing itself to a new user, it should have identified iportant properties of the speech patterns, such as the coon patterns within each of the ajor dialects or accents, and other such inforation about the distribution of speech patterns within the user population. It should then be able to leverage this inforation to reduce the nuber of words or phrases the next user needs to speak in order to train the syste, for instance by first trying to identify the individual s dialect, then presenting phrases that differentiate coon subpatterns within that dialect, and so forth. In analyzing the benefits of transfer learning in such a setting, one iportant question to ask is how quickly we can estiate the distribution fro which the learning probles are sapled. In recent work, [Yang et al., 011] have shown that under ild conditions on the faily of possible distributions, if the target concepts reside in a known VC class, then it is possible to estiate this distribtion to arbitrary precision using only a bounded nuber of training saples per task: specifically, a nuber of saples equal the VC diension. However, that work left open the question of quantifying the rate of convergence. This rate of convergence can have a direct ipact on how uch benefit we gain fro transfer learning when we are faced with only a finite sequence of learning probles. As such, it is certainly desirable to derive tight characterizations of this rate of convergence. The present work continues that of [Yang et al., 011], bounding the rate of convergence for estiating this distribution, under a soothness condition on the distribution. We derive a generic upper bound, which holds regardless of the VC class the target concepts reside in. The proof of this result builds on the earlier work of [Yang et al., 011], but requires several interesting innovations to ake the rate of convergence explicit, and to draatically iprove the upper bounds on certain quantities copared to the analogous bounds iplicit in the original proofs. We further derive a nontrivial lower bound that holds for certain constructed scenarios, which illustrates a lower liit on how good of a general upper bound we ight hope for in results expressed only in ters of the nuber of tasks, the soothness conditions, and the VC diension. 1

4 The Setting Let (X, B X ) be a Borel space [Schervish, 1995] (where X is called the instance space), and let D be a distribution on X (called the data distribution). Let C be a VC class of easurable classifiers h : X { 1, +1} (called the concept space), and denote by d the VC diension of C [Vapnik, 198]. We suppose C is equipped with its Borel σ-algebra B induced by the pseudo-etric ρ(h, g) = D({x X : h(x) g(x)}). Though our results can be forulated for general D (with soewhat ore coplicated theore stateents), to siplify the stateent of results we suppose ρ is actually a etric, which would follow fro appropriate topological conditions on C relative to D. For any two probability easures µ 1, µ on a easurable space (Ω, F), define the total variation distance µ 1 µ = sup µ 1 (A) µ (A). A F Let Π Θ = {π θ : θ Θ} be a faily of probability easures on C (called priors), where Θ is an arbitrary index set (called the paraeter space). We additionally suppose there exists a probability easure π 0 on C (called the reference easure) such that every π θ is absolutely continuous with respect to π 0, and therefore has a density function f θ given by the Radon-Nikody derivative dπ θ dπ 0 [Schervish, 1995]. We consider the following type of estiation proble. There is a collection of C-valued rando variables {h tθ : t N, θ Θ}, where for any fixed θ Θ the {h tθ } t=1 variables are i.i.d. with distribution π θ. For each θ Θ, there is a sequence Z t (θ) = {(X t1, Y t1 (θ)), (X t, Y t (θ)),...}, where {X ti } t,i N are i.i.d. D, and for each t, i N, Y ti (θ) = h tθ (X ti). We additionally denote by Z tk = {(X t1, Y t1 (θ)),..., (X tk, Y tk (θ))} the first k eleents of Z t (θ), for any k N, and siilarly X tk = {X t1,..., X tk } and Y tk (θ) = {Y t1 (θ),..., Y tk (θ)}. Following the terinology used in the transfer learning literature, we refer to the collection of variables associated with each t collectively as the t th task. We will be concerned with sequences of estiators ˆθ T θ = ˆθ T (Z 1k (θ),..., Z T k (θ)), for T N, which are based on only a bounded nuber k of saples per task, aong the first T tasks. Our ain [ results specifically ] study the case of k = d. For any such estiator, we easure the risk as E πˆθt θ π θ, and will be particularly interested in upper-bounding the worst-case [ ] risk sup θ Θ E πˆθt θ π θ as a function of T, and lower-bounding the iniu possible value of this worst-case risk over all possible ˆθ T estiators (called the iniax risk). In previous work, [Yang et al., 011] showed that, if Π Θ is a totally bounded faily, then even with only d nuber of saples per task, the iniax risk (as a function of the nuber of tasks T ) converges to zero. In fact, that work also includes a proof that this is not necessarily the case in general for any nuber of saples less than d. However, the actual rates of convergence were not explicitly derived in that work, and indeed the upper bounds on the rates of convergence iplicit in that analysis ay often have fairly coplicated dependences on C, Π Θ, and D, and furtherore often provide only very slow rates of convergence. To derive explicit bounds on the rates of convergence, in the present work we specifically focus on failies of sooth densities. The otivation for involving a notion of soothness in characterizing rates of convergence is clear if we consider the extree case in which Π Θ contains two priors π 1 and π, with π 1 ({h}) = π ({g}) = 1, where ρ(h, g) is a very sall but nonzero

5 value; in this case, if we have only a sall nuber of saples per task, we would require any tasks (on the order of 1/ρ(h, g)) to observe any data points carrying any inforation that would distinguish between these two priors (naely, points x with h(x) g(x)); yet π 1 π = 1, so that we have a slow rate of convergence (at least initially). A total boundedness condition on Π Θ would liit the nuber of such pairs present in Π Θ, so that for instance we cannot have arbitrarily close h and g, but less extree variants of this can lead to slow asyptotic rates of convergence as well. Specifically, in the present work we consider the following notion of soothness. For L (0, ) and α (0, 1], a function f : C R is (L, α)-hölder sooth if h, g C, f(h) f(g) Lρ(h, g) α. 3 An Upper Bound We now have the following theore, holding for an arbitrary VC class C and data distribution D; it is the ain result of this work. Theore 1. For Π Θ any class of priors on C having (L, α)-hölder sooth densities {f θ : θ Θ}, for any T N, there exists an estiator ˆθ T θ = ˆθ T (Z 1d (θ),..., Z T d (θ)) such that ( ) sup π E πˆθt θ = Õ LT α (d+α)(α+(d+1)). θ Θ Proof. By the standard PAC analysis [Vapnik, 198, Bluer et al., 1989], for any γ > 0, with probability greater than 1 γ, a saple of k = O((d/γ) log(1/γ)) rando points will partition C into regions of width less than γ. For brevity, we oit the t subscript on quantities such as Z tk (θ) throughout the following analysis, since the clais hold for any arbitrary value of t. For any θ Θ, let π θ denote a (conditional on X 1,..., X k ) distribution defined as follows. Let f θ denote the (conditional on X 1,..., X k ) density function of π θ with respect to π 0, and for any g C, let f θ (g) = π θ({h C: i k,h(x i )=g(x i )}) (or 0 if π π 0 ({h C: i k,h(x i )=g(x i )}) 0({h C : i k, h(x i ) = g(x i )}) = 0). In other words, π θ has the sae probability ass as π θ for each of the equivalence classes induced by X 1,..., X k, but conditioned on the equivalence class, siply has a constant-density distribution over that equivalence class. Note that, by the soothness condition, with probability greater than 1 γ, we have everywhere f θ (h) f θ(h) < Lγ α. So for any θ, θ Θ, with probability greater than 1 γ, π θ π θ = (1/) f θ f θ dπ 0 < Lγ α + (1/) f θ f θ dπ 0. Furtherore, since the regions that define f θ and f θ are the sae (naely, the partition induced by X 1,..., X k ), we have (1/) f θ f θ dπ 0 = 1 y 1,...,y k { 1,+1} π θ({h C : i k, h(x i ) = 3

6 y i }) π θ ({h C : i k, h(x i ) = y i }) = P Yk (θ) X k P Yk (θ ) X k. Thus, we have that with probability at least 1 γ, π θ π θ < Lγ α + P Yk (θ) X k P Yk (θ ) X k. Following analogous to the inductive arguent of [Yang et al., 011], suppose I {1,..., k}, fix x I X I and ȳ I { 1, +1} I. Then the ỹ I { 1, +1} I for which no h C has h( x I ) = ỹ I for which ȳ I ỹ I 1 is inial, has (1/) ȳ I ỹ I 1 d + 1, and for any i I with ȳ i ỹ i, letting ȳ j = ȳ j for j I \ {i} and ȳ i = ỹ i, we have and siilarly for θ, so that P YI (θ) X I (ȳ I x I ) = P YI\{i} (θ) X I\{i} (ȳ I\{i} x I\{i} ) P YI (θ) X I (ȳ I x I ), P YI (θ) X I (ȳ I x I ) P YI (θ ) X I (ȳ I x I ) P YI\{i} (θ) X I\{i} (ȳ I\{i} x I\{i} ) P YI\{i} (θ ) X I\{i} (ȳ I\{i} x I\{i} ) + P YI (θ) X I (ȳ I x I ) P YI (θ ) X I (ȳ I x I ). Now consider that these two ters inductively define a binary tree. Every tie the tree branches left once, it arrives at a difference of probabilities for a set I of one less eleent than that of its parent. Every tie the tree branches right once, it arrives at a difference of probabilities for a ȳ I one closer to an unrealized ỹ I than that of its parent. Say we stop branching the tree upon reaching a set I and a ȳ I such that either ȳ I is an unrealized labeling, or I = d. Thus, we can bound the original (root node) difference of probabilities by the su of the differences of probabilities for the leaf nodes with I = d. Any path in the tree can branch left at ost k d ties (total) before reaching a set I with only d eleents, and can branch right at ost d + 1 ties in a row before reaching a ȳ I such that both probabilities are zero, so that the difference is zero. So the depth of any leaf node with I = d is at ost (k d)d. Furtherore, at any level of the tree, fro left to right the nodes have strictly decreasing I values, so that the axiu width of the tree is at ost k d. So the total nuber of leaf nodes with I = d is at ost (k d) d. Thus, for any ȳ { 1, +1} k and x X k, Since P Yk (θ) X k (ȳ x) P Yk (θ ) X k (ȳ x) (k d) d ax ȳ d { 1,+1} d P Yk (θ) X k P Yk (θ ) X k = (1/) and by Sauer s Lea this is at ost ax P Yd (θ) X D {1,...,k} d d (ȳ d x D ) P Yd (θ ) X d (ȳ d x D ). ȳ k { 1,+1} k P Yk (θ) X k (ȳ k ) P Yk (θ ) X k (ȳ k ), we have that (ek) d ax P Yk (θ) X ȳ k { 1,+1} k k (ȳ k ) P Yk (θ ) X k (ȳ k ), P Yk (θ) X k P Yk (θ ) X k (ek) d k d ax ȳ d { 1,+1} d 4 ax P Yd (θ) X D {1,...,k} d D (ȳ d ) P Yd (θ ) X D (ȳ d ).

7 Thus, we have that [ π θ π θ = E π θ π θ < γ+lγ α +(ek) d k de ax ȳ d { 1,+1} d ] ax P Yd (θ) X D {1,...,k} d D (ȳ d ) P Yd (θ ) X D (ȳ d ). Note that [ E ax ȳ d { 1,+1} d ] ax P Yd (θ) X D {1,...,k} d D (ȳ d ) P Yd (θ ) X D (ȳ d ) [ ] P Yd (θ) X D (ȳ d ) P Yd (θ ) X D (ȳ d ) E ȳ d { 1,+1} d D {1,...,k} d (k) d ax ax E ȳ d { 1,+1} d D {1,...,k} d [ ] P Yd (θ) X D (ȳ d ) P Yd (θ ) X D (ȳ d ), and by exchangeability, this last line equals (k) d ax E [ P Yd (θ) X ȳ d { 1,+1} d d (ȳ d ) P Yd (θ ) X d (ȳ d ) ]. [Yang et al., 011] showed that E [ P Yd (θ) X d (ȳ d ) P Yd (θ ) X d (ȳ d ) ] 4 P Zd (θ) P Zd (θ ), so that in total we have π θ π θ < (L + 1)γ α + 4(ek) d+ P Zd (θ) P Zd (θ ). Plugging in the value of k = c(d/γ) log(1/γ), this is ( (L + 1)γ α + 4 ec d ( )) d+ 1 γ log P Zd (θ) P Zd (θ γ ). So the only reaining question is the rate of convergence of our estiate of P Zd (θ ). If N(ε) is the ε-covering nuber of {P Zd (θ) : θ Θ}, then taking ˆθ T θ as the iniu distance skeleton estiate of [Yatracos, 1985, Devroye & Lugosi, 001] achieves expected total variation distance ε fro π θ, for soe T = O((1/ε ) log N(ε/4)). We can partition C into O((L/ε) d/α ) cells of diaeter O((ε/L) 1/α ), and set a constant density value within each cell, on an O(ε)-grid of density values, and every prior with (L, α)-hölder sooth density will have density within ε of soe density so-constructed; there are then at ost (1/ε) O((L/ε)d/α) such densities, so this bounds the covering nubers of Π Θ. Furtherore, the covering nuber of Π Θ upper bounds N(ε) [Yang et al., 011], so that N(ε) (1/ε) O((L/ε)d/α). Solving T = O(ε (L/ε) d/α log(1/ε)) for ε, we have ε = O ( L ( log(t L) T ) ) α d+α. So this bounds the rate of convergence for E P Zd (ˆθ T ) P Z d (θ ), for ˆθ T the iniu distance skeleton 5

8 estiate. Plugging this rate into the bound on the priors, cobined with Jensen s inequality, we have ( π E πˆθt θ < (L + 1)γ α + 4 ec d ( )) ( d+ ( ) ) α 1 log(t L) γ log d+4α O L. γ T This holds for any γ > 0, so iniizing ) this expression over γ > 0 yields a bound on the rate. For instance, with γ = (T Õ α (d+α)(α+(d+1)), we have E πˆθt ( ) π θ = Õ LT α (d+α)(α+(d+1)). 4 A Miniax Lower Bound One natural quesiton is whether Theore 1 can generally be iproved. While we expect this to be true for soe fixed VC classes (e.g., those of finite size), and in any case we expect that soe of the constant factors in the exponent ay be iprovable, it is not at this tie clear whether the general for of T Θ(α /(d+α) ) is soeties optial. One way to investigate this question is to construct specific spaces C and distributions D for which a lower bound can be obtained. In particular, we are generally interested in exhibiting lower bounds that are worse than those that apply to the usual proble of density estiation based on direct access to the h tθ values (see Theore 3 below). Here we present a lower bound that is interesting for this reason. However, although larger than the optial rate for ethods with direct access to the target concepts, it is still far fro atching the upper bound above, so that the question of tightness reains open. Specifically, we have the following result. Theore. For any integer d 1, any L > 0, α (0, 1], there is a value C(d, L, α) (0, ) such that, for any T N, there exists an instance space X, a concept space C of VC diension d, a distribution D over X, and a distribution π 0 over C such that, for Π Θ a set of distributions over C with (L, α)-hölder sooth density functions with respect to π 0, any estiator ˆθ T = ˆθ T (Z 1d (θ ),..., Z T d (θ )) has sup E [ π πˆθt θ ] C(d, L, α)t α (d+α). θ Θ Proof. (Sketch) We proceed by a reduction fro the task of deterining the bias of a coin fro aong two given possibilities. Specifically, fix any γ (0, 1/), n N, and let B 1 (p),..., B n (p) be i.i.d Bernoulli(p) rando variables, for each p [0, 1]; then it is known that, for any (possibly nondeterinistic) decision rule ˆp n : {0, 1} n {(1 + γ)/, (1 γ)/}, 1 p {(1+γ)/,(1 γ)/} P(ˆp n (B 1 (p),..., B n (p)) p) (1/3) exp { 18γ n/3 }. (1) 6

9 This easily follows fro the results of [Wald, 1945, Bar-Yossef, 003], cobined with a result of [Poland & Hutter, 006] bounding the KL divergence. To use this result, we construct a learning proble as follows. Fix soe N with d, let X = {1,..., }, and let C be the space of all classifiers h : X { 1, +1} such that {x X : h(x) = +1} d. Clearly the VC diension of C is d. Define the distribution D as unifor over X. Finally, we specify a faily of (L, α)-hölder sooth priors, paraeterized by Θ = { 1, +1} ( d), as follows. Let γ = (L/)(1/) α. First, enuerate the ( ) d distinct d-sized subsets of {1,..., } as X 1, X,..., X ( d). Define the reference distribution π 0 by the property that, for any h C, letting q = {x : h(x) = +1}, π 0 ({h}) = ( 1)d( ) ( q d q / ) d. For any b = (b 1,..., b ( d) ) { 1, d), define the prior 1}( πb as the distribution of a rando variable h b specified by the following generative odel. Let i Unifor({1,..., ( ) d }), let C b (i ) Bernoulli((1 + γ b i )/); finally, h b Unifor({h C : {x : h(x) = +1} X i, Parity( {x : h(x) = +1} ) = C b (i )}), where Parity(n) is 1 if n is odd, or 0 if n is even. We will refer to the variables in this generative odel below. For any h C, letting H = {x : ) 1 ( d) h(x) = +1} and q = H, we can equivalently express π b ({h}) = ( 1)d( d 1[H X i ](1 + γ b i ) Parity(q) (1 γ b i ) 1 Parity(q). Fro this explicit representation, it is clear that, letting f b = dπ b dπ 0, we have f b (h) [1 γ, 1+γ ] for all h C. The fact that f b is Hölder sooth follows fro this, since every distinct h, g C have D({x : h(x) g(x)}) 1/ = (γ /L) 1/α. Next we set up the reduction as follows. For any estiator ˆπ T = ˆπ T (Z 1d (θ ),..., Z T d (θ )), and each i {1,..., ( ) d }, let hi be the classifier with {x : h i (x) = +1} = X i ; also, if ˆπ T ({h i }) > ( 1 )d / ( ) d, let ˆbi = Parity(d) 1, and otherwise ˆb i = 1 Parity(d). We use these ˆb i values to estiate the original b i values. Specifically, let ˆp i = (1 + γ ˆbi )/ and p i = (1 + γ b i )/, where b = θ. Then ( d) ˆπ T π θ (1/) ˆπ T ({h i }) π θ ({h i }) ( d) (1/) ( d) = (1/) γ d( d 1 d( d ) ˆb i b i / ) ˆp i p i. Thus, we have reduced fro the proble of deciding the biases of these ( ) d independent Bernoulli rando variables. To coplete the proof, it suffices to lower bound the expectation of the right side for an arbitrary estiator. Toward this end, we in fact study an even easier proble. Specifically, consider an estiator ˆq i = ˆq i (Z 1d (θ ),..., Z T d (θ ), i 1,..., i T ), where i t is the i rando variable in the generative odel that defines h tθ ; that is, i t Unifor({1,..., ( ) d }), Ct Bernoulli((1 + γ b i )/), t and h tθ Unifor({h C : {x : h(x) = +1} X i t, Parity( {x : h(x) = +1} ) = C t }), 7

10 where the i t are independent across t, as are the C t and h tθ. Clearly the ˆp i fro above can be viewed as an estiator of this type, which siply ignores the knowledge of i t. The knowledge of these i t variables siplifies the analysis, since given {i t : t T }, the data can be partitioned into ( ) d disjoint sets, {{Ztd (θ ) : i t = i} : i = 1,..., ( ) d }, and we can use only the set {Z td (θ ) : i t = i} to estiate p i. Furtherore, we can use only the subset of these for which X td = X i, since otherwise we have zero inforation about the value of Parity( {x : h tθ (x) = +1} ). That is, given i t = i, any Z td (θ ) is conditionally independent fro every b j for j i, and is even conditionally independent fro b i when X td is not copletely contained in X i ; specifically, in this case, regardless of b i, the conditional distribution of Y td (θ ) given i t = i and given X td is a product distribution, which deterinistically assigns label 1 to those Y tk (θ ) with X tk / X i, and gives unifor rando values to the subset of Y td (θ ) with their respective X tk X i. Finally, letting r t = Parity( {k d : Y tk (θ ) = +1} ), we note that given i t = i, X td = X i, and the value r t, b i is conditionally independent fro Z td (θ ). Thus, the set of values C it (θ ) = {r t : i t = i, X td = X i } is a sufficient statistic for b i (hence for p i ). Recall that, when i t = i and X td = X i, the value of r t is equal to C t, a Bernoulli(p i ) rando variable. Thus, we neither lose nor gain anything (in ters of risk) by restricting ourselves to estiators ˆq i of the type ˆq i = ˆq i (Z 1d (θ ),..., Z T d (θ ), i 1,..., i T ) = ˆq i(c it (θ )), for soe ˆq i [Schervish, 1995]: that is, estiators that are a function of the N it (θ ) = C it (θ ) Bernoulli(p i ) rando variables, which we should note are conditionally i.i.d. given N it (θ ). Thus, by (1), for any n T, 1 [ ] E ˆq i p i N it (θ ) = n b i { 1,+1} = 1 ( ) NiT γ P ˆq i p i (θ ) = n b i { 1,+1} (γ /3) exp { 18γ N i /3 }. Also note that, for each i, E[N i ] = d!(1/)d T (d/) ( d) d T = d d (γ /L) d/α T, so that Jensen s inequality, linearity of expectation, and the law of total expectation iply 1 E [ ˆq i p i ] (γ /3) exp { 43(/L) d/α d d γ +d/α T }. b i { 1,+1} Thus, by linearity of the expectation, ( ) 1 ( d) b { 1,+1} ( d) = ( d) 1 d( ) 1 d b i { 1,+1} ( d) E 1 d( d E [ ˆq i p i ] ) ˆq i p i (γ /(3 d )) exp { 43(/L) d/α d d γ +d/α T }. 8

11 In particular, taking we have so that ( ) 1 ( d) b { 1,+1} ( d) = (L/) ( 1/α 43(/L) d/α d d T ) 1 (d+α), ( (43(/L) γ = Θ d/α d d T ) ) α (d+α), ( d) E 1 d( d ) ˆq i p i = Ω ( ( d 43(/L) d/α d d T ) ) α (d+α). In particular, this iplies there exists soe b for which ( d) 1 E d( ) ˆq i p i = Ω ( ( d 43(/L) d/α d d T ) ) α (d+α). d Applying this lower bound to the estiator ˆp i defined above yields the result. In the extree case of allowing arbitrary dependence on the data saples, we erely recover the known results lower bounding the risk of density estiation fro i.i.d. saples fro a sooth density, as indicated by the following result. Theore 3. For any integer d 1, there exists an instance space X, a concept space C of VC diension d, a distribution D over X, and a distribution π 0 over C such that, for Π Θ the set of distributions over C with (L, α)-hölder sooth density functions with respect to π 0, any sequence of estiators, ˆθ T = ˆθ T (Z 1 (θ ),..., Z T (θ )) (T = 1,,...), has sup E [ πˆθt θ Θ π θ ] = Ω (T α d+α The proof is a siple reduction fro the proble of estiating π θ based on direct access to h 1θ,..., h T θ, which is essentially equivalent to the standard odel of density estiation, and indeed the lower bound in Theore 3 is a well-known result for density estiation fro T i.i.d. saples fro a Hölder sooth density in a d-diensional space [?, see e.g.,]]devroye:01. 5 Future Directions There are several interesting questions that reain open at this tie. Can either the lower bound or upper bound be iproved in general? If, instead of d saples per task, we instead use d saples, how does the iniax risk vary with? Related to this, what is the optial value of to optiize the rate of convergence as a function of T, the total nuber of saples? More generally, if an estiator is peritted to use N total saples, taken fro however any tasks it wishes, what is the optial rate of convergence as a function of N? 9 ).

12 References Bar-Yossef, Z. (003). Sapling lower bounds via inforation theory. Proceedings of the 35th Annual ACM Syposiu on the Theory of Coputing (pp ). Baxter, J. (1997). A bayesian/inforation theoretic odel of learning to learn via ultiple task sapling. Machine Learning, 8, Bluer, A., Ehrenfeucht, A., Haussler, D., & Waruth, M. (1989). Learnability and the Vapnik- Chervonenkis diension. Journal of the Association for Coputing Machinery, 36, Devroye, L., & Lugosi, G. (001). Cobinatorial ethods in density estiation. New York, NY, USA: Springer. Poland, J., & Hutter, M. (006). MDL convergence speed for Bernoulli sequences. Statistics and Coputing, 16, Schervish, M. J. (1995). Theory of statistics. New York, NY, USA: Springer. Vapnik, V. (198). Estiation of dependencies based on epirical data. Springer-Verlag, New York. Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Matheatical Statistics, 16, Yang, L., Hanneke, S., & Carbonell, J. (011). Identifiability of priors fro bounded saple sizes with applications to transfer learning. 4 th Annual Conference on Learning Theory. Yatracos, Y. G. (1985). Rates of convergence of iniu distance estiators and Kologorov s entropy. The Annals of Statistics, 13,

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher