Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Size: px
Start display at page:

Download "Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks"

Transcription

1 Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University Pittsburgh, PA 1513 Abstract We study the optial rates of convergence for estiating a prior distribution over a VC class fro a sequence of independent data sets respectively labeled by independent target functions sapled fro the prior. We specifically derive upper and lower bounds on the optial rates under a soothness condition on the correct prior, with the nuber of saples per data set equal the VC diension. These results have iplications for the iproveents achievable via transfer learning. This research is supported in part by NSF grant IIS and a Google Core AI grant.

2 Keywords: Miniax Rates, Transfer Learning, VC Diension, Bayesian Learning

3 1 Introduction In the transfer learning setting, we are presented with a sequence of learning probles, each with soe respective target concept we are tasked with learning. The key question in transfer learning is how to leverage our access to past learning probles in order to iprove perforance on learning probles we will be presented with in the future. Aong the several proposed odels for transfer learning, one particularly appealing odel supposes the learning probles are independent and identically distributed, with unknown distribution, and the advantage of transfer learning then coes fro the ability to estiate this shared distribution based on the data fro past learning probles [Baxter, 1997, Yang et al., 011]. For instance, when custoizing a speech recognition syste to a particular speaker s voice, we ight expect the first few people would need to speak any words or phrases in order for the syste to accurately identify the nuances. However, after perforing this for any different people, if the software has access to those past training sessions when custoizing itself to a new user, it should have identified iportant properties of the speech patterns, such as the coon patterns within each of the ajor dialects or accents, and other such inforation about the distribution of speech patterns within the user population. It should then be able to leverage this inforation to reduce the nuber of words or phrases the next user needs to speak in order to train the syste, for instance by first trying to identify the individual s dialect, then presenting phrases that differentiate coon subpatterns within that dialect, and so forth. In analyzing the benefits of transfer learning in such a setting, one iportant question to ask is how quickly we can estiate the distribution fro which the learning probles are sapled. In recent work, [Yang et al., 011] have shown that under ild conditions on the faily of possible distributions, if the target concepts reside in a known VC class, then it is possible to estiate this distribtion to arbitrary precision using only a bounded nuber of training saples per task: specifically, a nuber of saples equal the VC diension. However, that work left open the question of quantifying the rate of convergence. This rate of convergence can have a direct ipact on how uch benefit we gain fro transfer learning when we are faced with only a finite sequence of learning probles. As such, it is certainly desirable to derive tight characterizations of this rate of convergence. The present work continues that of [Yang et al., 011], bounding the rate of convergence for estiating this distribution, under a soothness condition on the distribution. We derive a generic upper bound, which holds regardless of the VC class the target concepts reside in. The proof of this result builds on the earlier work of [Yang et al., 011], but requires several interesting innovations to ake the rate of convergence explicit, and to draatically iprove the upper bounds on certain quantities copared to the analogous bounds iplicit in the original proofs. We further derive a nontrivial lower bound that holds for certain constructed scenarios, which illustrates a lower liit on how good of a general upper bound we ight hope for in results expressed only in ters of the nuber of tasks, the soothness conditions, and the VC diension. 1

4 The Setting Let (X, B X ) be a Borel space [Schervish, 1995] (where X is called the instance space), and let D be a distribution on X (called the data distribution). Let C be a VC class of easurable classifiers h : X { 1, +1} (called the concept space), and denote by d the VC diension of C [Vapnik, 198]. We suppose C is equipped with its Borel σ-algebra B induced by the pseudo-etric ρ(h, g) = D({x X : h(x) g(x)}). Though our results can be forulated for general D (with soewhat ore coplicated theore stateents), to siplify the stateent of results we suppose ρ is actually a etric, which would follow fro appropriate topological conditions on C relative to D. For any two probability easures µ 1, µ on a easurable space (Ω, F), define the total variation distance µ 1 µ = sup µ 1 (A) µ (A). A F Let Π Θ = {π θ : θ Θ} be a faily of probability easures on C (called priors), where Θ is an arbitrary index set (called the paraeter space). We additionally suppose there exists a probability easure π 0 on C (called the reference easure) such that every π θ is absolutely continuous with respect to π 0, and therefore has a density function f θ given by the Radon-Nikody derivative dπ θ dπ 0 [Schervish, 1995]. We consider the following type of estiation proble. There is a collection of C-valued rando variables {h tθ : t N, θ Θ}, where for any fixed θ Θ the {h tθ } t=1 variables are i.i.d. with distribution π θ. For each θ Θ, there is a sequence Z t (θ) = {(X t1, Y t1 (θ)), (X t, Y t (θ)),...}, where {X ti } t,i N are i.i.d. D, and for each t, i N, Y ti (θ) = h tθ (X ti). We additionally denote by Z tk = {(X t1, Y t1 (θ)),..., (X tk, Y tk (θ))} the first k eleents of Z t (θ), for any k N, and siilarly X tk = {X t1,..., X tk } and Y tk (θ) = {Y t1 (θ),..., Y tk (θ)}. Following the terinology used in the transfer learning literature, we refer to the collection of variables associated with each t collectively as the t th task. We will be concerned with sequences of estiators ˆθ T θ = ˆθ T (Z 1k (θ),..., Z T k (θ)), for T N, which are based on only a bounded nuber k of saples per task, aong the first T tasks. Our ain [ results specifically ] study the case of k = d. For any such estiator, we easure the risk as E πˆθt θ π θ, and will be particularly interested in upper-bounding the worst-case [ ] risk sup θ Θ E πˆθt θ π θ as a function of T, and lower-bounding the iniu possible value of this worst-case risk over all possible ˆθ T estiators (called the iniax risk). In previous work, [Yang et al., 011] showed that, if Π Θ is a totally bounded faily, then even with only d nuber of saples per task, the iniax risk (as a function of the nuber of tasks T ) converges to zero. In fact, that work also includes a proof that this is not necessarily the case in general for any nuber of saples less than d. However, the actual rates of convergence were not explicitly derived in that work, and indeed the upper bounds on the rates of convergence iplicit in that analysis ay often have fairly coplicated dependences on C, Π Θ, and D, and furtherore often provide only very slow rates of convergence. To derive explicit bounds on the rates of convergence, in the present work we specifically focus on failies of sooth densities. The otivation for involving a notion of soothness in characterizing rates of convergence is clear if we consider the extree case in which Π Θ contains two priors π 1 and π, with π 1 ({h}) = π ({g}) = 1, where ρ(h, g) is a very sall but nonzero

5 value; in this case, if we have only a sall nuber of saples per task, we would require any tasks (on the order of 1/ρ(h, g)) to observe any data points carrying any inforation that would distinguish between these two priors (naely, points x with h(x) g(x)); yet π 1 π = 1, so that we have a slow rate of convergence (at least initially). A total boundedness condition on Π Θ would liit the nuber of such pairs present in Π Θ, so that for instance we cannot have arbitrarily close h and g, but less extree variants of this can lead to slow asyptotic rates of convergence as well. Specifically, in the present work we consider the following notion of soothness. For L (0, ) and α (0, 1], a function f : C R is (L, α)-hölder sooth if h, g C, f(h) f(g) Lρ(h, g) α. 3 An Upper Bound We now have the following theore, holding for an arbitrary VC class C and data distribution D; it is the ain result of this work. Theore 1. For Π Θ any class of priors on C having (L, α)-hölder sooth densities {f θ : θ Θ}, for any T N, there exists an estiator ˆθ T θ = ˆθ T (Z 1d (θ),..., Z T d (θ)) such that ( ) sup π E πˆθt θ = Õ LT α (d+α)(α+(d+1)). θ Θ Proof. By the standard PAC analysis [Vapnik, 198, Bluer et al., 1989], for any γ > 0, with probability greater than 1 γ, a saple of k = O((d/γ) log(1/γ)) rando points will partition C into regions of width less than γ. For brevity, we oit the t subscript on quantities such as Z tk (θ) throughout the following analysis, since the clais hold for any arbitrary value of t. For any θ Θ, let π θ denote a (conditional on X 1,..., X k ) distribution defined as follows. Let f θ denote the (conditional on X 1,..., X k ) density function of π θ with respect to π 0, and for any g C, let f θ (g) = π θ({h C: i k,h(x i )=g(x i )}) (or 0 if π π 0 ({h C: i k,h(x i )=g(x i )}) 0({h C : i k, h(x i ) = g(x i )}) = 0). In other words, π θ has the sae probability ass as π θ for each of the equivalence classes induced by X 1,..., X k, but conditioned on the equivalence class, siply has a constant-density distribution over that equivalence class. Note that, by the soothness condition, with probability greater than 1 γ, we have everywhere f θ (h) f θ(h) < Lγ α. So for any θ, θ Θ, with probability greater than 1 γ, π θ π θ = (1/) f θ f θ dπ 0 < Lγ α + (1/) f θ f θ dπ 0. Furtherore, since the regions that define f θ and f θ are the sae (naely, the partition induced by X 1,..., X k ), we have (1/) f θ f θ dπ 0 = 1 y 1,...,y k { 1,+1} π θ({h C : i k, h(x i ) = 3

6 y i }) π θ ({h C : i k, h(x i ) = y i }) = P Yk (θ) X k P Yk (θ ) X k. Thus, we have that with probability at least 1 γ, π θ π θ < Lγ α + P Yk (θ) X k P Yk (θ ) X k. Following analogous to the inductive arguent of [Yang et al., 011], suppose I {1,..., k}, fix x I X I and ȳ I { 1, +1} I. Then the ỹ I { 1, +1} I for which no h C has h( x I ) = ỹ I for which ȳ I ỹ I 1 is inial, has (1/) ȳ I ỹ I 1 d + 1, and for any i I with ȳ i ỹ i, letting ȳ j = ȳ j for j I \ {i} and ȳ i = ỹ i, we have and siilarly for θ, so that P YI (θ) X I (ȳ I x I ) = P YI\{i} (θ) X I\{i} (ȳ I\{i} x I\{i} ) P YI (θ) X I (ȳ I x I ), P YI (θ) X I (ȳ I x I ) P YI (θ ) X I (ȳ I x I ) P YI\{i} (θ) X I\{i} (ȳ I\{i} x I\{i} ) P YI\{i} (θ ) X I\{i} (ȳ I\{i} x I\{i} ) + P YI (θ) X I (ȳ I x I ) P YI (θ ) X I (ȳ I x I ). Now consider that these two ters inductively define a binary tree. Every tie the tree branches left once, it arrives at a difference of probabilities for a set I of one less eleent than that of its parent. Every tie the tree branches right once, it arrives at a difference of probabilities for a ȳ I one closer to an unrealized ỹ I than that of its parent. Say we stop branching the tree upon reaching a set I and a ȳ I such that either ȳ I is an unrealized labeling, or I = d. Thus, we can bound the original (root node) difference of probabilities by the su of the differences of probabilities for the leaf nodes with I = d. Any path in the tree can branch left at ost k d ties (total) before reaching a set I with only d eleents, and can branch right at ost d + 1 ties in a row before reaching a ȳ I such that both probabilities are zero, so that the difference is zero. So the depth of any leaf node with I = d is at ost (k d)d. Furtherore, at any level of the tree, fro left to right the nodes have strictly decreasing I values, so that the axiu width of the tree is at ost k d. So the total nuber of leaf nodes with I = d is at ost (k d) d. Thus, for any ȳ { 1, +1} k and x X k, Since P Yk (θ) X k (ȳ x) P Yk (θ ) X k (ȳ x) (k d) d ax ȳ d { 1,+1} d P Yk (θ) X k P Yk (θ ) X k = (1/) and by Sauer s Lea this is at ost ax P Yd (θ) X D {1,...,k} d d (ȳ d x D ) P Yd (θ ) X d (ȳ d x D ). ȳ k { 1,+1} k P Yk (θ) X k (ȳ k ) P Yk (θ ) X k (ȳ k ), we have that (ek) d ax P Yk (θ) X ȳ k { 1,+1} k k (ȳ k ) P Yk (θ ) X k (ȳ k ), P Yk (θ) X k P Yk (θ ) X k (ek) d k d ax ȳ d { 1,+1} d 4 ax P Yd (θ) X D {1,...,k} d D (ȳ d ) P Yd (θ ) X D (ȳ d ).

7 Thus, we have that [ π θ π θ = E π θ π θ < γ+lγ α +(ek) d k de ax ȳ d { 1,+1} d ] ax P Yd (θ) X D {1,...,k} d D (ȳ d ) P Yd (θ ) X D (ȳ d ). Note that [ E ax ȳ d { 1,+1} d ] ax P Yd (θ) X D {1,...,k} d D (ȳ d ) P Yd (θ ) X D (ȳ d ) [ ] P Yd (θ) X D (ȳ d ) P Yd (θ ) X D (ȳ d ) E ȳ d { 1,+1} d D {1,...,k} d (k) d ax ax E ȳ d { 1,+1} d D {1,...,k} d [ ] P Yd (θ) X D (ȳ d ) P Yd (θ ) X D (ȳ d ), and by exchangeability, this last line equals (k) d ax E [ P Yd (θ) X ȳ d { 1,+1} d d (ȳ d ) P Yd (θ ) X d (ȳ d ) ]. [Yang et al., 011] showed that E [ P Yd (θ) X d (ȳ d ) P Yd (θ ) X d (ȳ d ) ] 4 P Zd (θ) P Zd (θ ), so that in total we have π θ π θ < (L + 1)γ α + 4(ek) d+ P Zd (θ) P Zd (θ ). Plugging in the value of k = c(d/γ) log(1/γ), this is ( (L + 1)γ α + 4 ec d ( )) d+ 1 γ log P Zd (θ) P Zd (θ γ ). So the only reaining question is the rate of convergence of our estiate of P Zd (θ ). If N(ε) is the ε-covering nuber of {P Zd (θ) : θ Θ}, then taking ˆθ T θ as the iniu distance skeleton estiate of [Yatracos, 1985, Devroye & Lugosi, 001] achieves expected total variation distance ε fro π θ, for soe T = O((1/ε ) log N(ε/4)). We can partition C into O((L/ε) d/α ) cells of diaeter O((ε/L) 1/α ), and set a constant density value within each cell, on an O(ε)-grid of density values, and every prior with (L, α)-hölder sooth density will have density within ε of soe density so-constructed; there are then at ost (1/ε) O((L/ε)d/α) such densities, so this bounds the covering nubers of Π Θ. Furtherore, the covering nuber of Π Θ upper bounds N(ε) [Yang et al., 011], so that N(ε) (1/ε) O((L/ε)d/α). Solving T = O(ε (L/ε) d/α log(1/ε)) for ε, we have ε = O ( L ( log(t L) T ) ) α d+α. So this bounds the rate of convergence for E P Zd (ˆθ T ) P Z d (θ ), for ˆθ T the iniu distance skeleton 5

8 estiate. Plugging this rate into the bound on the priors, cobined with Jensen s inequality, we have ( π E πˆθt θ < (L + 1)γ α + 4 ec d ( )) ( d+ ( ) ) α 1 log(t L) γ log d+4α O L. γ T This holds for any γ > 0, so iniizing ) this expression over γ > 0 yields a bound on the rate. For instance, with γ = (T Õ α (d+α)(α+(d+1)), we have E πˆθt ( ) π θ = Õ LT α (d+α)(α+(d+1)). 4 A Miniax Lower Bound One natural quesiton is whether Theore 1 can generally be iproved. While we expect this to be true for soe fixed VC classes (e.g., those of finite size), and in any case we expect that soe of the constant factors in the exponent ay be iprovable, it is not at this tie clear whether the general for of T Θ(α /(d+α) ) is soeties optial. One way to investigate this question is to construct specific spaces C and distributions D for which a lower bound can be obtained. In particular, we are generally interested in exhibiting lower bounds that are worse than those that apply to the usual proble of density estiation based on direct access to the h tθ values (see Theore 3 below). Here we present a lower bound that is interesting for this reason. However, although larger than the optial rate for ethods with direct access to the target concepts, it is still far fro atching the upper bound above, so that the question of tightness reains open. Specifically, we have the following result. Theore. For any integer d 1, any L > 0, α (0, 1], there is a value C(d, L, α) (0, ) such that, for any T N, there exists an instance space X, a concept space C of VC diension d, a distribution D over X, and a distribution π 0 over C such that, for Π Θ a set of distributions over C with (L, α)-hölder sooth density functions with respect to π 0, any estiator ˆθ T = ˆθ T (Z 1d (θ ),..., Z T d (θ )) has sup E [ π πˆθt θ ] C(d, L, α)t α (d+α). θ Θ Proof. (Sketch) We proceed by a reduction fro the task of deterining the bias of a coin fro aong two given possibilities. Specifically, fix any γ (0, 1/), n N, and let B 1 (p),..., B n (p) be i.i.d Bernoulli(p) rando variables, for each p [0, 1]; then it is known that, for any (possibly nondeterinistic) decision rule ˆp n : {0, 1} n {(1 + γ)/, (1 γ)/}, 1 p {(1+γ)/,(1 γ)/} P(ˆp n (B 1 (p),..., B n (p)) p) (1/3) exp { 18γ n/3 }. (1) 6

9 This easily follows fro the results of [Wald, 1945, Bar-Yossef, 003], cobined with a result of [Poland & Hutter, 006] bounding the KL divergence. To use this result, we construct a learning proble as follows. Fix soe N with d, let X = {1,..., }, and let C be the space of all classifiers h : X { 1, +1} such that {x X : h(x) = +1} d. Clearly the VC diension of C is d. Define the distribution D as unifor over X. Finally, we specify a faily of (L, α)-hölder sooth priors, paraeterized by Θ = { 1, +1} ( d), as follows. Let γ = (L/)(1/) α. First, enuerate the ( ) d distinct d-sized subsets of {1,..., } as X 1, X,..., X ( d). Define the reference distribution π 0 by the property that, for any h C, letting q = {x : h(x) = +1}, π 0 ({h}) = ( 1)d( ) ( q d q / ) d. For any b = (b 1,..., b ( d) ) { 1, d), define the prior 1}( πb as the distribution of a rando variable h b specified by the following generative odel. Let i Unifor({1,..., ( ) d }), let C b (i ) Bernoulli((1 + γ b i )/); finally, h b Unifor({h C : {x : h(x) = +1} X i, Parity( {x : h(x) = +1} ) = C b (i )}), where Parity(n) is 1 if n is odd, or 0 if n is even. We will refer to the variables in this generative odel below. For any h C, letting H = {x : ) 1 ( d) h(x) = +1} and q = H, we can equivalently express π b ({h}) = ( 1)d( d 1[H X i ](1 + γ b i ) Parity(q) (1 γ b i ) 1 Parity(q). Fro this explicit representation, it is clear that, letting f b = dπ b dπ 0, we have f b (h) [1 γ, 1+γ ] for all h C. The fact that f b is Hölder sooth follows fro this, since every distinct h, g C have D({x : h(x) g(x)}) 1/ = (γ /L) 1/α. Next we set up the reduction as follows. For any estiator ˆπ T = ˆπ T (Z 1d (θ ),..., Z T d (θ )), and each i {1,..., ( ) d }, let hi be the classifier with {x : h i (x) = +1} = X i ; also, if ˆπ T ({h i }) > ( 1 )d / ( ) d, let ˆbi = Parity(d) 1, and otherwise ˆb i = 1 Parity(d). We use these ˆb i values to estiate the original b i values. Specifically, let ˆp i = (1 + γ ˆbi )/ and p i = (1 + γ b i )/, where b = θ. Then ( d) ˆπ T π θ (1/) ˆπ T ({h i }) π θ ({h i }) ( d) (1/) ( d) = (1/) γ d( d 1 d( d ) ˆb i b i / ) ˆp i p i. Thus, we have reduced fro the proble of deciding the biases of these ( ) d independent Bernoulli rando variables. To coplete the proof, it suffices to lower bound the expectation of the right side for an arbitrary estiator. Toward this end, we in fact study an even easier proble. Specifically, consider an estiator ˆq i = ˆq i (Z 1d (θ ),..., Z T d (θ ), i 1,..., i T ), where i t is the i rando variable in the generative odel that defines h tθ ; that is, i t Unifor({1,..., ( ) d }), Ct Bernoulli((1 + γ b i )/), t and h tθ Unifor({h C : {x : h(x) = +1} X i t, Parity( {x : h(x) = +1} ) = C t }), 7

10 where the i t are independent across t, as are the C t and h tθ. Clearly the ˆp i fro above can be viewed as an estiator of this type, which siply ignores the knowledge of i t. The knowledge of these i t variables siplifies the analysis, since given {i t : t T }, the data can be partitioned into ( ) d disjoint sets, {{Ztd (θ ) : i t = i} : i = 1,..., ( ) d }, and we can use only the set {Z td (θ ) : i t = i} to estiate p i. Furtherore, we can use only the subset of these for which X td = X i, since otherwise we have zero inforation about the value of Parity( {x : h tθ (x) = +1} ). That is, given i t = i, any Z td (θ ) is conditionally independent fro every b j for j i, and is even conditionally independent fro b i when X td is not copletely contained in X i ; specifically, in this case, regardless of b i, the conditional distribution of Y td (θ ) given i t = i and given X td is a product distribution, which deterinistically assigns label 1 to those Y tk (θ ) with X tk / X i, and gives unifor rando values to the subset of Y td (θ ) with their respective X tk X i. Finally, letting r t = Parity( {k d : Y tk (θ ) = +1} ), we note that given i t = i, X td = X i, and the value r t, b i is conditionally independent fro Z td (θ ). Thus, the set of values C it (θ ) = {r t : i t = i, X td = X i } is a sufficient statistic for b i (hence for p i ). Recall that, when i t = i and X td = X i, the value of r t is equal to C t, a Bernoulli(p i ) rando variable. Thus, we neither lose nor gain anything (in ters of risk) by restricting ourselves to estiators ˆq i of the type ˆq i = ˆq i (Z 1d (θ ),..., Z T d (θ ), i 1,..., i T ) = ˆq i(c it (θ )), for soe ˆq i [Schervish, 1995]: that is, estiators that are a function of the N it (θ ) = C it (θ ) Bernoulli(p i ) rando variables, which we should note are conditionally i.i.d. given N it (θ ). Thus, by (1), for any n T, 1 [ ] E ˆq i p i N it (θ ) = n b i { 1,+1} = 1 ( ) NiT γ P ˆq i p i (θ ) = n b i { 1,+1} (γ /3) exp { 18γ N i /3 }. Also note that, for each i, E[N i ] = d!(1/)d T (d/) ( d) d T = d d (γ /L) d/α T, so that Jensen s inequality, linearity of expectation, and the law of total expectation iply 1 E [ ˆq i p i ] (γ /3) exp { 43(/L) d/α d d γ +d/α T }. b i { 1,+1} Thus, by linearity of the expectation, ( ) 1 ( d) b { 1,+1} ( d) = ( d) 1 d( ) 1 d b i { 1,+1} ( d) E 1 d( d E [ ˆq i p i ] ) ˆq i p i (γ /(3 d )) exp { 43(/L) d/α d d γ +d/α T }. 8

11 In particular, taking we have so that ( ) 1 ( d) b { 1,+1} ( d) = (L/) ( 1/α 43(/L) d/α d d T ) 1 (d+α), ( (43(/L) γ = Θ d/α d d T ) ) α (d+α), ( d) E 1 d( d ) ˆq i p i = Ω ( ( d 43(/L) d/α d d T ) ) α (d+α). In particular, this iplies there exists soe b for which ( d) 1 E d( ) ˆq i p i = Ω ( ( d 43(/L) d/α d d T ) ) α (d+α). d Applying this lower bound to the estiator ˆp i defined above yields the result. In the extree case of allowing arbitrary dependence on the data saples, we erely recover the known results lower bounding the risk of density estiation fro i.i.d. saples fro a sooth density, as indicated by the following result. Theore 3. For any integer d 1, there exists an instance space X, a concept space C of VC diension d, a distribution D over X, and a distribution π 0 over C such that, for Π Θ the set of distributions over C with (L, α)-hölder sooth density functions with respect to π 0, any sequence of estiators, ˆθ T = ˆθ T (Z 1 (θ ),..., Z T (θ )) (T = 1,,...), has sup E [ πˆθt θ Θ π θ ] = Ω (T α d+α The proof is a siple reduction fro the proble of estiating π θ based on direct access to h 1θ,..., h T θ, which is essentially equivalent to the standard odel of density estiation, and indeed the lower bound in Theore 3 is a well-known result for density estiation fro T i.i.d. saples fro a Hölder sooth density in a d-diensional space [?, see e.g.,]]devroye:01. 5 Future Directions There are several interesting questions that reain open at this tie. Can either the lower bound or upper bound be iproved in general? If, instead of d saples per task, we instead use d saples, how does the iniax risk vary with? Related to this, what is the optial value of to optiize the rate of convergence as a function of T, the total nuber of saples? More generally, if an estiator is peritted to use N total saples, taken fro however any tasks it wishes, what is the optial rate of convergence as a function of N? 9 ).

12 References Bar-Yossef, Z. (003). Sapling lower bounds via inforation theory. Proceedings of the 35th Annual ACM Syposiu on the Theory of Coputing (pp ). Baxter, J. (1997). A bayesian/inforation theoretic odel of learning to learn via ultiple task sapling. Machine Learning, 8, Bluer, A., Ehrenfeucht, A., Haussler, D., & Waruth, M. (1989). Learnability and the Vapnik- Chervonenkis diension. Journal of the Association for Coputing Machinery, 36, Devroye, L., & Lugosi, G. (001). Cobinatorial ethods in density estiation. New York, NY, USA: Springer. Poland, J., & Hutter, M. (006). MDL convergence speed for Bernoulli sequences. Statistics and Coputing, 16, Schervish, M. J. (1995). Theory of statistics. New York, NY, USA: Springer. Vapnik, V. (198). Estiation of dependencies based on epirical data. Springer-Verlag, New York. Wald, A. (1945). Sequential tests of statistical hypotheses. The Annals of Matheatical Statistics, 16, Yang, L., Hanneke, S., & Carbonell, J. (011). Identifiability of priors fro bounded saple sizes with applications to transfer learning. 4 th Annual Conference on Learning Theory. Yatracos, Y. G. (1985). Rates of convergence of iniu distance estiators and Kologorov s entropy. The Annals of Statistics, 13,

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models 2014 IEEE International Syposiu on Inforation Theory Copression and Predictive Distributions for Large Alphabet i.i.d and Markov odels Xiao Yang Departent of Statistics Yale University New Haven, CT, 06511

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany. New upper bound for the B-spline basis condition nuber II. A proof of de Boor's 2 -conjecture K. Scherer Institut fur Angewandte Matheati, Universitat Bonn, 535 Bonn, Gerany and A. Yu. Shadrin Coputing

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40 On Poset Merging Peter Chen Guoli Ding Steve Seiden Abstract We consider the follow poset erging proble: Let X and Y be two subsets of a partially ordered set S. Given coplete inforation about the ordering

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint.

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint. 59 6. ABSTRACT MEASURE THEORY Having developed the Lebesgue integral with respect to the general easures, we now have a general concept with few specific exaples to actually test it on. Indeed, so far

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Prediction by random-walk perturbation

Prediction by random-walk perturbation Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co

More information

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013). A Appendix: Proofs The proofs of Theore 1-3 are along the lines of Wied and Galeano (2013) Proof of Theore 1 Let D[d 1, d 2 ] be the space of càdlàg functions on the interval [d 1, d 2 ] equipped with

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Chaotic Coupled Map Lattices

Chaotic Coupled Map Lattices Chaotic Coupled Map Lattices Author: Dustin Keys Advisors: Dr. Robert Indik, Dr. Kevin Lin 1 Introduction When a syste of chaotic aps is coupled in a way that allows the to share inforation about each

More information

lecture 36: Linear Multistep Mehods: Zero Stability

lecture 36: Linear Multistep Mehods: Zero Stability 95 lecture 36: Linear Multistep Mehods: Zero Stability 5.6 Linear ultistep ethods: zero stability Does consistency iply convergence for linear ultistep ethods? This is always the case for one-step ethods,

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

The Frequent Paucity of Trivial Strings

The Frequent Paucity of Trivial Strings The Frequent Paucity of Trivial Strings Jack H. Lutz Departent of Coputer Science Iowa State University Aes, IA 50011, USA lutz@cs.iastate.edu Abstract A 1976 theore of Chaitin can be used to show that

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

On Conditions for Linearity of Optimal Estimation

On Conditions for Linearity of Optimal Estimation On Conditions for Linearity of Optial Estiation Erah Akyol, Kuar Viswanatha and Kenneth Rose {eakyol, kuar, rose}@ece.ucsb.edu Departent of Electrical and Coputer Engineering University of California at

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Curious Bounds for Floor Function Sums

Curious Bounds for Floor Function Sums 1 47 6 11 Journal of Integer Sequences, Vol. 1 (018), Article 18.1.8 Curious Bounds for Floor Function Sus Thotsaporn Thanatipanonda and Elaine Wong 1 Science Division Mahidol University International

More information

Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation

Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation Algorithic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation Michael Kearns AT&T Labs Research Murray Hill, New Jersey kearns@research.att.co Dana Ron MIT Cabridge, MA danar@theory.lcs.it.edu

More information

Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods

Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods Estiating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Diension Methods David Haussler University of California Santa Cruz, California Manfred Opper Institut fur Theoretische

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Math 262A Lecture Notes - Nechiporuk s Theorem

Math 262A Lecture Notes - Nechiporuk s Theorem Math 6A Lecture Notes - Nechiporuk s Theore Lecturer: Sa Buss Scribe: Stefan Schneider October, 013 Nechiporuk [1] gives a ethod to derive lower bounds on forula size over the full binary basis B The lower

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography Tight Bounds for axial Identifiability of Failure Nodes in Boolean Network Toography Nicola Galesi Sapienza Università di Roa nicola.galesi@uniroa1.it Fariba Ranjbar Sapienza Università di Roa fariba.ranjbar@uniroa1.it

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

MULTIPLAYER ROCK-PAPER-SCISSORS

MULTIPLAYER ROCK-PAPER-SCISSORS MULTIPLAYER ROCK-PAPER-SCISSORS CHARLOTTE ATEN Contents 1. Introduction 1 2. RPS Magas 3 3. Ites as a Function of Players and Vice Versa 5 4. Algebraic Properties of RPS Magas 6 References 6 1. Introduction

More information

Metric Entropy of Convex Hulls

Metric Entropy of Convex Hulls Metric Entropy of Convex Hulls Fuchang Gao University of Idaho Abstract Let T be a precopact subset of a Hilbert space. The etric entropy of the convex hull of T is estiated in ters of the etric entropy

More information

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors New Slack-Monotonic Schedulability Analysis of Real-Tie Tasks on Multiprocessors Risat Mahud Pathan and Jan Jonsson Chalers University of Technology SE-41 96, Göteborg, Sweden {risat, janjo}@chalers.se

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields Finite fields I talked in class about the field with two eleents F 2 = {, } and we ve used it in various eaples and hoework probles. In these notes I will introduce ore finite fields F p = {,,...,p } for

More information

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES

TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES TEST OF HOMOGENEITY OF PARALLEL SAMPLES FROM LOGNORMAL POPULATIONS WITH UNEQUAL VARIANCES S. E. Ahed, R. J. Tokins and A. I. Volodin Departent of Matheatics and Statistics University of Regina Regina,

More information