Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Size: px
Start display at page:

Download "Stability Bounds for Stationary ϕ-mixing and β-mixing Processes"

Transcription

1 Journal of Machine Learning Research (200) Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences and Google Research 25 Mercer Street New York, NY 002 Afshin Rostaizadeh Departent of Coputer Science Courant Institute of Matheatical Sciences 25 Mercer Street New York, NY 002 Editor: Ralf Herbrich Abstract Most generalization bounds in learning theory are based on soe easure of the coplexity of the hypothesis class used, independently of any algorith. In contrast, the notion of algorithic stability can be used to derive tight generalization bounds that are tailored to specific learning algoriths by exploiting their particular properties. However, as in uch of learning theory, existing stability analyses and bounds apply only in the scenario where the saples are independently and identically distributed. In any achine learning applications, however, this assuption does not hold. The observations received by the learning algorith often have soe inherent teporal dependence. This paper studies the scenario where the observations are drawn fro a stationary ϕ-ixing or β-ixing sequence, a widely adopted assuption in the study of non-i.i.d. processes that iplies a dependence between observations weakening over tie. We prove novel and distinct stability-based generalization bounds for stationary ϕ-ixing and β-ixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algoriths, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-ixing generalization bounds to general classes of learning algoriths, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and any other kernel regularization-based and relative entropy-based regularization algoriths. These novel bounds can thus be viewed as the first theoretical basis for the use of these algoriths in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, ixing distributions, algorithic stability, generalization bounds, learning theory. Introduction Most generalization bounds in learning theory are based on soe easure of the coplexity of the hypothesis class used, such as the VC-diension, covering nubers, or Radeacher coplexity. These easures characterize a class of hypotheses, independently of any algorith. In contrast, the notion of algorithic stability can be used to derive bounds that are tailored to specific learning algoriths and exploit their particular properties. A learning algorith is stable if the hypothesis it c 200 Mehryar Mohri and Afshin Rostaizadeh.

2 MOHRI AND ROSTAMIZADEH outputs varies in a liited way in response to sall changes ade to the training set. Algorithic stability has been used effectively in the past to derive tight generalization bounds (Bousquet and Elisseeff, 200, 2002; Kearns and Ron, 997). But, as in uch of learning theory, existing stability analyses and bounds apply only in the scenario where the saples are independently and identically distributed (i.i.d.). In any achine learning applications, this assuption, however, does not hold; in fact, the i.i.d. assuption is not tested or derived fro any data analysis. The observations received by the learning algorith often have soe inherent teporal dependence. This is clear in syste diagnosis or tie series prediction probles. Clearly, prices of different stocks on the sae day, or of the sae stock on different days, ay be dependent. But, a less apparent tie dependency ay affect data sapled in any other tasks as well. This paper studies the scenario where the observations are drawn fro a stationary ϕ-ixing or β-ixing sequence, a widely adopted assuption in the study of non-i.i.d. processes that iplies a dependence between observations weakening over tie (Yu, 994; Meir, 2000; Vidyasagar, 2003; Lozano et al., 2006; Mohri and Rostaizadeh, 2007). We prove novel and distinct stabilitybased generalization bounds for stationary ϕ-ixing and β-ixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algoriths, thereby extending the usefulness of stability-bounds to non-i.i.d. scenarios. Our proofs are based on the independent block technique described by Yu (994) and attributed to Bernstein (927), which is coonly used in such contexts. However, our analysis soewhat differs fro previous uses of this technique in that the blocks of points we consider are not necessarily of equal size. For our analysis of stationary ϕ-ixing sequences, we ake use of a generalized version of Mc- Diarid s inequality given by Kontorovich and Raanan (2008) that holds for ϕ-ixing sequences. This leads to stability-based generalization bounds with the standard exponential for. Our generalization bounds for stationary β-ixing sequences cover a ore general non-i.i.d. scenario and use the standard McDiarid s inequality, however, unlike the ϕ-ixing case, the β-ixing bound presented here is not a purely exponential bound and contains an additive ter depending on the ixing coefficient. We also illustrate the application of our ϕ-ixing generalization bounds to general classes of learning algoriths, including Support Vector Regression (SVR) (Vapnik, 998), Kernel Ridge Regression (Saunders et al., 998), and Support Vector Machines (SVMs) (Cortes and Vapnik, 995). Algoriths such as SVR have been used in the context of tie series prediction in which the i.i.d. assuption does not hold, soe with good experiental results (Müller et al., 997; Mattera and Haykin, 999). However, to our knowledge, the use of these algoriths in non-i.i.d. scenarios has not been previously supported by any theoretical analysis. The stability bounds we give for SVR, SVMs, and any other kernel regularization-based and relative entropy-based regularization algoriths can thus be viewed as the first theoretical basis for their use in such scenarios. The following sections are organized as follows. In Section 2, we introduce the definitions relevant to the non-i.i.d. probles that we are considering and discuss the learning scenarios in that context. Section 3 gives our ain generalization bounds for stationary ϕ-ixing sequences based on stability, as well as the illustration of its applications to general kernel regularization-based algoriths, including SVR, KRR, and SVMs, as well as to relative entropy-based regularization algoriths. Finally, Section 4 presents the first known stability bounds for the ore general stationary β-ixing scenario. 790

3 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES 2. Preliinaries We first introduce soe standard definitions for dependent observations in ixing theory (Doukhan, 994) and then briefly discuss the learning scenarios in the non-i.i.d. case. 2. Non-i.i.d. Definitions Definition A sequence of rando variables Z={Z t } t= is said to be stationary if for any t and non-negative integers and k, the rando vectors (Z t,...,z t+ ) and (Z t+k,...,z t++k ) have the sae distribution. Thus, the index t or tie, does not affect the distribution of a variable Z t in a stationary sequence. This does not iply independence however. In particular, for i< j < k, PrZ j Z i ay not equal PrZ k Z i, that is, conditional probabilities ay vary at different points in tie. The following is a standard definition giving a easure of the dependence of the rando variables Z t within a stationary sequence. There are several equivalent definitions of these quantities, we are adopting here a version convenient for our analysis, as in Yu (994). Definition 2 Let Z={Z t } t= be a stationary sequence of rando variables. For any i, j Z {,+ }, let σ j i denote the σ-algebra generated by the rando variables Z k, i k j. Then, for any positive integer k, the β-ixing and ϕ-ixing coefficients of the stochastic process Z are defined as β(k) = sup E sup PrA B PrA ϕ(k) = sup PrA B PrA. n B σ n A σ n n+k A σ n+k B σ n Z is said to be β-ixing (ϕ-ixing) if β(k) 0 (resp. ϕ(k) 0) as k. It is said to be algebraically β-ixing (algebraically ϕ-ixing) if there exist real nubers β 0 > 0 (resp. ϕ 0 > 0) and r > 0 such that β(k) β 0 /k r (resp. ϕ(k) ϕ 0 /k r ) for all k, exponentially ixing if there exist real nubers β 0 (resp. ϕ 0 > 0), β (resp. ϕ > 0) and r > 0 such that β(k) β 0 exp( β k r ) (resp. ϕ(k) ϕ 0 exp( ϕ k r )) for all k. Both β(k) and ϕ(k) easure the dependence of an event on those that occurred ore than k units of tie in the past. β-ixing is a weaker assuption than ϕ-ixing and thus covers a ore general non-i.i.d. scenario. This paper gives stability-based generalization bounds both in the ϕ-ixing and β-ixing case. The β-ixing bounds cover a ore general case of course, however, the ϕ-ixing bounds are sipler and adit the standard exponential for. The ϕ-ixing bounds are based on a concentration inequality that applies to ϕ-ixing processes only. Except for the use of this concentration bound and two leas 5 and 6, all of the interediate proofs and results to derive a ϕ-ixing bound in Section 3 are given in the ore general case of β-ixing sequences. It has been argued by Vidyasagar (2003) that β-ixing is just the right assuption for the analysis of weakly-dependent saple points in achine learning, in particular because several PAClearning results then carry over to the non-i.i.d. case. Our β-ixing generalization bounds further contribute to the analysis of this scenario.. Soe results have also been obtained in the ore general context of α-ixing but they see to require the stronger condition of exponential ixing (Modha and Masry, 998). 79

4 MOHRI AND ROSTAMIZADEH We describe in several instances the application of our bounds in the case of algebraic ixing. Algebraic ixing is a standard assuption for ixing coefficients that has been adopted in previous studies of learning in the presence of dependent observations (Yu, 994; Meir, 2000; Vidyasagar, 2003; Lozano et al., 2006). Let us also point out that ixing assuptions can be checked in soe cases such as with Gaussian or Markov processes (Meir, 2000) and that ixing paraeters can also be estiated in such cases. Most previous studies use a technique originally introduced by Bernstein (927) based on independent blocks of equal size (Yu, 994; Meir, 2000; Lozano et al., 2006). This technique is particularly relevant when dealing with stationary β-ixing. We will need a related but soewhat different technique since the blocks we consider ay not have the sae size. The following lea is a special case of Corollary 2.7 fro Yu (994). Lea 3 (Yu, 994, Corollary 2.7) Let µ and suppose ( that h is easurable ) function, with absolute value bounded by M, on a product probability space µ j= Ω j, µ σs i r i where r i s i r i+ for all i. Let Q be a probability easure on the product ( space with arginal easures Q i on (Ω i,σ s i r i ), and let Q i+ be the arginal easure of Q on i+ j= Ω j, i+ j= σs j r j ),,...,µ. Let β(q) = sup i µ β(k i ), where k i =r i+ s i, and P = µ Q i. Then, E Q h E P h (µ )Mβ(Q). The lea gives a easure of the difference between the distribution of µ blocks where the blocks are independent in one case and dependent in the other case. The distribution within each block is assued to be the sae in both cases. For a onotonically decreasing function β, we have β(q) = β(k ), where k = in i (k i ) is the sallest gap between blocks. 2.2 Learning Scenarios We consider the failiar supervised learning setting where the learning algorith receives a saple of labeled points S = (z,...,z ) = ((x,y ),...,(x,y )) (X Y), where X is the input space and Y the set of labels (Y R in the regression case), both assued to be easurable. For a fixed learning algorith, we denote by h S the hypothesis it returns when trained on the saple S. The error of a hypothesis on a pair z X Y is easured in ters of a cost function c : Y Y R +. Thus, c(h(x),y) easures the error of a hypothesis h on a pair (x,y), c(h(x),y) = (h(x) y) 2 in the standard regression cases. We will often use the shorthand c(h,z) := c(h(x),y) for a hypothesis h and z = (x,y) X Y and will assue that c is upper bounded by a constant M>0. We denote by R(h) the epirical error of a hypothesis h for a training saple S=(z,...,z ): R(h) = c(h,z i ). In the standard achine learning scenario, the saple pairs z,...,z are assued to be i.i.d., a restrictive assuption that does not always hold in practice. We will consider here the ore general case of dependent saples drawn fro a stationary ixing sequence Z over X Y. As in the i.i.d. case, the objective of the learning algorith is to select a hypothesis with sall error over future saples. But, here, we ust distinguish two versions of this proble. 792

5 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES In the ost general version, future saples depend on the training saple S and thus the generalization error or true error of the hypothesis h S trained on S ust be easured by its expected error conditioned on the saple S: R(h S ) = E z c(h S,z) S. () This is the ost realistic setting in this context, which atches tie series prediction probles. A soewhat less realistic version is one where the saples are dependent, but the test points are assued to be independent of the training saple S. The generalization error of the hypothesis h S trained on S is then: R(h S ) = E z c(h S,z) S = E z c(h S,z). This setting sees less natural since, if saples are dependent, future test points ust also depend on the training points, even if that dependence is relatively weak due to the tie interval after which test points are drawn. Nevertheless, it is this soewhat less realistic setting that has been studied by all previous achine learning studies that we are aware of Yu (994), Meir (2000), Vidyasagar (2003) and Lozano et al. (2006), even when exaining specifically a tie series prediction proble (Meir, 2000). Thus, the bounds derived in these studies cannot be directly applied to the ore general setting. We will consider instead the ost general setting with the definition of the generalization error based on Equation. Clearly, our analysis also applies to the less general setting just discussed as well. Let us also briefly discuss the ore general scenario of non-stationary ixing sequences, that is one where the distribution ay change over tie. Within that general case, the generalization error of a hypothesis h S, defined straightforwardly by R(h S,t) = E c(h S,z t ) S, zt σ t would depend on the tie t and it ay be the case that R(h S,t) R(h S,t ) for t t, aking the definition of the generalization error a ore subtle issue. To reove the dependence on tie, one could define a weaker notion of the generalization error based on an expected loss over all tie: R(h S ) = E t R(h S,t). It is not clear however whether this ter could be easily coputed and be useful. A stronger condition would be to iniize the generalization error for any particular target tie. Studies of this type have been conducted for soothly changing distributions, such as in Zhou et al. (2008), however, to the best of our knowledge, the scenario of a both non-identical and non-independent sequences has not yet been studied. 3. ϕ-mixing Generalization Bounds and Applications This section gives generalization bounds for β-stable algoriths over a ixing stationary distribution. 2 The first two sections present our supporting leas which hold for either β-ixing or ϕ-ixing stationary distributions. In the third section, we will briefly discuss concentration inequalities that apply to ϕ-ixing processes only. Then, in the final section, we will present our ain results. 2. The standard variable used for the stability coefficient is β. To avoid the confusion with the β-ixing coefficient, we will use β instead. 793

6 MOHRI AND ROSTAMIZADEH The condition of β-stability is an algorith-dependent property first introduced by Devroye and Wagner (979) and Kearns and Ron (997). It has been later used successfully by Bousquet and Elisseeff (200, 2002) to show algorith-specific stability bounds for i.i.d. saples. Roughly speaking, a learning algorith is said to be stable if sall changes to the training set do not cause large deviations in its output. The following gives the precise technical definition. Definition 4 A learning algorith is said to be (uniforly) β-stable if the hypotheses it returns for any two training saples S and S that differ by reoving a single point satisfy z X Y, c(h S,z) c(h S,z) β. We note that a β-stable algorith is also stable with respect to replacing a single point. Let S and S i be two sequences differing in the ith coordinate, and S /i be equivalent to S and S i but with the ith point reoved. Then for a β-stable algorith we have, c(h S,z) c(h Si,z) = c(h S,z) c(h S/i )+c(h S/i ) c(h Si,z) c(h S,z) c(h S/i ) + c(h S/i ) c(h Si,z) 2 β. The use of stability in conjunction with McDiarid s inequality will allow us to derive generalization bounds. McDiarid s inequality is an exponential concentration bound of the for ) Pr Φ EΦ ε exp ( ε2 τ 2, where the probability is over a saple of size and where τ is the Lipschitz paraeter of Φ, with τ a function of. Unfortunately, this inequality cannot be applied when the saple points are not distributed in an i.i.d. fashion. We will use instead a result of Kontorovich and Raanan (2008) that extends McDiarid s inequality to ϕ-ixing distributions (Theore 8). To obtain a stability-based generalization bound, we will apply this theore to Φ(S) = R(h S ) R(h S ). To do so, we need to show, as with the standard McDiarid s inequality, that Φ is a Lipschitz function and, to ake it useful, bound EΦ. The next two sections describe how we achieve both of these in this non-i.i.d. scenario. Let us first take a brief look at the proble faced when attepting to give stability bounds for dependent sequences and give soe idea of our solution for that proble. The stability proofs given by Bousquet and Elisseeff (200) assue the i.i.d. property, thus replacing an eleent in a sequence with another does not affect the expected value of a rando variable defined over that sequence. In other words, the following equality holds, E S V(Z,...,Z i,...,z ) = E S,Z V(Z,...,Z,...,Z ), (2) for a rando variable V that is a function of the sequence of rando variables S = (Z,...,Z ). However, clearly, if the points in that sequence S are dependent, this equality ay not hold anyore. 794

7 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES (a) (b) Figure : Illustration of dependent (a) and independent (b) blocks. Although there is no dependence between blocks of points in (b), the distribution within each block reains the sae as in (a) and thus points within a block reain dependent. The ain technique to cope with this proble is based on the so-called independent block sequence originally introduced by Bernstein (927). This consists of eliinating fro the original dependent sequence several blocks of contiguous points, leaving us with soe reaining blocks of points. Instead of these dependent blocks, we then consider independent blocks of points, each with the sae size and the sae distribution (within each block) as the dependent ones. By Lea 3, for a β-ixing distribution, the expected value of a rando variable defined over the dependent blocks is close to the one based on these independent blocks. Working with these independent blocks brings us back to a situation siilar to the i.i.d. case, with i.i.d. blocks replacing i.i.d. points. Figure illustrates the two types of blocks just discussed. Our use of this ethod soewhat differs fro previous ones (see Yu, 994; Meir, 2000) where any blocks of equal size are considered. We will be dealing with four blocks and with typically unequal sizes. More specifically, note that for Equation 2 to hold, we only need that the variable Z i be independent of the other points in the sequence. To achieve this, roughly speaking, we will be discarding soe of the points in the sequence surrounding Z i. This results in a sequence of three blocks of contiguous points. If our algorith is stable and we do not discard too any points, the hypothesis returned should not be greatly affected by this operation. In the next step, we apply the independent block lea, which then allows us to assue each of these blocks as independent odulo the addition of a ixing ter. In particular, Z i becoes independent of all other points. Clearly, the nuber of points discarded is subject to a trade-off: reoving too any points could excessively odify the hypothesis returned; reoving too few would aintain the dependency between Z i and the reaining points, thereby inducing a larger penalty when applying Lea 3. This trade-off is ade explicit in the following section where an optial solution is sought. 3. Lipschitz Bound As discussed in Section 2.2, in the ost general scenario, test points depend on the training saple. We first present a lea that relates the expected value of the generalization error in that scenario and the sae expectation in the scenario where the test point is independent of the training saple. We denote by R(h S ) = E z c(h S,z) S the expectation in the dependent case and by R(h Sb ) = E z c(h Sb, z) the expectation where the test points are assued independent of the training, with S b denoting a sequence siilar to S but with the last b points reoved. Figure 2(a) illustrates that sequence. The block S b is assued to have exactly the sae distribution as the corresponding block of the sae size in S. 795

8 MOHRI AND ROSTAMIZADEH Lea 5 Assue that the learning algorith is β-stable and that the cost function c is bounded by M. Then, for any saple S of size drawn fro a ϕ-ixing stationary distribution and for any b {0,...,}, the following holds: R(h S ) R(h Sb ) b β+mϕ(b) Proof The β-stability of the learning algorith iplies that R(h S ) R(h Sb ) = E z c(h S,z) S E z c(h Sb,z) S b b β. (3) Now, in order to reove the dependence on S b we bound the following difference Ec(h Sb,z) S b Ẽc(h Sb, z) z z = c(h Sb,z)(Prz S b Prz) z Z = c(h Sb,z)(Prz S b Prz)+ c(h Sb,z)(Prz S b Prz) z Z + z Z = c(h Sb,z) Prz S b Prz c(h Sb,z) Prz S b Prz z Z + z Z ax c(h Sb,z) Prz S b Prz Z {Z,Z + } z Z ax M Prz S b Prz Z {Z,Z + } z Z = ax M Prz S b Prz Z {Z,Z + } z Z = ax M PrZ Sb PrZ Mϕ(b), Z {Z,Z + } (4) where the su has been separated over the set of zs Z + for which the difference Prz S b Prz is non-negative, and its copleent Z. Using (3) and (4) and the triangle inequality yields the stateent of the lea. Note that we assue that z iediately follows the saple S, which is the strongest dependent scenario. The following bounds can be iproved in a straightforward anner if the test point z is assued to be observed say k units of tie after the saple S. The bound would then contain the ixing ter ϕ(k+ b) instead of ϕ(b). We can now prove a Lipschitz bound for the function Φ. Lea 6 Let S = (z,...,z i,...,z ) and S i = (z,...,z i,...,z ) be two sequences drawn fro a ϕ-ixing stationary process that differ only in point z i for soe i {,...,}, and let h S and h S i be the hypotheses returned by a β-stable algorith when trained on each of these saples. Then, for any i {,...,}, the following inequality holds: Φ(S) Φ(S i ) (b+2)2 β+2ϕ(b)m + M. 796

9 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES S b z z i S i (a) b b b (b) S i,b S i i,b z i z z z b b (c) b b b (d) b Figure 2: Illustration of the sequences derived fro S that are considered in the proofs. Proof To prove this inequality, we first bound the difference of the epirical errors as in Bousquet and Elisseeff (2002), then the difference of the generalization errors. Bounding the difference of costs on agreeing points with β and the one that disagrees with M gives R(h S ) R(h S i) j i c(h S,z j ) c(h S i,z j ) + c(h S,z i ) c(h S i,z i) 2 β+ M. (5) Since both R(h S ) and R(h S i) are defined with respect to a (different) dependent point, we can apply Lea 5 to both generalization error ters and use β-stability. Using this and the triangle inequality, we can write R(h S ) R(h S i) R(h S ) R(h Sb )+ R(h Sb ) R(h S i b )+ R(h S i b ) R(h S i) R(h Sb ) R(h S i b ) +2b β+2ϕ(b)m = Ẽ z c(h Sb, z) c(h S i b, z)+2b β+2ϕ(b)m 2 β+2b β+2ϕ(b)m. (6) The stateent of the lea is obtained by cobining inequalities 5 and Bound on Expectation As entioned earlier, to obtain an explicit bound after application of a generalized McDiarid s inequality, we also need to bound E S Φ(S). This is done by analyzing independent blocks using Lea

10 MOHRI AND ROSTAMIZADEH Lea 7 Let h S be the hypothesis returned by a β-stable algorith trained on a saple S drawn fro a β-ixing stationary distribution. Then, for all b,, the following inequality holds: E S Φ(S) (6b+2) β+3β(b)m. Proof Let S b be defined as in the proof of Lea 5. To deal with independent block sequences defined with respect to the sae hypothesis, we will consider the sequence S i,b = S i S b, which is illustrated by Figure 2(a-c). This can result in as any as four blocks. As before, we will consider a sequence S i,b with a siilar set of blocks each with the sae distribution as the corresponding blocks in S i,b, but such that the blocks are independent as seen in Figure 2(d). Since three blocks of at ost b points are reoved fro each hypothesis, by the β-stability of the learning algorith, the following holds: EΦ(S) = E R(h S ) R(h S ) S S = E S,z c(h S,z i ) c(h S,z) E Si,b,z c(h Si,b,z i ) c(h Si,b,z) + 6b β. The application of Lea 3 to the difference of two cost functions also bounded by M as in the right-hand side leads to EΦ(S) E S S i,b, z c(h S i,b, z i ) c(h S i,b, z) + 6b β+3β(b)m. Now, since the points z and z i are independent and since the distribution is stationary, they have the sae distribution and we can replace z i with z in the epirical cost. Thus, we can write EΦ(S) E S S i,b, z c(h S i,b, z) i c(h S i,b, z) + 6b β+3β(b)m 2 β+6b β+3β(b)m, where S i i,b is the sequence derived fro S i,b by replacing z i with z. The last inequality holds by β-stability of the learning algorith. The other side of the inequality in the stateent of the lea can be shown following the sae steps. 3.3 ϕ-mixing Concentration Bound We are now prepared to ake use of a concentration inequality to provide a generalization bound in the ϕ-ixing scenario. Several concentration inequalities have been shown in the ϕ-ixing case, for exaple, Marton (998), Sason (2000), Chazottes et al. (2007) and Kontorovich and Raanan (2008). We will use that of Kontorovich and Raanan (2008), which is very siilar to that of Chazottes et al. (2007), odulo the fact that the latter requires a finite saple space. 798

11 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES The following is a concentration inequality derived fro that of Kontorovich and Raanan (2008). 3 Theore 8 Let Φ: Z R be a easurable function that is c-lipschitz with respect to the Haing etric for soe c>0 and let Z,...,Z be rando variables distributed according to a ϕ- ixing distribution. Then, for any ε>0, the following inequality holds: where +2 Pr Φ(Z,...,Z ) EΦ(Z,...,Z ) ε 2exp k= ϕ(k). ( 2ε 2 c 2 2 It should be pointed out that the stateent of the theore in this paper is iproved by a factor of 4 in the exponent with respect to that of Kontorovich and Raanan (2008, Theore.). This can be achieved straightforwardly by following the sae steps as in the proof of Kontorovich and Raanan (2008), but by aking use of the following general for of McDiarid s inequality (Theore 9) instead of Azua s inequality. In particular, Theore 5. of Kontorovich and Raanan (2008) shows that for a ϕ-ixing distribution and a -Lipschitz function, the constants c i can be bounded as follows in Theore 9: i c i +2 ϕ(k). Theore 9 (McDiarid, 989, 6.0) Let Z,...,Z be arbitrary rando variables taking values in Z and let Φ: Z R be a easurable function satisfying for all z i,z i Z,,...,, the following inequalities: E Φ(Z,...,Z ) Z = z,...,z i = z i E Φ(Z,...,Z ) Z = z,...,z i = z i ci, where c i >0,,...,, are constants. Then, for any ε>0, the following inequality holds : Pr Φ(Z,...,Z ) EΦ(Z,...,Z ) ( ) 2ε 2 ε 2exp. k= c2 i In the i.i.d. case, McDiarid s theore can be restated in the following sipler for that we shall use in Section 4. Theore 0 (McDiarid, i.i.d. scenario) Let Z,...,Z be independent rando variables taking values in Z and let Φ: Z R be a easurable function satisfying for all z i,z i Z,,...,, the following inequalities: Φ(z,...,z i,...z ) Φ(z,...,z i,...z ) c i, where c i >0,,...,, are constants. Then, for any ε>0, the following inequality holds: Pr Φ(Z,...,Z ) EΦ(Z,...,Z ) ( ) 2ε 2 ε 2exp. c2 i 3. We should note that original bound is expressed in ters of η-ixing coefficients. To siplify presentation, we are adapting it to the case of stationary ϕ-ixing sequences by using the following straightforward inequality for a stationary process: 2ϕ( j i) η i j. Furtherore, the bound presented in Kontorovich and Raanan (2008) holds when the saple space is countable, it is extended to the continuous case in Kontorovich (2007). ), 799

12 MOHRI AND ROSTAMIZADEH 3.4 ϕ-mixing Generalization Bounds This section presents several theores that constitute the ain results of this paper in the ϕ-ixing case. The following theore is constructed fro the bounds shown in the previous three sections. Theore (General Non-i.i.d. Stability Bound) Let h S denote the hypothesis returned by a βstable algorith trained on a saple S drawn fro a ϕ-ixing stationary distribution and let c be a easurable non-negative cost function upper bounded by M>0, then for any b {0,...,} and any ε>0, the following generalization bound holds: R(hS Pr ) R(h S ) > ε+(6b+2) β+6mϕ(b) S ( ) 2ε 2 (+2 2exp ϕ(i)) 2. ((b+2)2 β+2mϕ(b)+m/) 2 Proof The theore follows directly the application of Lea 6 and Lea 7 to Theore 8. The theore gives a general stability bound for ϕ-ixing stationary sequences. If we further assue that the sequence is algebraically ϕ-ixing, that is for all k, ϕ(k) = ϕ 0 k r for soe r>, then we can solve for the value of b to optiize the bound. Theore 2 (Non-i.i.d. Stability Bound for Algebraically Mixing Sequences) Let h S denote the hypothesis returned by a β-stable algorith trained on a saple S drawn fro an algebraically ϕ- ixing stationary distribution, ϕ(k)=ϕ 0 k r with r>, and let c be a easurable non-negative cost function upper bounded by M>0, then, for any ε>0, the following generalization bound holds: R(hS Pr ) R(h S ) > ε+8 β+(r+ )6Mϕ(b) S ( ) 2ε 2 (+2ϕ 0 r/(r )) 2 2exp, (6 β+(r+ )2Mϕ(b)+M/) 2 ( where b = β /(r+). rϕ 0 M) Proof For an algebraically ixing sequence, the value of b iniizing the bound of Theore satisfies the equation βb = rmϕ(b ). Since b ust be an integer, we use the approxiation b = ( ) /(r+) β rϕ 0 M when applying Theore. However, observing the inequalities ϕ(b ) ϕ(b) and (b + ) b, allows us to write the stateent of Theore 2 in ters of the fractional choice b. The ter in the nuerator can be bounded as +2 ϕ(i) = +2 ϕ 0 i r Z ) +2ϕ 0 (+ x r dx ( ) = +2ϕ 0 + r. r 800

13 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES Using the assuption r >, we can upper bound r with and obtain ( ) ( +2ϕ 0 + r +2ϕ 0 + ) = + 2ϕ 0r r r r. Plugging in this value and the iniizing value of b in the bound of Theore yields the stateent of the theore. In the case of a zero ixing coefficient (ϕ=0 and b=0), the bounds of Theore coincide with the i.i.d. stability bound of Bousquet and Elisseeff (2002). In the general case, in order for the right-hand side of these bounds to converge, we ust have β=o(/ ) and ϕ(b)=o(/ ). The first condition holds for several failies of algoriths with β O(/) (Bousquet and Elisseeff, 2002). In the case of algebraically ixing sequences with r>, as assued in Theore 2, β O(/) iplies ϕ(b) ϕ 0 ( β/(rϕ 0 M)) (r/(r+)) <O(/ ). More specifically, for the scenario of algebraic ixing with /-stability, the following bound holds with probability at least δ: R(h S ) R(h S ) O ( ) log(/δ). This is obtained by setting the right-hand side of Theore 2 equal to δ and solving for ε. Furtherore, if we choose ε= C log() for a large enough constant C > 0, the right-hand side of (r )/(r+) Theore 2 is suable over and thus, by the Borel-Cantelli lea, the following inequality holds alost surely: ( ) log() R(h S ) R(h S ) O. Siilar bounds can be given for the exponential ixing setting (ϕ(k) = ϕ 0 exp( ϕ k r )). If we choose b = O( log() 3 /) and assue β = O(/), then, with probability at least δ, R(h S ) R(h S ) O log(/δ)log 2 (). log If we instead set ε = C 3 () for a large enough constant C, the right-hand side of Theore 2 is suable and again by the Borel-Cantelli lea we have alost surely. 3.5 Applications R(h S ) R(h S ) O r r+ r r+ log 3 () We now present the application of our stability bounds for algebraically ϕ-ixing sequences to several algoriths, including the faily of kernel-based regularization algoriths and that of relative entropy-based regularization algoriths. The application of our learning bounds will benefit fro the previous analysis of the stability of these algoriths by Bousquet and Elisseeff (2002)., 80

14 MOHRI AND ROSTAMIZADEH 3.5. KERNEL-BASED REGULARIZATION ALGORITHMS We first apply our bounds to a faily of algoriths iniizing a regularized objective function based on the nor K in a reproducing kernel Hilbert space, where K is a positive definite syetric kernel: argin h H c(h,z i )+λ h 2 K. (7) The application of our bound is possible, under soe general conditions, since kernel regularized algoriths are stable with β O(/) (Bousquet and Elisseeff, 2002). For the sake of copleteness, we briefly present the proof of this β-stability. We will assue that the cost function c is σ-adissible, that is there exists σ R + such that for any two hypotheses h,h H and for all z = (x,y) X Y, c(h,z) c(h,z) σ h(x) h (x). This assuption holds for the quadratic cost and ost other cost functions when the hypothesis set and the set of output labels are bounded by soe M R + : h H, x X, h(x) M and y Y, y M. We will also assue that c is differentiable. This assuption is in fact not necessary and all of our results hold without it, but it akes the presentation sipler. We denote by B F the Bregan divergence associated to a convex function F: B F ( f g) = F( f) F(g) f g, F(g). In what follows, it will be helpful to define F as the objective function of a general regularization based algorith, F S (h) = R S (h)+λn(h), where R S is the epirical error as easured on the saple S, N : H R + is a regularization function and λ > 0 is the failiar trade-off paraeter. Finally, we shall use the shorthand h = h h. Lea 3 (Bousquet and Elisseeff, 2002) A kernel-based regularization algorith of the for (7), with bounded kernel K(x,x) κ < and σ-adissible cost function, is β-stable with coefficient β σ 2 κ 2 λ. Proof Let h and h be the iniizers of F S and F S respectively where S and S differ in the first coordinate (choice of coordinate is without loss of generality), then, B N (h h)+b N (h h ) 2σ λ sup h(x). (8) To see this, we notice that since B F = B R + λb N, and since a Bregan divergence is non-negative, x S λ ( B N (h h)+b N (h h ) ) B FS (h h)+b FS (h h ). By the definition of h and h as the iniizers of F S and F S, B FS (h h)+b FS (h h ) = R FS (h ) R FS (h)+ R FS (h) R FS (h ). 802

15 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES Finally, by the σ-adissibility of the cost function c and the definition of S and S, λ ( B N (h h)+b N (h h ) ) R FS (h ) R FS (h)+ R FS (h) R FS (h ) = c(h,z ) c(h,z )+c(h,z ) c(h,z ) σ h(x ) +σ h(x ) 2σ sup h(x), x S which establishes (8). Now, if we consider N( ) = 2 K, we have B N(h h) = h h 2 K, thus B N(h h)+b N (h h ) = 2 h 2 K and by (8) and the reproducing kernel property, 2 h 2 K 2σ λ sup h(x) x S 2σ λ κ h K. Thus h K σκ λ. And using the σ-adissibility of c and the kernel reproducing property we obtain z X Y, c(h,z) c(h,z) σ h(x) κσ h K. Therefore, which copletes the proof. z X Y, c(h,z) c(h,z) σ2 κ 2 λ, Three specific instances of kernel regularization algoriths are SVR, for which the cost function is based on the ε-insensitive cost: { 0 if h(x) y ε, c(h,z) = h(x) y ε otherwise. Kernel Ridge Regression (Saunders et al., 998), for which c(h,z) = (h(x) y) 2, and finally Support Vector Machines with the hinge-loss, { 0 if yh(x) 0, c(h,z) = yh(x) if yh(x) <, For kernel regularization algoriths, as pointed out in Bousquet and Elisseeff (2002, Lea 23), a bound on the labels iediately iplies a bound on the output of the hypothesis returned by the algorith. We forally state this lea below. 803

16 MOHRI AND ROSTAMIZADEH Lea 4 Let h be the solution of the optiization proble (7), let c be a cost function and let B( ) be a real-valued function such that for all h H, x X, and y Y, Then, the output of h is bounded as follows, c(h(x),y ) B(h(x)). B(0) x X, h (x) κ λ, where λ is the regularization paraeter, and κ 2 K(x,x) for all x X. Proof Let F(h) = c(h,z i)+λ h 2 K and let 0 be the zero hypothesis, then by definition of F and h, λ h 2 K F(h ) F(0) B(0). Then, using the reproducing kernel property and the Cauchy-Schwarz inequality we note, x X, h (x) = h,k(x, ) h K K(x,x) κ h K. Cobining the two inequalities proves the lea. We note that in Bousquet and Elisseeff (2002), the following bound is also stated: c(h (x),y ) B(κ B(0)/λ). However, when later applied, it sees that the authors use an incorrect upper bound function B( ), which we reedy in the following. Corollary 5 Assue a bounded output Y = 0,B, for soe B > 0, and assue that K(x,x) κ 2 for all x for soe κ > 0. Let h S denote the hypothesis returned by the algorith when trained on a saple S drawn fro an algebraically ϕ-ixing stationary distribution. Let u = r/(r + ) 2,, M = 2(r+)ϕ 0 M/(rϕ 0 M) u, and ϕ 0 = (+2ϕ 0r/(r )). Then, with probability at least δ, the following generalization bounds hold for a. Support Vector Machines (SVM, with hinge-loss) ( ) R(h S ) R(h S )+ 8κ2 2κ 2 u ( ) 3M λ + λ u + ϕ 0 (M + 3κ2 2κ 2 u ) λ + M 2log(2/δ) λ u, where M = κ λ + B. b. Support Vector Regression (SVR): ( ) R(h S ) R(h S )+ 8κ2 2κ 2 u ( ) 3M λ + λ u + ϕ 0 (M + 3κ2 2κ 2 u ) λ + M 2log(2/δ) λ u, 2B where M = κ λ + B. 804

17 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES c. Kernel Ridge Regression (KRR): R(h S ) R(h S )+ 32κ2 B 2 ( 8κ 2 λ + B 2 ) u 3M λ u + ϕ 0 (M + 2κ2 B 2 ( 8κ 2 B 2 ) u ) M 2log(2/δ) + λ λ u, where M = 2κ 2 B 2 /λ+b 2. Proof For SVM, the hinge-loss is -adissible giving β κ 2 /(λ). Using Lea 4, with B(0) =, the loss can be bounded x X,y Y,+ h (x) κ λ + B. Siilarly, SVR has a loss function that is -adissible, thus, applying Lea 3 gives us β κ 2 /(λ). Using Lea 4, with B(0) = B, we can bound the loss as follows, x X,y Y, h B (x) y κ λ + B. Finally for KRR, we have a loss function that is 2B-adissible and again using Lea 3 β 4κ 2 B 2 /(λ). Again, applying Lea 4 with B(0) = B 2 and x X,y Y,(h (x) y) 2 κ 2 B 2 /λ+b 2. Plugging these values into the bound of Theore 2 and setting the right-hand side to δ yields the stateent of the corollary RELATIVE ENTROPY-BASED REGULARIZATION ALGORITHMS In this section, we apply the results of Theore 2 to a faily of learning algoriths based on relative entropy-regularization. These algoriths learn hypotheses h that are ixtures of base hypotheses in {h θ : θ Θ}, where Θ is easurable set. The output of these algoriths is a ixture g: Θ R, that is a distribution over Θ. Let G denote the set of all such distributions and let g 0 G be a fixed distribution. Relative entropy based-regularization algoriths output the solution of a iniization proble of the following for: argin g G c(g,z i )+λd(g g 0 ), (9) where the cost function c: G Z R is defined in ters of a second internal cost function c : H Z R: Z c(g,z) = c (h θ,z)g(θ)dθ, Θ and where D(g g 0 ) is the relative entropy between g and g 0 : Z D(g g 0 ) = g(θ)log g(θ) Θ g 0 (θ) dθ. As shown by Bousquet and Elisseeff (2002, Theore 24), a relative entropy-based regularization algorith defined by (9) with bounded loss c ( ) M, is β-stable with the following bound on the stability coefficient: β M 2 λ. Theore 2 cobined with this inequality iediately yields the following generalization bound. 805

18 MOHRI AND ROSTAMIZADEH Corollary 6 Let h S be the hypothesis solution of the optiization (9) trained on a saple S drawn fro an algebraically ϕ-ixing stationary distribution with the internal cost function c bounded by M. Then, with probability at least δ, the following holds: (M + 3M2 R(h S ) R(h S )+ 8M2 λ + 3M λ u u + ϕ 0 λ + 2u M λ u u ) 2log(2/δ), where u = r/(r+ ) 2,, M = 2(r+ )ϕ 0 M u+ /(rϕ 0 ) u, and ϕ 0 = (+2ϕ 0r/(r )). 3.6 Discussion The results presented here are, to the best of our knowledge, the first stability-based generalization bounds for the class of algoriths just studied in a non-i.i.d. scenario. These bounds are nontrivial when the condition on the regularization paraeter λ / /2 /r paraeter holds for all large values of. This condition coincides with the one obtained in the i.i.d. setting by Bousquet and Elisseeff (2002), in the liit, as r tends to infinity. The next section gives stability-based generalization bounds that hold even in the scenario of β-ixing sequences. 4. β-mixing Generalization Bounds In this section, we prove a stability-based generalization bound that only requires the training sequence to be drawn fro a β-ixing stationary distribution. The bound is thus ore general and covers the ϕ-ixing case analyzed in the previous section. However, unlike the ϕ-ixing case, the β-ixing bound presented here is not a purely exponential bound. It contains an additive ter, which depends on the ixing coefficient. As in the previous section, Φ(S) is defined by Φ(S)=R(h S ) R(h S ). To siplify the presentation, here, we define the generalization error of h S by R(h S )=E z c(h S,z). Thus, test saples are assued independent of S. 4 Note that for any block of points Z = z...z k drawn independently of S, the following equality holds: E Z Z c(h S,z) z Z = k k E Z c(h S,z i ) = k k E zi c(h S,z i ) = E z c(h S,z) since, by stationarity, E zi c(h S,z i )=E z j c(h S,z j ) for all i, j k. Thus, for any such block Z, we can write R(h S )=E Z Z z Z c(h S,z). For convenience, we extend the cost function c to blocks as follows: c(h,z) = Z c(h,z). z Z With this notation, R(h S ) = E Z c(h S,Z) for any block drawn independently of S, regardless of the size of Z. To derive a generalization bound for the β-ixing scenario, we apply McDiarid s inequality (Theore 0) to Φ defined over a sequence of independent blocks. The independent blocks we consider are non-syetric and thus ore general than those considered by previous authors (Yu, 994; Meir, 2000). 4. In the β-ixing scenario, a result siilar to that of Lea 5 can be shown to hold in expectation with respect the saple S. Using Markov s inequality, the inequality can be shown to hold with high probability. Thus, the results that follow can all be be extended to the case where the test points depend on the training saple, at the expense of an an additional confidence ter. 806

19 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES Figure 3: Illustration of the sequences S a and S b derived fro S that are considered in the proofs. The darkened regions are considered as being reoved fro the sequence. Fro a saple S ade of a sequence of points, we construct two sequences of blocks S a and S b, each containing µ blocks. Each block in S a contains a points and each block in S b contains b points (see Figure 3). S a and S b for a partitioning of S; for any a,b {0,...,} such that (a+b)µ=, they are defined precisely as follows: S a = (Z (a),...,z(a) µ ), with Z (a) i = z (i )(a+b)+,...,z (i )(a+b)+a S b = (Z (b),...,z(b) µ ), with Z (b) i = z (i )(a+b)+a+,...,z (i )(a+b)+a+b, for all i {,...,µ}. We shall consider siilarly sequences of i.i.d. blocks Z i a and Z i b, i {,...,µ}, such that the points within each block are drawn according to the sae original β-ixing distribution and shall denote by S a the block sequence ( Z (a),..., Z µ (a) ). In preparation for the application of McDiarid s inequality, we give a bound on the expectation of Φ( S a ). Since the expectation is taken over a sequence of i.i.d. blocks, this brings us to a situation siilar to the i.i.d. scenario analyzed by Bousquet and Elisseeff (2002), with the exception that we are dealing with i.i.d. blocks instead of i.i.d. points. Lea 7 Let S a be an independent block sequence as defined above, then the following bound holds for the expectation of Φ( S a ) : Ẽ Φ( S a ) 2a β. S a Proof Since the blocks Z (a) are independent, we can replace any one of the with any other block Z drawn fro the sae distribution. However, changing the training set also changes the hypothesis, in a liited way. This is shown precisely below: Ẽ Φ( S a ) = Ẽ S a S a µ E S a,z µ = E S a,z µ µ µ µ c(h S a, Z (a) i c(h S a, Z (a) i ) E Z c(h S a,z) ) c(h S a,z) c(h S,Z) a i c(h S a,z), 807

20 MOHRI AND ROSTAMIZADEH where S a i corresponds to the block sequence S a obtained by replacing the ith block with Z. The β-stability of the learning algorith gives E S a,z µ µ c(h S,Z) a i c(h S a,z) E S a,z µ µ 2a β 2a β. We now relate the non-i.i.d. event PrΦ(S) ε to an independent block sequence event to which we can apply McDiarid s inequality. Lea 8 Assue a β-stable algorith. Then, for a saple S drawn fro a β-ixing stationary distribution, the following bound holds: Pr Φ(S) ε Pr Φ( S a ) E Φ( S a ) ε 0 +(µ )β(b), S S a where ε 0 = ε µbm 2µb β E S Φ( S a). a Proof The proof consists of first rewriting the event in ters of S a and S b and bounding the error on the points in S b in a trivial anner. This can be afforded since b will be eventually chosen to be sall. Since E Z c(h S,Z ) c(h S,z ) M for any z S b, we can write Pr Φ(S) ε = Pr R(h S ) R(h S ) ε S S = Pr S Ec(h S,Z) c(h S,z) ε Z z S Pr S Ec(h S,Z) c(h S,z) + Z z S a E c(h S,Z ) c(h z Z S,z ) ε S b Pr S Ec(h S,Z) c(h S,z) + µbm Z z S a ε. By β-stability and µa/, this last ter can be bounded as follows Pr S Ec(h S,Z) c(h S,z) + µbm Z z S a ε Pr S a µa Ec(h Sa,Z) c(h Sa,z) + µbm Z z S a + 2µb β ε. The right-hand side can be rewritten in ters of Φ and bounded in ters of a β-ixing coefficient: Pr S a µa Ec(h Sa,Z) c(h Sa,z) + µbm Z z S a + 2µb β ε = Pr Φ(S a ) + µbm S a + 2µb β ε Pr S a Φ( S a ) + µbm + 2µb β ε +(µ )β(b), 808

21 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES { by applying Lea 3 to the indicator function of the event Φ(S a ) + µbm E S a Φ( S a) is a constant, the probability in this last ter can be rewritten as Pr Φ( S a ) + µbm S a + 2µb β ε = Pr Φ( S a ) Ẽ S a S a which ends the proof of the lea. = Pr S a Φ( S a) + µbm, Φ( S a ) Ẽ Φ( S a) ε 0 S a } + 2µb β ε. Since + 2µb β ε Ẽ Φ( S a) S a The last two leas will help us prove the ain result of this section forulated in the following theore. Theore 9 Assue a β-stable algorith and let ε denote ε µbm 2µb β 2a β as in Lea 8. Then, for any saple S of size drawn according to a β-ixing stationary distribution, any choice of the paraeters a,b,µ > 0 such that (a + b)µ =, and ε 0 such that ε 0, the following generalization bound holds: ( 2ε Pr R(h S ) R(h 2 S ) ε exp ( ) S 2 )+(µ )β(b). 4a β+(a+b)m Proof To prove the stateent of theore, it suffices to bound the probability ter appearing in the right-hand side of Equation 8, Pr S a Φ( S a ) E Φ( S a ) ε 0, which is expressed only in ters of independent blocks. We can therefore apply McDiarid s inequality by viewing the blocks as i.i.d. points. To do so, we ust bound the quantity Φ( S a ) Φ( S i a) where the sequence Sa and S i a differ in the ith block. We will bound separately the difference between the generalization errors and epirical errors. 5 The difference in epirical errors can be bounded as follows using the bound on the cost function c: R(h Sa ) R(h S i a ) = µ c(h Sa,Z j ) c(h S i a,z j ) + c(hsa,z i ) c(h j i µ S i,z a i) 2a β+ M (a+b)m = 2a β+. µ The difference in generalization error can be straightforwardly bounded using β-stability: R(h Sa ) R(h S i a ) = E Z c(h Sa,Z) E Z c(h S i a,z) = E Z c(h Sa,Z) c(h S i a,z) 2a β. 5. We drop the superscripts on Z (a) since we will not be considering the sequence S b in what follows. 809

22 MOHRI AND ROSTAMIZADEH Using these bounds in conjunction with McDiarid s inequality yields ( ) Pr Φ( S a ) Ẽ Φ( S a) ε 2ε exp ( ) 2 S a S a 4a β+(a+b)m ( 2ε 2 exp ( ) 2 ). 4a β+(a+b)m Note that to show the second inequality we ake use of Lea 7 to establish the fact that ε 0 = ε µbm 2µb β Ẽ Φ( S a) ε µbm S a 2µb β 2a β = ε. Finally, we ake use of Lea 8 to establish the proof, Pr Φ(S) ε Pr Φ( S a ) E Φ( S a ) ε 0 +(µ )β(b) S S a ( 2ε 2 exp ( ) 2 )+(µ )β(b). 4a β+(a+b)m This concludes the proof of the theore. In order to ake use of this bound, we ust deterine the values of paraeters b and µ (a is then equal to µ/ u). There is a trade-off between selecting a large enough value for b to ensure that the ixing ter decreases and choosing a large enough value of µ to iniize the reaining ters of the bound. The exact choice of paraeters will depend on the type of ixing that is assued (e.g., algebraic or exponential). In order to choose optial paraeters, it will be useful to view the bound as it holds with high probability, in the following corollary. Corollary 20 Assue a β-stable algorith and let δ denote δ (µ )β(b). Then, for any saple S of size drawn according to a β-ixing stationary distribution, any choice of the paraeters a,b,µ> 0 such that (a+b)µ=, and δ 0 such that δ 0, the following generalization bound holds with probability at least ( δ): ( ) ( M R(h S ) R(h S ) < µb + 2 β + 2a β 4a β+m ) log(/δ ) µ 2. In the case of a fast ixing distribution, it is possible to select( the values of the paraeters to ) retrieve a bound as in the i.i.d. case, that is, R(h S ) R(h S ) = O 2 log/δ. In particular, for β(b) 0, we can choose a = 0, b =, and µ = to retrieve the i.i.d. bound of Bousquet and Elisseeff (200). In the following, we exaine slower ixing algebraic β-ixing distributions, which are thus not close to the i.i.d. scenario. For algebraic ixing, the ixing paraeter is defined as β(b) = b r. In that case, we wish to iniize the following function in ters of µ and b: s(µ,b) = µ b r + 3/2 β ( ) µ + /2 µ + µb + β. (0) 80

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

arxiv: v4 [cs.lg] 4 Apr 2016

arxiv: v4 [cs.lg] 4 Apr 2016 e-publication 3 3-5 Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions arxiv:35796v4 cslg 4 Apr 6 Corinna Cortes Google Research, 76 Ninth Avenue, New York, NY Spencer

More information

On the Impact of Kernel Approximation on Learning Accuracy

On the Impact of Kernel Approximation on Learning Accuracy On the Ipact of Kernel Approxiation on Learning Accuracy Corinna Cortes Mehryar Mohri Aeet Talwalkar Google Research New York, NY corinna@google.co Courant Institute and Google Research New York, NY ohri@cs.nyu.edu

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

An EGZ generalization for 5 colors

An EGZ generalization for 5 colors An EGZ generalization for 5 colors David Grynkiewicz and Andrew Schultz July 6, 00 Abstract Let g zs(, k) (g zs(, k + 1)) be the inial integer such that any coloring of the integers fro U U k 1,..., g

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Structured Prediction Theory Based on Factor Graph Complexity

Structured Prediction Theory Based on Factor Graph Complexity Structured Prediction Theory Based on Factor Graph Coplexity Corinna Cortes Google Research New York, NY 00 corinna@googleco Mehryar Mohri Courant Institute and Google New York, NY 00 ohri@cisnyuedu Vitaly

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

The degree of a typical vertex in generalized random intersection graph models

The degree of a typical vertex in generalized random intersection graph models Discrete Matheatics 306 006 15 165 www.elsevier.co/locate/disc The degree of a typical vertex in generalized rando intersection graph odels Jerzy Jaworski a, Michał Karoński a, Dudley Stark b a Departent

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

Solutions of some selected problems of Homework 4

Solutions of some selected problems of Homework 4 Solutions of soe selected probles of Hoework 4 Sangchul Lee May 7, 2018 Proble 1 Let there be light A professor has two light bulbs in his garage. When both are burned out, they are replaced, and the next

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint.

A := A i : {A i } S. is an algebra. The same object is obtained when the union in required to be disjoint. 59 6. ABSTRACT MEASURE THEORY Having developed the Lebesgue integral with respect to the general easures, we now have a general concept with few specific exaples to actually test it on. Indeed, so far

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable

More information

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany.

New upper bound for the B-spline basis condition number II. K. Scherer. Institut fur Angewandte Mathematik, Universitat Bonn, Bonn, Germany. New upper bound for the B-spline basis condition nuber II. A proof of de Boor's 2 -conjecture K. Scherer Institut fur Angewandte Matheati, Universitat Bonn, 535 Bonn, Gerany and A. Yu. Shadrin Coputing

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH

LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH LORENTZ SPACES AND REAL INTERPOLATION THE KEEL-TAO APPROACH GUILLERMO REY. Introduction If an operator T is bounded on two Lebesgue spaces, the theory of coplex interpolation allows us to deduce the boundedness

More information

THE CONSTRUCTION OF GOOD EXTENSIBLE RANK-1 LATTICES. 1. Introduction We are interested in approximating a high dimensional integral [0,1]

THE CONSTRUCTION OF GOOD EXTENSIBLE RANK-1 LATTICES. 1. Introduction We are interested in approximating a high dimensional integral [0,1] MATHEMATICS OF COMPUTATION Volue 00, Nuber 0, Pages 000 000 S 0025-578(XX)0000-0 THE CONSTRUCTION OF GOOD EXTENSIBLE RANK- LATTICES JOSEF DICK, FRIEDRICH PILLICHSHAMMER, AND BENJAMIN J. WATERHOUSE Abstract.

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Estimating Entropy and Entropy Norm on Data Streams

Estimating Entropy and Entropy Norm on Data Streams Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Note on generating all subsets of a finite set with disjoint unions

Note on generating all subsets of a finite set with disjoint unions Note on generating all subsets of a finite set with disjoint unions David Ellis e-ail: dce27@ca.ac.uk Subitted: Dec 2, 2008; Accepted: May 12, 2009; Published: May 20, 2009 Matheatics Subject Classification:

More information

Introduction to Discrete Optimization

Introduction to Discrete Optimization Prof. Friedrich Eisenbrand Martin Nieeier Due Date: March 9 9 Discussions: March 9 Introduction to Discrete Optiization Spring 9 s Exercise Consider a school district with I neighborhoods J schools and

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

arxiv: v1 [math.nt] 14 Sep 2014

arxiv: v1 [math.nt] 14 Sep 2014 ROTATION REMAINDERS P. JAMESON GRABER, WASHINGTON AND LEE UNIVERSITY 08 arxiv:1409.411v1 [ath.nt] 14 Sep 014 Abstract. We study properties of an array of nubers, called the triangle, in which each row

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Some Classical Ergodic Theorems

Some Classical Ergodic Theorems Soe Classical Ergodic Theores Matt Rosenzweig Contents Classical Ergodic Theores. Mean Ergodic Theores........................................2 Maxial Ergodic Theore.....................................

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies

Lost-Sales Problems with Stochastic Lead Times: Convexity Results for Base-Stock Policies OPERATIONS RESEARCH Vol. 52, No. 5, Septeber October 2004, pp. 795 803 issn 0030-364X eissn 1526-5463 04 5205 0795 infors doi 10.1287/opre.1040.0130 2004 INFORMS TECHNICAL NOTE Lost-Sales Probles with

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative

More information

Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions

Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions Corinna Cortes Spencer Greenberg Mehryar Mohri January, 9 Abstract We present an extensive analysis of relative deviation

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Multi-kernel Regularized Classifiers

Multi-kernel Regularized Classifiers Multi-kernel Regularized Classifiers Qiang Wu, Yiing Ying, and Ding-Xuan Zhou Departent of Matheatics, City University of Hong Kong Kowloon, Hong Kong, CHINA wu.qiang@student.cityu.edu.hk, yying@cityu.edu.hk,

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information