Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Size: px

Start display at page:

Download "Stability Bounds for Stationary ϕ-mixing and β-mixing Processes"

May Aileen Lang
6 years ago
Views:

1 Journal of Machine Learning Research (200) Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences and Google Research 25 Mercer Street New York, NY 002 Afshin Rostaizadeh Departent of Coputer Science Courant Institute of Matheatical Sciences 25 Mercer Street New York, NY 002 Editor: Ralf Herbrich Abstract Most generalization bounds in learning theory are based on soe easure of the coplexity of the hypothesis class used, independently of any algorith. In contrast, the notion of algorithic stability can be used to derive tight generalization bounds that are tailored to specific learning algoriths by exploiting their particular properties. However, as in uch of learning theory, existing stability analyses and bounds apply only in the scenario where the saples are independently and identically distributed. In any achine learning applications, however, this assuption does not hold. The observations received by the learning algorith often have soe inherent teporal dependence. This paper studies the scenario where the observations are drawn fro a stationary ϕ-ixing or β-ixing sequence, a widely adopted assuption in the study of non-i.i.d. processes that iplies a dependence between observations weakening over tie. We prove novel and distinct stability-based generalization bounds for stationary ϕ-ixing and β-ixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algoriths, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-ixing generalization bounds to general classes of learning algoriths, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and any other kernel regularization-based and relative entropy-based regularization algoriths. These novel bounds can thus be viewed as the first theoretical basis for the use of these algoriths in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, ixing distributions, algorithic stability, generalization bounds, learning theory. Introduction Most generalization bounds in learning theory are based on soe easure of the coplexity of the hypothesis class used, such as the VC-diension, covering nubers, or Radeacher coplexity. These easures characterize a class of hypotheses, independently of any algorith. In contrast, the notion of algorithic stability can be used to derive bounds that are tailored to specific learning algoriths and exploit their particular properties. A learning algorith is stable if the hypothesis it c 200 Mehryar Mohri and Afshin Rostaizadeh.

2 MOHRI AND ROSTAMIZADEH outputs varies in a liited way in response to sall changes ade to the training set. Algorithic stability has been used effectively in the past to derive tight generalization bounds (Bousquet and Elisseeff, 200, 2002; Kearns and Ron, 997). But, as in uch of learning theory, existing stability analyses and bounds apply only in the scenario where the saples are independently and identically distributed (i.i.d.). In any achine learning applications, this assuption, however, does not hold; in fact, the i.i.d. assuption is not tested or derived fro any data analysis. The observations received by the learning algorith often have soe inherent teporal dependence. This is clear in syste diagnosis or tie series prediction probles. Clearly, prices of different stocks on the sae day, or of the sae stock on different days, ay be dependent. But, a less apparent tie dependency ay affect data sapled in any other tasks as well. This paper studies the scenario where the observations are drawn fro a stationary ϕ-ixing or β-ixing sequence, a widely adopted assuption in the study of non-i.i.d. processes that iplies a dependence between observations weakening over tie (Yu, 994; Meir, 2000; Vidyasagar, 2003; Lozano et al., 2006; Mohri and Rostaizadeh, 2007). We prove novel and distinct stabilitybased generalization bounds for stationary ϕ-ixing and β-ixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algoriths, thereby extending the usefulness of stability-bounds to non-i.i.d. scenarios. Our proofs are based on the independent block technique described by Yu (994) and attributed to Bernstein (927), which is coonly used in such contexts. However, our analysis soewhat differs fro previous uses of this technique in that the blocks of points we consider are not necessarily of equal size. For our analysis of stationary ϕ-ixing sequences, we ake use of a generalized version of Mc- Diarid s inequality given by Kontorovich and Raanan (2008) that holds for ϕ-ixing sequences. This leads to stability-based generalization bounds with the standard exponential for. Our generalization bounds for stationary β-ixing sequences cover a ore general non-i.i.d. scenario and use the standard McDiarid s inequality, however, unlike the ϕ-ixing case, the β-ixing bound presented here is not a purely exponential bound and contains an additive ter depending on the ixing coefficient. We also illustrate the application of our ϕ-ixing generalization bounds to general classes of learning algoriths, including Support Vector Regression (SVR) (Vapnik, 998), Kernel Ridge Regression (Saunders et al., 998), and Support Vector Machines (SVMs) (Cortes and Vapnik, 995). Algoriths such as SVR have been used in the context of tie series prediction in which the i.i.d. assuption does not hold, soe with good experiental results (Müller et al., 997; Mattera and Haykin, 999). However, to our knowledge, the use of these algoriths in non-i.i.d. scenarios has not been previously supported by any theoretical analysis. The stability bounds we give for SVR, SVMs, and any other kernel regularization-based and relative entropy-based regularization algoriths can thus be viewed as the first theoretical basis for their use in such scenarios. The following sections are organized as follows. In Section 2, we introduce the definitions relevant to the non-i.i.d. probles that we are considering and discuss the learning scenarios in that context. Section 3 gives our ain generalization bounds for stationary ϕ-ixing sequences based on stability, as well as the illustration of its applications to general kernel regularization-based algoriths, including SVR, KRR, and SVMs, as well as to relative entropy-based regularization algoriths. Finally, Section 4 presents the first known stability bounds for the ore general stationary β-ixing scenario. 790

3 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES 2. Preliinaries We first introduce soe standard definitions for dependent observations in ixing theory (Doukhan, 994) and then briefly discuss the learning scenarios in the non-i.i.d. case. 2. Non-i.i.d. Definitions Definition A sequence of rando variables Z={Z t } t= is said to be stationary if for any t and non-negative integers and k, the rando vectors (Z t,...,z t+ ) and (Z t+k,...,z t++k ) have the sae distribution. Thus, the index t or tie, does not affect the distribution of a variable Z t in a stationary sequence. This does not iply independence however. In particular, for i< j < k, PrZ j Z i ay not equal PrZ k Z i, that is, conditional probabilities ay vary at different points in tie. The following is a standard definition giving a easure of the dependence of the rando variables Z t within a stationary sequence. There are several equivalent definitions of these quantities, we are adopting here a version convenient for our analysis, as in Yu (994). Definition 2 Let Z={Z t } t= be a stationary sequence of rando variables. For any i, j Z {,+ }, let σ j i denote the σ-algebra generated by the rando variables Z k, i k j. Then, for any positive integer k, the β-ixing and ϕ-ixing coefficients of the stochastic process Z are defined as β(k) = sup E sup PrA B PrA ϕ(k) = sup PrA B PrA. n B σ n A σ n n+k A σ n+k B σ n Z is said to be β-ixing (ϕ-ixing) if β(k) 0 (resp. ϕ(k) 0) as k. It is said to be algebraically β-ixing (algebraically ϕ-ixing) if there exist real nubers β 0 > 0 (resp. ϕ 0 > 0) and r > 0 such that β(k) β 0 /k r (resp. ϕ(k) ϕ 0 /k r ) for all k, exponentially ixing if there exist real nubers β 0 (resp. ϕ 0 > 0), β (resp. ϕ > 0) and r > 0 such that β(k) β 0 exp( β k r ) (resp. ϕ(k) ϕ 0 exp( ϕ k r )) for all k. Both β(k) and ϕ(k) easure the dependence of an event on those that occurred ore than k units of tie in the past. β-ixing is a weaker assuption than ϕ-ixing and thus covers a ore general non-i.i.d. scenario. This paper gives stability-based generalization bounds both in the ϕ-ixing and β-ixing case. The β-ixing bounds cover a ore general case of course, however, the ϕ-ixing bounds are sipler and adit the standard exponential for. The ϕ-ixing bounds are based on a concentration inequality that applies to ϕ-ixing processes only. Except for the use of this concentration bound and two leas 5 and 6, all of the interediate proofs and results to derive a ϕ-ixing bound in Section 3 are given in the ore general case of β-ixing sequences. It has been argued by Vidyasagar (2003) that β-ixing is just the right assuption for the analysis of weakly-dependent saple points in achine learning, in particular because several PAClearning results then carry over to the non-i.i.d. case. Our β-ixing generalization bounds further contribute to the analysis of this scenario.. Soe results have also been obtained in the ore general context of α-ixing but they see to require the stronger condition of exponential ixing (Modha and Masry, 998). 79

4 MOHRI AND ROSTAMIZADEH We describe in several instances the application of our bounds in the case of algebraic ixing. Algebraic ixing is a standard assuption for ixing coefficients that has been adopted in previous studies of learning in the presence of dependent observations (Yu, 994; Meir, 2000; Vidyasagar, 2003; Lozano et al., 2006). Let us also point out that ixing assuptions can be checked in soe cases such as with Gaussian or Markov processes (Meir, 2000) and that ixing paraeters can also be estiated in such cases. Most previous studies use a technique originally introduced by Bernstein (927) based on independent blocks of equal size (Yu, 994; Meir, 2000; Lozano et al., 2006). This technique is particularly relevant when dealing with stationary β-ixing. We will need a related but soewhat different technique since the blocks we consider ay not have the sae size. The following lea is a special case of Corollary 2.7 fro Yu (994). Lea 3 (Yu, 994, Corollary 2.7) Let µ and suppose ( that h is easurable ) function, with absolute value bounded by M, on a product probability space µ j= Ω j, µ σs i r i where r i s i r i+ for all i. Let Q be a probability easure on the product ( space with arginal easures Q i on (Ω i,σ s i r i ), and let Q i+ be the arginal easure of Q on i+ j= Ω j, i+ j= σs j r j ),,...,µ. Let β(q) = sup i µ β(k i ), where k i =r i+ s i, and P = µ Q i. Then, E Q h E P h (µ )Mβ(Q). The lea gives a easure of the difference between the distribution of µ blocks where the blocks are independent in one case and dependent in the other case. The distribution within each block is assued to be the sae in both cases. For a onotonically decreasing function β, we have β(q) = β(k ), where k = in i (k i ) is the sallest gap between blocks. 2.2 Learning Scenarios We consider the failiar supervised learning setting where the learning algorith receives a saple of labeled points S = (z,...,z ) = ((x,y ),...,(x,y )) (X Y), where X is the input space and Y the set of labels (Y R in the regression case), both assued to be easurable. For a fixed learning algorith, we denote by h S the hypothesis it returns when trained on the saple S. The error of a hypothesis on a pair z X Y is easured in ters of a cost function c : Y Y R +. Thus, c(h(x),y) easures the error of a hypothesis h on a pair (x,y), c(h(x),y) = (h(x) y) 2 in the standard regression cases. We will often use the shorthand c(h,z) := c(h(x),y) for a hypothesis h and z = (x,y) X Y and will assue that c is upper bounded by a constant M>0. We denote by R(h) the epirical error of a hypothesis h for a training saple S=(z,...,z ): R(h) = c(h,z i ). In the standard achine learning scenario, the saple pairs z,...,z are assued to be i.i.d., a restrictive assuption that does not always hold in practice. We will consider here the ore general case of dependent saples drawn fro a stationary ixing sequence Z over X Y. As in the i.i.d. case, the objective of the learning algorith is to select a hypothesis with sall error over future saples. But, here, we ust distinguish two versions of this proble. 792

5 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES In the ost general version, future saples depend on the training saple S and thus the generalization error or true error of the hypothesis h S trained on S ust be easured by its expected error conditioned on the saple S: R(h S ) = E z c(h S,z) S. () This is the ost realistic setting in this context, which atches tie series prediction probles. A soewhat less realistic version is one where the saples are dependent, but the test points are assued to be independent of the training saple S. The generalization error of the hypothesis h S trained on S is then: R(h S ) = E z c(h S,z) S = E z c(h S,z). This setting sees less natural since, if saples are dependent, future test points ust also depend on the training points, even if that dependence is relatively weak due to the tie interval after which test points are drawn. Nevertheless, it is this soewhat less realistic setting that has been studied by all previous achine learning studies that we are aware of Yu (994), Meir (2000), Vidyasagar (2003) and Lozano et al. (2006), even when exaining specifically a tie series prediction proble (Meir, 2000). Thus, the bounds derived in these studies cannot be directly applied to the ore general setting. We will consider instead the ost general setting with the definition of the generalization error based on Equation. Clearly, our analysis also applies to the less general setting just discussed as well. Let us also briefly discuss the ore general scenario of non-stationary ixing sequences, that is one where the distribution ay change over tie. Within that general case, the generalization error of a hypothesis h S, defined straightforwardly by R(h S,t) = E c(h S,z t ) S, zt σ t would depend on the tie t and it ay be the case that R(h S,t) R(h S,t ) for t t, aking the definition of the generalization error a ore subtle issue. To reove the dependence on tie, one could define a weaker notion of the generalization error based on an expected loss over all tie: R(h S ) = E t R(h S,t). It is not clear however whether this ter could be easily coputed and be useful. A stronger condition would be to iniize the generalization error for any particular target tie. Studies of this type have been conducted for soothly changing distributions, such as in Zhou et al. (2008), however, to the best of our knowledge, the scenario of a both non-identical and non-independent sequences has not yet been studied. 3. ϕ-mixing Generalization Bounds and Applications This section gives generalization bounds for β-stable algoriths over a ixing stationary distribution. 2 The first two sections present our supporting leas which hold for either β-ixing or ϕ-ixing stationary distributions. In the third section, we will briefly discuss concentration inequalities that apply to ϕ-ixing processes only. Then, in the final section, we will present our ain results. 2. The standard variable used for the stability coefficient is β. To avoid the confusion with the β-ixing coefficient, we will use β instead. 793

6 MOHRI AND ROSTAMIZADEH The condition of β-stability is an algorith-dependent property first introduced by Devroye and Wagner (979) and Kearns and Ron (997). It has been later used successfully by Bousquet and Elisseeff (200, 2002) to show algorith-specific stability bounds for i.i.d. saples. Roughly speaking, a learning algorith is said to be stable if sall changes to the training set do not cause large deviations in its output. The following gives the precise technical definition. Definition 4 A learning algorith is said to be (uniforly) β-stable if the hypotheses it returns for any two training saples S and S that differ by reoving a single point satisfy z X Y, c(h S,z) c(h S,z) β. We note that a β-stable algorith is also stable with respect to replacing a single point. Let S and S i be two sequences differing in the ith coordinate, and S /i be equivalent to S and S i but with the ith point reoved. Then for a β-stable algorith we have, c(h S,z) c(h Si,z) = c(h S,z) c(h S/i )+c(h S/i ) c(h Si,z) c(h S,z) c(h S/i ) + c(h S/i ) c(h Si,z) 2 β. The use of stability in conjunction with McDiarid s inequality will allow us to derive generalization bounds. McDiarid s inequality is an exponential concentration bound of the for ) Pr Φ EΦ ε exp ( ε2 τ 2, where the probability is over a saple of size and where τ is the Lipschitz paraeter of Φ, with τ a function of. Unfortunately, this inequality cannot be applied when the saple points are not distributed in an i.i.d. fashion. We will use instead a result of Kontorovich and Raanan (2008) that extends McDiarid s inequality to ϕ-ixing distributions (Theore 8). To obtain a stability-based generalization bound, we will apply this theore to Φ(S) = R(h S ) R(h S ). To do so, we need to show, as with the standard McDiarid s inequality, that Φ is a Lipschitz function and, to ake it useful, bound EΦ. The next two sections describe how we achieve both of these in this non-i.i.d. scenario. Let us first take a brief look at the proble faced when attepting to give stability bounds for dependent sequences and give soe idea of our solution for that proble. The stability proofs given by Bousquet and Elisseeff (200) assue the i.i.d. property, thus replacing an eleent in a sequence with another does not affect the expected value of a rando variable defined over that sequence. In other words, the following equality holds, E S V(Z,...,Z i,...,z ) = E S,Z V(Z,...,Z,...,Z ), (2) for a rando variable V that is a function of the sequence of rando variables S = (Z,...,Z ). However, clearly, if the points in that sequence S are dependent, this equality ay not hold anyore. 794

7 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES (a) (b) Figure : Illustration of dependent (a) and independent (b) blocks. Although there is no dependence between blocks of points in (b), the distribution within each block reains the sae as in (a) and thus points within a block reain dependent. The ain technique to cope with this proble is based on the so-called independent block sequence originally introduced by Bernstein (927). This consists of eliinating fro the original dependent sequence several blocks of contiguous points, leaving us with soe reaining blocks of points. Instead of these dependent blocks, we then consider independent blocks of points, each with the sae size and the sae distribution (within each block) as the dependent ones. By Lea 3, for a β-ixing distribution, the expected value of a rando variable defined over the dependent blocks is close to the one based on these independent blocks. Working with these independent blocks brings us back to a situation siilar to the i.i.d. case, with i.i.d. blocks replacing i.i.d. points. Figure illustrates the two types of blocks just discussed. Our use of this ethod soewhat differs fro previous ones (see Yu, 994; Meir, 2000) where any blocks of equal size are considered. We will be dealing with four blocks and with typically unequal sizes. More specifically, note that for Equation 2 to hold, we only need that the variable Z i be independent of the other points in the sequence. To achieve this, roughly speaking, we will be discarding soe of the points in the sequence surrounding Z i. This results in a sequence of three blocks of contiguous points. If our algorith is stable and we do not discard too any points, the hypothesis returned should not be greatly affected by this operation. In the next step, we apply the independent block lea, which then allows us to assue each of these blocks as independent odulo the addition of a ixing ter. In particular, Z i becoes independent of all other points. Clearly, the nuber of points discarded is subject to a trade-off: reoving too any points could excessively odify the hypothesis returned; reoving too few would aintain the dependency between Z i and the reaining points, thereby inducing a larger penalty when applying Lea 3. This trade-off is ade explicit in the following section where an optial solution is sought. 3. Lipschitz Bound As discussed in Section 2.2, in the ost general scenario, test points depend on the training saple. We first present a lea that relates the expected value of the generalization error in that scenario and the sae expectation in the scenario where the test point is independent of the training saple. We denote by R(h S ) = E z c(h S,z) S the expectation in the dependent case and by R(h Sb ) = E z c(h Sb, z) the expectation where the test points are assued independent of the training, with S b denoting a sequence siilar to S but with the last b points reoved. Figure 2(a) illustrates that sequence. The block S b is assued to have exactly the sae distribution as the corresponding block of the sae size in S. 795

8 MOHRI AND ROSTAMIZADEH Lea 5 Assue that the learning algorith is β-stable and that the cost function c is bounded by M. Then, for any saple S of size drawn fro a ϕ-ixing stationary distribution and for any b {0,...,}, the following holds: R(h S ) R(h Sb ) b β+mϕ(b) Proof The β-stability of the learning algorith iplies that R(h S ) R(h Sb ) = E z c(h S,z) S E z c(h Sb,z) S b b β. (3) Now, in order to reove the dependence on S b we bound the following difference Ec(h Sb,z) S b Ẽc(h Sb, z) z z = c(h Sb,z)(Prz S b Prz) z Z = c(h Sb,z)(Prz S b Prz)+ c(h Sb,z)(Prz S b Prz) z Z + z Z = c(h Sb,z) Prz S b Prz c(h Sb,z) Prz S b Prz z Z + z Z ax c(h Sb,z) Prz S b Prz Z {Z,Z + } z Z ax M Prz S b Prz Z {Z,Z + } z Z = ax M Prz S b Prz Z {Z,Z + } z Z = ax M PrZ Sb PrZ Mϕ(b), Z {Z,Z + } (4) where the su has been separated over the set of zs Z + for which the difference Prz S b Prz is non-negative, and its copleent Z. Using (3) and (4) and the triangle inequality yields the stateent of the lea. Note that we assue that z iediately follows the saple S, which is the strongest dependent scenario. The following bounds can be iproved in a straightforward anner if the test point z is assued to be observed say k units of tie after the saple S. The bound would then contain the ixing ter ϕ(k+ b) instead of ϕ(b). We can now prove a Lipschitz bound for the function Φ. Lea 6 Let S = (z,...,z i,...,z ) and S i = (z,...,z i,...,z ) be two sequences drawn fro a ϕ-ixing stationary process that differ only in point z i for soe i {,...,}, and let h S and h S i be the hypotheses returned by a β-stable algorith when trained on each of these saples. Then, for any i {,...,}, the following inequality holds: Φ(S) Φ(S i ) (b+2)2 β+2ϕ(b)m + M. 796

9 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES S b z z i S i (a) b b b (b) S i,b S i i,b z i z z z b b (c) b b b (d) b Figure 2: Illustration of the sequences derived fro S that are considered in the proofs. Proof To prove this inequality, we first bound the difference of the epirical errors as in Bousquet and Elisseeff (2002), then the difference of the generalization errors. Bounding the difference of costs on agreeing points with β and the one that disagrees with M gives R(h S ) R(h S i) j i c(h S,z j ) c(h S i,z j ) + c(h S,z i ) c(h S i,z i) 2 β+ M. (5) Since both R(h S ) and R(h S i) are defined with respect to a (different) dependent point, we can apply Lea 5 to both generalization error ters and use β-stability. Using this and the triangle inequality, we can write R(h S ) R(h S i) R(h S ) R(h Sb )+ R(h Sb ) R(h S i b )+ R(h S i b ) R(h S i) R(h Sb ) R(h S i b ) +2b β+2ϕ(b)m = Ẽ z c(h Sb, z) c(h S i b, z)+2b β+2ϕ(b)m 2 β+2b β+2ϕ(b)m. (6) The stateent of the lea is obtained by cobining inequalities 5 and Bound on Expectation As entioned earlier, to obtain an explicit bound after application of a generalized McDiarid s inequality, we also need to bound E S Φ(S). This is done by analyzing independent blocks using Lea

10 MOHRI AND ROSTAMIZADEH Lea 7 Let h S be the hypothesis returned by a β-stable algorith trained on a saple S drawn fro a β-ixing stationary distribution. Then, for all b,, the following inequality holds: E S Φ(S) (6b+2) β+3β(b)m. Proof Let S b be defined as in the proof of Lea 5. To deal with independent block sequences defined with respect to the sae hypothesis, we will consider the sequence S i,b = S i S b, which is illustrated by Figure 2(a-c). This can result in as any as four blocks. As before, we will consider a sequence S i,b with a siilar set of blocks each with the sae distribution as the corresponding blocks in S i,b, but such that the blocks are independent as seen in Figure 2(d). Since three blocks of at ost b points are reoved fro each hypothesis, by the β-stability of the learning algorith, the following holds: EΦ(S) = E R(h S ) R(h S ) S S = E S,z c(h S,z i ) c(h S,z) E Si,b,z c(h Si,b,z i ) c(h Si,b,z) + 6b β. The application of Lea 3 to the difference of two cost functions also bounded by M as in the right-hand side leads to EΦ(S) E S S i,b, z c(h S i,b, z i ) c(h S i,b, z) + 6b β+3β(b)m. Now, since the points z and z i are independent and since the distribution is stationary, they have the sae distribution and we can replace z i with z in the epirical cost. Thus, we can write EΦ(S) E S S i,b, z c(h S i,b, z) i c(h S i,b, z) + 6b β+3β(b)m 2 β+6b β+3β(b)m, where S i i,b is the sequence derived fro S i,b by replacing z i with z. The last inequality holds by β-stability of the learning algorith. The other side of the inequality in the stateent of the lea can be shown following the sae steps. 3.3 ϕ-mixing Concentration Bound We are now prepared to ake use of a concentration inequality to provide a generalization bound in the ϕ-ixing scenario. Several concentration inequalities have been shown in the ϕ-ixing case, for exaple, Marton (998), Sason (2000), Chazottes et al. (2007) and Kontorovich and Raanan (2008). We will use that of Kontorovich and Raanan (2008), which is very siilar to that of Chazottes et al. (2007), odulo the fact that the latter requires a finite saple space. 798

11 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES The following is a concentration inequality derived fro that of Kontorovich and Raanan (2008). 3 Theore 8 Let Φ: Z R be a easurable function that is c-lipschitz with respect to the Haing etric for soe c>0 and let Z,...,Z be rando variables distributed according to a ϕ- ixing distribution. Then, for any ε>0, the following inequality holds: where +2 Pr Φ(Z,...,Z ) EΦ(Z,...,Z ) ε 2exp k= ϕ(k). ( 2ε 2 c 2 2 It should be pointed out that the stateent of the theore in this paper is iproved by a factor of 4 in the exponent with respect to that of Kontorovich and Raanan (2008, Theore.). This can be achieved straightforwardly by following the sae steps as in the proof of Kontorovich and Raanan (2008), but by aking use of the following general for of McDiarid s inequality (Theore 9) instead of Azua s inequality. In particular, Theore 5. of Kontorovich and Raanan (2008) shows that for a ϕ-ixing distribution and a -Lipschitz function, the constants c i can be bounded as follows in Theore 9: i c i +2 ϕ(k). Theore 9 (McDiarid, 989, 6.0) Let Z,...,Z be arbitrary rando variables taking values in Z and let Φ: Z R be a easurable function satisfying for all z i,z i Z,,...,, the following inequalities: E Φ(Z,...,Z ) Z = z,...,z i = z i E Φ(Z,...,Z ) Z = z,...,z i = z i ci, where c i >0,,...,, are constants. Then, for any ε>0, the following inequality holds : Pr Φ(Z,...,Z ) EΦ(Z,...,Z ) ( ) 2ε 2 ε 2exp. k= c2 i In the i.i.d. case, McDiarid s theore can be restated in the following sipler for that we shall use in Section 4. Theore 0 (McDiarid, i.i.d. scenario) Let Z,...,Z be independent rando variables taking values in Z and let Φ: Z R be a easurable function satisfying for all z i,z i Z,,...,, the following inequalities: Φ(z,...,z i,...z ) Φ(z,...,z i,...z ) c i, where c i >0,,...,, are constants. Then, for any ε>0, the following inequality holds: Pr Φ(Z,...,Z ) EΦ(Z,...,Z ) ( ) 2ε 2 ε 2exp. c2 i 3. We should note that original bound is expressed in ters of η-ixing coefficients. To siplify presentation, we are adapting it to the case of stationary ϕ-ixing sequences by using the following straightforward inequality for a stationary process: 2ϕ( j i) η i j. Furtherore, the bound presented in Kontorovich and Raanan (2008) holds when the saple space is countable, it is extended to the continuous case in Kontorovich (2007). ), 799

12 MOHRI AND ROSTAMIZADEH 3.4 ϕ-mixing Generalization Bounds This section presents several theores that constitute the ain results of this paper in the ϕ-ixing case. The following theore is constructed fro the bounds shown in the previous three sections. Theore (General Non-i.i.d. Stability Bound) Let h S denote the hypothesis returned by a βstable algorith trained on a saple S drawn fro a ϕ-ixing stationary distribution and let c be a easurable non-negative cost function upper bounded by M>0, then for any b {0,...,} and any ε>0, the following generalization bound holds: R(hS Pr ) R(h S ) > ε+(6b+2) β+6mϕ(b) S ( ) 2ε 2 (+2 2exp ϕ(i)) 2. ((b+2)2 β+2mϕ(b)+m/) 2 Proof The theore follows directly the application of Lea 6 and Lea 7 to Theore 8. The theore gives a general stability bound for ϕ-ixing stationary sequences. If we further assue that the sequence is algebraically ϕ-ixing, that is for all k, ϕ(k) = ϕ 0 k r for soe r>, then we can solve for the value of b to optiize the bound. Theore 2 (Non-i.i.d. Stability Bound for Algebraically Mixing Sequences) Let h S denote the hypothesis returned by a β-stable algorith trained on a saple S drawn fro an algebraically ϕ- ixing stationary distribution, ϕ(k)=ϕ 0 k r with r>, and let c be a easurable non-negative cost function upper bounded by M>0, then, for any ε>0, the following generalization bound holds: R(hS Pr ) R(h S ) > ε+8 β+(r+ )6Mϕ(b) S ( ) 2ε 2 (+2ϕ 0 r/(r )) 2 2exp, (6 β+(r+ )2Mϕ(b)+M/) 2 ( where b = β /(r+). rϕ 0 M) Proof For an algebraically ixing sequence, the value of b iniizing the bound of Theore satisfies the equation βb = rmϕ(b ). Since b ust be an integer, we use the approxiation b = ( ) /(r+) β rϕ 0 M when applying Theore. However, observing the inequalities ϕ(b ) ϕ(b) and (b + ) b, allows us to write the stateent of Theore 2 in ters of the fractional choice b. The ter in the nuerator can be bounded as +2 ϕ(i) = +2 ϕ 0 i r Z ) +2ϕ 0 (+ x r dx ( ) = +2ϕ 0 + r. r 800

13 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES Using the assuption r >, we can upper bound r with and obtain ( ) ( +2ϕ 0 + r +2ϕ 0 + ) = + 2ϕ 0r r r r. Plugging in this value and the iniizing value of b in the bound of Theore yields the stateent of the theore. In the case of a zero ixing coefficient (ϕ=0 and b=0), the bounds of Theore coincide with the i.i.d. stability bound of Bousquet and Elisseeff (2002). In the general case, in order for the right-hand side of these bounds to converge, we ust have β=o(/ ) and ϕ(b)=o(/ ). The first condition holds for several failies of algoriths with β O(/) (Bousquet and Elisseeff, 2002). In the case of algebraically ixing sequences with r>, as assued in Theore 2, β O(/) iplies ϕ(b) ϕ 0 ( β/(rϕ 0 M)) (r/(r+)) <O(/ ). More specifically, for the scenario of algebraic ixing with /-stability, the following bound holds with probability at least δ: R(h S ) R(h S ) O ( ) log(/δ). This is obtained by setting the right-hand side of Theore 2 equal to δ and solving for ε. Furtherore, if we choose ε= C log() for a large enough constant C > 0, the right-hand side of (r )/(r+) Theore 2 is suable over and thus, by the Borel-Cantelli lea, the following inequality holds alost surely: ( ) log() R(h S ) R(h S ) O. Siilar bounds can be given for the exponential ixing setting (ϕ(k) = ϕ 0 exp( ϕ k r )). If we choose b = O( log() 3 /) and assue β = O(/), then, with probability at least δ, R(h S ) R(h S ) O log(/δ)log 2 (). log If we instead set ε = C 3 () for a large enough constant C, the right-hand side of Theore 2 is suable and again by the Borel-Cantelli lea we have alost surely. 3.5 Applications R(h S ) R(h S ) O r r+ r r+ log 3 () We now present the application of our stability bounds for algebraically ϕ-ixing sequences to several algoriths, including the faily of kernel-based regularization algoriths and that of relative entropy-based regularization algoriths. The application of our learning bounds will benefit fro the previous analysis of the stability of these algoriths by Bousquet and Elisseeff (2002)., 80

14 MOHRI AND ROSTAMIZADEH 3.5. KERNEL-BASED REGULARIZATION ALGORITHMS We first apply our bounds to a faily of algoriths iniizing a regularized objective function based on the nor K in a reproducing kernel Hilbert space, where K is a positive definite syetric kernel: argin h H c(h,z i )+λ h 2 K. (7) The application of our bound is possible, under soe general conditions, since kernel regularized algoriths are stable with β O(/) (Bousquet and Elisseeff, 2002). For the sake of copleteness, we briefly present the proof of this β-stability. We will assue that the cost function c is σ-adissible, that is there exists σ R + such that for any two hypotheses h,h H and for all z = (x,y) X Y, c(h,z) c(h,z) σ h(x) h (x). This assuption holds for the quadratic cost and ost other cost functions when the hypothesis set and the set of output labels are bounded by soe M R + : h H, x X, h(x) M and y Y, y M. We will also assue that c is differentiable. This assuption is in fact not necessary and all of our results hold without it, but it akes the presentation sipler. We denote by B F the Bregan divergence associated to a convex function F: B F ( f g) = F( f) F(g) f g, F(g). In what follows, it will be helpful to define F as the objective function of a general regularization based algorith, F S (h) = R S (h)+λn(h), where R S is the epirical error as easured on the saple S, N : H R + is a regularization function and λ > 0 is the failiar trade-off paraeter. Finally, we shall use the shorthand h = h h. Lea 3 (Bousquet and Elisseeff, 2002) A kernel-based regularization algorith of the for (7), with bounded kernel K(x,x) κ < and σ-adissible cost function, is β-stable with coefficient β σ 2 κ 2 λ. Proof Let h and h be the iniizers of F S and F S respectively where S and S differ in the first coordinate (choice of coordinate is without loss of generality), then, B N (h h)+b N (h h ) 2σ λ sup h(x). (8) To see this, we notice that since B F = B R + λb N, and since a Bregan divergence is non-negative, x S λ ( B N (h h)+b N (h h ) ) B FS (h h)+b FS (h h ). By the definition of h and h as the iniizers of F S and F S, B FS (h h)+b FS (h h ) = R FS (h ) R FS (h)+ R FS (h) R FS (h ). 802

15 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES Finally, by the σ-adissibility of the cost function c and the definition of S and S, λ ( B N (h h)+b N (h h ) ) R FS (h ) R FS (h)+ R FS (h) R FS (h ) = c(h,z ) c(h,z )+c(h,z ) c(h,z ) σ h(x ) +σ h(x ) 2σ sup h(x), x S which establishes (8). Now, if we consider N( ) = 2 K, we have B N(h h) = h h 2 K, thus B N(h h)+b N (h h ) = 2 h 2 K and by (8) and the reproducing kernel property, 2 h 2 K 2σ λ sup h(x) x S 2σ λ κ h K. Thus h K σκ λ. And using the σ-adissibility of c and the kernel reproducing property we obtain z X Y, c(h,z) c(h,z) σ h(x) κσ h K. Therefore, which copletes the proof. z X Y, c(h,z) c(h,z) σ2 κ 2 λ, Three specific instances of kernel regularization algoriths are SVR, for which the cost function is based on the ε-insensitive cost: { 0 if h(x) y ε, c(h,z) = h(x) y ε otherwise. Kernel Ridge Regression (Saunders et al., 998), for which c(h,z) = (h(x) y) 2, and finally Support Vector Machines with the hinge-loss, { 0 if yh(x) 0, c(h,z) = yh(x) if yh(x) <, For kernel regularization algoriths, as pointed out in Bousquet and Elisseeff (2002, Lea 23), a bound on the labels iediately iplies a bound on the output of the hypothesis returned by the algorith. We forally state this lea below. 803

16 MOHRI AND ROSTAMIZADEH Lea 4 Let h be the solution of the optiization proble (7), let c be a cost function and let B( ) be a real-valued function such that for all h H, x X, and y Y, Then, the output of h is bounded as follows, c(h(x),y ) B(h(x)). B(0) x X, h (x) κ λ, where λ is the regularization paraeter, and κ 2 K(x,x) for all x X. Proof Let F(h) = c(h,z i)+λ h 2 K and let 0 be the zero hypothesis, then by definition of F and h, λ h 2 K F(h ) F(0) B(0). Then, using the reproducing kernel property and the Cauchy-Schwarz inequality we note, x X, h (x) = h,k(x, ) h K K(x,x) κ h K. Cobining the two inequalities proves the lea. We note that in Bousquet and Elisseeff (2002), the following bound is also stated: c(h (x),y ) B(κ B(0)/λ). However, when later applied, it sees that the authors use an incorrect upper bound function B( ), which we reedy in the following. Corollary 5 Assue a bounded output Y = 0,B, for soe B > 0, and assue that K(x,x) κ 2 for all x for soe κ > 0. Let h S denote the hypothesis returned by the algorith when trained on a saple S drawn fro an algebraically ϕ-ixing stationary distribution. Let u = r/(r + ) 2,, M = 2(r+)ϕ 0 M/(rϕ 0 M) u, and ϕ 0 = (+2ϕ 0r/(r )). Then, with probability at least δ, the following generalization bounds hold for a. Support Vector Machines (SVM, with hinge-loss) ( ) R(h S ) R(h S )+ 8κ2 2κ 2 u ( ) 3M λ + λ u + ϕ 0 (M + 3κ2 2κ 2 u ) λ + M 2log(2/δ) λ u, where M = κ λ + B. b. Support Vector Regression (SVR): ( ) R(h S ) R(h S )+ 8κ2 2κ 2 u ( ) 3M λ + λ u + ϕ 0 (M + 3κ2 2κ 2 u ) λ + M 2log(2/δ) λ u, 2B where M = κ λ + B. 804

17 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES c. Kernel Ridge Regression (KRR): R(h S ) R(h S )+ 32κ2 B 2 ( 8κ 2 λ + B 2 ) u 3M λ u + ϕ 0 (M + 2κ2 B 2 ( 8κ 2 B 2 ) u ) M 2log(2/δ) + λ λ u, where M = 2κ 2 B 2 /λ+b 2. Proof For SVM, the hinge-loss is -adissible giving β κ 2 /(λ). Using Lea 4, with B(0) =, the loss can be bounded x X,y Y,+ h (x) κ λ + B. Siilarly, SVR has a loss function that is -adissible, thus, applying Lea 3 gives us β κ 2 /(λ). Using Lea 4, with B(0) = B, we can bound the loss as follows, x X,y Y, h B (x) y κ λ + B. Finally for KRR, we have a loss function that is 2B-adissible and again using Lea 3 β 4κ 2 B 2 /(λ). Again, applying Lea 4 with B(0) = B 2 and x X,y Y,(h (x) y) 2 κ 2 B 2 /λ+b 2. Plugging these values into the bound of Theore 2 and setting the right-hand side to δ yields the stateent of the corollary RELATIVE ENTROPY-BASED REGULARIZATION ALGORITHMS In this section, we apply the results of Theore 2 to a faily of learning algoriths based on relative entropy-regularization. These algoriths learn hypotheses h that are ixtures of base hypotheses in {h θ : θ Θ}, where Θ is easurable set. The output of these algoriths is a ixture g: Θ R, that is a distribution over Θ. Let G denote the set of all such distributions and let g 0 G be a fixed distribution. Relative entropy based-regularization algoriths output the solution of a iniization proble of the following for: argin g G c(g,z i )+λd(g g 0 ), (9) where the cost function c: G Z R is defined in ters of a second internal cost function c : H Z R: Z c(g,z) = c (h θ,z)g(θ)dθ, Θ and where D(g g 0 ) is the relative entropy between g and g 0 : Z D(g g 0 ) = g(θ)log g(θ) Θ g 0 (θ) dθ. As shown by Bousquet and Elisseeff (2002, Theore 24), a relative entropy-based regularization algorith defined by (9) with bounded loss c ( ) M, is β-stable with the following bound on the stability coefficient: β M 2 λ. Theore 2 cobined with this inequality iediately yields the following generalization bound. 805

18 MOHRI AND ROSTAMIZADEH Corollary 6 Let h S be the hypothesis solution of the optiization (9) trained on a saple S drawn fro an algebraically ϕ-ixing stationary distribution with the internal cost function c bounded by M. Then, with probability at least δ, the following holds: (M + 3M2 R(h S ) R(h S )+ 8M2 λ + 3M λ u u + ϕ 0 λ + 2u M λ u u ) 2log(2/δ), where u = r/(r+ ) 2,, M = 2(r+ )ϕ 0 M u+ /(rϕ 0 ) u, and ϕ 0 = (+2ϕ 0r/(r )). 3.6 Discussion The results presented here are, to the best of our knowledge, the first stability-based generalization bounds for the class of algoriths just studied in a non-i.i.d. scenario. These bounds are nontrivial when the condition on the regularization paraeter λ / /2 /r paraeter holds for all large values of. This condition coincides with the one obtained in the i.i.d. setting by Bousquet and Elisseeff (2002), in the liit, as r tends to infinity. The next section gives stability-based generalization bounds that hold even in the scenario of β-ixing sequences. 4. β-mixing Generalization Bounds In this section, we prove a stability-based generalization bound that only requires the training sequence to be drawn fro a β-ixing stationary distribution. The bound is thus ore general and covers the ϕ-ixing case analyzed in the previous section. However, unlike the ϕ-ixing case, the β-ixing bound presented here is not a purely exponential bound. It contains an additive ter, which depends on the ixing coefficient. As in the previous section, Φ(S) is defined by Φ(S)=R(h S ) R(h S ). To siplify the presentation, here, we define the generalization error of h S by R(h S )=E z c(h S,z). Thus, test saples are assued independent of S. 4 Note that for any block of points Z = z...z k drawn independently of S, the following equality holds: E Z Z c(h S,z) z Z = k k E Z c(h S,z i ) = k k E zi c(h S,z i ) = E z c(h S,z) since, by stationarity, E zi c(h S,z i )=E z j c(h S,z j ) for all i, j k. Thus, for any such block Z, we can write R(h S )=E Z Z z Z c(h S,z). For convenience, we extend the cost function c to blocks as follows: c(h,z) = Z c(h,z). z Z With this notation, R(h S ) = E Z c(h S,Z) for any block drawn independently of S, regardless of the size of Z. To derive a generalization bound for the β-ixing scenario, we apply McDiarid s inequality (Theore 0) to Φ defined over a sequence of independent blocks. The independent blocks we consider are non-syetric and thus ore general than those considered by previous authors (Yu, 994; Meir, 2000). 4. In the β-ixing scenario, a result siilar to that of Lea 5 can be shown to hold in expectation with respect the saple S. Using Markov s inequality, the inequality can be shown to hold with high probability. Thus, the results that follow can all be be extended to the case where the test points depend on the training saple, at the expense of an an additional confidence ter. 806

19 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES Figure 3: Illustration of the sequences S a and S b derived fro S that are considered in the proofs. The darkened regions are considered as being reoved fro the sequence. Fro a saple S ade of a sequence of points, we construct two sequences of blocks S a and S b, each containing µ blocks. Each block in S a contains a points and each block in S b contains b points (see Figure 3). S a and S b for a partitioning of S; for any a,b {0,...,} such that (a+b)µ=, they are defined precisely as follows: S a = (Z (a),...,z(a) µ ), with Z (a) i = z (i )(a+b)+,...,z (i )(a+b)+a S b = (Z (b),...,z(b) µ ), with Z (b) i = z (i )(a+b)+a+,...,z (i )(a+b)+a+b, for all i {,...,µ}. We shall consider siilarly sequences of i.i.d. blocks Z i a and Z i b, i {,...,µ}, such that the points within each block are drawn according to the sae original β-ixing distribution and shall denote by S a the block sequence ( Z (a),..., Z µ (a) ). In preparation for the application of McDiarid s inequality, we give a bound on the expectation of Φ( S a ). Since the expectation is taken over a sequence of i.i.d. blocks, this brings us to a situation siilar to the i.i.d. scenario analyzed by Bousquet and Elisseeff (2002), with the exception that we are dealing with i.i.d. blocks instead of i.i.d. points. Lea 7 Let S a be an independent block sequence as defined above, then the following bound holds for the expectation of Φ( S a ) : Ẽ Φ( S a ) 2a β. S a Proof Since the blocks Z (a) are independent, we can replace any one of the with any other block Z drawn fro the sae distribution. However, changing the training set also changes the hypothesis, in a liited way. This is shown precisely below: Ẽ Φ( S a ) = Ẽ S a S a µ E S a,z µ = E S a,z µ µ µ µ c(h S a, Z (a) i c(h S a, Z (a) i ) E Z c(h S a,z) ) c(h S a,z) c(h S,Z) a i c(h S a,z), 807

20 MOHRI AND ROSTAMIZADEH where S a i corresponds to the block sequence S a obtained by replacing the ith block with Z. The β-stability of the learning algorith gives E S a,z µ µ c(h S,Z) a i c(h S a,z) E S a,z µ µ 2a β 2a β. We now relate the non-i.i.d. event PrΦ(S) ε to an independent block sequence event to which we can apply McDiarid s inequality. Lea 8 Assue a β-stable algorith. Then, for a saple S drawn fro a β-ixing stationary distribution, the following bound holds: Pr Φ(S) ε Pr Φ( S a ) E Φ( S a ) ε 0 +(µ )β(b), S S a where ε 0 = ε µbm 2µb β E S Φ( S a). a Proof The proof consists of first rewriting the event in ters of S a and S b and bounding the error on the points in S b in a trivial anner. This can be afforded since b will be eventually chosen to be sall. Since E Z c(h S,Z ) c(h S,z ) M for any z S b, we can write Pr Φ(S) ε = Pr R(h S ) R(h S ) ε S S = Pr S Ec(h S,Z) c(h S,z) ε Z z S Pr S Ec(h S,Z) c(h S,z) + Z z S a E c(h S,Z ) c(h z Z S,z ) ε S b Pr S Ec(h S,Z) c(h S,z) + µbm Z z S a ε. By β-stability and µa/, this last ter can be bounded as follows Pr S Ec(h S,Z) c(h S,z) + µbm Z z S a ε Pr S a µa Ec(h Sa,Z) c(h Sa,z) + µbm Z z S a + 2µb β ε. The right-hand side can be rewritten in ters of Φ and bounded in ters of a β-ixing coefficient: Pr S a µa Ec(h Sa,Z) c(h Sa,z) + µbm Z z S a + 2µb β ε = Pr Φ(S a ) + µbm S a + 2µb β ε Pr S a Φ( S a ) + µbm + 2µb β ε +(µ )β(b), 808

21 STABILITY BOUNDS FOR NON-I.I.D. PROCESSES { by applying Lea 3 to the indicator function of the event Φ(S a ) + µbm E S a Φ( S a) is a constant, the probability in this last ter can be rewritten as Pr Φ( S a ) + µbm S a + 2µb β ε = Pr Φ( S a ) Ẽ S a S a which ends the proof of the lea. = Pr S a Φ( S a) + µbm, Φ( S a ) Ẽ Φ( S a) ε 0 S a } + 2µb β ε. Since + 2µb β ε Ẽ Φ( S a) S a The last two leas will help us prove the ain result of this section forulated in the following theore. Theore 9 Assue a β-stable algorith and let ε denote ε µbm 2µb β 2a β as in Lea 8. Then, for any saple S of size drawn according to a β-ixing stationary distribution, any choice of the paraeters a,b,µ > 0 such that (a + b)µ =, and ε 0 such that ε 0, the following generalization bound holds: ( 2ε Pr R(h S ) R(h 2 S ) ε exp ( ) S 2 )+(µ )β(b). 4a β+(a+b)m Proof To prove the stateent of theore, it suffices to bound the probability ter appearing in the right-hand side of Equation 8, Pr S a Φ( S a ) E Φ( S a ) ε 0, which is expressed only in ters of independent blocks. We can therefore apply McDiarid s inequality by viewing the blocks as i.i.d. points. To do so, we ust bound the quantity Φ( S a ) Φ( S i a) where the sequence Sa and S i a differ in the ith block. We will bound separately the difference between the generalization errors and epirical errors. 5 The difference in epirical errors can be bounded as follows using the bound on the cost function c: R(h Sa ) R(h S i a ) = µ c(h Sa,Z j ) c(h S i a,z j ) + c(hsa,z i ) c(h j i µ S i,z a i) 2a β+ M (a+b)m = 2a β+. µ The difference in generalization error can be straightforwardly bounded using β-stability: R(h Sa ) R(h S i a ) = E Z c(h Sa,Z) E Z c(h S i a,z) = E Z c(h Sa,Z) c(h S i a,z) 2a β. 5. We drop the superscripts on Z (a) since we will not be considering the sequence S b in what follows. 809

22 MOHRI AND ROSTAMIZADEH Using these bounds in conjunction with McDiarid s inequality yields ( ) Pr Φ( S a ) Ẽ Φ( S a) ε 2ε exp ( ) 2 S a S a 4a β+(a+b)m ( 2ε 2 exp ( ) 2 ). 4a β+(a+b)m Note that to show the second inequality we ake use of Lea 7 to establish the fact that ε 0 = ε µbm 2µb β Ẽ Φ( S a) ε µbm S a 2µb β 2a β = ε. Finally, we ake use of Lea 8 to establish the proof, Pr Φ(S) ε Pr Φ( S a ) E Φ( S a ) ε 0 +(µ )β(b) S S a ( 2ε 2 exp ( ) 2 )+(µ )β(b). 4a β+(a+b)m This concludes the proof of the theore. In order to ake use of this bound, we ust deterine the values of paraeters b and µ (a is then equal to µ/ u). There is a trade-off between selecting a large enough value for b to ensure that the ixing ter decreases and choosing a large enough value of µ to iniize the reaining ters of the bound. The exact choice of paraeters will depend on the type of ixing that is assued (e.g., algebraic or exponential). In order to choose optial paraeters, it will be useful to view the bound as it holds with high probability, in the following corollary. Corollary 20 Assue a β-stable algorith and let δ denote δ (µ )β(b). Then, for any saple S of size drawn according to a β-ixing stationary distribution, any choice of the paraeters a,b,µ> 0 such that (a+b)µ=, and δ 0 such that δ 0, the following generalization bound holds with probability at least ( δ): ( ) ( M R(h S ) R(h S ) < µb + 2 β + 2a β 4a β+m ) log(/δ ) µ 2. In the case of a fast ixing distribution, it is possible to select( the values of the paraeters to ) retrieve a bound as in the i.i.d. case, that is, R(h S ) R(h S ) = O 2 log/δ. In particular, for β(b) 0, we can choose a = 0, b =, and µ = to retrieve the i.i.d. bound of Bousquet and Elisseeff (200). In the following, we exaine slower ixing algebraic β-ixing distributions, which are thus not close to the i.i.d. scenario. For algebraic ixing, the ixing paraeter is defined as β(b) = b r. In that case, we wish to iniize the following function in ters of µ and b: s(µ,b) = µ b r + 3/2 β ( ) µ + /2 µ + µb + β. (0) 80

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences