Stability Bounds for Non-i.i.d. Processes

Size: px

Start display at page:

Download "Stability Bounds for Non-i.i.d. Processes"

Jason Perry
6 years ago
Views:

1 tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 Afshin Rostaiadeh Departent of Coputer cience Courant Institute of Matheatical ciences 25 Mercer treet New York, NY 002 Abstract The notion of algorithic stability has been used effectively in the past to derive tight generaliation bounds. A key advantage of these bounds is that they are designed for specific learning algoriths, exploiting their particular properties. But, as in uch of learning theory, existing stability analyses and bounds apply only in the scenario where the saples are independently and identically distributed (i.i.d.). In any achine learning applications, however, this assuption does not hold. The observations received by the learning algorith often have soe inherent teporal dependence, which is clear in syste diagnosis or tie series prediction probles. This paper studies the scenario where the observations are drawn fro a stationary ixing sequence, which iplies a dependence between observations that weaken over tie. It proves novel stability-based generaliation bounds that hold even with this ore general setting. These bounds strictly generalie the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algoriths, including upport Vector Regression and Kernel Ridge Regression. Introduction The notion of algorithic stability has been used effectively in the past to derive tight generaliation bounds 2 4,6. A learning algorith is stable when the hypotheses it outputs differ in a liited way when sall changes are ade to the training set. A key advantage of stability bounds is that they are tailored to specific learning algoriths, exploiting their particular properties. They do not depend on coplexity easures such as the VC-diension, covering nubers, or Radeacher coplexity, which characterie a class of hypotheses, independently of any algorith. But, as in uch of learning theory, existing stability analyses and bounds apply only in the scenario where the saples are independently and identically distributed (i.i.d.). Note that the i.i.d. assuption is typically not tested or derived fro a data analysis. In any achine learning applications this assuption does not hold. The observations received by the learning algorith often have soe inherent teporal dependence, which is clear in syste diagnosis or tie series prediction probles. A typical exaple of tie series data is stock pricing, where clearly prices of different stocks on the sae day or of the sae stock on different days ay be dependent. This paper studies the scenario where the observations are drawn fro a stationary ixing sequence, a widely adopted assuption in the study of non-i.i.d. processes that iplies a dependence between observations that weakens over tie 8, 0, 6, 7. Our proofs are also based on the independent block technique coonly used in such contexts 7 and a generalied version of McDiarid s inequality 7. We prove novel stability-based generaliation bounds that hold even with this ore general setting. These bounds strictly generalie the bounds given in the i.i.d. case and apply to all stable learning algoriths thereby extending the usefulness of stability-bounds to non-i.i.d. scenar-

2 ios. It also illustrates their application to general classes of learning algoriths, including upport Vector Regression (VR) 5 and Kernel Ridge Regression 3. Algoriths such as support vector regression (VR) 4, 5 have been used in the context of tie series prediction in which the i.i.d. assuption does not hold, soe with good experiental results 9, 2. To our knowledge, the use of these algoriths in non-i.i.d. scenarios has not been supported by any theoretical analysis. The stability bounds we give for VR and any other kernel regulariation-based algoriths can thus be viewed as the first theoretical basis for their use in such scenarios. In ection 2, we will introduce the definitions for the non-i.i.d. probles we are considering and discuss the learning scenarios. ection 3 gives our ain generaliation bounds based on stability, including the full proof and analysis. In ection 4, we apply these bounds to general kernel regulariation-based algoriths, including upport Vector Regression and Kernel Ridge Regression. 2 Preliinaries We first introduce soe standard definitions for dependent observations in ixing theory 5 and then briefly discuss the learning scenarios in the non-i.i.d. case. 2. Non-i.i.d. Definitions Definition. A sequence of rando variables Z = {Z t } t= is said to be stationary if for any t and non-negative integers and k, the rando vectors (Z t,...,z t+ ) and (Z t+k,...,z t++k ) have the sae distribution. Thus, the index t or tie, does not affect the distribution of a variable Z t in a stationary sequence. This does not iply independence however. In particular, for i < j < k, PrZ j Z i ay not equal PrZ k Z i. The following is a standard definition giving a easure of the dependence of the rando variables Z t within a stationary sequence. There are several equivalent definitions of this quantity, we are adopting here that of 7. Definition 2. Let Z = {Z t } t= be a stationary sequence of rando variables. For any i,j Z {,+ }, let σ j i denote the σ-algebra generated by the rando variables Z k, i k j. Then, for any positive integer k, the β-ixing and ϕ-ixing coefficients of the stochastic process Z are defined as β(k) = sup E sup PrA B PrA ϕ(k) = PrA B PrA. () n B σ n A σ n+k sup n A σ n+k B σ n Z is said to be β-ixing (ϕ-ixing) if β(k) 0 (resp. ϕ(k) 0) as k. It is said to be algebraically β-ixing (algebraically ϕ-ixing) if there exist real nubers β 0 > 0 (resp. ϕ 0 > 0) and r > 0 such that β(k) β 0 /k r (resp. ϕ(k) ϕ 0 /k r ) for all k, exponentially ixing if there exist real nubers β 0 (resp. ϕ 0 > 0) and β (resp. ϕ > 0) such that β(k) β 0 exp( β k r ) (resp. ϕ(k) ϕ 0 exp( ϕ k r )) for all k. Both β(k) and ϕ(k) easure the dependence of the events on those that occurred ore than k units of tie in the past. β-ixing is a weaker assuption than φ-ixing. We will be using a concentration inequality that leads to siple bounds but that applies to φ-ixing processes only. However, the ain proofs presented in this paper are given in the ore general case of β-ixing sequences. This is a standard assuption adopted in previous studies of learning in the presence of dependent observations 8, 0, 6, 7. As pointed out in 6, β-ixing sees to be just the right assuption for carrying over several PAC-learning results to the case of weakly-dependent saple points. everal results have also been obtained in the ore general context of α-ixing but they see to require the stronger condition of exponential ixing. Mixing assuptions can be checked in soe cases such as with Gaussian or Markov processes 0. The ixing paraeters can also be estiated in such cases. 2

3 Most previous studies use a technique originally introduced by based on independent blocks of equal sie 8, 0, 7. This technique is particularly relevant when dealing with stationary β-ixing. We will need a related but soewhat different technique since the blocks we consider ay not have the sae sie. The following lea is a special case of Corollary 2.7 fro 7. Lea (Yu 7, Corollary 2.7). Let µ and suppose that ( h is easurable function, with µ absolute value bounded by M, on a product probability space j= Ω j, ) µ σsi r i where r i s i r i+ for all i. Let Q be a probability easure on the product ( space with arginal easures Q i i+ on (Ω i,σr si i ), and let Q i+ be the arginal easure of Q on j= Ω j, i+ j= σsj r j ), i =,...,µ. Let β(q) = sup i µ β(k i ), where k i = r i+ s i, and P = µ Q i. Then, E Q h E P h (µ )Mβ(Q). (2) The lea gives a easure of the difference between the distribution of µ blocks where the blocks are independent in one case and dependent in the other case. The distribution within each block is assued to be the sae in both cases. For a onotonically decreasing function β, we have β(q) = β(k ), where k = in i (k i ) is the sallest gap between blocks. 2.2 Learning cenarios We consider the failiar supervised learning setting where the learning algorith receives a saple of labeled points = (,..., ) = ((x,y ),...,(x,y )) (X Y ), where X is the input space and Y the set of labels (Y = R in the regression case), both assued to be easurable. For a fixed learning algorith, we denote by h the hypothesis it returns when trained on the saple. The error of a hypothesis on a pair X Y is easured in ters of a cost function c : Y Y R +. Thus, c(h(x),y) easures the error of a hypothesis h on a pair (x,y), c(h(x),y) = (h(x) y) 2 in the standard regression cases. We will use the shorthand c(h,) := c(h(x),y) for a hypothesis h and = (x,y) X Y and will assue that c is upper bounded by a constant M > 0. We denote by R(h) the epirical error of a hypothesis h for a training saple = (,..., ): R(h) = c(h, i ). (3) In the standard achine learning scenario, the saple pairs,..., are assued to be i.i.d., a restrictive assuption that does not always hold in practice. We will consider here the ore general case of dependent saples drawn fro a stationary ixing sequence Z over X Y. As in the i.i.d. case, the objective of the learning algorith is to select a hypothesis with sall error over future saples. But, here, we ust distinguish two versions of this proble. In the ost general version, future saples depend on the training saple and thus the generaliation error or true error of the hypothesis h trained on ust be easured by its expected error conditioned on the saple : R(h ) = E c(h,). (4) This is the ost realistic setting in this context, which atches tie series prediction probles. A soewhat less realistic version is one where the saples are dependent, but the test points are assued to be independent of the training saple. The generaliation error of the hypothesis h trained on is then: R(h ) = E c(h,) = E c(h,). (5) This setting sees less natural since if saples are dependent, then future test points ust also depend on the training points, even if that dependence is relatively weak due to the tie interval after which test points are drawn. Nevertheless, it is this soewhat less realistic setting that has been studied by all previous achine learning studies that we are aware of 8,0,6,7, even when exaining specifically a tie series prediction proble 0. Thus, the bounds derived in these studies cannot be applied to the ore general setting. We will consider instead the ost general setting with the definition of the generaliation error based on Eq. 4. Clearly, our analysis applies to the less general setting just discussed as well. 3

4 3 Non-i.i.d. tability Bounds This section gives generaliation bounds for ˆβ-stable algoriths over a ixing stationary distribution. The first two sections present our ain proofs which hold for β-ixing stationary distributions. In the third section, we will be using a concentration inequality that applies to φ-ixing processes only. The condition of ˆβ-stability is an algorith-dependent property first introduced in 4 and 6. It has been later used successfully by 2, 3 to show algorith-specific stability bounds for i.i.d. saples. Roughly speaking, a learning algorith is said to be stable if sall changes to the training set do not produce large deviations in its output. The following gives the precise technical definition. Definition 3. A learning algorith is said to be (uniforly) ˆβ-stable if the hypotheses it returns for any two training saples and that differ by a single point satisfy X Y, c(h,) c(h,) ˆβ. (6) Many generaliation error bounds rely on McDiarid s inequality. But this inequality requires the rando variables to be i.i.d. and thus is not directly applicable in our scenario. Instead, we will use a theore that extends McDiarid s inequality to general ixing distributions (Theore, ection 3.3). To obtain a stability-based generaliation bound, we will apply this theore to Φ() = R(h ) R(h ). To do so, we need to show, as with the standard McDiarid s inequality, that Φ is a Lipschit function and, to ake it useful, bound EΦ. The next two sections describe how we achieve both of these in this non-i.i.d. scenario. 3. Lipschit Condition As discussed in ection 2.2, in the ost general scenario, test points depend on the training saple. We first present a lea that relates the expected value of the generaliation error in that scenario and the sae expectation in the scenario where the test point is independent of the training saple. We denote by R(h ) = E c(h,) the expectation in the dependent case and by R(h b ) = E e c(h b, ) that expectation when the test points are assued independent of the training, with b denoting a sequence siilar to but with the last b points reoved. Figure (a) illustrates that sequence. The block b is assued to have exactly the sae distribution as the corresponding block of the sae sie in. Lea 2. Assue that the learning algorith is ˆβ-stable and that the cost function c is bounded by M. Then, for any saple of sie drawn fro a β-ixing stationary distribution and for any b {0,...,}, the following holds: E R(h ) E R(h b ) bˆβ + β(b)m. (7) Proof. The ˆβ-stability of the learning algorith iplies that ER(h ) = E c(h,) E c(h b,) + bˆβ. (8),, The application of Lea yields ER(h ) E c(h b, ) + bˆβ + β(b)m = ẼR(h b ) + bˆβ + β(b)m. (9),e The other side of the inequality of the lea can be shown following the sae steps. We can now prove a Lipschit bound for the function Φ. The standard variable used for the stability coefficient is β. To avoid the confusion with the β-ixing coefficient, we will use ˆβ instead. 4

5 b i i i i,b i i,b b b b b b b b b (a) (b) (c) (d) b Figure : Illustration of the sequences derived fro that are considered in the proofs. Lea 3. Let = (, 2,..., ) and i = (, 2,..., ) be two sequences drawn fro a β-ixing stationary process that differ only in point i,, and let h and h i be the hypotheses returned by a ˆβ-stable algorith when trained on each of these saples. Then, for any i,, the following inequality holds: Φ() Φ( i ) (b + )2ˆβ + 2β(b)M + M. (0) Proof. To prove this inequality, we first bound the difference of the epirical errors as in 3, then the difference of the true errors. Bounding the difference of costs on agreeing points with ˆβ and the one that disagrees with M yields R(h ) R(h i) = c(h, j ) c(h i, j) () j= = c(h, j ) c(h i, j) + c(h, i ) c(h i, i) ˆβ + M. j i Now, applying Lea 2 to both generaliation error ters and using ˆβ-stability result in R(h ) R(h i) R(h b ) R(h i b ) + 2bˆβ + 2β(b) (2) = E e c(h b, ) c(h i b, ) + 2bˆβ + 2β(b)M ˆβ + 2bˆβ + 2β(b)M. The lea s stateent is obtained by cobining inequalities and Bound on EΦ As entioned earlier, to ake the bound useful, we also need to bound E Φ(). This is done by analying independent blocks using Lea. Lea 4. Let h be the hypothesis returned by a ˆβ-stable algorith trained on a saple drawn fro a stationary β-ixing distribution. Then, for all b,, the following inequality holds: E Φ() (6b + )ˆβ + 3β(b)M. (3) Proof. We first analye the ter E R(h ). Let i be the sequence with the b points before and after point i reoved. Figure (b) illustrates this definition. i is thus ade of three blocks. Let i denote a siilar set of three blocks each with the sae distribution as the corresponding block in i, but such that the three blocks are independent. In particular, the iddle block reduced to one point i is independent of the two others. By the ˆβ-stability of the algorith, E R(h ) = E c(h, i ) Ei c(h i, i ) + 2bˆβ. (4) Applying Lea to the first ter of the right-hand side yields E R(h ) E c(h ei, i ) + 2bˆβ + 2β(b)M. (5) ei 5

6 Cobining the independent block sequences associated to R(h ) and R(h ) will help us prove the lea in a way siilar to the i.i.d. case treated in 3. Let b be defined as in the proof of Lea 2. To deal with independent block sequences defined with respect to the sae hypothesis, we will consider the sequence i,b = i b, which is illustrated by Figure (c). This can result in as any as four blocks. As before, we will consider a sequence i,b with a siilar set of blocks each with the sae distribution as the corresponding blocks in i,b, but such that the blocks are independent. ince three blocks of at ost b points are reoved fro each hypothesis, by the ˆβ-stability of the learning algorith, the following holds: EΦ() = E R(h ) R(h ) = E c(h, i ) c(h,) (6), E c(h i,b, i ) c(h i,b,) + 6bˆβ. (7) i,b, E e i,b,e Now, the application of Lea to the difference of two cost functions also bounded by M as in the right-hand side leads to EΦ() c(h ei,b, i ) c(h ei,b, ) + 6bˆβ + 3β(b)M. (8) E e i,b,e ince and i are independent and the distribution is stationary, they have the sae distribution and we can replace i with in the epirical cost and write EΦ() c(h e i, ) c(h ei,b, ) + 6bˆβ + 3β(b)M ˆβ + 6bˆβ + 3β(b)M, (9) i,b where i i,b is the sequence derived fro i,b by replacing i with. The last inequality holds by ˆβ-stability of the learning algorith. The other side of the inequality in the stateent of the lea can be shown following the sae steps. 3.3 Main Results This section presents several theores that constitute the ain results of this paper. We will use the following theore which extends McDiarid s inequality to ϕ-ixing distributions. Theore (Kontorovich and Raanan 7, Th..). Let Φ : Z R be a function defined over a countable space Z. If Φ is l-lipschit with respect to the Haing etric for soe l > 0, then the following holds for all ǫ > 0: Pr Φ(Z) EΦ(Z) > ǫ 2exp Z where + 2 ϕ(k). k= ( ǫ 2 2l 2 2 ), (20) Theore 2 (General Non-i.i.d. tability Bound). Let h denote the hypothesis returned by a ˆβstable algorith trained on a saple drawn fro a ϕ-ixing stationary distribution and let c be a easurable non-negative cost function upper bounded by M > 0, then for any b 0, and any ǫ > 0, the following generaliation bound holds Pr h R(h) R(h ) b i > ǫ + (6b + )ˆβ + 6Mϕ(b) 2exp ǫ 2 ( + 2 P! ϕ(i)) 2. 2((b + )2ˆβ + 2Mϕ(b) + M/) 2 Proof. The theore follows directly the application of Lea 3 and Lea 4 to Theore. The theore gives a general stability bound for ϕ-ixing stationary sequences. If we further assue that the sequence is algebraically ϕ-ixing, that is for all k, ϕ(k) = ϕ 0 k r for soe r >, then we can solve for the value of b to optiie the bound. 6

7 Theore 3 (Non-i.i.d. tability Bound for Algebraically Mixing equences). Let h denote the hypothesis returned by a ˆβ-stable algorith trained on a saple drawn fro an algebraically ϕ-ixing stationary distribution, ϕ(k) = ϕ 0 k r with r > and let c be a easurable non-negative cost function upper bounded by M > 0, then for any ǫ > 0, the following generaliation bound holds Pr h R(h) b R(h ) > ǫ + ˆβ + (r + )6Mϕ(b) where ϕ(b) = ϕ 0 ( ) ˆβ r/(r+). rϕ 0M satisfies ˆβb = rmϕ(b), which gives b = following ter can be bounded as X X + 2 ϕ 0ϕ(i) = + 2ϕ 0 i r + 2ϕ 0 + i 2exp ǫ 2 ( + 2ϕ 0r/(r )) 2 2(2ˆβ + (r + )2Mϕ(b) + M/) 2 Proof. For an algebraically ixing sequence, the value of b iniiing the bound of Theore 2 ( ) /(r+) ( ) ˆβ rϕ 0M and ϕ(b) = ˆβ r/(r+). ϕ0 rϕ 0M The Z ««i r di = + 2ϕ 0 + r. (2) r For r >, the exponent of is negative, and so we can bound this last ter by + 2ϕ 0 r/(r ). Plugging in this value and the iniiing value of b in the bound of Theore 2 yields the stateent of the theore. In the case of a ero ixing coefficient (ϕ = 0 and b = 0), the bounds of Theore 2 and Theore 3 coincide with the i.i.d. stability bound of 3. In order for the right-hand side of these bounds to converge, we ust have ˆβ = o(/ ) and ϕ(b) = o(/ ). For several general classes of algoriths, ˆβ O(/) 3. In the case of algebraically ixing sequences with r > assued in Theore 3, ˆβ O(/) iplies ϕ(b) = ϕ 0 (ˆβ/(rϕ 0 M)) (r/(r+)) < O(/ ). The next section illustrates the application of Theore 3 to several general classes of algoriths. 4 Application We now present the application of our stability bounds to several algoriths in the case of an algebraically ixing sequence. Our bound applies to all algoriths based on the iniiation of a regularied objective function based on the nor K in a reproducing kernel Hilbert space, where K is a positive definite syetric kernel: argin h H c(h, i ) + λ h 2 K, (22) under soe general conditions, since these algoriths are stable with ˆβ O(/) 3. Two specific instances of these algoriths are VR, for which the cost function is based on the ǫ-insensitive cost: { 0 if h(x) y ǫ, c(h,) = h(x) y ǫ = (23) h(x) y ǫ otherwise, and Kernel Ridge Regression 3, for which c(h,) = (h() y) 2. Corollary. Assue a bounded output Y = 0,B, a bounded cost function with bound M > 0, and that K(x,x) κ for all x for soe κ > 0. Let h denote the hypothesis returned by the algorith when trained on a saple drawn fro an algebraically ϕ-ixing stationary distribution. Then, with probability at least δ, the following generaliation bounds hold for a. upport vector regression (VR): «R(h ) R(h b ) + κ2 κ 2 u «2λ + 3M λ + u ϕ 0 M + κ2 κ 2 u «r 2λ + M 2 log(2/δ) λ u b. Kernel Ridge Regression (KRR): «R(h ) R(h b ) + 2Bκ2 4κ 2 λ + B 2 u «3M λ + u ϕ 0 M + 2κ2 B 2 4κ 2 B 2 u M + λ λ u! «r 2log(2/δ) With u = r/(r + ) 2,, M = 2(r + )M/(2rϕ 0 M) u, and ϕ 0 = ( + 2ϕ 0 r/(r ))., 7

8 Proof. It has been shown in 3 that for VR ˆβ κ 2 /(2λ) and for KRR, ˆβ 2κ 2 B 2 /(λ). Plugging in these values in the bound of Theore 3 and setting the right hand side to δ, yield the stateent of the corollary. These bounds give, to the best of our knowledge, the first stability-based generaliation bounds for VR and KRR in a non-i.i.d. scenario. iilar bounds can be obtained for other failies of algoriths such as axiu entropy discriination, which can be shown to have coparable stability properties 3. These bounds are non-trivial when the condition λ / /2 /r on the regulariation paraeter holds for all large values of, which clearly coincides with the i.i.d. case as r tends to infinity. It would be interesting to give a quantitative coparison of our bounds and the generaliation bounds of 0 based on covering nubers for ixing stationary distributions, in the scenario where test points are independent of the training saple. In general, because the bounds of 0 are not algorith-dependent, one can expect tighter bounds using stability, provided that a tight bound is given on the stability coefficient. The coparison also depends on how fast the covering nuber grows with the saple sie and trade-off paraeters such as λ. For a fixed λ, the asyptotic behavior of our stability bounds for VR and KRR is tight. 5 Conclusion Our stability bounds for ixing stationary sequences apply to large classes of algoriths, including VR and KRR, extending to weakly dependent observations existing bounds in the i.i.d. case. ince they are algorith-specific, these bounds can often be tighter than other generaliation bounds. Weaker notions of stability ight help further iprove or refine the. Acknowledgents This work was partially funded by the New York tate Office of cience Technology and Acadeic Research (NYTAR) and was also sponsored in part by the Departent of the Ary Award Nuber W23RYX N605. The U.. Ary Medical Research Acquisition Activity, 820 Chandler treet, Fort Detrick MD is the awarding and adinistering acquisition office. The content of this aterial does not necessarily reflect the position or the policy of the Governent and no official endorseent should be inferred. References. N. Bernstein. ur l extension du théorèe liite du calcul des probabilités aux soes de quantités dépendantes. Math. Ann., 97: 59, O. Bousquet and A. Elisseeff. Algorithic stability and generaliation perforance. In Advances in Neural Inforation Processing ystes (NIP 2000), O. Bousquet and A. Elisseeff. tability and generaliation. JMLR, 2: , L. Devroye and T. Wagner. Distribution-free perforance bounds for potential function rules. In Inforation Theory, IEEE Transactions on, volue 25, pages , P. Doukhan. Mixing: Properties and Exaples. pringer-verlag, M. Kearns and D. Ron. Algorithic stability and sanity-check bounds for leave-one-out cross-validation. In Coputational Learing Theory, pages 52 62, L. Kontorovich and K. Raanan. Concentration inequalities for dependent rando variables via the artingale ethod, A. Loano,. Kulkarni, and R. chapire. Convergence and consistency of regularied boosting algoriths with stationary β-ixing observations. In NIP, D. Mattera and. Haykin. upport vector achines for dynaic reconstruction of a chaotic syste. In Advances in kernel ethods: support vector learning, pages MIT Press, Cabridge, MA, R. Meir. Nonparaetric tie series prediction through adaptive odel selection. ML, 39():5 34, D. Modha and E. Masry. On the consistency in nonparaetric estiation under ixing assuptions. IEEE Transactions of Inforation Theory, 44:7 33, K.-R. Müller, A. ola, G. Rätsch, B. chölkopf, J. K., and V. Vapnik. Predicting tie series with support vector achines. In Proceedings of ICANN 97, LNC, pages pringer,

9 3 C. aunders, A. Gaeran, and V. Vovk. Ridge Regression Learning Algorith in Dual Variables. In Proceedings of the ICML 98, pages Morgan Kaufann Publishers Inc., B. chölkopf and A. ola. Learning with Kernels. MIT Press: Cabridge, MA, Vladiir N. Vapnik. tatistical Learning Theory. Wiley-Interscience, New York, M. Vidyasagar. Learning and Generaliation: With Applications to Neural Networks. pringer, B. Yu. Rates of convergence for epirical processes of stationary ixing sequences. The Annals of Probability, 22():94 6, Jan

Stability Bounds for Stationary ϕ-mixing and β-mixing Processes

Journal of Machine Learning Research (200 66-686 Subitted /08; Revised /0; Published 2/0 Stability Bounds for Stationary ϕ-ixing and β-ixing Processes Mehryar Mohri Courant Institute of Matheatical Sciences