A Unified Approach to Universal Prediction: Generalized Upper and Lower Bounds

Size: px

Start display at page:

Download "A Unified Approach to Universal Prediction: Generalized Upper and Lower Bounds"

Delphia Tucker
5 years ago
Views:

1 646 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH 05 A Unified Approach to Universal Prediction: Generalized Upper and Lower Bounds Nuri Denizcan Vanli and Suleyan S Kozat, Senior Meber, IEEE Abstract We study sequential prediction of real-valued, arbitrary, and unknown sequences under the squared error loss as well as the best paraetric predictor out of a large, continuous class of predictors Inspired by recent results fro coputational learning theory, we refrain fro any statistical assuptions and define the perforance with respect to the class of general paraetric predictors In particular, we present generic lower and upper bounds on this relative perforance by transforing the prediction task into a paraeter learning proble We first introduce the lower bounds on this relative perforance in the iture of eperts fraework, where we show that for any sequential algorith, there always eists a sequence for which the perforance of the sequential algorith is lower bounded by zero We then introduce a sequential learning algorith to predict such arbitrary and unknown sequences, and calculate upper bounds on its total squared prediction error for every bounded sequence We further show that in soe scenarios, we achieve atching lower and upper bounds, deonstrating that our algoriths are optial in a strong inia sense such that their perforances cannot be iproved further As an interesting result, we also prove that for the worst case scenario, the perforance of randoized output algoriths can be achieved by sequential algoriths so that randoized output algoriths do not iprove the perforance Inde Ters Online learning, sequential prediction, worst-case perforance I INTRODUCTION In this brief, we investigate the generic sequential online) prediction proble fro an individual sequence perspective using tools of coputational learning theory, where we refrain fro any statistical assuptions either in odeling or on signals ] 4] In this approach, we have an arbitrary, deterinistic, bounded, and unknown signal t] t, where t] < A <, and t] R Since we do not ipose any statistical assuptions on the underlying data, we, otivated by recent results fro sequential learning ] 4], define the perforance of a sequential algorith with respect to a coparison class, where the predictors of the coparison class are fored by observing the entire sequence in hindsight, under the squared error loss, that is t] ˆ s t]) inf t] ˆc t] ) c C for an arbitrary length of data n, and for any possible sequence t] t,where ˆ s t] is the prediction at tie t of any sequential algorith that has access data fro ] up to t ] for prediction, and ˆ c t] is the prediction at tie t of the predictor c such that c C, where C represents the class of predictors we copete against We ephasize that since the predictors ˆ c t], c C have the access Manuscript received July 5, 03; revised January 4, 04 and April 3, 04; accepted April 6, 04 Date of publication April 4, 04; date of current version February 6, 05 This work was supported in part by the IBM Faculty Award and in part by TUBITAK under Contract E6 and Contract 3E57 The authors are with the Departent of Electrical and Electronics Engineering, Bilkent University, Ankara 06800, Turkey e-ail: vanli@eebilkentedutr; kozat@eebilkentedutr) Digital Object Identifier 009/TNNLS to the entire sequence before the processing starts, the iniu squared prediction error that can be achieved with a sequential predictor ˆ s t] is equal to the squared prediction error of the optial batch predictor ˆ c t], c C Here, we call the difference in the squared prediction error of the sequential algorith ˆ s t] and the optial batch predictor ˆ c t], c C as the regret of not using the optial predictor or equivalently, not knowing the future) Therefore, we seek for sequential algoriths ˆ s t] that iniize this regret or loss for any possible t] t We ephasize that this regret definition is for the accuulated sequential cost, instead of the batch cost Instead of fiing a coparison class of predictors, we paraeterize the coparison classes such that the paraeter set and functional for of these classes can be chosen as desired In this sense, in this brief, we consider the ost general class of paraetric predictors as our class of predictors C such that the regret for an arbitrary length of data n is given by t] ˆ s t]) inf t] f w, ) ) where f w, ) is a paraetric function whose paraeters w w,,w ] T can be set prior to prediction, and this function uses the data, t a for prediction for soe arbitrary integer a, which can be viewed as the tap size of the predictor Although the paraeters of the paraetric prediction function f w, ) can be set arbitrarily, even by observing all the data t] t apriori,the function is naturally restricted to use only the sequential data in prediction 5] 7] Since we have no statistical assuptions on the underlying data, the corresponding lower and upper bounds on the regret in ) in this sense provide the ultiate easure of the learning perforance for any sequential predictor We ephasize that lower bounds not only provide the worst-case perforance of an algorith, but also quantify the prediction power of the paraetric class As such, a positive lower bound guarantees the eistence of a data sequence having an arbitrary length such that no atter how sart the learning algorith is, the perforance of this sart algorith on this sequence will be worse than the class of paraetric predictors by at least an order of the lower bound Hence, if an algorith is found such that the upper bound of the regret of that algorith atches with the lower bound, then that algorith is optial in a strong inia sense such that the actual convergence perforance cannot be further iproved 7] To this end, the inia sense optiality of different paraetric learning algoriths, such as the well-known prediction algoriths, least ean squares LMSs) 8], recursive least squares RLSs) 8], and online sequential etree learning achine of ] can be deterined using the lower bounds provided in this brief In this sense, the rates of the corresponding upper and lower bounds are analogous to the VC All vectors are colun vectors and denoted by boldface lower case letters For a vector u, u T is the ordinary transpose We denote b a t]b ta 6-37X 04 IEEE Personal use is peritted, but republication/redistribution requires IEEE perission See for ore inforation )

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH diension 9] of classifiers and can be used to quantify the learning perforance ] 3], 0] Various sequential learning algoriths have been proposed in ], 7], 8], 0] ], and 3] in order to efficiently learn the relationship between the observations and the desired data One of the siplest ethods is to linearly odel this relationship, ie, f wt], ) wt]t, and then update wt] using the wellknown algoriths, such as the LMS or RLS algoriths ], 8] In ore recent studies 7], ], universal algoriths have been proposed that achieve the perforance of the optial weighting vector without any statistical assuptions Kivinen and Waruth 0] have proposed a ultiplicative update of the weights and provided guaranteed upper bounds on the perforance of the proposed algorith On the other hand, in order to introduce a nonlinear odeling, siilar learning ethods are usually etended by either apping the observations to higher diensions as in polynoial and Volterra filters ] or partitioning the observation space and fitting linear odels in each partition, ie, piecewise linear odeling 3] In order to derive upper and lower bounds on the perforance of such learning algoriths, the iture of eperts fraework is usually used As an eaple, linear prediction 5], 7], ], nonlinear odels based on piecewise linear approiations 3], and the learning of an individual noise-corrupted deterinistic sequence 4] are studied These results are then etended to the filtering probles 5], 6] In this brief, on the other hand, we consider a holistic approach and provide upper and lower bounds for the general fraework, which was previously issing in the literature Our ain contribution in this brief is to obtain the generalized lower bounds for a variety of prediction fraeworks by transforing the prediction proble to a well known and studied statistical paraeter learning proble ], 4] 7] By doing so, we prove that for any sequential algorith there always eists soe data sequence over any length such that the regret of the sequential algorith is lower bounded by zero We further derive lower bounds for iportant classes of predictors heavily investigated in achine learning literature, including univariate polynoial, ultivariate polynoial, and linear predictors 4] 7], 0] ], 4] We also provide a universal sequential prediction algorith and calculate upper bounds on the regret of this algorith, and show that we obtain atching lower and upper bounds in soe scenarios As an interesting result, we also show that given the regret in ) as the perforance easure, there is no additional gain achieved by using randoized algoriths in the worst-case scenario The rest of this brief is organized as follows In Section II, we first present general lower bounds, and then analyze couple of specific scenarios We then introduce a universal prediction algorith and calculate the upper bounds on its regret in Section III In Section IV, we show that in the worst-case scenario, the perforance of randoized algoriths can be achieved by sequential algoriths Finally, conclusions are drawn in Section V II LOWER BOUNDS In this section, we investigate the worst-case perforance of sequential algoriths to obtain guaranteed lower bounds on the regret Hence, for any arbitrary length of data n, t] t, we are trying to find a lower bound on the following: n sup t] ˆ s t]) inf t] f w, n w R ) ) ) For this regret, we have the following theore that relates the perforance of any sequential algorith to the general class of paraetric predictors While proving this theore, we also provide a generic procedure to find lower bounds on the regret in ) and later use this ethod to derive lower bounds for paraetric classes, including the classes of univariate polynoial, ultivariate polynoial, and linear predictors 4] 7], 0] ], 4] Theore : There is no best sequential algorith for all sequences for any class in the paraetric for f w, ),wherew R Given a paraetric class there eists always a sequence such that the regret in ) is always lower bounded by soe nonnegative value This theore iplies that no atter how sart a sequential algorith is or how naive the copetition class is, it is not possible to outperfor the copetition class for all sequences As an eaple, this result deonstrates that even copeting against the class of constant predictors, ie, the ost naive copetition class, where ˆ c t] always predicts a constant value, any sequential algorith, no atter how sart, cannot outperfor this class of constant predictors for all sequences We ephasize that in this sense, the lower bounds provide the prediction and odeling power of the paraetric class Proof of Theore : We begin our proof by pointing out that finding the best sequential predictor for an arbitrary and unknown sequence of n is not straightforward Yet, for a specific distribution on n, the best predictor is the conditional ean on n under the squared error 7] Therefore, by this clever transforation, we are able to calculate the regret in ) in the epectation sense and prove this theore Since the supreu in ) is taken over all n, for any distribution n, the regret is lower bounded by n sup t] ˆ s t]) inf t] f w, n w R ) ) ) n E n t] ˆ s t]) inf t] f w, w R ) ) ] Ln) where epectation is taken with respect to this particular distribution Hence, it is enough to lower bound Ln) to get a final lower bound By the linearity of the epectation n ] Ln) E n t] ˆ s t]) E n inf t] f w, ) ) ] 3) The squared-error loss Et] ˆ s t]) ] is iniized with the wellknown iniu ean squared error MMSE) predictor given by 7] ˆ s t] E t] ] t ],,] E t] ] 4) where we drop the eplicit n -dependence of the epectation to siplify the presentation Suppose we select a paraetric distribution for n with paraeter vector θ θ,,θ ] Then, for the second ter in 3), we use the following inequality: E θ E n θ inf E θ inf E n t] f w, n θ ) ) ]] t] f w, ) ) ]] 5)

3 648 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH 05 By using 4) and 5), and epanding the epectation, we can lower bound Ln) as n Ln) E θ E n θ t] E t] ]] ]) E θ inf E n n θ t] f w, ) ) ]] The inequality in 6) is true for any distribution on n Hence, for a distribution on n such that E t] ], θ h θ, ) 7) with soe function h, if we can find a vector function gθ) satisfying f gθ), ) hθ, ) then the last ter in 6) yields E θ inf w R E n n θ t] f w, ) ) ]] n E θ E n θ t] h θ, ) ) ]] Thus, 6) can be written as Ln) E θ n E n θ n E θ E n θ t] E t] ]) ]] t] E t] ]] ]), θ which is by definition of the MMSE estiator is always lower bounded by zero, ie, Ln) 0 By this inequality, we conclude that for predictors of the for f w, ) for which this special paraetric distribution, ie, w gθ) eists, the best sequential predictor will be always outperfored by soe predictor in this class for soe sequence n Hence, there is no best algorith for all sequences for any class in this paraetric for The question arises if a suitable distribution on n can be found for a given f w, ) such that f gθ), ) hθ, ) with a suitable transforation gθ) Suppose f w, ) is bounded by soe 0 < M < for all t] A, ie, f w, ) M Then, given θ fro a beta distribution with paraeters C, C), C R +, we generate a sequence n such that t] A/M fw, ) with probability θ and t] A/M f w, )) with probability θ)then E t] ],θ A M θ ) f w, ) Hence, this concludes the proof of the Theore As an iportant special case, if we use the restricted functional for f w, ) so that f w, ) is separable, then the prediction proble is transfored to a paraeter estiation proble The separable for is given by f w, ) f w w) T f ) where f w w) and f ) are vector functions of size for soe integer Then, 7) can be written as E t] ], θ f w gθ)) T f ) where f w gθ)) A/Mθ ) f w w) Denoting f n w) A/M f w w) as the noralized prediction function, and after soe 6) algebra 6) is obtained as n Ln) E θ E n θ t] E θ ) ] T f n w) T f ) ]] n E θ E n θ t] θ ) f n w) T f ) ) ]] so that the regret of the sequential algorith over the best prediction function is due to the regret attained by the sequential algorith while learning the paraeters of the prediction function, ie, the paraeters of the underlying distribution To illustrate this procedure, we investigate the regret given in ) for three candidate function classes that are widely studied in coputational learning theory A th-order Univariate Polynoial Prediction For an th-order polynoial in t ], the regret is given by p sup t] ˆ s t]) inf t] w n w R i i t ] i 8) where ˆ s t] is the prediction at tie t of any sequential algorith that has access data fro ] up to t ] for prediction, w w,,w ] T is the paraeter vector, and i t ] is the ith power of t ] Since i w i i t ] w t ] with appropriate selection of w, considering the following distribution on n, we can lower bound the regret in 8) Given θ fro a beta distribution with paraeters C, C), C R +, we generate a sequence n having only two values, A and A such that t] t ] with probability θ and t] t ] with probability θ) Then, Et],θ]θ )t ], givinghθ, ) θ )t ] Since the MMSE given θ is linear in t ], the optiu w that iniizes the accuulated error for this distribution is w θ ), 0,,0] T After following the lines in 5], we obtain a lower bound of the for Olnn)) B Multivariate Polynoial Prediction Suppose the prediction function is given by w T f ) k w k f k t r ), where each f kt r ) is a ultivariate polynoial function as an eaple f k t r ) t ] t ]/t 3]), and regret is taken over all w w,,w ] T R,thatis n sup t] ˆ s t]) inf t] w n w R T f ) ) where ˆ s t] is the prediction at tie t of any sequential algorith that has access data fro ] up to t ] for prediction, and w is the paraeter for prediction We ephasize that this class of predictors are not only the super set of univariate polynoial predictors, but also widely used in any signal processing applications to odel nonlinearity, such as Volterra filters ] This filtering technique is attractive when linear filtering

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH techniques do not provide satisfactory results, and includes cross products of the input signals Since k w k f k t r ) w f t r ) with an appropriate selection of w and redefinition of f t r ), we define the following paraetric distribution on n to obtain a lower bound Given θ fro a beta distribution with paraeters C, C), C R +, we generate a sequence n having only two values, A and A, such that t] f n ) with probability θ and t] f n ) with probability θ), where f n ) Af t r )/M, ie, noralized version of f t r )Thus,givenθ, n fors a twostate Markov chain with transition probability θ) Hence, we have Et],θ]θ ) f n ) The lower bound for the regret is given by Ln) E t] ˆθ ) f n ) ) ] E t] θ ) f n ) ) ] where ˆθ Eθ ] After soe algebra, we achieve Ln) 4E ˆθt] f n ) ] + 4E θt] f n ) ] It can be deduced that +E ˆθ ) ] E θ ) ] ˆθ E θ ] t F t + C t + C where F t is the total nuber of transitions between the two states in a sequence of length t ), ie, ˆθ is ratio of nuber of transitions to tie period Hence E ˆθt] f n ) ] t Ft + C E t] f n ) ] t + C t + C)E t] f n ) ] E t + C t + C E θ)t )t] f n ) ] t t + C E θt] f n ) ] F t t] f n where the third line follows fro Et] f n )] Eθ )A ]0andEF t t] f n )]t ) θ) since F t is a binoial rando variable with paraeters θ) and size t ) Thus, we obtain t Ln) 4 t + C E θt) f n ) ] +4E θt) f n ) ] +E ˆθ ) ] E θ ) ] After this line, the derivation follows siilar lines to 7], giving a lower bound of the for Olnn)) for the regret C k-ahead th-order Linear Prediction The regret in ) for k-ahead th-order linear prediction is given by n sup t] ˆ s t]) inf t] w n w R T t k]) 9) where ˆ s t] is the prediction at tie t of any sequential algorith that has access data fro ] up to t k] for prediction for soe ) ] integer k, w w,,w ] T is the paraeter vector, and t k] t k],,t k + ]] T We first find a lower bound for k-ahead first-order prediction, where w T t k] wt k] For this purpose, we define the following paraetric distribution on n as in 5] Given θ fro a beta distribution with paraeters C, C), C R +, we generate a sequence n having only two values, A and A, such that t] t k] with probability θ and t] t k] with probability θ) Thus, given θ, n fors a two-state Markov chain with transition probability θ) Then, Et] t k,θ]θ )t k], giving hθ, ) θ )t k] and gθ) θ ) After this point, the derivation eactly follows the lines in 5] resulting a lower bound of the for Olnn)) For k-ahead th-order prediction, we generalize the lower bound obtained for k-ahead first-order prediction and following the lines in 5], we obtain a lower bound of the for O lnn)) III COMPREHENSIVE APPROACH TO REGRET MINIMIZATION In this section, we introduce a ethod which can be used to predict a bounded, arbitrary, and unknown sequence We derive the upper bounds of this algorith such that for any sequence n, our algorith will not perfor worse than the presented upper bounds In soe cases, by achieving atching upper and lower bounds, we prove that this algorith is optial in a strong inia sense such that the worst-case perforance cannot be further iproved We restrict the prediction functions to be separable, ie, f w, ) f w w)t f ), where f w w) and f ) are vector functions of size forsoe integer To avoid any confusion, we siply denote β f w w), whereβ R Hence, the sae prediction function can be written as f w, ) β T f ) If the paraeter vector β is selected such that the total squared prediction error is iniized over a batch of data of length n, then the coefficients are given by β n] arg in t] β T f ) ) β R The well-known least-squares solution to this proble is given by β n] R n ff ) r n f,where R n ff f ) f ) T is invertible and r n f t] f ) When R n is singular, the solution is no longer unique, ff however, a suitable choice can be ade using, eg, pseudoinverses We also consider the ore general least-squares ridge regression) proble that arises in any signal processing probles, and whose total squared prediction error is iniized over a batch of data of length n with β n n] arg in t] β T f ) ) + δ β β R ] R n ff + δ I r n f We define a universal predictor u n], as u n] β u n ] T f n a n )

5 650 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH 05 where β u n] β n] R n ] ff + δi r n f and δ>0 is a positive constant Theore : The total squared prediction error of the th-order universal predictor for any bounded arbitrary sequence of t] t, t] A, having an arbitrary length of n satisfies t] u t]) n in t] β T f ) ) β R +δ β +A ln I + R n ff δ Theore indicates that the total squared prediction error of the th-order universal predictor is within O lnn)) of the best batch th-order paraetric predictor for any individual sequence of t] t This result iplies that in order to learn paraeters, the universal algorith pays a regret of O lnn)), which can be viewed as the paraeter regret After we prove Theore, we apply Theore to the copetition classes discussed in Section II Proof of Theore : We prove this result for a scalar prediction function such that f ) f ) to avoid any confusions Yet for a vector prediction function of f ), one can follow the eact sae steps in this proof with vector etensions of the Gaussian iture The derivations follow siilar lines to 5] and 0], hence only ain points are presented We first define a function of the loss, naely the probability for a predictor having paraeter β as follows: P β n ) ep h k k] β f ) ) which can be viewed as a probability assignent of the predictor with paraeter β to the data t], for t n, induced by perforance of β on the sequence n We then construct a universal estiate of the probability of the sequence n,asanaprioriweighted iture aong all of the probabilities, ie, P u n) pβ)p β n)dβ, where pβ) is an aprioriweight assigned to the paraeter β, and is selected as Gaussian in order to obtain a closed for bounds, ie, pβ) /π) / σ ep β /σ Following siilar lines to 7] with a predictor of β f ),we obtain: P u n n ) γ ep h γ n] βn ] f n a n ) ) where γ R n ff + δ)/r n ff + δ) ) / If we could find another Gaussian satisfying P u n ) P u n ), then it would coplete the proof of the theore After soe algebra, we find that the universal predictor is given by u n] γ β n ] f n a n ) rf n R n ff + δ f n ) n a Now, we can select the sallest value of h over the region A, A], P u n n ) is larger than P u n n ),thatis h lnγ )γ A ) + γ ˆ u n] γ ) γ ) which ust hold for all values of ˆ u n] A, A] Therefore, h A γ )/ lnγ ), where γ < Note that for 0 < γ < wehave0< γ )/ lnγ <, which iplies that we ust have h A to ensure that P u P u In fact, since this bound on the value of h depends upon the value of γ and ˆ u n], and is only tight for γ, and ˆ u n] 0, then the restriction that n] < A can actually be occasionally violated, as long as P u P u still holds To illustrate this procedure, we investigate the upper bound for the regret in ) for the sae candidate function classes as we also investigated in Section II A th-order Univariate Polynoial Predictor For an th-order polynoial in t ], the prediction function is given by f w, ) βt f ) βt t ], where t ] t ],, t ]] T, ie, the vector of powers of t ] After replacing R n ff Rn n t ]t ] T and r n f r n n t]t ], we obtain an upper bound t] u t]) in β R n t] β T ) t ] +δ β + A ln I + R n δ A ln+a n/δ) B Multivariate Polynoial Prediction The upper bound for a ultivariate polynoial prediction function f ) eactly follows the upper bound derivation of th-order univariate polynoial predictor giving an upper bound: t] u t]) n in t] β β R T f ) +δ β ) +A ln + A n δ C k-ahead th-order Linear Prediction For k-ahead th-order prediction, the prediction class is given by f w, ) βt f ) βt t k] where t k] t k],,t k +]] T as before After replacing R n ff Rn n t k]t k] T and r n f r n n t]t k] with suitable liits, we obtain an upper bound n t] u t]) in t] β β R T t k]) +δ β ) +A ln + A n δ IV RANDOMIZED OUTPUT PREDICTIONS In this section, we investigate the perforance of randoized output algoriths for the worst-case scenario with respect to linear predictors with using the sae regret easure in ) We ephasize that the randoized output algoriths are a super set of the deterinistic sequential predictors and the derivations here can be readily generalized to include any prediction class In particular, we consider randoized output algoriths f θ ), ) such that the randoization paraeters θ R can be a function of the whole past Hence, a randoized sequential algorith introduce randoization or uncertainty in its output such that the output also depends on a rando eleent Note that such ethods are widely used in applications involving security considerations As an eaple, suppose there are prediction algoriths running in parallel to predict the observation sequence t] t sequentially At each tie t, the randoized output algorith selects one of the constituent algoriths randoly such that the algorith k is

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL 6, NO 3, MARCH selected with probability p k t] By definition k p k t] and p k t] ay be generated as the cobination of the past observation saples and a seed independent fro the observations For such randoized output prediction algoriths, we consider the following tie-accuulated prediction error over a deterinistic sequence t] t as the prediction error: P rand n) E θ t] f θ ) )) ], 0) This epectation is taken over all the randoization due to independent or dependent seeds Hence, our general regret can be etended to include this perforance easure sup n P rand n) in Epanding 0), we obtain P rand n) t] E θ f +Var θ f ) t] w T t ] ) θ θ ) )]), ) )), noting that t] is independent of the randoization Since E θ f θ ), )] is a sequential function of and Var θ f θ ), )) is always nonnegative, the perforance of a randoized output algorith can be reached by a deterinistic sequential algorith Since deterinistic algoriths are subclass of randoized output algoriths, upper bounds we derived for k-ahead th-order prediction in 9) also hold for ) Since we also proved that the lower bound for such linear predictions of th order are in the for of O lnn)), the lower and upper bounds are tight and of the for O lnn)) V CONCLUSION In this brief, we consider the proble of sequential prediction fro a iture of eperts perspective We have introduced coprehensive lower bounds on the sequential learning fraework by proving that for any sequential algorith, there always eists a sequence for which the sequential predictor cannot outperfor the class of paraetric predictors, whose paraeters are set noncasually The lower bounds for iportant paraetric classes, such as univariate polynoial, ultivariate polynoial, and linear predictor classes, are derived in detail We then introduced a universal sequential prediction algorith and investigated the upper bound on the regret of this algorith We also derived the upper bounds in detail for the sae iportant classes that we discussed for lower bounds, where we further showed that this algorith is optial in a strong inia sense for soe scenarios Finally, we have proven that for the worst-case scenario, randoized output algoriths cannot provide any iproveent in the perforance copared with the sequential algoriths REFERENCES ] N-Y Liang, G-B Huang, P Saratchandran, and N Sundararajan, A fast and accurate online sequential learning algorith for feedforward networks, IEEE Trans Neural Netw, vol 7, no 6, pp 4 43, Nov 006 ] L Devroye, T Linder, and G Lugosi, Nonparaetric estiation and classification using radial basis function nets and epirical risk iniization, IEEE Trans Neural Netw, vol 7, no, pp , Mar 996 3] A Krzyzak and T Linder, Radial basis function networks and copleity regularization in function learning, IEEE Trans Neural Netw, vol 9, no, pp 47 56, Mar 998 4] N Cesa-Bianchi, P M Long, and M K Waruth, Worst-case quadratic loss bounds for prediction using linear functions and gradient descent, IEEE Trans Neural Netw, vol 7, no 3, pp , May 996 5] A C Singer and M Feder, Universal linear prediction by odel order weighting, IEEE Trans Signal Process, vol 47, no 0, pp , Oct 999 6] G C Zeitler and A Singer, Universal linear least-squares prediction in the presence of noise, in Proc IEEE/SP 4th Workshop SSP, Aug 007, pp ] A C Singer, S S Kozat, and M Feder, Universal linear least squares prediction: Upper and lower bounds, IEEE Trans Inf Theory, vol 48, no 8, pp , Aug 00 8] T Kailath, A H Sayed, and B Hassibi, Linear Estiation Englewood Cliffs, NJ, USA: Prentice-Hall, 000 9] V Cherkassky, X Shao, F M Mulier, and V N Vapnik, Model copleity control for regression using VC generalization bounds, IEEE Trans Neural Netw, vol 0, no 5, pp , Sep 999 0] J Kivinen and M K Waruth, Eponentiated gradient versus gradient descent for linear predictors, J Inf Coput, vol 3, no, pp 63, 997 ] V J Mathews, Adaptive polynoial filters, IEEE Signal Process Mag, vol 8, no 3, pp 0 6, Jul 99 ] V Vovk, Copetitive on-line statistics, Int Statist Rev, vol 69, no, pp 3 48, 00 3] S S Kozat, A C Singer, and G C Zeitler, Universal piecewise linear prediction via contet trees, IEEE Trans Signal Process, vol 55, no 7, pp , Jul 007 4] T Weissan and N Merhav, Universal prediction of individual binary sequences in the presence of noise, IEEE Trans Inf Theory, vol 47, no 6, pp 5 73, Sep 00 5] T Moon and T Weissan, Universal FIR MMSE filtering, IEEE Trans Signal Process, vol 57, no 3, pp , Mar 009 6] T Moon and T Weissan, Copetitive on-line linear FIR MMSE filtering, in Proc IEEE ISIT, Jun 007, pp ] H Stark and J Woods, Probability, Rando Processes, and Estiation Theory for Engineers Upper Saddle River, NJ, USA: Prentice-Hall, 994

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where