ONLINE BAYESIAN KERNEL REGRESSION FROM NONLINEAR MAPPING OF OBSERVATIONS

Size: px

Start display at page:

Download "ONLINE BAYESIAN KERNEL REGRESSION FROM NONLINEAR MAPPING OF OBSERVATIONS"

Willis Pitts
5 years ago
Views:

1 ONLINE BAYESIAN KERNEL REGRESSION FROM NONLINEAR MAPPING OF OBSERVATIONS Mattheu Gest 1,2, Olver Petqun 1 and Gabrel Frcout 2 1 SUPELEC, IMS Research Group, Metz, France 2 ArcelorMttal Research, MCE Department, Mazères-lès-Metz, France ABSTRACT In a large number of applcatons, engneers have to estmate a functon lnked to the state of a dynamc system. To do so, a sequence of samples drawn from ths unknown functon s observed whle the system s transtng from state to state and the problem s to generalze these observatons to unvsted states. Several solutons can be envsoned among whch regressng a famly of parameterzed functons so as to make t ft at best to the observed samples. However classcal methods cannot handle the case where actual samples are not drectly observable but only a nonlnear mappng of them s avalable, whch happen when a specal sensor has to be used or when solvng the Bellman equaton n order to control the system. Ths paper ntroduces a method based on Bayesan flterng and kernel machnes desgned to solve the trcky problem at sght. Frst expermental results are promsng. 1. INTRODUCTION In a large number of applcatons, engneers have to estmate a functon lnked to the state of a dynamc system. Ths functon s generally qualfyng the system (e.g. the magntude of the eletro-magnetc feld n a wf network accordng to the power emtted by each antenna n the network). To do so, a sequence of samples drawn from ths unknown functon s observed whle the system s transtng from state to state and the problem s to generalze these observatons to unvsted states. Several solutons can be envsoned among whch regressng a famly of parameterzed functons so as to make t ft at best to the observed samples. Ths s a well known problem that s addressed by many standard methods n the lterature such as the kernel machnes methods [1] or neural networks [2]. Yet, several problems are not usually handled by standard technques. For nstance, actual samples are sometmes not drectly observable but only a nonlnear mappng of them s avalable. Ths s the case when a specal sensor has to be used (e.g. measurng a temperature usng a spectrometer or a thermocouple). Ths s also the case when solvng the Bellman equaton n a Markovan decson process wth unknown determnstc transtons [3]. Ths s mportant for (asynchronous and onlne) dynamc programmng, and more generally for control theory. Another problem s to handle onlne regresson, that s ncrementally mprove the regresson results as new samples are observed by recursvely updatng prevously computed parameters. In ths paper we propose a method, based on the Bayesan flterng paradgm and kernel machnes, for onlne regresson of non-lnear mappng of observatons. We have chosen here to descrbe a qute general formulaton of the problem n order to appeal a broader audence. Indeed the technque ntroduced below handles well non-lneartes n a dervatve free way and t should be useful n other felds Problem statement Through the rest of ths paper, x wll denote a column vector and x a scalar. Let x = (x 1, x 2 ) X = X 1 X 2 where X 1 (resp. X 2 ) s a compact set of R n (resp. R m ). Let t : X 1 X 2 X 1 be a nonlnear transformaton (transtons n case of dynamc systems) whch wll be observed. Let g be a nonlnear mappng such that g : f R X g f R X X 1. The am here s to approxmate sequentally the nonlnear functon f : x X f(x) = f(x 1, x 2 ) R from samples ( xk, t k = t(x 1 k, x 2 k), y k = g f (x 1 k, x 2 k, t k ) ) k by a functon ˆf θ (x) = ˆf θ (x 1, x 2 ) parameterzed by the vector θ. Here the output s scalar, however the proposed method can be straghtforwardly extended to vectoral outputs. Note that as j R X R X g R X X 1 ff R X g f : x X g f (x, t(x)) ths problem statement s qute general. Thus the work presented n ths paper can be consdered wth g : f R X g f R X (g beng known analytcally), ths case beng more specfc. The nterest of ths partcular formulaton s that a part of the nonlneartes has to be known analytcally (the mappng g) whereas the other part can be just observed (the t functon).

2 1.2. Outlne A kernel-based regresson s used, namely the approxmaton s of the form ˆf θ (x) = p =1 α K(x, x ) where K s a kernel, that s a contnuous, symmetrc and postve sem-defnte functon. The parameter vector θ contans the weghts (α ), and possbly the centers (x ) and some parameters of the kernel (e.g. the varance for Gaussan kernels). These methods rely on the Mercer theorem [4] whch states that each kernel s a dot product n a hgher dmensonal space. More precsely, for each kernel K, t exsts a mappng ϕ : X F (F beng called the feature space) such that x, y X, K(x, y) = ϕ(x), ϕ(y). Thus, any lnear regresson algorthm whch only uses dot products can be cast by ths kernel trck nto a nonlnear one by mplctly mappng the orgnal space X to a hgher dmensonal one. Many approaches to kernel regresson can be found n the lterature, the most classcal beng the Support Vector Machnes (SVM) framework [4]. There are qute fewer Bayesan approaches, nonetheless the reader can refer to [5] or [6] for nterestng examples. To our knowledge, none of them s desgned to handle the regresson problem descrbed n ths paper. As a preprocessng step a pror on the kernel to be used s put (e.g. a Gaussan kernel wth a specfc wdth), and a dctonary method [7] s used to compute an approxmate bass of ϕ(x ). Ths gves a far sparse and nose-ndependent ntalsaton for the number of kernels to be used and ther centers. Followng [8], the regresson problem s cast nto a state-space representaton θ k+1 = θ k + v k (1) y k = g ˆfθk (x 1 k, x 2 k, t k ) + n k where v k s an artfcal process nose and n k s an observaton nose. A Sgma Pont Kalman Flter (SPKF) [9] s used to sequentally estmate the parameters, as the samples (x k, t(x k ), y k ) k are observed. In the followng sectons, the SPKF approach and the dctonary method wll be frst brefly presented. The proposed algorthm, whch s based on the two forementoned methods, wll then be exhbted, and the frst expermental results wll be shown. Eventually, future works wll be sketched. 2. BACKGROUND 2.1. Bayesan Flterng Paradgm The problem of Bayesan flterng can be expressed n ts state-space formulaton: s k+1 = ψ k (s k, v k ) (2) y k = h k (s k, n k ) The objectve s to sequentally nfer E[s k y 1:k ], the expectaton of the hdden state s k gven the sequence of observatons up to tme k, y 1:k, knowng that the state evoluton s drven by the possbly nonlnear mappng ψ k and the process nose v k. The observaton y k, through the possbly nonlnear mappng h k, s a functon of the state s k, corrupted by an observaton nose n k. If the mappngs are lnear and f the noses v k and n k are addtonal Gaussan noses, the optmal soluton s gven by the Kalman flter [10]: quanttes of nterest are random varables, and nference (that s predcton of these quanttes and correcton of them gven a new observaton) s done onlne by propagatng suffcent statstcs through lnear transformatons. If one of these two hypothess does not hold, approxmate solutons exst. See [11] for a survey on Bayesan flterng. The Sgma Pont framework [9] s a nonlnear extenson of Kalman flterng (random noses are stll Gaussan). The basc dea s that t s easer to approxmate a probablty dstrbuton than an arbtrary nonlnear functon. Recall that n the Kalman flter framework, bascally lnear transformatons are appled to suffcent statstcs. In the Extended Kalman Flter (EKF) approach, nonlnear mappngs are lnearzed. In the Sgma Pont approach, a set of so-called sgma ponts are determnstcally computed usng the estmates of mean and of covarance of the random varable of nterest. These ponts are representatve of the current dstrbuton. Then, nstead of computng a lnear (or lnearzed) mappng of the dstrbuton of nterest, one calculates a nonlnear mappng of these sgma ponts, and use them to compute suffcent statstcs of nterest for predcton and correcton equatons, that s to approxmate the followng dstrbutons: p(s k Y 1:k 1 ) = p(s k S k 1 )p(s k 1 Y 1:k 1 )ds k 1 S p(y k S k )p(s k Y 1:k 1 ) p(s k Y 1:k ) = S p(y k S k )p(s k Y 1:k 1 )ds k Note that SPKF and classcal Kalman equatons are very smlar. The major change s how to compute suffcent statstcs (drectly for Kalman, through sgma ponts computaton for SPKF). Table 1 sketches a SPKF update n the case of addtve nose, based on the state-space formulaton, and usng the standard Kalman notatons: s k k 1 denotes a predcton, s k k an estmate (or correcton), P sk,y k a covarance matrx, n k a mean and k s the dscrete tme ndex. The reader can refer to [9] for detals Dctonary method As sad n secton 1, the kernel trck corresponds to a dot product n a hgher dmensonal space F, assocated wth a mappng ϕ. By observng that although F s a (very) hgher dmensonal space, ϕ(x ) can be a qute smaller em-

3 nputs: outputs: Table 1. SPKF Update s k 1 k 1, P sk 1 k 1 s k k, P sk k Sgma ponts computaton: Compute determnstcally sgma-pont set S k 1 k 1 from s k 1 k 1 and P sk 1 k 1 Predcton step: Compute S k k 1 from ψ k (S k 1 k 1, v k ) and process nose covarance Compute s k k 1 and P sk k 1 from S k k 1 Correcton step: Observe y k Y k k 1 = h k (S k k 1, n k ) Compute ȳ k k 1, P yk k 1 and P sk k 1,y k k 1 from S k k 1, Y k k 1 and observaton nose covarance K k = P sk k 1,y k k 1 Py 1 k k 1 s k k = s k k 1 + K k (y k ȳ k k 1 ) P sk k = P sk k 1 K k P yk k 1 Kk T beddng, the objectve s to fnd a set of p ponts n X such that ϕ(x ) Span {ϕ( x 1 ),..., ϕ( x p )}. Ths procedure s teratve. Suppose that samples x 1, x 2,... are sequentally observed. At tme k, a dctonary D k 1 = ( x j ) m k 1 j=1 (x j ) k 1 j=1 of m k 1 elements s avalable where by constructon feature vectors ϕ( x j ) are approxmately lnearly ndependent n F. A sample x k s then observed, and s added to the dctonary f ϕ(x k ) s lnearly ndependent on D k 1. To test ths, weghts a = (a 1,..., a mk 1 ) T have to be computed so as to verfy 2 δ k = mn a R m k 1 m k 1 a j ϕ( x j ) ϕ(x k ) j=1 Formally, f δ k = 0 then the feature vectors are lnearly dependent, otherwse not. Practcally an approxmate dependence s allowed, and δ k wll be compared to a predefned threshold ν determnng the qualty of the approxmaton (and consequently the sparsty of the dctonary). Thus the feature vectors wll be consdered as approxmately lnearly dependent f δ k ν. By usng the kernel trck and the blnearty of dot products, equaton (3) can be rewrtten as δ k = mn a n a T Kk 1 a 2a T kk 1 (x t) + K(x k, x k ) where ( K k 1 ),j = K( x, x j ) s a m k 1 m k 1 matrx and ( k k 1 (x)) = K(x, x ) s a m k 1 1 vector. If δ k > ν, x k = x mk s added to the dctonary, otherwse not. Equaton (4) admts the analytcal soluton a k = K k 1 k 1 k 1 (x k ). Moreover t exsts a computatonally effcent algorthm whch uses the parttoned matrx nverson formula to construct ths dctonary. See [7] for detals. o (3) (4) Thus, by choosng a pror on the kernel to be used, and by applyng ths algorthm to a set of N ponts randomly sampled from X, a sparse set of good canddates to the kernel regresson problem s obtaned. Ths method s theoretcally well founded, easy to mplement, computatonally effcent and t does not depend on kernels nor space s topology. Note that, despte the fact that ths algorthm s naturally onlne, ths dctonary cannot be bult (straghtforwardly) whle estmatng the parameters, snce the parameters of the chosen kernels (such as mean and devaton for Gaussan kernels) wll be parameterzed as well. Observe that only bounds on X have to be known n order to compute ths dctonary, and not the samples used for regresson. 3. PROPOSED ALGORITHM Recall that the objectve here s to sequentally approxmate a nonlnear functon, as samples (x k, t k = t(x 1 k, x2 k ), y k) are avalable, wth a set of kernel functons. Ths parameter estmaton problem can be expressed as a state-space problem (1). Note that here f does not depend on tme, but we thnk that ths approach can be easly extended to nonstatonary functon approxmaton. In ths paper the approxmaton s of the form ˆf θ (x) = p α K σ 1 (x 1, x 1 )K σ 2 (x 2, x 2 ), wth (5) =1 θ = [(α ) p =1, ( (x 1 ) T ) p =1, (σ1 ) p =1, ( (x 2 ) T ) p =1, (σ2 ) p =1 ]T and K σ j (x j, x j ) = x j exp( xj 2 2(σ j ), j = 1, 2 )2 ( Note that K σ (x, x ) = K (σ 1,σ 2 (x ) 1, x 2 ), (x 1, x2 )) = K σ 1 (x 1, x 1 )K σ 2(x2, x 2 ) s a product of kernels, thus t s a kernel. The optmal number of kernels and a good ntalsaton for the parameters have to be determned frst. To tackle ths ntal dffculty, the dctonary method s used n a preprocessng step. A pror σ 0 = (σ0, 1 σ0) 2 on the Gaussan wdth s frst put (the same for each kernel). Then a set of N random ponts s sampled unformly from X. Fnally, these samples are used to compute the dctonary. A set of p ponts D = {x 1,..., x p } s thus obtaned such that ϕ(x ) Span {ϕ σ0 (x 1 ),..., ϕ σ0 (x p )} where ϕ σ0 s the mappng correspondng to K σ0. Note that even though ths sparsfcaton procedure s offlne, the proposed algorthm (the regresson part) s onlne. Moreover, no tranng sample s requred for ths preprocessng step, but only classcal pror whch s anyway requred for the Bayesan flter (σ 0 ), one sparsfcaton parameter ν and bounds for X. A SPKF s then used to estmate the parameters. As for any Bayesan approach an ntal pror on the parameter dstrbuton has to be put. The ntal parameter vector follows

4 a Gaussan law, that s θ 0 N ( θ 0, Σ θ0 ), where θ 0 = [α 0,..., α 0, D 1, σ 1 0,..., σ 1 0, D 2, σ 2 0,..., σ 2 0] T, and Σ θ0 = dag(σα 2 0,..., σ 2 µ,..., σ σ,..., σ µ,..., σ σ... ) 0, 2 In these equatons, α 0 s the pror mean on kernel weghts, D j = [(x j 1 )T,..., (x j p) T ] s derved from the dctonary computed n the preprocessng step, σ j 0 s the pror mean on kernel devaton, and σα 2 0, σ 2 (whch s a row vector µ j 0 wth the varance for each component of x j, ), σ2 σ 0 are respectvely the pror varances on kernel weghts, centers and devatons. All these parameters (except the dctonary) have to be set up beforehand. Let θ = q = (n + m + 3)p. Note that θ 0 R q and Σ θ0 R q q. A pror on noses has to be put too: more precsely, v 0 N (0, R v0 ) where R v0 = σv 2 0 I q, I q beng the dentty matrx, and n N (0, R n ), where R n = σn. 2 Note that here the observaton nose s also structural. In other words, t models the ablty of the parameterzaton ˆf θ to approxmate the true functon of nterest f. Then, a SPKF update (whch updates the suffcent statstcs of the parameters) s smply appled at each tme step, as a new tranng sample (x k, t k, y k ) s avalable. A specfc form of SPKF s used, the so-called Square- Root Central Dfference Kalman Flter (SR-CDKF) parameter estmaton form. Ths mplementaton uses the fact that for ths parameter estmaton problem, the evoluton equaton s lnear and the nose s addtve. Ths approach s computatonally cheaper: the complexty per teraton s O( θ 2 ), whereas t s O( θ 3 ) n the general case. See [9] for detals. A last ssue s to choose the artfcal process nose. Formally, snce the target functon s statonary, there s no process nose. But the parameter space has to be explored n order to fnd a good soluton, that s an artfcal process nose must be ntroduced. Choosng ths nose s stll an open research problem. Followng [9] a Robbns-Monro stochastc approxmaton scheme for estmatng nnovaton s used. That s the process nose covarance s set to R vk = (1 α)r vk 1 + αk k (y k g ˆfθk 1 (x k, t k )) 2 K T k Here α s a forgettng factor set by the user, and K k s the Kalman gan obtaned durng the SR-CDKF update. Ths approach s summarzed n Table PRELIMINARY RESULTS In ths secton the results of our prelmnary experments are presented. The proposed artfcal problem s qute arbtrary, t has been chosen for ts nterestng nonlneartes so as to emphasze on the potental benefts of the proposed approach. More results focusng on the applcaton of ths method to renforcement learnng are presented n [12]. Table 2. Proposed algorthm nputs:ν,n,α 0, σ j 0, σα 0,σ j σ, σ j 0 µ,σ v0, σ n 0 outputs: θ, Σθ Compute dctonary; {1... N}, x U X Set X = {x 1,..., x N } D =Compute-Dctonary(X,ν,σ 0) Intalzaton: Intalze θ 0, Σ θ0, R n0, R v for k = 1, 2,... do Observe (x k, t k, y k ) SR-CDKF update: [ θ k, Σ θk, K k ] = SR-CDKF-Update( θ k 1,Σ θk 1,x k,t k,......y k,r vk 1,R n) Artfcal process nose update: R vk = (1 α)r vk 1 + αk k (y k g ˆfθk 1 (x k, t k )) 2 Kk T end for Let X = [ 10, 10] 2 and f be a 2d nonlnear cardnal sne: f(x 1, x 2 ) = sn(x1 ) x 1 + x1 x Let the observed nonlnear transformaton be t(x 1, x 2 ) = 10 tanh( x1 7 ) + sn(x2 ) Recall that ths analytcal expresson s only used to generate observatons n ths experment but t s not used n the regresson algorthm. Let the nonlnear mappng g be ( g f x 1, x 2, t(x 1, x 2 ) ) = f(x 1, x 2 ) γ max f ( t(x 1, x 2 ), z ) z X 2 wth γ [0, 1[ a predefned constant. Ths s a specfc form of the Bellman equaton [3] wth the transformaton t beng a determnstc transton functon. Recall that ths functon s of specal nterest n optmal control theory. The assocated state-space formulaton s thus θ k+1 = θ k + v k (6) y k = ˆf θk (x 1 k, x 2 ( k) γ max ˆfθk t(x 1 k, x 2 k), z ) + n k z X Problem statement and settngs At each tme step k the flter observes x k U X (x k unformly sampled from X ), the transformaton t k = t(x 1 k, x2 k ) and y k = f(x k ) γ max z X 2 f(t k, z). Once agan the regressor does not have to know the functon t analytcally, t just has to observe t. The functon f s shown on Fg. 1(a) and ts nonlnear mappng on Fg. 1(b). For these experments the algorthm parameters were set to N = 1000,

(a) 2-dmensonal nonlnear cardnal sne. (a) Approxmaton of f(x). (b) Nonlnear mappng observed by the regressor. Fg. 1. 2-dmensonal nonlnear cardnal sne and ts observed nonlnear mappng. σ j 0 = 4.

Note that ths emprcal parameters were not fnely tuned. 4.2.

be used to measure the qualty of regresson. Instead the nonlnear mappng of f s computed from ts estmate ˆf and s then used to calculate the RMSE.

As t s computed from ˆf, t really measures the qualty of regresson, and as t uses the assocated nonlnear mappng, t wll not take nto account the bas.

5 (a) 2-dmensonal nonlnear cardnal sne. (a) Approxmaton of f(x). (b) Nonlnear mappng observed by the regressor. Fg dmensonal nonlnear cardnal sne and ts observed nonlnear mappng. σ j 0 = 4.4, ν = 0.1, α 0 = 0, σ n = 0.05, α = 0.7, and all varances (σα 2 0, σ 2, σ 2 σ µ j 0 σ0, 2 j n 0 ) to 0.1, for j = 1, 2. The factor γ was set to 0.9. Note that ths emprcal parameters were not fnely tuned Qualty of regresson Because of the specal form of g f, any bas added to f stll gves the same observatons and t may exst other nvarances: the root mean square error (RMSE) between f and ˆf should not be used to measure the qualty of regresson. Instead the nonlnear mappng of f s computed from ts estmate ˆf and s then used to calculate the RMSE. The qualty of regresson s thus measured wth the followng RMSE: ( X (g f (x, t(x)) g fθ ˆ (x, t(x))) 2 dx) 1 2. As t s computed from ˆf, t really measures the qualty of regresson, and as t uses the assocated nonlnear mappng, t wll not take nto account the bas. Practcally t s computed on 10 4 equally spaced ponts. The functon g f s what s observed (Fg. 1(b)) and t s used by the SPKF to approxmate the functon f (Fg. 1(a)) by ˆf θ (Fg. 2(a)). We compute g fθ ˆ from ˆf θ (Fg. 2(b)) and use t to compute the RMSE. As the proposed algorthm s stochastc, the results have been averaged over 100 runs. Fg. 3 shows an errorbar plot, that s mean ± standard devaton of RMSE. The average (b) Nonlnear mappng calculated from ˆf θ (x) Fg. 2. Approxmaton of f(x)and assocated approxmated nonlnear mappng. number of kernels was ± Ths results can be compared wth the RMSE drectly computed from ˆf, that s ( x X (f(x) ˆf θ (x)) 2 dx) 1 2, whch s llustrated on Table 3. There s more varance and bas n these results because of the possble nvarances of the consdered nonlnear mappng. However one can observe on Fg. 2(a) that a good approxmaton s obtaned. Table 3. RMSE (mean ± devaton) of g ˆfθ and ˆf θ as a functon of number of samples g ˆfθ ± ± ± ˆf θ ± ± ± Thus the RMSE computed from g f s a better qualty measure. As far as we know, no other method to handle such a problem has been publshed so far, thus comparsons wth other approaches s made dffcult. However the order of magntude for the RMSE obtaned from g f s comparable wth the state-of-the-art regresson approaches when the functon of nterest s drectly observed. Ths s demonstrated on Table 4. Here the problem s to regress a lnear 2d cardnal sne f(x 1, x 2 ) = sn(x1 ) x + x For the proposed algorthm, the nonlnear mappng s the same as consdered before. We compare RMSE results (10 3 tranng ponts) to

6 s planned to be lnked to the qualty of regresson. Yet the uncertanty reducton occurrng when acqurng more samples (thus more nformaton) could be quantfed. 6. REFERENCES [1] B. Scholkopf and A. J. Smola, Learnng wth Kernels: Support Vector Machnes, Regularzaton, Optmzaton, and Beyond, MIT Press, Fg. 3. RMSE (g ˆfθ ) averaged over 100 runs. Kernel Recursve Least Squares (KRLS) and Support Vector Regresson (SVR). For the proposed approach, the approxmaton s computed from nonlnear mappng of observatons. For KLRS and SVR, t s computed from drect observatons. Notce that SVR s a batch method, that s t needs the whole set of samples n advance. The target of the regressor s ndeed g f, and we obtan the same order of magntude (however much nonlneartes are ntroduced for our method and the representaton s more sparse). The RMSE on ˆf s slghtly hgher for ths contrbuton, but ths can be mostly explaned by the nvarances (as bas nvarance) nduced by the nonlnear mappng. Table 4. Comparatve results (KRLS and SVR results are reproduced from [7]). The frst column gves RMSE (from ˆf and g ˆf for the proposed algorthm, from ˆf for the two others), and the second one the percentage of support vector/dctonary vectors from the tranng sample used. test error Method ˆf g ˆf %DV/SVs Proposed Alg % KRLS % SVR % 5. CONCLUSIONS AND FUTURE WORKS An approach allowng to regress a functon of nterest f from observatons whch are obtaned through a nonlnear mappng of t has been proposed. The regresson s onlne, kernel-based, nonlnear and Bayesan. A preprocessng step allows to acheve sparsty of representaton. Ths method has proven to be effectve on an artfcal problem and reaches good performance although much more nonlneartes are ntroduced. Ths approach s planned to be extended to stochastc transformaton functon (the observed part of nonlneartes) n order to solve a more general form of the Bellman equaton. The adaptve artfcal process nose s planned to be extended, and an adaptve observaton nose s on study. Fnally, the approxmaton ˆf θ beng a mappng of the random varable θ, t s a random varable. An uncertanty nformaton can thus be derved from t and [2] C. M. Bshop, Neural Networks for Pattern Recognton, Oxford Unversty Press, New York, USA, [3] R. Bellman, Dynamc Programmng, Dover Publcatons, sxth edton, [4] V. N. Vapnk, Statscal Learnng Theory, John Wley & Sons, Inc., [5] C. M. Bshop and M. E. Tppng, Bayesan Regresson and Classfcaton, n Advances n Learnng Theory: Methods, Models and Applcatons. 2003, vol. 190, pp , OS Press, NATO Scence Seres III: Computer and Systems Scences. [6] J. Vermaak, S. J. Godsll, and A. Doucet, Sequental Bayesan Kernel Regresson, n Advances n Neural Informaton Processng Systems , MIT Press. [7] Y. Engel, S. Mannor, and R. Mer, The Kernel Recursve Least Squares Algorthm, IEEE Transactons on Sgnal Processng, vol. 52, no. 8, pp , [8] E. A. Wan and R. van der Merwe, The Unscented Kalman Flter for nonlnear estmaton, n Adaptve Systems for Sgnal Processng, Communcatons, and Control Symposum, 2000, pp [9] R. van der Merwe and E. A. Wan, Sgma-Pont Kalman Flters for Probablstc Inference n Dynamc State-Space Models, n Workshop on Advances n Machne Learnng, Montreal, Canada, June [10] R. E. Kalman, A New Approach to Lnear Flterng and Predcton Problems, Transactons of the ASME Journal of Basc Engneerng, vol. 82, Seres D, pp , [11] Z. Chen, Bayesan Flterng : From Kalman Flters to Partcle Flters, and Beyond, Tech. Rep., Adaptve Systems Lab, McMaster Unversty, [12] Mattheu Gest, Olver Petqun, and Gabrel Frcout, Bayesan Reward Flterng, n European Workshop on Renforcement Learnng (EWRL 2008), Llle (France), 2008, Lecture Notes n Artfcal Intellgence, Sprnger Verlag, to appear.

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume