Online Learning Algorithms for Stochastic Water-Filling

Onlne Lernng Algorthms for Stochstc Wter-Fllng Y G nd Bhskr Krshnmchr Mng Hseh Deprtment of Electrcl Engneerng Unversty of Southern Clforn Los Angeles, CA 90089, USA Eml: {yg, bkrshn}@usc.edu Abstrct Wter-fllng s the term for the clssc soluton to the problem of lloctng constrned power to set of prllel chnnels to mxmze the totl dt-rte. It s used wdely n prctce, for exmple, for power llocton to sub-crrers n mult-user OFDM systems such s WMx. The clssc wterfllng lgorthm s determnstc nd requres perfect knowledge of the chnnel gn to nose rtos. In ths pper we consder how to do power llocton over stochstclly tme-vryng (..d.) chnnels wth unknown gn to nose rto dstrbutons. We dopt n onlne lernng frmework bsed on stochstc multrmed bndts. We consder two vrtons of the problem, one n whch the gol s to fnd power llocton to mxmze E[log(1 + SNR )], nd nother n whch the gol s to fnd power llocton to mxmze log(1+e[snr ]). For the frst problem, we propose cogntve wter-fllng lgorthm tht we cll CWF1. We show tht CWF1 obtns regret (defned s the cumultve gp over tme between the sum-rte obtned by dstrbuton-wre gene nd ths polcy) tht grows polynomlly n the number of chnnels nd logrthmclly n tme, mplyng tht t symptotclly cheves the optml tme-verged rte tht cn be obtned when the gn dstrbutons re known. For the second problem, we present n lgorthm clled CWF2, whch s, to our knowledge, the frst lgorthm n the lterture on stochstc mult-rmed bndts to explot non-lner dependences between the rms. We prove tht the number of tmes CWF2 pcks the ncorrect power llocton s bounded by functon tht s polynoml n the number of chnnels nd logrthmc n tme, mplyng tht ts frequency of ncorrect llocton tends to zero. I. INTRODUCTION A fundmentl resource llocton problem tht rses n mny settngs n communcton networks s to llocte constrned mount of power cross mny prllel chnnels n order to mxmze the sum-rte. Assumng tht the powerrte functon for ech chnnel s proportonl to log(1+snr) s per the Shnnon s cpcty theorem for AWGN chnnels, t s well known tht the optml power llocton cn be determned by wter-fllng strtegy [1]. The clssc wterfllng soluton s determnstc lgorthm, nd requres perfect knowledge of ll chnnel gn to nose rtos. In prctce, however, chnnel gn-to-nose rtos re stochstc qunttes. To hndle ths rndomness, we consder n lterntve pproch, bsed on onlne lernng, specfclly Ths reserch ws sponsored n prt by the U.S. Army Reserch Lbortory under the Network Scence Collbortve Technology Allnce, Agreement Number W911NF-09-2-0053. stochstc mult-rmed bndts. We formulte the problem of stochstc wter-fllng s follows: tme s dscretzed nto slots; ech chnnel s gn-to-nose rto s modeled s n..d. rndom vrble wth n unknown dstrbuton. In our generl formulton, the power-to-rte functon for ech chnnel s llowed to be ny sub-ddtve functon 1. We seek power llocton tht mxmzes the expected sum-rte (.e., n optmzton of the form E[ log(1 + SNR )]). Even f the chnnel gn-to-nose rtos re rndom vrbles wth known dstrbutons, ths turns out to be hrd combntorl stochstc optmzton problem. Our focus n ths pper s thus on more chllengng cse. In the clsscl mult-rmed bndt, there s plyer plyng K rms tht yeld stochstc rewrds wth unknown mens t ech tme n..d. fshon over tme. The plyer seeks polcy to mxmze ts totl expected rewrd over tme. The performnce metrc of nterest n such problems s regret, defned s the cumultve dfference n expected rewrd between modelwre gene nd tht obtned by the gven lernng polcy. And t s of nterest to show tht the regret grows sub-lnerly wth tme so tht the tme-verged regret symptotclly goes to zero, mplyng tht the tme-verged rewrd of the modelwre gene s obtned symptotclly by the lernng polcy. We show tht t s possble to mp the problem of stochstc wter-fllng to n MAB formulton by tretng ech possble power llocton s n rm (we consder dscrete power levels n ths pper; f there re P possble power levels for ech of N chnnels, there would be P N totl rms.) We present novel combntorl polcy for ths problem tht we cll CWF1, tht yelds regret growng polynomlly n N nd logrthmclly over tme. Despte the exponentl growng set of rms, the CWF1 observes nd mntns nformton for P N vrbles, one correspondng to ech power-level nd chnnel, nd explots lner dependences between the rms bsed on these vrbles. Typclly, the wy the rndomness n the chnnel gn to nose rtos s delt wth s tht the men chnnel gn to nose rtos re estmted frst bsed on vergng fnte set of trnng observtons nd then the estmted gns re used n determnstc wter-fllng procedure. Essentlly ths 1 A functon f s subddtve f f(x + y) f(x) + f(y); for ny concve functon g, f g(0) 0 (such s log(1 + x)), g s subddtve.

2 pproch tres to dentfy the power llocton tht mxmzes pseudo-sum-rte, whch s determned bsed on the powerrte equton ppled to the men chnnel gn-to-nose rtos (.e., n optmzton of the form log(1 + E[SNR ]). We lso present dfferent stochstc wter-fllng lgorthm tht we cll CWF2, whch lerns to do ths n n onlne fshon. Ths lgorthm observes nd mntns nformton for N vrbles, one correspondng to ech chnnel, nd explots non-lner dependences between the rms bsed on these vrbles. To our knowledge, CWF2 s the frst MAB lgorthm to explot non-lner dependences between the rms. We show tht the number of tmes CWF2 plys non-optml combnton of powers s unformly bounded by functon tht s logrthmc n tme. Under some restrctve condtons, CWF2 my lso solve the frst problem more effcently. II. RELATED WORK The clssc wter-fllng strtegy s descrbed n [1]. There re few other stochstc vrtons of wter-fllng tht hve been covered n the lterture tht re dfferent n sprt from our formulton. When fdng dstrbuton over the gns s known pror, the power constrnt s expressed over tme, nd the nstntneous gns re lso known, then determnstc jont frequency-tme wter-fllng strtegy cn be used [2], [3]. In [4], stochstc grdent pproch bsed on Lgrnge dulty s proposed to solve ths problem when the fdng dstrbuton s unknown but stll nstntneous gns re vlble. By contrst, n our work we do not ssume tht the nstntneous gns re known, nd focus on keepng the sme power constrnt t ech tme whle consderng unknown gn dstrbutons. Another work [5] consders wter-fllng over stochstc non-sttonry fdng chnnels, nd proposes n dptve lernng lgorthm tht trcks the tme-vryng optml power llocton by ncorportng forgettng fctor. However, the focus of ther lgorthm s on mnmzng the mxmum men squred error ssumng mperfect chnnel estmtes, nd they prove only tht ther lgorthm would converge n sttonry settng. Although ther lgorthm cn be vewed s lernng mechnsm, they do not tret stochstc wterfllng from the perspectve of mult-rmed bndts, whch s novel contrbuton of our work. In our work, we focus on sttonry settng wth perfect chnnel estmtes, but prove stronger results, showng tht our lernng lgorthm not only converges to the optml llocton, t does so wth sub-lner regret. There hs been long lne of work on stochstc multrmed bndts nvolvng plyng rms yeldng stochstclly tme vryng rewrds wth unknown dstrbutons. Severl uthors [6] [9] present lernng polces tht yeld regret growng logrthmclly over tme (symptotclly, n the cse of [6] [8] nd unformly over tme n the cse of [9]). Our lgorthms buld on the UCB1 lgorthm proposed n [9] but mke sgnfcnt modfctons to hndle the combntorl nture of the rms n ths problem. CWF1 hs some commonltes wth the LLR lgorthm we recently developed for completely dfferent problem, tht of stochstc combntorl bprtte mtchng for chnnel llocton [10], but s modfed to ccount for the non-lner power-rte functon n ths pper. Other recent work on stochstc MAB hs consdered decentrlzed settngs [11] [14], nd non-..d. rewrd processes [15] [19]. Wth respect to ths lterture, the problem settng for stochstc wter-fllng s novel n tht t nvolves non-lner functon of the cton nd unknown vrbles. In prtculr, s fr s we re wre, our CWF2 polcy s the frst to explot the non-lner dependences between rms to provbly mprove the regret performnce. III. PROBLEM FORMULATION We defne the stochstc verson of the clssc communcton theory problem of power llocton for mxmzng rte over prllel chnnels (wter-fllng) s follows. We consder system wth N chnnels, where the chnnel gn-to-nose rtos re unknown rndom processes X (n), 1 N. Tme s slotted nd ndexed by n. We ssume tht X (n) evolves s n..d. rndom process over tme (.e., we consder block fdng), wth the only restrcton tht ts dstrbuton hs fnte support. Wthout loss of generlty, we normlze X (n) [0, 1]. We do not requre tht X (n) be ndependent cross. Ths rndom process s ssumed to hve men θ = E[X ] tht s unknown to the users. We denote the set of ll these mens by Θ = {θ }. At ech decson perod n (lso referred to nterchngebly s tme slot), n N-dmensonl cton vector (n), representng power llocton on these N chnnels, s selected under polcy π(n). We ssume tht the power levels re dscrete, nd we cn put ny constrnt on the selectons of power lloctons such tht they re from fnte set F (.e., the mxmum totl power constrnt, or n upper bound on the mxmum llowed power per subcrrer). We ssume (n) 0 for ll 1 N. When prtculr power llocton (n) s selected, the chnnel gn-to-nose rtos correspondng to nonzero components of (n) re reveled,.e., the vlue of X (n) s observed for ll such tht (n) 0. We denote by A (n) = { : (n) 0, 1 N} the ndex set of ll (n) 0 for n llocton. We dopt generl formulton for wter-fllng, where the sum rte 2 obtned t tme n by lloctng set of powers (n) s defned s: R (n) (n) = (n) f ( (n), X (n)). (1) where for ll, f ( (n), X (n)) s nonlner contnuous ncresng sub-ddtve functon n X (n), nd f ( (n), 0) = 0 for ny (n). We ssume f s defned on R + R +. Our formulton s generl enough to nclude s specl cse of the rte functon obtned from Shnnon s cpcty theorem for AWGN, whch s wdely used n communcton 2 We refer to rte nd rewrd nterchngebly n ths pper.

3 networks: R (n) (n) = N log(1 + (n)x (n)) In the typcl formulton there s totl power constrnt nd ndvdul power constrnts, the correspondng constrnt s F = { : N P totl 0 P, }. where P totl s the totl power constrnt nd P s the mxmum llowed power per chnnel. Our gol s to mxmze the expected sum-rte when the dstrbutons of ll X re unknown, s shown n (2). We refer to ths objectve s O 1. mx E[ f (, X ))] (2) Note tht even when X hve known dstrbutons, ths s hrd combntorl non-lner stochstc optmzton problem. In our settng, wth unknown dstrbutons, we cn formulte ths s mult-rmed bndt problem, where ech power llocton (n) F s n rm nd the rewrd functon s n combntorl non-lner form. The optml rms re the ones wth the lrgest expected rewrd, denoted s O = { }. For the rest of the pper, we use s the ndex ndctng tht prmeter s for n optml rm. If more thn one optml rm exsts, refers to ny one of them. We note tht for the combntorl mult-rmed bndt problem wth lner rewrds where the rewrd functon s defned by R (n) (n) = (n)x (n), s soluton to determnstc optmzton problem becuse mx E[ (n) X ] = mx E[X ]. Dfferent from the combntorl multrmed bndt problem wth lner rewrds, here s soluton to stochstc optmzton problem,.e., O = {ã : ã = rg mx E[ f (, X ))]}. (3) We evlute polces for O 1 wth respect to regret, whch s defned s the dfference between the expected rewrd tht could be obtned by gene tht cn pck n optml rm t ech tme, nd tht obtned by the gven polcy. Note tht mnmzng the regret s equvlent to mxmzng the expected rewrds. Regret cn be expressed s: n R π (n) = nr E[ R π(t) (t)], (4) t=1 where R = mx E[ f (, X ))], the expected rewrd of n optml rm. Intutvely, we would lke the regret R π (n) to be s smll s possble. If t s sub-lner wth respect to tme n, the tmeverged regret wll tend to zero nd the mxmum possble tme-verged rewrd cn be cheved. Note tht the number of rms F cn be exponentl n the number of unknown rndom vrbles N. We lso note tht for the stochstc verson of the wterfllng problems, typcl wy n prctce to del wth the unknown rndomness s to estmte the men chnnel gn to nose rtos frst nd then fnd the optmzed llocton bsed on the men vlues. Ths pproch tres to dentfy the power llocton tht mxmzes the power-rte equton ppled to the men chnnel gn-to-nose rtos. We refer to mxmzng ths s the sum-pseudo-rte over verged chnnels. We denote ths objectve by O 2, s shown n (5). mx f (, E[X ]) (5) We would lso lke to develop n onlne lernng polcy for O 2. Note tht the optml rm of O 2 s soluton to determnstc optmzton problem. So, we evlute the polces for O 2 wth respect to the expected totl number of tmes tht non-optml power llocton s selected. We denote by T (n) the number of tmes tht power llocton s pcked up to tme n. We denote r = f (, E[X ]). Let Tnon(n) π denote the totl number of tmes tht polcy π select power llocton r < r. Denote by 1 π t () the ndctor functon whch s equl to 1 f s selected under polcy π t tme t, nd 0 else. Then n E[Tnon π (n)] = n E[ 1 π t ( ) = 1] (6) = r <r t=1 E[T (n)]. IV. ONLINE LEARNING FOR MAXIMIZING THE SUM-RATE We frst present n ths secton n onlne lernng polcy for stochstc wter-fllng under object O 1. A. Polcy Desgn A strghtforwrd, nve wy to solve ths problem s to use the UCB1 polcy proposed [9]. For UCB1, ech power llocton s treted s n rm, nd the rm tht mxmzes 2ln n m k Ŷ k + wll be selected t ech tme slot, where Ŷk s the men observed rewrd on rm k, nd m k s the number of tmes tht rm k hs been plyed. Ths pproch essentlly gnores the underlyng dependences cross the dfferent rms, nd requres storge tht s lner n the number of rms nd yelds regret growng lnerly wth the number of rms. Snce there cn be n exponentl number of rms, the UCB1 lgorthm performs poorly on ths problem. We note tht for combntorl optmzton problems wth lner rewrd functons, n onlne lernng lgorthm LLR hs been proposed n [6] s n effcent soluton. LLR stores the men of observed vlues for every underlyng unknown rndom vrble, s well s the number of tmes ech hs been observed. So the storge of LLR s lner n the number of unknown rndom vrbles, nd the nlyss n [6] shows LLR cheves regret tht grows logrthmclly n tme, nd polynomlly n the number of unknown prmeters.

4 However, the chllenge wth stochstc wter-fllng wth objectve O 1, where the expectton s outsde the non-lner rewrd functon, drectly storng the men observtons of X wll not work. To del wth ths chllenge, we propose to store the nformton for ech, X combnton,.e., 1 N,, we defne new set of rndom vrbles Y, = f (, X ). So now the number of rndom vrbles Y, s N B, where B = { : 0}. Note tht N B PN. Then the rewrd functon cn be expressed s R = Y,, (7) Note tht (7) s n combntorl lner form. For ths redefned MAB problem wth N B unknown rndom vrbles nd lner rewrd functon (7), we propose the followng onlne lernng polcy CWF1 for stochstc wter-fllng s shown n Algorthm 1. Algorthm 1 Onlne Lernng for Stochstc Wter-Fllng: CWF1 1: // INITIALIZATION 2: If mx A s known, let L = mx A ; else, L = N; 3: for n = 1 to N do 4: Ply ny rm such tht n A ; 5: A, B, Y, := Y, m+f(,x) m +1 ; 6: A, m := m + 1; 7: end for 8: // MAIN LOOP 9: whle 1 do 10: n := n + 1; 11: Ply n rm whch solves the mxmzton problem (L + 1)lnn (Y, + ); (8) m 12: A, B, Y, := Y, m+f(,x) m +1 ; 13: A, m := m + 1; 14: end whle To hve tghter bound of regret, dfferent from the LLR lgorthm, nsted of storng the number of tmes tht ech unknown rndom vrbles Y, hs been observed, we use 1 by N vector, denoted s (m ) 1 N, to store the number of tmes tht X hs been observed up to the current tme slot. We use 1 by N B vector, denoted s (Y, )1 N B to store the nformton bsed on the observed vlues. (Y, s updted n s shown n lne 12. Ech tme )1 N B n rm (n) s plyed, A (n), the observed vlue of X s obtned. For every observed vlue of X, B vlues re updted: B, the verge vlue Y, of ll the vlues of Y, up to the current tme slot s updted. CWF1 polcy requres storge lner n N B. B. Anlyss of regret Theorem 1: The expected regret under the CWF1 polcy s t most [ ] 4L 2 (L + 1)N lnn ( mn ) 2 + N + π2 3 LN mx. (9) where mn = mn = R E[R ], mx = mx = R E[R ]. Note tht L N. The proof of Theorem 1 s omtted. Remrk 1: For CWF1 polcy, lthough there re N B rndom vrbles, the upper bound of regret remns O(N 4 log n), whch s the sme s LLR, s shown by Theorem 2 n [6]. Drectly pplyng LLR lgorthm to solve the redefned MAB problem n (7) wll result n regret tht grows s O(P 4 N 4 log n). Remrk 2: Algorthm 1 wll even work for rte functons tht do not stsfy subddtvty. Remrk 3: We cn develop smlr polces nd results when X re Mrkovn rewrds s n [19] nd [20]. V. ONLINE LEARNING FOR SUM-PSEUDO-RATE We now show our novel onlne lernng lgorthm CWF2 for stochstc wter-fllng wth object O 2. Unlke CWF1, CWF2 explots non-lner dependences between the choces of power lloctons nd requres lower storge. Under condton where the power llocton tht mxmze O 2 lso mxmze O 1, we wll see through smultons tht CWF2 hs better regret performnce. A. Polcy Desgn Our proposed polcy CWF2 for stochstc wter fllng wth objectve O 2 s shown n Algorthm 2. We use two 1 by N vectors to store the nformton fter we ply n rm t ech tme slot. One s (X ) 1 N n whch X s the verge (smple men) of ll the observed vlues of X up to the current tme slot (obtned through potentlly dfferent sets of rms over tme). The other one s (m ) 1 N n whch m s the number of tmes tht X hs been observed up to the current tme slot. So CWF2 polcy requres storge lner n N. B. Anlyss of regret For the nlyss of the upper bound for E[Tnon π (n)] of CWF2 polcy, we use the nequltes s stted n the Chernoff- Hoeffdng bound s follows: Lemm 1 (Chernoff-Hoeffdng bound [21]): X 1,...,X n re rndom vrbles wth rnge [0, 1], nd

5 Algorthm 2 Onlne Lernng for Stochstc Wter-Fllng: CWF2 1: // INITIALIZATION 2: If mx A s known, let L = mx A ; else, L = N; 3: for n = 1 to N do 4: Ply ny rm such tht n A ; 5: A, X := Xm+X m +1, m := m + 1; 6: end for 7: // MAIN LOOP 8: whle 1 do 9: n := n + 1; 10: Ply n rm whch solves the mxmzton problem mx f (, X ) + f (, (L + 1)lnn ) m 11: A (n), X := Xm+X m +1, m := m + 1; 12: end whle ; (10) E[X t X 1,..., X t 1 ] = µ, 1 t n. Denote S n = X. Then for ll 0 P{S n nµ + } e 22 /n P{S n nµ } e 22 /n (11) Theorem 2: Under the CWF2 polcy, the expected totl number of tmes tht non-optml power lloctons re selected s t most E[Tnon π N(L + 1)lnn (n)] Bmn 2 + N + π2 LN, (12) 3 where B mn s constnt defned by δ mn nd L; δ mn = mn :r <r (r r ). Proof: See [22]. Remrk 4: CWF2 cn be used to solve the stochstc wterfllng wth objectve O 1 s well f O, such tht / O, f (, θ )) > f j ( j, θ j ). (13) j A Then the regret of CWF2 s t most [ N(L + 1)lnn R CWF2 (n) B 2 mn REFERENCES + N + π2 3 LN ] mx, (14) [1] T. Cover nd J. Thoms, Elements of Informton Theory. New York: Wley, 1991. [2] A. J. Goldsmth nd P. P. Vry, Cpcty of Fdng MIMO Chnnels wth Chnnel Estmton Error, IEEE Interntonl Conference on Communctons (ICC), June, 2004. [3] A. J. Goldsmth, Wreless Communctons. New York: Cmbrdge Unversty Press, 2005. [4] X. Wng, D. Wng, H. Zhung, nd S. D. Morger, Energy- Effcent Resource Allocton n Wreless Sensor Networks over Fdng TDMA, vol. 28, no. 7, pp. 1063-1072, 2010. [5] I. Zd nd V. Krshnmurthy, Stochstc Adptve Multlevel Wterfllng n MIMO-OFDM WLANs, the 39th Aslomr Conference on Sgnls, Systems nd Computers, October, 2005. [6] Y. G, B. Krshnmchr nd R. Jn, Combntorl Network Optmzton wth Unknown Vrbles: Mult-Armed Bndts wth Lner Rewrds nd Indvdul Observtons, to pper n IEEE/ACM Trnsctons on Networkng. [7] V. Annthrm, P. Vry, nd J. Wlrnd, Asymptotclly Effcent Allocton Rules for the Multrmed Bndt Problem wth Multple Plys-Prt I: IID Rewrds, IEEE Trnsctons on Automtc Control, vol. 32, no. 11, pp. 968-976, 1987. [8] R. Agrwl, Smple Men Bsed Index Polces wth O(log n) Regret for the Mult-Armed Bndt Problem, Advnces n Appled Probblty, vol. 27, no. 4, pp. 1054-1078, 1995. [9] P. Auer, N. Ces-Bnch, nd P. Fscher, Fnte-tme Anlyss of the Multrmed Bndt Problem, Mchne Lernng, vol. 47, no. 2, pp. 235-256, 2002. [10] Y. G, B. Krshnmchr, nd R. Jn, Lernng Multuser Chnnel Alloctons n Cogntve Rdo Networks: A Combntorl Mult-rmed Bndt Formulton, IEEE Interntonl Dynmc Spectrum Access Networks (DySPAN) Symposum, Sngpore, Aprl, 2010. [11] A. Anndkumr, N. Mchel, nd A.K. Tng, Opportunstc Spectrum Access wth Multple Users: Lernng under Competton, IEEE Interntonl Conference on Computer Communctons (INFOCOM), Mrch, 2010. [12] A. Anndkumr, N. Mchel, A. Tng, nd A. Swm, Dstrbuted Lernng nd Allocton of Cogntve Users wth Logrthmc Regret, IEEE JSAC on Advnces n Cogntve Rdo Networkng nd Communctons, vol. 29, no. 4, pp. 781-745, [13] K. Lu nd Q. Zho, Dstrbuted Lernng n Cogntve Rdo Networks: Mult-Armed Bndt wth Dstrbuted Multple Plyers, Proc. of IEEE Interntonl Conference on Acoustcs, Speech, nd Sgnl Processng (ICASSP), Mrch, 2010. [14] Y. G nd B. Krshnmchr, Decentrlzed Onlne Lernng Algorthms for Opportunstc Spectrum Access, IEEE Globl Communctons Conference (GLOBECOM), December, [15] C. Tekn nd M. Lu, Onlne Algorthms for the Mult-Armed Bndt Problem wth Mrkovn Rewrds, the 48th Annul Allerton Conference on Communcton, Control, nd Computng (Allerton), September, 2010. [16] C. Tekn nd M. Lu, Onlne Lernng n Opportunstc Spectrum Access: Restless Bndt Approch, IEEE Interntonl Conference on Computer Communctons (INFOCOM), Aprl, [17] H. Lu, K. Lu, nd Q. Zho, Lernng nd Shrng n Chngng World: Non-Byesn Restless Bndt wth Multple Plyers, Informton Theory nd Applctons Workshop (ITA), Jnury, [18] W. D, Y. G, B. Krshnmchr nd Q. Zho, The Non-Byesn Restless Mult-Armed Bndt: Cse of Ner- Logrthmc Regret, IEEE Interntonl Conference on Acoustcs, Speech, nd Sgnl Processng (ICASSP), My, [19] Y. G, B. Krshnmchr nd M. Lu, On the Combntorl Mult-Armed Bndt Problem wth Mrkovn Rewrds, IEEE Globl Communctons Conference (GLOBECOM), December, [20] Y. G, B. Krshnmchr nd M. Lu, Onlne lernng for combntorl network optmzton wth restless mrkovn rewrds, rxv:1109.1606. [21] D. Pollrd, Convergence of Stochstc Processes. Berln: Sprnger, 1984. [22] Y. G nd B. Krshnmchr, Onlne Lernng Algorthms for Stochstc Wter-Fllng, rxv:1109.2088.