Online Learning of Power Allocation Policies in Energy Harvesting Communications

Size: px
Start display at page:

Download "Online Learning of Power Allocation Policies in Energy Harvesting Communications"

Transcription

1 Online Learning of Power Allocaion Policies in Energy Harvesing Communicaions Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Deparmen of Elecrical Engineering Vierbi School of Engineering Universiy of Souhern California, Los Angeles, CA, USA {sakulkar, Absrac We consider he problem of power allocaion over a ime-varying channel wih an unknown disribuion in energy harvesing communicaion sysems. In his problem, he ransmier needs o choose is ransmi power based on he amoun of sored energy in is baery wih he goal of maximizing he average rae obained over ime. We model his problem as a Markov decision process (MDP) wih he ransmier as he agen, he baery saus as he sae, he ransmi power as he acion and he rae obained as he reward. The average reward maximizaion problem over he MDP can be solved by a linear program (LP) ha uses he ransiion probabiliies for he saeacion pairs and heir mean rewards o choose a power allocaion policy. Since he rewards associaed he sae-acion pairs are unknown, we propose an online learning algorihm called UCLP ha learns hese rewards and adaps is policy wih ime. The UCLP algorihm solves he LP a each ime-sep o choose is policy using he upper confidence bounds on he rewards. We prove ha he reward loss or regre incurred by UCLP is upper bounded by a consan. Index Terms Energy harvesing communicaions, Markov decision process (MDP), online learning, conexual bandis. I. INTRODUCTION Communicaion sysems where he ransmissions are powered by he harvesed energy have rapidly emerged as a viable opion for he nex-generaion wireless neworks wih prolonged lifeime [1]. The performance of such sysems is dependen on he efficien uilizaion of energy ha is currenly sored in he baery, as well as ha is o be harvesed over ime. In [2], power allocaion policies over a finie ime horizon wih known channel gain and harvesed energy disribuions are sudied. In [3], a similar problem is analyzed, bu he energy arrivals are assumed o be deerminisic and known in advance. The algorihms presened in [4] assume he knowledge of energy arrivals and ries o minimize he overall scheduling ime for daa packes. In our problem, however, he channel gain disribuion is unknown and he harves energy is assumed be sochasically varying wih a known disribuion. The ransmier has o decide he ransmi power level based on he curren baery saus wih he goal maximizing he average expeced ransmission rae obained over ime. We model he sysem as an MDP wih he baery saus as he sae, he ransmi power as he acion, he rae as he reward. The power allocaion problem, herefore, reduces o he average reward maximizaion problem for an MDP. Our problem can also be seen from he lens of conexual bandis. In he sandard conexual bandi problems [5], [6], [7], he conexs are assumed o be drawn from an unknown disribuion independenly over ime. In his paper, we model he conex ransiions by MDPs. The acion he agen akes a ime, herefore, affecs no only he insananeous reward bu also he conex in slo + 1. Thus he agen needs o decide he acions wih he global objecive in mind, i.e. maximizing he average reward over ime. I mus be noed ha he MDP formulaion generalizes he sandard conexual bandis [8] for he case where he mapping beween he conex and random insance o reward is a known monoonic funcion, since he i.i.d. conex case can be viewed as a single sae MDP. Our problem is also closely relaed o he reinforcemen learning problem over MDPs from [9], [10], [11]. The objecive for hese problems is o maximize he average undiscouned reward over ime. In [9], [10], he agen is unaware of he ransiion probabiliies and he rewards corresponding o he sae-acion pairs. In [11], he agen knows he rewards, bu he ransiion probabiliies are sill unknown. In our problem, however, he ransiion probabiliies of he MDP can be inferred from he knowledge of he arrival disribuion and he acion aken from each sae. The goal of our problem is o maximize he average reward by learning he rewards for he sae-acion pairs over ime. One addiional feaure of our problem is ha he funcion mapping he sae-acion pair and he channel gain o he rae is known o he agen. The reward informaion revealed afer every acion can, herefore, be used o infer he rewards for oher sae-acion pairs. Our conribuions in his paper are as follows: We formulae he power allocaion problem as an online learning problem over an MDP wih he goal of maximizing he average reward over ime. We prove ha he MDP is ergodic and herefore use is corresponding LP formulaion o characerize he opimal policy. We propose an online learning algorihm UCLP ha learns he rewards for he sae-acion pairs and adaps is policy over ime. We characerize he regre conribuion from following a non-opimal policy and from no being a saionariy while following an opimal policy, and prove a consan regre upper bound for UCLP.

2 Harvesed Power: p Tx Wireless Channel Transmi Power : q Rx o a sae s +1 according o he ransiion probabiliy P (s +1 s, a ). In he energy harvesing problem, he baery saus Q represens he sysem sae s and he ransmi power q represens he acion aken a a any slo. In his paper, we consider sysems where he random rewards of various sae acion pairs can be modelled as Baery Sae: Q Time Slos Fig. 1. Power allocaion over a wireless channel in energy harvesing communicaions This paper is organized as follows. Firs, we describe he model for he energy harvesing communicaion sysem sudied in his paper, formulae his problem as an MDP and discuss he srucure of he opimal policy in secion II. We hen propose our online learning algorihm UCLP and prove is regre bounds in secion III. Secion IV presens he resuls of numerical simulaions for his problem and secion V concludes he paper. We also include appendices B and C o discuss and prove some of he echnical lemmas a he end of he paper. II. SYSTEM MODEL Consider a ime-sloed energy harvesing communicaion sysem where he ransmier uses he harvesed power for ransmission over a channel wih sochasically varying channel gains wih unknown disribuion as shown in figure 1. Le p denoes he harvesed power in he -h slo which is assumed o be i.i.d. over ime. Le Q denoe he sored energy in he ransmier s baery ha has a capaciy of Q max. Assume ha he ransmier decides o use q ( Q ) amoun of power for ransmission in -h slo. We assume discree and finie number of power levels for he harvesed and ransmi powers. The rae obained during he -h slo is assumed o follow a relaionship r = B log 2 (1 + q X ), (1) where X denoes he insananeous channel gain-o-noise raio of he channel which is assumed o be i.i.d. over ime and B is he channel bandwidh. The baery sae ges updaed in he nex slo as Q +1 = max{q q + p, Q max }. (2) The goal is o uilize he harvesed power and choose a ransmi power q in each slo[ sequenially o maximize he 1 expeced average rae lim T E T ] =1 r obained over ime. T A. Problem Formulaion Consider an MDP M wih a finie sae space S and a finie acion space A. Le A s A denoe he se of allowed acions from sae s. When he agen chooses an acion a A s in sae s S, i receives a random reward r(s, a ). Based on he agen s decision he sysem undergoes a random ransiion r (s, a ) = f(s, a, X ), (3) where f is a reward funcion known o he agen and X is a random variable inernal o he sysem ha is i.i.d. over ime. Noe ha in he energy harvesing communicaions problem, he reward is he rae obained a each slo and he reward funcion is defined in equaion (1). In his problem, he channel gain-o-noise raio X corresponds o he sysem s inernal random variable. We assume ha he disribuion of he harvesed energy p is known o he agen. This implies ha he sae ransiion probabiliies P (s +1 s, a ) are inferred by he agen based on he updae equaion (2). A policy is defined as any rule for choosing he acions in successive ime slos. The acion chosen a ime may, herefore, depend on he hisory of previous saes, acions and rewards. I may even be randomized such ha he acion a A s is chosen from some disribuion over he acions. A policy is said o be saionary, if he acion chosen a ime is only a funcion of he sysem sae a. This means ha a deerminisic saionary policy β is a mapping from he sae s S o is corresponding acion a A s. When a saionary policy is played, he sequence of saes {s = 1, 2, } follows a Markov chain. An MDP is said o be ergodic, if every deerminisic saionary policy leads o an irreducible and aperiodic Markov chain. According o secion V.3 from [12], he average reward can be maximized by an appropriae deerminisic saionary policy β for an ergodic MDP wih finie sae space. In order o arrive a an ergodic MDP for he energy harvesing communicaions problem, we make he following assumpions: 1. when he baery sae Q > 0, he ransmi power q > 0; 2. he disribuion of he harvesed energy is such ha Pr{p = p} > 0 for all 0 p Q max. Under hese assumpions, we prove he ergodiciy of he MDP as follows. Proposiion 1. The MDP corresponding o he power allocaion applicaion in energy harvesing communicaions is ergodic. Proof: Consider any policy β and le P (n) (s, s ) be he n- sep ransiion probabiliies associaed wih he Markov chain resuling from he policy. Firs, we prove ha P (1) (s, s ) > 0 for any s s as follows. According o he sae updae equaions, s +1 = s β(s ) + p. (4) The ransiion probabiliies can, herefore, be expressed as P (1) (s, s ) = Pr{p = s s + β(s)} 0, (5)

3 since s s and β(s) 0 for all saes. This implies ha any sae s S is accessible from any oher sae s in he resulan Markov chain, if s s. Now, we prove ha P (1) (s, s 1) > 0 for all s 1 as follows. From equaion (5), we observe ha P (1) (s, s 1) = Pr{p = β(s) 1} 0, (6) since β(s) 1 for all s 1. This implies ha every sae s S is accessible from he sae s+1 in he resulan Markov chain. Equaions (5) and (6) imply ha all he sae pairs (s, s+1) communicae wih each oher. Since communicaion is an equivalence relaionship, all he saes communicae wih each oher and he resulan Markov chain is irreducible. Also, equaion (5) implies ha P (1) (s, s) > 0 for all he saes and he Markov chain is, herefore, aperiodic. Since he MDP under consideraion is ergodic, we resric ourselves o he se of deerminisic saionary policies which we inerchangeably refer o as policies henceforh. Le µ(s, a) denoe he expeced reward associaed wih he sae-acions pair (s, a) which can be expressed as µ(s, a) = E [r(s, a)] = E X [f(s, a, X)]. (7) For ergodic MDPs, he opimal mean reward ρ is independen of he iniial sae (see [13], secion 8.3.3). I is specified as ρ = max ρ(β, M), (8) β B where B is he se of all policies, M is he marix whose (s, a)- h enry is µ(s, a), and ρ(β, M) is he average expeced reward per slo using policy β. We use he opimal mean reward as he benchmark and define he cumulaive regre of a learning algorihm afer T ime-slos as [ T ] R(T ) := T ρ E r. (9) B. Opimal Saionary Policy When he expeced rewards for all sae-acion pairs µ(s, a) and he ransiion probabiliies P (s s, a) are known, he problem of deermining he opimal policy o maximize he average expeced reward over ime can be formulaed as a linear program (LP) (see e.g. [12], secion V.3) shown below. maximize π(s, a)µ(s, a) subjec o π(s, a) 0, s S, a A s, π(s, a) = 1, s S a A s s S : π(s, a) = π(s, a)p (s s, a), a A s (10) where π(s, a) denoes he saionary disribuion of he MDP. The objecive funcion of he LP from equaion (10) gives he average rae corresponding o he saionary disribuion π(s, a), while he consrains make sure ha his saionary disribuion corresponds o a valid policy on he MDP. Such LPs can be solved by using sandard solvers like CVXPY [14]. If π (s, a) is he soluion o he LP from (10), hen for every s S, π(s, a) > 0 for only one acion a A s. This is due o he fac he he opimal policy β is deerminisic for ergodic MDPs in average reward maximizaion problems (see [13], secion 8.3.3). Thus for his problem, β (s) = arg max a As π (s, a). Noe ha we, henceforh, drop he acion index from he saionary disribuion, since he policies under consideraion are deerminisic and he corresponding acion is, herefore, deerminisically known. In general, we use π β (s) o denoe he saionary disribuion corresponding o he policy β. I mus be noed ha he saionary disribuion of any policy is independen of he reward values and only depends on he ransiion probabiliy for every sae-acion pair. The expeced average reward depends on he saionary disribuion as ρ(β, M) = s S π β (s)µ(s, β(s)). (11) In erms of his noaion, he LP from (10) equivalen o max β B ρ(β, M). Since he marix M is unknown, we develop an online learning framework o learn he opimal policy in he nex secion. III. ONLINE LEARNING ALGORITHMS For he power allocaion problem under consideraion, alhough he agen knows he sae ransiion probabiliies, he mean rewards for he sae-acion pairs µ(s, a) values are sill unknown. Hence, he agen canno solve he LP from (10) o figure ou he opimal policy. Any online learning algorihm needs o learn he reward values over ime and updae is policy adapively. One ineresing aspec of he problem, however, is ha he reward funcion from equaion (3) is known o he agen. Since he reward funcions under consideraion (1) is bijecive, once he reward is revealed o he agen, i can inver hem o infer he insananeous realizaion of he random variable X. This inference can be used o predic he rewards ha would have been obained for oher sae-acion pairs using he funcion knowledge. In our online learning framework, we sore he average values of hese inferred rewards θ(s, a) for all sae-acion pairs. Also, we define confidence bounds a ime : u,λ (s, a) = θ(s, a) + B(s, a) (12) l,λ (s, a) = θ(s, a) B(s, a), (13) which are referred o as UCB and LCB, respecively, and B(s, a) max x f(s, a, x) min x f(s, a, x) denoes any upper bound on he maximum possible range of he reward for he sae-acion pair (s, a). The idea behind our algorihms is o use he UCB values for he maximizaion problems insead of he unknown µ(s, a) values in he objecive funcion of he LP from (10). Since he θ(s, a) values ge updaed afer each reward revelaion, he agen needs o solve he LP again

4 and again. We propose our online learning algorihm UCLP where he agen solves he LP a each slo. Alhough he agen is unaware of he acual µ(s, a) values, i learns he saisics θ(s, a) over ime and evenually figures ou he opimal policy. We use following noaions in he analysis of our algorihms: B 0 := max (s,a) B(s, a), min := ρ max β β ρ(β, M). The oal number of saes and acions are specified as S := S, A := A, respecively. Also, U,λ and L,λ denoe he marices conaining he enries u,λ (s, a) and l,λ (s, a) a ime, respecively. The UCLP algorihm presened in algorihm 1 solves he LP a each ime-sep and updaes i policy based on he soluion obained. I sores only one θ value per sae-acion pair, is required sorage is, herefore, O(SA). In heorem 1, we derive an upper bound on he expeced number of slos where he LP fails o find he opimal soluion using UCLP. We use his resul o bound he oal expeced regre of UCLP in heorem 2. These resuls guaranee ha he regre is always upper bounded by a consan. Noe ha, for he ease of exposiion, we assume ha he ime sars a = 0. This simplifies he analysis, bu has no significan impac on he regre bounds. Algorihm 1 UCLP 1: Parameers: λ > 1/2. 2: Iniializaion: For all (s, a) pairs, θ(s, a) = 0. 3: for n = 0 do 4: Given he sae s 0 and choose any valid acion; 5: Updae all (s, a) pairs: θ(s, a) = f(s, a, x 0 ); 6: end for 7: // MAIN LOOP 8: while 1 do 9: n = n + 1; 10: Confidence bounds: λ ln n u n,λ (s, a) = θ(s, a) + B(s, a) n ; 11: Solve he LP from (10) wih u n,λ (s, a) insead of unknown µ(s, a); 12: In erms of he LP soluion π (n), define β n (s) = arg max a As π (n) (s, a), s S; 13: Given he sae s n, selec he acion β n (s n ); 14: Updae for all (s, a) pairs: 15: end while θ(s, a) nθ(s, a) + f(s, a, x n) ; n + 1 Theorem 1. The expeced number of slos where non-opimal policies are played by UCLP is upper bounded by n 0 + (1 + A) Sσ λ, (14) where σ λ = 2λ and n 0 denoes he minimum value of =1 n N for which min 2B 0 λ ln n n. The proof of heorem 1 is provided in appendix A. UCLP requires λ > 1/2 in order o have a consan regre upper bound from equaion (14), since σ λ < for λ > 1/2. I is imporan o noe ha even if he opimal policy is found by he LP and played during cerain slos, i does no mean ha regre conribuion of hose slos is zero. According o he definiion of regre from equaion (9), regre conribuion of a cerain slo is zero if and only if he opimal policy is played and he corresponding Markov chain is a is saionary disribuion. In appendix C, we inroduce ools o analyze he mixing of Markov chains and characerize his regre conribuion in heorem 3. These resuls are used o upper bound he UCLP regre in he nex heorem. Theorem 2. The oal expeced regre of he UCLP is upper bounded by ) ( ) µmax (n 0 +(1 + A) Sσ λ max + 1+(1 + A) Sσ λ 1 γ, (15) where γ = max P (s, ) P (s, ) TV, P denoes he ransiion probabiliy marix corresponding o he opimal policy, s,s S µ max = max µ(s, a) and max = ρ min µ(s, a). s S,a A s s S,a A s Proof: The regre of UCLP arises when eiher nonopimal acions are aken or opimal acions are aken, bu he corresponding Markov chain is no a saionariy. For he firs source of regre, i is sufficien o analyze he number of insances where he LP fails o find he opimal policy. For he second source, however, we need o analyze he oal number of phases where he opimal policy is found in succession. Since only he opimal policy is played in consecuive slos in a phase, i corresponds o ransiions on he Markov chain associaed wih he opimal policy and he ools from appendix C can be applied. According o heorem 3, he regre conribuion of any phase is bounded from above by (1 γ) µ max. As proved in heorem 1, for n 0, he expeced number of insances of non-opimal policies is upper bounded by (1 + A) Sσ λ. Even if none of he hese insances appear in successive slos, he expeced number of opimal phases is upper bounded by 1 + (1 + A) Sσ λ. Hence, for n 0, he expeced regre conribuion from he slos following he opimal policy is upper bounded by ( 1 + (1 + A) Sσ λ ) µmax 1 γ. (16) Noe ha maximum regre possible during one slo is max. Hence for he firs n 0 slos, he regre is bounded by n 0 max. Since here are a mos (1 + A) Sσ λ slos wih non-opimal policy for n 0 in expecaion, heir expeced regre is upper bounded by (1 + A) Sσ λ max. Overall expeced regre for he UCLP algorihm is, herefore, bounded from above by equaion (15). Remark 1. I mus be noed ha we call wo policies as same if and only if hey recommend idenical acions for every sae. I is, herefore, possible for a non-opimal policy o recommend opimal acions for some of he saes. In he analysis of UCLP, however, we assumed ha any occurrence of a non-opimal

5 Regre UCLP Opimal Policy Naive Policy Fig. 2. Number of rials Regre performance of differen algorihms. policy conribues o he regre. Alhough his is no necessary, i leads us o a valid upper bound in he proof. IV. NUMERICAL SIMULATIONS We perform simulaions for he power allocaion problem wih S = {0, 1, 2, 3, 4} and A = {0, 1, 2, 3, 4}. Noe ha each sae s corresponds o Q from equaion (2) wih Q max = 4 and a corresponds o he ransmi power q from equaion (1). The valid acions for each sae are shown in able I. The reward funcion is he rae funcion from equaion (1) and he channel gain is a scaled Bernoulli random variable wih Pr{X = 10} = 0.2 and Pr{X = 0} = 0.8. We use CVXPY [14] for solving he LPs in our algorihm. For he simulaions in figure 2, we use λ = 2 and plo he average regre performance over 10 3 independen runs of differen algorihms. Here, he naive policy never uses he baery, i.e. i uses all he arriving power for he curren ransmission. Noe ha he opimal policy also incurs a regre because of he corresponding Markov chain no being a saionariy. We observe ha UCLP follows he performance of he opimal policy wih he difference in regre semming from he firs few ime-slos when he channel saisics are no properly learn and hus UCLP fails o find he opimal policy. As he ime progresses, UCLP finds he opimal policy and he regre follows he regre paern of he opimal policy. TABLE I VALID ACTIONS PER STATE Sae (s) Acions (A s) 0 {0} 1 {1} 2 {1, 2} 3 {1, 2, 3} 4 {1, 2, 3, 4} V. CONCLUSION We have considered he problem of power allocaion over a sochasically varying channel wih unknown disribuion in an energy harvesing communicaion sysem. We have cas his problem as an online learning problem over an MDP. If he ransiion probabiliies and he mean rewards associaed wih he MDP are known, he opimal policy maximizing he average expeced reward over ime can be found by solving an LP specified in he paper. Since he agen is only assumed o know he disribuion of he harvesed energy, she needs o learn he rewards of he sae-acion pairs over ime and make her decisions based on he learn behaviour. For his problem, we have proposed our online learning algorihm UCLP which solves he LP a each ime-slo using he updaed upper confidence bounds of he rewards insead of he unknown mean rewards. We have shown ha he regre incurred by UCLP is bounded from above by a consan. Through he numerical simulaions, we have shown ha he regre of UCLP is very close o ha of he opimal policy. The UCLP algorihm presened in his paper solves an LP and herefore requires a lo of compuaions a each imesep. Reducing he compuaional needs of UCLP is a poenial direcion for fuure works in his area. REFERENCES [1] S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang, Energy harvesing wireless communicaions: A review of recen advances, Seleced Areas in Communicaions, IEEE Journal on, vol. 33, no. 3, pp , [2] C. K. Ho and R. Zhang, Opimal energy allocaion for wireless communicaions wih energy harvesing consrains, Signal Processing, IEEE Transacions on, vol. 60, no. 9, pp , [3] K. Tuuncuoglu and A. Yener, Opimum ransmission policies for baery limied energy harvesing nodes, Wireless Communicaions, IEEE Transacions on, vol. 11, no. 3, pp , [4] J. Yang and S. Ulukus, Opimal packe scheduling in an energy harvesing communicaion sysem, Communicaions, IEEE Transacions on, vol. 60, no. 1, pp , [5] J. Langford and T. Zhang, The epoch-greedy algorihm for muliarmed bandis wih side informaion, in Advances in neural informaion processing sysems, pp , [6] M. Dudik, D. Hsu, S. Kale, N. Karampaziakis, J. Langford, L. Reyzin, and T. Zhang, Efficien opimal learning for conexual bandis, in Conference on Uncerainy in Arificial Inelligence, [7] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire, Taming he monser: A fas and simple algorihm for conexual bandis, in Inernaional Conference on Machine Learning, pp , [8] P. Sakulkar and B. Krishnamachari, Sochasic conexual bandis wih known reward funcions. USC ANRG Technical Repor, ANRG , hp://anrg.usc.edu/www/papers/dcb ANRG TechRepor.pdf. [9] P. Orner and R. Auer, Logarihmic online regre bounds for undiscouned reinforcemen learning, in Proceedings of he 2006 Conference on Advances in Neural Informaion Processing Sysems, vol. 19, p. 49, [10] P. Auer, T. Jaksch, and R. Orner, Near-opimal regre bounds for reinforcemen learning, in Advances in neural informaion processing sysems, pp , [11] A. Tewari and P. L. Barle, Opimisic linear programming gives logarihmic regre for irreducible mdps, in Advances in Neural Informaion Processing Sysems, pp , [12] S. Ross, Inroducion o sochasic dynamic programming. Academic Press, [13] M. L. Puerman, Markov decision processes: discree sochasic dynamic programming. John Wiley & Sons, [14] S. Diamond and S. Boyd, CVXPY: A Pyhon-embedded modeling language for convex opimizaion, Journal of Machine Learning Research, To appear. [15] W. Hoeffding, Probabiliy inequaliies for sums of bounded random variables, Journal of he American saisical associaion, vol. 58, no. 301, pp , 1963.

6 [16] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing imes. American Mahemaical Soc., APPENDIX A PROOF OF THEOREM 1 Le β denoe he policy obained by UCLP a ime and I(z) be he indicaor funcion defined o be 1 when he predicae z is rue, and 0 oherwise. Now he number of slos where non-opimal policies are played can be expressed as N 1 = 1 + I {β β } =1 n 0 + I {β β } =n 0 = n 0 + I {ρ(β, U,λ ) ρ(β, U,λ )}. (17) =n 0 We observe ha ρ(β, U,λ ) ρ(β, U,λ ) implies ha a leas one of he following inequaliies mus be rue: ρ(β, U,λ ) ρ(β, M) (18) ρ(β, L,λ ) ρ(β, M) (19) ρ(β, M) < ρ(β, M) + ρ(β, U,λ ) ρ(β, L,λ ). (20) Hence we upper bound he probabiliies of each of hese evens. For he firs even from condiion (18), we ge Pr {ρ(β, U,λ ) ρ(β, M)} { = Pr π (s, β (s))u,λ (s, β (s)) s S s S π (s, β (s))µ(s, β (s)) Pr {For a leas one sae s S : π (s, β (s))u,λ (s, β (s)) π (s, β (s))µ(s, β (s))} s S Pr {π (s, β (s))u,λ (s, β (s)) π (s, β (s))µ(s, β (s))} = s S Pr {u,λ (s, β (s)) µ(s, β (s))} (a) 2λ s S = S 2λ, (21) where (a) holds due o concenraion of confidence bounds from lemma 2 (see appendix B). Similarly for he second even from condiion (19), we ge Pr {ρ(β, L,λ ) ρ(β, M)} { = Pr π β (s, a)l,λ (s, a) } π β (s, a)µ(s, a) } Pr {For a leas one sae-acion pair (s, a) : π β (s, a)l,λ (s, a)) π β (s, a)µ(s, a)} Pr {π β (s, a)l,λ (s, a) π β (s, a)µ(s, a)} = Pr {l,λ (s, a) µ(s, a)} (b) 2λ SA 2λ, (22) where (b) holds due o concenraion bounds from lemma 2 (see appendix B). Now le us analyze he hird even from condiion (20). ρ(β, U,λ ) ρ(β, L,λ ) = π β (s, a)u,λ (s, a) π β (s, a)l,λ (s, a) = π β (s, a) (u,λ (s, a) l,λ (s, a)) = ( ) π β (s, a) 2B(s, a) s S a A s = 2 π β (s, a)b(s, a) s S a A s 2 π β (s, a)b 0 2B 0. (23) Since min ρ(β, M) ρ(β, M), for all n 0 we ge ρ(β, M) ρ(β, M) (ρ(β, U,λ ) ρ(β, L,λ )) min 2B 0 min 2B 0 λ ln n0 n 0 0. (24) This implies ha condiion (20) is always false for n 0. The expeced number of incorrec policies from equaion (17), herefore, can be expressed as E[N 1 ] n 0 + Pr {ρ(β, U,λ ) ρ(β, U,λ )} =n 0 n 0 + (Pr {ρ(β, U,λ ) ρ(β, M)} =n 0 n 0 + n + Pr {ρ(β, L,λ ) ρ(β, M)}) =n 0 ( S 2λ + SA 2λ ) n 0 + (1 + A) S =n 0 2λ

7 n 0 + (1 + A) Sσ λ, (25) where σ λ < as λ > 1/2. APPENDIX B TECHNICAL LEMMAS & PROOFS Lemma 1 (Hoeffding s Concenraion Inequaliy from [15]). Le Y 1,..., Y n be i.i.d. random variables wih mean µ and range [0, 1]. Le S n = n Y. Then for all α 0 =1 Pr{S n nµ + α} e 2α2 /n Pr{S n nµ α} e 2α2 /n. Lemma 2 (Concenraion of Confidence Bounds). A any ime, for any valid sae-acion pair (s, a), following inequaliies hold: 1) Pr {u,λ (s, a) µ(s, a)} 2λ, 2) Pr {l,λ (s, a) µ(s, a)} 2λ. Proof: For he firs inequaliy, Pr {u,λ (s, a) µ(s, a)} { } Pr ˆr (s, a) + B(s, a) µ(s, a) { ˆr (s, a) µ(s, a) = Pr B(s, a) B(s, a) } λ ln (a) 2(λ ln )/ e = 2λ, (26) where (a) is obained using he lef-sided Hoeffding s inequaliy (see lemma 1) wih α = λ ln. Similarly using he righ-sided version of he concenraion inequaliy, we ge he second inequaliy. APPENDIX C ANALYSIS OF MARKOV CHAIN MIXING We briefly inroduce he ools required for he analysis of Markov chain mixing (see [16], chaper 4 for a deailed discussion). The oal variaion (TV) disance beween wo probabiliy disribuions φ and ψ on sample space Ω is defined by φ ψ TV = max φ(e) ψ(e). (27) E Ω Inuiively, i means he TV disance beween φ and ψ is he maximum difference beween he probabiliies of a single even by he wo disribuions. The TV disance is relaed o he L 1 disance as follows φ ψ TV = 1 φ(ω) ψ(ω). (28) 2 ω Ω We wish o bound he maximal disance beween he saionary disribuion π and he disribuion over saes afer seps of a Markov chain. Le P () be he -sep ransiion marix wih P () (s, s ) being he ransiion probabiliy from sae s o s of he Markov chain in seps and P be he collecion of all probabiliy disribuions on Ω. Also le P () (s, ) be he row or disribuion corresponding o he iniial sae of s. Based on hese noaions, we define a couple of useful -sep disances as follows: d() := max π P () (s, ) TV s S = sup π φp () TV, (29) φ P ˆd() := max P () (s, ) P () (s, ) TV s,s S = sup ψp () φp () TV. (30) ψ,φ P For irreducible and aperiodic Markov chains, he disances d() and ˆd() have following special properies: Lemma 3 ([16], lemma 4.11). For all > 0, d() ˆd() 2d(). Lemma 4 ([16], lemma 4.12). The funcion ˆd is submuliplicaive: ˆd( ) ˆd( 1 ) ˆd( 2 ). These lemmas lead o following useful corollary: Corollary 1. For all 0, d() ˆd(1). Consider an MDP wih opimal saionary policy β. Since he MDP migh no sar a he saionary disribuion π corresponding o he opimal policy, even he opimal policy incurs some regre as defined in equaion (9). We characerize his regre in he following heorem. Theorem 3 (Regre of Opimal Policy). For an ergodic MDP, he oal expeced regre of he opimal saionary policy wih ransiion probabiliy marix P is upper bounded by (1 γ) µ max, where γ = max P (s, ) P (s, ) TV and s,s S µ max = max µ(s, a). s S,a A Proof: Le φ 0 be he iniial disribuion over saes and φ = φ 0 P () be such disribuion a ime represened as a row vecors. Also, le µ be a row vecor wih enry corresponding o sae s being µ(s, β (s)). We use d () and ˆd () o denoe he -sep disances from equaions (29) and (30) for he opimal policy. Ergodiciy of he MDP ensures ha he Markov chain corresponding o he opimal policy is irreducible and aperiodic, and hus lemmas 3 and 4 hold. The regre of he opimal policy, herefore, ges simplified as: R (φ 0, T ) T = T ρ φ µ T = T (π µ ) φ µ T = (π φ ) µ T (π φ ) + µ (Negaive enries ignored)

8 = T (π (s) φ (s)) + µ (s) s S T µ max (π (s) φ (s)) + s S T = µ max π φ 0 P () TV T µ max d () T ( ) µ ˆd max (1) (From corollary 1) T = µ max γ µ max 1 1 γ. Noe ha his regre bound is independen of he iniial disribuion over he saes.

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Energy Storage Benchmark Problems

Energy Storage Benchmark Problems Energy Sorage Benchmark Problems Daniel F. Salas 1,3, Warren B. Powell 2,3 1 Deparmen of Chemical & Biological Engineering 2 Deparmen of Operaions Research & Financial Engineering 3 Princeon Laboraory

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Conservative Contextual Linear Bandits

Conservative Contextual Linear Bandits Conservaive Conexual Linear Bandis Abbas Kazerouni Sanford Universiy abbask@sanford.edu Yasin Abbasi-Yadkori Adobe Research abbasiya@adobe.com Mohammad Ghavamzadeh DeepMind ghavamza@google.com Benjamin

More information

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems MATHEMATICS OF OPERATIONS RESEARCH Vol. 38, No. 2, May 2013, pp. 209 227 ISSN 0364-765X (prin) ISSN 1526-5471 (online) hp://dx.doi.org/10.1287/moor.1120.0562 2013 INFORMS On Boundedness of Q-Learning Ieraes

More information

Lecture 20: Riccati Equations and Least Squares Feedback Control

Lecture 20: Riccati Equations and Least Squares Feedback Control 34-5 LINEAR SYSTEMS Lecure : Riccai Equaions and Leas Squares Feedback Conrol 5.6.4 Sae Feedback via Riccai Equaions A recursive approach in generaing he marix-valued funcion W ( ) equaion for i for he

More information

Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach

Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach 1 Decenralized Sochasic Conrol wih Parial Hisory Sharing: A Common Informaion Approach Ashuosh Nayyar, Adiya Mahajan and Demoshenis Tenekezis arxiv:1209.1695v1 [cs.sy] 8 Sep 2012 Absrac A general model

More information

Lecture 4 Notes (Little s Theorem)

Lecture 4 Notes (Little s Theorem) Lecure 4 Noes (Lile s Theorem) This lecure concerns one of he mos imporan (and simples) heorems in Queuing Theory, Lile s Theorem. More informaion can be found in he course book, Bersekas & Gallagher,

More information

Cash Flow Valuation Mode Lin Discrete Time

Cash Flow Valuation Mode Lin Discrete Time IOSR Journal of Mahemaics (IOSR-JM) e-issn: 2278-5728,p-ISSN: 2319-765X, 6, Issue 6 (May. - Jun. 2013), PP 35-41 Cash Flow Valuaion Mode Lin Discree Time Olayiwola. M. A. and Oni, N. O. Deparmen of Mahemaics

More information

Conservative Contextual Linear Bandits

Conservative Contextual Linear Bandits Conservaive Conexual Linear Bandis Abbas Kazerouni, Mohammad Ghavamzadeh and Benjamin Van Roy 1 Absrac Safey is a desirable propery ha can immensely increase he applicabiliy of learning algorihms in real-world

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture Notes 2. The Hilbert Space Approach to Time Series Time Series Seven N. Durlauf Universiy of Wisconsin. Basic ideas Lecure Noes. The Hilber Space Approach o Time Series The Hilber space framework provides a very powerful language for discussing he relaionship

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Lecture 2 October ε-approximation of 2-player zero-sum games

Lecture 2 October ε-approximation of 2-player zero-sum games Opimizaion II Winer 009/10 Lecurer: Khaled Elbassioni Lecure Ocober 19 1 ε-approximaion of -player zero-sum games In his lecure we give a randomized ficiious play algorihm for obaining an approximae soluion

More information

14 Autoregressive Moving Average Models

14 Autoregressive Moving Average Models 14 Auoregressive Moving Average Models In his chaper an imporan parameric family of saionary ime series is inroduced, he family of he auoregressive moving average, or ARMA, processes. For a large class

More information

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II Roland Siegwar Margaria Chli Paul Furgale Marco Huer Marin Rufli Davide Scaramuzza ETH Maser Course: 151-0854-00L Auonomous Mobile Robos Localizaion II ACT and SEE For all do, (predicion updae / ACT),

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

Expert Advice for Amateurs

Expert Advice for Amateurs Exper Advice for Amaeurs Ernes K. Lai Online Appendix - Exisence of Equilibria The analysis in his secion is performed under more general payoff funcions. Wihou aking an explici form, he payoffs of he

More information

Block Diagram of a DCS in 411

Block Diagram of a DCS in 411 Informaion source Forma A/D From oher sources Pulse modu. Muliplex Bandpass modu. X M h: channel impulse response m i g i s i Digial inpu Digial oupu iming and synchronizaion Digial baseband/ bandpass

More information

OBJECTIVES OF TIME SERIES ANALYSIS

OBJECTIVES OF TIME SERIES ANALYSIS OBJECTIVES OF TIME SERIES ANALYSIS Undersanding he dynamic or imedependen srucure of he observaions of a single series (univariae analysis) Forecasing of fuure observaions Asceraining he leading, lagging

More information

CHAPTER 2 Signals And Spectra

CHAPTER 2 Signals And Spectra CHAPER Signals And Specra Properies of Signals and Noise In communicaion sysems he received waveform is usually caegorized ino he desired par conaining he informaion, and he undesired par. he desired par

More information

On Multicomponent System Reliability with Microshocks - Microdamages Type of Components Interaction

On Multicomponent System Reliability with Microshocks - Microdamages Type of Components Interaction On Mulicomponen Sysem Reliabiliy wih Microshocks - Microdamages Type of Componens Ineracion Jerzy K. Filus, and Lidia Z. Filus Absrac Consider a wo componen parallel sysem. The defined new sochasic dependences

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter Sae-Space Models Iniializaion, Esimaion and Smoohing of he Kalman Filer Iniializaion of he Kalman Filer The Kalman filer shows how o updae pas predicors and he corresponding predicion error variances when

More information

Optimal Server Assignment in Multi-Server

Optimal Server Assignment in Multi-Server Opimal Server Assignmen in Muli-Server 1 Queueing Sysems wih Random Conneciviies Hassan Halabian, Suden Member, IEEE, Ioannis Lambadaris, Member, IEEE, arxiv:1112.1178v2 [mah.oc] 21 Jun 2013 Yannis Viniois,

More information

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data Chaper 2 Models, Censoring, and Likelihood for Failure-Time Daa William Q. Meeker and Luis A. Escobar Iowa Sae Universiy and Louisiana Sae Universiy Copyrigh 1998-2008 W. Q. Meeker and L. A. Escobar. Based

More information

Comments on Window-Constrained Scheduling

Comments on Window-Constrained Scheduling Commens on Window-Consrained Scheduling Richard Wes Member, IEEE and Yuing Zhang Absrac This shor repor clarifies he behavior of DWCS wih respec o Theorem 3 in our previously published paper [1], and describes

More information

arxiv: v1 [math.oc] 11 Sep 2017

arxiv: v1 [math.oc] 11 Sep 2017 Online Learning in Weakly Coupled Markov Decision Processes: A Convergence ime Sudy Xiaohan Wei, Hao Yu and Michael J. Neely arxiv:709.03465v [mah.oc] Sep 07. Inroducion Absrac: We consider muliple parallel

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Robust estimation based on the first- and third-moment restrictions of the power transformation model h Inernaional Congress on Modelling and Simulaion, Adelaide, Ausralia, 6 December 3 www.mssanz.org.au/modsim3 Robus esimaion based on he firs- and hird-momen resricions of he power ransformaion Nawaa,

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

Notes for Lecture 17-18

Notes for Lecture 17-18 U.C. Berkeley CS278: Compuaional Complexiy Handou N7-8 Professor Luca Trevisan April 3-8, 2008 Noes for Lecure 7-8 In hese wo lecures we prove he firs half of he PCP Theorem, he Amplificaion Lemma, up

More information

Mathematical Theory and Modeling ISSN (Paper) ISSN (Online) Vol 3, No.3, 2013

Mathematical Theory and Modeling ISSN (Paper) ISSN (Online) Vol 3, No.3, 2013 Mahemaical Theory and Modeling ISSN -580 (Paper) ISSN 5-05 (Online) Vol, No., 0 www.iise.org The ffec of Inverse Transformaion on he Uni Mean and Consan Variance Assumpions of a Muliplicaive rror Model

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

Presentation Overview

Presentation Overview Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion

More information

Mixing times and hitting times: lecture notes

Mixing times and hitting times: lecture notes Miing imes and hiing imes: lecure noes Yuval Peres Perla Sousi 1 Inroducion Miing imes and hiing imes are among he mos fundamenal noions associaed wih a finie Markov chain. A variey of ools have been developed

More information

1. An introduction to dynamic optimization -- Optimal Control and Dynamic Programming AGEC

1. An introduction to dynamic optimization -- Optimal Control and Dynamic Programming AGEC This documen was generaed a :45 PM 8/8/04 Copyrigh 04 Richard T. Woodward. An inroducion o dynamic opimizaion -- Opimal Conrol and Dynamic Programming AGEC 637-04 I. Overview of opimizaion Opimizaion is

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

Applying Genetic Algorithms for Inventory Lot-Sizing Problem with Supplier Selection under Storage Capacity Constraints

Applying Genetic Algorithms for Inventory Lot-Sizing Problem with Supplier Selection under Storage Capacity Constraints IJCSI Inernaional Journal of Compuer Science Issues, Vol 9, Issue 1, No 1, January 2012 wwwijcsiorg 18 Applying Geneic Algorihms for Invenory Lo-Sizing Problem wih Supplier Selecion under Sorage Capaciy

More information

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov Saionary Disribuion Design and Analysis of Algorihms Andrei Bulaov Algorihms Markov Chains 34-2 Classificaion of Saes k By P we denoe he (i,j)-enry of i, j Sae is accessible from sae if 0 for some k 0

More information

Math 10B: Mock Mid II. April 13, 2016

Math 10B: Mock Mid II. April 13, 2016 Name: Soluions Mah 10B: Mock Mid II April 13, 016 1. ( poins) Sae, wih jusificaion, wheher he following saemens are rue or false. (a) If a 3 3 marix A saisfies A 3 A = 0, hen i canno be inverible. True.

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

Stochastic Bandits with Pathwise Constraints

Stochastic Bandits with Pathwise Constraints Sochasic Bandis wih Pahwise Consrains Auhor Insiue Absrac. We consider he problem of sochasic bandis, wih he goal of maximizing a reward while saisfying pahwise consrains. The moivaion for his problem

More information

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and

More information

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks - Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics

More information

Inventory Control of Perishable Items in a Two-Echelon Supply Chain

Inventory Control of Perishable Items in a Two-Echelon Supply Chain Journal of Indusrial Engineering, Universiy of ehran, Special Issue,, PP. 69-77 69 Invenory Conrol of Perishable Iems in a wo-echelon Supply Chain Fariborz Jolai *, Elmira Gheisariha and Farnaz Nojavan

More information

Resource Allocation in Visible Light Communication Networks NOMA vs. OFDMA Transmission Techniques

Resource Allocation in Visible Light Communication Networks NOMA vs. OFDMA Transmission Techniques Resource Allocaion in Visible Ligh Communicaion Neworks NOMA vs. OFDMA Transmission Techniques Eirini Eleni Tsiropoulou, Iakovos Gialagkolidis, Panagiois Vamvakas, and Symeon Papavassiliou Insiue of Communicaions

More information

Subway stations energy and air quality management

Subway stations energy and air quality management Subway saions energy and air qualiy managemen wih sochasic opimizaion Trisan Rigau 1,2,4, Advisors: P. Carpenier 3, J.-Ph. Chancelier 2, M. De Lara 2 EFFICACITY 1 CERMICS, ENPC 2 UMA, ENSTA 3 LISIS, IFSTTAR

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Non-Stochastic Bandit Slate Problems

Non-Stochastic Bandit Slate Problems Non-Sochasic Bandi Slae Problems Sayen Kale Yahoo! Research Sana Clara, CA skale@yahoo-inccom Lev Reyzin Georgia Ins of echnology Alana, GA lreyzin@ccgaechedu Absrac Rober E Schapire Princeon Universiy

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1

SZG Macro 2011 Lecture 3: Dynamic Programming. SZG macro 2011 lecture 3 1 SZG Macro 2011 Lecure 3: Dynamic Programming SZG macro 2011 lecure 3 1 Background Our previous discussion of opimal consumpion over ime and of opimal capial accumulaion sugges sudying he general decision

More information

Tracking Adversarial Targets

Tracking Adversarial Targets A. Proofs Proof of Lemma 3. Consider he Bellman equaion λ + V π,l x, a lx, a + V π,l Ax + Ba, πax + Ba. We prove he lemma by showing ha he given quadraic form is he unique soluion of he Bellman equaion.

More information

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Supplemen for Sochasic Convex Opimizaion: Faser Local Growh Implies Faser Global Convergence Yi Xu Qihang Lin ianbao Yang Proof of heorem heorem Suppose Assumpion holds and F (w) obeys he LGC (6) Given

More information

Electrical and current self-induction

Electrical and current self-induction Elecrical and curren self-inducion F. F. Mende hp://fmnauka.narod.ru/works.hml mende_fedor@mail.ru Absrac The aricle considers he self-inducance of reacive elemens. Elecrical self-inducion To he laws of

More information

Online Appendix for "Customer Recognition in. Experience versus Inspection Good Markets"

Online Appendix for Customer Recognition in. Experience versus Inspection Good Markets Online Appendix for "Cusomer Recogniion in Experience versus Inspecion Good Markes" Bing Jing Cheong Kong Graduae School of Business Beijing, 0078, People s Republic of China, bjing@ckgsbeducn November

More information

Learning to Take Concurrent Actions

Learning to Take Concurrent Actions Learning o Take Concurren Acions Khashayar Rohanimanesh Deparmen of Compuer Science Universiy of Massachuses Amhers, MA 0003 khash@cs.umass.edu Sridhar Mahadevan Deparmen of Compuer Science Universiy of

More information

The expectation value of the field operator.

The expectation value of the field operator. The expecaion value of he field operaor. Dan Solomon Universiy of Illinois Chicago, IL dsolom@uic.edu June, 04 Absrac. Much of he mahemaical developmen of quanum field heory has been in suppor of deermining

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model Modal idenificaion of srucures from roving inpu daa by means of maximum likelihood esimaion of he sae space model J. Cara, J. Juan, E. Alarcón Absrac The usual way o perform a forced vibraion es is o fix

More information

A Reinforcement Learning Approach for Collaborative Filtering

A Reinforcement Learning Approach for Collaborative Filtering A Reinforcemen Learning Approach for Collaboraive Filering Jungkyu Lee, Byonghwa Oh 2, Jihoon Yang 2, and Sungyong Park 2 Cyram Inc, Seoul, Korea jklee@cyram.com 2 Sogang Universiy, Seoul, Korea {mrfive,yangjh,parksy}@sogang.ac.kr

More information

Errata (1 st Edition)

Errata (1 st Edition) P Sandborn, os Analysis of Elecronic Sysems, s Ediion, orld Scienific, Singapore, 03 Erraa ( s Ediion) S K 05D Page 8 Equaion (7) should be, E 05D E Nu e S K he L appearing in he equaion in he book does

More information

The Asymptotic Behavior of Nonoscillatory Solutions of Some Nonlinear Dynamic Equations on Time Scales

The Asymptotic Behavior of Nonoscillatory Solutions of Some Nonlinear Dynamic Equations on Time Scales Advances in Dynamical Sysems and Applicaions. ISSN 0973-5321 Volume 1 Number 1 (2006, pp. 103 112 c Research India Publicaions hp://www.ripublicaion.com/adsa.hm The Asympoic Behavior of Nonoscillaory Soluions

More information

Inequality measures for intersecting Lorenz curves: an alternative weak ordering

Inequality measures for intersecting Lorenz curves: an alternative weak ordering h Inernaional Scienific Conference Financial managemen of Firms and Financial Insiuions Osrava VŠB-TU of Osrava, Faculy of Economics, Deparmen of Finance 7 h 8 h Sepember 25 Absrac Inequaliy measures for

More information

Solutions for Assignment 2

Solutions for Assignment 2 Faculy of rs and Science Universiy of Torono CSC 358 - Inroducion o Compuer Neworks, Winer 218 Soluions for ssignmen 2 Quesion 1 (2 Poins): Go-ack n RQ In his quesion, we review how Go-ack n RQ can be

More information

GENERALIZATION OF THE FORMULA OF FAA DI BRUNO FOR A COMPOSITE FUNCTION WITH A VECTOR ARGUMENT

GENERALIZATION OF THE FORMULA OF FAA DI BRUNO FOR A COMPOSITE FUNCTION WITH A VECTOR ARGUMENT Inerna J Mah & Mah Sci Vol 4, No 7 000) 48 49 S0670000970 Hindawi Publishing Corp GENERALIZATION OF THE FORMULA OF FAA DI BRUNO FOR A COMPOSITE FUNCTION WITH A VECTOR ARGUMENT RUMEN L MISHKOV Received

More information

Solutions to Odd Number Exercises in Chapter 6

Solutions to Odd Number Exercises in Chapter 6 1 Soluions o Odd Number Exercises in 6.1 R y eˆ 1.7151 y 6.3 From eˆ ( T K) ˆ R 1 1 SST SST SST (1 R ) 55.36(1.7911) we have, ˆ 6.414 T K ( ) 6.5 y ye ye y e 1 1 Consider he erms e and xe b b x e y e b

More information

III. Module 3. Empirical and Theoretical Techniques

III. Module 3. Empirical and Theoretical Techniques III. Module 3. Empirical and Theoreical Techniques Applied Saisical Techniques 3. Auocorrelaion Correcions Persisence affecs sandard errors. The radiional response is o rea he auocorrelaion as a echnical

More information

Stability and Bifurcation in a Neural Network Model with Two Delays

Stability and Bifurcation in a Neural Network Model with Two Delays Inernaional Mahemaical Forum, Vol. 6, 11, no. 35, 175-1731 Sabiliy and Bifurcaion in a Neural Nework Model wih Two Delays GuangPing Hu and XiaoLing Li School of Mahemaics and Physics, Nanjing Universiy

More information

CHERNOFF DISTANCE AND AFFINITY FOR TRUNCATED DISTRIBUTIONS *

CHERNOFF DISTANCE AND AFFINITY FOR TRUNCATED DISTRIBUTIONS * haper 5 HERNOFF DISTANE AND AFFINITY FOR TRUNATED DISTRIBUTIONS * 5. Inroducion In he case of disribuions ha saisfy he regulariy condiions, he ramer- Rao inequaliy holds and he maximum likelihood esimaor

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

The field of mathematics has made tremendous impact on the study of

The field of mathematics has made tremendous impact on the study of A Populaion Firing Rae Model of Reverberaory Aciviy in Neuronal Neworks Zofia Koscielniak Carnegie Mellon Universiy Menor: Dr. G. Bard Ermenrou Universiy of Pisburgh Inroducion: The field of mahemaics

More information

1. Consider a pure-exchange economy with stochastic endowments. The state of the economy

1. Consider a pure-exchange economy with stochastic endowments. The state of the economy Answer 4 of he following 5 quesions. 1. Consider a pure-exchange economy wih sochasic endowmens. The sae of he economy in period, 0,1,..., is he hisory of evens s ( s0, s1,..., s ). The iniial sae is given.

More information

The Arcsine Distribution

The Arcsine Distribution The Arcsine Disribuion Chris H. Rycrof Ocober 6, 006 A common heme of he class has been ha he saisics of single walker are ofen very differen from hose of an ensemble of walkers. On he firs homework, we

More information

18 Biological models with discrete time

18 Biological models with discrete time 8 Biological models wih discree ime The mos imporan applicaions, however, may be pedagogical. The elegan body of mahemaical heory peraining o linear sysems (Fourier analysis, orhogonal funcions, and so

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Introduction to Probability and Statistics Slides 4 Chapter 4

Introduction to Probability and Statistics Slides 4 Chapter 4 Inroducion o Probabiliy and Saisics Slides 4 Chaper 4 Ammar M. Sarhan, asarhan@mahsa.dal.ca Deparmen of Mahemaics and Saisics, Dalhousie Universiy Fall Semeser 8 Dr. Ammar Sarhan Chaper 4 Coninuous Random

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

On-line Adaptive Optimal Timing Control of Switched Systems

On-line Adaptive Optimal Timing Control of Switched Systems On-line Adapive Opimal Timing Conrol of Swiched Sysems X.C. Ding, Y. Wardi and M. Egersed Absrac In his paper we consider he problem of opimizing over he swiching imes for a muli-modal dynamic sysem when

More information

Martingales Stopping Time Processes

Martingales Stopping Time Processes IOSR Journal of Mahemaics (IOSR-JM) e-issn: 2278-5728, p-issn: 2319-765. Volume 11, Issue 1 Ver. II (Jan - Feb. 2015), PP 59-64 www.iosrjournals.org Maringales Sopping Time Processes I. Fulaan Deparmen

More information

A Shooting Method for A Node Generation Algorithm

A Shooting Method for A Node Generation Algorithm A Shooing Mehod for A Node Generaion Algorihm Hiroaki Nishikawa W.M.Keck Foundaion Laboraory for Compuaional Fluid Dynamics Deparmen of Aerospace Engineering, Universiy of Michigan, Ann Arbor, Michigan

More information

Sliding Mode Extremum Seeking Control for Linear Quadratic Dynamic Game

Sliding Mode Extremum Seeking Control for Linear Quadratic Dynamic Game Sliding Mode Exremum Seeking Conrol for Linear Quadraic Dynamic Game Yaodong Pan and Ümi Özgüner ITS Research Group, AIST Tsukuba Eas Namiki --, Tsukuba-shi,Ibaraki-ken 5-856, Japan e-mail: pan.yaodong@ais.go.jp

More information

School and Workshop on Market Microstructure: Design, Efficiency and Statistical Regularities March 2011

School and Workshop on Market Microstructure: Design, Efficiency and Statistical Regularities March 2011 2229-12 School and Workshop on Marke Microsrucure: Design, Efficiency and Saisical Regulariies 21-25 March 2011 Some mahemaical properies of order book models Frederic ABERGEL Ecole Cenrale Paris Grande

More information

Stable Scheduling Policies for Maximizing Throughput in Generalized Constrained Queueing Systems

Stable Scheduling Policies for Maximizing Throughput in Generalized Constrained Queueing Systems 1 Sable Scheduling Policies for Maximizing Throughpu in Generalized Consrained Queueing Sysems Prasanna Chaporar, Suden Member, IEEE, Saswai Sarar, Member, IEEE Absrac We consider a class of queueing newors

More information

Unit Root Time Series. Univariate random walk

Unit Root Time Series. Univariate random walk Uni Roo ime Series Univariae random walk Consider he regression y y where ~ iid N 0, he leas squares esimae of is: ˆ yy y y yy Now wha if = If y y hen le y 0 =0 so ha y j j If ~ iid N 0, hen y ~ N 0, he

More information

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto.   Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical

More information

non -negative cone Population dynamics motivates the study of linear models whose coefficient matrices are non-negative or positive.

non -negative cone Population dynamics motivates the study of linear models whose coefficient matrices are non-negative or positive. LECTURE 3 Linear/Nonnegaive Marix Models x ( = Px ( A= m m marix, x= m vecor Linear sysems of difference equaions arise in several difference conexs: Linear approximaions (linearizaion Perurbaion analysis

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI: 0.038/NCLIMATE893 Temporal resoluion and DICE * Supplemenal Informaion Alex L. Maren and Sephen C. Newbold Naional Cener for Environmenal Economics, US Environmenal Proecion

More information

Book Corrections for Optimal Estimation of Dynamic Systems, 2 nd Edition

Book Corrections for Optimal Estimation of Dynamic Systems, 2 nd Edition Boo Correcions for Opimal Esimaion of Dynamic Sysems, nd Ediion John L. Crassidis and John L. Junins November 17, 017 Chaper 1 This documen provides correcions for he boo: Crassidis, J.L., and Junins,

More information

8. Basic RL and RC Circuits

8. Basic RL and RC Circuits 8. Basic L and C Circuis This chaper deals wih he soluions of he responses of L and C circuis The analysis of C and L circuis leads o a linear differenial equaion This chaper covers he following opics

More information