An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming

Size: px

Start display at page:

Download "An Asymptotically Efficient Simulation-Based Algorithm for Finite Horizon Stochastic Dynamic Programming"

Rodney Harper
5 years ago
Views:

1 An Asymptotcally Effcent Smulaton-Based Algorthm for Fnte Horzon Stochastc Dynamc Programmng Hyeong Soo Chang, Mchael C. Fu, Jaqao Hu, and Steven I. Marcus Abstract We present a smulaton-based algorthm called Smulated Annealng Multplcatve Weghts (SAMW) for solvng large fntehorzon stochastc dynamc programmng problems. At each teraton of the algorthm, a probablty dstrbuton over canddate polces s updated by a smple multplcatve weght rule, and wth proper annealng of a control parameter, the generated sequence of dstrbutons converges to a dstrbuton concentrated only on the best polces. he algorthm s asymptotcally effcent, n the sense that for the goal of estmatng the value of an optmal polcy, a provably convergent fnte-tme upper bound for the sample mean s obtaned. Index erms stochastc dynamc programmng, Markov decson processes, smulaton, learnng algorthms, smulated annealng I. INRODUCION Consder a dscrete-tme system wth a fnte horzon H: x t+ = f(x t,a t,w t) for t =0,,..., H, where x t s the state at tme t rangng over a (possbly nfnte) set, a t s the acton at tme t to be chosen from a nonempty subset A(x t) of a gven (possbly nfnte) set of avalable actons A at tme t, and w t s a random dsturbance unformly and ndependently selected from [0,] at tme t, representng the uncertanty n the system, and f : A() [0, ] s a next-state functon. hroughout, we assume the ntal state x 0 s gven, but ths s wthout loss of generalty, as the results n the paper carry through for the case where x 0 follows a gven dstrbuton. Defne a nonstatonary (non-randomzed) polcy = { t t : A(),t =0,,..., H }, and ts correspondng fnte-horzon dscount value functon gven by " H # V = E w0,...,w H γ t R(x t, t(x t),w t), () wth dscount factor γ (0, ] and one-perod reward functon R : A() [0, ] R +. We suppress explct dependence of the horzon H on V. he functon f, together wth, A, and R, comprse a stochastc dynamc programmng problem or a Markov decson process (MDP) [] [8]. We assume throughout that the one-perod reward functon s bounded. For smplcty, but wthout loss of generalty, we take the bound to be /H,.e., sup x,a A,w [0,] R(x, a, w) /H, so 0 V. he problem we consder s estmatng the optmal value over a gven fnte set of polces Π: V := max V. () hs work was supported n part by the Natonal Scence Foundaton under Grant DMI-00, n part by the Ar Force Offce of Scentfc Research under Grant FA , and n part by the Department of Defense. he work of H.S. Chang was also supported by the Sogang Unversty research grants n 006. H.S. Chang s wth the Department of Computer Scence and Engneerng at Sogang Unversty, Seoul -74, Korea. (e-mal:hschang@sogang.ac.kr). M.C. Fu s wth the Robert H. Smth School of Busness and the Insttute for Systems Research at the Unversty of Maryland, College Park. (emal:mfu@rhsmth.umd.edu). J. Hu s wth the Department of Appled Mathematcs & Statstcs, SUNY, Stony Brook. (e-mal:jqhu@xx.xx.edu). S.I. Marcus s wth the Department of Electrcal & Computer Engneerng and the Insttute for Systems Research at the Unversty of Maryland, College Park. (e-mal:marcus@eng.umd.edu). Prelmnary portons of ths paper appeared n the Proceedngs of the 4nd IEEE Conference on Decson and Control, 00. Any polcy that acheves V s called an optmal polcy. Our settng s that n whch explct forms for f and R are not avalable, but both can be smulated,.e., sample paths for the states and rewards can be generated from a gven random number sequence {w 0,...,w H }. We present a smulaton-based algorthm called Smulated Annealng Multplcatve Weghts (SAMW) for solvng (), based on the weghted majorty algorthm of []. Specfcally, we explot the recent work of the multplcatve weghts algorthm studed by Freund and Schapre [7] n a completely dfferent context: noncooperatve repeated two-player bmatrx zero-sum games. At each teraton, the algorthm updates a probablty dstrbuton over Π by a multplcatve weght rule usng the estmated (from smulaton) value functons for all polces n Π, requrng Π sample paths. Wth a proper annealng of the control parameter assocated wth the algorthm as n Smulated Annealng (SA) [0], the sequence of dstrbutons generated by the multplcatve weght rule converges to a dstrbuton concentrated only on polces that acheve V, motvatng our choce of SAMW for the name of the algorthm. he algorthm s asymptotcally effcent, n the sense that a fnte-tme upper bound s obtaned for the sample mean of the value of an optmal polcy, and the upper bound converges to V wth rate O(/ ), where s the number of teratons. A samplng verson of the algorthm that does not enumerate all polces n Π at each teraton, but nstead samples from the sequence of generated dstrbutons, s also shown to converge to V. he samplng verson can be used as an on-lne smulaton-based control n the context of plannng. SAMW dffers from the usual SA n that t does not perform any local search; rather, t drectly updates a probablty dstrbuton over Π at each teraton and has a much smpler tunng process than SA. In ths regard, t may be sad that SAMW s a compressed verson of SA wth an extenson to stochastc dynamc programmng. he use of probablty dstrbuton on the search space s a fundamentally dfferent approach from exstng smulaton-based optmzaton technques for solvng MDPs, such as (bass-functon based) neurodynamc programmng [], model-free approaches of Q-learnng [9] and D(λ)-learnng [7], and (bandt-theory based) adaptve multstage samplng [4]. Updatng a probablty dstrbuton over the search space s smlar to the learnng automata approach for stochastc optmzaton [], but SAMW s based on a dfferent multplcatve weght rule. Ordnal comparson [9] that smply chooses the current best Π from the sample mean of V does not provde a determnstc upper-bound even f a probablstc bound s possble (see, e.g., heorem n [6] wth lettng each arm of the bandt nto a polcy). Furthermore, t s not clear how to desgn a varant of ordnal comparson that does not enumerate all polces n Π. hs s also true for the recently proposed on-lne control algorthms, parallel rollout and polcy swtchng [5], for MDPs. hs paper s organzed as follows. In Secton II, we present the SAMW algorthm and n Sectons III and IV, we analyze ts convergence propertes. We conclude n Secton VI wth some remarks. II. BASIC ALGORIHM DESCRIPION Let Φ be the set of all probablty dstrbutons over Π. Forφ Φ and Π, let φ() denote the probablty for polcy. he goal s to concentrate the probablty on the optmal polces n Π. he SAMW algorthm teratvely generates a sequence of dstrbutons, where φ denotes the dstrbuton at teraton. Each teraton of SAMW requres H random numbers w 0,..., w H,.e.,..d. U(0, ) and ndependent from prevous teratons. Each polcy Π s then smulated usng the same sequence of random numbers for that

2 teraton (dfferent random number sequences can also be used for each polcy, and all of the results stll hold) n order to obtan a sample path estmate of the value functon (): H V := γ t R(x t, t(x t),w t), () where the subscrpt denotes the teraton count, whch has been omtted for notatonal smplcty n the quanttes x t and w t. he estmates {V, Π} are used for updatng a probablty dstrbuton over Π at each teraton. Note that 0 V (a.s.) by the boundedness assumpton. Note also that the sze of Π n the worst case can be qute large,.e., A H so that we assume here that Π s relatvely small. In Secton IV, we study the convergence property of a samplng verson of SAMW that does not enumerate all polces n Π at each teraton. he teratve updatng to compute the new dstrbuton φ + from φ and {V } uses a smple multplcatve rule: φ + () =φ () βv, Π, (4) Z where β > s a parameter of the algorthm, the normalzaton factor Z s gven by Z = P φ ()β V, and the ntal dstrbuton φ s the unform dstrbuton,.e., φ () =/ Π Π. III. CONVERGENCE ANALYSIS For φ Φ, defne V (φ) = V φ(), Ψ := V, = where Ψ s the sample mean estmate for the value functon of polcy. Agan, note that (a.s.) 0 V (φ) for all φ Φ. We remark that V (φ) represents an expected reward for each fxed (teraton) experment, where the expectaton s w.r.t. the dstrbuton of the polcy. he followng lemma provdes a fnte-tme upper bound for the sample mean of the value functon of an optmal polcy n terms of the probablty dstrbutons generated by SAMW va (4). Lemma.: For β = β >, =,...,, the sequence of dstrbutons φ,..., φ generated by SAMW va (4) satsfes (a.s.) Ψ β = V (φ ln Π, for any optmal polcy. Proof: he proof dea follows that of heorem n [7], for whch t s convenent to ntroduce the followng measure of dstance between two probablty dstrbutons, called the relatve entropy (also known as Kullback-Lebler entropy): D(p, q) := p()ln p() q() «, p,q Φ. (5) Although D(p, q) 0 for any p and q, and D(p, q) =0f and only f p = q, the measure s not symmetrc, hence not a true metrc. Consder any Drac dstrbuton φ Φ such that for an optmal polcy n Π, φ ( )=and φ () =0for all Π { }. We frst prove that V (β ) V (φ D(φ,φ ) D(φ,φ + ), (6) where φ and φ + are generated by SAMW va (4) and β >. From the defnton of D gven by (5), D(φ,φ + ) D(φ,φ ) = «φ φ () ()ln = φ ()ln Z φ + () β V = φ ()lnβ V +lnz φ () =( ) φ ()V ( ) V (φ ln =( )V +lnz " φ ()(+(β )V ) +ln +(β ) V (φ ) ( )V +(β ) V (φ ), where the frst nequalty follows from the property β a +(β )a for β 0,a [0, ], and the last nequalty follows from the property ln( + a) a for a>. Solvng for V (recall β>) yelds (6). Summng the nequalty (6) over =,...,, = V β β β = = V (φ D(φ,φ ) D(φ,φ + ) V (φ D(φ,φ ) V (φ = ln Π, where the second nequalty follows from D(φ,φ + ) 0, and the last nequalty uses the unform dstrbuton property that φ () = = D(φ,φ ) ln Π. Π Dvdng both sdes by yelds the desred result. If (β )/ s very close to and at the same tme ln Π /( ) s very close to 0, then the above nequalty mples that the expected per-teraton performance of SAMW s very close to the optmal value. However, lettng β, ln Π /.On the other hand, for fxed β and ncreasng, ln Π / becomes neglgble relatve to. hus, from the form of the bound, t s clear that the sequence β should be chosen as a functon of such that β and n order to acheve convergence. Defne the total varaton dstance for probablty dstrbutons p and q by d (p, q) := P Λ p() q(). he followng lemma states that the sequence of dstrbutons generated by SAMW converges to a statonary dstrbuton, wth a proper tunng or annealng of the β-parameter. Lemma.: Let {ψ( )} be a decreasng sequence such that ψ( ) > and lm ψ( ) =. For β = ψ( ), =,..., +k, k, the sequence of dstrbutons φ,..., φ generated by SAMW va (4) satsfes (a.s.) lm d (φ,φ +k )=0. Proof: From the defnton of D gven by (5), D(φ,φ + ) = φ ()ln ««φ () φ max φ + () ln () φ + () =max ln Z ψ( )V =mnln ln ψ( ), ψ( ) V Z snce V and Z for all and any. #

3 Applyng Pnsker s nequalty [8], d (φ,φ + ) p D(φ,φ + ) p lnψ( ). herefore, k k p d (φ,φ +k ) d (φ +j,φ +j ) lnψ( + j). j= Because P d (φ,φ +k ) 0 for any k and k p j=0 lnψ( + j) 0 as, d (φ,φ +k ) 0 as. heorem.: Let {ψ( )} be a decreasng sequence such that ψ( ) >, lm ψ( )=, and lm ln ψ( )=. For β = ψ( ), =,...,, the sequence of dstrbutons φ,..., φ generated by SAMW va (4) satsfes (a.s.) ψ( ) ln ψ( ) V (φ ln Π ln ψ( ) V, = and φ φ Φ, where φ () =0for all such that V <V. Proof: Usng x x ln x for all x and Lemma., Ψ ψ( ) ln ψ( ) V (φ ln Π ln ψ( ) ψ( ) = = j=0 V (φ ln Π ln ψ( ). (7) In the lmt as, the lefthand sde converges to V by the law of large numbers, and n the rghtmost expresson n (7), ψ( ) and the second term vanshes, so t suffces to show that P V = (φ ) s bounded from above by V (n the lmt). From Lemma., for every ɛ>0, there exsts < such that d (φ,φ +k ) ɛ for all > and any nteger k. hen, for >, we have (a.s.) V (φ )= 4 V (φ V (φ ) 5 = = = = = V (φ + V (φ + V (φ = + = + = + V (φ ) V (φ ) = = + V (φ = φ () φ () V φ () φ () V = + Π ɛ, (8) the last nequalty followng from max φ +k () φ () ɛ and V > k, Π. As, the frst term of (8) vanshes, and the second term converges by the law of large numbers to V, φ, whch s bounded from above by V. Snce ɛ can be chosen arbtrarly close to zero, the desred convergence follows. he second part of the theorem follows drectly from the frst part wth Lemma., wth a proof obtaned n a straghtforward manner by assumng there exsts a Π such that φ () 0and V <V, leadng to a contradcton. We skp the detals. An example of a decreasng sequence {ψ( )}, =,,..., that satsfes the condton of heorem. s ψ( )=+ p /, > 0. IV. CONVERGENCE OF HE SAMPLING VERSION OF HE ALGORIHM Instead of estmatng the value functons for every polcy n Π accordng to (), whch requres smulatng all polces n Π, a samplng verson of the algorthm would sample a subset of the polces n Π at each teraton accordng to φ and smulate only those polces (and estmate ther correspondng value functons). In ths context, heorem. essentally establshes that the expected per-teraton performance of SAMW approaches the optmal value as for approprately selected tunng sequence {β }. Here, we show that the actual (dstrbuton sampled) per-teraton performance also converges to the optmal value usng a partcular annealng schedule of the parameter β. For smplcty, we assume that a sngle polcy s sampled at each teraton (.e., subset s a sngleton). A related result s proven by Freund and Schapre wthn the context of solvng two-player zero-sum bmatrx repeated game [7], and the proof of the followng theorem s based on thers. heorem 4.: Let k = P k j= j.forβ =+/k, k < k, let {φ } denote the sequence of dstrbutons generated by SAMW va (4), wth resettng of φ () =/ Π at each = k. Let ˆ(φ ) denote the polcy sampled from φ (at teraton ). hen (a.s. as k ), k k = V ˆ(φ ) V. Proof: he sequence of random varables κ = V ˆ(φ ) V (φ ) forms a martngale dfference sequence wth κ, snce E[κ κ,..., κ ] = 0 for all. Let ɛ k = ln k/k and I k =[ k +, k ]. Applyng Azuma s nequalty [4, p.09], we have that for every ɛ k > 0, 0 V k ˆ(φ ) V (φ ) I k >ɛ ka e 0.5k ɛ k = k. (9) he sum of the probablty bound n (9) over all k from to s fnte. herefore, by the Borel-Cantell lemma, (a.s.) all but a fnte number of I k s (k =,..., ) satsfy V(φ ) V ˆ(φ ) + k ɛ k, (0) I k I k so those I k that volate (9) can be gnored (a.s.). From Lemma. wth the defnton of β, for all I k, k Ψ k β V(φ ln Π I k β V(φ ln Π β I k β = + «V (φ ln Π (k +) k Ik V (φ k +ln Π (k +), () Ik where the last nequalty follows from V (φ) φ Φ and I k = k. Combnng (0) and () and summng, we have k Ψ k + V ˆ(φ ) I I k k j= h j p ln j + j(ln Π +ln Π, so

4 4 Ψ k k + k k V ˆ(φ ) I I k j= h j p ln j + j(ln Π +ln Π. () Because k s O(k ), the term on the rghthand sde of () vanshes as k. herefore, for every ɛ>0, (a.s.) for all but a fnte number of values of k, Ψ k k k = V ˆ(φ ) + ɛ. We now argue that {φ } converges to a fxed dstrbuton as k, so that eventually the term P k k = V ˆ(φ ) s bounded from above by V. Wth smlar reasonng as n the proof of Lemma., for every ɛ>0, there exsts I k for some k> such that for all > wth + j I k, j, d (φ,φ +j ) ɛ. akng > wth I k, = V ˆ(φ ) = = = 4 V ˆ(φ ) V ˆ(φ ) = V ˆ(φ ) + = + = + = = + V ˆ(φ ) = + V ˆ(φ ) 5 V ˆ(φ ) V ˆ(φ ) V ˆ(φ ) V ˆ(φ ) V ˆ(φ ) «. () Lettng, the frst term on the rghthand sde of () vanshes, and the second term s bounded from above by V, because the second term converges to V, φ, from the law of large numbers. We know that for all > n I k, ɛ + φ () φ +j () φ (ɛ for all Π and any j. herefore, as ɛ can be chosen arbtrarly close to zero, {φ } converge to the dstrbuton φ, makng the last term vansh (once each polcy s sampled from the same dstrbuton over Π, the smulated value would be the same for the same random numbers), whch provdes the desred convergence result. V. A NUMERICAL EAMPLE o llustrate the performance of SAMW, we consder a smple fnte-horzon nventory control problem wth lost sales, zero order lead tme, and lnear holdng and shortage costs. Gven an nventory level, orders are placed and receved, demand s realzed, and the new nventory level s calculated. Formally, we let D t, a dscrete random varable, denote the demand n perod t, x t the nventory level at perod t, a t the order amount at perod t, p the per perod per unt demand lost penalty cost, h the per perod per unt nventory holdng cost, and M the nventory capacty. hus, the nventory level evolves accordng to the followng dynamcs: x t+ =max{0,x t +a t D t}. he goal s to mnmze, over a gven set of (nonstatonary) polces Π, the expected total cost over the entre horzon from a gven ntal nventory level x 0,.e., mn E[ P H [h max{0,xt + t(xt) D t} + p max{0,d t x t t(x t)}] x 0 = x]. he followng set of parameters s used n our experments: M = 0, H =, h = 0.00, p = 0.0, x 0 = 5 and x t {0, 5, 0, 5, 0} for t =,...,H, a t {0, 5, 0, 5, 0} for all t =0,...,H, and D t s a dscrete unformly dstrbuted random varable takng values n {0, 5, 0, 5, 0}. he values of h and p are chosen so as to satsfy the one-perod reward bound assumed n the SAMW convergence results. Note that snce we are gnorng the setup cost (.e., no fxed order cost), t s easy to see that the optmal order polcy follows a threshold rule, n whch an order s placed at perod t f the nventory level x t s below a certan threshold S t, and the amount to order s equal to the dfference max{0,s t x t}. hus, by takng advantage of ths structure, n actual mplementaton of SAMW, we restrct the search of the algorthm to the set of threshold polces,.e., Π=(S 0,S,S ), S t {0, 5, 0, 5, 0}, t =0,,, rather than the set of all admssble polces. We mplemented two versons of SAMW,.e., the fully sampled verson of SAMW, whch constructs the optmal value functon estmate by enumeratng all polces n Π and usng all value functon estmates, and the sngle samplng verson of SAMW ntroduced n heorem 4., whch uses just one sampled polcy n each teraton to update the optmal value functon estmate; however, updatng φ requres value functon estmates for all polces n Π. For numercal comparson, we also appled the adaptve mult-stage samplng (AMS) algorthm [4] and a non-adaptve mult-stage samplng (NMS) algorthm. hese two algorthms are smulaton-tree based methods, where each node n the tree represents a state, wth the root node beng the ntal state, and each edge sgnfes the samplng of an acton. hey both use forward search to generate sample path from the ntal state to the fnal state, and update the value functon backwards only at those vsted states. he dfference between the AMS and NMS algorthms s n the way actons are sampled at each decson perod: AMS samples actons n an adaptve manner accordng to some performance ndex, whereas NMS smply samples each acton for a fxed number of tmes. A detaled descrpton of these approaches can be found n [4]. value functon estmate SAMW (fully sampled) SAMW (sngle samplng) SAMW (fully sampled) β= SAMW (sngle samplng) β= AMS NMS Optmal total perods smulated x 0 5 Fg.. Average performance (mean of 5 smulaton replcatons, resultng n confdence half-wdths wthn 5% of estmated mean) of SAMW, AMS, and NMS on the nventory control problem (h =0.00, p =0.0). Fgure shows the performance of these algorthms as a functon of the total number of perods smulated, based on 5 ndependent replcatons. he results ndcate convergence of both versons of SAMW; however, the two alternatve benchmark algorthms AMS and NMS seem to provde superor emprcal performance over SAMW. We beleve ths s because the annealng schedule for β used n SAMW s too conservatve for ths problem, thus leadng to slow convergence. o mprove the emprcal performance of SAMW, we also mplemented both versons of the algorthm wth β beng

5 5 held constant throughout the search,.e., ndependent of. he β = case s ncluded n Fgure, whch shows sgnfcantly mproved performance. Expermentaton wth the SAMW algorthm also revealed that t performed even better for cost parameters values n the nventory control problem that do not satsfy the strct reward bound. One such example s shown n Fgure, for the case h = and p =(all other parameter values unchanged). value functon estmate SAMW (fully sampled) SAMW (sngle samplng) AMS NMS Optmal total perods smulated x 0 5 Fg.. Average performance (mean of 5 smulaton replcatons, resultng n confdence half-wdths wthn 5% of estmated mean) of SAMW, AMS, and NMS on the nventory control problem (h =, p =). VI. CONCLUDING REMARKS SAMW can be naturally parallelzed to speed up ts computatonal cost. Partton the gven polcy space Π nto { j} such that j j = for all j j and S j j =Π, and apply the algorthm n parallel for teratons on each j. For a fxed value of β>, we have the followng fnte-tme bound from Lemma.: ( ) V β max j V (φ j)(x ln j j, = where φ j s the dstrbuton generated for j at teraton. he orgnal verson of SAMW recalculates an estmate of the value functon for all polces n Π at each teraton, requrng each polcy to be smulated. If Π s large, ths may not be practcal, and the samplng verson of SAMW gven by heorem 4. also requres each value functon estmate n order to update the φ at each teraton. One smple alternatve s to use the pror value functon estmates for updatng φ, except for the sngle sampled one; thus, only one smulaton per teraton would be requred. Specfcally, V := V f not sampled at teraton ; else obtan a new estmate of V va (). An extenson of ths s to use a threshold on φ to determne whch polces wll be smulated. Snce the sequence of the dstrbutons generated by SAMW converges to a dstrbuton concentrated on the optmal polces n Π, as the number of teratons ncreases, the contrbutons from non-optmal polces get smaller and smaller, so these polces need not be resmulated (and value functon estmates updated) very often. Specfcally, V := V f φ () ɛ; else obtan a new estmate of V va (). he coolng schedule presented n heorem 4. s just one way of controllng the parameter β. Characterzng propertes of good schedules s crtcal to effectve mplementaton, as the numercal experments showed. he numercal experments also demonstrated that the algorthm may work well outsde the boundares of the assumptons under whch theoretcal convergence s proved, specfcally the bound on the one-perod reward functon and the value of the coolng parameter β. We suspect ths has somethng to do wth the scalng of the algorthm, but more nvestgaton nto ths phenomena s clearly warranted. Fnally, we presented SAMW n the MDP framework for optmzaton of the sequental decson makng processes. Even though the dea s general, the actual algorthm depends on the sequental structure of MDPs. he general problem gven by () takes the form of a general stochastc optmzaton problem, so SAMW can also be adapted to serve as a global stochastc optmzaton algorthm for bounded expected value objectve functons. REFERENCES [] D. P. Bertsekas, Dynamc Programmng and Optmal Control, Volumes and. Athena Scentfc, 995. [] D. P. Bertsekas and J. stskls, Neuro-Dynamc Programmng. Athena Scentfc, 996. [] H. S. Chang, M. C. Fu, and S. I. Marcus, An asymptotcally effcent algorthm for fnte horzon stochastc dynamc programmng problems, n Proc. of the 4nd IEEE Conf. Decson and Control, 00, pp [4] H. S. Chang, M. C. Fu, J. Hu, and S. I. Marcus, An adaptve samplng algorthm for solvng Markov decson processes, Operatons Research, vol. 5, no., pp. 6-9, 005. [5] H. S. Chang, R. Gvan, and E. K. P. Chong, Parallel rollout for onlne soluton of partally observable Markov decson processes, Dscrete Event Dynamc Systems: heory and Applcaton, vol. 4, no., pp. 09 4, 004. [6] E. Even-Dar, S. Mannor, and Y. Mansour, PAC bounds for mult-armed bandt and Markov decson processes, n Proc. of the 5th Annual Conf. on Computatonal Learnng heory, 00, pp [7] Y. Freund and R. Schapre, Adaptve game playng usng multplcatve weghts, Games and Economc Behavor, vol. 9, pp. 79 0, 999. [8] O. Hernández-Lerma and J. B. Lasserre, Dscrete-me Markov Control Processes: Basc Optmalty Crtera, Sprnger, 996. [9] Y. C. Ho, C. Cassandras, C-H. Chen, and L. Da, Ordnal optmzaton and smulaton, J. of Operatons Research Socety, vol., pp , 000. [0] S. Krkpatrck, C. D. Gelatt, and M. P. Vecch, Optmzaton by smulated annealng, Scence, vol. 0, pp , 98. [] A. J. Kleywegt, A. Shapro, and R. Homem-de-Mello, he sample average approxmaton method for stochastc dscrete optmzaton, SIAM J. on Control and Optmzaton, vol., no., pp , 00. [] N. Lttlestone and M. K. Warmnuth, he weghted majorty algorthm, Informaton and Computaton, vol. 08, pp. 6, 994. [] A. S. Poznyak and K. Najm, Learnng Automata and Stochastc Optmzaton, Sprnger-Verlag, 997. [4] S. M. Ross, Stochastc Processes, Second Edton, John Wley & Sons, 996. [5] R. Y. Rubnsten and A. Shapro, Dscrete Event Systems: Senstvty Analyss and Stochastc Optmzaton by the Score Functon Method, John Wley & Sons, 99. [6] N. Shmkn and A. Shwartz, Guaranteed performance regons n Markovan systems wth competng decson makers, IEEE rans. on Automatc Control, vol. 8, no., pp 84 95, 99. [7] R. Sutton and A. Barto, Renforcement Learnng: An Introducton, MI Press, Cambrdge, Massachusetts, 998. [8] F. opsoe, Bounds for entropy and dvergence for dstrbutons over a two-element set, J. of Inequaltes n Pure and Appled Mathematcs, vol., ssue, Artcle 5, 00. [9] C. J. C. H. Watkns, Q-learnng, Machne Learnng, vol. 8, no., pp. 79 9, 99.

Lecture 14: Bandits with Budget Constraints

Lecture 14: Bandits with Budget Constraints IEOR 8100-001: Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed