We consider a finite-state, finite-action, infinite-horizon, discounted reward Markov decision process and

Size: px

Start display at page:

Download "We consider a finite-state, finite-action, infinite-horizon, discounted reward Markov decision process and"

Willis Jordan
6 years ago
Views:

1 MANAGEMENT SCIENCE Vol. 53, No. 2, Februry 2007, pp ssn essn nforms do /mnsc INFORMS Bs nd Vrnce Approxmton n Vlue Functon Estmtes She Mnnor Deprtment of Electrcl nd Computer Engneerng, McGll Unversty, Montrel, Quebec, Cnd H3A 2A7, she@ece.mcgll.c Duncn Smester Slon School of Mngement, Msschusetts Insttute of Technology, Cmbrdge, Msschusetts 02139, smester@mt.edu Peng Sun Fuqu School of Busness, Duke Unversty, Durhm, North Croln 27708, psun@duke.edu John N. Tstskls Lbortory for Informton nd Decson Systems, Msschusetts Insttute of Technology, Cmbrdge, Msschusetts 02139, nt@mt.edu We consder fnte-stte, fnte-cton, nfnte-horzon, dscounted rewrd Mrkov decson process nd study the bs nd vrnce n the vlue functon estmtes tht result from emprcl estmtes of the model prmeters. We provde closed-form pproxmtons for the bs nd vrnce, whch cn then be used to derve confdence ntervls round the vlue functon estmtes. We llustrte nd vldte our fndngs usng lrge dtbse descrbng the trnscton nd mlng hstores for customers of ml-order ctlog frm. Key words: vlue functon; confdence ntervl; vrnce; bs Hstory: Accepted by Wllce J. Hopp, stochstc models nd smulton; receved July 14, Ths pper ws wth the uthors 4 1 months for 4 revsons Introducton Bellmn s vlue functon plys centrl role n the optmzton of dynmc decson-mkng models, s well s n the structurl estmton of dynmc models of rtonl gents. For the mportnt cse of fnte-stte Mrkov decson process (MDP), the vlue functon depends on two types of model prmeters: the trnston probbltes between sttes nd the expected one-step rewrds from ech stte. In mny pplctons n the socl scences nd n engneerng, the trnston probbltes nd expected rewrds re not known nd nsted must be estmted from fnte smples of dt. The estmton errors for these prmeters ntroduce bs nd vrnce n the vlue functon estmtes. In ths pper, we present methodology for evlutng ths bs nd vrnce. Ths, n turn, llows the clculton of confdence ntervls round the vlue functon estmtes. The confdence ntervls re themselves pproxmtons. For nlytcl nd computtonl trctblty, they rely on second-order Tylor seres pproxmtons. Moreover, becuse the expressons for the bs nd the vrnce pproxmton requre the true but unknown model prmeters, we replce these unknown prmeters by ther estmtes. We evlute the ccurcy of these pproxmtons nd vldte the expressons usng lrge smple of rel dt obtned from ml-order ctlog compny Sources of Vrnce We strt by dstngushng between two types of vrnce tht cn rse n n MDP: nternl nd prmetrc. Internl vrnce reflects the stochstcty n the trnstons nd rewrds. For exmple, n mrketng settng there s rrely certnty s to whether n ndvdul customer wll purchse, resultng n genunely stochstc trnstons nd rewrds. Prmetrc vrnce rses f the true trnston probbltes nd expected rewrds re estmted rther thn known; the potentl for error n the estmtes of these prmeters ntroduces vrnce n the vlue functon estmtes. The two types of vrnce hve dfferent sources nd cn be llustrted through dfferent experments. To llustrte nternl vrnce, we cn fx the model prmeters nd then generte number of fntelength smple trectores (wth ll trectores hvng the sme length, strtng from the sme stte, nd usng common control polcy). The vrton cross smple trectores n the totl rewrds nd/or the dentty of the fnl stte reflects nternl vrnce. In 308

2 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes Mngement Scence 53(2), pp , 2007 INFORMS 309 contrst, ggregton cross smples does not mtgte prmetrc vrnce. The ltter cn be llustrted by comprng the verge outcomes from lrge number of smples generted under dfferent estmtes for the model prmeters. The vrton n the verge outcomes under dfferent estmtes reflects prmetrc vrnce. Internl vrnce hs lredy been consdered n the lterture. In prtculr, Sobel (1982) provdes n expresson for the nternl vrnce n n MDP wth dscounted rewrds, whle Flr et l. (1989) nd Bukl-Gursoy nd Ross (1992) consder the verge rewrd crteron. In ths pper, we focus on prmetrc vrnce. Our motvton s tht n mny contexts the underlyng obectve nvolves vergng outcomes cross lrge number of smples, n whch cse the nternl vrnce s verged out. For exmple, n mrketng pplcton, frm profts typclly represent the ggregton of outcomes cross lrge number of customers. Smlrly, n lbor economcs settng, frm often ggregtes cross lrge number of employees. Of course, there re settngs where nternl vrnce s lso mportnt. For exmple, when lloctng fnncl portfolos, the (nternl) vrnce of the return on sngle fnncl portfolo s mportnt n ts own rght Lterture Mrkov decson problems nd the ssocted methodology of dynmc progrmmng hve found brod rnge of pplctons n numerous felds n the socl scences nd n engneerng. These pplctons cn be brodly dvded n two ctegores bsed on the reserch obectves. The frst nd more trdtonl ctegory of pplctons focuses on optmzng the operton of humn or engneerng systems nd on provdng tools for effectve decson mkng. The pplcton res re vst, nd nclude fnnce (Luenberger 1997, Cmpbell nd Vcer 2002), economcs (Dxt nd Pndyck 1994), nventory control nd supply chn mngement (Zpkn 2000), revenue nd yeld mngement (McGll nd vn Ryzn 1999), trnsportton (Godfrey nd Powell 2002), communctons, wter resource mngement, nd electrc power systems. The vst morty of ths lterture ssumes tht n ccurte system model s vlble. There s n underlyng mplct ssumpton tht the true model wll be estmted usng sttstcl methods on the bss of whtever dt re vlble. However, the sttstcl rmfctons of workng wth fnte dt records hve receved lttle ttenton. An excepton s the lterture delng wth onlne lernng of optml polces (dptve control of Mrkov chns, renforcement lernng) (Sutton nd Brto 1998, Bertseks nd Tstskls 1996). However, ths lterture s concerned wth symptotc convergence s opposed to the common sttstcl questons of stndrd errors nd confdence ntervls. The second ctegory of pplctons focuses on explnng observed phenomen. Among the most wdely cted exmples s the work of Rust (1987), who develops dscrete dynmc progrmmng model of the optml replcement polcy for bus engnes. Accordng to ths pproch, the resercher strts by ssumng tht ndvduls or frms behve optmlly, but tht the prmeters of the frm or the customer decson problem re unknown. By mxmzng the lkelhood of the emprclly observed ctons of ndvduls or frms under the optml polces for dfferent sets of prmeters, the resercher seeks to dentfy these unobserved prmeters. Smlr pplctons of dscrete dynmc progrmmng models hve become ncresngly common, prtculrly n the lbor (Kene nd Wolpn 1994), ndustrl orgnzton (Hendel nd Nevo 2002), nd mrketng (Gönül nd Sh 1998) ltertures. Whle these methods use vrety of pproches to clculte or pproxmte the vlue functon, the vlue functon reles on pont estmtes of the model prmeters. Prevous ttempts to consder the mpct of prmeter error on the clculted vlue functon hve been lmted to smulton-bsed pproches. We fnlly note tht the mpct of uncertnty n the model prmeters on the ccurcy of the vlue functon estmtes hs receved ttenton n the fnnce lterture. For exmple, X (2001) nd Brbers (2000) nvestgte how dynmc lernng bout stock return predctblty ffects optml portfolo lloctons. The generl problem consdered n these studes s smlr to the one ddressed n ths pper. However, the sources of vrnce re dfferent. In prtculr, the fnnce lterture s concerned wth nternl vrnce due to the stochstcty n the underlyng process nd prmetrc vrnce due to nonsttonrty of the model prmeters, ncludng chnges n the nvestment horzon nd/or dynmc lernng. In contrst, we bstrct wy from the problem of nternl vrnce, ssume tht the model prmeters re sttonry, nd focus on the prmetrc vrnce tht results from estmtng the model prmeters from fnte smple of dt Overvew As fr s we know, ths s the frst pper to study prmetrc bs nd vrnce n MDPs. It serves two purposes: Frst, to llustrte the potentl for error n vlue functon estmtes nd to hghlght the potentl mgntude of these errors; second, to provde formuls nd methodology for estmtng the bs nd vrnce n vlue functon estmtes, whch cn then be used to construct confdence ntervls round the vlue functon estmtes.

3 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes 310 Mngement Scence 53(2), pp , 2007 INFORMS We begn wth some nottons nd bckground mterl n 2. In 3, we llustrte the reltonshp between errors n the model prmeters nd the ccurcy of vlue functon estmtes, usng ctul dt from ctlog mlng context. In 4, we present methodology for estmtng the bs nd vrnce n the vlue functon estmtes. In 5, we vldte our methodology usng the ctlog mlng dt. We conclude n 6 wth revew of the fndngs nd dscusson of opportuntes for future reserch. 2. A Forml Descrpton of the Problem We consder fnte-stte, fnte-cton, nfnte-horzon, dscounted rewrd MDP, where S denotes the set of sttes of crdnlty m, A s the set of ctons, 0 1 s the dscount fctor, nd P nd R S A denote trnston probbltes nd the condtonl expected rewrds. The sclrs P nd R re nterpreted s follows: f the current stte s nd cton s ppled, then the next stte s wth probblty P ; furthermore, gven tht trnston from to occurs followng n cton equl to, rndom rewrd s obtned, whose condtonl expectton s equl to R. Note tht f cton s ppled t stte, the expected rewrd, denoted by R, s equl to P R. We re nterested n the vlue functon ssocted wth sttonry, Mrkovn, possbly rndomzed, fxed polcy. The ssumpton tht the polcy s fxed llows us to ntlly bstrct wy from the control problem. As we dscuss n 4.2, the mpct of prmeter uncertnty on the soluton to the control problem rses ddtonl ssues. We use to denote the condtonl probblty of pplyng cton when t stte. Let P = P, whch s the trnston probblty from to. The superscrpt heres slght buse of notton. It wll be cler from the context tht the Greek letter s superscrpt ndctes tht the prmeter s functon wth polcy s n rgument. Smlrly, denote R = R = P R (1) whch s the expected rewrd t stte, under the polcy. We use P to denote the m m mtrx wth entres P, nd R to denote the m-dmensonl vector wth components R. Defne the vlue functon ssocted wth polcy to be the m-dmensonl vector gven by Y = k P k R = I P 1 R R In our settng, the true model prmeters, P nd, re not known. Insted, we hve ccess to fnte smple of dt from whch these prmeters cn be estmted. Specfclly, ssume tht for every nd, we hve record of N trnstons out of stte, under cton, nd the ssocted rewrds. We tret the numbers N s fxed (not s rndom vrbles), nd ssume tht N > 0 for every nd. Ths lst ssumpton restrcts ttenton to ctons tht hve been tred before. For t lest two resons, we ntcpte tht ths wll be reltvely wek ssumpton n prctce. Frst, the nblty to evlute ctons n one stte does not restrct our blty to evlute the sme cton n other sttes becuse we cn stll evlute n cton t ny stte where the cton hs been tred before. Thus, the restrcton only pples to sttes n whch there s no pst nformton bout the outcome. Second, there s tremendous mount of vrton n hstorcl polces n mny rel-world pplctons. Ths vrton my rse for lot of resons ncludng expermentton, mplementton errors, or nonsttonrty n the polcy. If there s nterest n untred ctons nd there re prors vlble to help predct the outcome, then Byesn pproch cn be used. For completeness, we detl such n pproch n Appendx D (provded n the e-compnon). 1 Furthermore, we do not ssume ny relton between the smplng process nd the polcy of nterest; n prtculr, the N for dfferent need not be proportonl to, nd the number N = N of trnstons out of stte need not be relted to the stedy-stte probblty of stte under polcy. For the N trnstons out of stte under cton n the smple dt, let N be the number of trnstons tht led to stte. Furthermore, let C be the sum of the rewrds ssocted wth these N trnstons (for completeness, we defne C = 0fN = 0). We defne P = N R = C N N whch wll be our estmtes of P nd R, respectvely. When N = 0, we defne R = 0. The possblty of N beng zero for fesble trnstons ntroduces some ddtonl bs, whch wll not be ccounted for. However, n our nlyss, we wll ssume tht ny trnston wth N = 0 s nfesble. In ddton, we defne nd R = P P R = = C N P R = R (2) 1 An electronc compnon to ths pper s vlble s prt of the onlne verson tht cn be found t nforms.org/.

4 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes Mngement Scence 53(2), pp , 2007 INFORMS 311 whch wll be our estmtes of P, R, nd R, respectvely. We fnlly defne mtrx P nd vector R, wth entres P nd R, respectvely, whch wll be our estmtes of P nd R. Bsed on these estmtes, we obtn n estmted vlue functon Y, gven by Y = I P 1 R (3) We ssume tht the smple dt reflect the true process n the followng sense. The vector N1 N m follows multnoml dstrbuton wth prmeters N P 1 P m. Let Ɛ denote expectton under the true model. We then hve Ɛ N = N P. A lst ssumpton tht reflects our erler ssumptons tht N s fxed nd tht ech smple rewrd s condtonlly ndependent from the pst, s tht Ɛ C N =. Under these ssumptons, t s esly verfed N R tht P nd R re unbsed estmtes of P nd R. Bsed on Equton (3), we cn ntcpte the mpct of errors n P nd R on Y. Note frst tht Y s lner n R, so tht f P were observed wthout error (.e., f P = P ), the vrnce of R would led to vrnce n Y but not to bs (becuse R s unbsed). In contrst, Y s nonlner n P, so tht errors n P led to both bs nd vrnce n Y. Moreover, due to the mtrx nverson the nonlnerty s substntl, so tht ny error n P cn trnslte to lrge error n Y. Ths s prtculrly true when s close to one. Furthermore, f the errors n P nd R re correlted, the nonlnerty mples tht errors n R wll lso led to bs n Y. 3. An Illustrton To llustrte the bs nd vrnce tht cn be ntroduced to vlue functon estmtes by errors n the model prmeters, we use rel dt from ml-order ctlog compny. Whle ths pplcton serves s useful cse study, our fndngs re not lmted to ths pplcton. Decdng who should receve ctlog s mongst the most mportnt decsons tht ml-order compnes must ddress. Yet, dentfyng n optml mlng polcy s dffcult tsk. Customer response functons re hghly stochstc, reflectng n prt the reltve pucty of nformton tht frms hve bout ech customer. Moreover, the problem s dynmc one. Purchsng decsons re nfluenced not ust by the frm s most recent mlng decson, but lso by pror mlng decsons. As result, the optml mlng decson depends on pst nd future mlng decsons. A typcl ctlog compny mght ml 25 ctlogs per yer. The number of ctlogs, the dtes tht they re mled, nd the content of the ctlogs re determned up to yer before the frm decdes to whom ech ctlog wll be mled. For ths reson, these decsons re typclly treted s fxed when decdng who to ml to. Accordngly, the frm only needs to decde whch customers to ml to, on ech exogenously determned mlng dte ( dscrete nfntehorzon problem). The frm s obectve s to mxmze ts expected totl dscounted profts. Rewrds (profts) n ech perod re clculted s the revenue erned from customer purchses (f ny) less the cost of the goods sold nd the mlng costs (pproxmtely 65 cents per ctlog mled). To support ther mlng decsons, ctlog frms typclly mntn lrge dtbses descrbng the ndvdul purchse nd mlng hstores for ech customer. We re fortunte to hve ccess to lrge dtbse descrbng the trnscton nd mlng hstores for the women s pprel dvson of modertely lrge ctlog compny. Ths dt s descrbed n detl n Smester et l. (2006). It ncludes the complete trnscton hstores for pproxmtely 1.72 mllon customers. The mlng hstores re complete for the sx-yer perod from 1996 through 2002 (the compny dd not mntn record of the mlng hstory pror to 1996). Ctlogs were mled on 133 occsons n ths sx-yer perod, so tht on verge mlng decson occurred every two to three weeks. The ctlog mlng problem cn be modelled s n MDP (s n Gönül nd Sh 1998), where the stte s summry of the customer s hstory, nd the cton t ech perod s to ether ml or not ml. The constructon of the stte spce s n nterestng problem tht we wll not consder here. We wll nsted follow stndrd ndustry pproch to ths problem tht uses three stte vrbles, the so-clled RFM mesures (e.g., Bult nd Wnsbeek 1995, Btrn nd Mondschen 1996). These mesures descrbe the recency, frequency, nd monetry vlue of customers pror purchses. Recency s mesured s the number of dys (n hundreds) snce customer s lst purchse. Frequency mesures the number of tems tht customers prevously purchsed. Monetry vlue mesures the verge prce (n dollrs) of the tems ordered by ech customer. For the purposes of ths llustrton, we constructed stte spce by quntzng ech of the RFM vrbles to four dscrete levels, yeldng stte spce wth S = 4 3 = 64 sttes. At ech hstorcl mlng epoch, we evlute the RFM vrbles of ech customer (regrdless of whether the customer receved ctlog or mde purchse) nd chrcterze hm/her nto one of the 64 sttes. We lso tret the purchse mount (zero f no purchse n the epoch) less the mlng cost s rewrd smple. Therefore, ech customer s hstorcl dt over tme serves s smple trectory. Followng the procedure descrbed n the prevous secton, we my then estmte the model prmeters P nd R nd clculte Y for the current polcy embedded n dt.

5 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes 312 Mngement Scence 53(2), pp , 2007 INFORMS Becuse the frm s nterested n the verge proft per customer rther thn the proft erned from n ndvdul customer, nternl vrnce verges out. However, prmetrc vrnce s of nterest becuse t ffects the comprson of dfferent polces. In prtculr, when evlutng new polcy, the frm would lke both predcton of the expected profts from doptng the new polcy, together wth confdence bounds round tht predcton. To llustrte the mpct of prmetrc vrnce, we rndomly dvded the 1.72 mllon customers nd 164 mllon observtons nto 250 eqully-szed subsmples, ech contnng pproxmtely 657,000 observtons. By observton, we men mlng perod nd n ssocted stte trnston n the hstory of customer, rrespectve of whether ctlog ws mled or purchse ws mde durng tht tme perod. We then seprtely estmted the model prmeters P nd R followng 2 usng the observtons from ech of these subsmples. Here, we consdered the polcy to be the sme s the smplng polcy tht generted the dt. Usng Equton (3), we clculted 250 estmtes of the vlue functon. As benchmrk, we lso estmted the model prmeters usng the full smple of 1.72 mllon customers. For the purposes of ths llustrton, we wll nterpret the model estmted usng the full smple s the true model, whch s essentlly equvlent to ssumng tht the 1.72 mllon customers re the full populton. Thus, wthn typcl subsmple, the expected rewrd n ech stte R ws estmted usng n verge of pproxmtely 10,000 observtons (N ), whle the trnston mtrx P ws estmted usng n verge of 160 observtons per trnston. In prctce, most of the trnstons re nfesble; for exmple, customer cnnot trnston from hvng three pror purchses to only hvng two pror purchses. When lmtng ttenton to only those trnstons tht re fesble, the verge number of observtons per trnston ws pproxmtely 1,400. (The verge of the postve N s s round 1,400.) In Fgure 1, we report the emprcl dstrbuton (hstogrm) of the vlue functon Y cross ll 250 subsmples under the hstorcl polcy used by the frm (s clculted usng the whole smple). To summrze n estmted vlue functon wth sngle number, we verge the estmtes cross sttes for ech subsmple, weghng ech stte eqully. We wll refer to ths mesure s the verge vlue functon (AVF). We use equl weghts to ncrese the clrty of llustrton. By usng equl weghts (s wth ny fxed weghts), we vod potentlly ntroducng n ddtonl source of vrnce due to the weghts themselves beng rndom vrbles. The true AVF, computed from the prmeters estmted for the full smple, s $ In comprson, the verge of the Fgure 1 Number of subsmples Ml Ctlog Problem: A Hstogrm of the AVF of the Hstorcl Polcy for Prtton of the Customers to 250 Subsmples AVF per subsmple Note. The dscount fctor per perod s = The polcy used s the hstorcl (mxed) polcy used by the frm, nd the vlue functon s weghted unformly cross sttes. The AVF obtned from the full dt s $28 54, nd s plotted s vertcl lne. The emprcl stndrd devton s $ estmtes s $28 65, wth n emprcl stndrd devton of $0 97. The dfference between $28 54 nd $28 65 s not sttstclly sgnfcnt nd s of seemngly lttle mngerl mportnce. However, the vrnce s potentlly very mportnt. The 95% confdence ntervl round the 250 AVF estmtes rnges from $26 59 to $30 49, or roughly 14% of the true men. Of course, we were ble to estmte the $0 97 stndrd devton only becuse we hd ccess to mny subsmples. In rel-world settng, where only sngle smple s vlble, the resercher generlly reles on smultons or ck-knfng technques to estmte the stndrd devton. In ths pper, we wll present procedure for dervng closed-form pproxmtons of the stndrd devton drectly from the dt. We cn demonstrte the robustness of the bove descrbed results by vryng both the sze of the subsmples nd the dscount fctor. In Tble 1, we present the emprcl bs nd stndrd devton for dfferent dscount fctors (verged over 10 repettons). In ech repetton, we dvde the dt set nto 100 subsmples nd compute the AVF for ech subsmple. We clculte the verge bsolute vlue of the bs nd the emprcl stndrd devton of the AVF estmtes cross subsmples. It cn be seen tht the verge bs s smll for dscount fctors tht re not too close to one. For dscount fctors tht re close to one, the bs becomes more menngful but stll remns much smller thn the stndrd devton. In nother experment, we vred the precson of the estmtes by chngng the sze of the

6 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes Mngement Scence 53(2), pp , 2007 INFORMS 313 Tble 1 Bs nd Vrnce s Functon of the Dscount Fctor Bs/AVF (%) STD/AVF (%) Note. For ech dscount fctor, we prtton the dt 10 tmes, wth ech prtton resultng n 100 subsmples (ech wth roughly 1.6 mllon observtons). In the tble, we present the men bsolute vlue of the bs nd the men emprcl stndrd devton ech verged cross the 10 repettons. Both of these mens re stndrdzed by dvdng by the AVF ssocted wth the hstorcl polcy (s mesured on the whole dt set). subsmples nd repeted the nlyss usng subsmples wth dfferent number of observtons. In Fgure 2, we report emprcl stndrd devtons of the AVF estmtes under the dfferent-szed subsmples. Ech cross n Fgure 2 represents rndom ssgnment of the observtons to subsmples (the dfferent ssgnments led to vrton n the subsmples between repettons). Whle ncresng the sze of the subsmples ncreses the ccurcy of the model prmeters, nd n turn reduces the vrnce n the AVF estmtes, the rte t whch the vrnce pproches zero slows down s the subsmples ncrese n sze. It seems tht even when estmtng the model prmeters wth very lrge mounts of dt, prmetrc vrnce leds to nonneglgble vrnce n the vlue functon estmtes. Fgure 2 STD of AVF (%) Ml Ctlog Problem: The Emprcl Stndrd Devton of the AVF s Functon of the Smple Sze Number of observtons per subsmple (mllons) Note. Ech cross represents sngle (rndom) prtton of the observtons nto subsmples. 4. Anlyss In ths secton, we provde closed-form pproxmtons for the bs nd vrnce of the estmted vlue functon usng second-order pproxmtons. We then brefly dscuss the control problem where n ddton to the estmton process, we look for n optml polcy. In 4.1, we wll drop the superscrpt becuse we consder fxed polcy Approxmtons for Bs nd Vrnce n the Estmted Vlue Functon We now derve closed-form pproxmtons for the (prmetrc) bs nd vrnce of Y. The nlyss follows clsscl (non-byesn) pproch, where the bs nd vrnce re expressed n terms of the (unknown) true prmeters. Becuse the true model prmeters re unknown, we substtute the estmted prmeters, whch s stndrd prctce. However, s result of ths substtuton, the vlues obtned for the bs nd vrnce re themselves estmtes. For completeness, we lso provde n Appendx D Byesn nlyss. Under the Byesn pproch, P nd R re treted s rndom vrbles wth known pror dstrbutons, nd we deduce pproxmtons for the condtonl bs nd vrnce, gven the vlues of P nd R. The expressons obtned usng the Byesn pproch re lmost dentcl to the ones n the clsscl pproch (unless n nformtve pror s vlble). Our gol s to clculte Ɛ Y nd the covrnce mtrx for Y, defned by cov Y = Ɛ Y Y Ɛ Y Ɛ Y We defne rndom m m mtrx P = P P nd rndom m-vector R = R R. Note tht P nd R re zero men rndom vrbles tht represent the dfference between the true model nd the estmted model. To help nterpret some of the lter nlyss, t wll be helpful to hve sense of the mgntudes of P nd R. Becuse the trnston probbltes re bounded by zero nd one, the errors n these probbltes re lso bounded between zero nd one. The trnston probbltes themselves wll tend to be smller the lrger the number of sttes to whch trnstons re fesble, whle the errors n these probbltes wll be smller the more observtons there re reltve to the number of fesble trnstons. In the exmple dscussed n 3 nd Fgure 1, the mxmum error n the trnston probbltes n subsmple (mx P ) hs men of nd stndrd devton of Furthermore, the verge (verged over ll prs wth nonzero trnston probblty) bsolute error n the trnston probblty estmtes, P, hs men of nd n emprcl stndrd devton of

7 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes 314 Mngement Scence 53(2), pp , 2007 INFORMS Note tht n tht exmple, the fesble trnstons consst of less thn 10% of the 64 2 entres n P. The expected rewrds re not bounded pror nd so the errors re lso unbounded. In the ctlog exmple, the verge bsolute error n the rewrd estmtes, R, hs men of $4.25 nd stndrd devton of $1.82. The mxml error n the rewrd estmtes, mx R, hs men of $56.3 nd stndrd devton of $43.2. We now wrte the expectton of Y (cf. Equton (3)) s Ɛ Y = Ɛ I P + P 1 R + R [ ] = Ɛ k P + P k R + R (4) where the geometrc seres expnson of I P+ P 1 ws used to obtn the second equlty. We use the notton X = I P 1 nd f k P = X ( PX ) k = ( X P ) k X. The followng lemm wll be useful. Lemm 4.1. l=0 l P + P l = k f k P. Proof. k f k P = k X P k X = I X P 1 X = X 1 X 1 X P 1 = I P P 1 = l P + P l l=0 where we repetedly used the defnton of X nd the fct tht X s nvertble. Usng Lemm 4.1 n Equton (4), we obtn ( ) Ɛ Y = I P 1 R + k Ɛ f k P R + k=1 k Ɛ f k P R (5) There re three terms on the rght-hnd sde of Equton (5). The frst term s the vlue functon for the true model. The second term reflects the bs ntroduced by the uncertnty n P lone, nd the thrd term represents the bs ntroduced by the correlton between the errors n P nd R. Equton (5) provdes seres expnson of the error n terms of hgh-order moments nd cross moments of the errors n P nd R. The clculton of the bs s tedous becuse the term Ɛ f k P nvolves kth-order moments of multnoml dstrbutons. But becuse P s typclly smll, P k s generlly close to zero for lrge k. For ths reson, we lmt our ttenton to second-order pproxmton nd we wll ssume tht Ɛ f k P 0 for k>2, nd tht Ɛ f k P R 0 for k>1. We use the ctlog dt to nvestgte the pproprteness of ths ssumpton n 5. Therefore, we cn wrte Equton (5) s Ɛ Y = I P 1 R + Ɛ f 1 P R + 2 Ɛ f 2 P R + XƐ R + Ɛ f 1 P R + L exp (6) where we represent ll the terms of order greter thn two n L exp = k Ɛ f k P R + k Ɛ f k P R (7) k=3 Gven tht we wll be usng second-order pproxmtons, we expect tht the men nd vrnce of Y cn be clculted s long s we re ble to compute the covrnce between vrous entres of R nd P. We strt wth P. Frst, we ntroduce some notton. We use the notton A nd A to denote the th row nd column, respectvely, of mtrx A, nd dg A to denote dgonl mtrx wth the entres of A long the dgonl. We note tht P nd P re ndependent when. To fnd the covrnce mtrx of P, we consder the row vectors P nd P wth the estmted nd true trnston probbltes, nd defne P to be ther dfference. Note tht P = P For ech stte-cton pr, we defne M = dg P P P whch s symmetrc postve semdefnte mtrx. Recll tht for ech, we hve P = N /N, where the N re drwn from multnoml dstrbuton. The covrnce mtrx of P s M /N, nd the covrnce mtrx of P s cov = Ɛ P P = k=2 2 M Now we consder R. Becuse C s ndependent of C kl whenever k, we hve Ɛ R R k = 0 for k. Furthermore, Ɛ R 2 = 2 Ɛ R 2 In the followng, we use N to represent the vector wth components N, = 1 m, nd R to represent the vector wth components R, = 1 m. Note tht C nd C k re ndependent gven N nd N k, so tht [ Ɛ R 2 C ] = vr = 1 [ ] vr C N N 2 = 1 { ( [ ]) vr Ɛ C N 2 N ( [ ])} + Ɛ vr C N N

8 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes Mngement Scence 53(2), pp , 2007 INFORMS 315 = 1 { ( ) [ ]} vr R N 2 N + Ɛ V N = 1 N 2 = 1 N { N 2 vr R P + N } V P R M R + V P (8) Here, V s the vrnce of the rewrds ssocted wth trnston from to, under cton. To ccount for the correlton between P nd R,we use Equton (2) to obtn R = R P = R P + R P + R P + R P (9) where R = R R. Comprng wth Equton (1), we hve R = R R = R P + R P + R P (10) We use to denote Hdmrd multplcton: for ny two mtrces A nd B wth the sme dmensons, A B s mtrx (gn wth the sme dmensons) wth entres A B = A B. We lso use e to denote the m-dmensonl vector wth ll components equl to one, nd we use to denote the m-dmensonl vector wth the th component beng. Wth ths notton, Equton (10) becomes ( ) R = P R + R P + R P e (11) We defne n m m mtrx Q wth entres Q = cov X (12) (Recll the defnton X = I P 1, nd tht Y = XR s the true vlue functon.) We defne n m-dmensonl vector B wth ts th component defned s B = 2 R M X N The followng proposton quntfes the bs under the second-order pproxmton ssumpton. The proof s gven n Appendx A. Proposton 4.1. The expectton of the estmted vlue functon Y stsfes Ɛ Y = Y + 2 XQY + XB + L exp where L exp s defned n Equton (7) nd ( ) 1 L exp = o where N = mn >0 N lm N o 1/N N = 0. N nd the term o stsfes In the bove proposton, represents the lest smpled stte-cton pr tht s used by the polcy. The term L exp decreses to zero fster tht 1/N, wheres Q nd B cn be shown to decrese lke 1/N. Therefore, our pproxmton for the bs n the vlue functon estmtes wll be 2 XQY + XB For the purposes of the next proposton, we ntroduce more notton. We defne the dgonl mtrx W, whose dgonl entres re gven by W = 2 Y + R N M Y + R + V P (13) The next proposton provdes n expresson for the second moment, Ɛ Y Y. Together wth the expresson for Ɛ Y n the precedng proposton, t leds to n pproxmton for the covrnce mtrx of Y. The proof s gven n Appendx B. Proposton 4.2. The second moment of Y stsfes Ɛ Y Y = YY +X { 2 QYR +RY Q + BR +RB +W } X +L vr where L vr s gven by L vr = k+l Ɛ f k P RR + R R f l P k l k+l>2 + Ɛ X R R f 1 P + Ɛ f 1 P R R X ( ) 1 = o N By tkng the dfference between Ɛ Y Y, s gven by Proposton 4.2, nd Ɛ Y Ɛ Y, s prescrbed by Proposton 4.1, the followng corollry s esly derved. Corollry 4.1. The covrnce mtrx of the estmted vlue functon stsfes ( 1 cov Y = XWX + o N The expressons n Propostons 4.1 nd 4.2 nd Corollry 4.1 yeld severl nsghts. Frst, s the counts N ncrese to nfnty, cov pproches zero, nd thus ll the terms nvolvng the mtrces Q, B, nd W converge to zero. As expected, ths mples tht s the smple sze ncreses nd the ccurcy of the estmted prmeters mproves, both the bs nd the vrnce decrese to zero. Second, the expressons for the bs nd vrnce rely on the true model prmeters, whch re unknown. As dscussed n the ntroducton, to obtn computble pproxmtons of the )

9 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes 316 Mngement Scence 53(2), pp , 2007 INFORMS bs nd vrnce, we wll use nsted P, R, nd the emprcl vrnce of ech R k. In prncple, we could lso estmte the bs nd vrnce due to ths pproxmton, but ths s tedous nd, s suggested by the expermentl results n the next secton, generlly unnecessry. Thrd, when mn N s lrge, t follows tht the nonzero entres of B, W, nd Q decrese to zero lke 1/N. Therefore, the stndrd devton decreses to zero lke 1/ N, whch s the usul behvor of emprcl estmtes. The expressons n Proposton 4.1 nd Corollry 4.1 llow us to qulttvely compre the mgntude of the bs nd vrnce. Accordng to Corollry 4.1, the stndrd devton of Y cn be pproxmtely estmted s Y = X WX (14) The next proposton, proved n Appendx C, quntfes the rto between the stndrd devton nd the bs. Recll tht for two postve functons f nd g (defned on the rel numbers), we wrte f n = g n f there exst constnts N 0 nd C>0 such tht f n Cg n for n N 0. N Proposton 4.3. Suppose tht Y >0 nd N >c>0 for ll nd. Then, / Y ( ) Ɛ Y Y = N for ll Proposton 4.3 mples tht the errors ntroduced by the prmetrc vrnce wll generlly be much lrger thn the bs. Note tht becuse W s postve semdefnte mtrx, Y >0 s very wek nondegenercy ssumpton. The condton N /N >c>0 requres tht smple szes ncrese unformly. Whle the expresson n Corollry 4.1 llows us to pproxmte the covrnce mtrx of the estmted vlue functon, the fndngs on ther own do not llow us to clculte confdence ntervls round these estmtes. Clcultng confdence ntervl requres tht we know the dstrbuton of the vlue functon estmtes. A centrl lmt theorem (Serflng 1980, p. 122, Theorem A) speks to ths ssue. Theorem 4.1 (Serflng 1980). Suppose tht sequence of rndom vectors X n = X n1 X nk s bn 2 (symptotclly norml,tht s, X n / d b n 0 ); nd the sequence of sclrs b n converges to 0. Let g x = g 1 x g m x, x = x 1 x k,be vector-vlued functon for whch ech component functon g x s rel-vlued functon nd hs nonzero grdent t x =. Let [ ] g D = x x= m k Then, g X n s g bn 2D D. Becuse P nd R re ll estmtors tht symptotclly follow norml dstrbutons, we my consder Y s the functon g n the bove theorem nd conclude tht Y s symptotclly norml. We further nvestgte ths ssue usng ctlog mlng dt n 5, where we report tht Kolmogorov-Smrnov test cnnot reect the hypothess tht Y s normlly dstrbuted. Reders my wonder whether we could hve used Serflng s (1980) result to derve our erler fndngs. It s technclly possble to do so. Indeed, under the ssumpton tht ll of the N s re dentcl, we were ble to show tht the two pproches yeld the sme result, nd observed tht the two dervtons were of comprble length nd complexty. However, f smplng occurs t dfferent rtes n dfferent sttes, the rte t whch the N s pproch nfnty wll generlly vry. In ths cse, use of the Serflng theorem, or ny relted centrl lmt theorem, requres extensve ddtonl dervton. Moreover, these theorems do not ddress the ssue of bs. The sme pproch cn be used for nfnte-horzon verge rewrd MDPs. Under mld ssumptons on the structure of the Mrkov chn, we get smlr pproxmtons to the bs nd vrnce for the verge rewrd. The de cn lso be extended to sem- Mrkov processes, where the trnston tmes between tme epochs re rndom nd estmted from smpled dt The Control Problem To ths pont, we hve focused on the vlue functon under fxed polcy. In mny pplctons, we re nterested n comprng n exstng polcy wth n lterntve polcy, possbly derved through polcy optmzton process. We know from the MDP theory tht there exsts n optml polcy such tht Y for ll dmssble polces nd ll sttes S. An Y optml polcy my be obtned usng vlue terton, polcy terton, or lner progrmmng lgorthms. See, for exmple, Bertseks (2000). Becuse we do not hve ccess to the true model prmeters P nd R, optmzton bsed on the estmted prmeters P nd R produces n optml polcy such tht Y Y for ll dmssble polces. In generl, polcy s dfferent from. Moreover, becuse the polcy s obtned through n optmzton process, the estmtes of the model prmeters for tht polcy ( P nd R ) wll no longer be unbsed estmtes of the true model prmeters (P nd R ). Therefore, we cnnot use the pproxmton derved n Proposton 4.1 (for fxed polcy) to evlute the bs n the optml vlue functon nor cn we use the pproxmtons n Proposton 4.2 nd Corollry 4.1 to estmte the covrnce mtrx.

10 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes Mngement Scence 53(2), pp , 2007 INFORMS 317 We cn llustrte the problem through smple exmple. Consder sngle-stte MDP wth two ctons, tht s, S = 1 nd A = 0 1. Both ctons yeld dentcl zero-men rndom rewrds. Clerly, n such problem could be ether cton 0 or 1, wth vlue functons Y = Y = 0 Now ssume tht we hve n smples to estmte the expected rewrd R for ether cton. Indeed, both R follow (pproxmtely) norml dstrbuton 0 1/n. The polcy optmzton procedure chooses the cton wth the lrgest R.Ifweuse R to denote the mxmum of R 0 nd R 1, we know from Jensen s nequlty tht Ɛ R >0, nd so the vlue functon estmted for the chosen polcy wll on verge be postvely bsed: Ɛ Y = Ɛ R = Ɛ mx R 0 R 1 > mx Ɛ R 0 Ɛ R 1 = 0 The mgntude of Ɛ Y, nd therefore the bs n ths exmple, s studed n the order sttstcs lterture (Ledbetter et l. 1983). We lso refer reders to Clrk (1961), whch presents procedure to pproxmte moments of the mxmum of fnte number of correlted Gussn rndom vrbles. Ths problem rses two ssues. Frst, how cn we de-bs the estmtes of P nd R so tht we cn use our erler results to estmte the bs nd covrnce mtrx of vlue functon when the polcy s derved from n optmzton procedure? Second, becuse the optmzton procedures themselves rely on estmtes P nd R, the polces derved from stndrd dynmc progrmmng lgorthms wll generlly not be truly optml ( ). In the remnder of ths secton, we propose cross-vldton pproch tht cn help to ddress the frst ssue. Unfortuntely, we do not hve soluton to the second ssue. Indeed, t seems unlkely tht generl procedure cn be found tht resolves the second ssue s the suboptmlty reflects the bsence of complete nformton n the trnng dt. The bs n the estmtes of P nd R rses becuse optmzton methods tend to fvor ctons for whch the estmton errors n P nd R led to nflted estmtes of the vlue functon. As long s the errors n P nd R re ndependent cross smples, we cn derve unbsed estmtes of P nd R f we use dfferent smple of dt to evlute the polcy thn the smple we used to desgn the polcy. In prtculr, consder the followng pproch: Strt by dvdng the trnng dt nto two subsmples clbrton smple nd vldton smple. Use the clbrton smple to estmte the model prmeters P cl nd R cl nd obtn the optml polcy cl = rg mx I P cl 1 R cl Then, estmte model prmeters P vl nd R vl from the vldton smple nd (followng Equton (3)) evlute the polcy usng these new prmeters: Y cl vl = ( I P ) cl 1 vl R cl vl Through ths procedure, we cn de-bs the vlue functon estmtes by reportng Y cl vl nsted of Y cl cl, where Y cl cl = I P cl cl 1 R cl cl. Accordngly, we my lso pproxmte the bs nd vrnce nd therefore the confdence bounds of Y cl vl followng Proposton 4.1 nd Corollry 4.1. The ssumpton tht the estmton errors n P nd R re ndependent cross the clbrton nd vldton subsmples s obvously crtcl. In ths pper, we hve ssumed tht estmtes P nd R re derved from strghtforwrd nonprmetrc ggregtes of the vlble dt. Under ths pproch, the estmton errors re ndependent cross the subsmples s long s ny mesurement errors re ndependent cross observtons. However, n some settngs, t s common to estmte the model prmeters from mxmum lkelhood estmtes tht requre functonl form nd dstrbuton ssumptons (ths s prtculrly common n the economcs lterture). Under ths lterntve pproch, ny errors ntroduced by the functonl form nd dstrbuton ssumptons wll be correlted cross the subsmples. As result, the cross-vldton procedure tht we hve proposed wll not de-bs the estmtes of P nd R, even f the mesurement errors re ndependent cross the observtons. 5. Experments The relnce on second-order expnson n dervng the pproxmtons for the bs nd vrnce presumes tht hgher-order terms re reltvely unmportnt. We now exmne ths ssumpton n further detl by usng the ctlog mlng dt to vldte the fndngs. These dt lso enble us to nvestgte the mpct (f ny) of usng estmtes of the model prmeters n these expressons (n the bsence of the true model prmeters). If the vlue functon estmtes follow norml dstrbuton, the vrnce nd bs expressons derved n the prevous secton fcltte clculton of confdence ntervls round the de-bsed vlue functon estmtes. We cn nvestgte the ccurcy of these confdence ntervls by comprng how frequently the true vlue functon flls wthn the confdence ntervls. We would expect tht on verge the true vlue

11 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes 318 Mngement Scence 53(2), pp , 2007 INFORMS wll fll wthn one stndrd devton of the unbsed men 68% of the tme nd wthn two stndrd devtons 95% of the tme. We begn by nvestgtng whether the vlue functon estmtes follow norml dstrbuton. We do so by usng Kolmogorov-Smrnov test on ech of the dt ponts reported n 3. The hypothess tht the rewrd s two-sded Gussn could not be reected wth confdence 0.05 t ny nstnce. The verge p-vlue ws wth mnmum of nd mxmum of Ths ndctes tht t cnnot be determned tht the dt do not follow Gussn rule. We use the sme prttons of the dt s n 2. In Fgure 3, the percentge of tmes tht the true vlue functon ws wthn one stndrd devton s denoted by + nd wthn two stndrd devtons by n. For exmple, for the 250 subsmples (wth bout 657,000 observtons ech), we report the percentge of the 250 estmtes n whch the true AVF (s estmted on the full smple) ws wthn the estmted confdence ntervl. By redrwng the 250 subsmples 10 tmes, we report 10 nstnces of ths percentge. An nlogous process ws used wth other choces of the subsmple sze. The fndngs n Fgure 3 confrm tht the percentge of estmtes tht fll wthn one nd two stndrd devtons of the true AVF re close to the trgets of 68% nd 95%, respectvely. We next consder the mportnce of the secondorder pproxmtons. We do so by tkng dvntge of the role plyed by the dscount fctor. The mportnce of hgher-order terms n the seres expnsons Fgure 3 Percentge below 1 (+) or 2 (+) STDs The Percentge of the AVF Estmtes tht Fll Wthn One + nd Two Stndrd Devtons from the Vlue Clculted Bsed on the Full Dt Set Observtons per subsmple (mllons) Note. Ech + nd represents rndom prtton of the full dt to subsmples. The dscount fctor s = Tble 2 Percentge of the AVF Estmtes tht Fll Wthn One nd Two Stndrd Devtons Smples wth Smples wth one STD (%) two STDs (%) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Note. We rndomly prttoned the dt whle vryng the dscount fctor. For ech dscount fctor, we performed the prtton 10 tmes; ech prtton ws to 250 subsmples (ech wth roughly 657,000 mllon observtons). We present the percentge of smples n whch the estmted AVF s wthn one stndrd devton (s predcted by Proposton 4.2) of the vlue s mesured on ll the dt; the mnmum nd mxmum percentges over the 10 runs re provded n prentheses. The sme sttstcs re presented for two stndrd devtons. ncreses s the dscount fctor pproches one. In Tble 2, we repet the nlyss for 250 subsmples of fxed sze, but for dfferent dscount fctors (sme settngs s n Tble 1). As expected, s pproches one, the ccurcy of the confdence ntervls degrdes. We ttrbute ths to the error ntroduced by the secondorder pproxmton The Control Problem As dscussed n 4.2, n obvous pplcton of our nlyss s the comprson of current polcy wth new polcy generted through some optmzton process. We cutoned tht before pplyng the expressons for the bs nd the vrnce to polcy derved from such process, we should frst obtn unbsed estmtes of the model prmeters, usng n ndependent vldton smple. We wll use the ctlog mlng dt to llustrte the mportnce of ths frst step. We begn by rndomly selectng porton of the vlble dt to be used s clbrton smple, nd retn the remnng dt s vldton smple. To demonstrte how the sze of the clbrton smple ffects the fndngs, we repet ths process for clbrton smples of dfferent szes. The clbrton smple s used to estmte model prmeters P cl nd R cl. Then, we run polcy terton lgorthm to dentfy n optml polcy cl from P cl nd R cl. We wll compre two AVF estmtes for ths polcy: the AVF clculted on the bss of the model estmted usng the clbrton smple (denoted by Y cl ), nd the AVF of tht polcy s estmted usng the vldton smple (denoted by Y vl ). The dfference between the two estmtes represents the bs ntroduced by the error n the model prmeters (the errors no longer hve zero expectton due to the optmzton process).

12 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes Mngement Scence 53(2), pp , 2007 INFORMS 319 Fgure 4 The Dfferences (Mrked by + ) Between the AVF Estmtes (n Dollrs, nd Averged Over All Sttes) Bsed on the Clbrton Smple nd the Vldton Smple for the Polcy Identfed Through n Optmzton Process Fgure 5 The Dfferences (Mrked by + ) Between the AVF Estmtes (n Dollrs) of the Optml Polcy Bsed on the Clbrton Smple nd the AVF of the Optml Polcy Found by Optmzng on the Vldton Smple $5 $0 $4 Bs: Y cl Y vl $3 $2 $1 Suboptmlty: Y vl Y * $0.5 $1.0 $0 $ Sze of clbrton smple (% of dt set) Note. Ech + ws generted by rndomly prttonng the dt to clbrton nd vldton smple. The horzontl xs corresponds to the sze of the clbrton smple, s percentge of the full dt smple. Here, = 0 98 for whch the true optml AVF s pproxmtely $ Ths bs s llustrted n Fgure 4 for clbrton smples of vryng szes. It cn be seen tht vlue functon estmtes from the clbrton smple re lmost unformly greter thn the estmtes from the vldton smple. Ths bs s sttstclly sgnfcnt. It s lso mngerlly relevnt, vergng round 6.3% of the true optml AVF ($33.59) for clbrton smple tht conssts of pproxmtely 1.6 mllon observtons (1% of the dt). In ddton, the $33.59 AVF for the optml polcy cn be compred wth the $28.54 AVF for the hstorcl polcy (reported n Fgure 1). These results ndcte tht the optml polcy offers potentl proft mprovement of pproxmtely 17%. We cn lso use the ctlog dt to nvestgte the extent to whch prmetrc vrnce leds to suboptml polces. To do so, we compred the optml polcy derved usng ech subsmple, wth the true optml polcy derved usng the entre dt set. Both polces re evluted on the vldton smple. We use Y to denote the AVF for the optml polcy found by optmzng on the entre dt set. The fndngs re reported n Fgure 5. As expected, the optml polcy lwys outperforms the polcy derved from the clbrton subsmple. The dfferences re gn sttstclly sgnfcnt. Note tht the computton of Y nd Y vl uses the sme dt, whch my ntroduce correlton between the two qunttes. Ths wll tend to dmnsh our estmtes of the suboptmlty. We lso computed Y vl, the optml AVF over the vldton set, n plce of Y for Tble 3 nd Fgures 4 nd 5. The results re smlr. $ Sze of clbrton smple (% of dt set) Note. Ech + ws generted by rndomly prttonng the dt to clbrton nd vldton smple. The horzontl xs corresponds to the sze of the clbrton smple, s percentge of the full dt smple. Here, = 0 98 for whch the true optml AVF s pproxmtely $ To demonstrte the robustness of the fndngs, we performed n experment smlr to the one reported n Tble 2. In Tble 3, we present the bs nd suboptmlty ntroduced by the optmzton process for dfferent vlues of. Specfclly, the bs ws clculted s Y cl Y vl /Y ; the suboptmlty ws clculted s Y vl Y /Y. From Tble 3, we cn esly obtn the men stndrd errors s the smple stndrd devtons dvded by 10 (the squre root of the smple sze, 100). It s cler tht both the bs nd the suboptmlty re generlly sgnfcntly greter thn zero, wth the bs vergng round 2% of the AVF nd the suboptmlty vergng round 1%. Tble 3 Optmzton Bs for Dfferent Vlues of Bs n percent Suboptmlty n percent Men STD Men STD Note. For ech dscount fctor, we performed rndom smplng of the dt 100 tmes. Ech tme we use rndom clbrton smple of 20% of the entre dt set (ech wth roughly eght mllon observtons) nd the other 80% s vldton smple. We found the optml polcy n ech such MDP nd present n the tble the bs, Y cl Y vl, normlzed by Y. We lso present the suboptmlty, Y vl Y, normlzed smlrly. The mens of the bses re sgnfcntly greter thn zero.

13 Mnnor et l.: Bs nd Vrnce Approxmton n Vlue Functon Estmtes 320 Mngement Scence 53(2), pp , 2007 INFORMS We conclude tht prmetrc vrnce ntroduces two ssues n polcy optmzton. Frst, the estmtes of the trnston probbltes nd the rewrds for the optml polcy re bsed, ledng to postve bs n the vlue functon estmtes. Ths problem cn be remeded reltvely esly by evlutng the polcy on seprte vldton smple. The second problem s more dffcult to resolve: errors n the model prmeters lso led to suboptml polces. As we dscussed n 4.2, ths second problem s t lest to some extent nevtble n the bsence of the true model prmeters. There s n nterestng queston rsed by referee: Gven fxed mount of dt, how do we dvde t nto the clbrton nd vldton smples? Includng more dt n the clbrton smple potentlly leds to better polcy, whle more dt n the vldton smple mens tghter confdence ntervl when we evlute the polcy. Ths trde-off cn often be resolved emprclly. 6. Concludng Remrks We hve provded closed-form pproxmtons for the bs nd vrnce of estmted vlue functons cused by uncertnty n the true model prmeters. For smll nd md-szed MDP models, the expressons cn be esly clculted nd used to evlute exstng polces or to compre new polces wth exstng ones. For the cse where new polcy s derved through polcy optmzton process, we lso demonstrted how to remove the ddtonl bs ntroduced by the optmzton process, by usng vldton smple. The expressons re bsed on second-order pproxmtons. Moreover, n the bsence of the true model prmeters, the expressons re evluted by relyng on estmtes of the model prmeters nd re therefore themselves estmtes, subect to prmetrc vrnce. We used lrge smple of dt from ctlog mlng compny to nvestgte the mpct of these pproxmtons. The fndngs ndcte tht the confdence ntervls obtned on the bss of the bs nd vrnce expressons re ressurngly ccurte. Both the ctlog mlng dt nd our theoretcl nlyss provde comprson of the reltve mgntude of the vrnce nd the dfferent bses. The vrnce ntroduced by prmetrc uncertnty s consderble, suggestng both prctcl nd sttstcl mportnce. Of the two bses, only the bs n optml polces ntroduced by the optmzton process s sgnfcnt. For fxed polcy, the bs ntroduced by prmetrc uncertnty wll generlly be neglgble when compred to the vrnce. Whle we report the verge of the vlue functons (verged over ll sttes), the dsggregte results my lso be of nterest. In prtculr, the vrnce of the vlue functons for lterntve polces could be used to gude future expermentton. Future expermentton my fvor ctons tht mght hve lrge effect on the vrnce of the vlue functon estmte. In ths mnner, the fndngs my contrbute to our understndng of the trde-off between explorton nd explotton. The fndngs my lso help to mprove the polcy optmzton process. The polcy mprovement portons of stndrd lgorthms focus on pont estmtes of the vlue functons nd overlook the vrnce round these estmtes. Fnlly, we cuton tht, s wth ll nlyses of MDPs, our fndngs rely on n ssumpton tht the dt re smpled from Mrkov process. In our experments, we ensured stsfcton of ths condton by smplng observtons rther thn trectores ( trectory here would be the complete hstory of customer). 7. Electronc Compnon An electronc compnon to ths pper s vlble s prt of the onlne verson tht cn be found t mnsc.ournl.nforms.org/. Acknowledgments Ths reserch ws prtlly supported by NSF Grnt DMI Ths pper hs benefted from comments by workshop prtcpnts t Duke Unversty, Unversty of Pennsylvn, Wshngton Unversty t St. Lous, the 2004 Interntonl Conference on Mchne Lernng, nd the INFORMS Annul Meetng The uthors re thnkful to Ynn Le Tllec for fndng n error n prevous verson nd to the referees for constructve comments. The uthors re especlly thnkful to the deprtment edtor for detled revew nd mny constructve suggestons. Appendx A Proof of Proposton 4.1. Lemm A.1. For ny cton,we hve Ɛ P = 0, Ɛ R = 0,nd furthermore, R s uncorrelted wth ny functon of P 1 P A,n whch A s the crdnlty of the cton spce A. Proof. The property Ɛ P = 0 s obvous. Let N stnd for the collecton of rndom vrbles N b k for every, k, nd b. We hve Ɛ R R N = 0. The fct Ɛ R = 0 follows by tkng the uncondtonl expectton. Furthermore, becuse ny P s completely determned by N, t follows tht for ny functon g, we hve Ɛ g P 1 P A R N = g P 1 P A Ɛ R N = 0 By tkng uncondtonl expecttons, we obtn the lst prt of the lemm. For the proof of Proposton 4.1, we strt from Equton (5) nd substtute the expresson from Equton (11) for R, to obtn ( ) Ɛ Y = I P 1 R + k Ɛ f k P R k=1

Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1

Dennis Bricker, 2001 Dept of Industrial Engineering The University of Iowa. MDP: Taxi page 1 Denns Brcker, 2001 Dept of Industrl Engneerng The Unversty of Iow MDP: Tx pge 1 A tx serves three djcent towns: A, B, nd C. Ech tme the tx dschrges pssenger, the drver must choose from three possble ctons: