Bias and Variance Approximation in Value Function Estimates

Size: px

Start display at page:

Download "Bias and Variance Approximation in Value Function Estimates"

John Ball
5 years ago
Views:

1 Bis nd Vrince Approximtion in Vlue Function Estimtes Shie Mnnor Duncn Simester Peng Sun John N. Tsitsiklis July 11, 2004 Revised: July 5, 2005 Abstrct We consider Mrkov Decision Process nd study the bis nd vrince in the vlue function estimtes tht result from empiricl estimtes of the model prmeters. We provide closed-form pproximtions for the bis nd vrince, which cn then be used to derive confidence intervls round the vlue function estimtes. We illustrte nd vlidte our findings using lrge dtbse describing the trnsction nd miling histories for customers of mil-order ctlog firm. This reserch ws prtilly supported by NSF grnt DMI The pper hs benefited from comments by workshop prticipnts t Duke University, University of Pennsylvni, Wshington University t St. Louis, the 2004 Interntionl Conference on Mchine Lerning nd INFORMS Annul Meeting The uthors re thnkful to Ynn Le Tllec for finding n error in previous version nd referees for constructive comments. The uthors re especilly thnkful to the deprtment editor for detiled review nd mny constructive suggestions. Lbortory for Informtion nd Decision Systems, Msschusetts Institute of Technology, Cmbridge, MA 02139; current ddress: Deprtment of Electricl nd Computer Engineering, McGill University, Montrel, Quebec H3A 2A7, Cnd, shie@ece.mcgill.c Slon School of Mngement, Msschusetts Institute of Technology, Cmbridge, MA 02139, simester@mit.edu Fuqu School of Business, Duke University, Durhm, NC 27708, psun@duke.edu Lbortory for Informtion nd Decision Systems, Msschusetts Institute of Technology, Cmbridge, MA 02139, jnt@mit.edu

2 1 Introduction Bellmn s vlue function plys centrl role in the optimiztion of dynmic decision-mking models, s well s in the structurl estimtion of dynmic models of rtionl gents. For the importnt cse of finite-stte Mrkov Decision Process MDP), the vlue function depends on two types of model prmeters: the trnsition probbilities between sttes nd the expected one-step rewrds from ech stte. In mny pplictions in the socil sciences nd in engineering, the trnsition probbilities nd expected rewrds re not known nd insted must be estimted from finite smples of dt. The estimtion errors for these prmeters introduce errors nd bises in the vlue function estimtes. In this pper, we present methodology for evluting the bis nd vrince in vlue function estimtes cused by errors in the model prmeters. This, in turn, llows the clcultion of confidence intervls round the vlue function estimtes. The confidence intervls re themselves pproximtions. For nlyticl nd computtionl trctbility, they rely on second order Tylor series pproximtions. Moreover, becuse the expressions for the bis nd the vrince pproximtion require the true but unknown model prmeters, we replce these unknown prmeters by their estimtes. We evlute the ccurcy of these pproximtions nd vlidte the expressions using lrge smple of rel dt obtined from mil-order ctlog compny. Sources of Vrince We strt by distinguishing between two types of vrince tht cn rise in n MDP: internl nd prmetric. Internl vrince reflects the stochsticity in the trnsitions nd rewrds. For exmple, in mrketing setting there is rrely certinty s to whether n individul customer will purchse, resulting in genuinely stochstic trnsitions nd rewrds. Prmetric vrince rises if the true trnsition probbilities nd expected rewrds re estimted rther thn known; the potentil for error in the estimtes of these prmeters introduces vrince in the vlue function estimtes. The two types of vrince hve different sources nd cn be illustrted through different experiments. To illustrte internl vrince, we cn fix the model prmeters nd then generte number of finite-length smple trjectories with ll trjectories hving the sme length, strting from the sme stte, nd using common control policy). The vrition cross smple trjectories in the totl rewrds nd/or the identity of the finl stte reflects internl vrince. 1

3 In contrst, ggregtion cross smples does not mitigte prmetric vrince. The ltter cn be illustrted by compring the verge outcomes from lrge number of smples generted under different estimtes for the model prmeters. The vrition in the verge outcomes under different estimtes reflects prmetric vrince. Internl vrince hs lredy been considered in the literture. In prticulr, Sobel 1982) provides n expression for the internl vrince in Mrkov Decision Process with discounted rewrds, while Filr et l. 1989) nd Bukl-Gursoy nd Ross 1992) consider the verge rewrd criterion. In this pper we focus on prmetric vrince. Our motivtion is tht in mny contexts the underlying objective involves verging outcomes cross lrge number of smples, in which cse the internl vrince is verged out. For exmple, in mrketing ppliction, firm profits typiclly represent the ggregtion of outcomes cross lrge number of customers. Similrly, in lbor economics setting, firm often ggregtes cross lrge number of employees. Of course, there re settings where internl vrince is lso importnt. For exmple, when llocting finncil portfolios, the internl) vrince of the return on single finncil portfolio is importnt in its own right. Literture Mrkov Decision Problems, nd the ssocited methodology of Dynmic Progrmming, hve found brod rnge of pplictions in numerous fields in the socil sciences nd in engineering. These pplictions cn be brodly divided into two ctegories, bsed upon the reserch objectives. The first nd more trditionl ctegory of pplictions focuses on optimizing the opertion of humn or engineering systems, nd on providing tools for effective decision-mking. The ppliction res re vst, nd include finnce Luenberger, 1997; Cmpbell nd Viceir, 2002), economics Dixit nd Pindyck, 1994), inventory control nd supply chin mngement Zipkin, 2000), revenue nd yield mngement McGill nd vn Ryzin, 1999), trnsporttion Godfrey nd Powell, 2002), communictions, wter resource mngement, electric power systems. The vst mjority of this literture ssumes tht n ccurte system model is vilble. There is n underlying implicit ssumption tht the true model will be estimted using sttisticl methods on the bsis of whtever dt re vilble. However, the sttisticl rmifictions of working with finite dt records hve received little ttention. An exception is the literture deling with on-line lerning of optiml policies dptive control of Mrkov chins, reinforcement 2

4 lerning) Sutton nd Brto, 1998; Bertseks nd Tsitsiklis, 1996). However, this literture is concerned with symptotic convergence s opposed to the common sttisticl questions of stndrd errors nd confidence intervls. The second ctegory of pplictions focuses on explining observed phenomen. Amongst the most widely cited exmples is the work of Rust 1987), who develops discrete dynmic progrmming model of the optiml replcement policy for bus engines. According to this pproch the resercher strts by ssuming tht individuls or firms behve optimlly, but tht the prmeters of the firm or customer decision problem re unknown. By mximizing the likelihood of the empiriclly observed ctions of individuls or firms under the optiml policies for different sets of prmeters, the resercher seeks to identify these unobserved prmeters. Similr pplictions of discrete dynmic progrmming models hve become incresingly common, prticulrly in the lbor Kene nd Wolpin, 1994), industril orgniztion Hendel nd Nevo, 2002), nd mrketing Gönül nd Shi, 1998) litertures. While these methods use vriety of pproches to clculte or pproximte the vlue function, the vlue function relies upon point estimtes of the model prmeters. Previous ttempts to consider the impct of prmeter error on the clculted vlue function hve been limited to simultion-bsed pproches. We finlly note tht the impct of uncertinty in the model prmeters on the ccurcy of the vlue function estimtes hs received ttention in the finnce literture. For exmple, Xi 2001) nd Brberis 2000) investigte how dynmic lerning bout stock return predictbility ffects optiml portfolio lloctions. The generl problem considered in these studies is similr to the one ddressed in this pper. However, the sources of vrince re different. In prticulr, the finnce literture is concerned with internl vrince due to the stochsticity in the underlying process, nd prmetric vrince due to non-sttionrity of the model prmeters, including chnges in the investment horizon nd/or dynmic lerning. In contrst, we bstrct wy from the problem of internl vrince, ssume tht the model prmeters re sttionry, nd focus on the prmetric vrince tht results from estimting the model prmeters from finite smple of dt. Overview As fr s we know this is the first pper to study prmetric bis nd vrince in Mrkov Decision Processes. It serves two purposes. First, to illustrte the potentil for error in vlue 3

5 function estimtes nd to highlight the potentil mgnitude of these errors. Second, to provide formuls nd methodology for estimting the bis nd vrince in vlue function estimtes, which cn then be used to construct confidence intervls round the vlue function estimtes. We begin with some nottions nd bckground mteril in Section 2. In Section 3 we illustrte the reltionship between errors in the model prmeters nd the ccurcy of vlue function estimtes using ctul dt from ctlog miling context. In Section 4, we present methodology for estimting the bis nd vrince in the vlue function estimtes. In Section 5, we vlidte our methodology using the ctlog miling dt. We conclude in Section 6 with review of the findings nd discussion of opportunities for future reserch. 2 A Forml Description of the Problem We consider Mrkov Decision Processes MDP) with fixed policy, where both the MDP nd the policy re ssumed sttionry. The ssumption tht the policy is fixed llows us to initilly bstrct wy from the control problem. As we discuss in Section 4.2, the impct of prmeter uncertinty on the solution to the control problem rises dditionl issues. An MDP is specified by finite set S of sttes, of crdinlity m, finite set A of ctions, nd two sclrs, Pij nd R ij for every i, j S nd every A. These sclrs re interpreted s follows: if the current stte is i nd ction is pplied, then the next stte is j with probbility Pij ; furthermore, given tht trnsition from i to j occurs following n ction equl to, rndom rewrd is obtined, whose conditionl expecttion is equl to Rij. We mke the usul Mrkovin ssumptions, nmely, tht given i nd, the next stte is conditionlly independent from the pst history of the process; lso, tht given i,, nd j, the ssocited rewrd is gin conditionlly independent from the pst history of the process. Note tht if ction is pplied t stte i, the expected rewrd, denoted by R i, is equl to j P ij R ij. We re interested in the vlue function ssocited with sttionry, Mrkovin, possibly rndomized, policy π. We use π i) to denote the conditionl probbility of pplying ction when t stte i. Let Pij π = π i)p ij, which is the trnsition probbility from i to j, nd R π i = π i)r i = π i) j P ijr ij, 1) 4

6 which is the expected rewrd t stte i, under the policy π. We use P π to denote the m m mtrix with entries P π ij, nd Rπ to denote the m-dimensionl vector with components R π i. We restrict our ttention to the infinite horizon, discounted rewrd criterion for fixed discount fctor α 0, 1). Define the vlue function ssocited with policy π to be the m- dimensionl vector given by Y π = α k P π ) k R π. k=0 Using the geometric series formul, the vlue function is given by Bellmn, 1957) Y π = I αp π ) 1 R π. In our setting the true model prmeters, Pij nd R ij, re not known. Insted, we hve ccess to finite smple of dt, from which these prmeters cn be estimted. Specificlly, ssume tht for every i nd, we hve record of N i trnsitions out of stte i, under ction, nd the ssocited rewrds. We tret the numbers N i s fixed not s rndom vribles), nd ssume tht N i > 0 for every i nd. This lst ssumption restricts ttention to ctions tht hve been tried before. For t lest two resons we nticipte tht this will be reltively wek ssumption in prctice. First, the inbility to evlute ctions in one stte does not restrict our bility to evlute the sme ction in other sttes, becuse we cn still evlute n ction t ny stte where the ction hs been tried before. Thus the restriction only pplies to sttes in which there is no pst informtion bout the outcome. Second, there is tremendous mount of vrition in historicl policies in mny rel-world pplictions. This vrition my rise for lot of resons including experimenttion, implementtion errors or non-sttionrity in the policy. If there is interest in untried ctions, nd there re priors vilble to help predict the outcome, then Byesin pproch cn be used. For completeness we detil such n pproch in the online Appendix D Mnnor et l., 2005). Furthermore, we do not ssume ny reltion between the smpling process nd the policy π of interest; in prticulr, the Ni, for different, need not be proportionl to the π i), nd the number N i = N i of trnsitions out of stte i need not be relted to the stedy-stte probbility of stte i under policy π. For the N i trnsitions out of stte i under ction in the smple dt, let N ij be the 5

7 number of trnsitions tht led to stte j. Furthermore, let Cij be the sum of the rewrds ssocited with these N ij trnsitions for completeness we define C ij = 0 if N ij = 0). We define ˆP ij = N ij N i, ˆR ij = C ij Nij, which will be our estimtes of P ij nd R ij, respectively. When N ij = 0, we define ˆR ij = 0.1 In ddition, we define ˆP π ij = π i) ˆP ij, nd ˆR i = j ˆP ij ˆR ij = j C ij Ni, ˆRπ i = π i) ˆR i, 2) which will be our estimtes of Pij π, R i, nd Rπ i, respectively. We finlly define mtrix ˆP π nd vector ˆR π, with entries ˆP π ij nd ˆR π i, respectively, which will be our estimtes of P π nd R π. Bsed on these estimtes, we obtin n estimted vlue function Ŷ π, given by Ŷ π = I α ˆP π ) 1 ˆRπ. 3) We ssume tht the smple dt reflect the true process, in the following sense. The vector Ni1,..., N im ) follows multinomil distribution with prmeters N i ; P i1,..., P im ). Let IE denote expecttion under the true model. We then hve IE[Nij ] = N i P ij. A lst ssumption tht reflects our erlier ssumptions tht N i is fixed nd tht ech smple rewrd is conditionlly independent from the pst, is tht IE[Cij N ij ] = N ij R ij. Under these ssumptions it is esily verified tht ˆP π nd ˆR π re unbised estimtes of P nd R. Bsed on Eq. 3), we cn nticipte the impct of errors in ˆP π nd ˆR π on Ŷ π. Notice first, tht Ŷ π is liner in ˆR π, so tht if P were observed without error i.e., if ˆP = P ), the vrince of ˆR π would led to vrince in Ŷ π but not to bis since ˆR π is unbised). In contrst, Ŷ π is nonliner in ˆP π, so tht errors in ˆP π led to both bis nd vrince in Ŷ π. Moreover, due to the mtrix inversion the nonlinerity is substntil, so tht ny error in ˆP π cn trnslte to lrge error in Ŷ π. This is prticulrly true when α is close to one. Furthermore, if the errors in ˆP π nd ˆR π re correlted, the nonlinerity implies tht errors in ˆR π will lso led to bis 1 The possibility of N ij being zero for fesible trnsitions introduces some dditionl bis, which will not be ccounted for. However, in our nlysis, we will ssume tht ny trnsition with N ij = 0 is infesible. 6

8 in Ŷ π. 3 An Illustrtion To illustrte the bis nd vrince tht cn be introduced to vlue function estimtes by errors in the model prmeters we use rel dt from mil-order ctlog compny. While this ppliction serves s useful cse study, our findings re not limited to this ppliction. Deciding who should receive ctlog is mongst the most importnt decisions tht milorder compnies must ddress. Yet, identifying n optiml miling policy is difficult tsk. Customer response functions re highly stochstic, reflecting in prt the reltive pucity of informtion tht firms hve bout ech customer. Moreover, the problem is dynmic one. Purchsing decisions re influenced not just by the firm s most recent miling decision, but lso by prior miling decisions. As result, the optiml miling decision depends upon pst nd future miling decisions. A typicl ctlog compny might mil 25 ctlogs per yer. The number of ctlogs, the dtes tht they re miled, nd the content of the ctlogs re determined up to yer before the firm decides to whom ech ctlog will be miled. For this reson, these decisions re typiclly treted s fixed when deciding who to mil to. Accordingly, the firm only needs to decide which customers to mil to, on ech exogenously determined miling dte discrete infinite horizon problem). The firm s objective is to mximize its expected totl discounted profits. Rewrds profits) in ech period re clculted s the revenue erned from customer purchses if ny) less the cost of the goods sold nd the miling costs pproximtely 65 cents per ctlog miled). To support their miling decisions, ctlog firms typiclly mintin lrge dtbses describing the individul purchse nd miling histories for ech customer. We re fortunte to hve ccess to lrge dtbse describing the trnsction nd miling histories for the women s pprel division of modertely lrge ctlog compny. This dt is described in detil in Simester et l. 2004). It includes the complete trnsction histories for pproximtely 1.72 million customers. The miling histories re complete for the six-yer period from 1996 through 2002 the compny did not mintin record of the miling history prior to 1996). Ctlogs were miled on 133 occsions in this six-yer period, so tht on verge miling decision occurred 7

9 every 2-3 weeks. The ctlog miling problem cn be modelled s n MDP s in Gönül nd Shi, 1998), where the stte is summry of the customer s history, nd the ction t ech period is to either mil or not mil. The construction of the stte spce is n interesting problem tht we will not consider here. We will insted follow stndrd industry pproch to this problem tht uses three stte vribles, the so-clled RFM mesures e.g., Bult nd Wnsbeek, 1995; Bitrn nd Mondschein, 1996). These mesures describe the recency, frequency nd monetry vlue of customers prior purchses. 2 For the purposes of this illustrtion, we constructed stte spce by quntizing ech of the RFM vribles to 4 discrete levels, yielding stte spce with S = 4 3 = 64 sttes. At ech historicl miling epoch, we evlute the RFM vribles of ech customer regrdless of whether the customer received ctlog or mde purchse) nd chrcterize him/her into one of the 64 sttes. We lso tret the purchse mount zero if no purchse in the epoch) less the miling cost s rewrd smple. Therefore ech customer s historicl dt over time serves s smple trjectory. Following the procedure described in the previous section, we my then estimte the model prmeters ˆP nd ˆR nd clculte Ŷ for the current policy embedded in dt. Since the firm is interested in the verge profit per customer, rther thn the profit erned from n individul customer, internl vrince is verged out. However, prmetric vrince is of interest becuse it ffects the comprison of different policies. In prticulr, when evluting new policy, the firm would like both prediction of the expected profits from dopting the new policy, together with confidence bounds round tht prediction. In order to illustrte the impct of prmetric vrince, we rndomly divided the 1.72 million customers nd 164 million observtions into 250 eqully sized sub-smples, ech contining pproximtely 657 thousnd observtions. By observtion we men miling period nd n ssocited stte trnsition in the history of customer, irrespective of whether ctlog ws miled or purchse ws mde during tht time period. We then seprtely estimted the model prmeters ˆP π nd ˆR π following Section 2 using the observtions from ech of these sub-smples. Here we considered the policy π to be the sme s the smpling policy tht 2 Recency is mesured s the number of dys in hundreds) since customer s lst purchse. Frequency mesures the number of items tht customers previously purchsed. Monetry Vlue mesures the verge price in dollrs) of the items ordered by ech customer. 8

10 generted the dt. Using eqution 3) we clculted 250 estimtes of the vlue function. As benchmrk, we lso estimted the model prmeters using the full smple of 1.72 million customers. For the purposes of this illustrtion, we will interpret the model estimted using the full smple s the true model, which is essentilly equivlent to ssuming tht the 1.72 million customers re the full popultion. Thus, within typicl sub-smple, the expected rewrd in ech stte ˆR π were estimted using n verge of pproximtely 10 thousnd observtions N i ), while the trnsition mtrix ˆP π ws estimted using n verge of 160 observtions per trnsition. In prctice, most of the trnsitions re infesible; for exmple, customer cnnot trnsition from hving 3 prior purchses to only hving 2 prior purchses. When limiting ttention to only those trnsitions tht re fesible, the verge number of observtions per trnsition ws pproximtely 1,400. The verge of the positive Nij s is round 1, 400.) In Figure 1 we report the empiricl distribution histogrm) of the vlue function Ŷ π cross ll 250 sub-smples under the historicl policy used by the firm s clculted using the whole smple). In order to summrize n estimted vlue function with single number to be referred to s the verge vlue function, or AVF ) for ech sub-smple, we verge the estimtes cross sttes weighing ech stte eqully). The true AVF, computed from the prmeters estimted for the full smple, is $ In comprison, the verge of the 250 estimtes is $28.65, with n empiricl stndrd devition of $0.97. The difference between $28.54 nd $28.65 is not sttisticlly significnt nd is of seemingly little mngeril importnce. However, the vrince is potentilly very importnt. The 95% confidence intervl round the 250 AVF estimtes rnges from $26.59 to $30.49, or roughly 14% of the true men. Of course, we were ble to estimte the $0.97 stndrd devition only becuse we hd ccess to mny sub-smples. In rel world setting, where only single smple is vilble, the resercher generlly relies on simultions or jck-knifing techniques to estimte the stndrd devition. In this pper, we will present procedure for deriving closed-form pproximtions of the stndrd devition directly from the dt. We cn demonstrte the robustness of the bove described results by vrying both the size of the sub-smples nd the discount fctor. In Tble 1 we present the empiricl bis nd stndrd devition for different discount fctors verged over 10 repetitions). In ech repetition, we divide the dt set into 100 sub-smples nd compute the AVF for ech sub-smple. We clculte the verge bsolute vlue of the bis nd the empiricl stndrd devition of the AVF 9

11 60 Number of sub smples AVF per sub smple Figure 1: Mil ctlog problem: histogrm of the AVF of the historicl policy for prtition of the customers to 250 sub-smples. The discount fctor per period is α = The policy used is the historicl mixed) policy used by the firm, nd the vlue function is weighted uniformly cross sttes. The AVF obtined from the full dt is $28.54, nd is plotted s verticl line. The empiricl stndrd devition is $0.97. estimtes cross sub-smples. It cn be seen tht the verge bis is smll for discount fctors tht re not too close to 1. For discount fctors tht re close to 1, the bis becomes more meningful but still remins much smller thn the stndrd devition. In nother experiment we vried the precision of the estimtes by chnging the size of the sub-smples nd repeted the nlysis using sub-smples with different number of observtions. In Figure 2 we report empiricl stndrd devitions of the AVF estimtes under the different sized sub-smples. Ech cross in Figure 2 represents rndom ssignment of the observtions to sub-smples the different ssignments led to vrition in the sub-smples between repetitions). While incresing the size of the sub-smples increses the ccurcy of the model prmeters, nd in turn reduces the vrince in the AVF estimtes, the rte t which the vrince pproches zero slows down s the sub-smples increse in size. It seems tht even when estimting the model prmeters with very lrge mounts of dt, prmetric vrince leds to non-negligible vrince in the vlue function estimtes. 4 Anlysis In this section we provide closed-form pproximtions for the bis nd vrince of the estimted vlue function using second order pproximtions. We then briefly discuss the control problem where in ddition to the estimtion process, we look for n optiml policy. In Section 4.1 we 10

12 α bis/avf STD/AVF % 3.57% % 3.37% % 3.32% % 3.26% % 3.33% % 3.88% % 5.26% Tble 1: Bis nd vrince s function of the discount fctor. For ech discount fctor, we prtition the dt 10 times, with ech prtition resulting in 100 sub-smples ech with roughly 1.6 million observtions). We present in the tble the men bsolute vlue of the bis nd the men empiricl stndrd devition ech verged cross the ten repetitions. Both of these mens re stndrdized by dividing by the AVF ssocited with the historicl policy s mesured on the whole dt set). $1.5 STD of AVF $1 $0.5 $ Number of observtions per sub smple Millions) Figure 2: Mil ctlog problem: the empiricl stndrd devition of the AVF s function of the smple size. Ech cross represents single rndom) prtition of the observtions into sub-smples. will drop the superscript π, becuse we consider fixed policy π. 4.1 Approximtions for Bis nd Vrince in the Estimted Vlue Function We now derive closed-form pproximtions for the prmetric) bis nd vrince of Ŷ. The nlysis follows clssicl non Byesin) pproch, where the bis nd vrince re expressed in terms of the unknown) true prmeters. Since the true model prmeters re unknown, we substitute the estimted prmeters, which is stndrd prctice. However, s result of this substitution, the vlues obtined for the bis nd vrince re themselves estimtes. For completeness we lso provide in the online Appendix D Mnnor et l., 2005) Byesin nlysis. Under the Byesin pproch P nd R re treted s rndom vribles with known 11

13 prior distributions, nd we deduce pproximtions for the conditionl bis nd vrince, given the vlues of ˆP nd ˆR. The expressions obtined using the Byesin pproch re lmost identicl to the ones in the clssicl pproch unless n informtive prior is vilble). Our gol is to clculte IE[Ŷ ] nd the covrince mtrix for Ŷ, defined by covŷ ) = IE[Ŷ Ŷ ] IE[Ŷ ]IE[Ŷ ]. We define rndom m m mtrix P = ˆP P nd rndom m-vector R = ˆR R. Note tht P nd R re zero men rndom vribles tht represent the difference between the true model nd the estimted model. To help interpret some of the lter nlysis, it will be helpful to hve sense of the mgnitudes of P nd R. Becuse the trnsition probbilities re bounded by zero nd one, the errors in these probbilities re lso bounded between zero nd one. The trnsition probbilities themselves will tend to be smller the lrger the number of sttes to which trnsitions re fesible, while the errors in these probbilities will be smller the more observtions there re reltive to the number of fesible trnsitions. In the exmple discussed in Section 3 nd Figure 1, the mximum error in the trnsition probbilities in sub-smple mx ij P ij ) hs men of 0.011, nd stndrd devition of Furthermore, the verge verged over ll pirs i, j) with nonzero trnsition probbility) bsolute error in the trnsition probbility estimtes, P ij, hs men of nd n empiricl stndrd devition of Note tht in tht exmple, the fesible trnsitions consist of less thn 10% of the 64 2 entries in P. The expected rewrds re not bounded priori nd so the errors re lso unbounded. In the ctlog exmple, the verge bsolute error in the rewrd estimtes, R ij, hs men of $4.25 nd stndrd devition of $1.82. The mximl error in the rewrd estimtes, mx ij R ij, hs men of $56.3 nd stndrd devition of $43.2. We now write the expecttion of Ŷ cf. Eq. 3)) s: IE [ Ŷ ] [ = IE I αp + P )) 1 R + R) ] [ ] = IE α k P + P ) k R + R), 4) k=0 where the geometric series expnsion of I αp + P )) 1 ws used to obtin the second 12

14 equlity. We use the nottion X = I αp ) 1 nd f k P ) k ) = X P X = X P ) k X. The following lemm will be useful. Lemm 4.1 l=0 αl P + P ) l = k=0 αk f k P ). Proof: α k f k P ) = k=0 α k X P ) k X = I αx P ) 1 X k=0 = X 1 X 1 αx P ) 1 = I αp α P ) 1 = α l P + P ) l, l=0 where we repetedly used the definition of X, nd the fct tht X is invertible. Using Lemm 4.1 in Eq. 4), we obtin: ) IE[Ŷ ] = I αp ) 1 R + α k IE[f k P )] R + α k IE[f k P ) R]. 5) k=1 k=0 There re three terms on the right-hnd side of Eq. 5). The first term is the vlue function for the true model. The second term reflects the bis introduced by the uncertinty in ˆP lone, nd the third term represents the bis introduced by the correltion between the errors in ˆP nd ˆR. Eqution 5) provides series expnsion of the error in terms of high order moments nd cross moments of the errors in ˆP nd ˆR. The clcultion of the bis is tedious becuse the term IE[f k P )] involves kth order moments of multinomil distributions. But since P ij is typiclly smll, P k is generlly close to zero for lrge k. For this reson we limit our ttention to second order pproximtion nd we will ssume tht IE[f k P )] 0 for k > 2, nd tht IE[f k P ) R] 0 for k > 1. We use the ctlog dt to investigte the ppropriteness of this ssumption in Section 5. Under this ssumption, we cn write Eqution 5) s: IE[Ŷ ] = I αp ) 1 R + αie[f 1 P )]R + α 2 IE[f 2 P )]R + XIE[ R] + αie[f 1 P ) R] + L exp, 6) where we represent ll the terms of order greter thn 2 in L exp = α k IE[f k P )]R + α k IE[f k P ) R]. k=3 k=2 13

15 Given tht we will be using second order pproximtions, we expect tht the men nd vrince of Ŷ cn be clculted s long s we re ble to compute the covrince between vrious entries of R nd P. We strt with P. First we introduce some nottion. We use the nottion A i nd A i to denote the i th row nd column, respectively, of mtrix A, nd diga i ) to denote digonl mtrix with the entries of A i long the digonl. We note tht P i nd P j re independent when i j. To find the covrince mtrix of P i, we consider the row vectors ˆP i nd P i with the estimted nd true trnsition probbilities, nd define P i to be their difference. Note tht P i = π i)p i. For ech stte-ction pir i, ), we define M i = digp i ) P i ) P i, which is symmetric positive semi-definite mtrix. Recll tht for ech i, ), we hve ˆP ij = N ij /N i, where the N ij re drwn from multinomil distribution. The covrince mtrix of ˆP i is M i /N i, nd the covrince mtrix of ˆP i is COV i) = IE[ P i P i ] = π i) 2 Mi. N i Now we consider R. Since C ij is independent of C kl whenever i k, we hve IE[ R i Rk ] = 0, for i k. Furthermore, IE[ R 2 i ] = π i) 2 IE[ R i ) 2 ]. In the following we use Ni to represent the vector with components Nij, j = 1,..., m, nd R i to represent the vector with components R ij, j = 1,..., m. Note tht C ij nd C ik re 14

16 independent given N ij nd N ik, so tht IE[ R [ i ) 2 j ] = vr C ] ij = 1 [ Ni vr )2 = = = N i j 1 { [ ]) [ ])} Ni vr IE C )2 ij Ni + IE vr Cij Ni j j 1 { ) [ ]} Ni vr RijN )2 ij + IE VijN ij = 1 N i j C ij 1 { Ni N )2 i ) 2 vrri ˆP i ) ) + Ni j ] j } VijP ij R i M i R i + V i P i ). 7) Here, V ij. is the vrince of the rewrds ssocited with trnsition from i to j, under ction In order to ccount for the correltion between P nd R i, we use Eq. 2), to obtin ˆR i = = π i) j π i) j ˆR ij ˆP ij RijP ij + Rij P ij + R ijp ij + R ij P ) ij, 8) where R ij = ˆR ij R ij. Compring with Eq. 1), we hve R i = ˆR i R i = π i) j Rij P ij + R ijp ij + R ij P ) ij. 9) We use to denote Hdmrd multipliction: for ny two mtrices A nd B with the sme dimensions, A B) is mtrix gin with the sme dimensions) with entries A B) ij = A ij B ij. We lso use e to denote the m-dimensionl vector with ll components equl to one. And we use π to denote the m-dimensionl vector with components π i = π i). With this nottion, Eq. 9) becomes R = π [ P R + R P + R P ) )e]. 10) We define n m m mtrix Q with entries Q ij = COV i) j X i. 11) 15

17 Recll the definition X = I αp ) 1, nd tht Y = XR is the true vlue function.) And we define n m-dimensionl vector B with its i th component defined s B i = π i) 2 N i R i M i X i The following proposition quntifies the bis under the second order pproximtion ssumption. The proof is given in Appendix A. Proposition 4.1 The expecttion of the estimted vlue function Ŷ stisfies IE[Ŷ ] = Y + α2 XQY + αxb + L exp, where L exp = k=3 [ α k IE f k P )] ) R + [ α k IE f k P ) k=2 π P R )e) )] ) 1 = o Ni where N i = min i,):π i)>0 N i nd the term o ) stisfies lim N o1/n) N = 0., In the bove proposition, i, ) represents the lest smpled stte-ction pir tht is used by the policy. The term L exp decreses to 0 fster tht 1/Ni, wheres Q nd B cn be shown to decrese like 1/Ni. Therefore, our pproximtion of the bis in the vlue function estimtes will be α 2 XQY + αxb. For the purposes of the next proposition, we introduce some more nottion. We define the digonl mtrix W whose digonl entries re given by W ii = π i) 2 N i [ ) αy + Ri Mi αy + R i ) + Vi P ] i. 12) The next proposition provides n expression for the second moment, IE[Y Ŷ ]. Together with the expression for IE[Ŷ ] in the preceding proposition, it leds to n pproximtion for the covrince mtrix of Ŷ. The proof is given in Appendix B. 16

18 Proposition 4.2 The second moment of Ŷ stisfies { } IE[Ŷ Ŷ ] = Y Y + X α 2 QY R + RY Q ) + αbr + RB ) + W X + L vr, where L vr is given by L vr = k,l:k+l>2 [ α k+l IE f k P ) RR + R) R) ) f l P ) ] [ + αie X R) R) f 1 P ) ] + [ IE f 1 P ) R) R) X ] 1 = o N i ) By tking the difference between IE[Ŷ Ŷ ], s given by Proposition 4.2, nd IE[Ŷ ]IE[Ŷ ], s prescribed by Proposition 4.1, the following corollry is esily derived. Corollry 4.1 The covrince mtrix of the estimted vlue function stisfies cov Ŷ ) ) 1 = XW X + o. Ni The expressions in Propositions 4.1, 4.2 nd Corollry 4.1 yield severl insights. First, s the counts N i increse to infinity, COV i) pproches 0, nd thus ll the terms involving the mtrices Q, B nd W converge to 0. As expected, this implies tht s the smple size increses nd the ccurcy of the estimted prmeters improves, both the bis nd the vrince decrese to 0. Second, the expressions for the bis nd vrince rely on the true model prmeters, which re unknown. As discussed in the introduction, to obtin computble pproximtions of the bis nd vrince, we will use insted ˆP, ˆR, nd the empiricl vrince of ech R ik. In principle, we could lso estimte the bis nd vrince due to this pproximtion, but this is tedious nd, s suggested by the experimentl results in the next section, generlly unnecessry. Third, when min i, N i is lrge, it follows tht the non zero entries of B, W, nd Q decreses to 0 like 1/Ni. Therefore the stndrd devition decreses to 0 like 1/ N, which is the usul behvior of empiricl estimtes. The expressions in Proposition 4.1 nd Corollry 4.1 llow us to qulittively compre the mgnitude of the bis nd vrince. According to Corollry 4.1, the stndrd devition of Ŷi i 17

19 cn be pproximtely estimted s σŷi) = X i W X i. 13) The next proposition, proved in Appendix C, quntifies the rtio between the stndrd devition nd the bis. Recll tht for two functions f nd g defined on the rel numbers) we write fn) = Ωgn)) if there exist constnts N 0 nd C such tht fn) Cgn) for n N 0. Proposition 4.3 Suppose tht σŷi) > 0 nd N i /N i σŷi) ) IE[Ŷi] Y i = Ω Ni > c > 0 for ll nd i. Then for ll i. Proposition 4.3 implies tht the errors introduced by the prmetric vrince will generlly be much lrger thn the bis. Note tht since W is positive semi-definite mtrix, σŷi) > 0 is very wek non-degenercy ssumption. The condition N i /N i > c > 0 requires tht smple sizes increse uniformly. The conditions in this proposition re somewht stronger thn necessry, for simplicity of exposition. While the expression in Corollry 4.1 llows us to pproximte the covrince mtrix of the estimted vlue function, the findings on their own do not llow us to clculte confidence intervls round these estimtes. Clculting confidence intervl requires tht we know the distribution of the vlue function estimtes. A centrl limit theorem Serfling, 1980, pge 122, Theorem A) speks to this issue. 3 Theorem 4.1 Serfling, 1980) Suppose tht X n := X n1,..., X nk ) symptoticlly pproches N µ, b 2 nσ) with b n 0. Let gx) = g 1 x),..., g m x)), x = x 1,..., x k ) be vector-vlued function for which ech component function g i x) is rel-vlued function nd hs non-zero grdient t x = µ. Let [ g i D = x j x=µ ] m k. Then gx n ) symptoticlly pproches N gµ), b 2 ndσd ). Becuse ˆP ij nd ˆR ij re ll estimtors tht symptoticlly follow norml distributions, we my consider Ŷ s the function g in the bove theorem nd conclude tht Ŷ is symptoticlly 3 We thnk the deprtment editor for directing our ttention to this theorem. 18

20 norml. We further investigte this issue using ctlog miling dt in Section 5, where we report tht Kolmogorov-Smirnov test cnnot reject the hypothesis tht Ŷ is normlly distributed. Reders my wonder whether we could hve used the Serfling result to derive our erlier findings. It is techniclly possible to do so. Indeed, under the ssumption tht ll of the N i s re identicl, we were ble to show tht the two pproches yield the sme result, nd observed tht the two derivtions were of comprble length nd complexity. However, if smpling occurs t different rtes in different sttes, the rte t which the Ni s pproch infinity will generlly vry. In this cse use of the Serfling theorem, or ny relted centrl limit theorem, requires extensive dditionl derivtion. Moreover, these theorems do not ddress the issue of bis. 4.2 The Control Problem To this point we hve focused on the vlue function under fixed policy. In mny pplictions we re interested in compring n existing policy with n lterntive policy, possibly derived through policy optimiztion process. We know from the MDP theory tht there exists n optiml policy π such tht Y π i Y π i for ll dmissible policies π nd ll sttes i S. The optiml policy my be obtined from vlue itertion, policy itertion or liner progrmming lgorithms. See, for exmple, Bertseks 2000). Since we do not hve ccess to the true model prmeters P nd R, optimiztion bsed on the estimted prmeters ˆP nd ˆR produces n optiml policy ˆπ such tht Ŷ ˆπ Ŷ π for ll dmissible policies π. In generl, policy ˆπ is different from π. Moreover, since the policy ˆπ is obtined through n optimiztion process, the estimtes of the model prmeters for tht policy ˆP ˆπ nd ˆRˆπ ) will no longer be unbised estimtes of the true model prmeters P ˆπ nd Rˆπ ). Therefore we cnnot use the pproximtion derived in Proposition 4.1 for fixed policy) to evlute the bis in the optiml vlue function. Nor cn we use the pproximtions in Proposition 4.2 nd Corollry 4.1 to estimte the covrince mtrix. We cn illustrte the problem using through simple exmple. Consider single stte MDP with two ctions, tht is, S = {1} nd A = {0, 1}. Both ctions yield identicl zeromen rndom rewrds. Clerly in such problem π could be either ction 0 or 1, with vlue 19

21 functions Y π = Y ˆπ = 0. Now ssume tht we hve n smples to estimte the expected rewrd ˆR for either ction. Indeed both ˆR follow pproximtely) norml distribution N 0, 1/n). The policy optimiztion procedure chooses the ction with the lrgest ˆR. If we use ˆR to denote the mximum of ˆR 0 nd ˆR 1, we know from Jensen s Inequlity tht IE[ ˆR ] > 0, nd so the vlue function estimted for the chosen policy will on verge be positively bised: [ ] [ IE[Ŷ ˆπ ] = IE ˆR = IE mx{ ˆR 0, ˆR ] { 1 } > mx IE[ ˆR 0 ], IE[ ˆR } 1 ] = 0. The mgnitude of IE[Ŷ ˆπ ], nd therefore the bis in this exmple, is studied in the order sttistics literture Ledbetter et l., 1983). We lso refer reders to Clrk 1961), where the uthor presents procedure to pproximte moments of the mximum of finite number of correlted Gussin rndom vribles. This problem rises two issues. First, how cn we de-bis the estimtes of ˆP ˆπ nd ˆP ˆπ so tht we cn use our erlier results to estimte the bis nd covrince mtrix of vlue function when the policy is derived from n optimiztion procedure? Second, becuse the optimiztion procedures themselves rely on estimtes ˆP π nd ˆR π, the policies derived from stndrd dynmic progrmming lgorithms will generlly not be truly optiml ˆπ π ). In the reminder of this section we propose cross-vlidtion pproch tht cn help to ddress the first issue. Unfortuntely, we do not hve solution to the second issue. Indeed, it seems unlikely tht generl procedure cn be found tht resolves the second issue s the sub-optimlity reflects the bsence of complete informtion in the trining dt. The bis in the estimtes of ˆP ˆπ nd ˆRˆπ rises becuse optimiztion methods tend to fvor ctions for which the estimtion errors in ˆP π nd ˆR π led to inflted estimtes of the vlue function. As long s the errors in ˆP nd ˆR re independent cross smples, we cn derive unbised estimtes of P nd R if we use different smple of dt to evlute the policy ˆπ thn the smple we used to design the policy. In prticulr, consider the following pproch. Strt by dividing the trining dt into two sub-smples; clibrtion smple nd vlidtion smple. Use the clibrtion smple to estimte the model prmeters ˆP cl nd ˆR cl nd obtin 20

22 the optiml policy ˆπ cl = rg mx I α ˆP 1 π π cl) ˆRπ cl. Then estimte model prmeters ˆP vl nd ˆR vl from the vlidtion smple nd following Eqution 3)) evlute the policy using these new prmeters: Ŷ ˆπ cl vl = I α ˆP ˆπ cl vl ) 1 ˆRˆπ cl vl. Through this procedure we cn de-bis the vlue function estimtes by reporting Ŷ ˆπ cl vl of Ŷ ˆπ cl cl, where Ŷ ˆπ cl cl = I α ˆP ˆπ cl cl vrince nd therefore the confidence bounds of Ŷ ˆπ cl vl 4.1. insted ) 1 ˆRˆπ cl cl. Accordingly, we my lso pproximte the bis nd following Proposition 4.1 nd Corollry The ssumption tht the estimtion errors in ˆP nd ˆR re independent cross the clibrtion nd vlidtion sub-smples is obviously criticl. In this pper we hve ssumed tht estimtes ˆP nd ˆR re derived from stright-forwrd non-prmetric ggregtes of the vilble dt. Under this pproch the estimtion errors re independent cross the sub-smples s long s ny mesurement errors re independent cross observtions. However, in some settings, it is common to estimte the model prmeters from mximum likelihood estimtes tht require functionl form nd distribution ssumptions this is prticulrly common in the economics literture). Under this lterntive pproch, ny errors introduced by the functionl form nd distribution ssumptions will be correlted cross the sub-smples. As result, the cross-vlidtion procedure tht we hve proposed will not de-bis the estimtes of ˆP ˆπ nd ˆRˆπ, even if the mesurement errors re independent cross the observtions. 5 Experiments The relince on second order expnsion in deriving the pproximtions for the bis nd vrince presumes tht higher order terms re reltively unimportnt. We now exmine this ssumption in further detil by using the ctlog miling dt to vlidte the findings. These dt lso enble us to investigte the impct if ny) of using estimtes of the model prmeters in these expressions in the bsence of the true model prmeters). If the vlue function estimtes follow norml distribution, the vrince nd bis expres- 21

23 sions derived in the previous section fcilitte clcultion of confidence intervls round the de-bised vlue function estimtes. We cn investigte the ccurcy of these confidence intervls by compring how frequently the true vlue function flls within the confidence intervls. We would expect tht on verge the true vlue will fll within one stndrd devition of the unbised men 68% of the time nd within two stndrd devitions 95% of the time. We begin by investigting whether the vlue function estimtes follow norml distribution. We do so by using Kolmogorov-Smirnov test on ech of the dt points reported in Section 3. The hypothesis tht the rewrd is two-sided Gussin could not be rejected with confidence 0.05 t ny instnce. The verge P-vlue ws with minimum of nd mximum of This indictes tht it cnnot be determined tht the dt do not follow Gussin rule. We use the sme prtitions of the dt s in Section 2. In Figure 3 the percentge of times tht the true vlue function ws within one stndrd devition is denoted by + nd within two stndrd devitions by n x. For exmple, for the 250 sub-smples with bout 657,000 observtions ech), we report the percentge of the 250 estimtes in which the true verge vlue function AVF) s estimted on the full smple) ws within the estimted confidence intervl. By re-drwing the 250 sub-smples ten times, we report ten instnces of this percentge. An nlogous process ws used with other choices of the sub-smple size. The findings in Figure 3 confirm tht the percentge of estimtes tht fll within one nd two stndrd devitions of the true AVF re close to the trgets of 68% nd 95% respectively. We next consider the importnce of the second order pproximtions. We do so by tking dvntge of the role plyed by the discount fctor α. The importnce of higher order terms in the series expnsions increses s the discount fctor pproches one. In Tble 2 we repet the nlysis for 250 sub-smples of fixed size, but for different discount fctors sme settings s in Tble 1). As expected, s α pproches 1, the ccurcy of the confidence intervls degrdes. We ttribute this to the error introduced by the second order pproximtion. 5.1 The Control Problem As discussed in Section 4.2, n obvious ppliction of our nlysis is the comprison of current policy with new policy generted through some optimiztion process. We cutioned tht before pplying the expressions for the bis nd the vrince to policy derived from 22

24 Percentge below 1 +) or 2 +) STDs Observtions per sub smple Millions) Figure 3: The percentge of the AVF estimtes tht fll within one + ) nd two x ) stndrd devitions from the vlue clculted bsed on the full dt set. Ech + nd x represents rndom prtition of the full dt to sub-smples. The discount fctor ws α = such process, we should first obtin unbised estimtes of the model prmeters, using n independent vlidtion smple. We will use the ctlog miling dt to illustrte the importnce of this first step. We begin by rndomly selecting portion of the vilble dt, to be used s clibrtion smple, nd retin the remining dt s vlidtion smple. To demonstrte how the size of the clibrtion smple ffects the findings, we repet this process for clibrtion smples of different sizes. The clibrtion smple is used to estimte model prmeters ˆP cl nd ˆR cl. Then we run policy itertion lgorithm to identify n optiml policy ˆπ cl from ˆP cl nd ˆR cl. We will compre two AVF estimtes for this policy: the AVF clculted on the bsis of the model estimted using the clibrtion smple denoted by Y cl ); nd the AVF of tht policy s estimted using the vlidtion smple denoted by Y vl ). The difference between the two estimtes represents the bis introduced by the error in the model prmeters the errors no longer hve zero expecttion due to the optimiztion process). This bis is illustrted in Figure 4 for clibrtion smples of vrying sizes. It cn be seen tht vlue function estimtes from 23

25 α Smples with Smples with 1 STD 2 STD % ) 95.44% ) % ) 94.84% ) % ) 95.08% ) % ) 94.76% ) % ) 95.52% ) % ) 94.92% ) % ) 92.20% ) Tble 2: We rndomly prtitioned the dt while vrying the discount fctor. For ech discount fctor, we performed the prtition 10 times, ech prtition ws to 250 sub-smples ech with roughly 657,000 million observtions). We present the percentge of smples in which the estimted AVF is within one stndrd devition s predicted by Proposition 4.2) of the vlue s mesured on ll the dt; the minimum nd mximum percentges over the 10 runs re provided in prentheses. The sme sttistics re presented for two stndrd devitions. the clibrtion smple re lmost uniformly greter thn the estimtes from the vlidtion smple. This bis is sttisticlly significnt. It is lso mngerilly relevnt, verging round 6.3% of the true optiml AVF $33.59) for clibrtion smple tht consists of pproximtely 1.6 million observtions 1% of the dt). As n side, the $33.59 AVF for the optiml policy cn be compred with the $28.54 AVF for the historicl policy reported in Figure 1). These results indicte tht the optiml policy offers potentil profit improvement of pproximtely 17%. $5 $4 Bis: Y cl Y vl $3 $2 $1 $0 $ Size of clibrtion smple % of dt set) Figure 4: The differences mrked by + ) between the AVF estimtes in Dollrs, nd verged over ll sttes) bsed on the clibrtion smple nd the vlidtion smple, for the policy identified through n optimiztion process. Ech + ws generted by rndomly prtitioning the dt to clibrtion nd vlidtion smple. The horizontl xis corresponds to the size of the clibrtion smple, s percentge of the full dt smple. Here α = 0.98 for which the true optiml AVF is pproximtely $

26 We cn lso use the ctlog dt to investigte the extent to which prmetric vrince leds to sub-optiml policies. To do so, we compred the optiml policy derived using ech sub-smple, with the true optiml policy derived using the entire dt set. Both policies re evluted on the vlidtion smple. We use Y to denote the AVF for the optiml policy found by optimizing on the entire dt set. 4 The findings re reported in Figure 5. As expected, the optiml policy lwys outperforms the policy derived from the clibrtion sub-smple. The differences re gin sttisticlly significnt. $0 Suboptimlity: Y vl Y * $0.5 $1 $ Size of clibrtion smple % of dt set) Figure 5: The differences mrked by + ) between the AVF estimtes in Dollrs) of the optiml policy bsed on the clibrtion smple nd the AVF of the optiml policy found by optimizing on the vlidtion smple. Ech + ws generted by rndomly prtitioning the dt to clibrtion nd vlidtion smple. The horizontl xis corresponds to the size of the clibrtion smple, s percentge of the full dt smple. Here α = 0.98 for which the true optiml AVF is pproximtely $ In order to demonstrte the robustness of the findings, we performed n experiment similr to the one reported in Tble 2. In Tble 3 we present the bis nd sub-optimlity introduced by the optimiztion process, for different vlues of α. 5 From Tble 3 we cn esily obtin the men stndrd errors s the smple stndrd devitions divided by 10 the squre root of the smple size, 100). It is cler tht both the bis nd the sub-optimlity re generlly significntly greter thn zero, with the bis verging round 2% of the AVF nd the suboptimlity verging round 1%. We conclude tht prmetric vrince introduces two issues in policy optimiztion. First, 4 Note tht the computtion of Y nd Y vl uses the sme dt, which my introduce correltion between the two quntities. This will tend to diminish our estimtes of the sub-optimlity. We lso computed Yvl, the optiml AVF over the vlidtion set, in plce of Y for Tble 3 nd Figures 4 nd 5. The results re similr. 5 The bis ws clculted s Y cl Y vl )/Y ; the sub-optimlity ws clculted s Y vl Y )/Y. 25

Tests for the Ratio of Two Poisson Rates

Tests for the Ratio of Two Poisson Rates Chpter 437 Tests for the Rtio of Two Poisson Rtes Introduction The Poisson probbility lw gives the probbility distribution of the number of events occurring in specified intervl of time or spce. The Poisson