Importance Sampling with Unequal Support

Size: px

Start display at page:

Download "Importance Sampling with Unequal Support"

Sylvia Quinn
6 years ago
Views:

1 Importae Samplig with Uequal Support Philip S. Thomas ad Emma Bruskill Caregie Mello Uiversity Abstrat Importae samplig is ofte used i mahie learig whe traiig ad testig data ome from differet distributios. I this paper we propose a ew variat of importae samplig that a redue the variae of importae sampligbased estimates by orders of magitude whe the supports of the traiig ad testig distributios differ. After motivatig ad presetig our ew importae samplig estimator, we provide a detailed theoretial aalysis that haraterizes both its bias ad variae relative to the ordiary importae samplig estimator i various settigs, whih ilude ases where ordiary importae samplig is biased, while our ew estimator is ot, ad vie versa. We olude with a example of how our ew importae samplig estimator a be used to improve estimates of how well a ew treatmet poliy for diabetes will work for a idividual, usig oly data from whe the idividual used a previous treatmet poliy. Itrodutio A key hallege i artifiial itelligee is to estimate the expetatio of a radom variable. Istaes of this problem arise i areas ragig from plaig ad deisio makig e.g., estimatig the expeted sum of rewards produed by a poliy for deisio makig uder uertaity to probabilisti iferee. Although the estimatio of a expeted value is straightforward if we a geerate may idepedet ad idetially distributed i.i.d. samples from the relevat probability distributio whih we refer to as the target distributio, we may ot have geerative aess to the target distributio. Istead, we might oly have data from a differet distributio that we all the samplig distributio. For example, i off-poliy evaluatio for reiforemet learig, the goal is to estimate the expeted sum of rewards that a deisio poliy will produe, give oly data gathered usig some other poliy. Similarly, i supervised learig, we may wish to predit the performae of a regressor or lassifier if it were to be applied to data that omes from a distributio that differs from the distributio of the available data e.g., we might predit the auray of a lassifier for had-writte letters give that observed letter frequeies ome from Eglish, usig a orpus of labeled letters olleted from Germa doumets. Copyright 07, Assoiatio for the Advaemet of Artifiial Itelligee All rights reserved. More preisely, we osider the problem of estimatig : EhX], where h is a real-valued futio ad the expetatio is over the radom variable X, whih is a sample from the target distributio. As iput we assume aess to i.i.d. samples from a samplig distributio that is differet from the target distributio. A lassial approah to this problem is to use importae samplig IS, whih reweights the observed samples to aout for the differee betwee the target ad samplig distributios Kah, 955. Importae samplig produes a ubiased but ofte highvariae estimate of. We itrodue importae samplig with uequal support US a simple ew importae samplig estimator that a drastially redue the variae of importae samplig whe the supports of the samplig ad target distributios differ. This settig with uequal support a our, for example, i our earlier example where Germa doumets might ilude symbols like ß, that the lassifier will ot eouter. US essetially performs importae samplig oly o the data that falls withi the support of the target distributio, ad the sales this estimate by a ostat that reflets the relative support of the target ad samplig distributios. US typially has lower variae tha ordiary importae samplig sometimes by orders of magitude, ad is ubiased i the importat settig where at least oe sample falls withi the support of the target distributio. If o samples do, the oe of the available data ould have bee geerated by the target distributio, ad so it is ulear what would make for a reasoable estimate. Furthermore, the oditioally ubiased ature of US is suffiiet to allow for its use with oetratio iequalities like Hoeffdig s iequality to ostrut ofidee bouds o. By otrast, weighted importae samplig Rubistei, 98 is aother variat of importae samplig that a redue variae, but whih itrodues bias that makes it iompatible with Hoeffdig s iequality. Problem Settig ad Importae Samplig Let f ad g be probability desity futios PDFs for two distributios that we all the target distributio ad samplig distributio, respetively. Let h : R R be alled the evaluatio futio. Let : E f hx], where E f deotes the expeted value give that f is the PDF of the radom variables i the expetatio i this ase, just X. Let

2 F : {x R : fx 0}, G : {x R : gx 0}, ad H : {x R : hx 0} be the supports of the target ad samplig distributios, ad the evaluatio futio, respetively. I this paper we will disuss tehiques for estimatig give N >0 i.i.d. samples, X : {X,..., X }, from the samplig distributio, ad we fous o the settig where F H G where the joit support of F ad H is a strit subset of the support of G. The importae samplig estimator, ISX : t + fx i gx i hx i t, is a widely used estimator of, where t 0 we osider o-zero values of t later. If F H G, the ISX is a osistet ad ubiased estimator of. That is, ISX a.s. ad E g ISX ] we review this latter result i Property i the supplemetal doumet. A otrol variate is a ostat, t R, that is subtrated from eah hx i ad the added bak to the fial estimate, as i Hammersley, 960; Hammersley ad Hadsomb, 964. Although otrol variates, tx i, that deped o the sample, X i, a be beefiial, for our later purposes we oly osider ostat otrol variates. Ituitively, iludig a ostat otrol variate equates to estimatig : E f h X] usig importae samplig without a otrol variate, where h x hx t, ad the addig t to the resultig estimate to get a estimate of. Later we show that the variae of importae samplig ireases with, ad so applyig importae samplig to h results i higher variae tha applyig importae samplig to h with t, sie the 0. That is, by iduig a kid of ormalizatio, a otrol variate a redue the variae of estimates without itroduig bias a property that has made the ilusio of otrol variates a popular topi i some reet works usig importae samplig Dudík et al., 0; Jiag ad Li, 06; Thomas ad Bruskill, 06. Although later we disuss otrol variates more, for simpliity our derivatios fous o importae samplig estimators without otrol variates. There are also other extesios of the importae samplig estimator that a redue variae otably the weighted importae samplig estimator, whih we ompare to later, ad whih a provide large redutios of variae ad mea squared error, but whih itrodues bias. A Illustrative Example I this setio we preset a example that highlights the peuliar behavior of the IS estimator whe F H G. The illustrative example, depited i Figure, is defied as follows. Let gx 0.5 if x 0, ] ad gx 0 otherwise, ad let fx if x 0, ] ad fx 0 otherwise. So, F 0, ] ad G 0, ]. Let hx if x 0, ] ad hx 0 otherwise, so that H 0, ]. Notie that. Sie the samplig ad target distributios are both uiform, a obvious estimator of if f ad g are kow but h is ot would be the average of the poits that fall withi F. Let #X i F deote the umber of samples i X that / g F f G Figure : Depitio of the illustrative example. The evaluatio futio is ot show beause h f ad H F. are i F. Formally, the obvious estimator is ˆ : F X i hx i, #X i F where A x if x A ad A x 0 otherwise. Give our kowledge of h, it is straightforward to show that this estimator is equal to if #X i F > 0 ad is udefied otherwise it is exatly orret has zero bias ad variae as log as at least oe sample falls withi F. If o samples fall withi F, the we have oly observed data that will ever our uder the target distributio, ad so we have o useful iformatio about. I this ase, we might defie our obvious estimator to retur a arbitrary value, e.g., zero. Perhaps surprisigly, the importae samplig estimator does ot degeerate to this obvious estimator: ISX F X i hx i #X i F. Sie E g #X i F /] /, this estimate is orret i expetatio, but does ot have zero variae give that at least oe sample falls withi F. If more tha / of the samples fall withi F, this estimate will be a over-estimate of, ad if fewer tha / of the samples fall withi F, this estimate will be a uder-estimate. Although orret o average, the importae samplig estimator has ueessary additioal variae relative to the obvious estimator. Importae Samplig with Uequal Support We propose a ew importae samplig estimator, importae samplig with uequal support ISUS, or US for brevity, that does degeerate to the obvious estimator for our illustrative example. Ituitively, US prues from X the samples that are outside F or more geerally, outside some set C, that we defie later to ostrut a ew data set, X, that has fewer samples. This ew data set a be viewed as #X i F i.i.d. samples from a differet samplig distributio a distributio with PDF g, whih is simply g, but truated to oly have support o F ad re-ormalized to itegrate to oe. US the applies ordiary importae samplig to this ew data set. For geerality, we allow US to prue from X all of the poits that are ot i a set, C, whih a be defied may differet ways, iludig C : F as i our previous example. Our oly requiremet is that F H C G. I order

3 to ompute US, we must ompute a value, : gx dx, C whih is the probability that a sample from the samplig distributio will be i C. I geeral, C should be hose to be as small as possible while still esurig that both F H C G so that iformative samples are ot disarded ad a be omputed. Ideally, we would selet C F H, however i some ases aot be omputed for this value of C. For example, i our later experimets we osider a problem where h ad H are ot kow, but F is, ad so we a ompute usig C F, but ot C F H. Let kx : CX i be the umber of X i that are i C. The US estimator is the defied as: USX : kx fx i gx i hx i, if kx > 0, ad USX : 0 if kx 0. This is equivalet to applyig importae samplig to the prued data set, X, sie the g x gx/ for x C. Also, i we sum over all samples rather tha just the kx samples i C beause fx i hx i 0 for all X i ot i C. Although we aalyze the US estimator as defied i, it a be geeralized to use measure theoreti probability ad to iorporate a otrol variate. I this more geeral settig, f ad g are probability measures, f is absolutely otiuous with respet to g, tx i deotes a real-valued sampledepedet otrol variate, ad USX : gc df kx dg X i hx i tx i E gtx]. Theoretial Aalysis of US We begi with two simple theorems that eluidate the relatioship betwee IS ad US. The proofs of both theorems are straightforward, but deferred to the supplemetal doumet. First, Theorem shows that, whe C G, US degeerates to IS. Oe ase where C G is whe the support of the target distributio ad evaluatio futio are both equal to the support of the samplig distributio, i.e., whe F H G, ad so C G eessarily. Theorem. If C G, the USX ISX. Theorem shows that, if we replae i the defiitio of US with a empirial estimate, ĉx : kx /, the US ad IS are equivalet. This provides some ituitio for why US teds to outperform IS whe C G IS is US, but usig a empirial estimate of the probability that a sample falls withi C, i plae of its kow value. Theorem. If we replae with a empirial estimate, ĉx : kx /, the USX ISX. I Table we summarize more theoretial results that larify the differees betwee IS ad US i several settigs. The first settig deoted by a i Table is the stadard settig where we osider the ordiary expeted value ad variae of the two estimators. The seod settig deoted by a i Table oditios o the evet that at least oe sample falls withi C, that is, the evet that kx > 0. This is a reasoable settig to osider if oe takes the view that o estimate should be retured if all of the samples are outside C. That is, if the prued data set, X, is empty, the o estimate should be produed or osidered just as IS does ot produe a estimate whe 0 whe there are o samples at all. Fially, the third settig deoted by a i Table oditios o the evet that kx that a speifi ostat umber of the samples are i C. Table ad the theorems that it referees use additioal symbols that we review here. Let ρ : PrkX > 0 be the probability that at least oe of samples is i C. Let Var g deote the variae give that the radom variables withi the parethesis are sampled from the distributio with PDF g. Let fx v : Var g gx hx X C be the oditioal variae of the importae samplig estimate whe usig a sigle sample ad give that the sample is i C. Let B, deote the biomial distributio with parameters ad ad let E B, deote the expeted value give that B,. Although the proofs of the laims i Table are some of the primary otributios of this work, we defer them to the supplemetal doumet beause they are straightforward though legthy ad do ot provide further isights ito the results. The primary result of Table is that US is ubiased ad ofte has lower variae i the key settig of iterest: whe at least oe sample is i the support of the target distributio whe kx > 0. We fid this settig ompellig beause, whe o samples are i F, little a be iferred about E f hx]. I this settig deoted by i Table US is a ubiased estimator, while IS is ot although the bias of IS does go to zero as. To uderstad the soure of this bias, osider the bias of IS give that kx the settig i Table. I this ase, E g ISX ]. Reall that IS uses a empirial estimate of, i.e., ĉ as disussed i Theorem. Whe this estimate is orret, terms i ael, makig IS ubiased. Thus, the bias of IS whe oditioig o the evet that kx > 0 stems from IS s use of a estimate of. Next we disuss the variae of the two estimators give that at least oe sample falls withi C, i.e., i the settig. First osider how the variaes of IS ad US hage as 0 that is, as the differees betwee the supports of the samplig ad target distributios ireases. Speifially, let i : i for i N >0. We the have that: VarISX kx > 0, i iv ρ v ρi v i, sie ρ 0, ], ad VarUSX kx > 0, i v/i E B, / > 0] v/i, sie E B, > 0]. Thus, as i as 0 logarithmially, ad If we do ot oditio o the evet that kx > 0, the US is a biased estimator of. This is beause it is ulear how to defie USX whe kx 0, ad we hose arbitrarily to defie it to be 0. However, the bias of USX i this settig overges quikly to zero, sie ρ the probability that o samples fall withi C overges quikly to oe as.

4 IS US E g ] E g ] E g ] Variae Variae Strogly Cosistet ρ v + v ρ +ρ + ρ ρ Property Theorem 6 Theorem 5 Theorem Yes ad Theorem 9 ρ Theorem 7 Theorem 4 Theorem 3 ρ ve B, > 0] + ρ ρ Theorem 0 ve B, > 0 ] Theorem 8 Yes ad Table : Theoretial properties of IS ad US estimators. give o oditios. oditioed o the evet that kx > 0 that at least oe sample is i C. oditioed o the evet that kx that exatly of samples are i C. All theorems require the assumptio that F H G. The osistey results follow immediately from the fat that the biases ad variaes all overge to zero as Thomas ad Bruskill, 06, Lemma 3. give some fixed ad v, the variae of US goes to zero muh faster tha the variae of IS. The variae of US as a futio of i overges to zero liearly or faster with a rate of at most while the variae of IS overges to zero subliearly at best, logarithmially. Next ote that the variae of US i this settig is idepedet of, but the variae of IS ireases with see Property 3 i the supplemetal doumet, applied to Theorem 9. To ameliorate this issue, a otrol variate, t, a be used to eter the data so that 0. However, sie is ot kow a priori, seletig t is ot pratial. The term that sales with i the variae of IS give that kx > 0 therefore meas that the variae of IS depeds o the quality of the otrol variate poor otrol variates a ause IS to have high variae. By otrast, the variae of US i this settig does ot have a term that sales with, ad so the quality of the otrol variate is less importat. There is a rare ase whe IS a have a lower variae tha US. First, we assume that the otrol variate is perfet so that 0 whih, as disussed before, is impratial ad osider the term that sales with v. From this term, it is lear that US will have lower variae tha IS if: E B, > 0] ρ. 3 Notie that this iequality depeds oly o ad, whih must both be kow i order to implemet US, ad so we a test a priori whether US will have lower variae tha IS. That is, if 3 holds, the US will have lower variae tha IS, give that kx > 0. However, if 3 does ot hold, it does ot mea that IS will have lower variae tha US uless the perfet typially ukow otrol variate is used so that 0. Appliatio to Illustrative Example Beause either method is always superior, here we osider the appliatio of IS ad US to the illustrative example to see whe eah method works best, ad by how muh. We osider the settig where C F, but modify the example slightly. First, although the target distributio is always uiform, we allow for its support to be saled. Speifially, we defie the support of f to be 0, F max ], where F max 0, ]. Whe F max is small, it orrespods to sigifiat differees i support, while large F max orrespod to small differees The quality of the otrol variate a still impat the variae of estimates though, sie it a hage v. whe F max, C F G ad so the two estimators are equivalet. We also modify h to allow for various values of. Speifially, we defie hx + if x < F max / ad hx + if x F max /. Notie that, although we defied h i terms of, remais E f hx], ad also that usig this defiitio of h ad 0 is a istae that is partiularly favorable to IS. For this example, it is straightforward to verify that v 4/Fmax for ay defiitio of, ad F max /. Give these two values ad, we a ompute the bias ad variae of eah estimator. The biases ad variaes of the two estimators for various settigs are depited i Figure. Notie that US is always ompetitive with IS, although the reverse is ot true. Partiularly, whe F max is small so that is small, or whe is large, US a have orders of magitude lower variae tha IS. Also, as ireases, the two estimators beome ireasigly similar, sie the empirial estimate of used by IS beomes ireasigly aurate, although US is still vastly superior to IS eve whe is large if is orrespodigly small. This mathes our theoretial aalysis from the previous setio: we expet US to perform better whe is small by our overgee rate aalysis or whe is large due to US s lesser depedee o the quality of the otrol variate, ad we expet the two estimators to beome ireasigly similar as beause ĉ beomes ireasigly similar to. Notie also that gais are ot oly obtaied whe is so small relative to that o samples are expeted to fall withi C a relatively uiterestig settig. For example, the right-most plot i Figure shows that with F max 0.5, where PrkX > 0 ρ, the MSE of US is approximately 0.086, while the MSE 50 of IS is approximately 6.08 US is has roughly /70 the MSE of IS /8 the RMSE. Perhaps surprisigly, there are ases where IS has lower variae tha US eve whe both are ubiased, sie 0. For example, osider the plot with 0 ad 0, ad the positio o the horizotal axis that orrespods to F max.0. This is oe ase where IS is margially better tha US it has lower variae i both settigs, ad either estimator is biased. Ituitively, the IS estimator iludes the poits outside the support of F, although they have assoiated values, hx i 0, whih pulls the importae samplig estimate towards zero. I this ase, whe 0, this extra pull towards zero happes to be beefiial. However, to remai ubiased give the pull towards zero, IS also ireases the magitudes of the weights assoiated with poits

5 Figure : The variaes of IS ad US aross various settigs of ad deoted alog the left ad top. At a glae, otie that the red ad gree urves US ted to be below the blak urves IS, partiularly whe osiderig the logarithmi sale of the vertial axes. The dotted lies show the variae oditioed o the evet that kx > 0. The gree lie shows the mea squared error of the US estimator without ay oditios, whih shows that the variae redutio of US is ot ompletely offset by ireased bias ompare the solid blak ad gree urves. Whe 0 the gree lie obsures the solid red lie. The plot o the right shows a zoomed-i view of the 0, 50 plot without the logarithmi vertial axis. i F, whih iurs additioal variae. Whe F max is small eough, this additioal variae outweighs the variae redutio that results from the extra pull towards zero, ad so US is agai superior. This ituitio is supported by the fat that i Figure IS does ot outperform US for small F max or, sie the a pull towards zero is detrimetal. Fially, we osider the use of IS ad US to reate highofidee upper ad lower bouds o usig a oetratio iequality Massart, 007 like Hoeffdig s iequality Hoeffdig, 963. If b deotes the rage of the futio fxhx/gx, for x G, the usig Hoeffdig s iequality, we have that ISX b l/δ/ is a δ ofidee lower boud o. Similarly, we a use US with Hoeffdig s iequality to reate a δ ofidee lower boud: USX b l/δ/kx, sie the rage of the kx i.i.d. radom variables averaged by USX is b. Notie that, if kx 0, the this seod estimator is udefied oe might defie the lower boud to be a kow lower boud o i this settig. Although we expet that kx, the resultig i the deomiator of the US-based boud is withi the square root, while the i the umerator is ot, ad so the boud ostruted usig US should ted to be tighter whe is small. Appliatio to Diabetes Treatmet We applied US ad IS to the problem of preditig the effetiveess of alterig the treatmet poliy for a partiular perso with type diabetes. That is, we would like to use prior data from whe the idividual was treated with oe treatmet poliy to estimate how well a related poliy would work. The treatmet poliy is parameterized by two umbers, CR ad CF, ad ditates how muh isuli a perso should ijet prior to eatig a meal i order to keep his or her blood gluose lose to optimum levels. CR ad CF are typially speified by a diabetologist ad tweaked durig follow-up visits every 3 6 moths. If follow-up visits are ot a optio, reet researh has suggested usig reiforemet learig algorithms to tue CR ad CF Bastai, 04. Here we fous o a sub-problem of improvig CR ad CF usig data olleted from a iitial rage of admissible values of CR ad CF to predit how well a ew rage of values for CR ad CF would perform. Whe olletig data, CR ad CF are draw uiformly from a iitial admissible rage, ad the used for oe day whih we view as oe episode of a Markov deisio proess. The performae durig eah day is measured usig a objetive futio similar to the reward futio proposed by Bastai 04, whih measures the deviatio of blood gluose from optimum levels, with larger pealties for low blood gluose levels. We refer to the measure of how good the outome was from oe day as the retur assoiated with that day, with larger values beig better. Usig approximately 30 days of data, our goal is to estimate the expeted retur if a differet distributio of CR ad CF were to be used. We osider a speifi i silio perso a perso simulated usig a metaboli simulator. We used the subjet Adult#003 i the Type Diabetes Metaboli Simulator TDMS Dalla Ma et al., 04 a simulator that has bee approved by the US Food ad Drug Admiistratio as a substitute for aimal trials i pre-liial testig of treatmet poliies for type diabetes. Durig eah day, the subjet is give three or four meals of radomized sizes at radomized

Figure 3: The first ad seod plots show a estimate of the expeted retur for various CR ad CF, from two differet agles the seod is a side-view of the first.

The two plots o the right depit the bias, variae, ad MSE of IS, US, ad WIS without ay oditioig for various values of ad both without third plot ad with fourth plot a otrol variate.

6 Figure 3: The first ad seod plots show a estimate of the expeted retur for various CR ad CF, from two differet agles the seod is a side-view of the first. The seod plot also iludes blue poits depitig the Mote Carlo returs observed from usig differet values of CR ad CF for a day otie the high variae. The two plots o the right depit the bias, variae, ad MSE of IS, US, ad WIS without ay oditioig for various values of ad both without third plot ad with fourth plot a otrol variate. The urves for US are largely obsured by the orrespodig urves for WIS. Notie that the variae of IS approahes 0.06, whih is eormous give that the differee betwee the best ad worst CR ad CF pairs possible uder the samplig poliy is approximately times, similar to the experimetal setup proposed by Bastai 04. As a result of this radomess, ad the stohasti ature of the TDMS model, applyig the same values of CR ad CF a produe differet returs if used for multiple days. After aalyzig the performae of may CR ad CF pairs, we seleted a iitial rage that results i good performae: CR 8.5, ] ad CF 0, 5]. Usig a large umber of samples, we omputed a Mote Carlo estimate of the expeted retur if differet CR ad CF values are used for a sigle day this estimate is depited i Figure 3. As desribed by Bastai 04, whe the value of CR is set appropriately, performae is robust to hages i CF. We therefore fous o possible hages to CR. Speifially, we osider ew treatmet poliies where CF remais sampled from the uiform distributio over 0, 5], but where CR is sampled from the truated ormal distributio over CR mi, ], with mea ad stadard deviatio CR mi. This distributio plaes the largest probability desities at the upper ed of the rage of CR, whih favors better poliies. As CR mi ireases towards, the support of the samplig distributio ad target distributio beome ireasigly differet CR mi /.5 ad the expeted retur ireases. For eah value of CR mi eah of whih orrespods to a value of, we performed,433 trials, eah of whih ivolved geeratig the returs from 30 days, where the values of CR ad CF used for eah day were sampled uiformly from CR 8.5, ] ad CF 0, 5], ad the usig IS, US, ad weighted importae samplig WIS to estimate the expeted retur if CR ad CF were sampled from the target distributio the truated Gaussia parameterized by CR mi. Figure 3 displays the bias, variae ad mea squared error MSE of these,433 estimates, usig a estimate of groud truth omputed usig Mote Carlo samplig. Figure 3 also shows the impat of providig a ostat otrol variate to all the estimators: the hose otrol variate was the expeted retur uder the samplig distributio. Notie that we see the same tred as i the illustrative example for small the best treatmet poliies, whih have small rages of CR, US sigifiatly outperforms IS. Furthermore, whe a deet otrol variate is ot used, the beefits of US are ireased, eve whe otrollig for the resultig bias by measurig the mea squared error. We also omputed the biases ad variaes give that kx > 0, ad observed similar results ot show, whih favored US slightly more. Notie that WIS ad US perform very similarly. Ideed, if the samplig ad target distributios are both uiform, it is straightforward to verify that WIS ad US are equivalet. I other experimets ot show we foud that WIS yields lower variae tha US whe the target distributio is modified to be eve less like the uiform distributio. However, it is ofte importat to be able to produe ofidee itervals aroud estimates espeially whe data is limited, ad sie WIS is biased, it aot be used with stadard oetratio iequalities. We used Hoeffdig s iequality to ompute a 90% ofidee iterval aroud the estimates produed by IS ad US without otrol variates ad with CR mi 0.375, so that /4 usig various umbers of samples days of data. The mea ofidee itervals are depited i Figure 4, whih also shows a Mote Carlo estimate of, as well as determiisti domai-speifi upper ad lower bouds o hx deoted by h rage i the leged. If kx 0, the US is ot defied, ad so the ofidee itervals show for US are averaged oly over the istaes where kx > 0. To show how ofte US returs a solutio, Figure 4 also shows ρ the probability that US will produe a ofidee boud usig the right vertial axis for sale. US produes a muh tighter ofidee iterval tha IS i all ases. Furthermore, the settig where US ofte does ot retur a boud orrespods to the settig where IS produes a ofidee iterval that is outside the determiisti boud o hx a trivial ofidee iterval. I additioal experimets ot show we defied the bouds to be truated to always be withi the determiisti bouds o hx ad defie the boud produed usig US to be oservative equal to the determiisti bouds whe kx 0. I this experimet we saw similar results the ofidee itervals produed usig US were muh tighter tha those usig IS.

7 Figure 4: Cofidee bouds usig IS ad US. Should Oe Use US or WIS i Pratie? The results preseted i the previous setio might raise the questio: whe should oe use US rather tha WIS? Previously we hited at the problem with WIS: it is a biased estimator. Here we disuss why this theoretial property has importat pratial ramifiatios that rule out the use of WIS but ot US for may high-risk problems. First we list the troublesome theoretial properties of the WIS estimator, whih are disussed i the work of Thomas 05, Setio 3.8. Whe there is oly a sigle sample, i.e., whe, WIS is a ubiased estimator of E g hx]. As ireases, the expeted value of the WIS estimator shifts towards the target value, E f hx]. If the samples that are likely uder g are extremely ulikely uder f, the the shift of the expeted value of the WIS estimator from E g hx] to E f hx] a be exeedigly slow. Cosider what this would mea for our diabetes experimet. Here the behavior poliy samplig distributio is a relatively deet poliy that we might be osiderig hagig. The evaluatio poliy target distributio might be a ew treatmet poliy that is both dagerously worse tha the behavior poliy ad quite differet from the behavior poliy. To determie whether the evaluatio poliy should be deployed, we might rely o high-ofidee guaratees, as has bee suggested for similar problems Thomas et al., 05a. That is, we might use Hoeffdig s iequality to ostrut a high-ofidee lower-boud o the expeted value of the WIS estimator, ad the require this boud to be ot far below the performae of the behavior poliy. Beause the behavior ad evaluatio poliies are quite differet, the WIS estimator will produe relatively lowvariae estimates etered ear the performae of the reasoable behavior poliy, rather tha estimates etered ear the dagerously poor performae of the evaluatio poliy. This meas that the lower-boud that we ompute will be a lower boud o the performae of the deet behavior poliy, rather the true poor performae of the evaluatio poliy. Moreover, if oe uses Studet s t-test or a bootstrap method to ostrut the ofidee iterval, as has bee suggested whe usig WIS Thomas et al., 05b, we might obtai a very-tight ofidee iterval aroud the performae of the behavior poliy. This exemplifies the problem with usig WIS for high-risk problems: the bias of the WIS estimator a ause us to ofte erroeously olude that dagerous poliies are safe to deploy. Colusio ad Future Work We have preseted a simple ew variat of importae samplig, US. Our aalytial ad empirial results suggest that US a sigifiatly outperform ordiary importae samplig whe the supports of the samplig ad target distributios differ. We also provide a iequality that a be evaluated prior to observig ay data, ad whih, if satisfied, guaratees that US will have lower variae tha ordiary importae samplig. Ulike some other importae samplig estimators that have bee developed to redue variae like WIS, US is ubiased give mild oditios that still permit the easy omputatio of ofidee itervals. Referees M. Bastai. Model-free itelliget diabetes maagemet usig mahie learig. Master s thesis, Departmet of Computig Siee, Uiversity of Alberta, 04. C. Dalla Ma, F. Miheletto, D. Lv, M. Breto, B. Kovathev, ad C. Cobelli. The UVA/Padova type diabetes simulator ew features. Joural of Diabetes Siee ad Tehology, 8:6 34, 04. M. Dudík, J. Lagford, ad L. Li. Doubly robust poliy evaluatio ad learig. I Proeedigs of the Twety- Eighth Iteratioal Coferee o Mahie Learig, pages , 0. J. M. Hammersley. Mote Carlo methods for solvig multivariable problems. Aals of the New York Aademy of Siees, 863: , 960. J. M. Hammersley ad D. C. Hadsomb. Mote Carlo methods, Methue & Co. Ltd., Lodo, page 40, 964. W. Hoeffdig. Probability iequalities for sums of bouded radom variables. Joural of the Ameria Statistial Assoiatio, 5830:3 30, 963. N. Jiag ad L. Li. Doubly robust off-poliy value evaluatio for reiforemet learig. I Iteratioal Coferee o Mahie Learig, 06. H. Kah. Use of differet Mote Carlo samplig tehiques. Tehial Report P-766, The RAND Corporatio, September 955. P. Massart. Coetratio Iequalities ad Model Seletio. Spriger, 007. R. Rubistei. Simulatio ad the Mote Carlo method. Wiley, New York, 98. P. S. Thomas. Safe Reiforemet Learig. PhD thesis, Uiversity of Massahusetts Amherst, 05. P. S. Thomas ad E. Bruskill. Data-effiiet off-poliy poliy evaluatio for reiforemet learig. I Iteratioal Coferee o Mahie Learig, 06. P. S. Thomas, G. Theoharous, ad M. Ghavamzadeh. High ofidee off-poliy evaluatio. I Proeedigs of the Twety-Nith Coferee o Artifiial Itelligee, 05a. P. S. Thomas, G. Theoharous, ad M. Ghavamzadeh. High ofidee poliy improvemet. I Iteratioal Coferee o Mahie Learig, 05b.

8 Supplemetal Doumet I this supplemetal doumet we prove the various properties ad theorems refereed earlier partiularly those i Table. Property. If F H G the E g ISX ]. ] E gisx ] E a fx g gx hx b F H gx fx hx dx gx fxhx dx E f hx], where a holds beause ISX is the mea of idepedet ad idetially distributed radom variables, ad b holds beause x G \ F H, fx 0. We ow provide a proof of Theorem, whih states that if C G, the USX ISX. I this settig, gx dx ad sie every G X i must be withi C, kx. So, USX kx G fx i gx i hx i fx i gx i hx i. We ow provide a proof of Theorem, whih states that if we replae with a empirial estimate, ĉx : kx, the USX ISX. Usig the empirial estimate, ĉx, i plae of withi US we have: USX ĉx kx kx kx fx i gx i hx i fx i gx i hx i fx i gx i hx i ISX. Theorem 3. If F H G ad N >0, the E g USX kx ]. Let Pr g X C deote the probability that a sample, X, from the samplig distributio is i C. E g USX kx ] ] fx i E g gx i hx i kx ] a fx i E g gx i hx i i {,..., }, X i C b E g fx ] gx hx X C gx C Pr g X C fx hx dx gx d gx C fx hx dx gx fxhx dx C e E f hx], where a holds beause fx i 0 for all but of the terms i the summatio, ad so by re-orderig the X i so that these terms have idies,..., we eed oly sum to rather tha, b holds beause the summatio is over idepedet ad idetially distributed radom variables, holds by the defiitio of oditioal expetatios, d holds beause Pr g X C, ad e holds beause F H C. Theorem 4. If F H G the E g USX kx > 0]. E g USX kx > 0] PrkX kx > 0 E gusx kx ] PrkX > 0 a, PrkX kx > 0 PrkX > 0 PrkX kx > 0 PrkX > 0 where a holds beause, by Theorem 3, EUSX kx ]. Theorem 5. If F H G ad N >0, the E g ISX kx ]. 4 Followig roughly the same steps as used to prove

9 Theorem 3 we have that: E g ISX kx ] ] fx i E g gx i hx i kx ] fx i E g gx i hx i i {,..., }, X i C ] fx E g gx hx X C gx fx hx dx C gx E f hx], ad so 4 follows. Theorem 6. If F H G the E g ISX kx > 0]. Reall from Property that E g ISX ]. By margializig over whether or ot kx > 0, we also have that: E g ISX ] PrkX > 0E g ISX kx > 0] + PrkX 0E g ISX kx 0]. Property. Let X,..., X be idepedet ad idetially distributed radom variables, eah with fiite mea ad variae. The, E Reall that Var X i Var X + E X ]. X i E So, by rearragig terms: E ] X i E ] X i. ] X i Var X i + ] E X i. Sie the X i are idepedet ad idetially distributed, we therefore have that: E X i Var X + E X ] Var X + E X ]. So, E g ISX kx > 0] PrkX 0E g ISX kx 0] PrkX > 0 a, where a holds beause E g ISX kx 0] 0 ad PrkX > 0 PrkX 0. Theorem 7. If F H G, the E g USX ]. E g USX ] PrkX > 0 E } kx {{ 0] }, by Theorem 4 + PrkX 0 E g USX kx 0] 0. Before otiuig, reall the followig property whih we prove for ompleteess: Theorem 8. If F H G the ] Var g USX kx > 0 ve B, > 0. Var gusx kx > 0 E gusx kx > 0] E gusx kx > 0] E gusx kx > 0] PrkX PrkX > 0 EgUSX kx ]. We will write y to deote a vetor i R, the elemets of whih are y,..., y R. We also write y i:j to deote the i th through j th etries of y, i.e., y i:j : y i, y i+,..., y j, y j ]. Let G {y G : ky } be the set of all possible tuples of samples where exatly are i C. We also overload the defiitio of g by defiig gy : gy i. Usig this otatio, we have that where... are used to deote that a log lie is split aross multiple lies via salar multipliatio: 5

10 Combiig 5 with 6 we have that E g USX kx ] gy G PrkX USy dy a gy USy dy : dy +: PrkX C G\C b PrkX C USy : dy : dy +: G\C gy :gy +:... gy : USy : C dy :... gy +: dy +: G\C } {{} k gy : C gy : C gy : fy i gy hyi dy : i fy i gy hyi dy : i fy i gy hyi dy : i C PrkX ] fx i E g gx hxi X C i ] d fx v + E gx hx X g, X C v + gx C fx hx dx gx v +, 6 Var g USX kx > 0 PrkX PrkX > 0 v + PrkX v PrkX > 0 + PrkX PrkX > 0 PrkX v PrkX > 0 ] ve B, > 0. Theorem 9. If F H G the Var gisx kx > 0 v ρ + ρ + ρ ρ. At a high level, this proof is similar to the proof of Theorem 8, but uses the property that ISX USX. kx Var gisx kx > 0 E gisx kx > 0] E gisx kx > 0] a E gisx kx > 0] PrkX PrkX > 0 EgISX kx ], 7 where a omes from Theorem 6. Also, where a omes from the fat that there are ways of orderig elemets suh that are i C ad are i G \ C, ad the fat that US does ot deped o the order of its iputs, b omes from the property that USy does ot hage if additioal samples are appeded to y that are ot i C ad the fat that gy a be deomposed ito gy : gy +: sie it represets the joit probability desity futio for idepedet ad idetially distributed radom variables, omes from the fat that PrkX, ad d omes from Property. E g ISX kx ] ] a kx E g USX kx EgUSX kx ] b v +,8 where a holds beause ISX kx USX ad b follows from 6. Usig the shorthad, ρ : PrkX > 0 ad by ombiig 7 with 8 we have

11 that: Var g ISX kx > 0 PrkX PrkX > 0 v + v PrkX ρ + ρ E B, ] PrkX E B, ] + ρ v ρ + + ρ ρ v ρ + ρ + ρ ρ. Theorem 0. If F H G the Var g USX ρ ve B, ]+ > 0 ρ ρ. Var g USX E g USX ] E g USX ] a E g USX ] ρ PrkX E g USX kx ] 0 ρ PrkX 0 E g USX kx 0] 0 + PrkX E g USX kx ] ρ PrkX ρ v + ρ ρ PrkX v ρ + ρ PrkX ρ ρ ] ρ ve B, > 0 + ρ ρ, b ρ where a omes from Theorem 7, b omes from 6 ad from multiplyig oe term by ρ/ρ. Theorem. If F H G the Var g ISX v +. Var g ISX a Var gisx Eg ISX ] E g ISX] b Eg ISX ] PrX C X ge g ISX X C] + PrX C X g E g ISX X C] 0 E g ISX X C] v + v +, where a holds beause ISX is the sum of idepedet ad idetially distributed radom variables, b omes from Property, ad omes from applyig 8 with ad. Property 3. ρ + ρ 0, Reall that ρ :, so we have that: ρ + ρ We will show by idutio that 9 is o-egative for all. First, otie that for the base ase where, 9 is equal to zero. For the idutive step we will show that 9 is o-egative for + give that it is o-egative for } {{ } a + +, where a is positive by the idutive hypothesis, ad so we eed oly show that Sie + + +, ad + beause 0, ], we olude.

arxiv: v1 [cs.lg] 10 Nov 2016 Abstract

arxiv: v1 [cs.lg] 10 Nov 2016 Abstract Importae Samplig with Uequal Support Philip S. Thomas ad Emma Bruskill Caregie Mello Uiversity arxiv:6.0345v s.lg 0 Nov 06 Abstrat Importae samplig is ofte used i mahie learig whe traiig ad testig data