Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning

Size: px

Start display at page:

Download "Supplementary Material: A Robust Approach to Sequential Information Theoretic Planning"

Pamela Watson
5 years ago
Views:

1 Supplementar aterial: A Robust Approach to Sequential Information Theoretic Planning Sue Zheng Jason Pacheco John W. Fisher, III. Proofs of Estimator Properties.. Proof of Prop. Here we show how the bias of the empirical estimator depends on,. Since we use a plug in estimate of the predictive posterior ˆp i p i Y to estimate I, we first determine the mean and variance of the samples θ i for I estimation which incorporate the plug in estimate θ i = log pi x i ˆp. i Let {x i, i } i= px, Y and for each i let {x ij } j= px Y be independent samples. Going forward, we will drop the explicit dependence on the observed data, Y, for clarit but all expressions are conditioned on them. The empirical estimate of the posterior predictive takes form ˆp i = j pi x ij. From Thm. 3.9 of DasGupta, 008, if we have iid observations X, X,..., X with finite fourth moment, mean µ, and variance, and a scalar transformation g with four uniforml bounded derivatives, then EgX = gµ + g µ + On var gx = g µ + On i Xi. In our where X is the empirical mean X = case, we have samples p i x ij which are iid when conditioned on i with conditional mean p i and conditional variance p i X x pi x pxdx p i. We are interested in the transformation g = log. Consequentl, we have Using the law of total expectation to remove the conditioning on i in the mean and dropping the superscript i since measurements are sampled iid, we have: E logˆp = HY + E p + O. Expanding the second term on the RHS, we obtain: x E p = p p x pxdx p p d = x p x pxdx d p = p x px dx d x ppx = χ p, x ppx where χ p, x ppx denotes the chi-square divergence of p, x from ppx. Plugging this back into, we obtain E logˆp = HY + χ p, x ppx + O. From Eqn. 3 we can now obtain the mean of each sample θ i = log pi x i ˆp i used in the empirical I estimate Î = i θ i and therefore, the mean of the I estimate: 3 E logˆp i i = logp i + p i X p i + O var logˆp i i = p i X p i + O. * Equal contribution Computer Science and Artificial Intelligence Lab, assachusetts Institute of Technolog, Boston, A, USA. Correspondence to: Sue Zheng <szheng@csail.mit.edu>. Proceedings of the 35 th International Conference on achine Learning, Stockholm, Sweden, PLR 80, 08. Copright 08 b the authors. EÎ = Eθ = Elog p x Elog ˆp = HY X + HY + χ p, x ppx + O = IX; Y + χ p, x ppx + O. We have thus shown how the bias of the empirical estimator depends on,. 4

2 .. Proof of Prop. We first show that the empirical plug-in estimator is consistent and asmptoticall normal. We will next show that the influence function of the robust estimator converges to the influence function of the empirical estimator, such that the asmptotic analsis of the empirical estimator holds for the robust estimate as well. We begin b finding expressions for the mean and variance of the iid θ i samples used in I estimation. Then, using these expression show that the empirical mean converges to a ormal distribution. In Sec.., we have an expression for the variance of the log predictive posterior conditioned on i. We can remove the conditioning on i though the law of total variance. Again removing the superscript i for clarit, we have the following expression for variance: Eqn. 5 to get the following dependence on : var log ˆp = log p 6 + HY χ p, x ppx log p /pd + O. We can now find the variance of the samples used in I estimation: var θ = log p x cov log p x, log ˆp + var log ˆp. 7 var logˆp 5 = Evar logˆp + var Elogˆp = E p + O + var Elogˆp. We can simplif the first term on the RHS through Eqn.. Keeping track of onl the order terms and higher, we expand the last term on the RHS as var Elogˆp = E logp + p + O Elogˆp = p log p log p p + O d HY + χ p, x ppx + O = plog p log p d + O p HY + HY χ p, x ppx = log p log p d p HY χ p, x ppx + O. We plug the above back into the variance expression To simplif our analsis, let us assume that we use independent sets of samples {x i, i } i= and {xi, i } i=+ for estimating Elog p x and Elog ˆp respectivel. Thus, the covariance term in Eqn. 7 is zero. Substituting the log posterior predictive variance Eqn. 6 into the I sample variance Eqn. 7 ields the full expression for the I sample variance var θ = log p x + log p + HY χ p, x ppx log p /pd + O = log p x 8 p + HY χ p, x ppx log p /pd + O. We now have the variance and mean of θ i, which compose the samples for the plug-in empirical estimate of I. We will now show that this I estimate admits a CLT. Let us define θ i = θ i IX; Y and let each θ i have cdf F θ i = P Θ i θ i. Let us assume that it has finite moment and its moment generating function mgf θt = Eexpt θ i exists. We will show that moment generating function for Z Î IX; Y = i θ i log p x p approaches that of a zero-mean normal with variance Î so that the estimator is consistent and a CLT holds. Because θ i are iid, the mgf for Z

3 can be written as Z t = θ i= i / t = θ/ t = θt/ = Eexpt θ/. 9 Using Talor s theorem on the exponential, we obtain Z t = = E + t θ + t θ + t θ ! 3/ + t χ p, x ppx + t E θ + + O t3 6 E θ / ote that, since θ = θ IX; Y, we have that var θ = var θ and E θ = O such that E θ = O. We now expand the third term in Eqn. 0 using our expression for variance Eqn. 8 of θ: E θ = var θ + E θ = var θ + O = log p x p + HY χ p, x ppx log p /pd + O. Furthermore, note that since all the moments of θ are finite, we can ignore the 3/ or lower terms since the disappear as. For the limit to exist, the coefficient for / must deca at a rate at least / ; thus, we must have = Ω. Finall, recall that the mgf for a Gaussian with mean µ and variance is exptµ + t. Therefore, to have a consistent estimator Z has zero mean in the limit, we require the coefficient on the t term to disappear in the limit. This requires that grows strictl faster than : = ω. Letting = ω, the limit of Eqn. 0 is: lim Z t = exp log p x t. p Since the mgf approaches that of a zero mean Gaussian with variance log p x p we can conclude that the empirical plug-in I estimator is consistent and satisfies the following CLT when = ω : lim Î IX; Y 0,. 3 Î Lastl, we demonstrate that as, the robust influence function approaches that of the empirical such that the consistenc and CLT guarantees also hold for our robust es- timator. Let α =. ote that the influence function corresponding to an empirical estimate is ψ empir x = cx where c is an constant. The robust cost function, parameterized b α, is given as ψx; α = { log + αx + αx /, x θ log αx + αx /, x < θ. For simplicit of analsis, we will onl consider x 0 however, the proof for x < 0 follows similarl. ote that lim αx = lim = 0. Therefore, we take the Talor series of log + + / at 0, where = αx, and obtain x ψx; α = αx + 0 αx αx3 3! + As, the first term dominates and we get that the robust influence function approaches the empirical function with c = α..3. Proof of Prop. 3 Here, we first show the finite sample deviation bound on the robust estimator, then secondl the bound on the empirical. Let {x i, i } i= px, Y and for each i let {x ij } j= px Y be independent samples. We define x i {x ij } j=. Recall that Î = ROOT i ψαθi Î where θ i = log pi x i ˆp i ;x i and ˆpi ; x i is the posterior predictive estimator. Assuming α = / Î and > + log ɛ, we have from Prop..4 of Catoni, 0, the robust estimator satisfies with probabilit at least ɛ where c Î m c 4 m = Eθ i = E log and c = +log ɛ Î / + +log ɛ / p x ˆp; x. Since the samples are identicall distributed, m does not depend on sample index i. ote that, because the denominator contains the estimated posterior predictive, this currentl does not bound

4 deviation from the true I value, I. We can rewrite m as follows: p x p Y m = E log ˆp; x p Y p x p Y = E log + E log p Y ˆp; x p Y = I + E log ˆp; x = I + E x KLp Y ˆp; x. 5 ote that the above derivation holds for both the robust and empirical estimators. Letting b = E x KLp Y ˆp; x, we obtain the presented deviation bounds on the robust estimator: b c Î m b + c 6 Similarl, the empirical estimate can be bounded as in Eqn. 4 but with c = î ɛ b Chebshev s inequalit. Since the same derivation relating the true mean to the plug-in mean in Eqn. 5 holds for the empirical, we have that Eqn. 6 also carries through. Although the high-level expression stas the same, we must be careful to note that both b and c differ for the two estimators. The difference in b is a bit more subtle. b is the expected KL divergence between the posterior predictive and the estimated posterior predictive; the estimated posterior predictive differs between robust and empirical..4. Proof of Prop. 4 We now show how we approximate the probabilit that the correct action is selected. Without loss of generalit, let I I... I A and let f a Îa and F a Îa denote the pdf and cdf of the estimate. Since independent samples are drawn to estimate I under each candidate action, their estimates are independent. This allows us to decompose the probabilit of selecting the correct action as: + A Î Pa = = fî fîadîa dî a= + A = fî F a Î dî 7 For, we have that Pa = + a= Î; I, A Î I a Φ dî a= where a = log pa x p a Y and Φ is the cumulative distribution function of the standard normal distribution. a Posterior Prob odel: Power Experiment Random SC Sequential Robust Posterior Prob odel: Exponential Experiment Figure. emor retention model. Plots of posterior model probabilit of the correct retention model for 500 random trials of 00 experiments with 500 particles. Data are sampled from POW left and EXP right. Plots show median solid and quartiles dashed over the same set of runs.. Experiment: emor Retention odel Selection We appl our method to sequential experiment design for Baesian model discrimination. We adopt the analsis posed in Cavagnaro et al., 00 where the aim is to discriminate among candidate models of memor retention. At each stage an experiment is conducted where a simulated participant is given a list of words to memorize. After a chosen lag time d t participants are asked to recall the words. At stage t the aim is to select lag time d t which maximall discriminates between two retention models, power POW and exponential EXP: p 0 d t θ = θ 0 d t + θ, p d t θ = θ 0 e θdt. The joint distribution combines the retention model with a response variable t where t = indicates the participant remembers the words and t = 0 otherwise, m Unif, θ 0 Betaα 0, β 0, θ Betaα, β, t m, θ; d t Bernoulli p m d t θ. At each stage we select the lag time which maximizes mutual information between the model and observation, arg max dt I dt ; Y t Y t. Expectations over model and response Y t are discrete, and can be computed efficientl. Furthermore, under a uniform prior we have that the model posterior is proportional to the evidences, Z mt t, d t pθ m t p i θ, m; d t dθ 8 i= We compare our integrated inference and planning approach to the sequential onte Carlo SC approach of Drovandi et al., 04; Chopin, 00, which utilizes the SC estimate of the evidence for planning: log Ẑmt = log Ẑm,t + log i= wi mt. We observe moderate slightl improved median performance and tighter quartiles as shown in Fig.. Im-

5 provements over SC are modest, likel due to the lowdimensional D integration over θ. We also find that when data are sampled from the exponential model posterior concentrates quickl under both methods, which confirms findings of the original authors. Both our method and SC significantl outperform random design. References Catoni, O. Challenging the empirical mean and empirical variance: A deviation stud. Ann. Inst. H. Poincar Probab. Statist., 484:48 85, 0. doi: 0. 4/-AIHP454. Cavagnaro, D. R., ung, J. I., Pitt,. A., and Kujala, J. V. Adaptive design optimization: A I-based approach to model discrimination in cognitive science. eural- Comp., 4: , 00. Chopin,. A sequential particle filter method for static models. Biometrika, 893:539 55, 00. DasGupta, A. Asmptotic Theor of Statistics and Probabilit. Springer Texts in Statistics. Springer ew York, 008. ISB Drovandi, C. C., cgree, J.., and Pettitt, A.. A sequential monte carlo algorithm to incorporate model uncertaint in baesian sequential design. JCGS, 3: 3 4, 04.

Robust Monte Carlo Methods for Sequential Planning and Decision Making

Robust Monte Carlo Methods for Sequential Planning and Decision Making Sue Zheng, Jason Pacheco, & John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory