Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

Size: px

Start display at page:

Download "Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures"

Archibald Wiggins
5 years ago
Views:

1 Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures Daniel R. Jiang and Warren B. Powell Abstract In this online companion, we provide some additional preliminary information regarding the risk-directed sampling (RDS) method. Also, we evaluate the policies generated by Dynamic-QBRM ADP with RDS on more practical metrics of risk and reward. We observe that they behave in an intuitively appealing way. Some necessary material from the main paper has been reproduced here for convenience. Motivation and Preliminaries for RDS Suppose the distribution of the exogenous information W t+ has a density p t (w). Notice that we have for any (s, a) and t, Q ( t (s, a) = H t u, t, u, t,..., u m, t, Q t+, w ) (s, a) p t (w) dw. (.) For convenience, we use the shorthand notation Ht ( (w s, a) = H t u, t, u, t,..., u m, t, Q t+, w ) (s, a), to emphasize the variable of integration, w. From the principle of importance sampling (see, e.g., Bucklew 004), it is known that to produce a low-variance estimate of Q t (s, a) using Monte Carlo sampling, one should sample from a distribution whose density is nearly proportional to the absolute value of the integrand of (.).. Review of Importance Sampling Let X be a random variable on R k with density p and consider the problem of estimating the expected value of a function ϕ : R k R of X, i.e., E p ϕ(x) = ϕ(x) p(x) dx. (.) he core idea of importance sampling is that we perform a change of measure and cast the above problem into another mean estimation problem: E p ϕ(x) = ϕ(x) p(x) dx = ϕ(x) p(x) p(x) dx = E p ϕ(x) p(x), p(x) p(x)

2 provided that p(x) > 0 whenever the original integrand ϕ(x) p(x) > 0. he term p(x) p(x) is called the likelihood ratio. Consequently, when performing Monte Carlo sampling for the new problem, we can draw samples from p rather than p. his technique is of value when ϕ(x) p(x) p(x) p is chosen so that the variance of when X p is smaller than the variance of ϕ(x) when X p. It is well-known that the density for which this variance is minimized is given by ϕ(x) p(x)/e p ϕ(x) ϕ(x) p(x), where, of course, the difficulty is that the denominator is unknown. Importance sampling is typically applied in situations where ϕ is known, but this is not true in our ADP setting, as our ϕ requires knowledge of Q and u.. Motivation In a risk setting such as ours, the integrand of (.) can be multimodal, and we argue that in a large class of applications, the integrand is, at the very least, bimodal. Roughly speaking, one mode comes from the density p t (w), corresponding to the bulk of the distribution of W t+ ; a second mode often exists due to the QBRM ρ α t (specifically, the function Φ), and it corresponds to the high costs of the tail risk for which ρ α t penalizes. Additional modes can easily exist as well, depending on the distribution p t (w) and risk measure ρ α t. Figure illustrates two simple cases when Q t+ (s, a) = 0 for all (s, a). We use the risk measure ρ α t (X) = ( λ) E X F t + λ CVaR α t (X), with parameters λ = 0.5, α = he figure shows the cases of the normal and lognormal distributions for W t+ combined with simple cost functions that are independent of (s, a). Because our integrands are multimodal, the most natural class of parameterized importance sampling distributions to consider is a mixture of unimodal distributions. Let φ, φ,..., φ K be densities of K basis distributions. he density of a mixture of these distributions is a linear combination of the φ k and can be written p(x) = k θk φ k (x), for some vector θ of nonnegative mixture weights θ k such that θ =. Not only are mixture classes straightforward to sample from, they are flexible and intuitively easy for practitioners to specify Oh and Berger, 99. wo other methods from the literature are: () as in Bardou and Frikha 009, we can specify the importance sampling distribution to be a translation of the original distribution, but this does not, in general, approximate multimodal distributions well; () we can search for candidates within an exponential family of distributions as in Siegmund 976 and Ryu and Boyd 05, but the main disadvantage here is that multimodal exponential families have a significantly more complex mathematical description (see Cobb et al. 98). g t(w s,a) pt(w s,a) w (A) N (0, ) with c t (s, a, w) = w g t(w s,a) pt(w s,a) w (B) log N (0, 0.5 ) with c t (s, a, w) = w Figure : Examples of Integrands of (.) under Normal and Lognormal Distributions

3 Various approaches have been studied in the literature to iteratively find the optimal sampling density within a parametric family. One line of work (see Al-Qaq et al. 995, Bardou and Frikha 009, Egloff and Leippold 00, and Ryu and Boyd 05) proposes using sample gradients to directly minimize the quantity Var p ϕ(x) p(x)/ p(x). Another route is to seek a sampling distribution whose density minimizes the distance to the optimal sampling density, i.e., the density that is proportional to H t (w s, a) p t (w). he cross-entropy method Rubinstein, 999 uses this strategy and chooses the distance metric to be the Kullback-Leibler (KL) divergence, which for two probability densities p and q, is given by p(x) log p(x) q(x).. Behavior of the Risk-Averse Policies in Practical Settings In the main paper, we evaluated the performance of the ADP risk-averse policies in terms of the dynamic risk measure objective function. While this is important for gaining an understanding of the convergence properties of the algorithm, it may not be the most useful metric when evaluating policies in practice. It is common in both the literature and in industry to evaluate risk-averse policies along two dimensions, one metric for risk and one metric reward; see, e.g., Philpott and de Matos 0, Shapiro et al. 0, and Çavus and Ruszczyński 04. he way that risk and reward are disentangled from each other is completely problem dependent and typically this separation allows us to gain some qualitative insight regarding the behavior of the risk-averse policies. he original motivation for the problem is to control the number of energy shortages, so an immediately obvious metric for risk is the probability of a shortage event. Another example metric for measuring risk in practical settings is to consider the right tail of the cumulative penalties, which might be quantified using the familiar conditional value at risk. We can make these metrics precise as follows. Let {A π 0, Aπ,..., Aπ } be a policy and let {St π } and {Ft π } be processes representing the states visited and rewards/penalties assessed { under the policy indexed by π. Moreover, we define the bad events to be Bt π = µs (St π ) + U t+ < 0 }, that is, the events under which we are assessed a penalty. he formulas for our risk metrics are given by Shortage(π) = E B π t and Penalty(π) = CVaR 0.99 Ft+ π B π t. We can certainly construct other reasonable metrics for risk, but we note that the appropriateness of one particular metric over another is very much dependent on the application domain. In the results that follow, reward is measured using the standard risk-neutral objective given by Revenue(π) = E ( c t S π t, A π t (St π ) ), W t+. We compute the above metrics of risk and reward for approximate policies produced by Dynamic-QBRM ADP with RDS after N = 5,000,000 iterations and plot a risk-reward frontier in Figure. We see that the risk can be driven down significantly by increasing the value of λ (the tradeoff is, of course, lower revenue) in a way that completely aligns with our intuition. Notice that the policy associated with λ = 0 maximizes expected revenue (by definition), yet the undesirable events Bt π occur nearly % of the time. his amount of risk

4 may be intolerable in many real applications, particularly energy; therefore, one solution for producing risk-averse policies is to simply apply Dynamic-QBRM ADP with some λ > 0. In this case, the practitioner does not necessarily need to be concerned with the dynamic risk measure objective function, so long as the resulting ADP policies produce the desired behavior when it comes to the tradeoff between risk and reward. Shortage (%) λ = 0.5 λ = 0. λ = 0..5 λ = 0.6 λ = 0.5 λ = 0. λ = Revenue ($) (A) Shortage Risk Metric Penalty ($) λ = λ = 0. λ = 0.5 λ = 0.4 λ = 0. λ = 0 λ = Revenue ($) (B) Penalty Risk Metric Figure : Risk-Reward Frontiers from Dynamic-QBRM ADP with RDS for N = 5,000,000 4

5 References W. A. Al-Qaq, M. Devetsikiotis, and J. K. ownsend. Stochastic gradient optimization of importance sampling for the efficient simulation of digital communication systems. IEEE ransactions on Communications, 4(): , 995. O. Bardou and N. Frikha. Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling. Monte Carlo Methods and Applications, 5():7 0, 009. J. Bucklew. Introduction to rare event simulation. Springer New York, 004. O. Çavus and A. Ruszczyński. Computational methods for risk-averse undiscounted transient markov models. Operations Research, 6():40 47, 04. L. Cobb, P. Koppstein, and N. H. Chen. Estimation and moment recursion relations for multimodal distributions of the exponential family. Journal of the American Statistical Association, 78(8):4 0, 98. D. Egloff and M. Leippold. Quantile estimation with adaptive importance sampling. Annals of Statistics, 8():44 78, 00. M. S. Oh and J. O. Berger. Integration of multimodal functions by Monte Carlo importance sampling. Journal of the American Statistical Association, 88(4): , 99. A. B. Philpott and V. L. de Matos. Dynamic sampling algorithms for multi-stage stochastic programs with risk aversion. European Journal of Operational Research, 8():470 48, 0. R. Y. Rubinstein. he simulated entropy method for combinatorial and continuous optimization. Methodology and Computing in Applied Probability, ():7 90, 999. E. K. Ryu and S. P. Boyd. Adaptive importance sampling via stochastic convex programming. Working Paper, Stanford University, 05. A. Shapiro, W. ekaya, J. P. da Costa, and M. P. Soares. Risk neutral and risk averse stochastic dual dynamic programming method. European Journal of Operational Research, 4():75 9, 0. D. Siegmund. Importance sampling in the Monte Carlo study of sequential tests. he Annals of Statistics, 4(4):67 684,

Expectation Propagation Algorithm

Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,