Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013 1
Themes Dynamic stochastic optimization models face the curse of dimensionality Randomization can overcome some dimensionality effects With sufficient symmetry, the results can be quite effective Keys are conditioning properly on outcomes, appropriate re-sampling and split sampling, and uses of Gaussian mixtures JRBirge U Florida, Nov 2013 2
Basic problem Outline Problems of estimation Particle filters and sequential methods for simulation Optimization by slice sampling Dynamic Gaussian problems Extensions to Gaussian mixtures JRBirge U Florida, Nov 2013 3
General Problem: Find a set of controls, u t, t =0,,..., T, to T min E[ X 1 f t (x t+1,u t )] u t U t A t ;x t+1 =g t (x t,u t ) t t t; t+1 g t ( t, t) where A t is Σ t -measurable (nonanticipative), U t represents the possibility of additional constraints on the controls, g t represents state dynamics, f t represents the cost of the current action and next state transition. t=00 Challenges: Rapid expansion of states in dimension of x t in each period, Exponential growth of trees over time. JRBirge U Florida, Nov 2013 4
First Problem: What is x t? Example: hidden Markov model Example: We begin with inventory x 0 (which may be uncertain) We make observations y 1,,y T based on reported sales and new inputs What is the distribution of actual inventory x t at any time t? JRBirge U Florida, Nov 2013 5
Bayesian Interpretation t ti Suppose a distribution on x 0 : f(x 0 ) Assume some transition: p(x t x t-1 ) Observations: q(y t x t ) Goal: find P(x t y 1,,y t ( y T )) Idea: Use Bayes Rule to find the conditional probability: P(x t y t )=P(x t,y t )/P(y t ) Π s q(y s x s)p(x s x s-1)f(x 0) JRBirge U Florida, Nov 2013 6
Particle Methods Idea: Keep a limited number (N) of particles in each period Propagate forward to obtain new points Re-sample (re-weight) backwards to get consistent weights Re-balance if weights are low References: Gordon, Salmond, Smith (1993), Doucet (2008) JRBirge U Florida, Nov 2013 7
General Processes If the observation and transition functions, q and p are quite general, then P(x y t t ) may not be available analytically (E.g., no sales if stocked out, maximum limit, bulky demand) d) Alternatives: Markov chain Monte Carlo (MCMC) Particle methods JRBirge U Florida, Nov 2013 8
Sequential Importance Resampling (SIR) Method 1. Propagate. Drawx t+1,j,j =1,...,N from f(x t+1,j x t,j ). 2. Re-weight. Updateweights,ŵ t,j,j =1,...,N,toŵ t+1,j = for j =1,...,N. w t,j g(y t+1 x t+1 ) P N j=1 ŵt,jg(y t+1 x t+1 ) 3. Re-sample. If a re-sampling condition is satisfied, draw N new samples x 0 t+1,j,j =1,...,N from x t+1,j,j =1,...,N with probability distribution given by ŵ t+1,j,, j =1,,..., N; ; let x t+1,j = x 0 t+1,j and ŵ t+1,j =1/N/ for j =1,...,N. JRBirge U Florida, Nov 2013 9
Tractable Case: LQG Linear, quadratic Gaussian (LQG) Model Implications: All of the distributions are normal Can derive analytical formulas (Kalman filter): y t =Hx t + v t ; x t =F x t-1 + u t ; v~n(0,q); u~n(o,r) Iterations: w t = y t -Hx t ; S=HP H T +R; K=P HTS -1 ; x t =x t +Kw t P =(I-KH)P ; x t =Fx t-1 ; P =FP F T +Q JRBirge RMS Phoenix, October 2012 10
Kalman Prediction Inventory over 10 periods: -Actual (*) - Observed (+) Kalman estimate: x t ~ N(x t,p t ) Here: x 10 ~ N(13.8,0.62) JRBirge U Florida, Nov 2013 11
Comparison to MCMC/Particle Filter Idea: construct a sample distribution that fits the data using only information about one-step transitions Goal: maintain a constant number of samples across time instead of explosively growing tree Can it work? Why? JRBirge U Florida, Nov 2013 12
Method Structure Re-sample: Propagate x 1 w 1 w t+1 t1,x1 t t x 2 w 2 2 t w t+1 t2, x 2 t y t+1......... x N w N w t+1 tn, x N t t JRBirge U Florida, Nov 2013 13
Cum. Results N=50 Compare to Kalman Cumulative JRBirge U Florida, Nov 2013 14
Particle Frequency N=50 Particle (+) Kalman (*) JRBirge U Florida, Nov 2013 15
Other Uses of Particles Particle filtering: i finding x t from y t =(y 1,,y t ) Particle smoothing: finding x t from y T =(y 1,,y T ) Particle learning: (Polson et al.) Find parameters θ t at the same time to explain the model. Simulation JRBirge U Florida, Nov 2013 16
Extensions to Optimization Problem: find x to Pincus (1968): enough to sample from: min x f(x)subject to g(x) 0 π κ κ( (x) exp{ κ(f(x))} with constraints (e.g., Geman/Geman (1985)) : π κ,λ( (x) exp { κ (f(x) + λ κ g(x))} Extension: Augment with latent variables (λ, ω) thenwecanwrite π κ,λ (x) exp{ κ(f(x) { + λ κ g(x))} } = e ax E ω {e x2 2 ω }e bx E λ {e x2 2 λ } where the parameters (a, b) depend on (κ, λ κ ) and the functional forms of (f(x),g(x)). JRBirge U Florida, Nov 2013 17
Basic Optimization via Simulation Pincus results: Model: min F(x) Create a distribution Pλ( λ (x) exp(-λ F(x)) ()) Then: lim * λ E Pλ (x) = x JRBirge U Florida, Nov 2013 18
Markov Chain Simulation From x t to x t+1 Sample over exp(-λ F(μ x t +(1-μ) x t+1 ) ) Choose directions i to obtain mixing i in the region JRBirge U Florida, Nov 2013 19
Example Rosenbrock function: φ(x =(x 1,x 2 )) = (1 x 1 ) 2 +100(x 2 x 2 1) 2, which h has a global l minimum i at x = (1, 1). Specification for f: x t = {u if v e κ(φ(u)/φ(xt 1)) ; x t 1 1 otherwise,, } where u U([ 4, 4] 2 ), a uniform distribution over [ 4, 4] [ 4, 4], and v U([0, 1]), a standard uniform draw. Note: corresponds to the Metropolis-Hastings method of Markov Chain Monte Carlo using the distribution proportional to e κφ(x) for the acceptance criterion. JRBirge U Florida, Nov 2013 20
Test for Rosenbrock T=5000 iterations N=500 replications Re-sample when: Effective Sample Size<250 κ=2, 10, 100 JRBirge U Florida, Nov 2013 21
Rosenbrock Graphs κ= 2 10 100 JRBirge U Florida, Nov 2013 22
Other Examples: Rastigrin JRBirge U Florida, Nov 2013 23
Himmelblau JRBirge U Florida, Nov 2013 24
Shubert JRBirge U Florida, Nov 2013 25
Combine Particles and Simulation Suppose N particles are used Consider a 2-stage problem f(x,ξ) = c(x) + Q(x,y,ξ) Distribution ib ti on x, y i, ξ: ξ P(x,y 1, y N,ξ) Π i=1n (c(x)+q(x,y i,ξ i )) Result: E P (x) x * JRBirge U Florida, Nov 2013 26
Implementation Propagation forward on y s: q( ξ j x 1 ) {c(x)+q(ξ ξ j, y j, x)} p( ξ j ) where y j =argmin(c(x) + Q( ξ j, y j, x)) p( x ξξ 1,y 1,, ξ N y N ) j=1n {c(x) + Q(ξ j, y j, x) }p( ξ j ) The particles concentrate the distribution on x* as in the deterministic optimization. JRBirge U Florida, Nov 2013 27
Example From B/Louveaux, Chapter 1 (Farmer): JRBirge U Florida, Nov 2013 28
Dynamic Models General Idea: Create a distribution over decision variables and random variables Use MCMC to sample with sufficient number of particles or annealing parameter marginal mean on optimum Extension for dynamics: Keep constant sample particle numbers Convergence in fixed number of particles per stage? JRBirge U Florida, Nov 2013 29
Example: Linear-Quadratic Gaussian All distributions are available in closed form x t =Ax t-1 +B t u t-1 +v t All normal distributions with risk-sensitive objective: -log E(exp(-(θ/2)(x 2T Qx 2 +u 1T Ru 1 ))) where u 1 is control lin the first period Note: Equivalent to robust formulation Now, all can be found in closed form JRBirge U Florida, Nov 2013 30
Mean of u: m u 1 m = L u 1 1(θ) x 1, Results for LQG where L 1 is the result from the Kalman filter Variance of u: V u (N) where (N) 0 as N V u 1 m (θ) u * u as θ 0 1 1 where u 1* is the two-stage risk-neutral solution. JRBirge U Florida, Nov 2013 31
Extensions to T Stages Can derive S θ t recursively to show that: p(x t-1 t-1 t x,u )=K exp(-(1/2) (1/2) ( s=1 t-1 (x tt Q s x s +u st R s u s )+x tt S t θ x t ) + (x t -Ax t-1 -Bu t-1 ) T V t-1 (x t -Ax t-1 -Bu t-1 )) For particle methods, j=1 N: Each x tj drawn consistently from this distribution JRBirge U Florida, Nov 2013 32
w t1,x t 1 u 1 t w t2, x t 2 u 2 t. w tn, x N t u N t General Method Structure Propogate x t+1 1 x 2 t+1... x t+1 N Re-sample: w t1, x 1 t u 1 t w t2, x 2 t u 2. u t w tn, x t N N N u N t JRBirge U Florida, Nov 2013 33
General Result If the propagation step is unbiased, then E(u * 1 ) u 1 The unbiased result is possible in LQG from the recursive structure. Harder to obtain in more general systems (without additional i future sampling) JRBirge U Florida, Nov 2013 34
Two-Stage: Compare Direct MC Low Risk Aversion High Risk Aversion JRBirge U Florida, Nov 2013 35
Three Stage Example JRBirge U Florida, Nov 2013 36
Generalizing Objectives Direct: risk-sensitive exponential utility: exp(-θ Q(x,y,ξ)) - Q quadratic Objective: e -θ y - d E θ λ [exp((-θ 2 /2λ j )(y j d) 2 )] where λ ~ Exp(2) Also, use: E[ e -θ y - d ] 1 + θ y-d + O(1) JRBirge U Florida, Nov 2013 37
Further Generalizations ations Use objective approximations that preserve normal forms Extend LQG-like results to this general framework Find other frameworks where unbiased future samples are available Obtain sensitivity results on bias from imperfect future sampling Use optimization i to pick ML instead JRBirge U Florida, Nov 2013 38
Conclusions Techniques from MCMC can be used effectively for optimizing to reduce dimension effects Results work well in situations with sufficient symmetry Keys are to effectively use re-sampling, split (and slice) sampling, and latent variables JRBirge U Florida, Nov 2013 39
Other Questions to Consider How can learning and parameter estimation be combined with optimization? What is the range of problems that can be approached in this manner? What are the best uses of optimization technology as part of this process? JRBirge U Florida, Nov 2013 40
Thank you! Questions? JRBirge U Florida, Nov 2013 41