A variational formula for risk-sensitive control

Size: px

Start display at page:

Download "A variational formula for risk-sensitive control"

Madeline Wheeler
5 years ago
Views:

1 A variational formula for risk-sensitive control Vivek S. Borkar, IIT Bombay May 9, 2018, IMA, Minneapolis

2 PART I: Risk-sensitive reward for Markov chains (joint work with V. Anantharam, Uni. of California, Berkeley ) PART II: Risk-sensitive cost for reflected diffusions (joint work with Ari Arapostathis, Uni. of Texas at Austin, and K. Suresh Kumar, IIT Bombay ) PART III: Linear/dynamic programming approaches for degenerate finite risk-sensitive reward problems

3 PART I

4 Courant-Fisher formula for principle eigenvalue of a positive definite matrix A R d d : λ = x T Ax max 0 x R d x T x. Consider an irreducible nonnegative Q R d d. Then the Perron-Frobenius theorem guarantees a positive principle eigenvalue with an associated positive eigenvector. Is there a counterpart of the Courant-Fisher formula?

5 YES!! The Collatz-Wielandt formula for the principal eigenvalue of an irreducible nonnegative matrix Q = [[q(i, j)]] R d d : λ = sup x=[x 1,,x d ] T,x i >0 i = inf x=[x 1,,x d ] T,x i >0 i min i:x i >0 max i:x i >0 An alternative characterization: write (Qx) i x i (Qx) i x i. Q = DP, where D = diag(d 1,, d d ), P = [[p(j i)]] stochastic.

6 Also define G 0 := { (π, P ) : π is a stationary probability for the stochastic matrix P = [[ p(j i)]] }. Then the following representation holds: log λ = sup (π, P ) G 0 i π(i) [d(i) D( p( i) p( i))]. This is the Donsker-Varadhan formula for the principal eigenvalue of a nonnegative matrix. (cf. the book by Dembo-Zeitouni)

7 Infinite dimensional generalization of Perron-Frobenius theorem is given by the Krein-Rutman theorem: Let 1. B be a Banach space with a positive cone K such that K K is dense in B, 2. T : B B a compact positive linear operator which is strongly positive. Then a principal eigenvalue (unique, positive) / eigenvector (positive) exist.

8 Our interest is in the following nonlinear scenario arising in Risk-Sensitive Control: Consider a controlled Markov chain {X n } on a compact metric state space S, an associated control process {Z n } in a compact metric control space U, a per stage reward function r : S U S R such that r C(S U S),

9 a controlled transition kernel p(dy x, u) with full support, such that P (X n+1 A X m, Z m, m n) = p(a X n, Z n ) and, the maps (x, u) f(y)p(dy x, u), f C(S), f 1, are equicontinuous. This can be relaxed via an approximation argument.

10 The control problem is to maximize the asymptotic growth rate of the exponential reward: λ := sup x S sup lim inf N 1 N loge [ e ] N 1 m=0 r(x m,z m,x m+1 ) X 0 = x. The second supremum is over all admissible (i.e., non-anticipative) controls. We allow relaxed (i.e., probability measure valued) controls. Define T f(x) := T (n) sup φ P(U) p(dy x, u)φ(du)e r(x,u,y) f(y), := T T T (n times), T (0) := Id.

11 T (n) : C(S) C(S), n 0, is the Nisio semigroup, satisfying: 1. strictly increasing: f > g = T (n) f > T (n) g, 2. strongly positive: f 0, f θ = T (n) f int(c + (S)), 3. positively 1-homogeneous: c > 0 = T (n) (cf) = ct (n) f, 4. compact.

12 This leads to an abstract Collatz-Wielandt formula: Theorem There exist ρ > 0, ψ int(c + (S)) such that T ψ = ρψ and ρ = inf f int(c + (S)) = sup f int(c + (S)) sup M + (S) inf M + (S) T fdµ fdµ T fdµ. fdµ Furthermore, log ρ is the optimal reward for the risk-sensitive control problem. The proof uses Nussbaum-Ogiwara formulation of a nonlinear Krein-Rutman theorem.

13 A variational formula: Let G := the set of probability measures η(dx, du, dy) P(S U S) which disintegrate as η(dx, du, dy) = η 0 (dx)η 1 (du x)η 2 (dy x, u), such that η 0 is invariant under the transition kernel U η 2(dy x, u)η 1 (du x). Let D( ) denote the Kullback-Leibler divergence or relative entropy.

14 Theorem Under above hypotheses, log ρ = sup η G ( η 0 (dx)η 1 (du x)[ D(η 2 (dy x, u) p(dy x, u))]). r(x, u, y)η 2 (dy x, u) This generalizes the Donsker-Varadhan formula. Hypotheses can be relaxed to: 1. Range(r) = [, ) with e r C(S U S), 2. p(dy x, u) need not have full support. (extension via an approximation argument)

15 We have an equivalent concave maximization problem, as opposed to a team problem one would obtain from the usual log transformation. If ρ(ϕ) denotes the asymptotic growth rate for a randomized Markov control ϕ, then ρ = max ρ(ϕ) (sufficiency of randomized Markov controls). Related to entropy-penalized control ( Bierkens-Kapper, Guan-Raginsky-Willett, Todorov)

16 Applications 1. Growth rate of the number of directed paths in a graph (requires as a possible reward). 2. Portfolio optimization in the framework of Bielecki, Hernandez-Hernández and Pliska. 3. Problem of minimizing the exit rate from a domain.

17 PART II

18 Reflected diffusion: dx(t) = b(x(t)), v(t))dt + σ(x(t))dw (t) γ(t)dξ(t), dξ(t) = I{X(t) Q}dξ(t), for t 0. Here: 1. Q open bounded with C 3 boundary Q. 2. b continuous, b(, u) Lipschitz uniformly in u. 3. σ is C 1,β 0 and uniformly non-degenerate.

19 4. γ i (x) = σ(x)σ(x) T η(x) where η(x) is the unit outward nornal. 5. W ( ) a BM, v( ) a non-anticipatine control a compact action space. OBJECTIVE: Minimize lim t where r is continuous. 1 t log E [ e t0 r(x(s),v(s))ds ],

20 Nisio semigroup: For t 0, S t f(x) := inf v( ) E x [ e t0 r(x(s),v(s))ds ]. Then S t : C( Q) C( Q) is a semigroup of strongly continuous bounded Lipschitz, monotone operators with infinitesimal generator G defines by Gf(x) = 1 (σ(x)σ 2 tr T (x) 2 ) f(x) + min v [ b(x, v), f(x) + r(x, v)f(x)]. Also, supperadditive, positively 1-homogeneous, strongly positive, completely continuous.

21 Let C 2 γ,+ ( Q) := {f : Q [0, ) : f is C 2 with f(x), γ(x) = 0 for x Q}. Nonlinear Krein-Rutman theorem = There exists unique pair (ρ, ϕ) R Cγ,+ 2 ( Q) satisfying ϕ 0, Q = 1 such that S t ϕ = e ρt ϕ. This solves Gϕ(x) = ρϕ(x), x Q, ϕ(x), γ(x) = 0, x Q.

22 Abstract Collatz-Wielandt formula = ρ = inf f C 2 γ,+ ( Q),f>0 = sup f C 2 γ,+ ( Q),f>0 sup ν P( Q) inf ν P( Q) Gf f dν Gf f dν In uncontrolled case, the first formula is the convex dual of the Donsker-Varadhan formula for principle eigenvalue of G: where ρ = I(ν) := ( ) sup ν P( Q) Q r(x)ν(dx) I(ν) inf f Cγ,+ 2 ( Q),f>0 Q Gf dν. f

23 PART III

24 Notation: {X n, n 0} : controlled Markov chain with finite state space S := {1, 2,, s} and finite action space U. {Z n } : U-valued control process. P u = [[p(j i, u)]] i,j S : controlled transition matrix. r(,, ) C(S U S) : per stage reward.

25 Risk-sensitive reward Aim: Maximize the asymptotic growth rate Here: λ := sup i sup lim inf {Z n } N 1 N log E i [ e ] N 1 m=0 r(x m,z m,x m+1 ). E i [ ] := expectation w.r.t. X 0 = i, and, the inner supremum is over all admissible controls.

26 Consider a controlled Markov chain {Y n } on S with state-dependent action space at state i given by: Ũ i := u U ({u} V i,u ), where V i,u := {q( i, u) : q( i, u) 0, This is isomorphic to P(S). Let j q(j i, u) = 1}. K := i S ({i} Ũ i ). The (controlled) transition probabilities of {Y n } are p(j i, (u, q( i, u))) := q(j i, u).

27 Define per stage reward r : K S R by: r(i, (u, q( i, u)), j) := r(i, u, j) D(q( i, u) p( i, u)). Let {(Z n, Q n ), n 0} := the Ũ Yn -valued control process. Consider the problem: Maximize the long run average reward lim inf N 1 N N 1 n=0 E x [ r ( Yn, (Z n, Q n ), Y n+1 )].

28 Define the corresponding ergodic occupation measure γ P(K S) by γ(i, (u, dq), j) := γ 1 (i)γ 2 (u, dq i)γ 3 (j i, (u, q)), where γ 1 is an invariant probability distribution (not necessarily unique) under the transition kernel ˇγ(j i) = u V i,u γ 2 (u, dq i)γ 3 (j i, (u, q)). Let E := the set of such γ.

29 The above average reward control problem is equivalent to the linear program: P0 Maximize i,j,u γ(i, (u, dq), j) r(i, (u, q), j) over E. (Recall that E is specified by linear constraints.) The maximum will be attained at an extreme point of E corresponding to a stationary Markov policy.

30 This LP can be simplified as : Maximize over i,j γ (i, u, j)[r(i, u, j) D( q( i, u) p( i, u))] Ẽ := {γ P(S U S) : γ (i, u, j) = γ 1 (i)ϕ(u i)q(j i, u) where γ 1 ( ) is invariant under the transition kernel γ(j i) := u ϕ(u i)q(j i, u)}.

31 The dual LP is : Minimize λ subject to λ λ(i), λ(i) + V (i) j λ(i) j q(j i, u)( r(i, (u, q( i, u)), j) + V (j)), q(j i, u)λ(j), i S, (u, q( i, u)) Ũ i. The proof goes through finite approximations. Note that the LP has infinitely many constraints.

32 Dynamic Programming The equivalent dynamic programming formulation is: = max λ(i), i λ(i) + V (i) = max ( q(j i, u)(v (j) u,q( i,u) j + r(i, (u, q( i, u), j)))), ( ) λ λ(i) = max (u,q( i,u)) B i where B i is the Argmax in ( ). j q(j i, u)λ(j), goes through finite approximations. i S, Once again, the proof

33 The maximization over q in ( ) can be explicitly performed using the Gibbs variational principle from statistical mechanics: For fixed i, u, the maximum is attained at q (j i, u) := p(j i, u)er(i,u,j)+v (j). r(i,u,k)+v (k) k p(k i, u)e Substitute back, set Φ(i) := e V (i), Λ(i) := e λ(i), i S, and exponentiate both sides of ( ).

34 This leads to the multiplicative dynamic programming equations for infinite horizon risk-sensitive reward in the general degenerate case: Λ(i)Φ(i) = max u Λ(i) = max u D i j j p(j i, u) where D i is the Argmax in ( ). ( e r(i,u,j) ) Φ(j), ( ) p(j i, u)er(i,u,j) Φ(j) k p(k i, u)e r(i,u,k) Λ(j), Φ(k) i S, This is the analog of Howard-Kallenberg results for ergodic control. Observe the occurrence of the twisted kernel.

35 FUTURE PROBLEMS: 1. Extension to non-compact state spaces (cf. recent work of Cavazos-Cadena (Math. OR, 2018)) 2. Degenerate case for diffusions

36 References 1. V. Anantharam, V. S. Borkar, A variational formula for risk-sensitive reward, submitted A. Arapostathis, V. S. Borkar, K. Suresh Kumar, Risksensitive control and an abstract Collatz-Wielandt formula, Journal of Theoretical Probability, May

37 3. V. S. Borkar, Linear and dynamic programming approaches to degenerate risk-sensitive reward processes, 56th IEEE Conference on Decision and Control, Dec , 2017, Melbourne, Australia.

38 THANK YOU!

A relative entropy characterization of the growth rate of reward in risk-sensitive control

1 / 47 A relative entropy characterization of the growth rate of reward in risk-sensitive control Venkat Anantharam EECS Department, University of California, Berkeley (joint work with Vivek Borkar, IIT