Process-Based Risk Measures for Observable and Partially Observable Discrete-Time Controlled Systems

Size: px

Start display at page:

Download "Process-Based Risk Measures for Observable and Partially Observable Discrete-Time Controlled Systems"

Samantha Hamilton
6 years ago
Views:

1 Process-Based Risk Measures for Observable and Partially Observable Discrete-Time Controlled Systems Jingnan Fan Andrzej Ruszczyński November 5, 2014; revised April 15, 2015 Abstract For controlled discrete-time stochastic processes we introduce a new class of dynamic risk measures, which we call process-based. Their main features are that they measure risk of processes that are functions of the history of the base process. We introduce a new concept of conditional stochastic time consistency and we derive the structure of process-based risk measures enjoying this property. We show that they can be equivalently represented by a collection of static law-invariant risk measures on the space of functions of the state of the base process. We apply this result to controlled Markov processes and we derive dynamic programming equations. Next, we consider partially observable processes and we derive the structure of stochastically conditionally time-consistent risk measures in this case. We prove that they can be represented by a sequence of law invariant risk measures on the space of function of the observable part of the state. We also prove corresponding dynamic programming equations. Keywords: Dynamic Risk Measures; Time Consistency; Partially Observable Markov Processes 1 Introduction The main objective of this paper is to provide theoretical foundations of the theory of dynamic risk measures for controlled stochastic processes, in particular, Markov processes, with only partial observation of the state. The theory of dynamic risk measures in discrete time has been intensively developed in the last 10 years see [12, 43, 37, 39, 20, 10, 42, 2, 35, 26, 25, 11] and the references therein). The basic setting is the following: we have a probability space Ω, F, P), a filtration {F t } t=1,...,t with a trivial F 1, and we define appropriate spaces Z t of F t -measurable random variables, t = 1,..., T. For each t = 1,..., T, a mapping ρ t,t : Z T Z t is called a conditional risk measure. The central role in the theory is played by the concept of time consistency, which regulates relations between the mappings ρ t,t and ρ s,t for different s and t. Rutgers University, Department of Management Science and Information Systems, Piscataway, NJ 08854, USA, jf546@rutgers.edu Rutgers University, Department of Management Science and Information Systems, Piscataway, NJ 08854, USA, rusz@rutgers.edu 1

2 One definition employed in the literature is the following: for all Z, W Z T, if ρ t,t Z) ρ t,t W ) then ρ s,t Z) ρ s,t W ) for all s < t. This can be used to derive recursive relations ρ t,t Z) = ρ t ρt+1,t Z) ), with simpler one-step conditional risk mappings ρ t : Z t+1 Z t, t = 1,..., T 1. Much effort has been devoted to derive dual representations of the conditional risk mapping and to study their evolution in various settings. When applied to controlled processes, and in particular, to Markov decision problems, the theory of dynamic measures of risk encounters difficulties. The spaces Z t become larger, when t grows, and each one-step mapping ρ t has different domain and range spaces. There is no easy way to explore stationarity of the process, nor a clear path exists to extend the theory to infinitehorizon problems. Moreover, no satisfactory theory of law invariant dynamic risk measures exists, which would be suitable for Markov control problems the restrictive definitions of law invariance employed in [27] and [44] lead to conclusions of limited practical usefulness, while the translation of the approach of [46] to the Markov case appears to be difficult). Motivated by these issues, in [41], we introduced a specific class of dynamic risk measures, which is well-suited for Markov problems. We postulated that the one-step conditional risk mappings ρ t have a special form, which allows for their representation in terms of risk measures on a space of functions defined on the state space of the Markov process. This restriction allowed for the development of dynamic programming equations and corresponding solution methods, which generalize the well-known results for expected value problems. Our ideas were successfully extended in [8, 7, 30, 45]. In this paper, we introduce and analyze a general class of risk measures, which we call process-based. We consider a controlled process { t } t=1,...,t taking values in a Polish space the state space), whose conditional distributions are described by controlled transition kernels Q t : t U P ), t = 1,..., T 1, where U is a certain control space. Any historydependent measurable) control u t = π t x 1,..., x t ) is allowed. In this setting, we are only interested in measuring risk of stochastic processes of the form Z t = c t t, u t ), t = 1,..., T, where c t : U R can be any bounded measurable function. This restriction of the class of stochastic processes for which risk needs to be measured is one of the two cornerstones of our approach. The other cornerstone is our new concept of stochastic conditional time-consistency. It is more restrictive than the usual time consistency, because it involves conditional distributions and uses stochastic dominance rather than the pointwise order. These two foundations allow for the development of a theory of dynamic risk measures which can be fully described by a sequence of law invariant risk measures on a space V of measurable functions on the state space. In the special case of controlled Markov processes, we derive the structure postulated in [41], thus providing its solid theoretical foundations. We also derive dynamic programming equations in a much more general setting than that of [41]. In the extant literature, three basic approaches to introduce risk aversion in Markov decision processes have been employed: utility functions see, e.g., [23, 24, 15]), mean variance models see. e.g., [47, 19, 31, 1]), and entropic exponential) models see, e.g., [22, 32, 13, 17, 29, 5]). Our approach generalizes the utility and exponential models; the mean variance models do not satisfy, in general, the monotonicity and time-consistency conditions, which are relevant for our theory. However, a version satisfying these conditions has been recently proposed in [9]. In the second part of the paper, we extend our theory to partially observable controlled Markov processes. In the expected-value case, this classical topic is covered in many monographs see, e.g., [21, 6, 4] and the references therein). The standard approach is to consider the belief state space, involving the space of probability distributions of the unobserved part of the state. The recent report [18] provides the state-of-the-art setting. The risk averse case has been dealt, so far, with the use of the entropic risk measure see [36, 5]). Our main result in the 2

3 risk-averse case is that the conditional risk mappings can be equivalently modeled by risk measures on the space of functions defined on the observable part of the state only. We derive this observation in two parallel, but equivalent ways: by considering history-dependent description of the evolution of the observable part, and by considering belief states, that is, posterior distributions of the unobservable part. We also derive risk-averse dynamic programming equations for partially observable Markov models. The paper is organized as follows. In sections , we formalize the basic model and review the concepts of risk measures and their time consistency. The first set of original results are presented in section 2.4; we introduce a new concept of stochastic conditional time consistency and we characterize the structure of dynamic risk measures enjoying this property Theorem 2.14). In section 3, we extend these ideas to the case of controlled processes, and we prove Theorem 3.3 on the structure of measures of risk in this case. These results are further specialized to controlled Markov processes in section 4. We introduce the concept of a Markov risk measure and we derive its structure Theorem 4.4). In section 4.2, we prove an analog of dynamic programming equations in this case. Section 5 is devoted to the analysis of the structure of risk measures for partially observable controlled Markov processes. We attack the problem in two different ways. In section 5.3, we focus on the observable part of the process and its history-dependent dynamics. The key result is Theorem 5.6, which simplifies modeling of conditional risk measures in this case. In section 5.5 we consider the Markov model in the extended state space, involving the posterior distribution of the unobservable part the belief state ). In Theorem 5.8 we derive an equivalent representation of risk measures, which also reduces to law invariant risk models on the space of functions of the next observation. Finally, in section 5.6, we derive an analog of dynamic programming equations for risk-averse partially observable models. 2 Risk Measures Based on Observable Processes In this section we introduce fundamental concepts and properties of dynamic risk measures for uncontrolled stochastic processes in discrete time. In subsection 2.1 we set up our a probabilistic framework, and in subsections 2.2 and 2.3 we revisit some important concepts existing in the literature. Beginning from subsection 2.4, we introduce new notions of conditional time consistency and stochastic conditional time-consistency, which are strictly stronger requirements on the dynamic risk measure than the standard concept od time consistency, and which are particularly useful for controlled stochastic processes. Based on these two concepts, we derive the structure of dynamic risk measures, which is based on so-called transition risk mappings. 2.1 Preliminaries In all subsequent considerations, we work with a Borel subset, B )) of a Polish space and the canonical measurable space T, B ) T ) where T is a natural number and B ) T is the product σ -algebra. We use { t } t=1,...,t to denote the discrete-time process of canonical projections. We also define H t = t to be the space of possible histories up to time t, and we use h t for a generic element of H t : a specific history up to time t. The random vector 1,, t ) will be denoted by H t. We assume that for all t = 1,..., T 1, the transition kernels, which describe the conditional distribution of t+1, given 1,, t, are measurable functions Q t : t P ), t = 1,..., T 1, 1) 3

4 where P ) is the set of probability measures on, B )). These kernels, along with the initial distribution of 1, define a unique probability measure P on the product space T with the product σ -algebra. For a stochastic system described above, we consider a sequence of random variables {Z t } t=1,...,t taking values in R; we assume that lower values of Z t are preferred e.g., Z t represents a cost at time t). We require that {Z t } t=1,...,t is bounded and adapted to {F t } t=1,...,t - the natural filtration generated by the process. In order to facilitate our discussion, we introduce the following spaces: Z t = { Z : T R Z is F t -measurable and bounded It is then equivalent to say that Z t Z t. We also introduce the spaces Z t,t = Z t Z T, t = 1,..., T. }, t = 1,..., T. 2) Since Z t is F t -measurable, a measurable function φ t : t R exists such that Z t = φ t 1,..., t ). With a slight abuse of notation, we still use Z t to denote this function. 2.2 Dynamic risk measures In this subsection, we quickly review some definitions related to risk measures. Note that all relations e.g., equality, inequality) between random variables are understood in the everywhere sense. Definition 2.1. A mapping ρ t,t : Z t,t Z t, where 1 t T, is called a conditional risk measure, if it satisfies the monotonicity property: for all Z t,..., Z T ) and W t,..., W T ) in Z t,t, if Z s W s, for all s = t,..., T, then ρ t,t Z t,..., Z T ) ρ t,t W t,..., W T ). Definition 2.2. A conditional risk measure ρ t,t : Z t,t Z t i) is normalized if ρ t,t 0,..., 0) = 0; ii) is translation-invariant if Z t,..., Z T ) Z t,t, ρ t,t Z t,..., Z T ) = Z t + ρ t,t 0, Z t+1,..., Z T ). Throughout the paper, we assume all conditional risk measures to be at least normalized. Translation-invariance is a fundamental property, which will also be frequently used; under normalization, it implies that ρ t,t Z t, 0,, 0) = Z t. Definition 2.3. A conditional risk measure ρ t,t has the local property if for any Z t,..., Z T ) Z t,t and for any event A F t, we have 1 A ρ t,t Z t,..., Z T ) = ρ t,t 1 A Z t,...,1 A Z T ). The local property is important because it means that the conditional risk measure at time t restricted to any F t -event A is not influenced by the values that Z t,..., Z T take on A c. Definition 2.4. A dynamic risk measure ρ = { ρ t,t is a sequence of conditional risk }t=1,...,t measures ρ t,t : Z t,t Z t. We say that ρ is translation-invariant or has the local property, if all ρ t,t, t = 1,..., T, satisfy the respective conditions of Definitions 2.2 or

5 2.3 Time Consistency The notion of time-consistency can be formulated in different ways, with weaker or stronger assumptions; but the key idea is that if one sequence of costs, compared to another sequence, has the same current cost and lower risk in the future, then it should have lower current risk. In this and the next subsection, we discuss three formulations of time-consistency from the weakest to the strongest one. We also show how the tower property the recursive relation between ρ t,t and ρ t+1,t that time-consistency implies) improves with more refined time consistency concepts. The following definition of time consistency was employed in [41]. Definition 2.5. A dynamic risk measure { ρ t,t is time-consistent if and only if for any }t=1,...,t 1 τ < θ T, for any Z τ,..., Z T ) and W τ,..., W T ) in Z τ,t, the conditions { Zt = W t, t = τ,..., θ 1, imply ρ θ,t Z θ,..., Z T ) ρ θ,t W θ,..., W T ), ρ τ,t Z τ,..., Z T ) ρ τ,t W τ,..., W T ). The following proposition simplifies the definition of a time-consistent measure, which will be useful in further derivations. Proof is omitted and can be found in [41]. Proposition 2.6. A dynamic risk measure { ρ t,t is time-consistent if for any 1 t < T }t=1,...,t and for all Z t,..., Z T ), W t,..., W T ) Z t, the conditions { Zt = W t, ρ t+1,t Z t+1,..., Z T ) ρ t+1,t W t+1,..., W T ), imply that ρ t,t Z t,..., Z T ) ρ t,t W t,..., W T ). It turns out that a translation-invariant and time-consistent dynamic risk measure can be decomposed into and then reconstructed from so-called one-step conditional risk mappings. Theorem 2.7 [41]). A dynamic risk measure { ρ t,t is translation-invariant and timeconsistent if and only if there exist ρ t : Z t+1 Z t, t = 1,..., T 1 satisfying the monotonicity }t=1,...,t property and such that ρ t 0) = 0, called one-step conditional risk mappings, such that for all t = 1,..., T 1, ρ t,t Z t,..., Z T ) = Z t + ρ t ρ t+1,t Z t+1,..., Z T )). 3) As an immediate comment, we would like to point out that translation invariance and time consistency do not imply the local property, as is demonstrated by the following example. Example 2.8. An example of a dynamic risk measure which is translation-invariant and timeconsistent but fails to have the local property: T = 2, = {1, 2}, { ρ1,2 Z 1, Z 2 )x 1 ) = Z 1 x 1 ) + E[Z 2 1 = x 1 ], x 1 {1, 2}, ρ 2,2 Z 2 )x 1, x 2 ) = Z 2 x 1, x 2 ), x 1, x 2 ) {1, 2} 2. Conceptually, the one-step conditional risk mappings play a similar role to one-step conditional expectations, and will be very useful when tower property is involved. At this stage without further refinement of assumptions, it remains a fairly abstract and general object that is 5

6 hard to characterize. In [41], a special variational form of this one-step conditional risk mappings is imposed, which is well suited for Markovian applications, while it is unclear if there exist other forms of such mappings. In order to gain deeper understanding of these concepts, in the following we introduce two stronger notions of time consistency, and we argue that any one-step conditional risk mapping is of the variational form postulated in [41]. To this end, we use the particular structure of the space T, B ) T ) and the way a probability measure is defined on this space. 2.4 Stochastic Conditional Time-Consistency and Transition Risk Mappings We now refine the concept of time-consistency for process-based risk measures. Definition 2.9. A dynamic risk measure { ρ t,t is conditionally time-consistent if for }t=1,...,t any 1 t T 1, for any h t t, for all Z t,..., Z T ), W t,..., W T ) Z t,t, the conditions { Zt h t ) = W t h t ), 4) ρ t+1,t Z t+1,..., Z T )h t, ) ρ t+1,t W t+1,..., W T )h t, ), imply ρ t,t Z t,..., Z T )h t ) ρ t,t W t,..., W T )h t ). Proposition The conditional time-consistency implies the time-consistency and the local property. Proof. If { ρ t,t is conditionally time-consistent, then it satisfies the conditions of Propo- }t=1,...,t sition 2.6 and is thus time-consistent. Let us prove by induction on t from T down to 1 that ρ t,t satisfy the local property. Clearly, ρ T,T does: if A F T, then [ 1 A ρ T,T ] ZT ) = 1 A Z T = ρ T,T 1 A Z T ). Suppose ρ t+1,t satisfies the local property for some 1 t < T, and consider any A F t, any h t t, and any Z t,..., Z T ) Z t,t. Two cases may occur. If 1 A h t ) = 0, then [1 A Z t ]h t ) = 0 and the local property for t + 1 yields: [ ρt+1,t 1 A Z t+1,...,1 A Z T ) ] h t, ) = [ 1 A ρ t+1,t Z t+1,..., Z T ) ] h t, ) = 0. By conditional time consistency, ρ t,t 1 A Z t,1 A Z t+1,...,1 A Z T )h t ) = ρ t,t 0,..., 0)h t ) = 0. If 1 A h t ) = 1, then [1 A Z t ]h t ) = Z t h t ) and the local property for t + 1 implies that [ ρt+1,t 1 A Z t+1,...,1 A Z T ) ] h t, ) = [ 1 A ρ t+1,t Z t+1,..., Z T ) ] h t, ) = ρ t+1,t Z t+1,..., Z T )h t, ). By conditional time consistency, ρ t,t 1 A Z t,...,1 A Z T )h t ) = ρ t,t Z t,..., Z T )h t ). In both cases, ρ t,t 1 A Z t,...,1 A Z T )h t ) = [ 1 A ρ t,t Z t,..., Z T ) ] h t ). The conditional time-consistency is thus a stronger property than the time-consistency; it allows us to exclude risk measures which do not satisfy the local property, like Example

7 Generally, the definition of dynamic risk measures in Section 2.2 and the definition of timeconsistency property in Section 2.3 are valid with any filtration on an underlying probability space, F, P), instead of process-generated filtration. However, from Section 2.4 and to the end of the paper, we only consider dynamic risk measures defined with a filtration generated by a specific process, because we are interested in those risk measures which can be evaluated on each specific history path. That is why we call these risk measures process-based. The following proposition shows that the conditional time-consistency also yields a refined version of the one-step risk-mapping. Throughout the paper, we define V to be the set of all bounded measurable functions on, and it turns out that such mappings can be equivalently represented by risk measures on V. Theorem A dynamic risk measure { ρ t,t is translation-invariant and conditionally time-consistent if and only if there exist }t=1,...,t functionals σ t : t V R, t = 1,..., T 1, called transition risk mappings, such that i) for all t = 1,... T 1 and all h t t, the functional σ t h t, ) is a normalized and monotonic risk measure on V; ii) for all t = 1,... T 1, for all Z t,..., Z T ) Z t,t, and for all h t t, we have the following recursive relation: ρ t,t Z t,..., Z T )h t ) = Z t h t ) + σ t ht, ρ t+1,t Z t+1,..., Z T )h t, ) ). 5) Moreover, for all t = 1,..., T 1 the functional σ t is uniquely determined by ρ t,t as follows: for every h t t and every v V, σ t h t, v) = ρ t,t 0, V, 0,..., 0)h t ), 6) where V Z t+1 satisfies the equation V h t, ) = v ), and can be arbitrary elsewhere. Proof. Assume { ρ t,t is translation-invariant and conditionally time-consistent. For }t=1,...,t any h t t, the formula 6) defines a normalized and monotonic risk measure on the space V. Setting vx) = ρ t+1,t Z t+1,..., Z T )h t, x), x, { vx), if h t+1 = h t, x), V h t+1 ) = 0, otherwise, we obtain, by conditional time-consistency, Thus, by the translation property, ρ t+1,t V, 0,..., 0)h t, ) = ρ t+1,t Z t+1,..., Z T )h t, ). ρ t,t Z t,..., Z T )h t ) = Z t h t ) + ρ t,t 0, Z t+1,..., Z T )h t ) = Z t h t ) + ρ t,t 0, V, 0,..., 0)h t ) = Z t h t ) + σ t h t, v). This chain of relations proves also the uniqueness of σ t. 7

8 On the other hand, if such transition risk mappings exist, then { ρ t,t is conditionally }t=1,...,t time-consistent by the monotonicity of σ h t, ). We can now use 5) to obtain for any t = 1,..., T 1, and for all h t t the following identity: ρ t,t 0, Z t+1,..., Z T )h t ) = σ t ht, ρ t+1,t Z t+1,..., Z T )h t, ) ) which is translation invariance of ρ t,t. = ρ t,t Z t,..., Z T )h t ) Z t h t ), Further refinement of the concept of time-consistency can be obtained by the use of the transition kernels Q t and stochastic, rather than pointwise, orders. We stress that unlike the previous notions of time-consistency, the new one will depend on the underlying kernels Q t. Definition A dynamic risk measure { ρ t,t is stochastically conditionally timeconsistent with respect to {Q t } t=1,...,t 1 if for any 1 t T 1, for any h t t, for all }t=1,...,t Z t,..., Z T ), W t,..., W T ) Z t,t, the conditions { Zt h t ) = W t h t ), ) ) 7) ρt+1,t Z t+1,..., Z T ) H t = h t st ρt+1,t W t+1,..., W T ) H t = h t, imply ρ t,t Z t,..., Z T )h t ) ρ t,t W t,..., W T )h t ), 8) where the relation st stands for the conditional stochastic order understood as follows: { Q t h t ) x ρt+1,t Z t+1,..., Z T )h t, x) > η }) Q t h t ) { x ρt+1,t W t+1,..., W T )h t, x) > η }), η R. When the choice of the underlying transition kernels is clear from the context, we will simply say that the dynamic risk measure is stochastically conditionally time-consistent. We first note that this new notion of time-consistency is even stronger than the conditional time-consistency introduced in Definition 2.9. Proposition The stochastic conditional time-consistency implies the conditional timeconsistency. Proof. Suppose { ρ t,t is stochastically conditionally time-consistent. Fix t = 1,..., T }t=1,...,t 1, and let h t and Z t,..., Z T ), W t,..., W T ) satisfy the conditions 4) of Definition 2.9. Then for all η R we obtain { x ρt+1,t Z t+1,..., Z T )h t, x) > η } { x ρ t+1,t W t+1,..., W T )h t, x) > η }, and thus 8) follows. It comes with no surprise that with this stronger notion of time-consistency, the mapping σ t appearing in Theorem 2.11 has a more refined characterization. The following theorem proves that this extra characterization is the law invariance property. 8

9 Theorem A process-based dynamic risk measure { ρ t,t is translation-invariant }t=1,...,t and stochastically conditionally time-consistent if and only if there exist functionals σ t : graphq t ) V R, t = 1,..., T 1, such that i) for all t = 1,..., T 1 and all h t t, the functional σ t h t, Q t h t ), ) is a normalized, monotonic, and law-invariant risk measure on V with respect to the distribution Q t h t ); ii) for all t = 1,..., T 1, for all Z t,..., Z T ) Z t,t, and for all h t t, ρ t,t Z t,..., Z T )h t ) = Z t h t ) + σ t h t, Q t h t ), ρ t+1,t Z t+1,..., Z T )h t, )). 9) Moreover, for all t = 1,..., T 1, σ t is uniquely determined by ρ t,t as follows: for every h t t and every v V, σ t h t, Q t h t ), v) = ρ t,t 0, V, 0,..., 0)h t ), 10) where V Z t+1 satisfies the equation V h t, ) = v ), and can be arbitrary elsewhere. Proof. On the one hand, if { ρ t,t is translation-invariant and stochastically conditionally time-consistent, then by Proposition 2.13 and Theorem 2.11, there exist normalized and }t=1,...,t monotonic risk measures σ t h t, Q t h t ), ), h t t, t = 1,..., T 1, that satisfy 9) and 6). We need only verify the postulated law invariance of σ t h t, Q t h t ), ). If V, V Z t+1 have the same conditional distribution, given h t, then Definition 2.12 implies that ρ t,t 0, V, 0,..., 0)h t ) = ρ t,t 0, V, 0,..., 0)h t ), and law invariance follows from 6). On the other hand, if such {σ t } t=1,...,t 1 exist, the law-invariance of σ t h t, Q t h t ), ) implies the stochastic conditional time-consistency. Remark With a slight abuse of notation, we included the distribution Q t h t ) as an argument of the transition risk mapping in view of the application to controlled processes. Example In the theory of risk-sensitive Markov decision processes, the following family of entropic risk measures is employed see [22, 36, 32, 14, 40, 13, 17, 29, 5]): ρ t,t Z t,..., Z T ) = E 1 [ γ ln exp γ ) T ] ) s=t Z s Ft, t = 1,..., T, γ > 0. It is stochastically conditionally time-consistent, and corresponds to the transition risk mapping σ t h t, q, v) = 1 γ ln E q [e γ v ] ) = 1 ) γ ln e γ vx) qdx), γ > 0. 11) We could also make γ in 11) dependent on the time t, the current state x t, or even the entire history h t, and still obtain a stochastically conditionally time-consistent dynamic risk measure. Example The following transition risk mapping satisfies the condition of Theorem 2.14 and corresponds to a stochastically conditionally time-consistent dynamic risk measure: [ ) ] ) p 1/p σ t h t, q, v) = vs) qds) + ϰ t h t ) vs) vs ) qds ) qds), 12) + where ϰ t : t [0, 1] is a measurable function, and p [1, + ). It is an analogue of the static mean semideviation measure of risk, whose consistency with stochastic dominance is well known [33, 34]. In the construction of a dynamic risk measure, we use q = Q t h t ). If ϰ t depends on x t only, the mapping 12) corresponds to a Markov risk measure discussed in sections 4.1 and

10 Example Our use of the stochastic dominance relation in the definition of stochastic conditional time consistency rules out some candidates for transition risk mappings. Suppose σ t h t, q, v) = vx 1 ), where x 1 is a selected state. Such a mapping is a coherent measure of risk, as a function of the last argument, and may be law invariant. In particular, it is law invariant with = {x 1, x 2 }, qx 1 ) = 1/3, qx 2 ) = 2/3, vx 1 ) = 3, vx 2 ) = 1, wx 1 ) = 2, wx 2 ) = 4. For this mapping, we have v st w under q, but σ t h t, q, v) > σ t h t, q, w), and thus the condition of stochastic conditional time consistency is violated. We consciously exclude such cases, because in controlled systems, which will be discussed in the next section, the second argument q) is the only one that depends on our decisions. It should be included in the definition of our preferences, if practically meaningful results are to be obtained. 3 Risk Measures for Controlled Stochastic Processes We now extend the setting of Section 2 by making the kernels 1) depend on some control variables u t. 3.1 The Model We still work with the process { t } t=1,...,t on the space T and introduce a measurable control space U, BU)). At each time t, we observe the state x t and then apply a control u t U. We assume that the admissible control sets and the transition kernels conditional distributions of the next state) depend on all currently-known state and control values. More precisely, we make the following assumptions: 1. For all t = 1,..., T, we require that u t U t x 1, u 1,..., x t 1, u t 1, x t ), where U t : G t U is a measurable multifunction, and G 1,..., G T are the sets of histories of all currently-known state and control values before applying each control: { G1 =, G t+1 = graphu t ) U) t, t = 1,..., T 1; 2. For all t = 1,..., T, the control-dependent transition kernels Q t : graphu t ) P ), t = 1,..., T 1 13) are measurable, and for all t = 1,..., T 1, for all x 1, u 1,..., x t, u t ) graphu t ), Q t x 1, u 1,..., x t, u t ) describes the conditional distribution of t+1, given currently-known states and controls. The reason why we allow U t and Q t to depend on all known state and control values is that in partially observable Markov decision problems, which we study in Section 5, the conditional distributions of the next observable part of the state does depend on all past controls. We plan to apply the results of the present section to the partially observable case. For this controlled process, a deterministic) history-dependent admissible policy π = π 1,..., π T ) is a sequence of measurable selectors, called decision rules, π t : G t U such that π t g t ) U t g t ) for all g t G t. We can easily prove by induction on t that for such an admissible policy π, each π t reduces to a measurable function on t, as u s = π s h s ) for all s = 1,..., t 1. We are still using π s to denote the decision rule, although it is a different function, formally; it will not lead to any misunderstanding. The set of admissible policies is Π := { π = π 1,..., π T ) t, π t x 1,..., x t ) U t x 1, π 1 x 1 ),..., x t 1, π t 1 x 1,..., x t 1 ), x t ) }. 14) 10

11 For any fixed policy π Π, the transition kernels can be rewritten as measurable functions from t to P ): Q π t : x 1,..., x t ) Q t x1, π 1 x 1 ),..., x t, π t x 1,..., x t ) ), t = 1,..., T 1, 15) just like the transition kernels of the uncontrolled case given in 1), but indexed by π. Thus for any policy π Π, we can consider { t } t=1,...,t as an uncontrolled process, on the probability space T, B ) T, P π ) with P π defined by {Q π t } t=1,...,t 1, which is adapted to the policyindependent filtration {F t } t=1,...,t. As before and throughout this paper, h t t will stand for x 1,..., x t ). We still use the same spaces Z t, t = 1,..., T as defined in 2) for the costs incurred at each stage; these spaces also allow us to consider control-dependent costs as collections of policyindexed costs in Z 1,T. Thus, we are able to classify time-consistent) dynamic risk measures ρ π for each fixed π Π, as in Section 2. Note that ρ π are defined on the same spaces independently of π, because the filtration and the spaces Z t, t = 1,..., T, are not dependent on π; however, we do need to index the measures of risk by the policy π, because the transition kernels and, consequently, the probability measure on the space T, depend on π. 3.2 Stochastic Conditional Time-Consistency and Transition Risk Mappings In the previous subsection we discussed a family of dynamic risk measures ρ π ) π such that for each fixed π Π, ρ π is a dynamic risk measure. In future considerations, we will compare risk levels among different policies, so a meaningful order among ρ π ) π Π is needed. It turns out that our concept of stochastic conditional time-consistency can be extended to this setting and help us achieve this goal. } π Π Definition 3.1. A family of process-based dynamic risk measures { ρt,t π is stochastically conditionally time-consistent if for any π, π Π, for any 1 t < T, for all h t t, all t=1,...,t 1 Z t,..., Z T ) Z t,t and all W t,..., W T ) Z t,t, the conditions { Zt h t ) = W t h t ), imply ρ π t+1,t Z t+1,..., Z T ) H π t = h t ) st ρ π t+1,t W t+1,..., W T ) H π t = h t ), ρ π t,t Z t,..., Z T )h t ) ρ π t,t W t,..., W T )h t ). Remark 3.2. As in Definition 2.12, the conditional stochastic order st should be understood as follows: for all η R we have { Q π t h t) x ρ π t+1,t Z t+1,..., Z T )h t, x) > η }) { Q π t h t ) x ρ π t+1,t W t+1,..., W T )h t, x) > η }). The above definition of stochastic conditional time-consistency of a family of dynamic risk measures ρ π ) π Π plays a pivotal role in building an intrinsic connection among them, as we explain it below. If a family of process-based dynamic risk measures { ρt,t π } π Π t=1,...,t 1 is 11

12 stochastically conditionally time-consistent, then for each fixed π Π the process-based dynamic risk measure { ρt,t π } is stochastically conditionally time-consistent, as defined t=1,...,t 1 in Definition By virtue of Proposition 2.14, for each π Π, there exist functionals σt π : graphq π t ) V R, t = 1... T 1, such that for all t = 1,..., T 1, all h t t, the functional σt π h t, Q π t h t), ) is a lawinvariant risk measure on V with respect to the distribution Q π t h t) and ρ π t,t Z t,..., Z T )h t ) = Z t h t ) + σ π t ht, Q π t h t), ρ π t+1,t Z t+1,..., Z T )h t, ) ), h t t. Consider any π, π Π, h t t, and Z t,..., Z T ) Z t,t, W t,..., W T ) Z t,t such that Z t h t ) = W t h t ), Q π t h t) = Q π t h t ), ρt+1,t π Z t+1,..., Z T )h t, ) = ρ π t+1,t W t+1,..., W T )h t, ). Then we have ρ π t+1,t Z t+1,..., Z T ) H π t = h t ) st ρ π t+1,t W t+1,..., W T ) H π t = h t ), where the relation st means that both st and st are true; in other words, equality in law. Because of the stochastic conditional time-consistency, whence ρ π t,t Z t,..., Z T )h t ) = ρ π t,t W t,..., W T )h t ), σt π ht, Q π t h t), ρt+1,t π Z t+1,..., Z T )h t, ) ) = σ π t ht, Q π t h t ), ρ π t+1,t W t+1,..., W T )h t, ) ). This proves that in fact σ π does not depend on π, and all dependence on π is carried by the controlled kernel Q π t. This is a highly desirable property, when we apply dynamic risk measures to a control problem. We summarize this important observation in the following theorem, which extends Theorem 2.14 to the case of controlled processes. } π Π Theorem 3.3. A family of process-based dynamic risk measures { ρt,t π is translationinvariant and stochastically conditionally time-consistent if and only if there exist functionals t=1,...,t { } σ t : graphq π t ) V R, t = 1... T 1, π Π such that i) For all t = 1,..., T 1 and all h t t, σ t h t,, ) is normalized and satisfies the following strong monotonicity with respect to stochastic dominance: q 1, q 2 { Q π t h t) : π Π }, v 1, v 2 V, v 1 ; q 1 ) st v 2 ; q 2 ) σ t h t, q 1, v 1 ) σ t h t, q 2, v 2 ), where v; q) = q v 1 means the distribution of v under q; 12

13 ii) For all π Π, for all t = 1,..., T 1, for all Z t,..., Z T ) Z t,t, and for all h t t, ρ π t,t Z t,..., Z T )h t ) = Z t h t ) + σ t h t, Q π t h t), ρ π t+1,t Z t+1,..., Z T )h t, )). 16) Moreover, for all t = 1,..., T 1, σ t is uniquely determined by ρ t,t as follows: for every h t t, for every q { Q π t h t) : π Π }, and for every v V, σ t h t, q, v) = ρ π t,t 0, V, 0,..., 0)h t), 17) where π is any admissible policy such that q = Q π t h t), and V Z t+1 satisfies the equation V h t, ) = v ), and can be arbitrary elsewhere. Proof. We have shown the existence of {σ t } t=1,...,t satisfying 16) and 17) in the discussion preceding the theorem. We can verify the strong law-invariance by 17) and Definition Strong monotonicity with respect to stochastic dominance In Theorem 3.3 we introduced the notion of strong monotonicity with respect to stochastic dominance, and in this subsection we shed more light on this notion. For a measurable space, we use P ) to denote the space of probability measures on, and we use S to denote a subset of P ). Recall that V stands for the set of bounded measurable functions on. We say that a function σ : S V R is strongly monotonic with respect to stochastic dominance on S, if q 1, q 2 S, v 1, v 2 V, v 1 ; q 1 ) st v 2 ; q 2 ) σ q 1, v 1 ) σ q 2, v 2 ). 18) The property of strong monotonicity with respect to stochastic dominance is strictly stronger than the usual monotonicity and law-invariance of σ q, ) with respect to q for all q S. We remind the readers that the usual law-invariance means that for all v 1 and v 2 having the same distribution under q, we must have σ v 1, q) = σ v 2, q). Furthermore, a mapping σ, which is strongly monotonic with respect to stochastic dominance, is nothing else but a function that evaluates the distributions of v under q for all v V and q S, which preserves the firstorder) stochastic dominance. Thus we can rewrite it as a function of distributions, or a function of cumulative distribution functions, F v;q α) = q{x : vx) α}, α R. Proposition 3.4. A function σ : S V R is strongly monotonic with respect to stochastic dominance on S if and only if there exist σ : { F v;q v V, q S } R such that σ q, v) = σ F v;q ) for all v V and q S, and for all F 1, F 2 { F v;q v V, q S }, F 1 F 2 σ F 1 ) σ F 2 ). 19) Consequently, to construct risk measures, which are strongly monotonic with respect to stochastic dominance, one possibility is to take any function of distributions satisfying 19) and the normalization condition. Such functions include the expected value, the mean-uppersemideviations, as in Example 2.18, or, more generally, coherent risk measures given by the Kusuoka-representation: σ q, v) = sup µ M 1 0 AVaR 1 α v; q) µdα), q P ), v V, 20) 13

14 where M is a convex subset of P 0, 1] ), and AVaR defined by AVaR 1 α v; q) = 1 α 1 1 α F 1 v;q s) ds, with F 1 v;q is the quantile function of v under q. Even more general mappings are monotonic and normalized risk measures on the space of bounded quantile functions, as discussed in [16]. We can also include all non-decreasing transformations of such measures, provided that they satisfy the normalization condition, such as the entropic risk measure of Example Note that { F v;q v V, q S } does not always span all bounded distributions on R, especially when is a discrete space; thus, we also allow different forms for each q, as shown in the following example: } Example 3.5. Let = {1, 2, 3}, S = {q 1 = 0, 1, 0.1, 0.8); q 2 = 3 1, 3 1, 1 3 ), and σ q, v) = { Eq 1v) + κe q 1 [v E q 1v)] +), if q = q 1, 0.4 max { v1), v2), v3) } + 0.6E q 2v), if q = q 2, with any 0 κ 1. Then σ, ) is strongly monotonic with respect to stochastic dominance on S. Indeed, each σ q i, ) is a coherent risk measure on V. We fix any v 2 V and denote min { v 2 1), v 2 2), v 2 3) } = a and max { v 2 1), v 2 2), v 2 3) } = c. Then we have 0.4c + 0.2a + a + c) σ q 2, v 2 ) 0.4c + 0.2a + c + c), and i) for any v 1, v 2 V such that v 1 ; q 1 ) st v 2 ; q 2 ), we have v 1 w := c, c, a), and thus [w E q 1w)] +) σ q 1, v 1 ) E q 1w) + κe q 1 = 0.2c + 0.8a κc a) 0.6c + 0.4a σ q 2, v 2 ); ii) for any v 1, v 2 V such that v 1 ; q 1 ) st v 2 ; q 2 ), we have v 1 w := a, a, c), and thus [w E q 1w )] +) σ q 1, v 1 ) E q 1w ) + κe q 1 = 0.8c + 0.2a κc a) 0.8c + 0.2a σ q 2, v 2 ). 4 Application to Controlled Markov Systems Our results can be further specialized to the case when { t } is a controlled Markov system, in which we assume the following conditions: i) The admissible control sets are measurable multifunctions of the current state, i.e., U t : U, t = 1,..., T ; ii) The dependence in the transition kernel 13) on the history is carried only through the last state and control: Q t : graphu t ) P ), t = 1,..., T 1; iii) The step-wise costs are only dependent on the current state and control: Z t = c t x t, u t ), t = 1,..., T, where c t : graphu t ) R, t = 1,..., T are measurable bounded functions. 14

15 Let Π be the set of admissible history-dependent policies: Π := { π = π 1,..., π T ) t, π t x 1,..., x t ) U t x t ) }. To alleviate notation, for all π Π and for all measurable c = c 1,..., c T ), we write v c,π t h t ) := ρ π t,t ct t, π t H t )),..., c T T, π T H T )) ) h t ). The following result is the direct translation of Theorem 3.3 to the Markovian case: Corollary 4.1. For a controlled Markov system, a family of process-based dynamic risk measures { ρt,t π } π Π is translation-invariant and stochastically conditionally time-consistent if t=1,...,t and only if there exist functionals σ t : { h t, Q t x t, u) ) : h t t, u U t x t ) } V R, t = 1... T 1, such that i) For all t = 1,..., T 1 and all h t t, σ t h t,, ) is normalized and strongly monotonic with respect to stochastic dominance on { Q t x t, u) : u U t x t ) } ; ii) For all π Π, for all bounded measurable c, for all t = 1,..., T 1, and for all h t t, ) vt c,π h t ) = c t x t, π t h t )) + σ t h t, Q t x t, π t h t )), v c,π t+1 h t, ). 21) Proof. To verify the if and only if statement, we can show that 17) is true if σ t satisfies 21) for all measurable bounded c. 4.1 Markov Risk Measures Because of the Markov property of the transition kernels, for a fixed Markov policy 1 π, the future evolution of the process { τ } τ=t,...,t is solely dependent on the current state x t, so is the distribution of the future costs c τ τ, π τ τ )), τ = t,..., T. Therefore, it is reasonable to assume that the dependence of the conditional risk measure on the history is also carried by the current state only. } π Π Definition 4.2. A family of process-based dynamic risk measures { ρt,t π for a controlled Markov system is Markov if for all Markov policies π Π, for all measurable c = t=1,...,t c 1,..., c T ), and for all h t = x 1,..., x t ) and h t = x 1,..., x t ) in t such that x t = x t, we have vt c,π h t ) = vt c,π h t ). Proposition 4.3. Under the translation invariance and the stochastic conditional time consistency, { ρt,t π } π Π t=1,...,t is Markov if and only if the dependence of σ t on h t is carried only by x t, for all t = 1,..., T 1. Proof. The Markov property in Definition 4.2 implies that for all t = 1,..., T 1, for all h t, h t t such that x t = x t, for all u U tx t ) and for all v V, there exists a Markov π Π such that π t x t ) = u. By setting c = 0,..., 0, c t+1, 0,..., 0) with c t+1 : x, u ) vx ), we obtain σ t h t, Q t x t, u), v) = v c,π t h t ) = vt c,π h t ) = σ th t, Q tx t, u), v). 1 A Markov policy π is composed of state-dependent measurable decision rules π t : U), t = 1,..., T. 15

16 Therefore, σ t is indeed memoryless, that is, its dependence on h t is carried by x t only. If σ t, t = 1,..., T 1 are all memoryless, we can prove by induction backward in time that for all t = T,..., 1, vt c,π h t ) = vt c,π h t ) for all Markov π and all h t, h t t such that x t = x t. Theorem 4.4. For a controlled Markov system, a family of process-based dynamic risk measures { ρt,t π } π Π is translation-invariant, stochastically conditionally time-consistent and t=1,...,t Markov if and only if there exist functionals σ t : { x, Q t x, u) ) : x, u U t x) } V R, t = 1... T 1, where V is the set of bounded measurable functions on, such that: i) For all t = 1,..., T 1 and all x, σ t x,, ) is normalized and strongly monotonic with respect to stochastic dominance on { Q t x, u) : u U t x) } ; ii) For all π Π, for all measurable bounded c, for all t = 1,..., T 1, and for all h t t, ) vt c,π h t ) = c t x t, π t h t )) + σ t x t, Q t x t, π t h t )), v c,π t+1 h t, ). 22) Remark 4.5. In [41], formula 22) was postulated as a definition of a Markov risk measure. Theorem 4.4 provides us with a simple recursive formula 22) for the evaluation of risk of a Markov policy π: v c,π T x) = c T x, π T x)), x, x) = c t x, π t x)) + σ t x, Qt x, π t x)), vt+1) c,π, x, t = T 1,..., 1. v c,π t It involves calculation of the values of functions v c,π t ) on the state space. 4.2 Dynamic Programming In this section, we fix the cost functions c 1,..., c T and consider a family of dynamic risk measures { ρt,t π } π Π which is translation-invariant, stochastically conditionally time-consistent t=1,...,t and Markov. Our objective is to analyze the risk minimization problem: min π Π vπ 1 x 1), x 1. For this purpose, we introduce the family of value functions: v t h t) = inf vt π h t), t = 1,..., T, h t t, 23) π Π t,t where Π t,t is the set of feasible deterministic policies π = {π t,..., π T }. As stated in Theorem 4.4, transition risk mappings { σ t exist, such that }t=1,...,t 1 v π t h t) = c t x t, π t h t )) + σ t xt, Q t x t, π t h t )), v π t+1 h t, ) ), t = 1,..., T 1, π Π, h t t. 24) Our intention is to prove that the value functions vt ) are memoryless, that is, for all h t = x 1,..., x t ) and h t = x 1,..., x t ) such that x t = x t, we have v t h t) = vt h t ). In this case, with a slight abuse of notation, we shall simply write vt x t). In order to formulate the main result of this subsection, we equip the space P ) of probability measures on with the topology of weak convergence. 16

17 Theorem 4.6. We assume in addition that the following conditions are satisfied: i) The transition kernels Q t, ), t = 1,..., T, are continuous; ii) For every lower semicontinuous v V the transition risk mappings σ t,, v), t = 1,..., T, are lower semi-continuous; iii) The functions c t, ), t = 1,..., T, are lower semicontinuous; iv) The multifunctions U t ), t = 1,..., T, are measurable, compact-valued, and upper semicontinuous. Then the functions vt, t = 1,..., T, are measurable, memoryless, lower semicontinuous, and satisfy the following dynamic programming equations: v t x) = v T x) = min c T x, u), x, 25) u U T x) min { ct x, u) + σ t x, Qt x, u), v )} t+1, x, t = T 1,..., 1. 26) u U t x) Moreover, an optimal Markov policy ˆπ exists and satisfies the equations: ˆπ T x) argmin c T x, u), x, 27) u U T x) { ˆπ t x) argmin ct x, u) + σ t x, Qt x, u), v )} t+1, x, t = T 1,..., 1. 28) u U t x) Proof. We prove the memoryless property of vt ) and construct the optimal Markov policy by induction backwards in time. For all h T T we have vt h T ) = inf c T x T, π T h T )) = inf c T x T, u). 29) π Π u U T x T ) Since c T, ) is lower semicontinuous, it is a normal integrand, that is, its epigraphical mapping x {u, α) U R : c T x, u) α} is a closed-valued and measurable multifunction [38, Ex ]. Due to assumption iv), the mapping { c T x, u) if u U T x), c T x, u) = + otherwise, is a normal integrand as well. By virtue of [38, Thm ], the infimum in 29) is attained and is a measurable function of x T. Hence, vt ) is measurable and memoryless. By assumptions iii) and iv) and Berge theorem, it is also lower semicontinuous see, e.g. [3, Thm ]). Moreover, the optimal solution mapping Ψ T x) = { u U T x) : c T x, u) = vt x)} is measurable and has nonempty and closed values. Therefore, a measurable selector ˆπ T of Ψ T exists [28], [3, Thm ]. Suppose vt+1 ) is measurable, memoryless, and lower semicontinuous, and Markov decision rules { ˆπ t+1,..., ˆπ T } exist such that Then for any h t t we have v t h t) = inf π Π vπ t h t) = inf π Π v t+1 x t+1) = v { ˆπ t+1,..., ˆπ T } t+1 x t+1 ), h t+1 t+1. { ct x t, π t h t )) + σ t xt, Q t x t, π t h t )), v π t+1 h t, ) )}. 17

18 On the one hand, since vt+1 π h t, ) vt+1 ) and σ t is non-decreasing with respect to the last argument, we obtain vt h { t) inf ct x t, π t h t )) + σ t xt, Q t x t, π t h t )), vt+1)} π Π { = inf ct x t, u) + σ t xt, Q t x t, u), v 30) t+1)}. u U t x t ) By assumptions i) iii), the mapping x, u) c t x, u) + σ t x, Q t x, u), vt+1 ) is lower semicontinuous. Invoking [38, Thm ] and assumption iv) again, exactly as in the case of t = T, we conclude that the optimal solution mapping { Ψ t x) = u U t x) : c t x, u) + σ t x, Qt x, u), vt+1 ) = inf u U t x) { ct x, u) + σ t x, Qt x, u), v t+1)} } is measurable and has nonempty and closed values; hence, a measurable selector ˆπ t of Ψ t exists [3, Thm ]. Substituting this selector into 30), we obtain vt h t) c t x t, ˆπ t x t )) + σ t xt, Q t x t, ˆπ t x t )), v { ˆπ t+1,..., ˆπ T }) { ˆπ t+1 = v t,..., ˆπ T } t x t ). In the last equation, we used 24) and the fact that the decision rules ˆπ t,..., ˆπ T are Markov. On the other hand, vt h t) = inf π Π vπ t h t) v { ˆπ t,..., ˆπ T } t x t ). Therefore, vt h t) = v { ˆπ t,..., ˆπ T } t x t ) is measurable, memoryless, and vt x { t) = min ct x t, u) + σ t xt, Q t x, u), vt+1)} u U t x t ) = c t x t, ˆπ t x t )) + σ t xt, Q t x t, ˆπ t x t )), v t+1). By assumptions ii), iii), iv), and Berge theorem, vt ) is lower semicontinuous see, e.g. [3, Thm ]). This completes the induction step. Let us verify the weak lower semicontinuity assumption ii) of the mean semideviation transition risk mapping of Example To make the mapping Markovian, we assume that the parameter ϰ depends on x only, that is, σ x, q, v) = [ ) ] p ) 1/p vs) qds) + ϰx) vs) vs ) qds ) qds). 31) + As before, p [1, ). For simplicity, we skip the subscript t of σ and ϰ. Lemma 4.7. Suppose ϰ ) is continuous. Then for every lower semicontinuous function v, the mapping x, q) σ x, q, v) is lower semicontinuous. Proof. Let q k q weakly and x k x. For all s we have the inequality [ ] 0 vs) vs ) qds ) + [ ] [ ] vs) vs ) q k ds ) + vs ) q k ds ) vs ) qds )

19 By the triangle inequality for the norm in L p, B ), q k ), [ vs) [ vs) ] p vs ) qds ) + ] p vs ) q k ds ) + ) 1/p q k ds) Adding vs) qds) to both sides, we obtain [ vs) qds) + vs) [ vs) vs ) q k ds ) 1/p [ ] q k ds)) + vs ) q k ds ) vs ) qds ). + ] p ) 1/p vs ) qds ) q k ds) + ] p 1/p [ q k ds)) + max + By the lower semicontinuity of v and weak convergence of q k to q, we have vs) qds) lim inf vs) q k ds), k that is, for every ε > 0, we can find k ε such that for all k k ε vs) q k ds) vs) qds) ε. Therefore, for these k we obtain vs) qds) + [ vs) ] p vs ) qds ) + vs) q k ds) + [ vs) ) 1/p q k ds) ] vs) q k ds), vs) qds). ] p 1/p vs ) q k ds ) q k ds)) + ε. + Taking the lim inf of both sides, and using the weak convergence of q k to q and the lower semicontinuity of the functions integrated, we conclude that vs) qds) + { lim inf k [ vs) ] p vs ) qds ) + vs) q k ds) + [ vs) ) 1/p qds) ] p ) 1/p } vs ) q k ds ) q k ds) + ε. + As ε > 0 was arbitrary, the last relation proves the lower semicontinuity of σ in the case when ϰx) 1. The case of a continuous ϰx) [0, 1] can be now easily analyzed by noticing that σ x, q, v) is a convex combination of the expected value, and the risk measure of the last displayed relation: σ x, q, v) = 1 ϰx) ) vs) qds) { [ ] p ) 1/p } + ϰx) vs) qds) + vs) vs ) qds ) qds). + As both components are lower semicontinuous in x, q), so is their sum. 19

20 5 Application to Partially-Observable Markov Decision Processes In this section, we apply our previous results to a partially-observable Markov decision process { t, Y t } t=1,...,t, in which { t } t=1,...,t is observable and {Y t } t=1,...,t is not. We use the term partially-observable Markov decision process in a more general way than the extant literature, because we consider dynamic risk measures as the objective of control, rather than just the expected value of the cost. 5.1 The Model In order to develop our subsequent theory, it is essential to define our partially-observable Markov decision process POMDP) in a clear and rigorous way. This subsection mostly follows Chapter 5 of [4], and readers are encouraged to consult [4] for more details. The state space of the model is defined as Y where, B )) and Y, BY)) are two Polish spaces. From the modeling perspective, x is the part of the state that we can observe at each step, while y Y is unobservable. The measurable space that we will work with is then given by Y) T endowed with the canonical product σ -field, and we use x t and y t to denote the canonical projections at time t. The control space is still given by U, and since only is observable, the set of admissible controls at step t is still given by a multifunction U t : U. The transition kernel at time t is now given by Q t : graphu t ) Y P Y); in other words, given that at time t the state is x, y) and we apply control u, the distribution of the next state is given by Q t x, y, u). Finally, the cost incurred at each stage t is given by the functional c t : graphu t ) R, which is assumed to be measurable and bounded. We would also like to comment on the notion of history in the current POMDP setting. At time t, all the information available for making a decision is given by g t = x 1, u 1,, x t ) because y t is unobservable. In the following, the notion of a history will always correspond to this observable history. 5.2 Bayes Operator In a POMDP, the Bayes operator provides a way to update from prior belief to posterior belief. More precisely, assume that our current state observation is x and action is u, and our current best guess of the distribution of the unobservable state is ξ. Given a new observation x, we can find a formula to determine the posterior distribution of the unobservable state. Let us start with a fairly general construction of the Bayes operator. Assuming the above setup, for given x, ξ, u) PY) U, define a new measure m t x, ξ, u) on Y, initially on all measurable rectangles A B as m t x, ξ, u)a B) = Q t A B x, y, u) ξdy). Y We verify readily that this uniquely defines a probability measure on Y. If the measurable space Y, BY)) is standard, i.e., isomorphic to a Borel subspace of R, we can disintegrate 20

Risk-Averse Dynamic Optimization. Andrzej Ruszczyński. Research supported by the NSF award CMMI

Risk-Averse Dynamic Optimization. Andrzej Ruszczyński. Research supported by the NSF award CMMI Research supported by the NSF award CMMI-0965689 Outline 1 Risk-Averse Preferences 2 Coherent Risk Measures 3 Dynamic Risk Measurement 4 Markov Risk Measures 5 Risk-Averse Control Problems 6 Solution Methods