Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach

Decentralized Stochastic Control with Partial Sharing Information Structures: A Common Information Approach 1 Ashutosh Nayyar, Aditya Mahajan and Demosthenis Teneketzis Abstract A general model of decentralized stochastic control problem in which multiple controllers share part of their information with each other is investigated. The general model subsumes several models of information sharing in decentralized stochastic control as special cases. Structural results for optimal control strategies for the general model are presented. A dynamic program for finding the optimal strategies is also derived. These results are obtained by identifying common information among controllers and formulating the decentralized problem as a centralized problem from the perspective of a coordinator who knows the common information. Index Terms Theory Decentralized Control, Stochastic Control, Information Structures, Markov Decision Theory, Team I. INTRODUCTION The theory of stochastic control provides analytical and computational tools for centralized sequential decision making problems with system and observation uncertainty. While general decision making problems with arbitrary models of uncertainty remain intractable, stochastic control theory has been successful in identifying models (for example, Markov decision processes, linear-quadratic-gaussian control problems) for which it can provide intuitively appealing and computationally tractable results. An essential feature of these models is the assumption of centralized control, that is, a single controller/decision-maker has access to all the observations, has perfect recall of all past observations and is in charge of making all the decisions. However,

2 a number of modern control problems arising in diverse applications like networked control systems, communication and queuing networks, sensor networks, etc., require control decisions to be made by different controllers/decision-makers with access to different information. In this paper, we investigate such problems of decentralized stochastic control. The presence of multiple decision-makers with different information implies that the tools of centralized stochastic control cannot be directly applied for decentralized problems. Two general approaches that indirectly use centralized stochastic control for addressing decentralized decision-making problems have been used in the literature: the person-by-person approach and the designer s approach. The general philosophy of the person-by-person approach is to study the decentralized problem by first arbitrarily fixing the decision strategies of all but the i th decision-maker and then using centralized control tools to identify an optimal decision strategy of the i th decision-maker. If the resulting decision strategy of the i th decision-maker has a structural property that is independent of the choice of other decision-makers strategies, then such a structural property is true for a globally optimal strategy of the i th decision-maker. This method has been successfully used to identify structural properties of globally optimal decision strategies in several decentralized problems including problems of real-time communication [1] [5], decentralized hypothesis testing and decentralized quickest change detection problems [6] [14], decentralized control problems [15] [17] etc. A recursive application of this approach starts with one decision-maker (say, the ith), fixes the strategies of all other decision-makers, finds the optimal strategy of the ith decision-maker and then repeats this process, cyclically, for all decision-makers. If this recursive approach yields a fixed point, then the resulting strategies are person-by-person optimal [18], i.e., no decision-maker can improve performance by unilaterally deviating from these strategies. Since the problem of globally optimal strategies is, in general, a non-convex functional optimization problem, person-by-person optimal strategies may not be globally optimal strategies. The general philosophy of the designer s approach is to consider the problem from the perspective of a designer who has to choose decision-strategies for all the decision-makers. The designer s optimization problem can be viewed as a centralized sequential decision-making problem for which a sequential decomposition can be obtained. This approach has been developed in [19], [20]. In addition to the above general approaches, other specialized approaches have been developed

3 to address specific problems in decentralized systems. Decentralized problems with partially nested information structure were defined and studied in [21]. Decentralized linear quadratic Gaussian (LQG) control problems with two controllers and partially nested information structure were studied in [22], [23]. Partially nested decentralized LQG problems with controllers connected via a graph were studied in [24], [25]. A generalization of partial nestedness called stochastic nestedness was defined and studied in [26]. An important property of LQG control problems with partially nested information structure is that there exists an affine control strategy which is globally optimal. In general, the problem of finding the best affine control strategies may not be a convex optimization problem. Conditions under which the problem of determining optimal control strategies within the class of affine control strategies becomes a convex optimization problem were identified in [27], [28]. Decentralized stochastic control problems with specific models of information sharing among controllers have also been studied in the literature. Control problems with delayed information sharing among controllers were studied in [29] [31]. The case of periodic sharing of information among controllers was studied in [32]. Decentralized problems where controllers share their past actions were investigated in [33], [34]. A decentralized control problem with a broadcast information structure was defined and studied in [17]. A decentralized problem with common and private observations for controllers was studied in [35]. A. Contributions of the Paper We introduce a general model of decentralized stochastic control problem in which multiple controllers share part of their information with each other. We call this model the partial sharing information structure. This model subsumes several models of information sharing in decentralized stochastic control as special cases (see Section II-B). We establish two results for our model. Firstly, we establish a structural property of optimal control strategies. Secondly, we provide a sequential decomposition of the problem of finding optimal control strategies. As in [31], [35], our results are derived using a common information based approach (see Section III). This approach differs from the person-by-person approach and the designer s approach mentioned earlier. In particular, the structural properties found in this paper cannot be found by the personby-person approach described earlier. Moreover, the sequential decomposition found in this paper is distinct from and simpler than the sequential decomposition based on the designer s

4 approach. For a general framework for using common information in sequential decision making problems, see [36]. B. Notation Random variables are denoted by upper case letters; their realization by the corresponding lower case letter. X a:b is a short hand for the vector (X a, X a+1,..., X b ) while X c:d is a short hand for the vector (X c, X c+1,..., X d ). The combined notation X c:d a:b vector (X j i is a short hand for the : i = a, a + 1,..., b, j = c, c + 1,..., d). In general, subscripts are used as time index while superscripts are used to index controllers. Bold letters X are used as a short hand for the vector (X 1:n ). P( ) is the probability of an event, E( ) is the expectation of a random variable. For a collection of functions g, we use P g ( ) and E g ( ) to denote that the probability measure/expectation depends on the choice of functions in g. 1 A ( ) is the indicator function of a set A. For singleton sets {a}, we also denote 1 {a} ( ) by 1 a ( ). For a singleton a and a set B, {a, B} denotes the set {a} B. For two sets A and B, {A, B} denotes the set A B. For two finite sets A, B, F (A, B) is the set of all functions from A to B. Also, if A =, F (A, B) := B. For a finite set A, (A) is the set of all probability mass functions over A. For the ease of exposition, we assume that all state, observation and control variables take values in finite sets. For two random variables (or random vectors) X and Y taking values in X and Y, P(X = x Y ) denotes the conditional probability of the event {X = x} given Y and P(X Y ) denotes the conditional PMF (probability mass function) of X given Y, that is, it denotes the collection of conditional probabilities P(X = x Y ), x X. Finally, all equalities involving random variables are to be interpreted as almost sure equalities (that is, they hold with probability one). C. Organization The rest of this paper is organized as follows. We present our model of a decentralized stochastic control problem and state our main results in Section II. We also present several special cases of our model in this section. We prove our result in Section III. We consider a generalization of our model in Section IV. We consider the infinite time-horizon discounted cost analogue of our problem in Section V. Finally, we conclude in Section VI.

5 II. PROBLEM FORMULATION AND MAIN RESULTS A. Basic model: Partial Sharing Information Structure 1) The Dynamic System: Consider a dynamic system with n controllers. The system operates in discrete time for a horizon T. Let X t X t denote the state of the system at time t, Ut i Ut i denote the control action of controller i, i = 1,..., n at time t, and U t denote the vector (U 1 t,..., U n t ). The initial state X 1 has a probability distribution Q 1, which is common knowledge to all the controllers. The state of the system evolves according to X t+1 = f t (X t, U t, W 0 t ), (1) where {W 0 t, t = 1,..., T } is a sequence of i.i.d. random variables with probability distribution Q 0 W. 2) Data available at the controller: At any time t, each controller has access to two types of data: (i) Current local observation: Each controller makes a local observation Y i t of the system at time t, Y i t Y i t on the state = h i t(x t, W i t ), (2) where {W i t, t = 1,..., T } is a sequence of i.i.d. random variables with probability distribution Q i W. Let Y t denote the vector (Y 1 t,..., Y n t ). We assume that the random variables in the collection {X 1, W j t, t = 1,..., T, j = 0, 1,..., n}, called primitive random variables, are mutually independent. (ii) Local memory and shared memory: a) Local memory: Each controller remembers a subset M i t of its past local observations and its past actions: At t = 1, the local memory is empty, M i 1 =. M i t {Y i 1:t 1, U i 1:t 1}. (3) b) Shared memory: In addition to its local memory, each controller has access to a shared memory. The contents C t of the shared memory at time t are a subset of the past local observations and control actions of all controllers: C t {Y 1:t 1, U 1:t 1 }. (4)

6 At t = 1, the shared memory is empty, C 1 =. Recall that Y t and U t denote the vectors (Y 1 t,..., Y n t ) and (U 1 t,..., U n t ) respectively. Controller i chooses action U i t Specifically, for every controller i, i = 1,..., n, as a function of the total data (Y i t, M i t, C t ) available to it. U i t = g i t(y i t, M i t, C t ), (5) where gt i is called the control law of controller i. The collection g i = (g1, i..., gt i ) is called the control strategy of controller i. The collection g 1:n = (g 1,..., g n ) is called the control strategy of the system. 3) Update of local and shared memories: (i) Shared memory update: After taking the control action at time t, the local information at controller i consists of the contents M i t of its local memory, its local observation Y i t and its control action U i t. Controller i sends a subset Z i t of this local information {M i t, Y i t, U i t } to the shared memory. The subset Z i t is chosen according to a pre-specified protocol. The contents of shared memory are nested in time, that is, the contents C t+1 of the shared memory at time t+1 are the contents C t at time t augmented with the new data Z t = (Z i t, Z 2 t,..., Z n t ) sent by all the controllers at time t: C t+1 = {C t, Z t }. (6) (ii) Local memory update: After taking the control action and sending data to the shared memory at time t, controller i updates its local memory according to a pre-specified protocol. The content M i t+1 of the local memory can at most equal the total local information {M i t, Y i t, U i t } at the controller. However, to ensure that the local and shared memories at time t + 1 don t overlap, we assume that M i t+1 {M i t, Y i t, U i t } \ Z i t. (7) Figure 1 shows the time order of observations, actions and memory updates. We refer to the above model as the partial sharing information structure. 4) The optimization problem: At time t, the system incurs a cost l(x t, U t ). The performance of the control strategy of the system is measured by the expected total cost J(g 1:n ) := E g1:n[ T t=1 ] l(x t, U t ), (8)

7 t t + 1 Shared Memory C t Z t C t+1 Controller 1 M 1 t Y 1 t U 1 t Z 1 t M 1 t+1 Controller n M n t Y n t U n t Z n t M n t+1 Fig. 1. Time ordering of Observations, Actions and Memory Updates where the expectation is with respect to the joint probability measure on (X 1:T, U 1:T ) induced by the choice of g 1:n. We are interested in the following optimization problem. Problem 1 For the model described above, given the distributions Q 1, Q i W, i = 0, 1,..., n, and the horizon T, find a control strategy g 1:n for the system that minimizes the expected total cost given by (8). B. Special Cases: The Models In the above model, although we have not specified the exact protocol by which controllers update the local and shared memories, we assume that a pre-specified protocol is being used. Different choices of this protocol result in different information structures for the system. In this section, we describe several models of decentralized control systems that can be viewed as special cases of our model by assuming a particular choice of protocol for local and shared memory updates. 1) Delayed Sharing Information Structure: Consider the following special case of the model of Section II-A. (i) The shared memory at the beginning of time t is C t = {Y 1:t s, U 1:t s }, where s 1 is a fixed number. The local memory at the beginning of time t is M i t = {Y i t s+1:t 1, U i t s+1:t 1}.

8 (ii) At each time t, after taking the action U i t, controller i sends Z i t = {Y i t s+1, U i t s+1} to the shared memory and the shared memory at t + 1 becomes C t+1 = {Y 1:t s+1, U 1:t s+1 }. (iii) After sending Z i t = {Y i t s+1, U i t s+1} to the shared memory, controller i updates the local memory to M i t+1 = {Y i t s+2:t, U i t s+2:t}. In this spacial case, the observations and control actions of each controller are shared with every other controller after a delay of s time steps. Hence, the above special case corresponds to the delayed sharing information structure considered in [29], [31], [37]. 2) Delayed State Sharing Information Structure: A special case of the delayed sharing information structure (which itself is a special case of our basic model) is the delayed state sharing information structure [30]. This information structure can be obtained from the delayed sharing information structure by making the following assumptions: (i) The state of the system at time t is a n-dimensional vector X t = (X 1 t, X 2 t,..., X n t ). (ii) At each time t, the current local observation of controller i is for i = 1, 2,..., n. Y i t = X i t, In this spacial case, the complete state vector X t is available to all controllers after a delay of s time steps. 3) Periodic Sharing Information Structure: Consider the following special case of the model of Section II-A where controllers update the shared memory periodically with period s 1: (i) For time ks < t (k + 1)s, where k = 0, 1, 2,..., the shared memory at the beginning of time t is The local memory at the beginning of time t is C t = {Y 1:ks, U 1:ks }. (9) M i t = {Y i ks+1:t 1, U i ks+1:t 1}. (10) (ii) At each time t, after taking the action U i t, controller i sends Z i t where to the shared memory,, if ks < t < (k + 1)s, Zt i = {Yks+1:(k+1)s i, U ks+1:(k+1)s i }, if t = (k + 1)s. (11)

9 (iii) After sending Z i t to the shared memory, controller i updates the local memory to M i t+1 = {M i t, Y i t, U i t } \ Z i t. In this spacial case, the entire history of observations and control actions are shared periodically between controllers with period s. Hence, the above special case corresponds to the periodic sharing information structure considered in [32]. 4) Control Sharing Information Structure: Consider the following special case of the model of Section II-A. (i) The shared memory at the beginning of time t is C t = {U 1:t 1 }. The local memory at the beginning of time t is M i t = {Y i 1:t 1}. (ii) At each time t, after taking the action U i t, controller i sends Z i t memory. = {U i t } to the shared (iii) After sending Z i t = U i t to the shared memory, controller i updates the local memory to M i t+1 = Y i 1:t. In this spacial case, the control actions of each controller are shared with every other controller after a delay of 1 time step. Hence, the above special case corresponds to the control sharing information structure considered in [33]. A related special case is the situation where the local memory at each controller consists of only s most recent observations, that is, M i t = Y i t s:t 1. 5) No Shared Memory with or without finite local memory: Consider the following special case of the model of Section II-A. (i) The shared memory at each time is empty, C t = and the local memory at the beginning of time t is M i t = {Y i t s:t 1, U i t s:t 1}, where s 1 is a fixed number. (ii) Controllers do not send any data to shared memory, Z i t =. (iii) At the end of time t, controllers update their local memories to M i t+1 = {Y i t s+1:t, U i t s+1:t}. In this special case, the controllers don t share any data. The above model is related to the finite-memory controller model of [38]. A related special case is the situation where the local memory at each controller consists of all of its past local observations and its past actions, that is, M i t = {Y i 1:t 1, U i 1:t 1}. Remark 1 All the special cases considered above are examples of symmetric sharing. That is, different controllers update their local memories according to identical protocols and the data sent by a controller to the shared memory is selected according to identical protocols. However, this

10 symmetry is not required for our model. Consider for example, the delayed sharing information structure where at the end of time t, controller i sends Y i t s i, U i t s i to the shared memory, with s i, i = 1, 2,..., n, being fixed, but not necessarily identical, numbers. This kind of asymmetric sharing is also a special case of our model. C. Results For centralized systems, stochastic control theory provides two important analytical results. Firstly, it provides a structural result. This result states that there is an optimal control strategy which selects control actions as a function only of the controller s posterior belief on the state of the system conditioned on all its observations and actions till the current time. The controller s posterior belief is called its information state. Secondly, stochastic control theory provides a sequential decomposition of the problem of finding optimal control strategies in centralized systems. This sequential decomposition also called the dynamic program allows one to evaluate the optimal action for each realization of the controller s information state in a backward inductive manner. In this paper, we provide a structural result and a sequential decomposition for the decentralized stochastic control problem with partial information sharing formulated above. 1) Structural Result: Theorem 1 (Structural Result for Optimal Control Strategies) In Problem 1, there exist optimal control strategies of the form where ( Ut i = gt i Y i t, Mt i, Π t ), i = 1, 2,..., n, (12) Π t := P g1:n 1:t 1 (Xt, Y t, M t C t ). (13) We call Π t the common information state. We denote by B t the space of possible realizations of Π t. Thus, B t := (X Y 1 t M 1 t... Y n t M n t ). (14)

11 2) Sequential Decomposition: Consider a control strategy g i for controller i of the form specified in Theorem 1. The control law g i t at time t is a function from the space Y i t M i t B t to the space of decisions U i t. Equivalently, the control law g i t can be represented as a collection of functions {g i t(,, π)} π Bt, where each element of this collection is a function from Y i t M i t to U i t. An element g i t(,, π) of this collection specifies a control action for each possible realization of Y i t, M i t and a fixed realization π of Π t. We call g i t(,, π) the partial control law of controller i at time t for the given realization π of the common information state Π t. We now describe a sequential decomposition of the problem of finding optimal control strategies. This sequential decomposition allows us to evaluate optimal partial control laws for each realization π of the common information state in a backward inductive manner. Recall that B t is the space of all possible realizations of Π t (see (14)) and F (Y i t M i t, U i t ) is the set of all functions from Y i t M i t to U i t (see Section I-B). Theorem 2 Define the functions V t : B t R, for t = 1,..., T as follows: V T (π) = and for 1 t T 1, V t (π) = inf E{l(X T, γ 1 { γ T i F (Yi T Mi T,U T i ),1 i n} T (YT 1, MT 1 ),..., γ T n (YT n, MT n )) Π T = π}, (15) inf E { l(x t, γ 1 { γ t i F (Yi t Mi t,u t i),1 i n} t (Yt 1, Mt i ),..., γ t n (Yt n, Mt n ))+ V t+1 (η t (π, γ 1 t,..., γ n t, Z t )) Π t = π }, (16) where η t is a B t+1 -valued function defined later in Section III (see equation (44) and Appendix A). For 1 t T and for each π B t, an optimal partial control law for controller i is the minimizing choice of γ i in the definition of V t (π). Thus, an optimal control strategy can be described in terms of the partial control laws as follows: (g,i T (,, π), 1 i n) = arg inf E{l(X T, γ T 1 (YT 1, MT 1 ),..., γ T n (YT n, MT n )) Π T = π} { γ T i F (Yi T Mi T,U T i ),1 i n} and for 1 t T 1, (g,i t (,, π), 1 i n) = arg inf { γ i t F (Yi t Mi t,u i t ),1 i n} E { l(x t, γ 1 t (Y 1 t, M i t ),..., γ n t (Y n t, M n t ))+ (17) V t+1 (η t (π, γ 1 t,..., γ n t, Z t )) Πt = π }. (18)

12 D. Special Cases: The Results In Section II-B, we described several models of decentralized control problems that are special cases of the model described in Section II-A. In this section, we state the results of Theorems 1 and 2 for these models. Corollary 1 In the delayed sharing information structure of section II-B1, there exist optimal control strategies of the form where ( Ut i = gt i Y i t s+1:t, Ut s+1:t 1, i Π t ), i = 1, 2,..., n, (19) Π t := P g1:n 1:t 1 (Xt, Y t s+1:t, U t s+1:t 1 C t ). (20) Moreover, optimal control strategies can be obtained by a dynamic program similar to that of Theorem 2. The above result is analogous to the result in [31]. Corollary 2 In the delayed state sharing information structure of section II-B2, there exist optimal control strategies of the form where U i t = g i t( X i t s+1:t, U i t s+1:t 1, Π t ), i = 1, 2,..., n, (21) Π t := P g1:n 1:t 1 (Xt s+1:t, U t s+1:t 1 C t ). (22) Moreover, optimal control strategies can be obtained by a dynamic program similar to that of Theorem 2. The above result is analogous to the result in [31]. Corollary 3 In the periodic sharing information structure of section II-B3, there exist optimal control strategies of the form where ( Ut i = gt i Y i ks+1:t, Uks+1:t 1, i Π t ), i = 1, 2,..., n, ks < t (k + 1)s, (23)

13 Π t := P g1:n 1:t 1 (Xt, Y ks+1:t, U ks+1:t 1 C t ), ks < t (k + 1)s. (24) Moreover, optimal control strategies can be obtained by a dynamic program similar to that of Theorem 2. The above result is analogous to the result in [32]. Corollary 4 In the control sharing information structure of section II-B4, there exist optimal control strategies of the form where ( Ut i = gt i Y i t s+1:t, Ut s+1:t 1, i Π t ), i = 1, 2,..., n, (25) Π t := P g1:n 1:t 1 (Xt, Y t s+1:t, U t s+1:t 1 C t ). (26) Moreover, optimal control strategies can be obtained by a dynamic program similar to that of Theorem 2. The Case of No Shared Memory: In the information structure of Section II-B5, the shared memory is always empty. Thus the common information state defined in Theorem 1 is now the unconditional probability Π t = P g1:n 1:t 1 (Xt, Y t, M t ). In particular, Π t is a constant random variable and takes a fixed value that depends only on the choice of past control laws. Therefore, with probability 1, g i t(y i t, M i t, Π t ) = g i t(y i t, M i t ), for an appropriately defined g i t. Thus, the result of Theorem 1 for this case, U i t = g i t(y i t, M i t, Π t ) = g i t(y i t, M i t ), is a redundant result since all control laws are of the above form. However, the optimal control laws can still be found using the dynamic program of Theorem 2. Our information state and dynamic program in this case are similar to the results in [38] for the case of one controller with finite memory and to those in [20] for the case of two controllers with finite memories.

14 III. PROOF OF THE RESULTS The main idea of the proof is to formulate an equivalent centralized stochastic control problem; solve the equivalent problem using classical stochastic-control techniques; and translate the results back to the basic model. For that matter, we proceed as follows: 1) Formulate a centralized coordinated system from the point of view of a coordinator that observes only the common information among the controllers in the basic model, i.e., the coordinator observes the shared memory C t but not the local memories (M i t, i = 1,..., n) or local observations (Y i t, i = 1,..., n). 2) Show that the coordinated system is a POMDP (partially observably Markov decision process). 3) For the coordinated system, determine the structure of an optimal coordination strategy and a dynamic program to find an optimal coordination strategy. 4) Show that any strategy of the coordinated system is implementable in the basic model with the same value of the total expected cost. Conversely, any strategy of the basic model is implementable in the coordinated system with the same value of the total expected cost. Hence, the two systems are equivalent. 5) Translate the structural results and dynamic programming decomposition of the coordinated system (obtained in stage 3) to the basic model. Stage 1: The coordinated system Consider a coordinated system that consists of a coordinator and n passive controllers. The coordinator knows the shared memory C t at time t, but not the local memories (M i t, i = 1,..., n) or local observations (Y i t, i = 1,..., n). At each time t, the coordinator chooses mappings Γ i t : Y i t M i t U i t, i = 1, 2,..., n, according to Γ t = d t (C t, Γ 1:t 1 ), (27) where Γ t = (Γ 1 t, Γ 2 t,..., Γ n t ). The function d t is called the coordination rule at time t and the collection of functions d := (d 1,..., d T ) is called the coordination strategy. The selected Γ i t is communicated to controller i at time t. The function Γ i t tells controller i how to process its current local observation and its local memory at time t; for that reason, we call Γ i t the coordinator s prescription to controller i.

15 Controller i generates an action using its prescription as follows: U i t = Γ i t(y i t, M i t ). (28) For this coordinated system, the system dynamics, the observation model and the cost are the same as the basic model of Section II-A: the system dynamics are given by (1), each controller s current observation is given by (2) and the instantaneous cost at time t is l(x t, U t ). As before, the performance of a coordination strategy is measured by the expected total cost [ T ] Ĵ(d) = E l(x t, U t ), (29) t=1 where the expectation is with respect to a joint measure on (X 1:T, U 1:T ) induced by the choice of d. In this coordinated system, we are interested in the following optimization problem: Problem 2 For the model of the coordinated system described above, find a coordination strategy d that minimizes the total expected cost given by (29). Stage 2: The coordinated system as a POMDP We will now show that the coordinated system is a partially observed Markov decision process. For that matter, we first describe the model of POMDPs [39]. POMDP Model: A partially observable Markov decision process consists of a state process S t S, an observation process O t O, an action process A t A, t = 1, 2,..., T, and a single decision-maker where 1) The action at time t is chosen by the decision-maker as a function of observation and action history, that is, A t = d t (O 1:t, A 1:t 1 ), (30) d t is the decision rule at time t. 2) After the action at time t is taken, the new state and new observation are generated according to transition probability rule P(S t+1, O t+1 S 1:t, O 1:t, A 1:t ) = P(S t+1, O t+1 S t, A t ). (31) 3) At each time, an instantaneous cost l(s t, A t ) is incurred.

16 4) The optimization problem for the decision-maker is to choose a decision strategy d := (d 1,..., d T ) to minimize a total cost given as T E[ l(st, A t )]. (32) t=1 The following well-known results provides the structure of optimal strategies and a dynamic program for POMDPs. For details, see [39]. Theorem 3 (POMDP Result) Let Θ t be the conditional probability distribution of the state S t at time t given the observations O 1:t and actions A 1:t 1, Then, Θ t (s) = P(S t = s O 1:t, A 1:t 1 ), s S. 1) Θ t+1 = η t (Θ t, A t, O t+1 ), where η t is the standard non-linear filter: If θ t, a t, o t+1 are the realizations of Θ t, A t and O t+1, then the realization of s th element of the vector Θ t+1 is θ θ t+1 (s) = s t (s )P(S t+1 = s, O t+1 = o t+1 S t = s, A t = a t ) ŝ, s θ t(ŝ)p(s t+1 = s, O t+1 = o t+1 S t = ŝ, A t = a t ) =: η s t (θ t, a t, o t+1 ) (33) and η t (θ t, a t, o t+1 ) is the vector (η s t (θ t, a t, o t+1 )) s S. 2) There exists an optimal decision strategy of the form A t = d t (Θ t ). Further, such a strategy can be found by the following dynamic program: and for 1 t T 1, V T (θ) = inf a E{ l(s T, a) Θ T = θ}, (34) V t (θ) = inf a E{ l(st, a) + V t+1 (η t (θ, a, O t+1 )) Θ t = θ, A t = a }. (35) We will now show that the coordinated system can be viewed as an instance of the above POMDP model with the state process as S t := {X t, Y t, M t }, the observation process as O t := Z t 1,

17 and the action process A t := Γ t. Lemma 1 For the coordinated system of Problem 2, 1) There exist functions f t and h t, t = 1,..., T, such that and In particular, we have that S t+1 = f t (S t, Γ t, W 0 t, W t+1 ), (36) Z t = h t (S t, Γ t ). (37) P(S t+1, Z t S 1:t, Z 1:t 1, Γ 1:t ) = P(S t+1, Z t S t, Γ t ). (38) 2) Furthermore, there exists a function l such that Thus, the objective of minimizing (29) is same as minimizing l(x t, U t ) = l(s t, Γ t ). (39) [ T ] Ĵ(d) = E l(st, Γ t ). (40) t=1 Proof: The existence of f t follows from (1), (2), (28), (7) and the definition of S t. The existence of h t follows from the fact that Z i t is a fixed subset of {M i t, Y i t, U i t }, equation (28) and the definition of S t. Equation (38) follows from (36) and the independence of W 0 t, W t+1 from all random variables in the conditioning in the left hand side of (38). The existence of l follows from the definition of S t and (28). form Recall that the coordinator is choosing its actions according to a coordination strategy of the Γ t = d t (C t, Γ 1:t 1 ) = d t (Z 1:t 1, Γ 1:t 1 ). (41) Equation (41) and Lemma 1 imply that the coordinated system is an instance of the POMDP model described above.

18 Stage 3: Structural result and dynamic program for the Coordinated System Since the coordinated system is a POMDP, Theorem 3 gives the structure of the optimal coordination strategies. For that matter, define coordinator s information state Π t := P(S t Z 1:t 1, Γ 1:t 1 ) = P(S t C t, Γ 1:t 1 ). (42) Then, we have the following: Proposition 1 For Problem 2, there is no loss of optimality in restricting attention to coordination rules of the form Γ t = d t (Π t ). (43) Furthermore, an optimal coordination strategy of the above form can be found using a dynamic program. For that matter, observe that we can write Π t+1 = η t (Π t, Z t, Γ t ) (44) where η t is the standard non-linear filtering update function (see Appendix A). Recall that B t is the space of all possible realizations of Π t (see (14)) and F (Y i t M i t, U i t ) is the set of all functions from Y i t M i t to U i t (see Section I-B). Then, we have the following result. Proposition 2 For all π t in B t, define V T (π) = and for 1 t T 1, V t (π) = inf E[ l(s t, Γ T ) Π t = π, Γ T = (γ 1 { γ T i F (Yi T Mi T,U T i ),1 i n} T,..., γt n )], (45) inf E[ l(s t, Γ t ) + V t+1 (η t (Π t, Γ t, Z t ) Π t = π, Γ t = (γ 1 { γ i F (Yt i Mi t,u t i),1 i n} t,..., γt n )]. Then the arg inf at each time step gives the coordinator s optimal prescriptions for the controllers when the coordinator s information state is π. Proposition 2 gives a dynamic program for the coordinator s problem (Problem 2). Since the coordinated system is a POMDP, it implies that computational algorithms for POMDPs can be used to solve the dynamic program for the coordinator s problem as well. We refer the reader to [40] and references therein for a review of algorithms to solve POMDPs. (46)

19 Stage 4: Equivalence between the two models We first observe that since C s C t, for all s < t, any coordination strategy of the form Γ t = d t (C t, Γ 1:t 1 ), t = 1, 2,..., T (47) can be transformed to a strategy of the form Γ t = d t (C t ), t = 1, 2,..., T (48) by recursive substitution. For example, Γ 2 = d 2 (C 2, Γ 1 ) = d 2 (C 2, d 1 (C 1 )) =: d 2 (C 2 ) In the following Proposition, we will only consider coordination strategies of the form in (48). Proposition 3 The basic model of Section II-A and the coordinated system are equivalent. More precisely: (a) Given any control strategy g 1:n for the basic model, choose a coordination strategy d for the coordinated system of stage 1 as d t (C t ) = ( gt 1 (,, C t ),..., g n (,, C t ) ). Then J(g 1:n ) = Ĵ(d). (b) Conversely, for any coordination strategy for the coordinated system, choose a control strategy g 1:n for the basic model as gt(, i, C t ) = d i t(c t ), where d i t(c t ) is the i-th component of d t (C t ) (that is, d i t(c t ) gives the coordinator s prescription for the i-th controller). Then, Ĵ(d) = J(g 1:n ). Proof: See Appendix B. Stage 5: Structural result and dynamic program for the basic model Proposition 1 states that there is no loss of optimality in using Π t instead of C t to decide the coordinator s prescriptions in the coordinated system. Combining this with Proposition 3 implies that there is no loss of optimality in using Π t instead of C t in the control strategies of the basic model. This establishes the structural result of Theorem 1. Combining Propositions 2 and 3, we get the sequential decomposition of Theorem 2.

20 IV. A SIMPLE GENERALIZATION The methodology described in Section III relies on the fact that the shared memory is common information among all controllers. Since the coordinator in the coordinated system knows only the common information, any coordination strategy can be mapped to an equivalent control strategy in the basic model (see Stage 4 of Section III). In some cases, in addition to the shared memory, the current observation (or if the current observation is a vector, some components of it) may also be commonly available to all controllers. The general methodology of Section 2 can be easily modified to include such cases as well. Consider the model of Section II-A with the following modifications: 1) In addition to their current local observation, all controllers have a common observation at time t. Y com t = h com t (X t, V t ) (49) where {V t, t = 1,..., T } is a sequence of i.i.d. random variables with probability distribution Q V which is independent of all other primitive random variables. 2) The shared memory C t at time t is a subset of {Y com 1:t 1, Y 1:t 1, U 1:t 1 }. 3) Each controller selects its action using a control law of the form U i t = g i t(y i t, M i t, C t, Y com t ). (50) 4) After taking the control action at time t, controller i sends a subset Z i t of {M i t, Y i t, U i t, Y com t } that necessarily includes Yt com. That is, Y com t Z i t {M i t, Y i t, U i t, Y com t }. This implies that the history of common observations is necessarily a part of the shared memory, that is, Y com 1:t 1 C t. The rest of the model is same as in Section II-A. In particular, the local memory update satisfies (7), so the local memory and shared memory at time t + 1 don t overlap. The instantaneous cost is given by l(x t, U t ) and the objective is to minimize an expected total cost given by (8). The arguments of Section III are also valid for this model. The observation process in Lemma 1 is now defined as R t+1 = {Z t, Y com t+1 }. The analysis of Section III leads to structural results and sequential decompositions analogous to Theorems 1 and 2 with Π t now defined as ) Π t := P g1:n 1:t 1 (Xt, Y t, M t C t, Y com. (51) t

21 A. Examples of the Generalized Model 1) Controllers with Identical Information: Consider the following special case of the above generalized model. 1) All controllers only make the common observation Yt com ; controllers have no local observation or local memory. 2) The shared memory at time t is C t = Y com 1:t 1. Thus, at time t, all controllers have identical information given as {C t, Yt com } = Y1:t com. 3) After taking the action at time t, each controller sends Z i t = Y com t to the shared memory. Recall that the coordinator s prescription Γ i t in Section III are chosen from the set of functions from Y i t M i t to U i t. Since, in this case Y i t = M i t =, we interpret the coordinator s prescription as prescribed actions. That is, Γ i t U i t. With this interpretation, the common information state becomes and the dynamic program of Theorem 2 becomes and for 1 t T 1, V t (π) = V T (π) = Π t := P g1:n 1:t 1 (Xt Y com 1:t ) (52) inf E{l(X T, u 1 {u i T U T i ),1 i n} T,..., u n T ) Π T = π}, (53) inf E { l(x t, u 1 {u i t U T i ),1 i n} t,..., u n t ) + V t+1 (η t (π, u 1 t,..., u n t, Yt com )) Π t = π }. (54) Since all the controllers have identical information, the above results correspond to the centralized dynamic program of Theorem 3 with a single controller choosing all the actions. Moreover, it can be shown that the information state Π t defined in equation (52) that does not use past actions in the conditioning and the information state Θ t = P(X t Y com 1:t, U 1:t 1 ) (as would be suggested by a direct application of Theorem 3) are identical [41]. 2) Coupled subsystems with control sharing information structure: Consider the following special case of the above generalized model. 1) The state of the system at time t is a (n+1)-dimensional vector X t = (X 1 t, X 2 t,..., X n t, X t ), where X i t, i = 1,..., n corresponds to the local state of subsystem i, and X t state of the system. 2) The state update function is such that the global state evolves according to X t+1 = f t (X t, U t, N 0 t ), is a global

22 while the local state of subsystem i evolves according to X i t+1 = f i t (X i t, X t, U t, N i t ), where {N 0 t, t = 1,... T },..., {N n t, t = 1,... T } are mutually independent i.i.d noise processes that are independent of primitive random variables. 3) At time t, the common observation of all controllers is given by Y com t = X t. 4) At time t, the local observation of controller i is given by Y i t = X i t, i = 1,..., n. 5) The shared memory at time t is C t = {X 1:t 1, U 1:t 1 }. At each time t, after taking the action U i t, controller i sends Z i t = {X t, U i t } to the shared memory. 6) No controller has any local memory, i.e., M t =. The above special case corresponds to the model of coupled subsystems with control sharing considered in [34], where several applications of this model are also presented. It is shown in [34] that assumption 6 above (the absence of local memory) does not entail any loss of optimality. The result of Theorems 1 and 2 apply for this model with Π t defined as Note that Π t can be evaluated from X t that X 1 t, X 2 t,..., x n t Π t := P g1:n 1:t 1 (X t, X 1 t,..., X n t X 1:t, U 1:t 1 ). and P g1:n 1:t 1 (X 1 t,..., X n t X 1:t, U 1:t 1 ). It is shown in [34] are conditionally independent given X 1:t, U 1:t 1, hence the joint distribution P g1:n 1:t 1 (X 1 t,..., X n t X 1:t, U 1:t 1 ) is a product of its marginal distributions. 3) Broadcast information structure: Consider the following special case of the above generalized model. 1) The state of the system at time t is a n-dimensional vector X t = (X 1 t, X 2 t,..., X n t ), where X i t, i = 1,..., n corresponds to the local state of subsystem i. The first component i = 1 is special and called the central node. Other components, i = 2,..., n, are called peripheral nodes. 2) The state update function is such that the state of the central node evolves according to X 1 t+1 = f 1 t (X 1 t, U 1 t, N 1 t ) while the state of the peripheral nodes evolves according to X i t+1 = f i t (X i t, X 1 t, U i t, U 1 t, N i t ) where {N i t, i = 1, 2,... n; t = 1,... } are noise processes that are independent across time and independent of each other.

23 3) At time t, the common observation of all controllers is given by Y com t = X 1 t. 4) At time t, the local observation of controller i, i > 2, is given by Y i t = X i t. Controller 1 does not have any local observations. 5) No controller sends any additional data to the shared memory. Thus, the shared memory consists of just the history of common observations, i.e., C t = Y com 1:t 1 = X 1 1:t 1. 6) No controller has any local memory, i.e., M t =. The above special case corresponds to the model of decentralized systems with broadcast structure considered in [17]. It is shown in [17] that assumption 6 above does not entail any loss of optimality. The result of Theorems 1 and 2 apply for this model with Π t defined as Π t := P g1:n 1:t 1 (X 1 t,..., X n t X 1 1:t). Note that Π t can be evaluated from X 1 t and P g1:n 1:t 1 (X 2 t,..., X n t X 1 1:t). V. EXTENSION TO INFINITE HORIZON In this Section, we consider the basic model of Section II-A with an infinite time horizon. Assume that the state of the system, the observations and control actions take value in timeinvariant sets and that the dynamics of the system (equation (1)) and the observation model (equation (2)) are time-homogeneous. That is, the functions f t and h t in equations (1) and (2) do not vary with time. Also, the local memories M i t and the updates to the shared memory Z i t take values in time-invariant sets M i and Z i respectively. Let the cost of using a strategy g 1:n be defined as J(g 1:n ) := E g1:n[ t=1 ] β t 1 l(x t, U t ), (55) where β [0, 1) is a discount factor. We can follow the arguments of Section III to formulate the problem of the coordinated system with an infinite time horizon. As in Section III, the coordinated system is equivalent to a POMDP. The time-homogeneous nature of the coordinated system and its equivalence to a POMDP allows us to use known POMDP results (see [41], Chapter 8, Section 3) to conclude the following theorem for the infinite time horizon problem. Theorem 4 Consider Problem 1 with infinite time horizon and the objective of minimizing the expected cost given by equation (55). Then, there exists an optimal time-invariant control strategy of the form: U i t = g i( Y i t, M i t, Π t ), i = 1, 2,..., n, (56)

24 Furthermore, consider the fixed point equation, V (π) = inf E { l(x t, γ { γ i F (Y i M i,u i t 1 (Yt 1, Mt i ),..., γ t n (Yt n, Mt n ))+ ),1 i n} βv (η t (π, γ 1,..., γ n, Z t )) Πt = π }. (57) Then, for any realization π of Π t, the optimal partial control laws are the choices of γ i that achieve the infimum in the right hand side of (57). The Case of No Shared Memory: As discussed in Section II-D, if the shared memory is always empty then the common information state defined in Theorem 1 is the unconditional probability Π t = P g1:n 1:t 1 (Xt, Y t, M t ). In particular, Π t is a constant random variable and takes a fixed value that depends only on the choice of past control laws. Therefore, for any function g i t of Y i t, M i t, Π t, there exists a function g i t of Y i t, M i t such that g i t(y i t, M i t ) = g i t(y i t, M i t, Π t ) with probability 1. While Theorem 4 establishes optimality of a time-invariant g i t, such time-invariance may not hold for the corresponding g i t. Similar observations were reported in [42]. VI. CONCLUSION We studied the decentralized stochastic control problem with multiple controllers that share part of their information with each other. Our model subsumes several models of decentralized control with various kinds of information sharing among controllers. We established the structure of optimal control strategies and provided a dynamic program for finding optimal control strategies. Our results rely crucially on identifying common information among controllers and formulating the decentralized problem as a centralized problem from the perspective of a coordinator who knows the common information. Identifying the coordinator s problem as a POMDP allows us to get our main results. Further, the relation with POMDPs also implies that one can use POMDP algorithms and approximations to solve the coordinator s dynamic program. By explicitly including a shared memory in the system, our model ensured that there is some common information among the controllers. More generally, we can define common information for any sequential decision-making problem and then address the problem from the perspective of a coordinator who knows the common information. Such a common information based approach for general sequential decision-making problems is presented in [36].

25 VII. ACKNOWLEDGMENTS This work was supported by the Natural Sciences and Engineering Research Council of Canada through the grant NSERC-RGPIN #402753-11 and by NSF through the grant CCF-1111061. APPENDIX A THE UPDATE FUNCTION η t OF THE COORDINATOR S INFORMATION STATE Consider a realization c t+1 of the shared memory C t+1 at time t + 1. Let (γ 1:t ) be the corresponding realization of the coordinator s prescriptions until time t. We assume the realization (c t+1, π 1:t, γ 1:t ) to be of non-zero probability. Then, the realization π t+1 of Π t+1 is given by Use Lemma 1 to simplify the above expression as 1 s ( f t (s t, γ t, wt 0, w t+1 )) s t,w 0 t,w t+1 Since c t+1 = (c t, z t ), write the last term of (59) as π t+1 (s) = P{S t+1 = s c t+1, γ 1:t }. (58) P{W 0 t = w 0 t } P{W t+1 = w t+1 } P{S t = s t c t+1, γ 1:t }. (59) P{S t = s t c t, z t, γ 1:t } = P{S t = s t, Z t = z t c t, γ 1:t } s P{S t = s, Z t = z t c t, γ 1:t }. (60) Use Lemma 1 and the sequential order in which the system variables are generated to write the numerator as P{S t = s t, Z t = z t c t, γ 1:t } = 1 ht(s t,γ t) (z t) P{S t = s t c t, γ 1:t } (61) = 1 ht(s t,γ t) (z t) π t (s t ). (62) where we dropped γ t from conditioning in (61) since under the given coordinator s strategy, it is a function of the rest of the terms in the conditioning. Substitute (62), (60), and (59) into (58), to get π t+1 (s) = η s t (π t, γ t, z t ), where η s t ( ) is given by (58), (59), (60), and (62). η t ( ) is the vector (η s t ( )) s S.

26 APPENDIX B PROOF OF PROPOSITION 3 (a) For any given control strategy g 1:n in the basic model, define a coordinated strategy d for the coordinated system as d t (C t ) = ( g 1 t (,, C t ),..., g n (,, C t ) ). (63) Consider Problems 1 and 2. Use control strategy g 1:n in Problem 1 and coordination strategy d given by (63) in Problem 2. Fix a specific realization of the primitive random variables {X 1, W j t, t = 1,..., T, j = 0, 1,..., n} in the two problems. Equation (2) implies that the realization of Y 1 will be the same in the two problems. Then, the choice of d according to (63) implies that the realization of the control actions U 1 will be the same in the two problems. This implies that the realization of the next state X 2 and the memories M 2, C 2 will be the same in the two problems. Proceeding in a similar manner, it is clear that the choice of d according to (63) implies that the realization of the state {X t ; t = 1,..., T }, the observations {Y t ; t = 1,..., T }, the control actions {U t ; t = 1,..., T } and the memories {M t ; t = 1,..., T } and {C t ; t = 1,..., T } are all identical in Problem 1 and 2. Thus, the total expected cost under g 1:n in Problem 1 is same as the total expected cost under the coordination strategy given by (63) in Problem 2. That is, J(g 1:n ) = Ĵ(d). (b) The second part of Proposition 3 follows from similar arguments as above. REFERENCES [1] H. S. Witsenhausen, On the structure of real-time source coders, Bell System Technical Journal, vol. 58, no. 6, pp. 1437 1451, July-August 1979. [2] J. C. Walrand and P. Varaiya, Optimal causal coding decoding problems, IEEE Trans. Inf. Theory, vol. 29, no. 6, pp. 814 820, Nov. 1983. [3] D. Teneketzis, On the structure of optimal real-time encoders and decoders in noisy communication, IEEE Trans. Inf. Theory, pp. 4017 4035, Sep. 2006. [4] A. Nayyar and D. Teneketzis, On the structure of real-time encoders and decoders in a multi-terminal communication system, IEEE Trans. Info. Theory, vol. 57, no. 9, pp. 6196 6214, August 2011. [5] Y. Kaspi and N. Merhav, Structure theorem for real-time variable-rate lossy source encoders and memory-limited decoders with side information, in International Sympnosium on Information Theory, 2010. [6] R. R. Tenney and N. R. Sandell Jr., Detection with distributed sensors, IEEE Trans. Aerospace Electron. Systems, vol. AES-17, no. 4, pp. 501 510, July 1981.