Markov LIMID Processes for Representing and Solving Renewal Problems

Size: px

Start display at page:

Download "Markov LIMID Processes for Representing and Solving Renewal Problems"

Cornelius Farmer
5 years ago
Views:

1 Markov LIMID Processes for Representing and Solving Renewal Problems Final preprint (accepted version) of article published in Annals of Operations Research. Please cite as: Jørgensen, E., A.R. Kristensen & D. Nilsson. 0. Markov LIMID Processes for Representing and Solving Renewal Problems. Annals of Operations Research,. DOI: 0.00/s0-0-0-

2 Noname manuscript No. (will be inserted by the editor) 0 0 Markov LIMID Processes for Representing and Solving Renewal Problems Erik Jørgensen Anders R. Kristensen Dennis Nilsson Received: date / Accepted: date Abstract In this paper a new tool for simultaneous optimisation of decisions on multiple time scales is presented. The tool combines the dynamic properties of Markov decision processes with the flexible and compact state space representation of Limited Memory Influence Diagrams (LIMIDs). A temporal version of LIMIDs, TEMLIMIDs, is defined by adding time-related functions to utility nodes. As a result, expected discounted utility, as well as expected relative utility might be used as optimisation criteria in TEMLIMIDs. Optimisation proceeds as in ordinary LIMIDs. A sequence of such TEMLIMIDs can be used to model a Markov LIMID Process, where each TEMLIMID represents a macro action. Algorithms are presented to find optimal plans for a sequence of such macro actions. Use of algorithms is illustrated based on an extended version of an example from pig production originally used to introduce the LIMID concept. Keywords Markov Decision Process, MDP, Multilevel hierarchical Markov Process, MLHMP, Limited Memory Influence Diagram, LIMID, pig production Introduction Inspired by the problem of maintenance and replacement in livestock production, Kristensen () introduced the notion of hierarchical semi-markov decision processes, which were structured with two levels: The main level holding one main process and the sub-level holding a series of consecutive sub-processes. The stages of the main process corresponded to animals successively replacing each other in an infinite chain of stages. Transition to a new This research was carried out as part of Dina, Danish Informatics Network in the Agricultural Sciences Erik Jørgensen Department of Animal Science, Faculty of Science and Tecnology, Aarhus University, Blichers Allé Postbox DK- Tjele, Denmark. erik.jorgensen@agrsci.dk Anders R. Kristensen Univ. of Copenhagen, Department of Large Animal Science, Faculty of Life Sciences, University of Copenhagen, Grønnegårdsvej, DK-0 Frederiksberg C, Denmark. ark@sund.ku.dk Dennis Nilsson Univ. of Copenhagen, present address: Nordic Director, Pricing and Business Intelligence at RSA, Gl. Kongevej, DK-0 Copenhagen, Denmark.

3 0 0 stage corresponded to a replacement. Each animal was, furthermore, modeled as a sub-level by a separate finite time semi-markov decision process representing the states and decisions corresponding to daily maintenance. In livestock, large models (Houben et al ; Verstegen et al ; Mourits et al ) have been developed using this hierarchical technique, which has had a positive effect on efficiency of optimisation. Later, the notion was developed further (Kristensen and Jørgensen 000) into multi-level hierarchical Markov decision processes (MLHMDP) to allow simultaneous optimisation of decisions on multiple time scales. The two-level concept was extended via a founder level process corresponding to the main process and nested child processes (corresponding to the sub-processes) to handle several levels in the hierarchy. As an example, Kristensen and Søllested (00) implemented a sow replacement model using MLHMDP, where the child processes modeled very different processes. Each child process included several types of decisions, in addition to the replacement decision, and each child process was still modeled as a semi-markov decision process. It is of interest, however, to explore if the new modeling techniques based on graphical modeling can be used for this purpose. Often the state of the system (e.g. an animal) being modeled in Markov decision processes (MDPs) is defined by values of state variables, each representing a trait of the system. The cartesian product of all value sets for individual state variables is defined as the state space. A more structured description of the state space, however, can usually be made. The structured description leads to a more efficient representation of the transition probabilities between states of consecutive stages. Such efficient representations may be obtained using Influence Diagrams (IDs) (Howard and Matheson ), which are representations of decision problems with uncertainty. Originally used as representations of decision trees, IDs are now usually seen as natural extensions of Bayesian Networks. As with Bayesian Networks, there exists local computation algorithms to find optimal strategies within IDs (Olmsted ; Shachter ; Shenoy ; Jensen et al ). Such IDs consist of a Bayesian Network augmented with decision and utility nodes, together with a protocol that identifies which variables are known to the decision maker when decisions are made (Lauritzen 00). Tatman and Shachter (0) showed that finite length MDPs could be formulated as IDs with restriction on the utility function of the MDP. The ID directly distinguishes between variables, known and unknown to the decision maker. Thus, we have a framework for modeling with only partial observations of variables. In MDPs, it is generally assumed that variables are known and that the state space is fully observable. It is more logical, however, to distinguish between directly observable state variables, such as milk yield or litter size, and latent variables, such as genetic value. Partially Observable Markov Decision Processes, POMDPs, are a framework allowing for modeling with such latent variables (refer for instance to Lovejoy () for a survey or Kaelbling et al () for an introduction). IDs may be used to model such processes. The complexity of finding globally optimal strategies is daunting, and hinders the application of IDs to many real decision problems. This is mainly because of the implicit no forgetting assumption, i.e., when the decision maker makes a decision, it is assumed that all the previous decisions, as well as observations they were based on, are known. With models covering even moderate time spans, the complexity of keeping this information in the model becomes prohibitive. Thus early considerations of implementing IDs within the MLHMDP were disappointing. The statistical models underlying the MLHMDP could naturally be expressed using latent variables, and it would be natural to express the IDs directly. Due to the complexity of the resulting model, however, it is necessary to use several tricks that are not a natural part of the modeling process (see e.g. Kristensen ; Relund Nielsen et al 00).

4 0 0 The extension to IDs called LIMIDs or LImited Memory Influence Diagrams (Lauritzen and Nilsson 00) relaxes the no forgetting assumption, thus providing a computationally tractable decision problem without assuming a Markov process. In general, optimisation algorithms based on LIMIDs are approximate; in some special cases, however, they are exact. By relaxing the no forgetting assumption, previously untractable problems can be handled, at least approximately. In addition to the computational complexity of IDs, another problem, that the current definition of LIMIDs does not solve, is the static nature of IDs. Although decisions are ordered according to time, so that a multi-stage decision problem can be handled, time is not an integrated element of an ID. It is implicitly assumed that there is no time preference concerning rewards (or utilities as they are denoted in IDs). Each time stage has, furthermore, to be defined explicitly in the model, as opposed to implicitly in Markov decision processes. If we want to use LIMIDs for dynamic decision problems in herd management, we therefore need to extend the definition. The purpose of this paper is to combine multi-level Markov processes with LIMIDs to provide a tool for dealing with dynamic decision problems at two or more levels. We will refer to this combination as Markov LIMID Processes (MLP). The MLPs integrate the dynamic nature of semi Markov decision processes with the state space representation of LIMIDs, allowing for use of unobservable variables. The integration is done by defining a semi Markov decision process at the founder level, interacting with slightly modified LIMIDs at the child level. A protocol for communication between the two levels is defined and an optimisation algorithm is presented and proved. The potential of the combined method is, furthermore, illustrated through examples related to the animal replacement problem.. PIGS example The PIGS example from Lauritzen and Nilsson (00) is fictitious, but our extension will make it more realistic, and closer to direct application. According to Lauritzen and Nilsson (00): A pig breeder is growing pigs for a period of four months and subsequently selling them. During this period the pig may or may not develop a certain disease. If the pig has the disease at the time when it must be sold, the pig must be sold for slaughtering and its expected market price is then 0 DKK (Danish kroner). If it is disease free, its expected market price as a breeding animal is 000 DKK. Once a month, a veterinary doctor sees the pig and makes a test for presence of the disease. If the pig is ill, the test will indicate this with probability.0, and if the pig is healthy, the test will indicate this with probability.0. At each monthly visit, the doctor may or may not treat the pig for the disease by injecting a certain drug. The cost of an injection is 00 DKK. A pig has the disease in the first month with probability.0. A healthy pig develops the disease in the subsequent month with probability.0 without injection, whereas a healthy and treated pig develops the disease with probability.0, so the injection has some preventive effect. An untreated pig which is unhealthy will remain so in the subsequent month with probability.0, whereas the similar probability is. for an unhealthy pig which is treated. Thus, although spontaneous cure is possible, treatment is beneficial on average. We will extend the example to illustrate how to handle repeated decision making.

5 0 0 Prerequisites The MLPs are based on a combination of the concepts in Semi Markov Processes and LIMIDs. The term LIMID (LImited Memory Influence Diagram) was introduced as a modification of IDs to represent multistage decision problems and to relax the no-forgetting assumption (Nilsson and Lauritzen 000; Lauritzen and Nilsson 00). We will assume that Semi Markov Processes are known, but we will provide an introduction to LIMIDs and their predecessors IDs. For proofs, the reader is referred to the above sources.. Influence Diagrams The original formulation of Influence Diagram (ID) was presented by Howard and Matheson (). The main element of an ID, L, is a directed acyclic graph (DAG) with three sets of nodes: chance nodes Γ, decision nodes, and utility nodes Υ. In Fig. (left), the PIGS problem is shown as an ID. Ovals indicate chance nodes, h t and t t where h t denotes the health state of the pig at time t, and t t denotes the outcome of the test the veterinarian performs. Rectangles indicate decision nodes d t and diamonds indicate utility nodes, e.g., u t corresponds to the cost incurred, depending on whether the pig is treated or not, and u D is sales price of the pig depending on its health state. The directed edges are used for two purposes: the first to indicate conditional dependency, e.g., the outcome of test t depends directly on health state h, and second to describe informational relationship. When decision d is made, the outcome of first test t is known, but neither health state h nor the outcome of second test t is known. When decision d is made, the outcome oft is known. IDs have an implicit no-forgetting assumption, meaning that previous observations and decisions are known. Thus t and d are known when t is made. The informational edges also serve the role of specifying an ordering of the outcomes and decisions. A unique ordering is necessary to solve the influence diagram. We use the convention of showing informational edges as dotted lines.. h h h h u D t t t d d d u u u.. h h h h u D t t t d d d u u u Fig. : PIGS example shown in two versions, as an ID (left) and as a LIMID(right). Ovals indicate chance nodes, h t is health state, t t is test outcome, rectangles indicate decision nodes, and diamonds indicate utility nodes. Informational edges are indicated by a dotted line to represent the implicit no-forgetting assumption in the ID..

6 0 0. LIMIDs A LIMID L is a directed acyclic graph (DAG) with three sets of nodes: chance nodes Γ, decision nodes, and utility nodes Υ. For each node n in the set V = Γ, we associate a variable X n that can take values in a finite set X n. For a subset A V, X A = n A X n X A = n A X n. Typical elements in X A are denoted by x A,y A etc. For a node n in L, its parent set is denoted pa(n), the set of descendants of n is denoted de(n), and the family of n is denoted fa(n) = pa(n) {n}. Chance nodes, shown as ovals, represent random variables. For each chance noder Γ, there is a probability distribution p r defined by p r : X r [0,], such that x r p r(x r) =. Decision nodes, shown as rectangles, represent decision variables. For each decision node d, the decision maker has the opportunity to choose a policy δ d. A policy for d associates with each state x pa(d) of X pa(d) a probability distribution δ d ( x pa(d) ) on X d. We assume that the states in X pa(d) are known to the decision maker when making decision d. For that reason, directed edges into decision nodes are termed informational edges. In contrast to IDs, only X pa(d) is known, there is no implicit no-forgetting assumption. Finally, utility nodes, shown as diamonds, represent utility functions. For each utility node u, we associate a (local) utility function U u defined by U u : X pa(u) R. Thus, Fig. may be considered as a graphical representation of an ID as well as a LIMID. However, they are two different models. In the right part of Fig. is shown the LIMID specification that corresponds to the ID, i.e., the informational arcs representing the available information are added.. Global maximum strategies in LIMIDs A strategy q in L is a set of policies, one for each decision node, d, and has the form q = {δ d : d }. The strategyq generates a joint distribution over all variablesv = Γ as f q = p r δ d. r Γ d The expected utility of q is EU(q) = x V f q(x V ) u Υ U u(x pa(u) ) = x V f q(x V )U(x V ), where U(x V ) = u Υ Uu(x pa(u)). A global maximum strategy ˆq in L has the property that EU(ˆq) = max EU(q). q

7 0 0. Other aspects of LIMIDs In the process of defining LIMIDs, new features of decision problems were identified and new computational strategies were developed. Probably the most important development was a new iterative approach for evaluation of LIMIDs based on local computations, the single-policy updating. It was shown that it was possible to reduce the LIMID before the optimisation, by algorithmically identifying information that was not required for decision. Finally, it was shown that IDs could be considered as a subclass of what was called soluble LIMIDs. Soluble LIMIDs can be solved exactly by using the single-policy updating algorithm. Important aspects of these developments are presented in the following sections... Single-Policy Updating for LIMIDs Single-Policy Updating is an iterative procedure to evaluate LIMIDs. The procedure starts with an initial strategy q 0, and proceeds by modifying (updating) individual policies in a given order. If the current strategy is q i = {δ i d : d } and decision d is to be modified, then the following steps are performed: Retract: Retract policy δ d from q i to obtain q i d := qi \{δ d }. Optimise: Compute a new policy, δ d for d by Replace: Let q i+ = q i d {δ d }. δd = argmax EU(q d i δ {δ d }). d In this way, policies are updated until no single policy modification can increase the expected utility. Such a strategy is termed a local maximum strategy. Under some conditions, the local maximum strategy obtained is a global maximum strategy as described in the next section... Soluble LIMIDs We start this section by introducing a bit of notation. For subsets A, B, and S in L, let the symbolic expression A L B S denote that A and B are d-separated by S in the DAG formed by the nodes in L, including utility nodes. A decision node d 0 is said to be extremal whenever the following d-separation holds for all utility nodes u Υ : u L ( d ( \{d0})fa(d)) fa(d 0 ). Decision nodes inlhave an exact solution orderingd,...,d k if, for alli,d i is extremal when d i+,...,d k have been converted into chance nodes. Further, L is soluble if decision nodes in L obey an exact solution ordering. Soluble LIMIDs have an important property as stated in the following theorem. Here, a uniform strategy q is given by q = { δ d : d }, where δ d is the uniform distribution over X d.

8 0 0 Theorem Suppose we are given a LIMID L whose decision nodes have an exact solution ordering d,...,d k. If we perform Single-Policy Updating on L, starting from the uniform strategy, and use the updating order d k,...,d, then after one update of each policy, we obtain a global maximum strategy. A large subset of soluble LIMIDs is the set of IDs. So, the above theorem implies that a global maximum strategy for an ID can be achieved by applying Single-Policy Updating, which lead to more efficient algorithms for solving IDs (Nilsson and Lauritzen 000; Madsen and Nilsson 00). A crucial step in these algorithms consists of removing informational arcs in L that are not relevant to the decision, i.e., they are non-requisite. The notion of non-requisiteness is described in the next section... Non-requisite information in LIMIDs Before applying Single-Policy Updating on a given LIMID L, it is wise to remove nonrequisite informational edges. In this section we define the notion of non-requisiteness, and show results that reduce the complexity of Single-Policy Updating. A parent n of decision node d is non-requisite for d, if the following d-separation holds: n L (Υ de(d)) (fa(d)\{n}). As a corollary, we say that n is requisite for d if n is not non-requisite for d. Similarly for informational edges: If n is non-requisite for d, the directed edge from n into d is nonrequisite. If the new LIMID L, is obtained by successively removing non-requisite informational arcs from L, then L is said to be a reduction of L. There exists a unique reduction of L, called minimal reduction, with the property that all informational arcs are requisite. The minimal reduction of L is denoted L min. Clearly, a reductionl is less complex thanl, because it has fewer edges. The following theorem shows that reductions have other nice properties: Theorem Suppose L is a reduction of L. Then the following conditions hold:. (Preserves solubility) If L is soluble, then L is also soluble.. (Preserves local maximum) If q is a local maximum strategy for L, then q is also a local maximum strategy for L. As a consequence of the above theorem, a global maximum strategy for a reduction L of L is also a global maximum strategy for L. Temporal LIMIDs This section presents the necessary extension of LIMIDs to account for the dynamic nature of the decision problem. A temporal LIMID or TEMLIMID, is a simple generalisation of LIMIDs to allow for the possibility to define additional properties of the utilities. The additional properties might include time of achieving the utility, duration covered by the utility, or any kind of physical property (i.e. litter size of a sow or milk yield of a cow). A TEMLIMID consists of the same three sets of nodes as LIMIDs: a set Γ of chance nodes, a set of decision nodes, and a set Υ of utility nodes. Informational edges, furthermore, have the same semantics as in section.. So, in TEMLIMIDs we also allow for the possibility to represent decision problems, where decisions are taken with limited memory.

9 0 0. Utility nodes in TEMLIMIDs To represent additional properties of the utility, we associate a number of optional functions with each utility node u. The number of functions depends on two aspects: the problem and the criterion of optimality to be used in the optimisation. For convenience, however, we will assume that three functions are available with a utility node, u: a utility function: U u : X pa(u) R, a time function T u : X pa(u) R + {0, }, and a duration function O u : X pa(u) R + {0}. The utility function has the same meaning as in LIMIDs, whereas the time function specifies the time until the utility is given. IfU u(x pa(u) ) is never paid, thent u(x pa(u) ) =, but ifu u(x pa(u) ) is paid immediately, thent u(x pa(u) ) = 0. The duration function specifies the time interval that the utility covers, but a similar definition can be used for any quantity defined with the utility, e.g., a physical output. In this paper, we will restrict the definition of the function to duration. The time function is used when the expected present value of the TEMLIMID is maximized, instead of the expected total value. In that case, utilities are discounted by a discount factor, λ ]0,[; that is, a utility of u units paid after t time units has a discounted value, ũ, of ũ = u λ t units. The duration function is used to express the deviation of the utility from a specified norm g relative to the duration. Thus, if a utility of u units is paid and a duration of o units is involved, then the revised utility becomes ũ = u g o; if u/o = g, then ũ = 0. Thus, a TEMLIMID consists of a L, its discount factor λ, and a norm g. We will use the notation L(λ,g) to indicate this.. Example: PIGS as a TEMLIMID The PIGS example shown in Fig. can be readily extended to a TEMLIMID. The utility assigned to u,u,u is -00 DKK if the decision is to treat or 0 if it is not. The u D is 0 DKK if the pig is diseased or 000 DKK if it is healthy. In the original PIGS example (Section.) time steps were of equidistant length. A more realistic description, however, will have time steps of varying lengths. For example, the first decision is made at months after start; the second at ; the third at ; and the final utility is received at. For the utility nodes, u,u,u,u D, therefore, the time functions (T u,t u,t u,t ud ) are (,,,), and the duration functions (O u,o u,o u,o ud ) are (,,,0). With an annual interest rate of 0% (implying that, on a monthly basis, λ 0.), the revised utilities, (ũ,ũ,ũ ), based on the discounting factor by the time function is (.,.,.); discounted sales prices, ũ D, are. and. DKK. Similar with the duration function and a norm g = DKK, we obtain the revised utilities, (ũ,ũ,ũ ), ( 0,, ) DKK for the treatments and (0, 000) for the sales prices (where the duration,o ud, is 0)... Premature ending of the process The original PIGS example was based on fixed length of the time steps in the model, i.e., the pig was always sold at time step. A more realistic model, however, would allow for the possibility to end the process at each time step. Thus, we may decide to sell the pig immediately, if we expect a loss by keeping it. Extended utility of sales prices are in Table

10 0 0. In addition we need to keep track of whether the pig is alive or not. One way to obtain this is to augment the state space of the h t nodes with a sold state, and add a similar test outcome sold to the t t. Note that it may be necessary to keep track of the time of delivery, e.g., by adding sold at time t as state levels. Table : Extended utility of sales prices. The pig is sold immediately after the decision was made; thus, the duration is 0 (except for the first time period). If the pig has already been sold, no further utilities are obtained. Utility function Time function Duration function State,h t U u U u U u U ud T u T u T u T ud O u O u O u O ud Healthy Diseased Sold at time Sold at time Sold at time Optimisation criteria in TEMLIMIDs With each strategy q = {δ d : d } in a TEMLIMID L(λ,g), we may associate one of two optimisation criteria: the expected discounted utility or the expected relative utility. The expected discounted utility is: EDU L(λ,g) (q) = x { } f q(x) U u(x pa(u) ) λ Tu(x pa(u)), u Υ where we define λ = 0. As usual, f q denotes the joint distribution of Γ induced by q. The expected relative utility is: ERU L(λ,g) (q) = x { } f q(x) U u(x pa(u) ) go u(x pa(u) ). u Υ Depending on the definition of optimality, we search for a strategy q that maximizes either the expected discounted utility or the expected relative utility: Definition A global maximum present value strategy ˆq p in a TEMLIMID L(λ,g) is a strategy that satisfies EDU L(λ,g) (ˆq p) = max EDU L(λ,g) (q). q Definition A global maximum relative value strategy ˆq r in a TEMLIMID L(λ,g) is a strategy that satisfies ERU L(λ,g) (ˆq r) = max ERU L(λ,g) (q). q

11 Solving TEMLIMIDs The concepts of solubility, requisiteness, and reduction from LIMIDs have the same meaning in TEMLIMIDs and are defined in the same way. To solve a TEMLIMID L(λ,g), we convert it first into a traditional LIMID. Before the conversion, however, it is advantageous to remove non-requisite informational arcs from L(λ,g)... Maximizing expected present value The conversion from TEMLIMID L(λ,g) into a LIMID L consists of a single step in which every utility function U u, u Υ is discounted : U u(x pa(u) ) U u(x pa(u) )λ Tu(x pa(u)), () where we define λ = 0. The LIMID obtained in this way is denoted L λ. After this conversion, the expected discounted utility of a strategy q in L λ equals the expected discounted utility of q in L(λ,g): EU Lλ (q) = EDU L(λ,g) (q). As a consequence of this conversion, a global maximum strategy, ˆq p, in L λ is also a global maximum strategy in L(λ,g)... Maximizing expected relative value The conversion from TEMLIMIDL(λ,g) into a LIMIDL in this case consists also of a single step in which every utility function U u, u Υ is transformed: U u(x pa(u) ) U u(x pa(u) ) go u(x pa(u) ). () The LIMID obtained in this way is denoted L g. After this conversion, the expected relative utility of a strategy q in L g equals the expected relative utility of q in L(λ,g), i. e. EU Lg (q) = ERU L(λ,g) (q). As a consequence of this conversion, a global maximum strategy, ˆq r, in L g is also a global maximum strategy in L(λ,g).. Use of TEMLIMID TEMLIMIDs are intended primarily for use in connection with MLPs, and we will not go into details with the potential for direct application. In most cases, modifications of utilities can be implemented directly when LIMIDs are implemented within specific domains. The possibility to keep the distinct features of the utility function separate during the modeling process, however, might make TEMLIMID an interesting modeling option on its own.

12 0 0 Markov LIMID Processes. Macro states and macro actions This section introduces the basic components of a Markov LIMID Process (MLP), and provides the notation used throughout. The notation is inspired by Puterman (). The procedure by which the MLP will be governed is presented briefly, but details are given in subsequent sections. The process enters an initial macro state, denoteds I, according to a known distribution on the set of macro states. Based on the states I, we associate a macro actionq(s I ). A macro action is made during a period of time referred to as a decision epoch. The macro action generates three consequences: a cumulative (possibly discounted) utility during the decision epoch; a transition to another macro state s O ; a cumulative expected duration during the decision epoch. The three consequences are handled by simple modifications of the TEMLIMID as presented in Section. This process is continued ad infinitum, i.e. the process goes through an infinite number of decision epochs. A policy for the MLP associates with each macro state a macro action for the decision epoch. We need to find a policy, therefore, that optimises the utility of the system, i.e., we need to find the utility from an infinite sequence of macro actions, as formulated in Section.. Example: PIGS as a renewal problem We have already modified the PIGS example to include premature ending of the process and to extend utilities with time functions and duration functions. We now extend the example further so that it fits into an MLP framework. The pig breeder has two separate sow units, A and B, and a central stable for growing pigs. On even-numbered months, pigs from sow unit A are moved to the central stable, whereas on odd-numbered months, pigs from sow unit B are moved to the stable. Thus the pig breeder can influence whether a new pig is from A or B by prematurely delivering the present pig in an even- or odd-numbered month. Suppose sows in A are carriers of the disease, so that pigs in A are infected, whereas pigs in B are not infected. As a result, the natural immune response will protect pigs coming from A from a new outbreak of the disease, whereas pigs from B are prone to the disease. The initial probability of having the disease in the first month, therefore, is reduced to 0.0 for pigs from A, whereas it remains at 0.0 for pigs from B. An untreated, healthy, pig from B has the original probability of having the disease of 0.0, whereas the corresponding risk for pigs from A is 0.0. Other disease aspects are identical to the original example. As an additional option, the pig breeder might choose, at a cost of DKK, to vaccinate pigs at insertion into the stable for growing pigs. Vaccination of a pig from A has no effect, because pigs are already protected by the natural immune response. If a pig from B is vaccinated, however, then it becomes protected to the same extent as pigs from A. The resulting TEMLIMID is in Fig.. The extension is modeled by an initial chance node, s I with two states, A and B. An initial decision node d I is added to model the vaccination decision with two states, Yes and No. The vaccination decision is made based on information about the unit of origin, as indicated by the informational edge from

13 0 0 s I tod I. The unit of origin of the next pig (replacing the present) is modeled by the nodes O having the same state space as s I. The treatment/selling strategy naturally should depend on unit of origin and vaccination state, so informational edges are added from s I and d I to each of the decision nodesd,d, andd. To keep track of even- vs. odd-numbered months, an edge is added between s I and s O.. d I u I s I h h h h t t t d d d u u u Fig. : Extended PIGS example modified to be included in the MLP. Input state s I and output-states O are added, as well as terminal utility nodeu T. The states of theh t nodes are augmented to include if and when the pig is sold. The vaccination decision d I with utility u I, based on the input state, is also added. As informational edges (dotted lines) indicate,s I and d I are known when decisions are made within the TEMLIMID.. Problem illustration A decision maker, agent, or controller is faced with the problem (or opportunity) of influencing the behaviour of a probabilistic system as it evolves through time. In the PIGS example, the pig breeder should influence his pig production not only for a single pig but also for a sequence of pigs. When a pig is delivered the breeder will replace it with a new one by choosing macro actions covering a specific period of time. In the example, the macro action is a set of rules on how to react to test observations from a new pig. The breeder will choose a different macro action depending on the unit of origin of the pig. Because the risk of disease is different in different units, the optimal treatment strategy may also differ. The goal is to choose a sequence of macro actions, which causes the system to perform optimally with respect to some predetermined overall optimality criterion. The system we model is ongoing, so the state of the system prior to the next macro action depends on the current macro action. If the pig breeder chooses to deliver the present pig in an evennumbered month, for example, then it influences the herd of origin of the pig replacing the present one. Consequently, macro actions must not be made myopically, but must anticipate the opportunities associated with future states. u D s O u T.

14 0 0. Modifying TEMLIMIDs to handle macro states and macro actions Macro actions are chosen at the start of time periods referred to as decision epochs. A TEM- LIMID L(λ,g), as in section, can be used to represent the decision structure within a decision epoch. The TEMLIMID might need three modifications to be used. Firstly, a chance node is added to represent the input state, i.e., the state of the process when the decision epoch starts. Secondly, a chance node is added to represent the output state, i.e., the state of the process when the decision epoch ends. Thirdly, a utility node is added to represent the future reward or utility value. At the start of the decision epoch, therefore, the process occupies an input state, and at the end of the decision epoch, the system occupies an output state. Consequences of a decision epoch depend on the macro action. In the PIGS example, the start of a decision epoch is when a new pig is introduced. The input state is the unit of origin of the pig. The output state is the unit of origin of the pig replacing the present one. We will use the following notation. In L(λ,g) there is a chance node s I, termed input node, to represent the input state, and a chance node s O, termed output node, to represent the output state. The associated variablesx si andx so are called input variables and output variables. Here, X si = X so and the elements in this set will be called macro states. We require that a special utility node u T, referred to as the terminal utility node and having the property that s O pa(u T ), can be added to the TEMLIMID L(λ,g), which is then denotedl (λ,g) with the extended utility node set Υ = Υ {u T }. We need to specify the three functions defined for utility nodes in a TEMLIMID:U ut,t ut ando ut. The utility function U ut represents a terminal reward received at the end of the time period modeled by the TEMLIMID. The time function T ut specifies the time when the terminal reward is paid. The duration functiono ut is normally 0. Note that because the output node is a parent of the terminal utility node, the terminal reward, as well as the time, might depend on the output state x so X so. In Fig., the PIGS example is shown with the modifications. At the beginning of a decision epoch, the input statex si is known to the decision maker. The input state at decision epoch n is obtained by observing the output variable at the end of the previous decision epoch n. If, at the start of some decision epoch, the decision maker or agent observes the system in input state x si, then the agent must choose a macro action q from some set Q. The set Q consists of all strategies in L(λ,g). The term macro action, therefore, is used because every element in Q consists of a set of decision functions; i.e., one decision function for each decision node in L(λ,g)... Example: Macro actions in PIGS For the PIGS example, Lauritzen and Nilsson (00) found the optimal LIMID strategy: never treat in the first month, and treat in second and third month if the test is positive. This strategy may be used as a macro action. In the extended example we have two more options: vaccination and delivery. The strategy might, furthermore, differ depending on if the pig is from A or B. Thus, a full macro action could be: Decision d I : Pigs from B should be vaccinated, whereas pigs from A should be left unvaccinated. Decision d : No pigs should be treated or sold no matter the test results.

15 0 0 Decision d : Pigs from A should be kept and treated if the test is positive, whereas they should be sold if it is negative. Pigs from B should be kept and not treated no matter the test result. Decision d : Pigs should be treated if t is positive. Ift is negative, however, a pig from B should be kept and not treated, whereas a pig from A should be sold. Depending on the input state, therefore, we choose a different macro action.. Consequences of macro actions For each input state, we calculate the consequences of each macro action. The macro action influences the probability of the output state and the expected value of the utility criteria... Transitions between macro states As a result of choosing macro actionq Q in input statex si at a decision epoch, the output state x so is determined by the conditional state transition probability function P s(x so x si,q), where the index s connotes state transition. This state transition probability function is found using the TEMLIMID L(λ, g), with macro action q P s(x so x si,q) = f q(x so x si ), () where f q denotes conditional the probability distribution over the variables in L(λ,g) induced by macro action q. Calculations proceed as within Bayesian Networks by entering evidence on the input state. After propagation, the conditional distribution over output states can be found directly. Example: Transition between input and output states for given macro action From Eq. () we see that selling decisions, d t, influence the probabilities of transitions between macro states. Assume, for instance, that a macro action implies that a pig is always kept when decision d is made and always sold when decision d is made months after insertion in the stable. Thus, if we have a pig from unit A, it was inserted in an even-numbered month. If a new pig is inserted months later, then it will be in an odd-numbered month, meaning a macro state transition from A to B. If a pig is from B, on the other hand, we have the opposite transition from B to A. Under such a macro action, the transition matrix is P s = ( ) 0. 0 Under the described special conditions, transitions are deterministic, elements p ij equal 0 or. More sophisticated macro actions, where the selling decisions depend on the observed test, typically result in transition matrices whose elements p ij satisfy 0 < p ij <.

16 0 0.. Utilities We consider each of the consequences described in Section.. As a result of choosing macro actionq in input statex si at decision pointn, the decision maker receives a utility u(x si,q) corresponding to the total expected utility received during the interval from decision epoch n to decision epoch (n+). If the utilities are discounted, the real-valued utility functionu(x si,q) denotes the value at the beginning of decision epoch n. From the perspective of the model studied here, it is unimportant how discounted utility is accrued during the interval. We require only that its expected value be known before choosing a macro action. Following the TEMLIMID definitions, the function u(x si,q) is computed from L(λ,g) given X si = x si as u(x si,q) = EDU L(λ,g) (q X si = x si ) () if we discount the utilities, or if we are interested only in the expected utility, then u(x si,q) = EU L(λ,g) (q X si = x si ). () The values of the function are computed conditioned on X si = x si. Correspondingly, the expected duration m(x si,q) is computed simply by interpreting the duration functions as utility, and then compute the expected value given X si = x si, completely analogous to the expected utility. Because the duration functionso u of the utility nodes denote the time intervals,m(x si,q) will be the expected inter-transition time between macro state transitions... Discounted transition probability function By using the time function T ut of the terminal utility node u T, we can also specify the discounted transition probability function Q(x so x si,q) as Q(x so x si,q) = x V \{so,s I } λ Tu T (x pa(u T )) fq(x V\{sI } x si ), () where x V denotes a configuration of the variables in L (λ,g). The function Q(x so x si,q) is the expected discounted value of a terminal utility of the value given that input state is x si, strategy is q, and output state is x so. Optimisation of macro actions. Interpretation of an MLP as a semi Markov decision process We now show that an MLP is a special case of an infinite horizon stationary semi Markov decision process (SMDP). Hence, we can apply techniques for value iteration and policy iteration for optimisation. Assume state space Ω and action space D are finite sets. A semi Markov decision process is characterised by a sequence of decision epochs where, at the start of a decision epoch n, state i Ω of the system is observed. Based on state i, an action d D is taken. Three consequences of taking action d in state i are

17 0 0 reward Ri d is gained. We refer to the expected reward as ri d, i.e., E(Ri d ) = rd i. It is sufficient to know the expected reward ri d. duration of the decision epoch will be Mij, d if the state observed at the next decision epoch is state j. We shall refer to the expected duration as m d i, i.e. E(Mij d ) = md i. Depending on the criterion of optimality, it is sufficient to know the expected duration, m d i. at the end of the decision epoch, the system will make a transition from state i to state j with probability p d ij, where j Ω pd ij =, for all i and d. We refer to r d i, M d ij, and p d ij as the parameter set of the decision epoch. If state and action spaces and the parameter set are the same for all decision epochs, we refer to the SMDP as stationary. If the sequence of decision epochs continues infinitely, furthermore, we refer to the SMDP as an infinite horizon process. For a stationary SMDP, a policy, δ, associates to each state i Ω, a decision δ(i) D. An optimal policy maximises a predefined objective function v(δ). We consider three criteria of optimality:. Expected present value, maximising the expected sum of all future rewards discounted to the present decision epoch. To discount a reward from timet to timet < t, we use a discount factor λ (t t), where λ is a predefined constant such that 0 λ. Under this criterion, we specify the discounted probability parameters Q d ij = λmd ijp d ij, for all i, j, and d, implying that duration M d ij is known.. Average reward over time, maximising the infinite future expected reward/output ratio. Under this criterion, it is sufficient to know the expected duration m d i.. Average reward per stage, maximising the infinite future reward/decision epochs ratio. This criterion is a special case of the former where all m d i =. An MLP as defined above is an infinite stationary SMDP. We show that by specifying all elements of the SMDP. A new decision epoch begins each time the decision maker chooses a macro action; the new epoch is modeled by the TEMLIMID L(λ,g). The state space Ω of the decision epoch is identical to the state space of the input node s I, i.e. Ω = X si. The action space D of the decision epoch is the set of all possible strategies of the TEMLIMID L(λ,g), i.e., an action d in the SMDP is a strategy q in the TEMLIMID. The expected reward r d i for state i = x si under action d = q is the expected utility of the TEMLIMIDu(x si,q) as in Eq. (). The expected duration is m d i = m(xs I,q), calculated from the TEMLIMID as described in Section... The transition probability from state i = x si to state j = x so under action d = q is p d ij = Ps(xs O x si,q), as in Eq. (). Finally, when needed, the corresponding discounted probability parameter isq d ij = Q(xs O x si,q) as above in Eq. (). Having identified all elements of the SMDP, we see that the infinite sequence of identical TEMLIMIDs form an infinite horizon stationary SMDP, where the state and action spaces and the parameter sets are available through the TEMLIMID. Thus, we can apply techniques for value iteration and policy iteration for optimisation.. Optimisation in an SMDP Two algorithms are available for optimisation in an SMDP: value iteration and policy iteration. Each technique may be described by four subroutines as follows:

18 0 0 Algorithm (Optimisation) Value iteration and policy iteration for an infinite horizon semi Markov decision process:. Initialise. Repeat until Convergence: (a) DeterminePolicy (b) DetermineValue We describe next the subroutines for the policy iteration algorithm. The Initialise subroutine defines initial conditions. We set the iteration counter n to 0 and the value function v i (0) to 0 for all states i Ω. Finally, the average reward, g 0 is set to 0. The DeterminePolicy subroutine increments n by (i.e. sets n = n + ) and returns an updated (improved) policy δ n. The updated policy is determined as: i Ω : δ n (i) = argmax d rd i g n m d i + Q d ijv j (n ). () j Ω The DetermineValue returns updated values v i (n) and g n determined by solving the simultaneous linear equations g nm δn (i) i +v i (n) = r δn (i) i + j Ω Q δn (i) ij v j (n). () Under the average criteria (where g n 0) an additional equation, e.g. v (n) = 0, is needed for a unique solution. Under the expected present value criterion we let g n = 0. The Convergence subroutine sets the stopping criterion of the iteration. For policy iteration, the convergence criterion is v i (n) = v i (n ), for all i; furthermore, g n = g n. The four subroutines for the value iteration algorithm are similar to those for the policy iteration algorithm. The value of g n is determined indirectly, implying that when using Eqs. () and (), we setg n = 0. The convergence criterion needs to be modified, however, because value iteration guarantees only an approximately optimal solution, as opposed to policy iteration, which guarantees an optimal solution within a finite number of iterations.. Optimisation in an MLP We showed in Section. that an MLP is an SMDP with known parameters. In principle, therefore, we can use Algorithm directly. Because parameters are known, the DetermineValue subroutine is performed easily. The DeterminePolicy subroutine, however, is more problematic because a new policy is determined state by state by successively calculating Eq. () for each action d D, and afterwards choosing the action that maximises the expression. An action d in the MLP, however, is an entire strategy q in the TEMLIMID L(λ,g). Evaluating all actions by Eq. () is prohibitive, because the number of possible strategies is high (although finite) in most cases. We need an other way, therefore, to determine a strategy to maximise the right-hand side of Eq. (). We can use SINGLE POLICY UPDATING in the TEMLIMID to determine such a strategy. For an MLP, therefore, we define the DeterminePolicy subroutine for the TEMLIMID L(λ,g):

19 0 0 DeterminePolicy: An improved policy is found as:. Increment the iteration counter n by.. Add the terminal utility node u T to L(λ,g) to form L (λ,g).. Set the utility function of the terminal utility node u T equal to the previous value of the value function, i.e. U ut (x pa(ut )\{s O },s) = v s(n ), s X so.. Depending on criterion: (a) For expected present value criterion, modify the utilities ofl (λ,g) as described in Eq. () to obtain the LIMID L λ. (b) For average rewards over time, modify the utilities of L (λ,g) as described in Eq. () to obtain the LIMID L g.. Depending on criterion: (a) For expected present value criterion, perform SINGLE POLICY UPDATING on LIMID L λ to obtain a local maximum strategy q n. (b) For average rewards over time, perform SINGLE POLICY UPDATING on LIMID L g to obtain a local maximum strategy q n.. Define the new policy δ n of the MLP as δ n = q n. The theorem below assures the correctness of the defined DeterminePolicy subroutine for an MLP. Theorem For an MLP in which a decision epoch is defined by the soluble TEMLIMID L(λ, g), the DeterminePolicy defined above returns a policy maximizing the expression on the right hand side of Eq. (). Proof: We first prove the theorem for the expected present value criterion. Recall that L(λ,g) (and hence L (λ,g)) is soluble. It follows directly that the strategy q n satisfies EDU L (λ,g)(q n X si = s) = max EDU L (λ,g)(q X si = s). q Let function φ(q n s) = EDU L (λ,g)(q n X si = s). Because utilities are considered as additive in LIMIDs, we have: ( )} φ(q n s) = max {EDU L(λ,g) (q X si = s)+e q λ Tu T (Xpa(u T)) UuT (X pa(ut )) X si = s q = max q = max q = max q u(s,q)+ λ Tu T (xpa(u T)) UuT (x pa(ut ))f q(x V\{sI }) s) x so x V \{s O,s I } u(s,q)+ λ Tu T (xpa(u T)) fq(xv\{si } s)u ut (x pa(ut )\{s O },x so ) x so x V \{s O,s I } u(s,q)+ Q(x so s,q)v xso (n ). x so Compare with Eq. () and interpret the MLP as an SMDP, as in Section.. For a soluble LIMID, we see that the SINGLE POLICY UPDATING technique can be used as a method

20 0 0 to update the policy in the DeterminePolicy subroutine, under the expected present value criterion, where always, g n = 0. For the average rewards over time criterion, we similarly define function φ(q n s) = ERU L (λ,g)(q n X si = s). Because utilities are considered as additive in LIMIDs and because L(λ,g) is soluble we obtain: φ(q n s) = max q = max q Again using the additivity property: ERU L (λ,g)(q X si = s) { ( ERUL(λ,g) (q X si = s)+e q UuT (X pa(ut )) X si = s )}. ERU L(λ,g) (q X si = s) = EU L(λ,g) ged L(λ,g) = u(s,q) gm(s,q), where ED denotes the expected duration, as in Section... Similarly for the second expression Thus, E q ( UuT (X pa(ut )) X si = s ) = x so P s(x so s,q)v xso (n ). φ(q n s) = max q u(s,q) gm(s,q)+ P s(x so s,q)v xso (n ). x so Under the average rewards over time criterion, λ =, so we have P s(x so s,q) = Q(x so s,q). We have thus confirmed that the SINGLE POLICY UPDATING technique also can be used as a method to update the policy in the DeterminePolicy subroutine. Implementation of algorithm. Software systems used The concept of MLP has been implemented within the framework of the MLHMP software (Kristensen 00). It is a general, Java based, software system for multi-level hierarchical Markov processes (Kristensen and Jørgensen 000). The MLP concept has been integrated into the MLHMP framework by use of the Esthauge LIMID Software System, also Java based, which is a general software system for Bayesian networks and LIMIDs. The Esthauge system comes with the SINGLE POLICY UP- DATING algorithm already implemented. The software system, furthermore, is able to check a LIMID for solubility and to remove non-requisite informational edges. It has a graphical user interface for browsing the directed acyclic graph and for editing models. Fig. (a) shows the PIGS example as it is displayed in the Esthauge LIMID software system. A more general form than that presented here of the MLP concept has been implemented. In that form, for any decision epoch of a Markov decision process to be represented

21 0 0 0 (a) Fig. : The PIGS example as it is displayed in the Esthauge LIMID Software System. The three node types (chance, decision, and utility nodes) are distinguished by colours in the user interface instead of shapes (here translated to different shades of grey, change nodes the lightest and decision nodes the darkest). The figure to the left shows the LIMID version and the one to the right shows the ID version, with all informational arcs added to satisfy the no-forgetting assumption. Table : Convergence of the value functions for the PIGS example (soluble version) using policy iteration under the criterion of average rewards over time. The relative value 0 for Unit B is arbitrary. Iteration Macro state n = n = n = n = Unit A (relative value),v (n) 0... Unit B (relative value),v (n) Average rewards over time (DKK/month),g n as a TEMLIMID, it is enough for the state spaces of the input and output nodes match the state spaces of the present and following decision epoch of the Markov decision process, respectively. Each of the three criteria of optimality defined in Section. has been implemented.. Example The PIGS example has been implemented as a plug-in to the MLHMP software system (Kristensen 00). If an exact solution is required, informational edges must be added as shown in Fig. (b). The TEMLIMID then becomes soluble, and convergence of the optimisation algorithm is guaranteed according to Theorem ; the resulting policy is optimal. We illustrate the optimisation for the soluble version under the criterion of average rewards over time. Four iterations were needed to find an optimal policy. The convergence of the value functions v i (n) and g n is in Table. The optimal strategy determined by the policy iteration algorithm, using SINGLE POL- ICY UPDATING in the TEMLIMID, is to vaccinate pigs from B, and to leave pigs from A untreated as expected. (b)

Nilsson, Höhle: Methods for evaluating Decision Problems with Limited Information

Nilsson, Höhle: Methods for evaluating Decision Problems with Limited Information Sonderforschungsbereich 386, Paper 421 (2005) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Methods for