Markov LIMID Processes for Representing and Solving Renewal Problems

Size: px
Start display at page:

Download "Markov LIMID Processes for Representing and Solving Renewal Problems"

Transcription

1 Markov LIMID Processes for Representing and Solving Renewal Problems Final preprint (accepted version) of article published in Annals of Operations Research. Please cite as: Jørgensen, E., A.R. Kristensen & D. Nilsson. 0. Markov LIMID Processes for Representing and Solving Renewal Problems. Annals of Operations Research,. DOI: 0.00/s0-0-0-

2 Noname manuscript No. (will be inserted by the editor) 0 0 Markov LIMID Processes for Representing and Solving Renewal Problems Erik Jørgensen Anders R. Kristensen Dennis Nilsson Received: date / Accepted: date Abstract In this paper a new tool for simultaneous optimisation of decisions on multiple time scales is presented. The tool combines the dynamic properties of Markov decision processes with the flexible and compact state space representation of Limited Memory Influence Diagrams (LIMIDs). A temporal version of LIMIDs, TEMLIMIDs, is defined by adding time-related functions to utility nodes. As a result, expected discounted utility, as well as expected relative utility might be used as optimisation criteria in TEMLIMIDs. Optimisation proceeds as in ordinary LIMIDs. A sequence of such TEMLIMIDs can be used to model a Markov LIMID Process, where each TEMLIMID represents a macro action. Algorithms are presented to find optimal plans for a sequence of such macro actions. Use of algorithms is illustrated based on an extended version of an example from pig production originally used to introduce the LIMID concept. Keywords Markov Decision Process, MDP, Multilevel hierarchical Markov Process, MLHMP, Limited Memory Influence Diagram, LIMID, pig production Introduction Inspired by the problem of maintenance and replacement in livestock production, Kristensen () introduced the notion of hierarchical semi-markov decision processes, which were structured with two levels: The main level holding one main process and the sub-level holding a series of consecutive sub-processes. The stages of the main process corresponded to animals successively replacing each other in an infinite chain of stages. Transition to a new This research was carried out as part of Dina, Danish Informatics Network in the Agricultural Sciences Erik Jørgensen Department of Animal Science, Faculty of Science and Tecnology, Aarhus University, Blichers Allé Postbox DK- Tjele, Denmark. erik.jorgensen@agrsci.dk Anders R. Kristensen Univ. of Copenhagen, Department of Large Animal Science, Faculty of Life Sciences, University of Copenhagen, Grønnegårdsvej, DK-0 Frederiksberg C, Denmark. ark@sund.ku.dk Dennis Nilsson Univ. of Copenhagen, present address: Nordic Director, Pricing and Business Intelligence at RSA, Gl. Kongevej, DK-0 Copenhagen, Denmark.

3 0 0 stage corresponded to a replacement. Each animal was, furthermore, modeled as a sub-level by a separate finite time semi-markov decision process representing the states and decisions corresponding to daily maintenance. In livestock, large models (Houben et al ; Verstegen et al ; Mourits et al ) have been developed using this hierarchical technique, which has had a positive effect on efficiency of optimisation. Later, the notion was developed further (Kristensen and Jørgensen 000) into multi-level hierarchical Markov decision processes (MLHMDP) to allow simultaneous optimisation of decisions on multiple time scales. The two-level concept was extended via a founder level process corresponding to the main process and nested child processes (corresponding to the sub-processes) to handle several levels in the hierarchy. As an example, Kristensen and Søllested (00) implemented a sow replacement model using MLHMDP, where the child processes modeled very different processes. Each child process included several types of decisions, in addition to the replacement decision, and each child process was still modeled as a semi-markov decision process. It is of interest, however, to explore if the new modeling techniques based on graphical modeling can be used for this purpose. Often the state of the system (e.g. an animal) being modeled in Markov decision processes (MDPs) is defined by values of state variables, each representing a trait of the system. The cartesian product of all value sets for individual state variables is defined as the state space. A more structured description of the state space, however, can usually be made. The structured description leads to a more efficient representation of the transition probabilities between states of consecutive stages. Such efficient representations may be obtained using Influence Diagrams (IDs) (Howard and Matheson ), which are representations of decision problems with uncertainty. Originally used as representations of decision trees, IDs are now usually seen as natural extensions of Bayesian Networks. As with Bayesian Networks, there exists local computation algorithms to find optimal strategies within IDs (Olmsted ; Shachter ; Shenoy ; Jensen et al ). Such IDs consist of a Bayesian Network augmented with decision and utility nodes, together with a protocol that identifies which variables are known to the decision maker when decisions are made (Lauritzen 00). Tatman and Shachter (0) showed that finite length MDPs could be formulated as IDs with restriction on the utility function of the MDP. The ID directly distinguishes between variables, known and unknown to the decision maker. Thus, we have a framework for modeling with only partial observations of variables. In MDPs, it is generally assumed that variables are known and that the state space is fully observable. It is more logical, however, to distinguish between directly observable state variables, such as milk yield or litter size, and latent variables, such as genetic value. Partially Observable Markov Decision Processes, POMDPs, are a framework allowing for modeling with such latent variables (refer for instance to Lovejoy () for a survey or Kaelbling et al () for an introduction). IDs may be used to model such processes. The complexity of finding globally optimal strategies is daunting, and hinders the application of IDs to many real decision problems. This is mainly because of the implicit no forgetting assumption, i.e., when the decision maker makes a decision, it is assumed that all the previous decisions, as well as observations they were based on, are known. With models covering even moderate time spans, the complexity of keeping this information in the model becomes prohibitive. Thus early considerations of implementing IDs within the MLHMDP were disappointing. The statistical models underlying the MLHMDP could naturally be expressed using latent variables, and it would be natural to express the IDs directly. Due to the complexity of the resulting model, however, it is necessary to use several tricks that are not a natural part of the modeling process (see e.g. Kristensen ; Relund Nielsen et al 00).

4 0 0 The extension to IDs called LIMIDs or LImited Memory Influence Diagrams (Lauritzen and Nilsson 00) relaxes the no forgetting assumption, thus providing a computationally tractable decision problem without assuming a Markov process. In general, optimisation algorithms based on LIMIDs are approximate; in some special cases, however, they are exact. By relaxing the no forgetting assumption, previously untractable problems can be handled, at least approximately. In addition to the computational complexity of IDs, another problem, that the current definition of LIMIDs does not solve, is the static nature of IDs. Although decisions are ordered according to time, so that a multi-stage decision problem can be handled, time is not an integrated element of an ID. It is implicitly assumed that there is no time preference concerning rewards (or utilities as they are denoted in IDs). Each time stage has, furthermore, to be defined explicitly in the model, as opposed to implicitly in Markov decision processes. If we want to use LIMIDs for dynamic decision problems in herd management, we therefore need to extend the definition. The purpose of this paper is to combine multi-level Markov processes with LIMIDs to provide a tool for dealing with dynamic decision problems at two or more levels. We will refer to this combination as Markov LIMID Processes (MLP). The MLPs integrate the dynamic nature of semi Markov decision processes with the state space representation of LIMIDs, allowing for use of unobservable variables. The integration is done by defining a semi Markov decision process at the founder level, interacting with slightly modified LIMIDs at the child level. A protocol for communication between the two levels is defined and an optimisation algorithm is presented and proved. The potential of the combined method is, furthermore, illustrated through examples related to the animal replacement problem.. PIGS example The PIGS example from Lauritzen and Nilsson (00) is fictitious, but our extension will make it more realistic, and closer to direct application. According to Lauritzen and Nilsson (00): A pig breeder is growing pigs for a period of four months and subsequently selling them. During this period the pig may or may not develop a certain disease. If the pig has the disease at the time when it must be sold, the pig must be sold for slaughtering and its expected market price is then 0 DKK (Danish kroner). If it is disease free, its expected market price as a breeding animal is 000 DKK. Once a month, a veterinary doctor sees the pig and makes a test for presence of the disease. If the pig is ill, the test will indicate this with probability.0, and if the pig is healthy, the test will indicate this with probability.0. At each monthly visit, the doctor may or may not treat the pig for the disease by injecting a certain drug. The cost of an injection is 00 DKK. A pig has the disease in the first month with probability.0. A healthy pig develops the disease in the subsequent month with probability.0 without injection, whereas a healthy and treated pig develops the disease with probability.0, so the injection has some preventive effect. An untreated pig which is unhealthy will remain so in the subsequent month with probability.0, whereas the similar probability is. for an unhealthy pig which is treated. Thus, although spontaneous cure is possible, treatment is beneficial on average. We will extend the example to illustrate how to handle repeated decision making.

5 0 0 Prerequisites The MLPs are based on a combination of the concepts in Semi Markov Processes and LIMIDs. The term LIMID (LImited Memory Influence Diagram) was introduced as a modification of IDs to represent multistage decision problems and to relax the no-forgetting assumption (Nilsson and Lauritzen 000; Lauritzen and Nilsson 00). We will assume that Semi Markov Processes are known, but we will provide an introduction to LIMIDs and their predecessors IDs. For proofs, the reader is referred to the above sources.. Influence Diagrams The original formulation of Influence Diagram (ID) was presented by Howard and Matheson (). The main element of an ID, L, is a directed acyclic graph (DAG) with three sets of nodes: chance nodes Γ, decision nodes, and utility nodes Υ. In Fig. (left), the PIGS problem is shown as an ID. Ovals indicate chance nodes, h t and t t where h t denotes the health state of the pig at time t, and t t denotes the outcome of the test the veterinarian performs. Rectangles indicate decision nodes d t and diamonds indicate utility nodes, e.g., u t corresponds to the cost incurred, depending on whether the pig is treated or not, and u D is sales price of the pig depending on its health state. The directed edges are used for two purposes: the first to indicate conditional dependency, e.g., the outcome of test t depends directly on health state h, and second to describe informational relationship. When decision d is made, the outcome of first test t is known, but neither health state h nor the outcome of second test t is known. When decision d is made, the outcome oft is known. IDs have an implicit no-forgetting assumption, meaning that previous observations and decisions are known. Thus t and d are known when t is made. The informational edges also serve the role of specifying an ordering of the outcomes and decisions. A unique ordering is necessary to solve the influence diagram. We use the convention of showing informational edges as dotted lines.. h h h h u D t t t d d d u u u.. h h h h u D t t t d d d u u u Fig. : PIGS example shown in two versions, as an ID (left) and as a LIMID(right). Ovals indicate chance nodes, h t is health state, t t is test outcome, rectangles indicate decision nodes, and diamonds indicate utility nodes. Informational edges are indicated by a dotted line to represent the implicit no-forgetting assumption in the ID..

6 0 0. LIMIDs A LIMID L is a directed acyclic graph (DAG) with three sets of nodes: chance nodes Γ, decision nodes, and utility nodes Υ. For each node n in the set V = Γ, we associate a variable X n that can take values in a finite set X n. For a subset A V, X A = n A X n X A = n A X n. Typical elements in X A are denoted by x A,y A etc. For a node n in L, its parent set is denoted pa(n), the set of descendants of n is denoted de(n), and the family of n is denoted fa(n) = pa(n) {n}. Chance nodes, shown as ovals, represent random variables. For each chance noder Γ, there is a probability distribution p r defined by p r : X r [0,], such that x r p r(x r) =. Decision nodes, shown as rectangles, represent decision variables. For each decision node d, the decision maker has the opportunity to choose a policy δ d. A policy for d associates with each state x pa(d) of X pa(d) a probability distribution δ d ( x pa(d) ) on X d. We assume that the states in X pa(d) are known to the decision maker when making decision d. For that reason, directed edges into decision nodes are termed informational edges. In contrast to IDs, only X pa(d) is known, there is no implicit no-forgetting assumption. Finally, utility nodes, shown as diamonds, represent utility functions. For each utility node u, we associate a (local) utility function U u defined by U u : X pa(u) R. Thus, Fig. may be considered as a graphical representation of an ID as well as a LIMID. However, they are two different models. In the right part of Fig. is shown the LIMID specification that corresponds to the ID, i.e., the informational arcs representing the available information are added.. Global maximum strategies in LIMIDs A strategy q in L is a set of policies, one for each decision node, d, and has the form q = {δ d : d }. The strategyq generates a joint distribution over all variablesv = Γ as f q = p r δ d. r Γ d The expected utility of q is EU(q) = x V f q(x V ) u Υ U u(x pa(u) ) = x V f q(x V )U(x V ), where U(x V ) = u Υ Uu(x pa(u)). A global maximum strategy ˆq in L has the property that EU(ˆq) = max EU(q). q

7 0 0. Other aspects of LIMIDs In the process of defining LIMIDs, new features of decision problems were identified and new computational strategies were developed. Probably the most important development was a new iterative approach for evaluation of LIMIDs based on local computations, the single-policy updating. It was shown that it was possible to reduce the LIMID before the optimisation, by algorithmically identifying information that was not required for decision. Finally, it was shown that IDs could be considered as a subclass of what was called soluble LIMIDs. Soluble LIMIDs can be solved exactly by using the single-policy updating algorithm. Important aspects of these developments are presented in the following sections... Single-Policy Updating for LIMIDs Single-Policy Updating is an iterative procedure to evaluate LIMIDs. The procedure starts with an initial strategy q 0, and proceeds by modifying (updating) individual policies in a given order. If the current strategy is q i = {δ i d : d } and decision d is to be modified, then the following steps are performed: Retract: Retract policy δ d from q i to obtain q i d := qi \{δ d }. Optimise: Compute a new policy, δ d for d by Replace: Let q i+ = q i d {δ d }. δd = argmax EU(q d i δ {δ d }). d In this way, policies are updated until no single policy modification can increase the expected utility. Such a strategy is termed a local maximum strategy. Under some conditions, the local maximum strategy obtained is a global maximum strategy as described in the next section... Soluble LIMIDs We start this section by introducing a bit of notation. For subsets A, B, and S in L, let the symbolic expression A L B S denote that A and B are d-separated by S in the DAG formed by the nodes in L, including utility nodes. A decision node d 0 is said to be extremal whenever the following d-separation holds for all utility nodes u Υ : u L ( d ( \{d0})fa(d)) fa(d 0 ). Decision nodes inlhave an exact solution orderingd,...,d k if, for alli,d i is extremal when d i+,...,d k have been converted into chance nodes. Further, L is soluble if decision nodes in L obey an exact solution ordering. Soluble LIMIDs have an important property as stated in the following theorem. Here, a uniform strategy q is given by q = { δ d : d }, where δ d is the uniform distribution over X d.

8 0 0 Theorem Suppose we are given a LIMID L whose decision nodes have an exact solution ordering d,...,d k. If we perform Single-Policy Updating on L, starting from the uniform strategy, and use the updating order d k,...,d, then after one update of each policy, we obtain a global maximum strategy. A large subset of soluble LIMIDs is the set of IDs. So, the above theorem implies that a global maximum strategy for an ID can be achieved by applying Single-Policy Updating, which lead to more efficient algorithms for solving IDs (Nilsson and Lauritzen 000; Madsen and Nilsson 00). A crucial step in these algorithms consists of removing informational arcs in L that are not relevant to the decision, i.e., they are non-requisite. The notion of non-requisiteness is described in the next section... Non-requisite information in LIMIDs Before applying Single-Policy Updating on a given LIMID L, it is wise to remove nonrequisite informational edges. In this section we define the notion of non-requisiteness, and show results that reduce the complexity of Single-Policy Updating. A parent n of decision node d is non-requisite for d, if the following d-separation holds: n L (Υ de(d)) (fa(d)\{n}). As a corollary, we say that n is requisite for d if n is not non-requisite for d. Similarly for informational edges: If n is non-requisite for d, the directed edge from n into d is nonrequisite. If the new LIMID L, is obtained by successively removing non-requisite informational arcs from L, then L is said to be a reduction of L. There exists a unique reduction of L, called minimal reduction, with the property that all informational arcs are requisite. The minimal reduction of L is denoted L min. Clearly, a reductionl is less complex thanl, because it has fewer edges. The following theorem shows that reductions have other nice properties: Theorem Suppose L is a reduction of L. Then the following conditions hold:. (Preserves solubility) If L is soluble, then L is also soluble.. (Preserves local maximum) If q is a local maximum strategy for L, then q is also a local maximum strategy for L. As a consequence of the above theorem, a global maximum strategy for a reduction L of L is also a global maximum strategy for L. Temporal LIMIDs This section presents the necessary extension of LIMIDs to account for the dynamic nature of the decision problem. A temporal LIMID or TEMLIMID, is a simple generalisation of LIMIDs to allow for the possibility to define additional properties of the utilities. The additional properties might include time of achieving the utility, duration covered by the utility, or any kind of physical property (i.e. litter size of a sow or milk yield of a cow). A TEMLIMID consists of the same three sets of nodes as LIMIDs: a set Γ of chance nodes, a set of decision nodes, and a set Υ of utility nodes. Informational edges, furthermore, have the same semantics as in section.. So, in TEMLIMIDs we also allow for the possibility to represent decision problems, where decisions are taken with limited memory.

9 0 0. Utility nodes in TEMLIMIDs To represent additional properties of the utility, we associate a number of optional functions with each utility node u. The number of functions depends on two aspects: the problem and the criterion of optimality to be used in the optimisation. For convenience, however, we will assume that three functions are available with a utility node, u: a utility function: U u : X pa(u) R, a time function T u : X pa(u) R + {0, }, and a duration function O u : X pa(u) R + {0}. The utility function has the same meaning as in LIMIDs, whereas the time function specifies the time until the utility is given. IfU u(x pa(u) ) is never paid, thent u(x pa(u) ) =, but ifu u(x pa(u) ) is paid immediately, thent u(x pa(u) ) = 0. The duration function specifies the time interval that the utility covers, but a similar definition can be used for any quantity defined with the utility, e.g., a physical output. In this paper, we will restrict the definition of the function to duration. The time function is used when the expected present value of the TEMLIMID is maximized, instead of the expected total value. In that case, utilities are discounted by a discount factor, λ ]0,[; that is, a utility of u units paid after t time units has a discounted value, ũ, of ũ = u λ t units. The duration function is used to express the deviation of the utility from a specified norm g relative to the duration. Thus, if a utility of u units is paid and a duration of o units is involved, then the revised utility becomes ũ = u g o; if u/o = g, then ũ = 0. Thus, a TEMLIMID consists of a L, its discount factor λ, and a norm g. We will use the notation L(λ,g) to indicate this.. Example: PIGS as a TEMLIMID The PIGS example shown in Fig. can be readily extended to a TEMLIMID. The utility assigned to u,u,u is -00 DKK if the decision is to treat or 0 if it is not. The u D is 0 DKK if the pig is diseased or 000 DKK if it is healthy. In the original PIGS example (Section.) time steps were of equidistant length. A more realistic description, however, will have time steps of varying lengths. For example, the first decision is made at months after start; the second at ; the third at ; and the final utility is received at. For the utility nodes, u,u,u,u D, therefore, the time functions (T u,t u,t u,t ud ) are (,,,), and the duration functions (O u,o u,o u,o ud ) are (,,,0). With an annual interest rate of 0% (implying that, on a monthly basis, λ 0.), the revised utilities, (ũ,ũ,ũ ), based on the discounting factor by the time function is (.,.,.); discounted sales prices, ũ D, are. and. DKK. Similar with the duration function and a norm g = DKK, we obtain the revised utilities, (ũ,ũ,ũ ), ( 0,, ) DKK for the treatments and (0, 000) for the sales prices (where the duration,o ud, is 0)... Premature ending of the process The original PIGS example was based on fixed length of the time steps in the model, i.e., the pig was always sold at time step. A more realistic model, however, would allow for the possibility to end the process at each time step. Thus, we may decide to sell the pig immediately, if we expect a loss by keeping it. Extended utility of sales prices are in Table

10 0 0. In addition we need to keep track of whether the pig is alive or not. One way to obtain this is to augment the state space of the h t nodes with a sold state, and add a similar test outcome sold to the t t. Note that it may be necessary to keep track of the time of delivery, e.g., by adding sold at time t as state levels. Table : Extended utility of sales prices. The pig is sold immediately after the decision was made; thus, the duration is 0 (except for the first time period). If the pig has already been sold, no further utilities are obtained. Utility function Time function Duration function State,h t U u U u U u U ud T u T u T u T ud O u O u O u O ud Healthy Diseased Sold at time Sold at time Sold at time Optimisation criteria in TEMLIMIDs With each strategy q = {δ d : d } in a TEMLIMID L(λ,g), we may associate one of two optimisation criteria: the expected discounted utility or the expected relative utility. The expected discounted utility is: EDU L(λ,g) (q) = x { } f q(x) U u(x pa(u) ) λ Tu(x pa(u)), u Υ where we define λ = 0. As usual, f q denotes the joint distribution of Γ induced by q. The expected relative utility is: ERU L(λ,g) (q) = x { } f q(x) U u(x pa(u) ) go u(x pa(u) ). u Υ Depending on the definition of optimality, we search for a strategy q that maximizes either the expected discounted utility or the expected relative utility: Definition A global maximum present value strategy ˆq p in a TEMLIMID L(λ,g) is a strategy that satisfies EDU L(λ,g) (ˆq p) = max EDU L(λ,g) (q). q Definition A global maximum relative value strategy ˆq r in a TEMLIMID L(λ,g) is a strategy that satisfies ERU L(λ,g) (ˆq r) = max ERU L(λ,g) (q). q

11 Solving TEMLIMIDs The concepts of solubility, requisiteness, and reduction from LIMIDs have the same meaning in TEMLIMIDs and are defined in the same way. To solve a TEMLIMID L(λ,g), we convert it first into a traditional LIMID. Before the conversion, however, it is advantageous to remove non-requisite informational arcs from L(λ,g)... Maximizing expected present value The conversion from TEMLIMID L(λ,g) into a LIMID L consists of a single step in which every utility function U u, u Υ is discounted : U u(x pa(u) ) U u(x pa(u) )λ Tu(x pa(u)), () where we define λ = 0. The LIMID obtained in this way is denoted L λ. After this conversion, the expected discounted utility of a strategy q in L λ equals the expected discounted utility of q in L(λ,g): EU Lλ (q) = EDU L(λ,g) (q). As a consequence of this conversion, a global maximum strategy, ˆq p, in L λ is also a global maximum strategy in L(λ,g)... Maximizing expected relative value The conversion from TEMLIMIDL(λ,g) into a LIMIDL in this case consists also of a single step in which every utility function U u, u Υ is transformed: U u(x pa(u) ) U u(x pa(u) ) go u(x pa(u) ). () The LIMID obtained in this way is denoted L g. After this conversion, the expected relative utility of a strategy q in L g equals the expected relative utility of q in L(λ,g), i. e. EU Lg (q) = ERU L(λ,g) (q). As a consequence of this conversion, a global maximum strategy, ˆq r, in L g is also a global maximum strategy in L(λ,g).. Use of TEMLIMID TEMLIMIDs are intended primarily for use in connection with MLPs, and we will not go into details with the potential for direct application. In most cases, modifications of utilities can be implemented directly when LIMIDs are implemented within specific domains. The possibility to keep the distinct features of the utility function separate during the modeling process, however, might make TEMLIMID an interesting modeling option on its own.

12 0 0 Markov LIMID Processes. Macro states and macro actions This section introduces the basic components of a Markov LIMID Process (MLP), and provides the notation used throughout. The notation is inspired by Puterman (). The procedure by which the MLP will be governed is presented briefly, but details are given in subsequent sections. The process enters an initial macro state, denoteds I, according to a known distribution on the set of macro states. Based on the states I, we associate a macro actionq(s I ). A macro action is made during a period of time referred to as a decision epoch. The macro action generates three consequences: a cumulative (possibly discounted) utility during the decision epoch; a transition to another macro state s O ; a cumulative expected duration during the decision epoch. The three consequences are handled by simple modifications of the TEMLIMID as presented in Section. This process is continued ad infinitum, i.e. the process goes through an infinite number of decision epochs. A policy for the MLP associates with each macro state a macro action for the decision epoch. We need to find a policy, therefore, that optimises the utility of the system, i.e., we need to find the utility from an infinite sequence of macro actions, as formulated in Section.. Example: PIGS as a renewal problem We have already modified the PIGS example to include premature ending of the process and to extend utilities with time functions and duration functions. We now extend the example further so that it fits into an MLP framework. The pig breeder has two separate sow units, A and B, and a central stable for growing pigs. On even-numbered months, pigs from sow unit A are moved to the central stable, whereas on odd-numbered months, pigs from sow unit B are moved to the stable. Thus the pig breeder can influence whether a new pig is from A or B by prematurely delivering the present pig in an even- or odd-numbered month. Suppose sows in A are carriers of the disease, so that pigs in A are infected, whereas pigs in B are not infected. As a result, the natural immune response will protect pigs coming from A from a new outbreak of the disease, whereas pigs from B are prone to the disease. The initial probability of having the disease in the first month, therefore, is reduced to 0.0 for pigs from A, whereas it remains at 0.0 for pigs from B. An untreated, healthy, pig from B has the original probability of having the disease of 0.0, whereas the corresponding risk for pigs from A is 0.0. Other disease aspects are identical to the original example. As an additional option, the pig breeder might choose, at a cost of DKK, to vaccinate pigs at insertion into the stable for growing pigs. Vaccination of a pig from A has no effect, because pigs are already protected by the natural immune response. If a pig from B is vaccinated, however, then it becomes protected to the same extent as pigs from A. The resulting TEMLIMID is in Fig.. The extension is modeled by an initial chance node, s I with two states, A and B. An initial decision node d I is added to model the vaccination decision with two states, Yes and No. The vaccination decision is made based on information about the unit of origin, as indicated by the informational edge from

13 0 0 s I tod I. The unit of origin of the next pig (replacing the present) is modeled by the nodes O having the same state space as s I. The treatment/selling strategy naturally should depend on unit of origin and vaccination state, so informational edges are added from s I and d I to each of the decision nodesd,d, andd. To keep track of even- vs. odd-numbered months, an edge is added between s I and s O.. d I u I s I h h h h t t t d d d u u u Fig. : Extended PIGS example modified to be included in the MLP. Input state s I and output-states O are added, as well as terminal utility nodeu T. The states of theh t nodes are augmented to include if and when the pig is sold. The vaccination decision d I with utility u I, based on the input state, is also added. As informational edges (dotted lines) indicate,s I and d I are known when decisions are made within the TEMLIMID.. Problem illustration A decision maker, agent, or controller is faced with the problem (or opportunity) of influencing the behaviour of a probabilistic system as it evolves through time. In the PIGS example, the pig breeder should influence his pig production not only for a single pig but also for a sequence of pigs. When a pig is delivered the breeder will replace it with a new one by choosing macro actions covering a specific period of time. In the example, the macro action is a set of rules on how to react to test observations from a new pig. The breeder will choose a different macro action depending on the unit of origin of the pig. Because the risk of disease is different in different units, the optimal treatment strategy may also differ. The goal is to choose a sequence of macro actions, which causes the system to perform optimally with respect to some predetermined overall optimality criterion. The system we model is ongoing, so the state of the system prior to the next macro action depends on the current macro action. If the pig breeder chooses to deliver the present pig in an evennumbered month, for example, then it influences the herd of origin of the pig replacing the present one. Consequently, macro actions must not be made myopically, but must anticipate the opportunities associated with future states. u D s O u T.

14 0 0. Modifying TEMLIMIDs to handle macro states and macro actions Macro actions are chosen at the start of time periods referred to as decision epochs. A TEM- LIMID L(λ,g), as in section, can be used to represent the decision structure within a decision epoch. The TEMLIMID might need three modifications to be used. Firstly, a chance node is added to represent the input state, i.e., the state of the process when the decision epoch starts. Secondly, a chance node is added to represent the output state, i.e., the state of the process when the decision epoch ends. Thirdly, a utility node is added to represent the future reward or utility value. At the start of the decision epoch, therefore, the process occupies an input state, and at the end of the decision epoch, the system occupies an output state. Consequences of a decision epoch depend on the macro action. In the PIGS example, the start of a decision epoch is when a new pig is introduced. The input state is the unit of origin of the pig. The output state is the unit of origin of the pig replacing the present one. We will use the following notation. In L(λ,g) there is a chance node s I, termed input node, to represent the input state, and a chance node s O, termed output node, to represent the output state. The associated variablesx si andx so are called input variables and output variables. Here, X si = X so and the elements in this set will be called macro states. We require that a special utility node u T, referred to as the terminal utility node and having the property that s O pa(u T ), can be added to the TEMLIMID L(λ,g), which is then denotedl (λ,g) with the extended utility node set Υ = Υ {u T }. We need to specify the three functions defined for utility nodes in a TEMLIMID:U ut,t ut ando ut. The utility function U ut represents a terminal reward received at the end of the time period modeled by the TEMLIMID. The time function T ut specifies the time when the terminal reward is paid. The duration functiono ut is normally 0. Note that because the output node is a parent of the terminal utility node, the terminal reward, as well as the time, might depend on the output state x so X so. In Fig., the PIGS example is shown with the modifications. At the beginning of a decision epoch, the input statex si is known to the decision maker. The input state at decision epoch n is obtained by observing the output variable at the end of the previous decision epoch n. If, at the start of some decision epoch, the decision maker or agent observes the system in input state x si, then the agent must choose a macro action q from some set Q. The set Q consists of all strategies in L(λ,g). The term macro action, therefore, is used because every element in Q consists of a set of decision functions; i.e., one decision function for each decision node in L(λ,g)... Example: Macro actions in PIGS For the PIGS example, Lauritzen and Nilsson (00) found the optimal LIMID strategy: never treat in the first month, and treat in second and third month if the test is positive. This strategy may be used as a macro action. In the extended example we have two more options: vaccination and delivery. The strategy might, furthermore, differ depending on if the pig is from A or B. Thus, a full macro action could be: Decision d I : Pigs from B should be vaccinated, whereas pigs from A should be left unvaccinated. Decision d : No pigs should be treated or sold no matter the test results.

15 0 0 Decision d : Pigs from A should be kept and treated if the test is positive, whereas they should be sold if it is negative. Pigs from B should be kept and not treated no matter the test result. Decision d : Pigs should be treated if t is positive. Ift is negative, however, a pig from B should be kept and not treated, whereas a pig from A should be sold. Depending on the input state, therefore, we choose a different macro action.. Consequences of macro actions For each input state, we calculate the consequences of each macro action. The macro action influences the probability of the output state and the expected value of the utility criteria... Transitions between macro states As a result of choosing macro actionq Q in input statex si at a decision epoch, the output state x so is determined by the conditional state transition probability function P s(x so x si,q), where the index s connotes state transition. This state transition probability function is found using the TEMLIMID L(λ, g), with macro action q P s(x so x si,q) = f q(x so x si ), () where f q denotes conditional the probability distribution over the variables in L(λ,g) induced by macro action q. Calculations proceed as within Bayesian Networks by entering evidence on the input state. After propagation, the conditional distribution over output states can be found directly. Example: Transition between input and output states for given macro action From Eq. () we see that selling decisions, d t, influence the probabilities of transitions between macro states. Assume, for instance, that a macro action implies that a pig is always kept when decision d is made and always sold when decision d is made months after insertion in the stable. Thus, if we have a pig from unit A, it was inserted in an even-numbered month. If a new pig is inserted months later, then it will be in an odd-numbered month, meaning a macro state transition from A to B. If a pig is from B, on the other hand, we have the opposite transition from B to A. Under such a macro action, the transition matrix is P s = ( ) 0. 0 Under the described special conditions, transitions are deterministic, elements p ij equal 0 or. More sophisticated macro actions, where the selling decisions depend on the observed test, typically result in transition matrices whose elements p ij satisfy 0 < p ij <.

16 0 0.. Utilities We consider each of the consequences described in Section.. As a result of choosing macro actionq in input statex si at decision pointn, the decision maker receives a utility u(x si,q) corresponding to the total expected utility received during the interval from decision epoch n to decision epoch (n+). If the utilities are discounted, the real-valued utility functionu(x si,q) denotes the value at the beginning of decision epoch n. From the perspective of the model studied here, it is unimportant how discounted utility is accrued during the interval. We require only that its expected value be known before choosing a macro action. Following the TEMLIMID definitions, the function u(x si,q) is computed from L(λ,g) given X si = x si as u(x si,q) = EDU L(λ,g) (q X si = x si ) () if we discount the utilities, or if we are interested only in the expected utility, then u(x si,q) = EU L(λ,g) (q X si = x si ). () The values of the function are computed conditioned on X si = x si. Correspondingly, the expected duration m(x si,q) is computed simply by interpreting the duration functions as utility, and then compute the expected value given X si = x si, completely analogous to the expected utility. Because the duration functionso u of the utility nodes denote the time intervals,m(x si,q) will be the expected inter-transition time between macro state transitions... Discounted transition probability function By using the time function T ut of the terminal utility node u T, we can also specify the discounted transition probability function Q(x so x si,q) as Q(x so x si,q) = x V \{so,s I } λ Tu T (x pa(u T )) fq(x V\{sI } x si ), () where x V denotes a configuration of the variables in L (λ,g). The function Q(x so x si,q) is the expected discounted value of a terminal utility of the value given that input state is x si, strategy is q, and output state is x so. Optimisation of macro actions. Interpretation of an MLP as a semi Markov decision process We now show that an MLP is a special case of an infinite horizon stationary semi Markov decision process (SMDP). Hence, we can apply techniques for value iteration and policy iteration for optimisation. Assume state space Ω and action space D are finite sets. A semi Markov decision process is characterised by a sequence of decision epochs where, at the start of a decision epoch n, state i Ω of the system is observed. Based on state i, an action d D is taken. Three consequences of taking action d in state i are

17 0 0 reward Ri d is gained. We refer to the expected reward as ri d, i.e., E(Ri d ) = rd i. It is sufficient to know the expected reward ri d. duration of the decision epoch will be Mij, d if the state observed at the next decision epoch is state j. We shall refer to the expected duration as m d i, i.e. E(Mij d ) = md i. Depending on the criterion of optimality, it is sufficient to know the expected duration, m d i. at the end of the decision epoch, the system will make a transition from state i to state j with probability p d ij, where j Ω pd ij =, for all i and d. We refer to r d i, M d ij, and p d ij as the parameter set of the decision epoch. If state and action spaces and the parameter set are the same for all decision epochs, we refer to the SMDP as stationary. If the sequence of decision epochs continues infinitely, furthermore, we refer to the SMDP as an infinite horizon process. For a stationary SMDP, a policy, δ, associates to each state i Ω, a decision δ(i) D. An optimal policy maximises a predefined objective function v(δ). We consider three criteria of optimality:. Expected present value, maximising the expected sum of all future rewards discounted to the present decision epoch. To discount a reward from timet to timet < t, we use a discount factor λ (t t), where λ is a predefined constant such that 0 λ. Under this criterion, we specify the discounted probability parameters Q d ij = λmd ijp d ij, for all i, j, and d, implying that duration M d ij is known.. Average reward over time, maximising the infinite future expected reward/output ratio. Under this criterion, it is sufficient to know the expected duration m d i.. Average reward per stage, maximising the infinite future reward/decision epochs ratio. This criterion is a special case of the former where all m d i =. An MLP as defined above is an infinite stationary SMDP. We show that by specifying all elements of the SMDP. A new decision epoch begins each time the decision maker chooses a macro action; the new epoch is modeled by the TEMLIMID L(λ,g). The state space Ω of the decision epoch is identical to the state space of the input node s I, i.e. Ω = X si. The action space D of the decision epoch is the set of all possible strategies of the TEMLIMID L(λ,g), i.e., an action d in the SMDP is a strategy q in the TEMLIMID. The expected reward r d i for state i = x si under action d = q is the expected utility of the TEMLIMIDu(x si,q) as in Eq. (). The expected duration is m d i = m(xs I,q), calculated from the TEMLIMID as described in Section... The transition probability from state i = x si to state j = x so under action d = q is p d ij = Ps(xs O x si,q), as in Eq. (). Finally, when needed, the corresponding discounted probability parameter isq d ij = Q(xs O x si,q) as above in Eq. (). Having identified all elements of the SMDP, we see that the infinite sequence of identical TEMLIMIDs form an infinite horizon stationary SMDP, where the state and action spaces and the parameter sets are available through the TEMLIMID. Thus, we can apply techniques for value iteration and policy iteration for optimisation.. Optimisation in an SMDP Two algorithms are available for optimisation in an SMDP: value iteration and policy iteration. Each technique may be described by four subroutines as follows:

18 0 0 Algorithm (Optimisation) Value iteration and policy iteration for an infinite horizon semi Markov decision process:. Initialise. Repeat until Convergence: (a) DeterminePolicy (b) DetermineValue We describe next the subroutines for the policy iteration algorithm. The Initialise subroutine defines initial conditions. We set the iteration counter n to 0 and the value function v i (0) to 0 for all states i Ω. Finally, the average reward, g 0 is set to 0. The DeterminePolicy subroutine increments n by (i.e. sets n = n + ) and returns an updated (improved) policy δ n. The updated policy is determined as: i Ω : δ n (i) = argmax d rd i g n m d i + Q d ijv j (n ). () j Ω The DetermineValue returns updated values v i (n) and g n determined by solving the simultaneous linear equations g nm δn (i) i +v i (n) = r δn (i) i + j Ω Q δn (i) ij v j (n). () Under the average criteria (where g n 0) an additional equation, e.g. v (n) = 0, is needed for a unique solution. Under the expected present value criterion we let g n = 0. The Convergence subroutine sets the stopping criterion of the iteration. For policy iteration, the convergence criterion is v i (n) = v i (n ), for all i; furthermore, g n = g n. The four subroutines for the value iteration algorithm are similar to those for the policy iteration algorithm. The value of g n is determined indirectly, implying that when using Eqs. () and (), we setg n = 0. The convergence criterion needs to be modified, however, because value iteration guarantees only an approximately optimal solution, as opposed to policy iteration, which guarantees an optimal solution within a finite number of iterations.. Optimisation in an MLP We showed in Section. that an MLP is an SMDP with known parameters. In principle, therefore, we can use Algorithm directly. Because parameters are known, the DetermineValue subroutine is performed easily. The DeterminePolicy subroutine, however, is more problematic because a new policy is determined state by state by successively calculating Eq. () for each action d D, and afterwards choosing the action that maximises the expression. An action d in the MLP, however, is an entire strategy q in the TEMLIMID L(λ,g). Evaluating all actions by Eq. () is prohibitive, because the number of possible strategies is high (although finite) in most cases. We need an other way, therefore, to determine a strategy to maximise the right-hand side of Eq. (). We can use SINGLE POLICY UPDATING in the TEMLIMID to determine such a strategy. For an MLP, therefore, we define the DeterminePolicy subroutine for the TEMLIMID L(λ,g):

19 0 0 DeterminePolicy: An improved policy is found as:. Increment the iteration counter n by.. Add the terminal utility node u T to L(λ,g) to form L (λ,g).. Set the utility function of the terminal utility node u T equal to the previous value of the value function, i.e. U ut (x pa(ut )\{s O },s) = v s(n ), s X so.. Depending on criterion: (a) For expected present value criterion, modify the utilities ofl (λ,g) as described in Eq. () to obtain the LIMID L λ. (b) For average rewards over time, modify the utilities of L (λ,g) as described in Eq. () to obtain the LIMID L g.. Depending on criterion: (a) For expected present value criterion, perform SINGLE POLICY UPDATING on LIMID L λ to obtain a local maximum strategy q n. (b) For average rewards over time, perform SINGLE POLICY UPDATING on LIMID L g to obtain a local maximum strategy q n.. Define the new policy δ n of the MLP as δ n = q n. The theorem below assures the correctness of the defined DeterminePolicy subroutine for an MLP. Theorem For an MLP in which a decision epoch is defined by the soluble TEMLIMID L(λ, g), the DeterminePolicy defined above returns a policy maximizing the expression on the right hand side of Eq. (). Proof: We first prove the theorem for the expected present value criterion. Recall that L(λ,g) (and hence L (λ,g)) is soluble. It follows directly that the strategy q n satisfies EDU L (λ,g)(q n X si = s) = max EDU L (λ,g)(q X si = s). q Let function φ(q n s) = EDU L (λ,g)(q n X si = s). Because utilities are considered as additive in LIMIDs, we have: ( )} φ(q n s) = max {EDU L(λ,g) (q X si = s)+e q λ Tu T (Xpa(u T)) UuT (X pa(ut )) X si = s q = max q = max q = max q u(s,q)+ λ Tu T (xpa(u T)) UuT (x pa(ut ))f q(x V\{sI }) s) x so x V \{s O,s I } u(s,q)+ λ Tu T (xpa(u T)) fq(xv\{si } s)u ut (x pa(ut )\{s O },x so ) x so x V \{s O,s I } u(s,q)+ Q(x so s,q)v xso (n ). x so Compare with Eq. () and interpret the MLP as an SMDP, as in Section.. For a soluble LIMID, we see that the SINGLE POLICY UPDATING technique can be used as a method

20 0 0 to update the policy in the DeterminePolicy subroutine, under the expected present value criterion, where always, g n = 0. For the average rewards over time criterion, we similarly define function φ(q n s) = ERU L (λ,g)(q n X si = s). Because utilities are considered as additive in LIMIDs and because L(λ,g) is soluble we obtain: φ(q n s) = max q = max q Again using the additivity property: ERU L (λ,g)(q X si = s) { ( ERUL(λ,g) (q X si = s)+e q UuT (X pa(ut )) X si = s )}. ERU L(λ,g) (q X si = s) = EU L(λ,g) ged L(λ,g) = u(s,q) gm(s,q), where ED denotes the expected duration, as in Section... Similarly for the second expression Thus, E q ( UuT (X pa(ut )) X si = s ) = x so P s(x so s,q)v xso (n ). φ(q n s) = max q u(s,q) gm(s,q)+ P s(x so s,q)v xso (n ). x so Under the average rewards over time criterion, λ =, so we have P s(x so s,q) = Q(x so s,q). We have thus confirmed that the SINGLE POLICY UPDATING technique also can be used as a method to update the policy in the DeterminePolicy subroutine. Implementation of algorithm. Software systems used The concept of MLP has been implemented within the framework of the MLHMP software (Kristensen 00). It is a general, Java based, software system for multi-level hierarchical Markov processes (Kristensen and Jørgensen 000). The MLP concept has been integrated into the MLHMP framework by use of the Esthauge LIMID Software System, also Java based, which is a general software system for Bayesian networks and LIMIDs. The Esthauge system comes with the SINGLE POLICY UP- DATING algorithm already implemented. The software system, furthermore, is able to check a LIMID for solubility and to remove non-requisite informational edges. It has a graphical user interface for browsing the directed acyclic graph and for editing models. Fig. (a) shows the PIGS example as it is displayed in the Esthauge LIMID software system. A more general form than that presented here of the MLP concept has been implemented. In that form, for any decision epoch of a Markov decision process to be represented

21 0 0 0 (a) Fig. : The PIGS example as it is displayed in the Esthauge LIMID Software System. The three node types (chance, decision, and utility nodes) are distinguished by colours in the user interface instead of shapes (here translated to different shades of grey, change nodes the lightest and decision nodes the darkest). The figure to the left shows the LIMID version and the one to the right shows the ID version, with all informational arcs added to satisfy the no-forgetting assumption. Table : Convergence of the value functions for the PIGS example (soluble version) using policy iteration under the criterion of average rewards over time. The relative value 0 for Unit B is arbitrary. Iteration Macro state n = n = n = n = Unit A (relative value),v (n) 0... Unit B (relative value),v (n) Average rewards over time (DKK/month),g n as a TEMLIMID, it is enough for the state spaces of the input and output nodes match the state spaces of the present and following decision epoch of the Markov decision process, respectively. Each of the three criteria of optimality defined in Section. has been implemented.. Example The PIGS example has been implemented as a plug-in to the MLHMP software system (Kristensen 00). If an exact solution is required, informational edges must be added as shown in Fig. (b). The TEMLIMID then becomes soluble, and convergence of the optimisation algorithm is guaranteed according to Theorem ; the resulting policy is optimal. We illustrate the optimisation for the soluble version under the criterion of average rewards over time. Four iterations were needed to find an optimal policy. The convergence of the value functions v i (n) and g n is in Table. The optimal strategy determined by the policy iteration algorithm, using SINGLE POL- ICY UPDATING in the TEMLIMID, is to vaccinate pigs from B, and to leave pigs from A untreated as expected. (b)

Nilsson, Höhle: Methods for evaluating Decision Problems with Limited Information

Nilsson, Höhle: Methods for evaluating Decision Problems with Limited Information Nilsson, Höhle: Methods for evaluating Decision Problems with Limited Information Sonderforschungsbereich 386, Paper 421 (2005) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Methods for

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Outline. A quiz

Outline. A quiz Introduction to Bayesian Networks Anders Ringgaard Kristensen Outline Causal networks Bayesian Networks Evidence Conditional Independence and d-separation Compilation The moral graph The triangulated graph

More information

Introduction to Bayesian Networks

Introduction to Bayesian Networks Introduction to Bayesian Networks Anders Ringgaard Kristensen Slide 1 Outline Causal networks Bayesian Networks Evidence Conditional Independence and d-separation Compilation The moral graph The triangulated

More information

Monte Carlo Simulation I

Monte Carlo Simulation I Monte Carlo Simulation I Anders Ringgaard Kristensen What is simulation? Simulation is an attempt to model a real world system in order to: Obtain a better understanding of the system (including interactions)

More information

Markov decision processes and interval Markov chains: exploiting the connection

Markov decision processes and interval Markov chains: exploiting the connection Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo Supervisors: Prof. Nigel Bean, Dr Joshua Ross University of Adelaide July 10, 2013 Intervals and interval arithmetic

More information

Belief Update in CLG Bayesian Networks With Lazy Propagation

Belief Update in CLG Bayesian Networks With Lazy Propagation Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

21 Markov Decision Processes

21 Markov Decision Processes 2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete

More information

Embedding a State Space Model Into a Markov Decision Process

Embedding a State Space Model Into a Markov Decision Process Embedding a State Space Model Into a Markov Decision Process Lars Relund Nielsen, Erik Jørgensen and Søren Højsgaard Research Unit of Statistics and Decision Analysis, Department of Genetics and Biotechnology,

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Textbook notes of herd management: Dynamic programming and Markov decision processes

Textbook notes of herd management: Dynamic programming and Markov decision processes Textbook notes of herd management: Dynamic programming and Markov decision processes Dina Notat No. 9 ugust 996 nders R. Kristensen This report is also available as a PostScript file on World Wide Web

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

RECURSION EQUATION FOR

RECURSION EQUATION FOR Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u

More information

Bayesian Networks 2:

Bayesian Networks 2: 1/27 PhD seminar series Probabilistics in Engineering : Bayesian networks and Bayesian hierarchical analysis in engineering Conducted by Prof. Dr. Maes, Prof. Dr. Faber and Dr. Nishijima Bayesian Networks

More information

Markov Decision Processes: Biosens II

Markov Decision Processes: Biosens II Markov Decision Processes: Biosens II E. Jørgensen & Lars R. Nielsen Department of Genetics and Biotechnology Faculty of Agricultural Sciences, University of Århus / 008 : Markov Decision Processes Examples

More information

Decision analysis with influence diagrams using Elvira s explanation facilities

Decision analysis with influence diagrams using Elvira s explanation facilities Decision analysis with influence diagrams using Elvira s explanation facilities Manuel Luque and Francisco J. Díez Dept. Inteligencia Artificial, UNED Juan del Rosal, 16, 28040 Madrid, Spain Abstract Explanation

More information

Finding the Value of Information About a State Variable in a Markov Decision Process 1

Finding the Value of Information About a State Variable in a Markov Decision Process 1 05/25/04 1 Finding the Value of Information About a State Variable in a Markov Decision Process 1 Gilvan C. Souza The Robert H. Smith School of usiness, The University of Maryland, College Park, MD, 20742

More information

Logic, Knowledge Representation and Bayesian Decision Theory

Logic, Knowledge Representation and Bayesian Decision Theory Logic, Knowledge Representation and Bayesian Decision Theory David Poole University of British Columbia Overview Knowledge representation, logic, decision theory. Belief networks Independent Choice Logic

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

10/3/2018. Our main example: SimFlock. Breeding animals Hens & Cocks

10/3/2018. Our main example: SimFlock. Breeding animals Hens & Cocks What is simulation? Monte Carlo Simulation I Anders Ringgaard Kristensen Simulation is an attempt to model a real world system in order to: Obtain a better understanding of the system (including interactions)

More information

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012 CSE 573: Artificial Intelligence Autumn 2012 Reasoning about Uncertainty & Hidden Markov Models Daniel Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer 1 Outline

More information

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

2534 Lecture 4: Sequential Decisions and Markov Decision Processes 2534 Lecture 4: Sequential Decisions and Markov Decision Processes Briefly: preference elicitation (last week s readings) Utility Elicitation as a Classification Problem. Chajewska, U., L. Getoor, J. Norman,Y.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL) 15-780: Graduate Artificial Intelligence Reinforcement learning (RL) From MDPs to RL We still use the same Markov model with rewards and actions But there are a few differences: 1. We do not assume we

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams)

Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) Ross D. Shachter Engineering-Economic Systems and Operations Research

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

Bayesian networks II. Model building. Anders Ringgaard Kristensen

Bayesian networks II. Model building. Anders Ringgaard Kristensen Bayesian networks II. Model building Anders Ringgaard Kristensen Outline Determining the graphical structure Milk test Mastitis diagnosis Pregnancy Determining the conditional probabilities Modeling methods

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Computer Science CPSC 322. Lecture 23 Planning Under Uncertainty and Decision Networks

Computer Science CPSC 322. Lecture 23 Planning Under Uncertainty and Decision Networks Computer Science CPSC 322 Lecture 23 Planning Under Uncertainty and Decision Networks 1 Announcements Final exam Mon, Dec. 18, 12noon Same general format as midterm Part short questions, part longer problems

More information

CS 4100 // artificial intelligence. Recap/midterm review!

CS 4100 // artificial intelligence. Recap/midterm review! CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

What is it all about? Introduction to Bayesian Networks. Method to reasoning under uncertainty. Where we reason using probabilities

What is it all about? Introduction to Bayesian Networks. Method to reasoning under uncertainty. Where we reason using probabilities What is it all about? Introduction to ayesian Networks Method to reasoning under uncertainty dvanced Herd Management 28th of september 2009 Where we reason using probabilities Tina irk Jensen Reasoning

More information

Solving Hybrid Influence Diagrams with Deterministic Variables

Solving Hybrid Influence Diagrams with Deterministic Variables 322 LI & SHENOY UAI 2010 Solving Hybrid Influence Diagrams with Deterministic Variables Yijing Li and Prakash P. Shenoy University of Kansas, School of Business 1300 Sunnyside Ave., Summerfield Hall Lawrence,

More information

Optimal Stopping Problems

Optimal Stopping Problems 2.997 Decision Making in Large Scale Systems March 3 MIT, Spring 2004 Handout #9 Lecture Note 5 Optimal Stopping Problems In the last lecture, we have analyzed the behavior of T D(λ) for approximating

More information

Analysis of Algorithms. Outline. Single Source Shortest Path. Andres Mendez-Vazquez. November 9, Notes. Notes

Analysis of Algorithms. Outline. Single Source Shortest Path. Andres Mendez-Vazquez. November 9, Notes. Notes Analysis of Algorithms Single Source Shortest Path Andres Mendez-Vazquez November 9, 01 1 / 108 Outline 1 Introduction Introduction and Similar Problems General Results Optimal Substructure Properties

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford

Probabilistic Model Checking Michaelmas Term Dr. Dave Parker. Department of Computer Science University of Oxford Probabilistic Model Checking Michaelmas Term 20 Dr. Dave Parker Department of Computer Science University of Oxford Overview PCTL for MDPs syntax, semantics, examples PCTL model checking next, bounded

More information

Applying Bayesian networks in the game of Minesweeper

Applying Bayesian networks in the game of Minesweeper Applying Bayesian networks in the game of Minesweeper Marta Vomlelová Faculty of Mathematics and Physics Charles University in Prague http://kti.mff.cuni.cz/~marta/ Jiří Vomlel Institute of Information

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Decision Graphs - Influence Diagrams. Rudolf Kruse, Pascal Held Bayesian Networks 429

Decision Graphs - Influence Diagrams. Rudolf Kruse, Pascal Held Bayesian Networks 429 Decision Graphs - Influence Diagrams Rudolf Kruse, Pascal Held Bayesian Networks 429 Descriptive Decision Theory Descriptive Decision Theory tries to simulate human behavior in finding the right or best

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

An Introduction to Markov Decision Processes. MDP Tutorial - 1

An Introduction to Markov Decision Processes. MDP Tutorial - 1 An Introduction to Markov Decision Processes Bob Givan Purdue University Ron Parr Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) Objective functions Policies Finding Optimal

More information

Decayed Markov Chain Monte Carlo for Interactive POMDPs

Decayed Markov Chain Monte Carlo for Interactive POMDPs Decayed Markov Chain Monte Carlo for Interactive POMDPs Yanlin Han Piotr Gmytrasiewicz Department of Computer Science University of Illinois at Chicago Chicago, IL 60607 {yhan37,piotr}@uic.edu Abstract

More information

Using first-order logic, formalize the following knowledge:

Using first-order logic, formalize the following knowledge: Probabilistic Artificial Intelligence Final Exam Feb 2, 2016 Time limit: 120 minutes Number of pages: 19 Total points: 100 You can use the back of the pages if you run out of space. Collaboration on the

More information

Chapter 16 focused on decision making in the face of uncertainty about one future

Chapter 16 focused on decision making in the face of uncertainty about one future 9 C H A P T E R Markov Chains Chapter 6 focused on decision making in the face of uncertainty about one future event (learning the true state of nature). However, some decisions need to take into account

More information

Planning Under Uncertainty: Structural Assumptions and Computational Leverage

Planning Under Uncertainty: Structural Assumptions and Computational Leverage Planning Under Uncertainty: Structural Assumptions and Computational Leverage Craig Boutilier Dept. of Comp. Science Univ. of British Columbia Vancouver, BC V6T 1Z4 Tel. (604) 822-4632 Fax. (604) 822-5485

More information

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

RL 14: Simplifications of POMDPs

RL 14: Simplifications of POMDPs RL 14: Simplifications of POMDPs Michael Herrmann University of Edinburgh, School of Informatics 04/03/2016 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Advanced Herd Management Probabilities and distributions

Advanced Herd Management Probabilities and distributions Advanced Herd Management Probabilities and distributions Anders Ringgaard Kristensen Slide 1 Outline Probabilities Conditional probabilities Bayes theorem Distributions Discrete Continuous Distribution

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Discrete-Time Markov Decision Processes

Discrete-Time Markov Decision Processes CHAPTER 6 Discrete-Time Markov Decision Processes 6.0 INTRODUCTION In the previous chapters we saw that in the analysis of many operational systems the concepts of a state of a system and a state transition

More information

Influence Diagrams with Memory States: Representation and Algorithms

Influence Diagrams with Memory States: Representation and Algorithms Influence Diagrams with Memory States: Representation and Algorithms Xiaojian Wu, Akshat Kumar, and Shlomo Zilberstein Computer Science Department University of Massachusetts Amherst, MA 01003 {xiaojian,akshat,shlomo}@cs.umass.edu

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Bayesian networks in Mastermind

Bayesian networks in Mastermind Bayesian networks in Mastermind Jiří Vomlel http://www.utia.cas.cz/vomlel/ Laboratory for Intelligent Systems Inst. of Inf. Theory and Automation University of Economics Academy of Sciences Ekonomická

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

c 2011 Nisha Somnath

c 2011 Nisha Somnath c 2011 Nisha Somnath HIERARCHICAL SUPERVISORY CONTROL OF COMPLEX PETRI NETS BY NISHA SOMNATH THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Aerospace

More information

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount Yinyu Ye Department of Management Science and Engineering and Institute of Computational

More information

Simulation Study on Heterogeneous Variance Adjustment for Observations with Different Measurement Error Variance

Simulation Study on Heterogeneous Variance Adjustment for Observations with Different Measurement Error Variance Simulation Study on Heterogeneous Variance Adjustment for Observations with Different Measurement Error Variance Pitkänen, T. 1, Mäntysaari, E. A. 1, Nielsen, U. S., Aamand, G. P 3., Madsen 4, P. and Lidauer,

More information

Monitoring and data filtering II. Dynamic Linear Models

Monitoring and data filtering II. Dynamic Linear Models Monitoring and data filtering II. Dynamic Linear Models Advanced Herd Management Cécile Cornou, IPH Dias 1 Program for monitoring and data filtering Friday 26 (morning) - Lecture for part I : use of control

More information

Motivation for introducing probabilities

Motivation for introducing probabilities for introducing probabilities Reaching the goals is often not sufficient: it is important that the expected costs do not outweigh the benefit of reaching the goals. 1 Objective: maximize benefits - costs.

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Tractable Inference in Hybrid Bayesian Networks with Deterministic Conditionals using Re-approximations

Tractable Inference in Hybrid Bayesian Networks with Deterministic Conditionals using Re-approximations Tractable Inference in Hybrid Bayesian Networks with Deterministic Conditionals using Re-approximations Rafael Rumí, Antonio Salmerón Department of Statistics and Applied Mathematics University of Almería,

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Best Guaranteed Result Principle and Decision Making in Operations with Stochastic Factors and Uncertainty

Best Guaranteed Result Principle and Decision Making in Operations with Stochastic Factors and Uncertainty Stochastics and uncertainty underlie all the processes of the Universe. N.N.Moiseev Best Guaranteed Result Principle and Decision Making in Operations with Stochastic Factors and Uncertainty by Iouldouz

More information

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 12: Reinforcement Learning Teacher: Gianni A. Di Caro REINFORCEMENT LEARNING Transition Model? State Action Reward model? Agent Goal: Maximize expected sum of future rewards 2 MDP PLANNING

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

THE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION TO CONTINUOUS BELIEF NETS

THE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION TO CONTINUOUS BELIEF NETS Proceedings of the 00 Winter Simulation Conference E. Yücesan, C.-H. Chen, J. L. Snowdon, and J. M. Charnes, eds. THE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION

More information

A Nonlinear Predictive State Representation

A Nonlinear Predictive State Representation Draft: Please do not distribute A Nonlinear Predictive State Representation Matthew R. Rudary and Satinder Singh Computer Science and Engineering University of Michigan Ann Arbor, MI 48109 {mrudary,baveja}@umich.edu

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Explanation of Bayesian networks and influence diagrams in Elvira

Explanation of Bayesian networks and influence diagrams in Elvira JOURNAL OF L A TEX CLASS FILES, VOL. 1, NO. 11, NOVEMBER 2002 1 Explanation of Bayesian networks and influence diagrams in Elvira Carmen Lacave, Manuel Luque and Francisco Javier Díez Abstract Bayesian

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

A Class of Star-Algebras for Point-Based Qualitative Reasoning in Two- Dimensional Space

A Class of Star-Algebras for Point-Based Qualitative Reasoning in Two- Dimensional Space From: FLAIRS- Proceedings. Copyright AAAI (www.aaai.org). All rights reserved. A Class of Star-Algebras for Point-Based Qualitative Reasoning in Two- Dimensional Space Debasis Mitra Department of Computer

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability Due: Thursday 10/15 in 283 Soda Drop Box by 11:59pm (no slip days) Policy: Can be solved in groups (acknowledge collaborators)

More information

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School

PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION PING HOU. A dissertation submitted to the Graduate School PROBABILISTIC PLANNING WITH RISK-SENSITIVE CRITERION BY PING HOU A dissertation submitted to the Graduate School in partial fulfillment of the requirements for the degree Doctor of Philosophy Major Subject:

More information

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty

A Decentralized Approach to Multi-agent Planning in the Presence of Constraints and Uncertainty 2011 IEEE International Conference on Robotics and Automation Shanghai International Conference Center May 9-13, 2011, Shanghai, China A Decentralized Approach to Multi-agent Planning in the Presence of

More information