Bayesian Networks and Decision Graphs

Similar documents
Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Rapid Introduction to Machine Learning/ Deep Learning

Introduction to Bayesian Networks

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Introduction to Artificial Intelligence. Unit # 11

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Lecture 5: Bayesian Network

Bayesian belief networks

COMP5211 Lecture Note on Reasoning under Uncertainty

Probabilistic Graphical Models for Image Analysis - Lecture 1

Our learning goals for Bayesian Nets

Module 10. Reasoning with Uncertainty - Probabilistic reasoning. Version 2 CSE IIT,Kharagpur

Quantifying Uncertainty & Probabilistic Reasoning. Abdulla AlKhenji Khaled AlEmadi Mohammed AlAnsari

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Motivation. Bayesian Networks in Epistemology and Philosophy of Science Lecture. Overview. Organizational Issues

Introduction to Bayesian Learning

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

TDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2

Ch.6 Uncertain Knowledge. Logic and Uncertainty. Representation. One problem with logical approaches: Department of Computer Science

Bayesian belief networks. Inference.

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Quantifying uncertainty & Bayesian networks

Probabilistic Graphical Models. Rudolf Kruse, Alexander Dockhorn Bayesian Networks 153

Introduction to Causal Calculus

Review: Bayesian learning and inference

Defining Things in Terms of Joint Probability Distribution. Today s Lecture. Lecture 17: Uncertainty 2. Victor R. Lesser

COMP9414: Artificial Intelligence Reasoning Under Uncertainty

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Building Bayesian Networks. Lecture3: Building BN p.1

Bayesian Networks BY: MOHAMAD ALSABBAGH

Y. Xiang, Inference with Uncertain Knowledge 1

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Graphical Models - Part I

Uncertainty. Introduction to Artificial Intelligence CS 151 Lecture 2 April 1, CS151, Spring 2004

CS 188: Artificial Intelligence Spring Announcements

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Figure 1: Bayesian network for problem 1. P (A = t) = 0.3 (1) P (C = t) = 0.6 (2) Table 1: CPTs for problem 1. C P (D = t) f 0.9 t 0.

Informatics 2D Reasoning and Agents Semester 2,

Bayesian Networks. Motivation

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Directed Graphical Models

Probabilistic Models

Course Introduction. Probabilistic Modelling and Reasoning. Relationships between courses. Dealing with Uncertainty. Chris Williams.

Bayesian Networks Representation

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Probabilistic representation and reasoning

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Artificial Intelligence Bayesian Networks

Modeling and reasoning with uncertainty

CS Lecture 3. More Bayesian Networks

Introduction to Artificial Intelligence Belief networks

PROBABILISTIC REASONING SYSTEMS

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Probabilistic representation and reasoning

COMP538: Introduction to Bayesian Networks

CS 188: Artificial Intelligence Fall 2009

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Probability. CS 3793/5233 Artificial Intelligence Probability 1

Bayesian Reasoning. Adapted from slides by Tim Finin and Marie desjardins.

Directed and Undirected Graphical Models

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Bayesian Networks for Causal Analysis

Probabilistic Reasoning; Network-based reasoning

Reasoning with Bayesian Networks

Bayesian Networks (Part II)

CS 5522: Artificial Intelligence II

CSE 473: Artificial Intelligence Autumn 2011

Introduction to Probabilistic Graphical Models

Bayesian Network. Outline. Bayesian Network. Syntax Semantics Exact inference by enumeration Exact inference by variable elimination

Conditional Independence

CS 188: Artificial Intelligence Fall 2008

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Machine Learning Lecture 14

Applications of Bayesian networks

COMP 328: Machine Learning

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Bayesian networks. Chapter 14, Sections 1 4

An Introduction to Bayesian Machine Learning

CS 5522: Artificial Intelligence II

Conditional probabilities and graphical models

CS 343: Artificial Intelligence

CMPSCI 240: Reasoning about Uncertainty

Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach

Uncertain Reasoning. Environment Description. Configurations. Models. Bayesian Networks

Bayesian Machine Learning

Bayesian Networks Basic and simple graphs

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Probabilistic Robotics

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Lecture 8. Probabilistic Reasoning CS 486/686 May 25, 2006

Naïve Bayes Classifiers

CS6220: DATA MINING TECHNIQUES

Lecture 8: Bayesian Networks

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Transcription:

Bayesian Networks and Decision Graphs A 3-week course at Reykjavik University Finn V. Jensen & Uffe Kjærulff ({fvj,uk}@cs.aau.dk) Group of Machine Intelligence Department of Computer Science, Aalborg University 26 April 13 May, 2005

Outline 1 Introduction 2 Contents of Course 3 Causal Networks and Relevance Analysis 4 Bayesian Probability Theory Reykjavik University, April/May 2005, BNs and DGs: Introduction 2

Introduction 1 Artificial Intelligence 2 Expert Systems 3 Normative Expert Systems 4 Sample Application Areas Reykjavik University, April/May 2005, BNs and DGs: Introduction 3

Artificial Intelligence / Machine Intelligence What is artificial intelligence? Device or service that reasons and makes decisions under uncertainty, extracts knowledge from data/experience, and solves problems efficiently and adapts to new situations. Reykjavik University, April/May 2005, BNs and DGs: Introduction 4

Artificial Intelligence / Machine Intelligence What is artificial intelligence? Device or service that reasons and makes decisions under uncertainty, extracts knowledge from data/experience, and solves problems efficiently and adapts to new situations. Why use artificial intelligence? Automate tasks. Automate reasoning and decision making. Extract knowledge and information from data. Reykjavik University, April/May 2005, BNs and DGs: Introduction 5

Expert Systems The first expert systems were constructed in the late 1960s. Expert System = Knowledge Base + Inference Engine Reykjavik University, April/May 2005, BNs and DGs: Introduction 6

Expert Systems The first expert systems were constructed in the late 1960s. Expert System = Knowledge Base + Inference Engine The first expert systems were constructed as computer models of the expert, e.g. production rules like: if condition, then fact if condition, then action Reykjavik University, April/May 2005, BNs and DGs: Introduction 7

Expert Systems The first expert systems were constructed in the late 1960s. Expert System = Knowledge Base + Inference Engine The first expert systems were constructed as computer models of the expert, e.g. production rules like: if condition, then fact if condition, then action In most systems there is a need for handling uncertainty: if condition with certainty x, then fact with certainty f (x) The algebras for combining certainty factors are not mathematically coherent and can lead to incorrect conclusions. Reykjavik University, April/May 2005, BNs and DGs: Introduction 8

Normative Expert Systems Observation Action Observation Action Model the problem domain, not the expert. Support the expert, don t substitute the expert. Use classical probability calculus and decision theory, not a non-coherent uncertainty calculus. Closed-world representation of a given problem domain (i.e., the domain model assumes some given background conditions or context in which the model is valid). Reykjavik University, April/May 2005, BNs and DGs: Introduction 9

Motivation for Normative Approach Some important motivations for using model-based systems: Procedure-based (extensional) systems are semantically sloppy, model-based (intensional) systems are not. Speak the language of causality, use a single knowledge base to provide simulation, diagnosis, and prognosis. Both knowledge and data can be used to construct Bayesian networks. Adapt to individual settings. Probabilities make it easy to interface with decision and utility theory. Reykjavik University, April/May 2005, BNs and DGs: Introduction 10

Bayes Rule P(Y X) = P(X Y )P(Y ) P(X) Rev. Thomas Bayes (1702 1761), an 18th century minister from England. The rule, as generalized by Laplace, is the basic starting point for inference problems using probability theory as logic. Reykjavik University, April/May 2005, BNs and DGs: Introduction 11

Example: Chest Clinic Chest Clinic Shortness-of-breath (dyspnoea) may be due to tuberculosis, lung cancer or bronchitis, or none of them, or more than one of them. A recent visit to Asia increases the chances of tuberculosis, while smoking is known to be a risk factor for both lung cancer and bronchitis. The results of a single chest X-ray do not discriminate between lung cancer and tuberculosis, as neither does the presence or absence of dyspnoea. This is a typical diagnostic situation. Reykjavik University, April/May 2005, BNs and DGs: Introduction 12

Bayesian Decision Problems Bayesian networks can be augmented with explicit representation of decisions and utilities. Such augmented models are denoted influence diagrams (or decision networks). Bayesian decision theory provides a solid foundation for assessing and thinking about actions under uncertainty. Intuitive, graphical specification of a decision problem. Automatic determination of a optimal strategy and computation of the maximal expected utility of this strategy. Reykjavik University, April/May 2005, BNs and DGs: Introduction 13

Example: Oil Wildcatter Oil Wildcatter An oil wildcatter must decide either to drill or not to drill. He is uncertain whether the hole is dry, wet, or soaking, The wildcatter could perform a seismic soundings test that will help determine the geological structure of the site. The soundings will give a closed reflection pattern (indication of much oil), an open pattern (indication of some oil), or a diffuse pattern (almost no hope of oil). The cost of testing is $10,000 whereas the cost of drilling is $70,000. The utility of drilling is $270,000, $120,000, and $0 for a soaking, wet, and dry hole, respectively. This is a typical decision scenario. Reykjavik University, April/May 2005, BNs and DGs: Introduction 14

Normative Systems Characteristics Systems based on Bayesian networks and influence diagrams are normative, and have the following characteristics: Graph representing causal relations. Strength of relations by probabilities. Preferences represented by utilities. Recommendations based on maximizing expected utility. Reykjavik University, April/May 2005, BNs and DGs: Introduction 15

Normative Systems Task Categories The class of tasks suitable for normative systems can be divided into three broad subclasses: Forecasting: Computing probability distributions for future events. Interpretation: Pattern identification (diagnosis, classification). Planning: Generation of optimal sequences of decisions/actions. Reykjavik University, April/May 2005, BNs and DGs: Introduction 16

Sample Application Areas of Normative Systems Medical Software Info. proc. Industry Economy Military Agriculture Mining Law enforcement Etc. Reykjavik University, April/May 2005, BNs and DGs: Introduction 17

Outline 1 Introduction 2 Contents of Course 3 Causal Networks and Relevance Analysis 4 Bayesian Probability Theory Reykjavik University, April/May 2005, BNs and DGs: Contents of Course 18

Contents of Course Lecture plan: 1 Causal networks and Bayesian probability calculus Tue 26/4 UK 2 Construction of Bayesian networks Wed 27/4 UK 3 Workshop: Construction of Bayesian networks Thu 28/4 UK 4 Inference and analyses in Bayesian networks Fri 29/4 UK 5 Decisions, utilities and decision trees Mon 2/5 FVJ 6 Troubleshooting and influence diagrams Tue 3/5 FVJ 7 Solution of influence diagrams Wed 4/5 FVJ 8 Methods for analysing an ID spec. dec. scenario Fri 6/5 FVJ 9 Workshop: Construction of influence diagrams Mon 9/5 FVJ 10 Learning parameters from data Tue 10/5 FVJ 11 Bayesian networks as classifiers Wed 11/5 FVJ 12 Learning the structure of Bayesian networks Thu 12/5 FVJ 13 Continuous variables Fri 13/5 FVJ The two workshops have a duration of 4 hours. All lectures start at 10:00. The plan is subject to changes. Reykjavik University, April/May 2005, BNs and DGs: Contents of Course 19

Literature, Requirements, Etc. Literature: Finn V. Jensen (2001), Bayesian Networks and Decision Graphs, Springer-Verlag. Exercises suggested after each lecture. To pass the course you are required to attend all lectures and hand in written answers to home assignments. A number of the exercises require access to the HUGIN Tool. See the course home page for instructions on downloading and installing the HUGIN Tool. Home page: www.cs.aau.dk/ uk/teaching/reykjavik-05/. Reykjavik University, April/May 2005, BNs and DGs: Contents of Course 20

Outline 1 Introduction 2 Contents of Course 3 Causal Networks and Relevance Analysis 4 Bayesian Probability Theory Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 21

Causal Networks and Relevance Analysis 1 Causal networks, variables and DAGs 2 Relevance analysis (transmission of evidence) Three types of connections Explaining away d-separation Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 22

Burglary or Earthquake Burglary or Earthquake Mr. Holmes is working in his office when he receives a phone call from his neighbor Dr. Watson, who tells him that Holmes burglar alarm has gone off. Convinced that a burglar has broken into his house, Holmes rushes to his car and heads for home. On his way, he listens to the radio, and in the news it is reported that there has been a small earthquake in the area. Knowing that earthquakes have a tendency to turn burglar alarms on, he returns to his work. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 23

Burglary or Earthquake: The Model Burglary Earthquake Alarm RadioNews WatsonCalls Each node in the graph represents a random variable. In this example, each variable has state space {no, yes}. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 24

Burglary or Earthquake: The Model Burglary Earthquake Alarm RadioNews WatsonCalls Each node in the graph represents a random variable. In this example, each variable has state space {no, yes}. Three types of connections: Serial Diverging Converging Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 25

Serial Connections Burglary has a causal influence on Alarm, which in turn has a causal influence on Watson calls. Burglary Alarm WatsonCalls Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 26

Serial Connections Burglary has a causal influence on Alarm, which in turn has a causal influence on Watson calls. Burglary Alarm WatsonCalls If we observe Alarm, any information about the state of Burglary is irrelevant to our belief about Watson calls and vice versa. Burglary Alarm WatsonCalls Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 27

Serial Connections X has a causal influence on Y that has a causal influence on Z: X Y Z Serial Connections Information may be transmitted through a serial connection unless the state of the variable (Y ) in the connection is known. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 28

Diverging Connections Earthquake has a causal influence on both Alarm and Radio news. Alarm Earthquake RadioNews Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 29

Diverging Connections Earthquake has a causal influence on both Alarm and Radio news. Alarm Earthquake RadioNews If we observe Earthquake, any information about the state of Alarm is irrelevant for our belief about an earthquake report in the Radio news and vice versa. Alarm Earthquake RadioNews Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 30

Diverging Connections Y has a causal influence on both X and Z: X Y Z Diverging Connections Information may be transmitted through a diverging connection, unless the state of the variable (Y ) in the connection is known. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 31

Converging Connections Alarm is causally influenced by both Burglary and Earthquake. Burglary Alarm Earthquake Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 32

Converging Connections Alarm is causally influenced by both Burglary and Earthquake. Burglary Alarm Earthquake If we observe Alarm and Burglary, then this will effect our belief about Earthquake : Burglary explains the alarm, reducing our belief that earthquake is the triggering factor, and vice versa. Burglary Alarm Earthquake Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 33

Converging Connections Both X and Z have a causal influence on Y : X Y Z Diverging Connections Information may only be transmitted through a converging connection if either information about the state of the variable in the connection (Y ) or one of its descendants is available. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 34

Transmission of Evidence Serial Diverging Converging Evidence may be transmitted unless the state of B is known. A B C Evidence may be transmitted unless the state of B is known. B Evidence may only be transmitted if B or one of its descendants has received evidence. A 1 A 1 A n A 1 A 1 A n B It takes hard evidence to block a serial or diverging connection, whereas to open a converging connection soft evidence suffices. Notice the explaining away effect in a converging connection: B has been observed; then if A i is observed, it explains the observation of B and the other causes are explained away (i.e., the beliefs in them are reduced). Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 35

Explaining Away I Burglary Earthquake Alarm RadioNews WatsonCalls The converging connection realizes the explaining away mechanism: The news about the earthquake strongly suggests that the earthquake is the cause of the alarm, and thereby explains away burglary as the cause. The ability to perform this kind of intercausal reasoning is unique for graphical models and is one of the main differences between automatic reasoning systems based on graphical models and those based on e.g. production rules. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 36

Explaining Away II Assume that we have observed the symptom RunnyNose, and that there are two competing causes of it: Cold and Allergy. Observing Fever, however, provides strong evidence that cold is the cause of the problem, while our belief in Allergy being the cause decreases substantially (i.e., it is explained away by the observation of Fever). Cold intercausal reasoning Allergy causal reasoning diagnostic reasoning Fever RunnyNose The ability of probabilistic networks to automatically perform such intercausal inference is a key contribution to their reasoning power. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 37

d-separation The rules for transmission of evidence over serial, diverging, and converging connections can be combined into one general rule known as d-separation: d-separation A path π = u,...,v in a DAG, G = (V,E), is blocked by S V if π contains a node w such that either w S and the connections in π does not meet head-to-head at w, or w S, w has no descendants in S, and the connections in π meet head-to-head at w. For three (not necessarily disjoint) subsets A, B, S of V, A and B are said to be d-separated hvis all paths between A and B are blocked by S. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 38

d-separation: Dependence and Independence If X and Y are not d-separated, they are d-connected. d-separation provides a criterion for reading statements of (conditional) dependence and independence (or relevance and irrelevance) from a causal structure. Dependence and independence depends on what you know (and do not know). Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 39

Example: Dependence and Independence A B C D E F H G 1 C and G are d-connected. 2 C and E are d-separated. 3 C and E are d-connected given evidence on G. 4 A and G are d-separated given evidence on D and E. 5 A and G are d-connected given evidence on D. Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 40

Summary Causal networks Serial/diverging/converging connections Transmission of evidence in causal networks Explaining away (intercausal reasoning) Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 41

Summary Causal networks Serial/diverging/converging connections Transmission of evidence in causal networks Explaining away (intercausal reasoning) Dependence and independence d-separation in causal networks Reykjavik University, April/May 2005, BNs and DGs: Causal Networks and Relevance Analysis 42

Outline 1 Introduction 2 Contents of Course 3 Causal Networks and Relevance Analysis 4 Bayesian Probability Theory Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 43

Bayesian Probability Theory Axioms of probability theory Probability calculus Fundamental rule Bayes rule The chain rule Combination and marginalization Conditional independence Evidence Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 44

Axioms of Probability The probability of an event, a, is denoted P(a). Probabilities obey the following axioms: 1 0 P(a) 1, with P(a) = 1 if a is certain. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 45

Axioms of Probability The probability of an event, a, is denoted P(a). Probabilities obey the following axioms: 1 0 P(a) 1, with P(a) = 1 if a is certain. 2 If events a and b are mutually exclusive, then P(a or b) P(a b) = P(a) + P(b). In general, if events a 1, a 2,... are pairwise incompatible, then ( ) P a i = P(a 1 ) + P(a 2 ) + = P(a i ). i i Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 46

Axioms of Probability The probability of an event, a, is denoted P(a). Probabilities obey the following axioms: 1 0 P(a) 1, with P(a) = 1 if a is certain. 2 If events a and b are mutually exclusive, then P(a or b) P(a b) = P(a) + P(b). In general, if events a 1, a 2,... are pairwise incompatible, then ( ) P a i = P(a 1 ) + P(a 2 ) + = P(a i ). i i 3 Joint probability: P(a and b) P(a, b) = P(b a)p(a). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 47

Conditional Probabilities The basic concept in the Bayesian treatment of uncertainty in causal networks is conditional probability. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 48

Conditional Probabilities The basic concept in the Bayesian treatment of uncertainty in causal networks is conditional probability. Every probability is conditioned on a context. For example, P(six) = 1 6 P(six SymmetrixDie) = 1 6 Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 49

Conditional Probabilities The basic concept in the Bayesian treatment of uncertainty in causal networks is conditional probability. Every probability is conditioned on a context. For example, P(six) = 1 6 P(six SymmetrixDie) = 1 6 In general, given the event b, the conditional probability of the event a is x: P(a b) = x. It is not whenever b we have P(a) = x. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 50

Conditional Probabilities The basic concept in the Bayesian treatment of uncertainty in causal networks is conditional probability. Every probability is conditioned on a context. For example, P(six) = 1 6 P(six SymmetrixDie) = 1 6 In general, given the event b, the conditional probability of the event a is x: P(a b) = x. It is not whenever b we have P(a) = x. Conditional Probability If b is true and everything else known is irrelevant for a, then the probability of a is x. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 51

A Justification of Axiom 3 The Fundamental Rule C: a set of cats B: the subset of brown cats (m) A: the subset of Abyssinians (i of them are brown) C B m i n A f (A B, C) = i m, f (B C) = m n Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 52

A Justification of Axiom 3 The Fundamental Rule C: a set of cats B: the subset of brown cats (m) A: the subset of Abyssinians (i of them are brown) C B m i n A f (A B, C) = i m, f (B C) = m n f (A, B C) = i n = i m m n = f (A B, C) f (B C). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 53

Discrete Random Variables A discrete random variable, A, has a set of exhaustive and mutually exclusive states, dom(a) = {a 1,...,a n }. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 54

Discrete Random Variables A discrete random variable, A, has a set of exhaustive and mutually exclusive states, dom(a) = {a 1,...,a n }. In this context, an event is an assignment of values to a set of variables and P(A = a 1 A = a n ) = P(A = a 1 ) + + P(A = a n ) n = P(A = a i ) = 1. i=1 Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 55

Discrete Random Variables A discrete random variable, A, has a set of exhaustive and mutually exclusive states, dom(a) = {a 1,...,a n }. In this context, an event is an assignment of values to a set of variables and P(A = a 1 A = a n ) = P(A = a 1 ) + + P(A = a n ) n = P(A = a i ) = 1. Capital letters will denote a variable, or a set of variables, and lower case letters will denote states (values) of variables. Example: R = r, B = b (Rain? = Raining, BirdsOnRoof = No) is an event. i=1 Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 56

Probability Distributions for Variables If X is a variable with states x 1,...,x n, then P(X) denotes a probability distribution over these states: P(X) = (P(X = x 1 ),...,P(X = x n )), where P(X = x i ) 0 and n P(X = x i ) = 1. i=1 P(X pa(x)) consists of one P(X) for each configuration of the parents, pa(x), of X. B A E B = n B = n B = y B = y E = n E = y E = n E = y A = n 0.999 0.1 0.05 0.01 A = y 0.001 0.9 0.95 0.99 Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 57

Rule of Total Probability P(A) = P(A, B = b 1 ) + + P(A, B = b n ) = P(A B = b 1 )P(B = b 1 ) + + P(A B = b n )P(B = b n ). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 58

Rule of Total Probability P(A) = P(A, B = b 1 ) + + P(A, B = b n ) = P(A B = b 1 )P(B = b 1 ) + + P(A B = b n )P(B = b n ). Computing P(A) from P(A, B) using the rule of total probability is often called marginalization, and is written compactly as P(A) = i P(A, B = b i ), or even shorter as P(A) = B P(A, B). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 59

The Fundamental Rule and Bayes Rule The fundamental rule of probability calculus on variables: P(X, Y ) = P(X Y )P(Y ) = P(Y X)P(X). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 60

The Fundamental Rule and Bayes Rule The fundamental rule of probability calculus on variables: Bayes rule: P(X, Y ) = P(X Y )P(Y ) = P(Y X)P(X). P(Y X) P(X Y )P(Y ) = P(X) P(X Y )P(Y ) = P(X Y = y 1 )P(Y = y 1 ) + + P(X Y = y n )P(Y = y n ). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 61

Simple Bayesian Inference Graphically, inference using Bayes rule corresponds to reversing arrows: A A B P(A, B) = P(A)P(B A) B P(A, B) = P(B)P(A B) P(A B) = P(A, B) P(B) = P(A)P(B A) P(A = a)p(b A = a) a dom(a) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 62

The Chain Rule Let V = {X 1,...,X n } be a set of variables Let P(V) denote the joint probability distribution over V Using the Fundamental Rule, P(V) can be written as P(V) = n P(X i X 1,...,X i 1 ) i=1 Thus, any joint can be represented as a product of conditionals, e.g., P(X 1, X 2, X 3 ) = P(X 3 X 1, X 2 )P(X 2, X 1 ) = P(X 3 X 1, X 2 )P(X 2 X 1 )P(X 1 ) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 63

The Chain Rule and Graph Structure Let V = {A, B, C, D}. Then P(V) factorizes as P(V) = P(A, B, C, D) = P(A B, C, D)P(B, C, D) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 64

The Chain Rule and Graph Structure Let V = {A, B, C, D}. Then P(V) factorizes as P(V) = P(A, B, C, D) = P(A B, C, D)P(B, C, D) = P(A B, C, D)P(B C, D)P(C, D) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 65

The Chain Rule and Graph Structure Let V = {A, B, C, D}. Then P(V) factorizes as P(V) = P(A, B, C, D) = P(A B, C, D)P(B, C, D) = P(A B, C, D)P(B C, D)P(C, D) = P(A B, C, D)P(B C, D)P(C D)P(D) (1) (1) A B C D Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 66

The Chain Rule and Graph Structure Let V = {A, B, C, D}. Then P(V) factorizes as P(V) = P(A, B, C, D) = P(A B, C, D)P(B, C, D) = P(A B, C, D)P(B C, D)P(C, D) = P(A B, C, D)P(B C, D)P(C D)P(D) (1) = P(B A, C, D)P(D A, C)P(C A)P(A) (2) (1) A (2) A B C B C D D Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 67

The Chain Rule and Graph Structure Let V = {A, B, C, D}. Then P(V) factorizes as P(V) = P(A, B, C, D) = P(A B, C, D)P(B, C, D) = P(A B, C, D)P(B C, D)P(C, D) = P(A B, C, D)P(B C, D)P(C D)P(D) (1) = P(B A, C, D)P(D A, C)P(C A)P(A) (2) = (1) (2) A A B C B C etc. D D Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 68

Combination and Marginalization Combination of probability distributions is multiplication. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 69

Combination and Marginalization Combination of probability distributions is multiplication. Using the Rule of Total Probability, from P(X, Y ) the marginal probability distribution P(X) can be computed: P(x) = P(X = x) = y dom(y) P(X = x, Y = y). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 70

Combination and Marginalization Combination of probability distributions is multiplication. Using the Rule of Total Probability, from P(X, Y ) the marginal probability distribution P(X) can be computed: P(x) = P(X = x) = y dom(y) P(X = x, Y = y). A variable Y is marginalized out of P(X, Y ) as P(X) = P(X, Y = y) or short: P(X) = y dom(y) Y P(X, Y ). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 71

Combination and Marginalization Combination of probability distributions is multiplication. Using the Rule of Total Probability, from P(X, Y ) the marginal probability distribution P(X) can be computed: P(x) = P(X = x) = y dom(y) P(X = x, Y = y). A variable Y is marginalized out of P(X, Y ) as P(X) = P(X, Y = y) or short: P(X) = y dom(y) Y P(X, Y ). The unity rule: P(X pa(x)) = 1 pa(x) X Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 72

Combination and Marginalization Example Combination: b 1 b 2 a 1 0.4 0.2 a 2 0.5 0.6 a 3 0.1 0.2 b 1 b 2 0.3 0.7 = b 1 b 2 a 1 0.12 0.14 a 2 0.15 0.42 a 3 0.03 0.14 Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 73

Combination and Marginalization Example Combination: b 1 b 2 a 1 0.4 0.2 a 2 0.5 0.6 a 3 0.1 0.2 b 1 b 2 0.3 0.7 = b 1 b 2 a 1 0.12 0.14 a 2 0.15 0.42 a 3 0.03 0.14 Marginalization: P(A) = b 1 b 2 a 1 0.12 + 0.14 a 2 0.15 + 0.42 a 3 0.03 + 0.14 = (0.26, 0.57, 0.17) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 74

Conditional Independence A variable X is independent of Y given Z if P(x i y j, z k ) = P(x i z k ), i, j, k Shorthand: P(X Y, Z) = P(X Z) Notice that the definition is symmetric Notation: X Y Z Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 75

Conditional Independence A variable X is independent of Y given Z if P(x i y j, z k ) = P(x i z k ), i, j, k Shorthand: P(X Y, Z) = P(X Z) Notice that the definition is symmetric Notation: X Y Z Under X Y Z the Fundamental Rule reduces to: P(X, Y Z) = P(X Y, Z)P(Y Z) = P(X Z)P(Y Z) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 76

Graphical Representations of X Y Z X and Y are conditionally independent given Z: P(X, Y, Z) = P(X Y, Z)P(Y Z)P(Z) = P(X Z)P(Y Z)P(Z) X Z Y Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 77

Graphical Representations of X Y Z X and Y are conditionally independent given Z: P(X, Y, Z) = P(X Y, Z)P(Y Z)P(Z) = P(X Z)P(Y Z)P(Z) X Z Y P(X, Y, Z) = P(X)P(Y X, Z)P(Z X) = P(X)P(Y Z)P(Z X) X Z Y Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 78

Graphical Representations of X Y Z X and Y are conditionally independent given Z: P(X, Y, Z) = P(X Y, Z)P(Y Z)P(Z) = P(X Z)P(Y Z)P(Z) X Z Y P(X, Y, Z) = P(X)P(Y X, Z)P(Z X) = P(X)P(Y Z)P(Z X) X Z Y P(X, Y, Z) = P(X Y, Z)P(Y )P(Z Y ) = P(X Z)P(Y )P(Z Y ) X Z Y Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 79

Evidence Definition 1 An instantiation of a variable X is an observation on the exact state of X. f (X) = (0,...,0, 1, 0,...,0) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 80

Evidence Definition 1 An instantiation of a variable X is an observation on the exact state of X. f (X) = (0,...,0, 1, 0,...,0) Definition 2 Let X be a variable with n states. An evidence function on X is an n-dimensional table of zeros and ones. f (X) = (0,...,0, 1, 0,...,0, 1, 0,...,0) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 81

Evidence Definition 1 An instantiation of a variable X is an observation on the exact state of X. f (X) = (0,...,0, 1, 0,...,0) Definition 2 Let X be a variable with n states. An evidence function on X is an n-dimensional table of zeros and ones. f (X) = (0,...,0, 1, 0,...,0, 1, 0,...,0) Definition 3 Let X be a variable with n states. An evidence function (likelihood evidence) on X is an n-dimensional table of non-negative numbers. f (X) = (0,...,0, 2, 0,...,0, 1, 0,...,0) Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 82

Evidence Let U be a set of variables and let {X 1,...,X n } be the subset of U. Given a set of evidence ε = {ε 1,...,ε m } we want to compute P(X i ε), for all i. This can be done via P(X i ε) = P(U, ε) U P(U, ε) = P(X i, ε). P(ε) X U\{X i } This requires the full joint P(U). Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 83

Summary Axioms of probability theory. Conditional probabilities. The fundamental rule and Bayes rule. Probability calculus. Fundamental Rule. Bayes Rule. Combination and marginalization. The chain rule. Conditional independence. Evidence. Reykjavik University, April/May 2005, BNs and DGs: Bayesian Probability Theory 84