Bayesian Networks. Distinguished Prof. Dr. Panos M. Pardalos

Distinguished Prof. Dr. Panos M. Pardalos Center for Applied Optimization Department of Industrial & Systems Engineering Computer & Information Science & Engineering Department Biomedical Engineering Program, McKnight Brain Institute University of Florida http://www.ise.ufl.edu/pardalos/

Lecture Outline Introduction Probability Space Conditional Probability Bayes Rule Bayesian Approach Example Graph Theory Concepts Definition Inferencing

Introduction are applied in cases of uncertainty when we know certain [conditional] probabilities and are looking for unknown probabilities given specific conditions. Applications: bioinformatics and medicine, engineering, document classification, image processing, data fusion, and decision support systems, etc Examples: Inference: P(Diagnosis Symptom) Anomaly detection: Is this observation anomalous? Active Data Collection: What is the next diagnostic test given a set of observations?

Discrete Random Variables Probability Space Conditional Probability Bayes Rule Let A denote a boolean-valued random variable If A denotes an event, and there is some degree of uncertainty as to whether A occurs. Examples A = Patient has Tuberculosis A = Coin flipping outcome is Head A = France will win World Cup in 2010

Probability Space Conditional Probability Bayes Rule Intuition Behind Probability Intuitively probability of event A equals to the proportion of the outcomes where A is true Ω is the set of all possible outcomes. Its area is P(Ω) = 1 The set colored in orange corresponds to the outcomes where A is true P(A) = Area of orange oval. Clearly 0 P(A) 1.

Kolmogorov s Probability Axioms Probability Space Conditional Probability Bayes Rule The theory of probability as a mathematical discipline can and should be developed from axioms in exactly the same way as geometry and algebra. Andrey Nikolaevich Kolmogorov. Foundations of the Theory of Probability, 1933. 1. P(A) 0, A Ω 2. P(Ω) = 1 3. σ-additivity: Any countable sequence of pairwise disjoint events A 1, A 2,... satisfies ( ) P A i = P(A i ) i i

Probability Space Conditional Probability Bayes Rule Other Ways to Deal with Uncertainty Three-valued logic: True / False / Maybe Fuzzy logic (truth values between 0 and 1) Non-monotonic reasoning (especially focused on Penguin informatics) Dempster-Shafer theory (and an extension known as quasi-bayesian theory) Possibabilistic Logic But...

Coherence of the Axioms Probability Space Conditional Probability Bayes Rule The Kolmogorov s axioms of probability are the only model with this property: Wagers (probabilities) are assigned in such a way that no matter what set of wagers your opponent chooses you are not exposed to certain loss Bruno de Finetti, Probabilismo, Napoli, Logos 14, 1931, pp 163-219. Bruno de Finetti. Probabilism: A Critical Essay on the Theory of Probability and on the Value of Science, (translation of 1931 article) in Erkenntnis, volume 31, September 1989, pp 169-223.

Probability Space Conditional Probability Bayes Rule Consequences of the Axioms P(Ā) = 1 P(A), where Ā = Ω\A P( ) = 0

Probability Space Conditional Probability Bayes Rule Consequences of the Axioms P(A B) = P(A) + P(B) P(A B) P(A) = P(A B) + P(A B)

Conditional Probability Probability Space Conditional Probability Bayes Rule P(A B) = Proportion of the space in which A is true that also have B true Formal definition: P(B A) = P(A B) P(A)

Conditional Probability: Example Probability Space Conditional Probability Bayes Rule Let us draw a card from the deck of 52 playing cards. A =the card is a court card. P(A) = 12/52 = 3/13 B =the card is a queen. P(B) = 4/52 = 1/13, P(B A) = P(B) = 1/13 If we apply the definition we obtain very intuitive result: P(B A) = 1/13 3/13 = 1/3 P(A B) = 1/13 1/13 = 1 C =the suit is spade. P(C) = 1/4 Note that P(C A) = P(C) = 1/4. In other words event C is independent from A.

Probability Space Conditional Probability Bayes Rule Independent Events Definition Two events A and B are independent if and only if P(A B) = P(A)Pr(B). Let us denote independence of A and B as I (A, B). The independence of A and B implies P(A B) = P(A), if P(B) 0 P(B A) = P(B), if P(A) 0 Why?

Probability Space Conditional Probability Bayes Rule Conditional Independence One might observe that people of longer arms tend to have higher levels of reading skills If the age is fixed then this relationship disappears Arm length and reading skills are conditionally independent given the age Definition Two events A and B are conditionally independent given C if and only if P(A B C) = P(A C)Pr(B C). Notation: I (A, B C). P(A B, C) = P(A C), if P(B C) 0 P(B A, C) = P(B C), if P(A C) 0

Probability Space Conditional Probability Bayes Rule Bayes Rule The definition of conditional probability P(A B) = P(A B) P(B) implies the chain rule: P(A B) = P(A B)P(B). By symmetry P(A B) = P(B A)P(A) After we equate right hand sides and do some algebra we obtain Bayes Rule P(B A) = P(A B)P(B) P(A)

Probability Space Conditional Probability Bayes Rule Monty Hall Problem The treasure is equally probable contained in one of the boxes A, B and C, i.e. P(A) = P(B) = P(C). You are offered to chose one of them. Let us say you choose box A. Then the host of the game opens the box which you did not chose and does not contain the treasure

Monty Hall Problem Introduction Probability Space Conditional Probability Bayes Rule For instance, the host has opened box C Then you are offered an option to reconsider your choice. What would you do? In other words what are the probabilities P(A N A,C ) and P(B N A,C )? What does your intuition advise? Now apply Bayes Rule.

Bayesian Approach Example Classification based on Bayes Theorem Let Y denote class variable. For example we want to predict if the borrower will default. Let X = (X 1, X 1,..., X k ) denote the attribute set (i.e. home owner, marital status, annual income, etc) We can treat X and Y as random variables and determine P(Y X ) (posterior probability). Knowing the probability P(Y X ) we can relate the relate the record X to the class that maximizes the posterior probability. How can we estimate P(Y X ) from training data?

Bayesian Approach Example # Home Owner (binary) Marital Status (categorical) Annual Income (continuous) 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Table: Historical data for default prediction Defaulted Borrower (class)

Bayes approach Introduction Bayesian Approach Example Accurate estimate of posterior probability for every possible combination of attributes and classes requires a very large training set, even for a moderate number of attributes. We can utilize Bayes theorem instead P(Y X ) = P(X Y ) P(Y ) P(X ) P(X ) is a constant and can be calculated as a normalization multiplier P(Y ) can be easily estimated from training set (fraction of training records that belong to each class) P(X Y ) is a more challenging task. Methods: Bayesian Network

Bayesian Approach Example Attributes are assumed to be conditionally independent, given the class label y: thus P(X Y = y) = k P(X i Y = y) i=1 P(Y X ) = P(Y ) k i=1 P(X i Y = y) P(X ) Now we need to estimate P(X i Y ) for i = 1,..., k.

Estimating Probabilities Bayesian Approach Example P(X i = x Y = y) is estimated according to fraction of training instances in class y that take on a particular attribute value x i. For example P(Home Owner=Yes Y = No) = 3/7 P(Marital Status=Single Y = Yes) = 2/3 What about continuous attributes? One solution is to discretize each continuous attribute and then replace value with its corresponding interval (transform continuous attributes into ordinal attributes). How can we discretize?..

Bayesian Approach Example Continuous Attributes Assume a certain type of probability distribution for continuous attribute. For example it can be a Gaussian distribution having p.d.f. f ij (x i ) = 1 e (xi µij ) 2σ ij 2 2πσij 2 parameters for f ij can be estimated based on training records that belongs to class y i

Continuous Attributes Bayesian Approach Example Using approximation P(x i < X i x i + ɛ Y = y i ) = xi +ɛ x i f ij (y)dy f ij (x i )ɛ and the fact that ɛ cancels out when we normalize posterior probability for P(Y X ) allows us to assume P(X i = x i Y = y j ) = f ij (x i )

Example Introduction Bayesian Approach Example The sample mean for annual income attribute with respect to the class No Variance x = 125 + 100 + 70 +... + 75 7 = 110 s 2 = (125 110)2 + (100 110) 2 +... + (75 100) 2 6 s = 2975 = 54.54 Given a test record with income $120K P(Income = 120 No) = = 2975 1 2π54.54 e (120 110)2 2 2975 = 0.0072

Bayesian Approach Example Example Suppose X =(Home Owner=No, Marital Status=Married, Income = $120K) P( Home Owner = Yes Y = No) = 3/7 P( Home Owner = No Y = No) = 4/7 P( Home Owner = Yes Y = Yes) = 0 P( Home Owner = No Y = Yes) = 1 P( Marital Status = Divorced Y = No) = 2/7 P( Marital Status = Married Y = No) = 1/7 P( Marital Status = Single Y = No) = 4/7 P( Marital Status = Divorced Y = Yes) = 2/3 P( Marital Status = Married Y = Yes) = 1/3 P( Marital Status = Single Y = Yes) = 0

Example Introduction Bayesian Approach Example For annual income: class No: x = 110, s 2 = 2975 class Yes: x = 90, s 2 = 25 Class-conditional probabilities: P(X No) = P(Home Owner = No No) P(Status=Married No) P(Annual Income = $120K No) = 4/7 4/7 0.0072 = 0.0024 P(X Yes) = P(Home Owner = No Yes) P(Status=Married Yes) P(Annual Income = $120K Yes) = 1 0 1.2 10 9 = 0

Example Introduction Bayesian Approach Example Posterior probabilities P(No X ) = α 7/10 0.0024 = 0.0016α P(Yes X ) = 0 where α = 1/P(X ) Since P(No X ) > (Yes X ) the record is classified as No

: Discussion Bayesian Approach Example Robust to isolated noise points because such points are averaged out when estimating conditional probabilities from data Can handle missing values by ignoring the example during model building and classification Robust to irrelevant attributes. If X i is irrelevant then P(X i Y ) is almost uniformly distributed and thus P(X i Y ) has little impact on posterior probability Correlated attributes can degrade the performance because conditional independence does not hold. account dependence between attributes

Graph Theory Concepts Definition Inferencing Directed Graph Definition A directed graph or digraph G is an ordered pair G := (V, A) with V is a set, whose elements are called vertices or nodes, A V V is a set of ordered pairs of vertices, called directed edges, arcs, or arrows. V = {V 1, V 2, V 3, V 4, V 5 } E = {(V 1, V 1 ), (V 1, V 4 ), (V 2, V 1 ), (V 4, V 2 ), (V 5, V 5 )} Cycle: V 1 V 4 V 2

Directed Acyclic Graph Graph Theory Concepts Definition Inferencing Definition A directed acyclic graph (DAG), is a directed graph with no directed cycles; that is, for any vertex v, there is no nonempty directed path that starts and ends on v.

Some Graph Theory Notions Graph Theory Concepts Definition Inferencing V 1 and V 4 are parents of V 2 (V 1, V 2 ) E and (V 4, V 2 ) E V 5, V 3 and V 2 are descendants of V 1 V1 is connected to V 5, V 3 and V 2 with directed paths V 4 and V 2 are ancestors of V 3 There exist directed paths from V4 and V 2 to V 3 V 6 and V 4 are nondescendents of V 1 Directed paths from V1 to V 4 and V 6 do not exist

Bayesian Network Definition Graph Theory Concepts Definition Inferencing Elements of Bayesian Network: Directed acyclic graph (DAG) encodes the dependence relationships among a set of variables A probability table associating each node to its immediate parent nodes Each node of the graph represents a variable Each arc asserts the dependence relationship between the pair of variables DAG satisfies Markov condition

The Markov Condition Graph Theory Concepts Definition Inferencing Definition Suppose we have a joint probability distribution P of the random variables in some set V and a DAG G = (V, E). We say that (G, P) satisfies the Markov condition if for each variable X V, {X } is conditionally independent of the set of all its nondescendents (ND X ) given the set of all its parents (PA X ). I ({X }, ND X PA X ). The definitin implies that a root node X, which has no parents, is unconditionally independent from its nondescendents.

Graph Theory Concepts Definition Inferencing Figure: Bayes network: a case study

Markov Condition Example Graph Theory Concepts Definition Inferencing Node Parents Independency E I (E, {D, Hb}) D I (D, E) HD E, D I (HD, Hb {E, D}) Hb?? B?? C?? Note that I (A, B C) implies I (A, D C) whenever D B.

Graph Theory Concepts Definition Inferencing Representation Recall that a naïve Bayes classifier assumes conditional independence of attributes X 1, X 1,..., X k, given target class Y This can be represented using a Bayesian Network below

Inferencing Introduction Graph Theory Concepts Definition Inferencing We can compute joint probability from a Bayesian Network P(X 1, X 2,..., X n ) = n P(X i parents(x i )) i=1 Thus we can compute any conditional probability P(X k X m ) = P(X entries X P(X) k, X m ) matching X = k,xm P(X m ) P(X) entries X matching X k

Example of Inferencing Graph Theory Concepts Definition Inferencing Suppose no prior information about the person is given What is the probability of developing heart disease? α = {Yes, No}, β = {Healthy, Unhealthy} P(HD = Yes) = = α β P(HD = Yes E = α, D = β) P(E = α, D = β) = = α β P(HD = Yes E = α, D = β) P(E = α) P(D = β) = = 0.25 0.7 0.25 + 0.45 0.7 0.75+ +0.55 0.3 0.25 + 0.75 0.30.75 = 0.49

Graph Theory Concepts Definition Inferencing Now let us compute probability of heart disease when the person has high blood pressure γ = {Yes, No} Probability of high blood pressure P(B = High) = = γ P(B = High HD = γ) P(HD = γ) = 0.85 0.49 + 0.2 0.51 = = 0.5185 The posterior probability of heart disease given high blood pressure is P(BP = High HD = Yes) P(HD = Yes) P(HD = Yes BP = High) = P(BP = High) = (0.85 0.49)/0.5185 = 0.8033 =

Complexity Issues Introduction Graph Theory Concepts Definition Inferencing Recall, we can compute any conditional probability: P(X k X m ) = P(X entries X P(X) k, X m ) matching X = k,xm P(X m ) P(X) entries X matching X k Generally it requires exponentially large number of operations We can apply various tricks to reduce complexity But querying of Bayes nets is NP-hard D. M. Chickering, D. Heckerman, C. Meek, Large-Sample Learning of is NP-Hard. Journal of Machine Learning Research, 5 (2004) 1287-1330

Graph Theory Concepts Definition Inferencing Discussion Bayes network is an elegant way of encoding casual probabilistic dependencies. The dependency model can be represented graphically Constructing a network requires effort but adding a new variable is quite straightforward Well suited for incomplete data. Due to probabilistic nature of the model the method is robust to model overfitting

What we have learned Independence and conditional independence Bayes theorem Naïve Bayes classification The definition of a Bayes net Computing probabilities with a Bayes net

Literature Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Addison-Wesley, 2005 Finn V. Jensen, Thomas D Nielsen. and Decision Graphs. 2nd Ed., Springer, 2007 Richard E. Neapolitan, Learning, Prentice Hall, 2003 Indea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988