A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II * kevin small & byron wallace * Slides borrow heavily from Andrew Moore, Weng- Keen Wong and Longin Jan Latecki
today probabilistic reasoning Bayesian networks reasoning with uncertainty crucial building block for automated clinical reasoning systems review conditional independence and (a little) graph theory
introduction diagnosing inhalational anthrax observe the following symptoms patient has difficulty breathing patient has a cough patient has a fever patient has diarrhea patient has inflamed mediastinum
introduction diagnoses often stated in probabilities (e.g. 30% chance of inhalational anthrax) additional evidence should change your degree of belief in the diagnosis how much evidence until absolutely certain? Bayesian networks are a methodology for reasoning with uncertainty
review: random variables basic element of probabilisac reasoning refers to an event drawn from a distribuaon modeling the uncertain outcome of the event
Boolean random variables takes the values true or false can be thought of event occurred or event didn t occur examples notation patient has inhalational anthrax A patient has difficulty breathing B patient has a cough C patient has a fever F patient has diarrhea D patient has inflamed mediastinum M
joint probability distribution expresses probability between arbitrary number of variables for each combinaaon, states how probable said combinaaon is A D M P(A,D,M) false false false 0.65 false false true 0.03 false true false 0.1 false true true 0.04 true false false 0.02 true false true 0.06 true true false 0.03 true true true 0.07 must sum to 1
reasoning with the joint with the joint, you can compute anything may need need marginalizaaon and/or Bayes rule to do so A D M P(A,D,M) false false false 0.65 false false true 0.03 false true false 0.1 false true true 0.04 true false false 0.02 true false true 0.06 true true false 0.03 true true true 0.07 p(d) = p(a,d, M) + p(a,d, M) + p( A,D,M) + p( A,D, M) = 0.15 p(a,m D) = p(a, M,D) p(d) = 0.467 p(a M,D) = p(a, M,D) p(m,d) = 0.636
problems with the joint not a compact representaaon requires 2 n - 1 parameters to express requires a lot of data to accurately esamate (condiaonal) independence to the rescue!
independence random variables A and B are independent if p(a,b) = p(a) p(b) p(a B) = p(a) p(b A) = p(b) knowledge regarding outcome of A provides no addiaonal informaaon about the outcome of B
independence independence allows compact representaaon suppose n coin flips joint requires 2 n - 1 parameters if flips independent, requires n parameters
conditional independence random variables A and B are condiaonally independent if p(a,b C) = p(a C) p(b C) p(a B,C) = p(a C) p(b A,C) = p(b C) knowledge regarding outcome of A provides no addiaonal informaaon about the outcome of B
Bayesian networks (finally!) a Bayesian network G=(V,E) is composed of a directed acyclic graph a set of condiaonal probability tables (CPT) A B P(B A) false false 0.01 A P(A) A false true 0.99 true false 0.7 true true 0.3 B D P(D B) false 0.6 true 0.4 B C P(C B) B false false 0.02 false true 0.98 false false 0.4 false true 0.6 C D true false 0.05 true false 0.9 true true 0.95 true true 0.1
semantics of structure A P(A) false 0.6 true 0.4 A B each vertex is a random variable B is a parent of D; D is condiaoned on B C D each vertex has CPT p(x i Parents(X i )) B D P(D B) false false 0.02 false true 0.98 true false 0.05 true true 0.95
conditional probability tables A B E P(B A,E) A B A B P(B A) false false 0.01 false true 0.99 true false 0.7 true true 0.3 A B E false false false 0.2 false false true 0.1 false true false 0.8 false true true 0.9 true false false 0.25 true false true 0.98 true true false 0.75 true true true 0.02 a Boolean variable with n parents has 2 n+1 entries (2 n which must be stored) note what must sum to 1
utility of Bayes nets two important properaes encodes condiaonal independence relaaonships between random variables in the graph compact representaaon of the joint ND 1 P 1 P 2 X ND 2 given parents (P 1,P 2 ), a vertex X is condiaonally independent of its non- descendents (ND 1,ND 2 ) C 1 C 2
calculating the joint can compute joint using Markov condiaon ( ) = p X i = x i Parents( X i ) p X 1 = x 1,, X n = x n n i=1 ( ) p(a,b, C,D) = p(a) p(b A) p( C B) p(d B) = 0.4 0.3 0.9 0.95 = 0.1026 A B C D
inference compuang probabiliaes specified by model generally queries of the form A p(x E) C B D query variable(s) evidence variable(s)
inference compuang probabiliaes specified by model let s try A p(c A) C B D query variable(s) evidence variable(s) to the board!
bad news exact inference is feasible in only small to medium sized networks exact inference in larger networks takes a long Ame can use approximate inference
network structure use domain expert knowledge to design learn it from data not trivial A B good news is clinical experase is high C D
naïve Bayes another opaon is to make strong (condiaonal) independence assumpaons ogen effecave for classificaaon models A B C D F M
Bayes revisited posterior = (prior * likelihood) / evidence p(a B,C,D,F, M) = p(a) p(b,c,d,f,m A) p(b,c,d,f, M) A B C D F M
conditional independence assume input variables condiaonally indendent A B C D F M p(a X) = p(a) n i=1 p(x) p( X i A)
naïve Bayes classification since p(x) is the same for all outcome of A a ˆ = argmax a' A p(a = a') n i=1 ( ) p X i A = a' A B C D F M
number of parameters joint probability distribuaon 2 n 1 = 63 naïve Bayes A 1+ A n i=1 ( X i 1) =11 inference runame O( n A ) to esamate parameters, count (and smooth)
example day outlook temperature humidity wind tennis 1 sunny hot high weak no 2 sunny hot high strong no 3 overcast hot high weak yes 4 rain mild high weak yes 5 rain cool normal weak yes 6 rain cool normal strong no 7 overcast cool normal strong yes 8 sunny mild high weak no 9 sunny cool normal weak yes 10 rain mild normal weak yes 11 sunny mild normal strong yes 12 overcast mild high strong yes 13 overcast hot normal weak yes 14 rain mild high strong no Given today is sunny, cool but windy with high humidity, will we play tennis? [Mitchell s Machine Learning Book]
example Given today is sunny, cool but windy with high humidity, will we play tennis? p(t = no X) p(t = no)p(o = sunny T = no) p(m = cool T = no)p(h = high T = no) p(w = strong T = no) 5 14 3 5 1 5 4 5 3 5 = 2.1e - 2 p(t = yes X) p(t = yes)p(o = sunny T = yes) p(m = cool T = yes)p(h = high T = yes) p(w = strong T = yes) 9 14 2 9 3 9 3 9 3 9 = 5.2e - 3
PopulaAon- wide ANomaly DetecAon and Assessment (PANDA) a detector for a large- scale outdoor release of inhalaaonal anthrax massive Bayes net populaaon- wide means each person has their own subnetwork in the model [Wong et al., KDD 2005]
population-wide approach anthrax is non- contagious reflected in network structure Anthrax Release Location of Release Time of Release Person Model Person Model Person Model
person model Anthrax Release Time Of Release Location of Release 20-30 Age Decile Female Gender 50-60 Male Age Decile Gender Anthrax Infection Home Zip 15213 Other ED Disease Anthrax Infection Home Zip 15146 Other ED Disease Respiratory from Anthrax Respiratory CC From Other Respiratory from Anthrax Respiratory CC From Other ED Admit from Anthrax Respiratory CC False ED Admit from Other ED Admit from Anthrax Respiratory CC Unknown ED Admit from Other Respiratory CC When Admitted Respiratory CC When Admitted Yesterday ED Admission never ED Admission
advanced topics learning network structure generally a search procedure Markov networks considers undirected edges influence diagrams generalized with determinisac veraces more inference variable eliminaaon, approximate inference
more? current standard bearer the classic really interesang