제 7 장 : 베이즈넷과확률추론 ( 교재 ) 장병탁, 인공지능개론, 2017 장병탁 서울대학교컴퓨터공학부 ( 인공지능강의슬라이드 )

Size: px

Start display at page:

Download "제 7 장 : 베이즈넷과확률추론 ( 교재 ) 장병탁, 인공지능개론, 2017 장병탁 서울대학교컴퓨터공학부 ( 인공지능강의슬라이드 )"

Darrell Walker
5 years ago
Views:

1 제 7 장 : 베이즈넷과확률추론 ( 인공지능강의슬라이드 ) ( 교재 ) 장병탁, 인공지능개론, 2017 장병탁 서울대학교컴퓨터공학부

2 목차 7.1 베이즈넷구조 베이지안학습 넷구조학습 확률적추론 응용사례

3 7.1 베이즈넷구조 3

4 A Bayesian Network The ICU alarm network MINVOLSET 37 binary random variables PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT 509 parameters instead of PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV ANAPHYLAXIS PVSAT ARTCO2 TPR SAO2 INSUFFANESTH EXPCO2 HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER CVP PCWP CO HREKG HRSAT HRBP BP

5 What s and Why s What is a Bayesian network? Why Bayesian networks are useful? Why learn a Bayesian network?

6 What is a Bayesian Network? also called belief networks, and (directed acyclic) graphical models Directed acyclic graph Nodes are variables (discrete or continuous) Arcs indicate dependence between variables. Conditional Probabilities (local distributions) Missing arcs implies conditional independence Independencies + local distributions => modular specification of a joint distribution X 1 X 2 X 3 p( x1 ) p( x2 x1 ) p( x3 x1, x2)

7 Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact specification of full joint distributions Syntax: a set of nodes, one per variable a directed, acyclic graph (link "directly influences") a conditional distribution for each node given its parents: P (X i Parents(X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values

8 Bayesian Networks 1. Finite, directed acyclic graph 2. Nodes: (discrete) random variables 3. Edges: direct influences 4. Associated with each node: a table representing a conditional probability distribution (CPD), quantifying the effect the parents have on the node E J A B M

9 l Bayesian network Bayesian Networks DAG (Directed Acyclic Graph) Express dependence relations between variables Can use prior knowledge on the data (parameters) A B C D E n P(X)= P(X i pa i )! i=1 P(A,B,C,D,E) = P(A)P(B A)P(C B) P(D A,B)P(E B,C,D) 2007, SNU Biointelligence Lab, 9

10 Associated CPDs naive representation tables other representations E A B decision trees rules neural networks support vector machines... E e e e e B P(A E,B) b.9.1 b.7.3 b.8.2 b.99.01

11 Bayesian Networks X 1 X 2 (0.2, 0.8) (0.6, 0.4) X 3 true 1 (0.2,0.8) true 2 (0.5,0.5) false 1 (0.23,0.77) false 2 (0.53,0.47) - Introduction

12 Cf. Markov Networks Undirected graphs Nodes = random variables Cliques = potentials (~ local jpd)

13 Graphical Models Graphical Models (GM) Causal Models Chain Graphs Other Semantics Dependency Networks Directed GMs Undirected GMs FST HMMs DBNs Factorial HMM Mixed Memory Markov Models Kalman Bayesian Networks Mixture Models Segment Models BMMs Decision Trees PCA Simple Models LDA Markov Random Fields / Markov networks Gibbs/Boltzman Distributions

14 Example Topology of network encodes conditional independence assertions: Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity

15 Example I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls Network topology reflects "causal" knowledge: A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

16 Example contd.

17 Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values Each row requires one number p for X i = true (the number for X i = false is just 1-p) If each variable has no more than k parents, the complete network requires O(n 2 k ) numbers I.e., grows linearly with n, vs. O(2 n ) for the full joint distribution For burglary net, = 10 numbers (vs = 31)

18 Semantics The full joint distribution is defined as the product of the local conditional distributions: e.g., n P(X 1,...,X n )= P(X i )Pa(X i )! i=1! P( j,m,a,b,e)= P( j a)p(m a)p(a b,e)p(b)p(e)

19 Why Bayesian Networks? Expressive language Finite mixture models, Factor analysis, HMM, Kalman filter, Intuitive language Can utilize causal knowledge in constructing models Domain experts comfortable building a network General purpose inference algorithms P(Bad Battery Has Gas, Won t Start) Battery Gas Start Exact: Modular specification leads to large computational efficiencies Approximate: Loopy belief propagation

20 Why Learning? Knowledge-based (expert systems) - Causal discovery - Data visualization - Concise model of data - Prediction Data-based

21 Illustration of independent equivalence X Y Z X X Y Z Y Z Independence assertion: X and Z are conditionally independent given Y 21

22 d-separation Follows from basic independence assumption of Bayes networks d-separation = direction-dependent separation Let E = set of evidence nodes (subset of variables in Bayes network) Let V i, V j be two variables in the network

23 d-separation Nodes V i and V j are conditionally independent given set E if E d-separates V i and V j E d-separates V i, V j if all (undirected) paths (V i,v j ) are blocked by E If E d-separates V i, V j, then V i and V j are conditionally independent, given E We write I(V i,v j E) This means: p(v i,v j E) = p(v i E) * p(v j E)

24 BLOCKING A PATH A path between V i and V j is blocked by nodes E if there is a blocking node V b on the path. V b blocks the path if one of the following holds: V b in E and both arcs on path lead out of V b, or V b in E and one arc on path leads into V b and one out, or neither V b nor any descendant of V b is in E, and both arcs on path lead into V b

25 7.2 베이지안학습 25

26 Overview Learning Probabilities (Local Distributions) Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Learning Bayes-Net Structure Bayesian model selection/averaging Applications

27 Learning Probabilities: Classical Approach Simple case: Flipping a thumbtack heads tails True probability q is unknown Given iid data, estimate q using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

28 Learning Probabilities: Bayesian Approach heads tails True probability q is unknown Bayesian probability density for q p(q) 0 1 q

29 Bayesian Approach: Use Bayes' rule to compute a new density for q given data posterior prior likelihood p( q data) = ò p( p( q q ) p(data ) p(data q q ) ) d q µ p( q) p( data q)

30 The Likelihood p( heads q ) = q p( tails q ) = ( 1-q ) p # h ( hhth... ttth q ) = q (1 -q binomial distribution ) # t

31 Example: Application of Bayes rule to the observation of a single "heads" p(q) p(heads q)= q p(q heads) µ 0 1 q 0 1 q 0 1 q prior likelihood posterior

32 The probability of heads on the next toss ) ( ) ( ) ( ), ( ) ( ) ( q q q q q q q q d d d d d E p d p d p h p h p = = = ò ò Note: This yields nearly identical answers to ML estimates when one uses a flat prior

33 Overview Learning Probabilities Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Applications Learning Bayes-net structure Bayesian model selection/averaging Applications

34 From thumbtacks to Bayes nets Thumbtack problem can be viewed as learning the probability for a very simple BN: X heads/tails ( ) ( ) P X = heads = f q Q Q X 1 X 2... X N toss 1 toss 2 toss N i=1 to N X i

35 The next simplest Bayes net X heads/tails Y heads/tails tails heads heads tails

36 The next simplest Bayes net X heads/tails Y heads/tails Q X? Q Y i=1 to N X i Y i

37 The next simplest Bayes net X heads/tails Y heads/tails "parameter independence" Q X Q Y i=1 to N X i Y i

38 The next simplest Bayes net X heads/tails Y heads/tails "parameter independence" Q X Q Y ß two separate thumbtack-like learning problems i=1 to N X i Y i

39 A bit more difficult... X heads/tails Y heads/tails Three probabilities to learn: q X=heads q Y=heads X=heads q Y=heads X=tails

40 A bit more difficult... X heads/tails Y heads/tails?? Q X? Q Y X=heads Q Y X=tails X 1 Y 1 case 1 X 2! Y 2! case 2

41 A bit more difficult... X heads/tails Y heads/tails Q X Q Y X=heads Q Y X=tails X 1 Y 1 case 1 X 2! Y 2! case 2

42 A bit more difficult... X heads/tails Y heads/tails Q X Q Y X=heads Q Y X=tails X 1 heads Y 1 case 1 X 2! tails Y 2! case 2 3 separate thumbtack-like problems

43 In general Learning probabilities in a BN is straightforward if Likelihoods from the exponential family (multinomial, poisson, gamma,...) Parameter independence Conjugate priors Complete data

44 Incomplete data makes parameters dependent X heads/tails Y heads/tails Q X Q Y X=heads Q Y X=tails X 1 Y 1

45 Incomplete data Incomplete data makes parameters dependent Parameter Learning for incomplete data Monte-Carlo integration Investigate properties of the posterior and perform prediction Large-sample approximation (Laplace/Gaussian approx.) Expectation-maximization (EM) algorithm and inference to compute mean and variance. Variational methods

46 7.3 넷구조학습 46

47 Overview Learning Probabilities Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Learning Bayes-Net Structure Bayesian model selection/averaging Applications

48 Overview Learning Probabilities Introduction to Bayesian statistics: Learning a probability Learning probabilities in a Bayes net Learning Bayes-Net Structure Bayesian model selection/averaging Applications

49 Two Types of Methods for Learning Constraint based BNs Finds a Bayesian network structure whose implied independence constraints match those found in the data. Scoring methods (Bayesian, MDL, MML) Find the Bayesian network structure that can represent distributions that match the data (i.e. could have generated the data).

50 Learning Bayes-net structure Given data, which model is correct? model 1: X Y model 2: X Y

51 Bayesian approach Given data, which model is correct? more likely? model 1: X Y p( m 1 ) = 0.7 Data d p( m 1 d) = 0.1 model 2: X Y p( m2 ) = 0.3 p( m 2 d) = 0.9

52 Bayesian approach: Model Averaging Given data, which model is correct? more likely? model 1: X Y p( m 1 ) = 0.7 Data d p( m 1 d) = 0.1 model 2: X Y p( m2 ) = 0.3 p( m 2 d) = 0.9 average predictions

53 Bayesian approach: Model Selection Given data, which model is correct? more likely? model 1: X Y p( m 1 ) = 0.7 Data d p( m 1 d) = 0.1 model 2: X Y p( m2 ) = 0.3 p( m 2 d) = 0.9 Keep the best model: - Explanation - Understanding - Tractability

54 To score a model, use Bayes rule Given data d: model score p( m d ) µ p( m) p( d m) "marginal likelihood" likelihood ò p( d m) d = p( q, m) p( q m) dq

55 The Bayesian approach and Occam s ò p( d m) d Razor = p( q, m) p( q m) dq m m m Simple model True distribution Just right p(q m m) Complicated model All distributions

56 Computation of Marginal Likelihood Efficient closed form if Likelihoods from the exponential family (binomial, poisson, gamma,...) Parameter independence Conjugate priors No missing data, including no hidden variables Else use approximations Monte-Carlo integration Large-sample approximations Variational methods

57 Practical considerations The number of possible BN structures is super exponential in the number of variables. How do we find the best graph(s)?

58 Model search Finding the BN structure with the highest score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995) initialize structure Heuristic methods Greedy Greedy with restarts MCMC methods score all possible single changes any changes better? yes perform best change no return saved structure

59 Learning the correct model True graph G and P is the generative distribution Markov Assumption: P satisfies the independencies implied by G Faithfulness Assumption: P satisfies only the independencies implied by G Theorem: Under Markov and Faithfulness, with enough data generated from P one can recover G (up to equivalence). Even with the greedy method!

60 Learning Bayes Nets From Data data Bayes net(s) X 1 true false false true X X 3 Red Blue Green Red Bayes-net learner X 1 X 2 X 3 X 4 X 5 X 6 X 7 + prior/expert information X 8 X 9

61 베이즈넷학습 Data +prior knowledge Inducer Bayesian Network 61

62 Constructing Bayesian networks 1. Choose an ordering of variables X 1,,X n 2. For i = 1 to n add X i to the network select parents from X 1,,X i-1 such that P (X i Parents(X i )) = P (X i X 1,... X i-1 ) n This choice of parents guarantees: P (X 1,,X n ) = π i =1 P (X i X 1,, X i-1 ) (chain rule) = π i =1 P (X i Parents(X i )) (by construction) n

63 Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)?

64 Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)? No P(A J, M) = P(A J)? P(A J, M) = P(A)?

65 Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)? No P(A J, M) = P(A J)? P(A J, M) = P(A)? No P(B A, J, M) = P(B A)? P(B A, J, M) = P(B)?

66 Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)? No P(A J, M) = P(A J)? P(A J, M) = P(A)? No P(B A, J, M) = P(B A)? Yes P(B A, J, M) = P(B)? No P(E B, A,J, M) = P(E A)? P(E B, A, J, M) = P(E A, B)?

67 Example Suppose we choose the ordering M, J, A, B, E P(J M) = P(J)? No P(A J, M) = P(A J)? P(A J, M) = P(A)? No P(B A, J, M) = P(B A)? Yes P(B A, J, M) = P(B)? No P(E B, A,J, M) = P(E A)? No P(E B, A, J, M) = P(E A, B)? Yes

68 Example contd. Deciding conditional independence is hard in noncausal directions (Causal models and conditional independence seem hardwired for humans!) Network is less compact: = 13 numbers needed

69 7.4 확률적추론 69

70 Inference Diagnostic Inferences (effects to causes) Causal Inferences (causes to effects) Intercausal Inferences - or Explaining Away (between causes of common effect) Mixed Inferences (combination of two or more of the above) 70

71 Inference contd. Q E Q E E Q E Q E Diagnostic Causal Intercausal Mixed 71

72 Inference contd. Burglary Earthquake Alarm Mary Calls John Calls 72

73 Inference in Singly Connected Networks E.g. P(X E): Involves computing two values: Causal Support (evidence variables above X connected through it s parents) Evidential Support (evidence variables below X connected through it s children Algorithm can perform in Linear Time. 73

74 Inference Algorithm Causal Support E Spreads out from Q to evidence nodes, root nodes and leaf nodes. E Q E Evidential Support Each recursive call excludes the node from which it was called. 74

75 Inference in Multiply Connected Networks Exact Inference is known to be NP-Hard Approaches include: Clustering Conditioning Stochastic Simulation Stochastic Simulation is most often used, particularly on large networks. 75

76 Clustering Cloudy Cloudy Sprinkler Rain Sprinkler & Rain C P(S) T F Wet Grass C P(R) T F Wet Grass C P(S&R) TT TF FT FF T F

77 Conditioning Cloudy + Cloudy Cloudy Cloudy Sprinkler Rain Sprinkler Rain Wet Grass Wet Grass 77

78 Stochastic Simulation - Example A P(A=1) = 0.2 A P(B=1) B C A p(c=1) D E 78

79 Stochastic Simulation Run repeated simulations to estimate the probability distribution Let Wx = the states of all other variables except x. Let the Markov Blanket of a node be all of its parents, children and parents of children. Distribution of each node, x, conditioned upon Wx can be computed locally from their own probability with their children s : P(a Wa) = a. P(a). P(b a). P(c a) P(b Wb) = a. P(b a). P(d b,c) P(c Wc) = a. P(c a). P(d b,c). P(e c) Therefore, only the Markov blanket of a node is required to compute the distribution 79

80 The Algorithm Set all observed nodes to their values Set all other nodes to random values STEP 1 Select a node randomly from the network According to the states of the node s markov blanket, compute P(x=state, Wx) for all states STEP 2 Use a random number generator that is biased according to the distribution computed in step 1 to select the next value of the node Repeat 80

81 Algorithm contd. The final probability distribution of each unobserved node is calculated from either: 1) the number of times each node took a particular state 2) the average conditional probability of each node taking a particular state given the other variables states. 81

82 7.5 응용사례 82

Sequence analysis and classification Speech recognition (Translation, Paraphrasing Biological sequences

83 Fielded Applications Expert systems Medical diagnosis (Mammography) Fault diagnosis (jet-engines, Windows 98) Monitoring Space shuttle engines (Vista project) Freeway traffic, Activity Recognition Sequence analysis and classification Speech recognition (Translation, Paraphrasing Biological sequences (DNA, Proteins, RNA,..) Information access Collaborative filtering Information retrieval & extraction among others?

84 Preference Prediction (a.k.a. Collaborative Filtering) Example: Predict what products a user will likely purchase given items in their shopping basket Basic idea: use other people s preferences to help predict a new user s preferences. Numerous applications Tell people about books or web-pages of interest Movies TV shows

85 Example: TV viewing Nielsen data: 2/6/95-2/19/95 Show1 Show2 Show3 viewer 1 y n n viewer 2 n y y... viewer 3 n n n etc. ~200 shows, ~3000 viewers Goal: For each viewer, recommend shows they haven t watched that they are likely to watch

87 Use a DAG to model the causality. Example Martin Oversleep Train Strike Norman Oversleep Martin Late Norman Late Boss Failure-in-Love Project Delay Office Dirty Boss Angry

9 Train Strike Norman oversleep Probability T 0.2 F 0.

88 Attach prior probabilities to all root nodes Martin oversleep Probability T 0.01 F 0.99 Martin Oversleep Train Example Strike Probability T 0.1 F 0.9 Train Strike Norman oversleep Probability T 0.2 F 0.8 Norman Oversleep Martin Late Norman Late Boss Failure-in-Love Project Delay Boss failure-in-love Probability T 0.01 Boss Angry F 0.99 Office Dirty

89 Attach prior probabilities to non-root nodes Example Each column is summed to 1. Martin Oversleep Train Strike Norman Oversleep Martin Late Norman Late Norman untidy Martin Late Train strike Boss Project T F Failure-in-Love Delay Martin oversleep T F T F T F Boss Angry Norman untidy Norman oversleep Office Dirty T F T F

90 Attach prior probabilities to non-root nodes Boss Angry Example T Boss Each Failure-in-love column is summed to 1. Project Delay T F T F Office Dirty MartinT F T F TTrain F T F Oversleep Strike very mid little Martin Norman Late Late no F Norman Oversleep Norman untidy Boss Failure-in-Love Project Delay Office Dirty Boss Angry

91 Inference

92 A Bayesian Network The ICU alarm network MINVOLSET 37 binary random variables PULMEMBOLUS INTUBATION KINKEDTUBE VENTMACH DISCONNECT 509 parameters instead of PAP SHUNT VENTLUNG VENITUBE PRESS MINOVL FIO2 VENTALV ANAPHYLAXIS PVSAT ARTCO2 TPR SAO2 INSUFFANESTH EXPCO2 HYPOVOLEMIA LVFAILURE CATECHOL LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER CVP PCWP CO HREKG HRSAT HRBP BP

94 How do we deal with uncertainty? limplicit: Ignore what you are uncertain if you can Build procedures that are robust to uncertainty lexplicit: Build a model of the world that describes uncertainty about its state, dynamics, and observations Reason about the effects of actions given the model l Graphical models = explicit, model-based

95 Probability A well-founded framework for uncertainty Clear semantics: joint prob. distribution Provides principled answers for: Combining evidence Predictive & diagnostic reasoning Incorporation of new evidence Intuitive (at some level) to human experts Can automatically be estimated from data

96 Joint Probability Distribution truth table of set of random variables X1 X 2 X 3 true 1 green true 1 blue true 2 green true 2 blue false 2 blue 0.2 Any probability we are interested in can be computed from it

97 Representing Prob. Distributions Probability distribution = probability for each combination of values of these attributes Hospital patients described by Background: age, gender, history of diseases, Symptoms: fever, blood pressure, headache, Diseases: pneumonia, heart attack, Naïve representations (such as tables) run into troubles 20 attributes require more than 2 20 ³10 6 parameters Real applications usually involve hundreds of attributes

98 Illustration of independent equivalence X Y Z X X Y Z Y Z Independence assertion: X and Z are conditionally independent given Y 98

99 d-separation Follows from basic independence assumption of Bayes networks d-separation = direction-dependent separation Let E = set of evidence nodes (subset of variables in Bayes network) Let V i, V j be two variables in the network

100 d-separation Nodes V i and V j are conditionally independent given set E if E d-separates V i and V j E d-separates V i, V j if all (undirected) paths (V i,v j ) are blocked by E If E d-separates V i, V j, then V i and V j are conditionally independent, given E We write I(V i,v j E) This means: p(v i,v j E) = p(v i E) * p(v j E)

101 BLOCKING A PATH A path between V i and V j is blocked by nodes E if there is a blocking node V b on the path. V b blocks the path if one of the following holds: V b in E and both arcs on path lead out of V b, or V b in E and one arc on path leads into V b and one out, or neither V b nor any descendant of V b is in E, and both arcs on path lead into V b

102 CONDITION 1 V b is a common cause: V b V i V j

103 CONDITION 2 V b is a closer, more direct cause of V j than V i is V i Vb V j

104 CONDITION 3 V b is not a common consequence of V i, V j V i V j V b V b not in E V d V d not in E

Bayesian Networks Inference with Probabilistic Graphical Models

4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning