Bayesian Networks Practice - PDF Free Download

Bayesian Networks Practice Part 2 2016-03-17 Byoung-Hee Kim, Seong-Ho Son Biointelligence Lab, CSE, Seoul National University

Agenda Probabilistic Inference in Bayesian networks Probability basics D-searation Probabilistic inference in olytrees Exercise Inference by hand (self) Inference by GeNIe (self) Learning from data using Weka Aendix AI & Uncertainty 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 2

(DAG) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 3

Bayesian Networks The joint distribution defined by a grah is given by the roduct of a conditional distribution of each node conditioned on their arent nodes. (x) K k 1 ( x Pa( k x k )) (Pa(x k ) denotes the set of arents of x k ) ex) x 1, x 2,, x 7 = * Without given DAG structure, usual chain rule can be alied to get the joint distribution. But comutational cost is much higher. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 4

Probability Probability lays a central role in modern attern recognition. The main tool to deal uncertainties All of the robabilistic inference and learning amount to reeated alication of the sum rule and the roduct rule Random Variables: variables + robability 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 5

19.1 Review of Probability Theory (1/4) Random variables Joint robability Ex. (B (BAT_OK), M (MOVES), L (LIFTABLE), G (GUAGE)) Joint Probability (True, True, True, True) 0.5686 (True, True, True, False) 0.0299 (True, True, False, True) 0.0135 (True, True, False, False) 0.0007 (C) 2000-2002 SNU CSE Biointelligence Lab 6

19.1 Review of Probability Theory (2/4) Marginal robability Ex. Conditional robability Ex. The robability that the battery is charged given that the arm does not move (C) 2000-2002 SNU CSE Biointelligence Lab 7

Bayes Theorem ( X Y ) ( Y ) ( Y X ) X ( ) Posterior Likelihood Prior Normalizing constant ( X ) ( X Y ) ( Y ) Y osterior likelihood rior 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 8

Bayes Theorem Figure from Figure 1. in (Adams, et all, 2013) obtained from htt://journal.frontiersin.org/article/10.3389/fsyt.2013.00047/full 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 9

Bayesian Probabilities -Frequentist vs. Bayesian Likelihood: Frequentist w: a fixed arameter determined by estimator Maximum likelihood: Error function = log ( D w) Error bars: Obtained by the distribution of ossible data sets Bootstra Cross-validation Bayesian ( w D) ( D w) ( D w) ( w) ( D) a robability distribution w: the uncertainty in the arameters Prior knowledge Noninformative (uniform) rior, Lalace correction in estimating riors Monte Carlo methods, variational Bayes, EP Thomas Bayes D (See an article WHERE Do PROBABILITIES COME FROM? on age 491 in the textbook (Russell and Norvig, 2010) for more discussion) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 10

Conditional Indeendence Conditional indeendence simlifies both the structure of a model and the comutations An imortant feature of grahical models is that conditional indeendence roerties of the joint distribution can be read directly from the grah without having to erform any analytical maniulations The general framework for this is called d-searation 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 11

19.3 Bayes Networks (1/2) Directed, acyclic grah (DAG) whose nodes are labeled by random variables. Characteristics of Bayesian networks Node V i is conditionally indeendent of any subset of nodes that are not descendents of V i., V,..., V V Pa( V V 1 2 k i i ) i1 Prior robability k Conditional robability table (CPT) (C) 2000-2002 SNU CSE Biointelligence Lab 12

(C) 2000-2002 SNU CSE Biointelligence Lab 14 19.4 Patterns of Inference in Bayes Networks (1/3) Causal or to-down inference Ex. The robability that the arm moves given that the block is liftable B L B M B L B M L B L B M L B L B M L B M L B M L M,,,,,,

19.4 Patterns of Inference in Bayes Networks (2/3) Diagnostic or bottom-u inference Using an effect (or symtom) to infer a cause Ex. The robability that the block is not liftable given that the arm does not move. M L 0. 9525 (using a causal reasoning) M LL 0.95250.3 0.28575 L M M M M M LL 0.05950.7 0.03665 L M M M M L M 0. 88632 (Bayes rule) (C) 2000-2002 SNU CSE Biointelligence Lab 15

(C) 2000-2002 SNU CSE Biointelligence Lab 16 19.4 Patterns of Inference in Bayes Networks (3/3) Exlaining away Bexlains M, making Lless certain 0.88632 0.030,,,,,,, M B L B L B M M B L L B L B M M B L L B M M B L (Bayes rule) (def. of conditional rob.) (structure of the Bayes network)

d-searation Tail-to-tail node or head-to-tail node Think of head as arent node and tail as descendant node. The ath is blocked if the node is observed. The ath is unblocked if the node is unobserved. Remember : ath we are talking about here is UNDIRECTED!!! Ex1 : c is tail-to-tail node because both arcs on the ath lead out of c. Ex2 : c is head-to-tail node because one arc on the ath leads in to c, while the other leads out. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 17

d-searation Head-to-head node The ath is blocked when the node is unobserved. The ath is unblocked if the node itself and/or at least one of its descendants is observed. Ex3 : c is head-to-head node because both arcs on the ath leads in to c. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 18

d-searation d-searation? All aths between two nodes(variables) are blocked. The joint distribution will satisfy conditional indeendence with resect to concerned variables. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 19

d-searation (Evidence nodes are observed ones.) Ex4 : V_b1 is tail-to-tail node and is observed, so it blocks the ath. V_b2 is head-to-tail node and is observed, so it blocks the ath. V_b3 is head-to-head node and is unobserved, so it blocks the ath. All the aths from V_i to V_j are blocked, so they are conditionally indeendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 20

D-Searation: 1 st case None of the variables are observed Node c is tail-to-tail The variable c is observed The conditioned node blocks the ath from a to b, causes a and b to become (conditionally) indeendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 21

D-Searation: 2 nd case None of the variables are observed Node c is head-to-tail The variable c is observed The conditioned node blocks the ath from a to b, causes a and b to become (conditionally) indeendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 22

D-Searation: 3 rd case None of the variables are observed The variable c is observed Node c is head-to-head When node c is unobserved, it blocks the ath and the variables a and b are indeendent. Conditioning on c unblocks the ath and render a and b deendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 23

Fuel gauge examle B Battery, F-fuel, G-electric fuel gauge (rather unreliable fuel gauge) Checking the fuel gauge ( Makes it more likely ) Checking the battery also has the meaning? Makes it less likely than observation of fuel gauge only. (exlaining away) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 24

d-searation (a) a is deendent to b given c Head-to-head node e is unblocked, because a descendant c is in the conditioning set. Tail-to-tail node f is unblocked (b) a is indeendent to b given f Head-to-head node e is blocked Tail-to-tail node f is blocked 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 25

19.7 Probabilistic Inference in Polytrees (1/2) Polytree A DAG for which there is just one ath, along arcs in either direction, between any two nodes in the DAG. (C) 2000-2002 SNU CSE Biointelligence Lab 26

19.7 Probabilistic Inference in Polytrees (2/2) A node is above Q The node is connected to Q only through Q s arents A node is below Q The node is connected to Q only through Q s immediate successors. Three tyes of evidence. All evidence nodes are above Q. All evidence nodes are below Q. There are evidence nodes both above and below Q. (C) 2000-2002 SNU CSE Biointelligence Lab 27

A Numerical Examle (2/2) Q U P0.8 U P U 0.7 0.8 0.2 0.2 0.60 0.2 Q U P0.019 U P U Q U k Q U 0.7 0.019 0.2 0.98 0.21 4.35, k 0.60.05 k 0.210.95 k 0.03 k 0.20 Q U 4.350.03 0. 13 0.98 Other techniques for aroximate inference Bucket elimination Monte Carlo method Clustering (C) 2000-2002 SNU CSE Biointelligence Lab 30

Exercise

Exercise 1 (inference) What is the robability that it is raining, given the grass is wet? 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 32

Exercise 2 (inference) Q1) (U R,Q,S) =? Q2) (P Q) =? Q3) (Q P) =? First, you may try to calculate by hand Next, you can check the answer with GeNIe 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 33

Dataset for Exercise with GeNIe Alarm Network data_alarm_modified.xdsl Pima Indians Diabetes discretization with Weka: ima_diabetes.arff (result: ima_diabetes_suervised_discretized.csv) Learning Bayesian network from data: ima_diabetes_suervised_discretized.csv 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 34

Dataset #1: Alarm Network Descrition The network for a medical diagnostic system develoed for on-line monitoring of atients in intensive care units You will learn how to do inference with a given Bayesian network Configuration of the data set 37 variables, discrete (2~4 levels) Variables reresent various states of heart, blood vessel and lung Three kinds of variables Diagnostic: basis of alarm Measurement: observations Intermediate: states of a atient 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 35

Dataset #2: Pima Indians Diabetes Descrition Pima Indians have the highest revalence of diabetes in the world You will learn how to learn structures and arameters of Bayesian networks from data We may get ossible causal relationshi between features that affect diabetes in Pima tribe Configuration of the data set 768 instances 8 attributes age, number of times regnant, results of medical tests/analysis discretized set will be used for BN Class value = 1 (Positive examle ) Interreted as "tested ositive for diabetes" 500 instances Class value = 0 (Negative examle) 268 instances 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 36

Exercise: Inference with the Alarm Network Monitoring Screen : diagnostic node : measurement node : intermediate node 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 37

Exercise: Inference with the Alarm Network Inference tasks Set evidences (according to observations or sensors) Network Udate Beliefs, or F5 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 38

Exercise: Inference with the Alarm Network Inference tasks Network - Probability of Evidence 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 39

Exercise: Inference with the Alarm Network Inference tasks Based on a set of observed nodes we can estimate the most robable states of target nodes We can calculate the robability of this configuration Network - Annealed MAP 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 40

Exercise: Learning from Diabetes data Pima Indians Diabetes data Ste 1: discretization of real-valued features with Weka 1. Oen ima_diabetes.arff 2. Aly Filter-Suervised-Attribute-Discretize with default setting 3. Save into ima_diabetes_suervised_discretized.csv 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 41

Exercise: Learning from Diabetes data Pima Indians Diabetes data Ste 2: Learning structure of the Bayesian network 1. File-Oen Data File: ima_diabetes_suervised_discretized.csv 2. Data-Learn New Network 3. Set arameters as in Fig. 1 4. Edit the resulting grah: changing osition, color Fig. 1 Parameter setting Fig. 2 Learned structure 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 42

Exercise: Learning from Diabetes data Pima Indians Diabetes data Ste 3: Learning arameters of the Bayesian network 1. Check the default arameters (based on counts in the data) 1. F8 key will show distributions for all the nodes as bar chart 2. F5 key will show you the robability 2. Network Learn Parameters 3. Just click OK button for each dialogue box 4. Check the change of the arameters with F5 key 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 43

Aendix - AI & UNCERTAINTY - BAYESIAN NETWORKS IN DETAIL 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 44

AI & Uncertainty

Artificial Intelligence (AI) The objective of AI is to build intelligent comuters We want intelligent, adative, robust behavior cat car Often hand rogramming not ossible. Solution? Get the comuter to rogram itself, by showing it examles of the behavior we want! This is the learning aroach to AI. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 47

Artificial Intelligence (AI) (Traditional) AI Knowledge & reasoning; work with facts/assertions; develo rules of logical inference Planning: work with alicability/effects of actions; develo searches for actions which achieve goals/avert disasters. Exert systems: develo by hand a set of rules for examining inuts, udating internal states and generating oututs 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 48

Artificial Intelligence (AI) Probabilistic AI emhasis on noisy measurements, aroximation in hard cases, learning, algorithmic issues. The ower of learning Automatic system building old exert systems needed hand coding of knowledge and of outut semantics learning automatically constructs rules and suorts all tyes of queries Probabilistic databases traditional DB technology cannot answer queries about items that were never loaded into the dataset UAI models are like robabilistic databases 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 49

Uncertainty and Artificial Intelligence (UAI) Probabilistic methods can be used to: make decisions given artial information about the world account for noisy sensors or actuators exlain henomena not art of our models describe inherently stochastic behavior in the world 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 50

Other Names for UAI Machine learning (ML), data mining, alied statistics, adative (stochastic) signal rocessing, robabilistic lanning/reasoning... Some differences: Data mining almost always uses large data sets, statistics almost always small ones Data mining, lanning, decision theory often have no internal arameters to be learned Statistics often has no algorithm to run! ML/UAI algorithms are rarely online and rarely scale to huge data (changing now). 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 51

Learning is most useful Learning in AI when the structure of the task is not well understood but can be characterized by a dataset with strong statistical regularity Also useful in adative or dynamic situations when the task (or its arameters) are constantly changing Currently, these are challenging toics of machine learning and data mining research 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 52

Probabilistic AI Let inuts=x, correct answers=y, oututs of our machine=z Learning: estimation of (X, Y) The central object of interest is the joint distribution The main difficulty is comactly reresenting it and robustly learning its shae given noisy samles 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 53

Probabilistic Grahical Models (PGMs) Probabilistic grahical models reresent large joint distributions comactly using a set of local relationshis secified by a grah Each random variable in our model corresonds to a grah node. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 54

Probabilistic Grahical Models (PGMs) There are useful roerties in using robabilistic grahical models A simle way to visualize the structure of a robabilistic model Insights into the roerties of the model Comlex comutations (for inference and learning) can be exressed in terms of grahical maniulations underlying mathematical exressions 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 55

Directed grah vs. undirected grah Both (robabilistic) grahical models Secify a factorization (how to exress the joint distribution) Define a set of conditional indeendence roerties Parent - child Local conditional distribution Maximal clique Potential function Bayesian Networks (BN) Markov Random Field (MRF) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 56

Bayesian Networks in Detail

(DAG) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 58

Designing a Bayesian Network Model TakeHeart II: Decision suort system for clinical cardiovascular risk assessment 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 59

Inference in a Bayesian Network Model Given an assignment of a subset of variables (evidence) in a BN, estimate the osterior distribution over another subset of unobserved variables of interest. Inferences viewed as message assing along the network 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 60

Bayes Theorem Likelihood ( Y X ) ( X Y ) ( Y ) X ( ) Prior Posterior Normalizing constant ( X ) ( X Y ) ( Y ) Y osterior likelihood rior 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 62

Likelihood: Frequentist Bayesian Probabilities -Frequentist vs. Bayesian w: a fixed arameter determined by estimator Maximum likelihood: Error function = log ( D w) Error bars: Obtained by the distribution of ossible data sets Bootstra Cross-validation Bayesian ( D w) ( w D) ( D w) ( w) ( D) Thomas Bayes a robability distribution w: the uncertainty in the arameters Prior knowledge Noninformative (uniform) rior, Lalace correction in estimating riors Monte Carlo methods, variational Bayes, EP D (See an article WHERE Do PROBABILITIES COME FROM? on age 491 in the textbook (Russell and Norvig, 2010) for more discussion) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 64

Three examle grahs 1 st case None of the variables are observed Node c is tail-to-tail The variable c is observed The conditioned node blocks the ath from a to b, causes a and b to become (conditionally) indeendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 66

Three examle grahs 2 nd case None of the variables are observed Node c is head-to-tail The variable c is observed The conditioned node blocks the ath from a to b, causes a and b to become (conditionally) indeendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 67

Three examle grahs 3 rd case None of the variables are observed The variable c is observed Node c is head-to-head When node c is unobserved, it blocks the ath and the variables a and b are indeendent. Conditioning on c unblocks the ath and render a and b deendent. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 68

Three examle grahs - Fuel gauge examle B Battery, F-fuel, G-electric fuel gauge (rather unreliable fuel gauge) Checking the fuel gauge Checking the battery also has the meaning? ( Makes it more likely ) Makes it less likely than observation of fuel gauge only. (exlaining away) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 69

d-searation Tail-to-tail node or head-to-tail node Unless it is observed in which case it blocks a ath, the ath is unblocked. Head-to-head node Blocks a ath if is unobserved, but on the node, and/or at least one of its descendants, is observed the ath becomes unblocked. d-searation? All aths are blocked. The joint distribution will satisfy conditional indeendence w.r.t. concerned variables. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 70

d-searation Another examle of conditional indeendence and d-searation: i.i.d. (indeendent identically distributed) data Problem: finding osterior dist. for the mean of a univariate Gaussian dist. Every ath is blocked and so the observations D={x 1,,x N } are indeendent given (indeendent) (The observations are in general no longer indeendent!) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 72

d-searation Naïve Bayes model Key assumtion: conditioned on the class z, the distribution of the inut variables x 1,, x D are indeendent. Inut {x 1,,x N } with their class labels, then we can fit the naïve Bayes model to the training data using maximum likelihood assuming that the data are drawn indeendently from the model. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 73

d-searation Markov blanket or Markov boundary When dealing with the conditional distribution of x i, consider the minimal set of nodes that isolates x i from the rest of the grah. The set of nodes comrising arents, children, co-arents is called the Markov blanket. arents Co-arents children 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 74

Probability Distributions Discrete variables Beta, Bernoulli, binomial Dirichlet, multinomial Continuous variables Normal (Gaussian) Student-t beta Dirichlet Exonential family & conjugacy Many robability densities on x can be reresented as the same form T ( x ) h( x) g( )ex u( x) binomial There are conjugate family of density functions having the same form of density functions Beta & binomial F beta Dirichlet Dirichlet & multinomial Normal & Normal x binomial multinomial Gaussian 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 75

Inference in Grahical Models Inference in grahical models Given evidences (some nodes are clamed to observed values) Wish to comute the osterior distributions of other nodes Inference algorithms in grahical structures Main idea: roagation of local messages Exact inference Sum-roduct algorithm, max-roduct algorithm, junction tree A algorithm Aroximate inference Looy belief roagation + message assing schedule Variational methods, samling methods (Monte Carlo methods) B C D E ABD BCD CDE 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 76

Parameters Learning Parameters of Bayesian Networks robabilities in conditional robability tables (CPTs) for all the variables in the network Learning arameters SEASON Assuming that the structure is fixed, i.e. designed or learned. We need data, i.e. observed instances Estimation based on relative frequencies from data + belief Examle: coin toss. Estimation of heads in various ways RAIN SEASON DRY RAINY DRY? YES?? RAINY? NO?? 1 The rincile of indifference: head and tail are equally robable P heads = 1 2 2 If we tossed a coin 10,000 times and it landed heads 3373 times, we would estimate the robability of heads to be about.3373 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 77

Learning Parameters of Bayesian Networks Learning arameters (continued) Estimation based on relative frequencies from data + belief Examle: A-match soccer game between Korea and Jaan. How, do you think, is it robable that Korean would win? A: 0.85 (Korean), B: 0.3 (Jaanese) 3 This robability is not a ratio, and it is not a relative frequency because the game cannot be reeated many times under the exact same conditions Degree of belief or subjective robability Usual method Estimate the robability distribution of a variable X based on a relative frequency and belief concerning a relative frequency 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 78

Learning Parameters of Bayesian Networks Simle counting solution (Bayesian oint of view) Parameter estimation of a single node Assume local arameter indeendence For a binary variable (for examle, a coin toss) rior: Beta distribution - Beta(a,b) after we have observed m heads and N-m tails osterior - Beta(a+m,b+N-m) and P X = head = (a+m) N (conjugacy of Beta and Binomial distributions) beta binomial 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 79

Learning Parameters of Bayesian Networks Simle counting solution (Bayesian oint of view) For a multinomial variable (for examle, a dice toss) rior: Dirichlet distribution Dirichlet(a 1,a 2,, a d ) a k N P X = k = N = a k Observing state i: Dirichlet(a 1,,a i +1,, a d ) (conjugacy of Dirichlet and Multinomial distributions) For an entire network We simly iterate over its nodes In the case of incomlete data In real data, many of the variable values may be incorrect or missing Usual aroximating solution is given by Gibbs samling or EM (exectation maximization) technique 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 80

Smoothing Another viewoint Learning Parameters of Bayesian Networks Lalace smoothing or additive smoothing given observed counts for d states of a variable X = (x 1, x 2, x d ) P X = k = x k + α N + αd i = 1,, d, (α = α 1 = α 2 = α d ) From a Bayesian oint of view, this corresonds to the exected value of the osterior distribution, using a symmetric Dirichlet distribution with arameter α as a rior. Additive smoothing is commonly a comonent of naive Bayes classifiers. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 81

Learning the Grah Structure Learning the grah structure itself from data requires A sace of ossible structures A measure that can be used to score each structure From a Bayesian viewoint Tough oints : score for each model Marginalization over latent variables => challenging comutational roblem Exloring the sace of structures can also be roblematic The # of different grah structures grows exonentially with the # of nodes Usually we resort to heuristics Local score based, global score based, conditional indeendence test based, 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 82

Bayesian Networks as Tools for AI Learning Extracting and encoding knowledge from data Knowledge is reresented in Probabilistic relationshi among variables Causal relationshi Network of variables Common framework for machine learning models Suervised and unsuervised learning Knowledge Reresentation & Reasoning Bayesian networks can be constructed from rior knowledge alone Constructed model can be used for reasoning based on robabilistic inference methods Exert System Uncertain exert knowledge can be encoded into a Bayesian network DAG in a Bayesian network is hand-constructed by domain exerts Then the conditional robabilities were assessed by the exert, learned from data, or obtained using a combination of both techniques. Bayesian network-based exert systems are oular Planning In some different form, known as decision grahs or influence diagrams We don t cover about this direction 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 83

Advantages of Bayesian Networks for Data Analysis Ability to handle missing data Because the model encodes deendencies among all variables Learning causal relationshis Can be used to gain understanding about a roblem domain Can be used to redict the consequences of intervention Having both causal and robabilistic semantics It is an ideal reresentation for combining rior knowledge (which comes in causal form) and data Efficient and rinciled aroach for avoiding the overfitting of data By Bayesian statistical methods in conjunction with Bayesian networks (summary from the abstract of D. Heckerman s Tutorial on BN) (Read Introduction section for detailed exlanations) 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 84

References K. Mohan & J. Pearl, UAI 12 Tutorial on Grahical Models for Causal Inference S. Roweis, MLSS 05 Lecture on Probabilistic Grahical Models Chater 1, Chater 2, Chater 8 (Grahical Models), in Pattern Recognition and Machine Learning by C.M. Bisho, 2006. David Heckerman, A Tutorial on Learning with Bayesian Networks. R.E. Neaolitan, Learning Bayesian Networks, Pearson Prentice Hall, 2004. 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 85

More Textbooks and Courses htts://www.coursera.org/course/gm : Probabilistic Grahical Models by D. Koller 2014-2016, SNU CSE Biointelligence Lab., htt://bi.snu.ac.kr 86