Bayesian Networks Inference with Probabilistic Graphical Models

4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1

Machine Learning? Learning System: A system which autonomously improves its performance (P) by automatically forming model (M) based on experiential data (D) obtained from interaction with environment (E) Self-improving Systems (Perspective of AI) Knowledge Discovery (Perspective of Data Mining) Data-Driven Software Design (Perspective of Software Engineering) Automatic Programming (Perspective of Computer Engineering) 4190.408 Artificial (2016-Spring) 2

Machine Learning as Automatic Programming Traditional Programming Data Program Computer Output Machine Learning Data Output Computer Program 4190.408 Artificial (2016-Spring) 3

Machine Learning (ML): Three Tasks Supervised Learning Estimate an unknown mapping from known input and target output pairs Learn f w from training set D = {(x,y)} s.t. fw ( x) y f ( x) Classification: y is discrete Regression: y is continuous Unsupervised Learning Only input values are provided Learn f w from D = {(x)} s.t. f ( w x) x Density estimation and compression Clustering, dimension reduction Sequential (Reinforcement) Learning Not target, but rewards (critiques) are provided sequentially Learn a heuristic function f w from D t = {(s t,a t,r t ) t = 1, 2, } s.t. With respect to the future, not just past Sequential decision-making Action selection and policy learning f w ( s, a, r ) t t t 4190.408 Artificial (2016-Spring) 4

Machine Learning Models Supervised Learning Neural Nets Decision Trees K-Nearest Neighbors Support Vector Machines Unsupervised Learning Self-Organizing Maps Clustering Algorithms Manifold Learning Evolutionary Learning Probabilistic Graph Bayesian Networks Markov Networks Hidden Markov Models Hypernetworks Dynamic System Kalman Filters Sequential Monte Carlo Particle Filters Reinforcement Learning 4190.408 Artificial (2016-Spring) 5

Outline Bayesian Inference Monte Carlo Importance Sampling MCMC Probabilistic Graphical Models Bayesian Networks Markov Random Fields Hypernetworks Architecture and Algorithms Application Examples Discussion 4190.408 Artificial (2016-Spring) 6

Bayes Theorem 4190.408 Artificial (2016-Spring) 7

MAP vs. ML What is the most probable hypothesis given data? From Bayes Theorem MAP (Maximum A Posteriori) ML (Maximum Likelihood) 4190.408 Artificial (2016-Spring) 8

Bayesian Inference 4190.408 Artificial (2016-Spring) 9

Prof. Schrater s Lecture Notes (Univ. of Minnesota) 4190.408 Artificial (2016-Spring) 10

4190.408 Artificial (2016-Spring) 11

Monte Carlo (MC) Approximation 4190.408 Artificial (2016-Spring) 12

Markov chain Monte Carlo 4190.408 Artificial (2016-Spring) 13

MC with Importance Sampling 4190.408 Artificial (2016-Spring) 14

Graphical Models Graphical Models (GM) Causal Models Chain Graphs Other Semantics Dependency Networks Directed GMs Undirected GMs FST HMMs DBNs Kalman Bayesian Networks Mixture Models Decision Trees Simple Models Markov Random Fields / Markov networks Segment Models Factorial HMM Mixed Memory Markov Models BMMs PCA LDA Gibbs/Boltzman Distributions 4190.408 Artificial (2016-Spring) 15

BAYESIAN NETWORKS 4190.408 Artificial (2016-Spring) 16

Bayesian network Bayesian Networks DAG (Directed Acyclic Graph) Express dependence relations between variables Can use prior knowledge on the data (parameters) A B C n P( X) P( pa i 1 X i i ) D E P(A,B,C,D,E) = P(A)P(B A)P(C B) P(D A,B)P(E B,C,D) 4190.408 Artificial (2016-Spring) 17

Representing Probability Distributions Probability distribution = probability for each combination of values of these attributes Hospital patients described by Background: age, gender, history of diseases, Symptoms: fever, blood pressure, headache, Diseases: pneumonia, heart attack, Naïve representations (such as tables) run into troubles 20 attributes require more than 220 106 parameters Real applications usually involve hundreds of attributes 4190.408 Artificial (2016-Spring) 18

Bayesian Networks - Key Idea Exploit regularities!!! utilize conditional independence Graphical representation of conditional independence respectively causal dependencies 4190.408 Artificial (2016-Spring) 19

Bayesian Networks 1. Finite, directed acyclic graph 2. Nodes: (discrete) random variables 3. Edges: direct influences 4. Associated with each node: a table representing a conditional probability distribution (CPD), quantifying the effect the parents have on the node E J A B M 4190.408 Artificial (2016-Spring) 20

Bayesian Networks X 1 X 2 (0.2, 0.8) X (0.6, 0.4) 3 true 1 (0.2,0.8) true 2 (0.5,0.5) false 1 (0.23,0.77) false 2 (0.53,0.47) 4190.408 Artificial (2016-Spring) 21

Example: Use a DAG to model the causality Martin Oversleep Train Strike Norman Oversleep Martin Late Norman Late Norman untidy Boss Failure-in-Love Project Delay Office Dirty Boss Angry 4190.408 Artificial (2016-Spring) 22

Example: Attach prior probabilities to all root nodes Martin Oversleep Martin Oversleep Probability T 0.01 F 0.99 Train Strike Probability T 0.1 F 0.9 Train Strike Martin Late Norman Oversleep Probability T 0.2 F 0.8 Norman Late Norman Oversleep Norman untidy Boss Failure-in-Love Project Delay Office Dirty Boss failurein-love Probability T 0.01 F 0.99 Boss Angry 4190.408 Artificial (2016-Spring) 23

Example: Attach prior probabilities to non-root nodes Each column is summed to 1. Martin Oversleep Train Strike Norman Oversleep Martin Late Norman Late Norman untidy Boss Failure-in-Love Martin Late Train strike Project Delay T F Martin oversleep T F T F T 0.95 0.8 0.7 0.05 F 0.05 0.2 0.3 0.95 Boss Angry Office Dirty Norman untidy Norman oversleep T F T 0.6 0.2 F 0.4 0.8 4190.408 Artificial (2016-Spring) 24

Example: Attach prior probabilities to non-root nodes Martin Oversleep Each column is summed to 1. T Boss Failure-in-love Train Strike Project Delay T F T F Martin Norman Office Dirty Late Late T F T F T F T F very 0.98 0.85 0.6 0.5 0.3 0.2 0 0.01 Boss Project Office Boss mid 0.02 0.15 0.3 0.25 0.5 0.5 0.2 0.02 Failure-in-Love Delay Dirty Angry little 0 0 0.1 0.25 0.2 0.3 0.7 0.07 no 0 0 0 0 0 0 0.1 0.9 Boss Angry F Norman Oversleep Norman untidy 4190.408 Artificial (2016-Spring) 25

Inference 4190.408 Artificial (2016-Spring) 26

MARKOV RANDOM FIELDS (MARKOV NETWORKS) 4190.408 Artificial (2016-Spring) 27

Graphical Models Directed Graph (e.g. Bayesian Network) Undirected Graph (e.g. Markov Random Field) 4190.408 Artificial (2016-Spring) 28

Bayesian Image Analysis Noise Transmission Original Image Degraded (observed) Image Pr Original Image Degraded Image A Posteriori Probabilit y Degradatio n Process A Priori Probabilit y Pr Degraded Image Original Image Pr Original Image Pr Degraded Image Marginal Likelihood 4190.408 Artificial (2016-Spring) 29

Image Analysis We could thus represent both the observed image (X) and the true image (Y) as Markov random fields. X observed image Y true image And invoke the Bayesian framework to find P(Y X) 4190.408 Artificial (2016-Spring) 30

Remember P(Y X) = Details P(X Y )P(Y ) P(X) µ P(X Y )P(Y ) P(Y X) proportional to P(X Y)P(Y) P(X Y) is the data model. P(Y) models the label interaction. Next we need to compute the prior P(Y=y) and the likelihood P(X Y). 4190.408 Artificial (2016-Spring) 31

Back to Image Analysis Likelihood can be modeled as a mixture of Gaussians. The potential is modeled to capture the domain knowledge. One common model is the Ising model of the form βy i y j 4190.408 Artificial (2016-Spring) 32

Bayesian Image Analysis Let X be the observed image = {x1,x2 xmn} Let Y be the true image = {y1,y2 ymn} Goal : find Y = y* = {y1*,y2* } such that P(Y = y* X) is maximum. Labeling problem with a search space of Lmn L is the set of labels. m*n observations. 4190.408 Artificial (2016-Spring) 33

Unfortunately Observed Image SVM MRF 4190.408 Artificial (2016-Spring) 34

Markov Random Fields (MRFs) Introduced in the 1960s, a principled approach for incorporating context information. Incorporating domain knowledge. Works within the Bayesian framework. Widely worked on in the 70s, disappeared over the 80s, and finally made a big come back in the late 90s. 4190.408 Artificial (2016-Spring) 35

Markov Random Field Random Field: Let F { F1, F2,..., FM } be a family of random variables defined on the set S, in which each random variable takes a value in a label set L. The family F is called a random field. F i f i Markov Random Field: F is said to be a Markov random field on S with respect to a neighborhood system N if and only if the following two conditions are satisfied: Positivity: P( f ) 0, f F Markoviani ty : P( f i S { i}) P( f i f N i ) 4190.408 Artificial (2016-Spring) 36

Inference Finding the optimal y* such that P(Y=y* X) is maximum. Search space is exponential. Exponential algorithm - simulated annealing (SA) Greedy algorithm iterated conditional modes (ICM) There are other more advanced graph cut based strategies. 4190.408 Artificial (2016-Spring) 37

Sampling and Simulated Annealing Sampling A way to generate random samples from a (potentially very complicated) probability distribution. Gibbs/Metropolis. Simulated annealing A schedule for modifying the probability distribution so that, at zero temperature, you draw samples only from the MAP solution. If you can find the right cooling schedule the algorithm will converge to a global MAP solution. Flip side --- SLOW finding the correct schedule is non trivial. 4190.408 Artificial (2016-Spring) 38

Iterated Conditional Modes Greedy strategy, fast convergence Idea is to maximize the local conditional probabilities iteratively, given an initial solution. Simulated annealing with T =0. 4190.408 Artificial (2016-Spring) 39

Parameter Learning Supervised learning (easiest case) * Maximum likelihood: For an MRF: 1 P( f ) e Z( ) arg max P( f ) U ( f )/ T 4190.408 Artificial (2016-Spring) 40

So we approximate Pseudo Likelihood PL( f ) P( f f ) = i X U( f ) Ui( fi, fn i ) i U ( f, f ) i i N i i N U ( f, f ) i j j N j Large lattice theorem: in the large lattice limit M, PL converges to ML estimate. Turns out that a local learning method like pseudo-likelihood when combined with a local inference method such as ICM does quite well. Close to optimal results. f e j L e 4190.408 Artificial (2016-Spring) 41