Y. Xiang, Learning Bayesian Networks 1

Similar documents
CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Graphical Event Models and Causal Event Models. Chris Meek Microsoft Research

Ensamble methods: Bagging and Boosting

Ensamble methods: Boosting

Math 315: Linear Algebra Solutions to Assignment 6

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Notes for Lecture 17-18

dy dx = xey (a) y(0) = 2 (b) y(1) = 2.5 SOLUTION: See next page

arxiv: v1 [stat.ml] 26 Sep 2012

Computer-Aided Analysis of Electronic Circuits Course Notes 3

Hidden Markov Models

Comparing Means: t-tests for One Sample & Two Related Samples

Chapter 2. First Order Scalar Equations

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Speech and Language Processing

Intermediate Macro In-Class Problems

EXERCISES FOR SECTION 1.5

Lie Derivatives operator vector field flow push back Lie derivative of

Anno accademico 2006/2007. Davide Migliore

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

A Hop Constrained Min-Sum Arborescence with Outage Costs

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

An Introduction to Malliavin calculus and its applications

3.1 More on model selection

Y. Xiang, Inference with Uncertain Knowledge 1

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

An introduction to the theory of SDDP algorithm

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

Testing for a Single Factor Model in the Multivariate State Space Framework

References are appeared in the last slide. Last update: (1393/08/19)

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems.

REVERSIBLE MCMC ON MARKOV EQUIVALENCE CLASSES OF SPARSE DIRECTED ACYCLIC GRAPHS 1

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Differential Equations

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Approximation Algorithms for Unique Games via Orthogonal Separators

PCP Theorem by Gap Amplification

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Continuous Time Linear Time Invariant (LTI) Systems. Dr. Ali Hussein Muqaibel. Introduction

Outline. lse-logo. Outline. Outline. 1 Wald Test. 2 The Likelihood Ratio Test. 3 Lagrange Multiplier Tests

Optimality Conditions for Unconstrained Problems

15. Vector Valued Functions

Chapter 3 Boundary Value Problem

Lecture 23: I. Data Dependence II. Dependence Testing: Formulation III. Dependence Testers IV. Loop Parallelization V.

Math 115 Final Exam December 14, 2017

Lecture 12: Multiple Hypothesis Testing

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

d 1 = c 1 b 2 - b 1 c 2 d 2 = c 1 b 3 - b 1 c 3

Unit Root Time Series. Univariate random walk

PROBLEMS FOR MATH 162 If a problem is starred, all subproblems are due. If only subproblems are starred, only those are due. SLOPES OF TANGENT LINES

Machine Learning 4771

Removing Useless Productions of a Context Free Grammar through Petri Net

SOLUTIONS TO ECE 3084

Solutions of Sample Problems for Third In-Class Exam Math 246, Spring 2011, Professor David Levermore

Signals and Systems Linear Time-Invariant (LTI) Systems

ACE 564 Spring Lecture 7. Extensions of The Multiple Regression Model: Dummy Independent Variables. by Professor Scott H.

A Bayesian Approach to Spectral Analysis

Continuous Time. Time-Domain System Analysis. Impulse Response. Impulse Response. Impulse Response. Impulse Response. ( t) + b 0.

The Brock-Mirman Stochastic Growth Model

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

MATH 128A, SUMMER 2009, FINAL EXAM SOLUTION

OBJECTIVES OF TIME SERIES ANALYSIS

Solutions to Assignment 1

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

SOMETHING ELSE ABOUT GAUSSIAN HIDDEN MARKOV MODELS AND AIR POLLUTION DATA

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

Temporal probability models

u(x) = e x 2 y + 2 ) Integrate and solve for x (1 + x)y + y = cos x Answer: Divide both sides by 1 + x and solve for y. y = x y + cos x

Theory of! Partial Differential Equations!

Temporal probability models. Chapter 15, Sections 1 5 1

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

control properties under both Gaussian and burst noise conditions. In the ~isappointing in comparison with convolutional code systems designed

Right tail. Survival function

Cash Flow Valuation Mode Lin Discrete Time

Announcements: Warm-up Exercise:

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Theory of! Partial Differential Equations-I!

Time series Decomposition method

Západočeská Univerzita v Plzni, Czech Republic and Groupe ESIEE Paris, France

3.6 Derivatives as Rates of Change

Some Basic Information about M-S-D Systems

Exercises: Similarity Transformation

An EM algorithm for maximum likelihood estimation given corrupted observations. E. E. Holmes, National Marine Fisheries Service

For example, the comb filter generated from. ( ) has a transfer function. e ) has L notches at ω = (2k+1)π/L and L peaks at ω = 2π k/l,

The consumption-based determinants of the term structure of discount rates: Corrigendum. Christian Gollier 1 Toulouse School of Economics March 2012

Licenciatura de ADE y Licenciatura conjunta Derecho y ADE. Hoja de ejercicios 2 PARTE A

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

10. State Space Methods

Let us start with a two dimensional case. We consider a vector ( x,

Empirical Process Theory

Introduction to Probability and Statistics Slides 4 Chapter 4

Solutions to the Exam Digital Communications I given on the 11th of June = 111 and g 2. c 2

Lecture 10 Estimating Nonlinear Regression Models

Block Diagram of a DCS in 411

Transcription:

Learning Bayesian Neworks Objecives Acquisiion of BNs Technical conex of BN learning Crierion of sound srucure learning BN srucure learning in 2 seps BN CPT esimaion Reference R.E. Neapolian: Learning Bayesian Neworks (2004) Acquisiion of BNs Eliciaion based acquisiion Deermine he se V of env variables and heir domains. Deermine he graphical dependence srucure. Deermine CPTs one for each variable. Time consuming for domain expers & agen developer. Learning based acquisiion Inpu: a raining se R of examples in an applicaion env Oupu: a BN for inference abou he env Unsupervised learning The focus of his uni. Y. Xiang, Learning Bayesian Neworks 1 Y. Xiang, Learning Bayesian Neworks 2 Task Decomposiion Denoe a BN by S = (V, G, Pb), where V is a se of environmen variables, G = (V, E) is a DAG, and Pb is a se of CPTs, Pb = {P(v (v)) v V. Task: Learning a BN from a raining se R 1) Idenificaion of V 2) Definiion of variable domains 3) Consrucion of dependency srucure G Referred o as srucure learning 4) Esimaion of CPTs Referred o as parameer learning Y. Xiang, Learning Bayesian Neworks 3 Review on BN Semanics 1. A variable v in BN is condiionally independen of is non-descendans given is parens (v). 2. Variables x and y are dependen given heir common descendan(s). Ex Burglar-quake burglary (b) callbyjohn (j) alarm (a) quake (q) callbymary (m) Y. Xiang, Learning Bayesian Neworks 4 1

Technical Conex of BN Learning The environmen is characerized by a unknown full join P*(V). A unknown BN S* = (V, G*, Pb*) encodes he same condiional independencies of P*(V) by G*. S* perfecly encodes P*(V). A se R of raining examples are obained from environmen (i.e., P*(V)) by independen rials. Task: From R, learn a BN S = (V, G, Pb) ha models P*(V) as accuraely and concisely as possible. Srucure Markov Equivalence Suppose a full join P*(V) over env V can be perfecly encoded by a BN S = (V, G, Pb). Is he DAG srucure G unique? Ex P*(V) can be perfecly encoded by a BN wih G. G: child_age foo_size shoe_size Two DAGs are Markov equivalen if hey enail he same condiional independencies. Ex G : child_age foo_size shoe_size Are G and G Markov equivalen? Y. Xiang, Learning Bayesian Neworks 5 Y. Xiang, Learning Bayesian Neworks 6 Crierion of Sound Srucure Learning Le S = (V,G,Pb) and S = (V,G,Pb ) be BNs s.. a) G and G are Markov equivalen, and b) Pb and Pb are derived from he same environmen. Then S and S model he same full join over V. Ex Env V = {a, b wih rue full join P*(V): a b P*(a,b) 0.05 G 1 : a b; G 2 :a b; G 3 is disconneced. If full join P*(V) can be encoded by BN f f 0.35 0.50 f f 0.10 S = (V,G,Pb) bu S = (V,G,Pb ) is learned, hen srucure learning is sound as long as G and G are Markov equivalen. Learning DAG Skeleon Le G = (V, E) be a DAG. The undireced graph G = (V, E ), where E is obained by removing direcion of each link in E, is he skeleon of G. Srucure learning can be performed in wo seps. 1) Learn a skeleon G = (V, E ). 2) Direc links in G o obain DAG G = (V, E). In skeleon learning, how do we know wheher a pair of variables should be adjacen? [Theorem] Variables x, y V are adjacen in G iff here exiss no Z V\{x,y s.. I(x, Z, y) holds. Y. Xiang, Learning Bayesian Neworks 7 Y. Xiang, Learning Bayesian Neworks 8 2

Enropy In decision ree learning, he amoun of informaion conained in he value of a variable is measured by enropy. Le X be a se of variables wih JPD P(X). The enropy of X is H(X) = - x P(x) log 2 (P(x)). Inerpreaion 1) H(X) is he measure of uncerainy associaed wih X. 2) H(X) is he amoun of info in an assignmen x of X. How o Deermine I(x,Z,y)? [Theorem] For variables x, y V and Z V\{x,y, we have I(x,Z,y) H(x,Z,y) = H(x,Z)+H(Z,y)-H(Z). Algorihm Tes I(x,Z,y) using raining se r esimae P(x,Z,y) from r; marginalize P(x,Z,y) o obain P(x,Z), P(Z,y) and P(Z); compue H(x,Z,y), H(x,Z), H(Z,y), and H(Z); compue diff = H(x,Z,y) (H(x,Z)+H(Z,y)-H(Z)) ; if diff < hreshold, reurn I(x,Z,y); else reurn I(x,Z,y); Y. Xiang, Learning Bayesian Neworks 9 Y. Xiang, Learning Bayesian Neworks 10 How o Choose Z o Tes I(x,Z,y)? Given Z V\{x,y and I(x, Z, y), i is possible ha for Z - Z, I(x, Z -, y) or for Z + Z, I(x, Z +, y). I appears ha, o deermine I(x,Z,y), all subses of V\{x,y mus be esed. [Theorem] In a BN, x,y V, Z V\{x,y, and I(x,Z,y). Then eiher I(x, (x),y) or I(x, (y),y) holds. Idea a) Sar wih he complee graph and delee <x,y> if I(x,Z,y) for some Z. Idea b) To find Z s.. I(x,Z,y), limi search o Z Adj(x) and Z Adj(y). Idea c) Tes smaller subses Z firs. Srucure Learning Algorihm learnbndag(v, R) { G = complee undireced graph over V; for each link <x,y> in G, associae <x,y> wih se Sxy = null; G = geskeleon(g, R); G = direclink(g ); reurn G; Y. Xiang, Learning Bayesian Neworks 11 Y. Xiang, Learning Bayesian Neworks 12 3

geskeleon(g, R) { k = 0; done = false; while done = false, do { done = rue; for each node x in G, ge Adj(x); if Adj(x) k, coninue; done = false; for each node y in Adj(x), for each subse Z of Adj(x)\{y wih Z =k, if I(x,Z,y), hen {Sxy = Z; rm <x,y> in G ; break; k++; 13 Types of Chains in Srucure Undireced chain A chain x-z-y where x and y are no adjacen is called an uncoupled meeing. Direced chain 1. A chain x z y is a head-o-ail meeing a z. 2. A chain x z y is a ail-o-ail meeing a z. 3. A chain x z y is a head-o-head meeing a z. When isolaed, are hese direced meeings Markov equivalen? Y. Xiang, Learning Bayesian Neworks 14 Direc Links in Skeleon Idea: Direc head-o-head meeings firs, and use DAG consrain o direc remaining links. [Theorem] Le S be a BN, x-z-y be a uncoupled meeing in is skeleon, I(x,W,y) for W V\{x,y, and z W. Then x-z-y is a head-o-head meeing in S. Operaional implicaion 1) If x-z-y is an uncoupled meeing, hen Sxy null. 2) If z Sxy, hen x-z-y mus be x z y. Y. Xiang, Learning Bayesian Neworks 15 direclink(g) { for each uncoupled meeing x-z-y, if (z Sxy) direc x-z-y as x z y; // rule 1 done = false; while done = false, do done = rue; for each uncoupled meeing x z-y, direc z-y as z y; done = false; // rule 2 for each x-y s.. here is a direced pah from x o y, direc x-y as x y; done = false; // rule 3 for each uncoupled meeing x-z-y s.. x w, z-w & y w, direc z-w as z w; done = false; // rule 4 direc remaining links randomly s.. no direced cycle or head-o-head meeing is creaed; 16 4

Direc Link by DAG Consrain [Rule 2] For each uncoupled meeing x z-y, direc as x z y. Why? [Rule 3] For each x-y s.. here is a direced pah from x o y, direc x-y as x y. Why? [Rule 4] For each uncoupled meeing x-z-y s.. x w, z-w and y w, direc z-w as z w. Why? CPT Esimaion To ge P(x (x)), for each x = u and each assignmen wof (x), we need o esimae P(u w). Maximum likelihood esimaion 1. Gaher he se S of examples in R ha saisfies (x)=w. 2. N = S. 3. M = number of examples in S ha saisfies x=u. 4. Esimae P(u w) based on N and M. To deermine P(X), esimae P(x) for each x of X. A. Gaher he se T of examples in R ha saisfies X=x. B. Esimae P(x) as T / R. Y. Xiang, Learning Bayesian Neworks 17 Y. Xiang, Learning Bayesian Neworks 18 Maximum Likelihood Esimaion 1. Denoe unknown P(u w) as parameer [0, 1]. 2. Denoe examples in S as e 1, e 2,, e N. 3. Derive likelihood P(S ) of observing S. 4. Deermine parameer ha maximizes P(S ). a) Maximizing P(S ) is equivalen o maximizing he log likelihood ln(p(s )). b) Differeniae ln(p(s )). c) Se derivaive o 0. d) Solve for. Remarks BN learning overcomes he bole-neck of knowledge acquisiion by eliciaion, and allows BN inference o be more widely applied. More advanced opics 1) Alernaive BN srucure learning mehods 2) Alernaive BN CPT esimaion mehods 3) Inegraing BN learning wih eliciaion 4) Learning BNs wih coninuous variables 5) Learning BNs in dynamic envs Y. Xiang, Learning Bayesian Neworks 19 Y. Xiang, Learning Bayesian Neworks 20 5