Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Similar documents
EM and Structure Learning

Artificial Intelligence Bayesian Networks

Hidden Markov Models

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Linear Approximation with Regularization and Moving Least Squares

1 Convex Optimization

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Generalized Linear Methods

Conjugacy and the Exponential Family

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Course 395: Machine Learning - Lectures

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Hidden Markov Models

Expectation propagation

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Gaussian process classification: a message-passing viewpoint

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

The Expectation-Maximization Algorithm

NP-Completeness : Proofs

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

3.1 ML and Empirical Distribution

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Complete subgraphs in multipartite graphs

Lecture 10 Support Vector Machines II

Quantifying Uncertainty

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Online Classification: Perceptron and Winnow

STAT 3008 Applied Regression Analysis

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Lecture Notes on Linear Regression

Randomness and Computation

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Goodness of fit and Wilks theorem

a. (All your answers should be in the letter!

Composite Hypotheses testing

Lecture 3 Stat102, Spring 2007

Homework Assignment 3 Due in class, Thursday October 15

Learning from Data 1 Naive Bayes

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

APPENDIX A Some Linear Algebra

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Clustering gene expression data & the EM algorithm

Maximum Likelihood Estimation

Economics 130. Lecture 4 Simple Linear Regression Continued

Problem Set 9 Solutions

6. Stochastic processes (2)

6. Stochastic processes (2)

Lecture 4: Universal Hash Functions/Streaming Cont d

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 10 Support Vector Machines. Oct

PHYS 705: Classical Mechanics. Calculus of Variations II

Classification as a Regression Problem

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Finding Dense Subgraphs in G(n, 1/2)

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

The Basic Idea of EM

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Perfect Competition and the Nash Bargaining Solution

COS 521: Advanced Algorithms Game Theory and Linear Programming

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

1 The Mistake Bound Model


Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

The big picture. Outline

Feature Selection: Part 1

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

find (x): given element x, return the canonical element of the set containing x;

Chapter 9: Statistical Inference and the Relationship between Two Variables

9 : Learning Partially Observed GM : EM Algorithm

Clustering with Gaussian Mixtures

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Stat 543 Exam 2 Spring 2016

Lecture 7: Boltzmann distribution & Thermodynamics of mixing

SDMML HT MSc Problem Sheet 4

A New Evolutionary Computation Based Approach for Learning Bayesian Network

Computing MLE Bias Empirically

Transcription:

Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number of errors/typos n the sldes of Lec.. Ths affected n partcular sldes 5, 6, 32, 34, 36. There may be other correctons after today s lecture. Please check the onlne verson of the sldes; I wll put an update sgn besde the lnk. Please do not hestate to contact me f you have any questons before the exam. Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 2 / 20 Outlne Our Model and Data Let = { v, v V } be a collecton of dscrete random varables. G: a DAG on V. Our model for : the set of all dstrbutons P( ) that factorze recursvely accordng to G. The true, unknown dstrbuton of : Q, not necessarly n our model. Maxmum lkelhood (ML) estmaton: Data: {x, x 2,..., x n }, n observatons ndependently generated accordng to Q, (.e., a random sample of sze n). The emprcal dstrbuton Q( ): Q( = x) s the observed frequency of the confguraton x n the data. P ML : the dstrbuton n our model that maxmzes the lkelhood functon based on the data, ny ny Y L(P) = P( = x ) = p`xv xpa(v). = = (For smplcty, we do not use the θ notaton for parameters here.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 3 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 4 / 20

Relaton between the ML Estmate, the emprcal and the true dstrbutons The relaton between P ML, Q and Q : Q* (unknown) P ML Q (emprcal) { P() : P factorzes recursvely accordng to G } Among all P n our model, P ML s the closest dstrbuton to Q n terms of the KL-dvergence KL(q, p). (q s the PMF of Q.) (See dscussons n Lec. 3 and Problem 3 of Exercse 2.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 5 / 20 Expresson of the ML Estmate The ML estmate P ML s the dstrbuton gven by p ML (x) = Y p ML (x v x pa(v) ), where the component condtonal dstrbutons are defned by p ML (x v x pa(v) ) = Q( v = x v pa(v) = x pa(v) ) = n(xv, x pa(v)), () n(x pa(v) ) and n the last expresson, n(x pa(v) ): the counts for the confguraton x pa(v) n the data; n(x v, x pa(v) ): the counts for the confguraton (x v, x pa(v) ) n the data. The maxmzed log lkelhood can be expressed as l(p ML ) = n E Qˆ ln p ML ( ) = n E Q ln q( v pa(v) ), (2) where E Q denotes expectaton wth respect to the dstrbuton Q. (Eqs. ()-(2) can be derved usng the nformaton nequalty; see sldes 8-9 for detals.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 6 / 20 Outlne Learnng a Rooted Tree Problem: Gven the data as descrbed earler, fnd a rooted tree G whch maxmzes the profle log lkelhood l p(g): l p(g) def = l(g, PG ML ) = max l(g, P). P P(G) Here P(G) s the set of all dstrbutons that factorze recursvely accordng to G. Such a tree s also called a Chow-Lu tree, and can be found by the Chow-Lu tree algorthm (Chow and Lu, 968). The algorthm can be generalzed to solve smlar types of problems (we wll show one). Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 7 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 8 / 20

Recall Mutual Informaton and Condtonal Mutual Informaton Let, Y, Z be dscrete random varables wth jont dstrbuton P. The mutual nformaton between and Y s defned as» p(, Y ) I( ; Y ) = E ln, p( )p(y ) and equvalently, I( ; Y ) = x,y «p(x, y) p(x, y) ln. p(x)p(y) The condtonal mutual nformaton between and Y gven Z s defned as» p(, Y Z) I( ; Y Z) = E ln, p( Z)p(Y Z) and equvalently, I( ; Y Z) = z By the nformaton nequalty, p(z) x,y «p(x, y z) p(x, y z) ln. p(x z)p(y z) I( ; Y ) 0, and I( ; Y ) = 0 ff. Y ; I( ; Y Z) 0, and I( ; Y Z) = 0 ff. Y Z. Dervng the We start wth the profle log lkelhood: by Eq. (2), l p(g) = n E Q ln q( v pag (v)). Here pa G (v) s the parent of v n the rooted tree G. Rewrte l p(g) n terms of the mutual nformaton I Q ( v ; pag (v)), v V (w.r.t. the dstrbuton Q): h» q(v pag (v)) q( pag (v)) q( v ) E Q ln q( v pag (v)) = E Q ln q( v ) q( pag (v)) q(v, pag (v)) = E Q + E Qˆ ln q(v ) q( v ) q( pag (v)) hence = I Q ( v ; pag (v)) + E Qˆ ln q(v ) ; n lp(g) = I Q ( v ; pag (v)) + E Qˆ ln q(v ). (3) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 9 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 0 / 20 Dervng the In the last equaton, n lp(g) = I Q ( v ; pag (v)) + E Qˆ ln q(v ), the second term does not dependent on G and therefore can be left out when maxmzng l p(g) over G; the mutual nformaton s symmetrc: I Q ( v ; pag (v)) = I Q ( pag (v); v ). Therefore, max l p(g) max G {rooted trees} G {undrected trees} v G u where the summaton P v G u s over all edges of G. I Q ( v ; u), (4) () Compute all parwse mutual nformaton q(v, u) I Q ( v ; u) = E Q, v, u V. q( v )q( u) (2) Fnd a maxmum spannng tree of the undrected, fully connected graph on V wth edge weght I Q ( v ; u) between node v and u. Ths can be done by Kruskal s algorthm: repeatedly select an edge wth maxmum weght that does not create a cycle. (3) Make any node of the spannng tree as the root and drect edges away from t. The result s a rooted tree G that maxmzes l p(g). Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 2 / 20

Generalzaton to Learnng Tree Augmented Nave Bayes A nave Bayes classfer wth class varable C and feature varables F : C F F Fm 2 Tree augmented nave Bayes classfers (TAN): Nave Bayes neglects the dependence between feature varables. Ths can be troublesome for rare classes that have characterstc combnatons of features. Each feature varable has at most one other feature varable as ts parent besdes the class varable. In other words, the subgraph nduced by the feature varables s a rooted tree or forest. Consder the problem of learnng a TAN G wth maxmum lkelhood. Notaton: v, v V : feature varables. Learnng TAN b G: the subgraph of G nduced by the feature varables v, v V. pa bg (v): the parent of v n b G,.e., the parent of v n G besdes C. Note that a TAN G s unquely determned by ts assocated b G. Apply the Chow-Lu tree algorthm to learnng TAN: Replace all parwse mutual nformaton by the condtonal mutual nformaton between all pars of feature varables gven the class varable: q(v, u C) I Q ( v ; u C) = E Q, v, u V. q( v C)q( u C) The output of the algorthm s the subgraph b G whose assocated TAN G maxmzes the profle log lkelhood among all TANs. Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 3 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 4 / 20 Dervng the for TAN Smlarly to learnng a rooted tree, we start wth the profle log lkelhood: by Eq. (2), l p( G) b = l p(g) = n E Q ln q( v pa bg (v), C) + n E Qˆ ln q(c). (5) We rewrte l p( b G) n terms of the condtonal mutual nformaton I Q`v ; pa bg (v) C between v and pa bg (v) gven C for v V : E Qˆ ln q(v pa bg (v), C) = E Q "ln hence = E Q "ln!# q` v pa bg (v), C q` pa bg (v) C q` v C q` v C q` pa bg (v) C!# q` v, pa bg (v) C + E Qˆ ln q(v C) q` v C q` pa bg (v) C = I Q`v ; pa bg (v) C + E Qˆ ln q(v C) ; n lp( G) b = I Q`v ; pa bg (v) C + E Qˆ ln q(v C) +E Qˆ ln q(c). (6) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 5 / 20 Dervng the for TAN In the last equaton, n lp( G) b = I Q`v ; pa bg (v) C + E Qˆ ln q(v C) + E Qˆ ln q(c), the second and thrd terms do not dependent on b G and therefore can be left out when maxmzng l p( b G) over b G; the condtonal mutual nformaton s symmetrc: I Q ( v ; pa bg (v) C) = I Q ( pa bg (v); v C); f b G s a forest, addng edges to make t a tree wll not decrease l p( b G). Therefore, max l p( G) b max bg {rooted trees} bg {undrected trees} v b G u I Q ( v ; u C), (7) where the summaton P s over all edges of G b. G v b u Ths verfes the clam n slde 4, that we can apply the Chow-Lu tree algorthm wth I Q ( v ; u C) replacng I Q ( v ; u) for all v, u V, to obtan the desrable G. b Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 6 / 20

Dscusson Further Readngs For TAN:. Fnn V. Jensen and Thomas D. Nelsen. Bayesan Networks and Decson Graphs. Sprnger, 2007. Chap. 8. Rooted trees and TANs are perfect DAGs: G m = G. So the models are equvalent to those assocated wth the undrected graphs G, and t s not surprsng that the structure learnng algorthms we derved can dsregard edge drectons. For learnng a sngly connected network (under certan assumptons) wth the Chow-Lu tree algorthm, see Pearl s 988 book. An old revew artcle dscussng the deas and steps nvolved n developng a probablstc expert system, usng the example CHILD network: 2. Davd J. Spegelhalter et al. Bayesan analyss n expert systems, Statstcal Scence, Vol. 8, No. 3, pp. 29-283, 993. (It ncludes Bayesan nference, whch we dd not talk about.) You may also fnd the related materals n the book by Cowell et al. 2007. A recent book by Koller and Fredman, Probablstc Graphcal Models, 2009 has many materals on both approxmate and exact nference algorthms. Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 7 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 8 / 20 Dervaton of Eqs. ()-(2) The lkelhood and log lkelhood functons are ny Y n L(P) = p`xv xpa(v), l(p) = ln p`xv xpa(v). = = The varables n the maxmzaton of l(p) are the condtonal dstrbutons p(x v x pa(v) ) of v for each confguraton x pa(v) of v s parents, for all v V. We next express l(p) n terms of these varables (colored n blue below) By exchangng the order of summatons n the expresson of l(p), l(p) = n ln p`xv x pa(v) = n(x v, x pa(v) ) ln p`x v x pa(v). = x v x pa(v) where n(x v, x pa(v) ) s the counts for the confguraton (x v, x pa(v) ) n the data. Under our model, there are no constrants between the component condtonal dstrbutons we can choose. So the maxmzaton problem max P l(p) decomposes nto separate maxmzaton problems, one for each v and ts parent confguraton x pa(v) : max p( x pa(v) ) x v n(x v, x pa(v) ) ln p`x v x pa(v). (8) (x pa(v) s fxed n the above subproblem.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 9 / 20 Dervaton of Eqs. ()-(2) The subproblem (8) s equvalent to n(x v, x pa(v)) max ln p`x v x pa(v), (9) p( x pa(v) ) n(x x v pa(v) ) where n(x pa(v) ) = P x v n(x v, x pa(v) ), and t s the counts of the parent confguraton x pa(v) n the data. By the nformaton nequalty (see Lec. 3), the maxmum of (9) s attaned at p(x v x pa(v) ) = n(xv, x pa(v)), x v, n(x pa(v) ) whch s the ML estmate p ML ( x pa(v) ) gven n Eq. (). The maxmzed log lkelhood thus equals l(p ML ) = n(x v, x pa(v) ) ln n(xv, x pa(v)) n(x x v pa(v) ) x pa(v) = n n(x v, x pa(v)) ln n(xv, x pa(v)) n n(x x pa(v) x v pa(v) ) = n q(x v, x pa(v) ) ln q(x v x pa(v) ) = n E Q ln q( v pa(v) ). x pa(v) x v (q s the PMF of Q.) Ths verfes Eq. (2). Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 20 / 20