Representing arbitrary probability distributions Inference. Exact inference; Approximate inference

Similar documents

EM and Structure Learning

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Artificial Intelligence Bayesian Networks

Hidden Markov Models

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Hidden Markov Models

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

3.1 ML and Empirical Distribution

Problem Set 9 Solutions

Generalized Linear Methods

Lecture Notes on Linear Regression

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Semi-Supervised Learning

Computing Correlated Equilibria in Multi-Player Games

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Quantifying Uncertainty

Hidden Markov Models

Finding Dense Subgraphs in G(n, 1/2)

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Lecture 10 Support Vector Machines II

Randomness and Computation

Lecture 3: Probability Distributions

Calculation of time complexity (3%)

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Clustering gene expression data & the EM algorithm

Lecture 2: Prelude to the big shrink

U.C. Berkeley CS278: Computational Complexity Professor Luca Trevisan 2/21/2008. Notes for Lecture 8

Kernel Methods and SVMs Extension

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Classification as a Regression Problem

Maximum likelihood. Fredrik Ronquist. September 28, 2005

Expectation propagation

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Gaussian Mixture Models

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Bayesian Networks. Course: CS40022 Instructor: Dr. Pallab Dasgupta

Feature Selection: Part 1

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Ensemble Methods: Boosting

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

Introduction to Hidden Markov Models

Maximum Likelihood Estimation (MLE)

Learning Theory: Lecture Notes

Machine learning: Density estimation

Lecture 10: May 6, 2013

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

CS47300: Web Information Search and Management

NP-Completeness : Proofs

1 The Mistake Bound Model

10-701/ Machine Learning, Fall 2005 Homework 3

Module 9. Lecture 6. Duality in Assignment Problems

Homework Assignment 3 Due in class, Thursday October 15

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Lecture 12: Discrete Laplacian

CIS587 - Artificial Intellgence. Bayesian Networks CIS587 - AI. KB for medical diagnosis. Example.

CS 331 DESIGN AND ANALYSIS OF ALGORITHMS DYNAMIC PROGRAMMING. Dr. Daisy Tang

Markov Chain Monte Carlo Lecture 6

Excess Error, Approximation Error, and Estimation Error

Retrieval Models: Language models

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Conjugacy and the Exponential Family

Vapnik-Chervonenkis theory

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Lecture 10 Support Vector Machines. Oct

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Probabilistic Graphical Models

EGR 544 Communication Theory

Bayesian belief networks

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Singular Value Decomposition: Theory and Applications

Lecture 4: Constant Time SVD Approximation

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Explaining the Stein Paradox

Expected Value and Variance

Lecture 4: Universal Hash Functions/Streaming Cont d

1 Convex Optimization

Linear Approximation with Regularization and Moving Least Squares

1 Motivation and Introduction

Engineering Risk Benefit Analysis

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Probability-Theoretic Junction Trees

Transcription:

Bayesan Learnng So far What does t mean to be Bayesan? Naïve Bayes Independence assumptons EM Algorthm Learnng wth hdden varables Today: Representng arbtrary probablty dstrbutons Inference Exact nference; Approxmate nference HW7 Released already Learnng Representatons of Probablty Dstrbutons You cannot do Problem 2 yet. Due on 5/1 No late hours Fnal: May 9 Same style as the md-term Comprehensve, but focus on materal after md-term. Projects: Reports and 5 mn. vdeos are due on May 11. 1

Unsupervsed Learnng We get as nput (n+1) tuples: (X 1, X 2, X n, X n+1 ) There s no noton of a class varable or a label. After seeng a few examples, we would lke to know somethng about the doman: correlatons between varables, probablty of certan events, etc. We want to learn the most lkely model that generated the data 2

Smple Dstrbutons In general, the problem s very hard. But, under some assumptons on the dstrbuton we have shown that we can do t. (exercse: show t s the most lkely dstrbuton) P(y) P(x 2 y) P(x y) P(x 1 y) n x1 Assumptons: (condtonal ndependence gven y) P(x x j,y) = P(x y) 8,j Can these assumptons be relaxed? Can we learn more general probablty dstrbutons? (These are essental n many applcatons: language, vson.) x 2 x 3 y x n 3

Smple Dstrbutons P(x 2 y) P(x 1 y) P(y) y P(x n y) Under the assumpton P(x x j,y) = P(x y) 8,j we can compute the jont probablty dstrbuton on the n+1 varables P(y, x 1, x 2, x n ) = p(y) 1n P(x y) Therefore, we can compute the probablty of any event: P(x 1 = 0, x 2 = 0, y = 1) = {b 2 {0,1}} P(y=1, x 1 =0, x 2 =0, x 3 =b 3, x 4 =b 4,,x n =b n ) More effcently: x 1 x 2 P(x 1 = 0, x 2 = 0, y = 1) = P(x 1 =0 y=1) P(x 2 =0 y=1) p(y=1) x 3 We can compute the probablty of any event or condtonal event over the n+1 varables. x n 4

Representng Probablty Dstrbuton Goal: To represent all jont probablty dstrbutons over a set of random varables X 1, X 2,., X n There are many ways to represent dstrbutons. A table, lstng the probablty of each nstance n {0,1} n We wll need 2 n -1 numbers What can we do? Make Independence Assumptons Mult-lnear polynomals Polynomals over varables (Samdan & Roth 09, Poon & Domngos 11) Bayesan Networks Drected acyclc graphs Markov Networks Undrected graphs 5

We can compute the probablty of any event or condtonal event over the n+1 varables. Unsupervsed Learnng In general, the problem s very hard. But, under some assumptons on the dstrbuton we have shown that we can do t. (exercse: show t s the most lkely dstrbuton) P(y) P(x 2 y) P(x y) P(x 1 y) n x1 Assumptons: (condtonal ndependence gven y) P(x x j,y) = P(x y) 8,j Can these assumptons be relaxed? Can we learn more general probablty dstrbutons? (These are essental n many applcatons: language, vson.) x 2 x 3 y x n 6

Graphcal Models of Probablty Dstrbutons Ths s a theorem. To prove t, order the nodes from leaves up, and use the product rule. The terms are called CPTs (Condtonal Probablty tables) and they completely defne the probablty dstrbuton. Bayesan Networks represent the jont probablty dstrbuton over a set of varables. Independence Assumpton: 8 x, x s ndependent of ts non-descendants gven ts parents Wth these conventons, the jont probablty dstrbuton s gven by: P(y, x, x,...x ) p(y) P(x Parents(x ) ) 1 2 n Y Z Z 1 Z 2 Z 3 X 10 X z s a parent of x x s a descendant of y X 2 7

Bayesan Network Semantcs of the DAG Nodes are random varables Edges represent causal nfluences Each node s assocated wth a condtonal probablty dstrbuton Two equvalent vewponts A data structure that represents the jont dstrbuton compactly A representaton for a set of condtonal ndependence assumptons about a dstrbuton 8

Bayesan Network: Example The burglar alarm n your house rngs when there s a burglary or an earthquake. An earthquake wll be reported on the rado. If an alarm rngs and your neghbors hear t, they wll call you. What are the random varables? 9

Bayesan Network: Example If there s an earthquake, you ll probably hear about t on the rado. Earthquake Burglary Rado Alarm An alarm can rng because of a burglary or an earthquake. How many parameters do we have? Mary Calls John Calls How many would we have f we had to store the entre jont? If your neghbors hear an alarm, they wll call you. 10

Bayesan Network: Example Wth these probabltes, (and assumptons, encoded n the graph) we can compute the probablty of any event over these varables. P(R E) Rado P(E) Earthquake Mary Calls P(E, B, A, R, M, J) = P(E) P(B, A, R, M, J E) = = P(E) P(B) P(A, R, M, J E, B) = = P(E) P(B) P(R E, B ) P(M, J, A E, B) Alarm = P(E) P(B) P(R E) P(M, J A, E, B) P(A, E, B) = P(E) P(B) ) P(R E) P(A E, B P(M A) P(J A) P(B) Burglary P(A E, B) P(M A) P(J A) John Calls 11

Computatonal Problems Learnng the structure of the Bayes net (What would be the gudng prncple?) Learnng the parameters Supervsed? Unsupervsed? Inference: Computng the probablty of an event: [#P Complete, Roth 93, 96] Gven structure and parameters Gven an observaton E, what s the probablty of Y? P(Y=y E=e) (E, Y are sets of nstantated varables) Most lkely explanaton (Maxmum A Posteror assgnment, MAP, MPE) [NP-Hard; Shmony 94] Gven structure and parameters Gven an observaton E, what s the most lkely assgnment to Y? Argmax y P(Y=y E=e) (E, Y are sets of nstantated varables) 13

Inference Inference n Bayesan Networks s generally ntractable n the worst case Two broad approaches for nference Exact nference Eg. Varable Elmnaton Approxmate nference Eg. Gbbs samplng 14

Tree Dependent Dstrbutons Drected Acyclc graph Each node has at most one parent Independence Assumpton: x s ndependent of ts nondescendants gven ts parents (x s ndependent of other nodes gve z; v s ndependent of w gven u;) P(y, x1, x2,...xn ) p(y) P(x Need to know two numbers for each lnk: p(x z), and a pror for the root p(y) V W P(y) U X Y Z P(x z) P(s y) Parents(x ) ) T S 15

Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x)? P(x=1) = P(y, x1, x2,...xn ) p(y) P(x V W P(y) = P(x=1 z=1)p(z=1) + P(x=1 z=0)p(z=0) Recursvely, go up the tree: P(z=1) = P(z=1 y=1)p(y=1) + P(z=1 y=0)p(y=0) P(z=0) = P(z=0 y=1)p(y=1) + P(z=0 y=0)p(y=0) Lnear Tme Algorthm U X Y Z P(x z) P(s y) T Parents(x 16 S ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables)

Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,y)? P(x=1,y=0) = = P(x=1 y=0)p(y=0) P(y, x1, x2,...xn ) p(y) P(x V W P(y) U X Y Z P(x z) Recursvely, go up the tree along the path from x to y: P(x=1 y=0) = z=0,1 P(x=1 y=0, z)p(z y=0) = = z=0,1 P(x=1 z)p(z y=0) P(s y) T Parents(x ) ) Now we have everythng n terms of the CPTs (condtonal probablty tables) 17 S

Tree Dependent Dstrbutons Ths s a generalzaton of naïve Bayes. Inference Problem: Gven the Tree wth all the assocated probabltes, evaluate the probablty of an event p(x,u)? (No drect path from x to u) P(y, x1, x2,...xn ) p(y) P(x P(x=1,u=0) = P(x=1 u=0)p(u=0) V W P(y) Let y be a parent of x and u (we always have one) P(x=1 u=0) = y=0,1 P(x=1 u=0, y)p(y u=0) = = y=0,1 P(x=1 y)p(y u=0) = U X Y Z P(x z) P(s y) T Parents(x Now we have reduced t to cases we have seen 18 S ) )

Tree Dependent Dstrbutons Inference Problem: P(y) Y P(s y) Gven the Tree wth all the assocated CPTs, we showed that we can evaluate the probablty of all events effcently. There are more effcent algorthms The dea was to show that the nference s ths case s a smple applcaton of Bayes rule and probablty theory. P(y, x1, x2,...xn ) p(y) P(x V W U X Z P(x z) S T Parents(x 19 ) )

Graphcal Models of Probablty Dstrbutons For general Bayesan Networks The learnng problem s hard The nference problem (gven the network, evaluate the probablty of a gven event) s hard (#P Complete) Y P(y) Z Z 1 Z 2 Z 3 P(z 3 y) X 10 X P(x z 1, z 2,z, z 3 ) X 2 P(y, x, x,...x p(y) P(x ) 1 2 n Parents(x ) ) 20

Varable Elmnaton Suppose the query s P(X 1 ) Key Intuton: Move rrelevant terms outsde summaton and cache ntermedate results 21

Varable Elmnaton: Example 1 A A B C We want to compute P(C) P(C) = åå P(A, B,C) = P(A)P(B A)P(C B) A B å å = P(C B) P(A)P(B A) B A åå A B Let s call ths f A (B) å = P(C B) f A (B) B A has been (nstantated and) elmnated What have we saved wth ths procedure? How many multplcatons and addtons dd we perform? 22

Varable Elmnaton VE s a sequental procedure. Gven an orderng of varables to elmnate For each varable v that s not n the query Replace t wth a new functon f v That s, margnalze v out The actual computaton depends on the order What s the doman and range of f v? It need not be a probablty dstrbuton 23

Varable Elmnaton: Example 2 P(E) Earthquake P(B) Burglary P(A E, B) P(R E) Rado Alarm What s P(M, J B)? P(M A) P(J A) Mary Calls P(E, B, A, R, M, J) = P(E B, A, R, M, J)P(B, A, R, M, J) John Calls = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) 24

Varable Elmnaton: Example 2 P(E, B, A, R, M, J) = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) P(M, J B = true) = Assumptons (graph; jont representaton) P(M, J, B = true) P(B = true) P(M, J, B = true) = P(E, B = true, A, R, M, J) å å E,A,R It s suffcent to compute the numerator and normalze = P(E) P(B = true) P(R E) P(A E, B = true) P(M A) P(J A) E,A,R å To elmnate R f R (E) = P(R E) R Elmnaton order R, A, E 25

Varable Elmnaton: Example 2 P(E, B, A, R, M, J) = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) P(M, J B = true) = P(M, J, B = true) P(B = true) It s suffcent to compute the numerator and normalze å P(M, J, B = true) = P(E, B = true, A, R, M, J) å E,A,R = P(E) P(B = true) P(A E, B = true) P(M A) P(J A) f R (E) E,A Elmnaton order A, E To elmnate A å f R (E) = P(R E) R f A (E, M, J) = å P(A E, B = true) P(M A) P(J A) A 26

Varable Elmnaton: Example 2 P(E, B, A, R, M, J) = P(E) P(B) P(R E) P(A E, B) P(M A) P(J A) P(M, J B = true) = P(M, J, B = true) P(B = true) It s suffcent to compute the numerator and normalze å P(M, J, B = true) = P(E, B = true, A, R, M, J) å E,A,R = P(E) P(B = true) f A (E, M, J) f R (E) E Factors Fnally elmnate E f R (E) = åp(r E) f A (E, M, J) = å P(A E, B = true) P(M A) P(J A) R A 27

Varable Elmnaton The order n whch varables are elmnated matters In the prevous example, what would happen f we elmnate E frst? The sze of the factors would be larger Complexty of Varable Elmnaton Exponental n the sze of the factors What about worst case? The worst case s ntractable 28

Inference Exact Inference n Bayesan Networks s #P-hard We can count the number of satsfyng assgnments for 3- SAT wth a Bayesan Network Approxmate nference Eg. Gbbs samplng Skp 29

Approxmate Inference P(x)? X Basc dea If we had access to a set of examples from the jont dstrbuton, we could just count. E[ f (x)]» 1 N N å =1 f (x () ) For nference, we generate nstances from the jont and count How do we generate nstances? 30

Generatng nstances Samplng from the Bayesan Network Condtonal probabltes, that s, P(X E) Only generate nstances that are consstent wth E Problems? How many samples? [Law of large numbers] What f the evdence E s a very low probablty event? 31

Detour: Markov Chan Revew 0.3 0.1 C 0.4 0.5 0.6 Generates a sequence of A,B,C Defned by ntal and transton probabltes P(X 0 ) and P(X t+1 = X t =j) 0.1 A 0.6 0.3 B 0.1 P j : Tme ndependent transton probablty matrx Statonary Dstrbutons: A vector q s called a statonary dstrbuton f q : The probablty of beng n state q j = å q P j If we sample from the Markov Chan repeatedly, the dstrbuton over the states converges to the statonary dstrbuton 32

Markov Chan Monte Carlo Our goal: To sample from P(X e) Overall dea: The next sample s a functon of the current sample The samples can be thought of as comng from a Markov Chan whose statonary dstrbuton s the dstrbuton we want Can approxmate any dstrbuton 33

Gbbs Samplng The smplest MCMC method to sample from P(X=x 1 x 2 x n e) Creates a Markov Chan of samples as follows: Intalze X randomly At each tme step, fx all random varables except one. Sample that random varable from the correspondng condtonal dstrbuton 34

Gbbs Samplng Algorthm: Intalze X randomly Iterate: Pck a varable X unformly at random Sample x (t+1) from P(x x (t) 1,,x (t) -1, x (t), +1, x (t) n,e) X (t+1) k =x (t+1) k for all other k Ths s the next sample X (1),X (2), X (t) forms a Markov Chan Why s Gbbs Samplng easy for Bayes Nets? P(x x - (t),e) s local 35

Gbbs Samplng: Bg pcture Gven some condtonal dstrbuton we wsh to compute, collect samples from the Markov Chan Typcally, the chan s allowed to run for some tme before collectng samples (burn n perod) So that the chan settles nto the statonary dstrbuton Usng the samples, we approxmate the posteror by countng 36

Gbbs Samplng Example 1 A B C We want to compute P(C): Suppose, after burn n, the Markov Chan s at A=true, B=false, C= false 1. Pck a varable B 2. Draw the new value of B from P(B A=true, C= false) = P(B A=true) Suppose B new = true 3. Our new sample s A=true, B = true, C = false 4. Repeat 37

Gbbs Samplng Example 2 P(E) Earthquake P(B) Burglary P(R E) Rado Alarm P(A E, B) P(M A) P(J A) Mary Calls John Calls Exercse: P(M,J B)? 38

Example: Hdden Markov Model Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 X 1 X 2 X 3 X 4 X 5 X 6 Transton probabltes Emsson probabltes A Bayesan Network wth a specfc structure. Xs are called the observatons and Ys are the hdden states Useful for sequence taggng tasks part of speech, modelng temporal structure, speech recognton, etc 39

HMM: Computatonal Problems Probablty of an observaton gven an HMM P(X parameters): Dynamc Programmng Fndng the best hdden states for a gven sequence P(Y X, parameters): Dynamc Programmng Learnng the parameters from observatons EM 40

Gbbs Samplng for HMM Goal:Computng P(y x) Intalze the Ys randomly Iterate: Only these varables are needed because they form the Markov blanket of Y. Pck a random Y Draw Y t from P(Y Y -1,Y +1,X ) Compute the probablty usng counts after the burn n perod Gbbs samplng allows us to ntroduce prors on the emsson and transton probabltes. 41

Bayesan Networks Bayesan Networks Compact representaton probablty dstrbutons Unversal: Can represent all dstrbutons In the worst case, every random varable wll be connected to all others Inference Inference s hard n the worst case Exact nference s #P-hard, approxmate nference s NP-hard [Roth93,96] Inference for Trees s effcent General exact Inference: Varable Elmnaton Learnng? 42

Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton What does that mean? P(y, Fnd the tree representaton of the dstrbuton. Generatve model What does that mean? V W P(y) U x1, x2,...xn ) p(y) P(x X Y Z P(x z) Among all trees, fnd the most lkely one, gven the data: P(T D) = P(D T) P(T)/P(D) P(s y) T Parents(x S ) ) 43

Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton W P(y) U Y Z P(s y) S Fnd the tree representaton of the dstrbuton. V X P(x z) T Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) Now we can see why we had to solve the nference problem frst; t s requred for learnng. 44

Tree Dependent Dstrbutons Learnng Problem: Gven data (n tuples) assumed to be sampled from a treedependent dstrbuton W P(y) U Y Z P(s y) S Fnd the tree representaton of the dstrbuton. V X P(x z) T Assumng unform pror on trees, the Maxmum Lkelhood approach s to maxmze P(D T), T ML = argmax T P(D T) = argmax T {x} P T (x 1, x 2, x n ) = = argmax T {x} P T (x Parents(x )) Try ths for naïve Bayes 45

Example: Learnng Dstrbutons Probablty Dstrbuton 1: 0000 0.1 0001 0.1 0010 0.1 0011 0.1 0100 0.1 0101 0.1 0110 0.1 0111 0.1 1000 0 1001 0 1010 0 1011 0 1100 0.05 1101 0.05 1110 0.05 1111 0.05 Probablty Dstrbuton 2: Are these representatons of the same dstrbuton? Gven a sample, whch of these generated t? X 4 P(x 4 ) P(x 3 x 4 ) P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 1 x 4 ) X 1 P(x 3 x 2 ) X 3 46

Example: Learnng Dstrbutons Probablty Dstrbuton 1: 0000 0.1 0001 0.1 0010 0.1 0011 0.1 0100 0.1 0101 0.1 0110 0.1 0111 0.1 1000 0 1001 0 1010 0 1011 0 1100 0.05 1101 0.05 1110 0.05 1111 0.05 Probablty Dstrbuton 2: We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? X 4 P(x 4 ) P(x 3 x 4 ) P(x 1 x 4 ) X 1 P(x 2 x 4 ) X 2 X 3 Probablty Dstrbuton 3 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 1 x 4 ) X 1 P(x 3 x 2 ) X 3 47

Example: Learnng Dstrbutons Probablty Dstrbuton 1: 0000 0.1 0001 0.1 0010 0.1 0011 0.1 0100 0.1 0101 0.1 0110 0.1 0111 0.1 1000 0 1001 0 1010 0 1011 0 1100 0.05 1101 0.05 1110 0.05 1111 0.05 What s the lkelhood that ths table generated the data? P(T D) = P(D T) P(T)/P(D) Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= 0 P(1001 T)= 0.1 P(0100 T)= 0.1 P(Data Table)=0 We are gven 3 data ponts: 1011; 1001; 0100 Whch one s the target dstrbuton? 48

Example: Learnng Dstrbutons Probablty Dstrbuton 2: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 4 =1)=1/2 P(x 1 x 4 ) X 2 X 3 p(x 1 =1 x 4 =0)=1/2 p(x 1 =1 x 4 =1)=1/2 p(x 2 =1 x 4 =0)=1/3 p(x 2 =1 x 4 =1)=1/3 p(x 3 =1 x 4 =0)=1/6 p(x 3 =1 x 4 =1)=5/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 4 =1)=1/2 1/2 2/3 5/6= 10/72 P(1001 T)= = 1/2 1/2 2/3 5/6=10/72 P(0100 T)= =1/2 1/2 2/3 5/6=10/72 P(Data Tree)=125/4*3 6 X 1 P(x 2 x 4 ) X 4 P(x 4 ) P(x 3 x 4 ) 49

Example: Learnng Dstrbutons Probablty Dstrbuton 3: What s the lkelhood that the data was sampled from Dstrbuton 2? Need to defne t: P(x 4 =1)=2/3 P(x 1 x 4 ) p(x 1 =1 x 4 =0)=1/3 p(x 1 =1 x 4 =1)=1 p(x 2 =1 x 4 =0)=1 p(x 2 =1 x 4 =1)=1/2 p(x 3 =1 x 2 =0)=2/3 p(x 3 =1 x 2 =1)=1/6 Lkelhood(T) ~= P(D T) ~= P(1011 T) P(1001 T)P(0100 T) P(1011 T)= p(x 4 =1)p(x 1 =1 x 4 =1)p(x 2 =0 x 4 =1)p(x 3 =1 x 2 =1)=2/3 1 1/2 2/3= 2/9 P(1001 T)= = 1/2 1/2 2/3 1/6=1/36 P(0100 T)= =1/2 1/2 1/3 5/6=5/72 P(Data Tree)=10/ 3 6 2 6 Dstrbuton 2 s the most lkely Learnng Dstrbutons dstrbuton CS446 -Sprng to have 17 produced the data. X 1 P(x 4 ) X 4 P(x 2 x 4 ) X 2 P(x 3 x 2 ) X 3

Example: Summary We are now n the same stuaton we were when we decded whch of two cons, far (0.5,0.5) or based (0.7,0.3) generated the data. But, ths sn t the most nterestng case. In general, we wll not have a small number of possble dstrbutons to choose from, but rather a parameterzed famly of dstrbutons. (analogous to a con wth p 2 [0,1] ) We need a systematc way to search ths famly of dstrbutons.

Example: Summary Frst, let s make sure we understand what we are after. We have 3 data ponts that have been generated accordng to our target dstrbuton: 1011; 1001; 0100 What s the target dstrbuton? We cannot fnd THE target dstrbuton. What s our goal? As before we are nterested n generalzaton Gven Data (e.g., the above 3 data ponts), we would lke to know P(1111) or P(11**), P(***0) etc. We could compute t drectly from the data, but. Assumptons about the dstrbuton are crucal here

Learnng Tree Dependent Dstrbutons Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton W P(y) U Y Z P(s y) S fnd the most probable tree representaton of the dstrbuton. V X P(x z) T 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) Space of all Dstrbutons Space of all Tree Dstrbutons Target Dstrbuton Fnd the Tree closest Learnng Dstrbutons CS446 to the -Sprng target 17 Target Dstrbuton 53

Learnng Tree Dependent Dstrbutons Learnng Problem: 1. Gven data (n tuples) assumed to be sampled from a tree-dependent dstrbuton fnd the most probable tree representaton of the dstrbuton. 2. Gven data (n tuples) fnd the tree representaton that best approxmates the dstrbuton (wthout assumng that the data s sampled from a tree-dependent dstrbuton.) (2) Fnd the maxmal one V W P(y) U X Y Z P(x z) P(s y) The smple mnded algorthm for learnng a tree dependent dstrbuton requres (1) for each tree, compute ts lkelhood L(T) = P(D T) = = {x} P T (x 1, x 2, x n ) = = {x} P T (x Parents(x )) T 54 S

1. Dstance Measure To measure how well a probablty dstrbuton P s approxmated by probablty dstrbuton T we use here the Kullback-Lebler cross entropy measure (KL-dvergence): D(P, T) Non negatve. x P(x)log D(P,T)=0 ff P and T are dentcal P(x) T(x) Non symmetrc. Measures how much P dffers from T. 55

2. Rankng Dependences Intutvely, the mportant edges to keep n the tree are edges (x---y) for x, y whch depend on each other. Gven that the dstance between the dstrbuton s measured usng the KL dvergence, the correspondng measure of dependence s the mutual nformaton between x and y, (measurng the nformaton x gves about y) P(x, y) I(x, y) P(x, y)log x, y P(x)P(y) whch we can estmate wth respect to the emprcal dstrbuton (that s, the gven data). 56

Learnng Tree Dependent Dstrbutons The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree 57

Spannng Tree Goal: Fnd a subset of the edges that forms a tree that ncludes every vertex, where the total weght of all the edges n the tree s maxmzed Sort the weghts Start greedly wth the largest one. Add the next largest as long as t does not create a loop. In case of a loop, dscard ths weght and move on to the next weght. Ths algorthm wll create a tree; It s a spannng tree: t touches all the vertces. It s not hard to see that ths s the maxmum weghted spannng tree The complexty s O(n 2 log(n)) 58

Learnng Tree Dependent Dstrbutons (2) (3) (1) The algorthm s gven m ndependent measurements from P. For each varable x, estmate P(x) (Bnary varables n numbers) For each par of varables x, y, estmate P(x,y) (O(n 2 ) numbers) For each par of varables compute the mutual nformaton Buld a complete undrected graph wth all the varables as vertces. Let I(x,y) be the weghts of the edge (x,y) Buld a maxmum weghted spannng tree Transform the resultng undrected tree to a drected tree. Choose a root varable and set the drecton of all the edges away from t. Place the correspondng condtonal probabltes on the edges. 59

Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng probablty dstrbuton T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best tree dependent approxmaton to P Let T be the tree-dependent dstrbuton accordng to the fxed tree t. Recall: T(x) = T(x Parent(x )) = P(x (x )) D(P, T) x P(x)log P(x) T(x) 60

Correctness (1) Place the correspondng condtonal probabltes on the edges. Gven a tree t, defnng T by forcng the condtonal probabltes along the edges to concde wth those computed from a sample taken from P, gves the best t-dependent approxmaton to P D(P,T) H(x) When s ths maxmzed? That s, how to defne T(x (x ))? x x P(x) P(x)log T(x) x P(x)log P(x) - P(x) n 1 x P(x)log T(x) log T(x ( x )) Slght abuse of notaton at the root 61

Correctness (1) D(P,T) Defnton of expectaton: x P(x)log H(x) H(x) H(x) H(x) x H(x) E n 1 n 1 (x n P P(x) T(x) P(x) [ n 1 E, P, (x 1 (x ) n 1 )) x log T(x [log T(x P(x P( (x P(x)log P(x) - logt(x, (x )) x (x (x (x P(x ))] ))] x )) )) log T(x (x P(x)log T(x) (x P(x ¼(x ) log T(x ¼(x )) takes ts maxmal value when we set: T(x ¼(x )) = P(x ¼(x )) )) )log T(x (x )) 62

P(x Correctness (2) Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that: However:, (x ))log P(x Ths gves: (x log )) D(P,T) H(x) P(x P(x, (x D(P,T) = -H(x) - 1,n I(x,¼(x )) - 1,n x P(x ) log P(x ) 1st and 3rd term do not depend on the tree structure. Snce the dstance s non negatve, mnmzng t s equvalent to maxmzng the sum of the edges weghts I(x,y). (x n 1 (x,, (x P(x P(x, (x )) )) log P(x )P( (x )) P(x, (x )) ))log P(x )P( (x )) )), (x )) log P(x (x )) log P(x P(x, (x ) ))log P(x 63 )

Correctness (2) Let I(x,y) be the weghts of the edge (x,y). Maxmzng the sum of the nformaton gans mnmzes the dstrbutonal dstance. We showed that the T s the best tree approxmaton of P f t s chosen to maxmze the sum of the edges weghts. D(P,T) = -H(x) - 1,n I(x,¼(x )) - 1,n x P(x ) log P(x ) The mnmzaton problem s solved wthout the need to exhaustvely consder all possble trees. Ths was acheved snce we transformed the problem of fndng the best tree to that of fndng the heavest one, wth mutual nformaton on the edges. 64

Correctness (3) Transform the resultng undrected tree to a drected tree. (Choose a root varable and drect of all the edges away from t.) What does t mean that you get the same dstrbuton regardless of the chosen root? (Exercse) Ths algorthm learns the best tree-dependent approxmaton of a dstrbuton D. L(T) = P(D T) = {x} P T (x Parent(x )) Gven data, ths algorthm fnds the tree that maxmzes the lkelhood of the data. The algorthm s called the Chow-Lu Algorthm. Suggested n 1968 n the context of data compresson, and adapted by Pearl to Bayesan Networks. Invented a couple more tmes, and generalzed snce then. 65

Example: Learnng tree Dependent Dstrbutons We have 3 data ponts that have been generated accordng to the target dstrbuton: 1011; 1001; 0100 We need to estmate some parameters: P(A=1) = 2/3, P(B=1)=1/3, P(C=1)=1/3), B P(D=1)=2/3 For the values 00, 01, 10, 11 respectvely, we have that: P(A,B)=0; 1/3; 2/3; 0 P(A,B)/P(A)P(B)=0; 3; 3/2; 0 I(A,B) ~ 9/2 P(A,C)=1/3; 0; 1/3; 1/3 P(A,C)/P(A)P(C)=3/2; 0; 3/4; 3/2 I(A,C) ~ 15/4 P(A,D)=1/3; 0; 0; 2/3 P(A,D)/P(A)P(D)=3; 0; 0; 3/2 I(A,D) ~ 9/2 P(B,C)=1/3; 1/3; 1/3;0 P(B,C)/P(B)P(C)=3/4; 3/2; 3/2; 0 I(B,C) ~ 15/4 P(B,D)=0; 2/3; 1/3;0 P(B,D)/P(B)P(D)=0; 3; 3/2; 0 I(B,D) ~ 9/2 P(C,D)=1/3; 1/3; 0; 1/3 P(C,D)/P(C)P(D)=3/2; 3/4; 0; 3/2 I(C,D) ~ 15/4 Generate the tree; place probabltes. I(x, y) A P(x, y) P(x, y)log x, P(x)P(y) y D C 66

Learnng tree Dependent Dstrbutons Chow-Lu algorthm fnds the tree that maxmzes the lkelhood. In partcular, f D s a tree dependent dstrbuton, ths algorthm learns D. (what does t mean?) Less s known on how many examples are needed n order for t to converge. (what does that mean?) Notce that we are takng statstcs to estmate the probabltes of some event n order to generate the tree. Then, we ntend to use t to evaluate the probablty of other events. One may ask the queston: why do we need ths structure? Why can t answer the query drectly from the data? (Almost lke makng predcton drectly from the data n the badges problem) 67