CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Similar documents
Learning Bayesian network : Given structure and completely observed data

STA 4273H: Statistical Machine Learning

4: Parameter Estimation in Fully Observed BNs

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

STA 414/2104: Machine Learning

Bayesian Models in Machine Learning

Probabilistic Graphical Models

13: Variational inference II

Exponential Families

Lecture 2: Simple Classifiers

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Density Estimation. Seungjin Choi

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COS513 LECTURE 8 STATISTICAL CONCEPTS

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Dynamic Approaches: The Hidden Markov Model

Lecture 13 : Variational Inference: Mean Field Approximation

PMR Learning as Inference

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

2 : Directed GMs: Bayesian Networks

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Bayesian Methods for Machine Learning

Based on slides by Richard Zemel

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Basic math for biology

Machine Learning Summer School

Learning Bayesian belief networks

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Generalized Linear Models and Exponential Families

STA 4273H: Statistical Machine Learning

Lecture 6: Graphical Models: Learning

The Monte Carlo Method: Bayesian Networks

Pattern Recognition and Machine Learning

Markov Chains and Hidden Markov Models

MLE/MAP + Naïve Bayes

Lecture 4 September 15

Chris Bishop s PRML Ch. 8: Graphical Models

MLE/MAP + Naïve Bayes

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

LECTURE 11: EXPONENTIAL FAMILY AND GENERALIZED LINEAR MODELS

Introduction to Machine Learning

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Introduction to Probabilistic Graphical Models

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Hidden Markov Models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 9. Intro to Hidden Markov Models (finish up)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Networks. Motivation

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS6220: DATA MINING TECHNIQUES

Density Estimation: ML, MAP, Bayesian estimation

Study Notes on the Latent Dirichlet Allocation

Unsupervised Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Linear Dynamical Systems (Kalman filter)

Lecture : Probabilistic Machine Learning

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Note Set 5: Hidden Markov Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Patterns of Scalable Bayesian Inference Background (Session 1)

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Introduction to Probability and Statistics (Continued)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Probabilistic Graphical Models

CS 361: Probability & Statistics

Graphical Models for Collaborative Filtering

Linear Dynamical Systems

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Expectation Propagation Algorithm

An Introduction to Bayesian Machine Learning

Probabilistic Graphical Models

Mathematical Formulation of Our Example

CPSC 540: Machine Learning

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Clustering using Mixture Models

Probabilistic Graphical Models

3 : Representation of Undirected GM

Directed Probabilistic Graphical Models CMSC 678 UMBC

Introduction to Machine Learning CMU-10701

Probabilistic Graphical Models

{ p if x = 1 1 p if x = 0

2 : Directed GMs: Bayesian Networks

Naïve Bayes classification

CPSC 340: Machine Learning and Data Mining

10708 Graphical Models: Homework 2

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Today s Lecture: HMMs

Logistic Regression. Seungjin Choi

Probabilistic Graphical Networks: Definitions and Basic Results

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

An Empirical-Bayes Score for Discrete Bayesian Networks

Transcription:

CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1

Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1 Z( ) h(x)exp( T T (x)) is an exponential family distribution with natural (canonical) parameter η Function T(x) is a sufficient statistic. Function A(η) = logz(η) is the log normalizer Examples: Bernoulli, multinomial, Gaussian, Poisson, Gamma, Categorical 2

Why exponential family? Moment generating property We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A(η) 3

MLE for Exponential Family For iid data the log-likelihood is We take the derivatives and set them to zero We perform moment matching We can infer the canonical parameters using 4

Generalized Linear Models The graphical model: Linear regression Discriminative linear classification E p (T )=µ = f( T X) Generalized Linear Model The observed input x is assumed to enter into the model via a linear combination of its elements = T x. The conditional mean μ is represented as a function f(ξ) of ξ, where f is known as the response function. The observed output y is assumed to be characterized by an exponential family distribution with conditional mean μ. 5

Learning in Graphical Models Goal: Given a set of independent samples (assignments to random variables), find the best Bayesian Network (both DAG and CPDs) (B,E,A,C,R) = (T,F,F,T,F) (B,E,A,C,R) = (T,F,T,T,F) (B,E,A,C,R) = (F,T,T,T,F) 6

Learning in Graphical Models Goal: Given a set of independent samples (assignments to random variables), find the best Bayesian Network (both DAG and CPDs) Structure learning (B,E,A,C,R) = (T,F,F,T,F) (B,E,A,C,R) = (T,F,T,T,F) (B,E,A,C,R) = (F,T,T,T,F) 7

Learning in Graphical Models Goal: Given a set of independent samples (assignments to random variables), find the best Bayesian Network (both DAG and CPDs) Structure learning (B,E,A,C,R) = (T,F,F,T,F) (B,E,A,C,R) = (T,F,T,T,F) (B,E,A,C,R) = (F,T,T,T,F) Parameter learning 8

Learning in Graphical Models Goal: Given a set of independent samples (assignments to random variables), find the best Bayesian Network (both DAG and CPDs) Structure learning Later in the class (B,E,A,C,R) = (T,F,F,T,F) (B,E,A,C,R) = (T,F,T,T,F) (B,E,A,C,R) = (F,T,T,T,F) Parameter learning 9

Parameter Estimation for Fully Observed GMs The data: D = (x1, x2, x3,, xn) Assume the graph G is known and fixed Expert design or structure learning Goal: estimate from a dataset of N independent, identically distributed (iid) training examples D Each training example corresponds to a vector of M values one per node random variable Model should be completely observable: no missing values, no hidden variables 10

Density estimation A construction of an estimate, based on observed data, of an unobservable underlying probability density function Can be viewed as single-node graphical models Instances of exponential family distribution Building blocks of general GM MLE and Bayesian estimate 11

Discrete Distributions Bernoulli distribution: P (x) =p x (1 p) 1 x Multinomial distribution: Mult(1, θ) X =[X 1,X 2,X 3,X 4,X 5,X 6 ] X j =[0, 1], X X j = 1 with probability j, j2[1,,6] X X j =1 j2[1,,6] theta j =1 P (X j = 1) = j 12

Discrete Distributions Multinomial distribution: Mult(n, θ) n =[n 1,n 2,...,n k ]where X j n j = N p(n) = N! n 1!n 2! n k! n 1 1 n 2 2 n K K 13

Example: multinomial model Data: We observed N iid die rolls (K-sided): D = {5, 1, K, 3}} x n =[x n,1,x n,2,,x n,k ]wherex n,k =0, 1 Model: Likelihood of an observation: Likelihood of D: KX k=1 X n,k = 1 with probability k and P (x 1,x 2,...,x N ) = x n,k =1 X k2{1,,k} k =1 P (x i )=P({x n,k =1, where k is the index of the n-th roll}) KY = k = x n,1 1 x n,2 2 x n,k k = x n,k k NY n=1 P (x n ) =Y k n k k k=1 14

MLE: constrained optimization Objective function: l( ; D) = log P (D ) = log Y k n k k = X k n k log k We need to maximize this subject to the constraint: Lagrange multipliers: Derivatives: @ l @ k = n k k =0 n k = l( ; D) = X k ) X k n k = X k k n k log k + (1 k ) N = X k k ) ˆ k,mle = 1 N X n x n,k Sufficient statistics? 15

Bayesian estimation I need a prior over parameters θ Dirichlet distribution Posterior of θ Isomorphism of the posterior with the prior (conjugate prior) Posterior mean estimation P ( ) = Q (P k k) k ( k ) Y k k 1 k P ( x 1,...,x N )= p(x 1,...,x N )p( ) p(x 1,...,x N ) k = Z k p( D)d = C Z = C( ) Y k / Y k k Y k n k k a k 1 Y k k 1 k = Y k k+n k 1 k d = n k + k N + k+n k 1 k 16

Continuous Distributions Uniform Gaussian Multivariate Gaussian 17

MLE for a multivariate Gaussian You can show that the MLE for μ and Σ is What are the sufficient statistics? 18

MLE for a multivariate Gaussian You can show that the MLE for μ and Σ is What are the sufficient statistics? Rewrite Sufficient statistics are: Similar for Bayesian estimation: Normal prior 19

MLE for general BNs If we assume the parameters for each CPD are globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node MLE-based parameter estimation of GM reduces to local est. of each GLIM. 20

Decomposable likelihood of a BN Consider the GM: This is the same as learning four separate smaller BNs each of which consists of a node an its parents. 21

MLE for BNs with tabular CPDs Each CPD is represented as a table (multinomial) with In case of multiple parents the CPD is a high-dimensional table The sufficient statistics are counts of variable configurations The log-likelihood is And using a Lagrange multiplier to enforce that conditionals sum up to 1 we have: 22

What about parameter priors? In a BN we have a collection of local distributions How can we define priors over the whole BN? We could write P(x1,x2, xn;g,θ)p(θ α) Symbolically the same as before but θ is defined over a vector of random variables that follow different distributions. We need θ to decompose to use local rules. Otherwise we cannot decompose the likelihood any more. We need certain rules on θ Complete Model Equivalence Global Parameter Independence Local Parameter Independence Likelihood and Prior Modularity 23

Global and Local Parameter Independence Global Parameter Independence For every DAG model Local Parameter Independence For every node 24

Global and Local Parameter Independence 25

Which PDFs satisfy these assumptions? Discrete DAG Models Gaussian DAG Models 26

Parameter sharing Transition probabilities P(X t X t-1 ) can be different: What is the parameterization cost? 27

Parameter sharing Transition probabilities P(X t X t-1 ) can be different: What is the parameterization cost? I need a different local conditional distribution. How can I learn the transition probabilities? I need iid data. So my training examples should correspond multiple rolls of the dice. I need to do alignment etc. What do we do? 28

Parameter sharing We will make the assumption that for every transition we follow the same conditional Consider a time-invariant (stationary) 1 st -order Markov model Initial state probability vector State transition probability matrix Now: π (multinomial) What about A? optimize separately 29

Learning a Markov chain transition matrix A is a stochastic matrix with X A ij =1 Each row of A is a multinomial distribution MLE of A ij is the fraction of transitions from i to j j Sparse data problem: If i to j did not occur in the data then A ij is 0. Any future sequence with pair i to j will have zero probability Backoff smoothing 30

Example: HMM supervised ML estimation Given x = x1, x2,, xn for which the true state path y1, y2,, yn is known: Define The maximum likelihood parameters θ are: If x is continuous we can apply learning rules for a Gaussian 31

Supervised ML estimation Intuition: when we know the underlying states, the best estimate of θ is the average frequency of transitions and emissions that occur in the training data Drawback: Given little data, we may overfit (remember zero probabilities) Example: Given 10 rolls we have Then 32

Pseudocounts Solution for small training sets: Add pseudocounts The pseudocounts represent our prior belief Large total pseudocounts => strong prior belief Small total pseudocounts => smoothing to avoid 0 probabilities Equivalent to Bayesian estimation under a uniform prior with parameter strength equal to the pseudocounts. 33

Summary For fully observed BN, the log-likelihood function decomposes into a sum of local terms, one per node; thus learning is also factored Learning single-node GM density estimation: exp. Family Typical discrete distribution Typical continuous distribution Conjugate priors Learning BN with more nodes: Local operations 34