Maximum likelihood in log-linear models

Similar documents
Decomposable Graphical Gaussian Models

Lecture 4 October 18th

Basic math for biology

Markov properties for undirected graphs

Math 152. Rumbos Fall Solutions to Assignment #12

Likelihood Analysis of Gaussian Graphical Models

4.1 Notation and probability review

Markov properties for undirected graphs

Probabilistic Graphical Models

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

EE512 Graphical Models Fall 2009

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

Probability Propagation

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1

Notes 6 : First and second moment methods

Estimates for probabilities of independent events and infinite series

Solutions to Problem Set 5

P i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=

ST5215: Advanced Statistical Theory

Definition 6.1. A metric space (X, d) is complete if every Cauchy sequence tends to a limit in X.

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Decomposable and Directed Graphical Gaussian Models

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Discrete Multivariate Statistics

M17 MAT25-21 HOMEWORK 6

u xx + u yy = 0. (5.1)

Module 7 : Applications of Integration - I. Lecture 20 : Definition of the power function and logarithmic function with positive base [Section 20.

Characterisation of Accumulation Points. Convergence in Metric Spaces. Characterisation of Closed Sets. Characterisation of Closed Sets

i c Robert C. Gunning

STA 4273H: Statistical Machine Learning

Functional Analysis HW #3

Total positivity in Markov structures

RIEMANN MAPPING THEOREM

LECTURE Itineraries We start with a simple example of a dynamical system obtained by iterating the quadratic polynomial

Chapter 17: Undirected Graphical Models

Data Mining 2018 Bayesian Networks (1)

Homework I, Solutions

CSC 412 (Lecture 4): Undirected Graphical Models

3.8 Strong valid inequalities

Recitation 9: Loopy BP

Chapter 7. Extremal Problems. 7.1 Extrema and Local Extrema

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

Probability Propagation

Variational Inference (11/04/13)

Introduction to Semidefinite Programming I: Basic properties a

Math 61CM - Solutions to homework 6

Statistical Process Control for Multivariate Categorical Processes

Chapter 4: Asymptotic Properties of the MLE

Lecture 9: PGM Learning

11. Learning graphical models

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation

4 : Exact Inference: Variable Elimination

Cutting planes from extended LP formulations

Convex relaxation. In example below, we have N = 6, and the cut we are considering

Midterm 1. Every element of the set of functions is continuous

M311 Functions of Several Variables. CHAPTER 1. Continuity CHAPTER 2. The Bolzano Weierstrass Theorem and Compact Sets CHAPTER 3.

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Multivariable Calculus

Accumulation constants of iterated function systems with Bloch target domains

3 Undirected Graphical Models

CS 6820 Fall 2014 Lectures, October 3-20, 2014

Lecture 17: May 29, 2002

Solution of the 8 th Homework

10708 Graphical Models: Homework 2

1 Directional Derivatives and Differentiability

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

Lecture 2: Categorical Variable. A nice book about categorical variable is An Introduction to Categorical Data Analysis authored by Alan Agresti

Probabilistic Graphical Models

Does Better Inference mean Better Learning?

Lecture 4 Noisy Channel Coding

Undirected Graphical Models

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

3 : Representation of Undirected GM

Preliminaries. Introduction to EF-games. Inexpressivity results for first-order logic. Normal forms for first-order logic

1 T 1 = where 1 is the all-ones vector. For the upper bound, let v 1 be the eigenvector corresponding. u:(u,v) E v 1(u)

In particular, if A is a square matrix and λ is one of its eigenvalues, then we can find a non-zero column vector X with

Laplace s Equation. Chapter Mean Value Formulas

Lecture 4: State Estimation in Hidden Markov Models (cont.)

17 Variational Inference

CSE 312 Final Review: Section AA

First In-Class Exam Solutions Math 410, Professor David Levermore Monday, 1 October 2018

Probabilistic Graphical Models

Lecture Notes on Metric Spaces

Your first day at work MATH 806 (Fall 2015)

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Math 104: Homework 7 solutions

CSC Linear Programming and Combinatorial Optimization Lecture 12: The Lift and Project Method

A LOCALIZATION PROPERTY AT THE BOUNDARY FOR MONGE-AMPERE EQUATION

CSC Linear Programming and Combinatorial Optimization Lecture 10: Semidefinite Programming

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Envelope Theorems for Arbitrary Parametrized Choice Sets

Going from graphic solutions to algebraic

11 The Max-Product Algorithm

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

On John type ellipsoids

Transcription:

Graphical Models, Lecture 4, Michaelmas Term 2010 October 22, 2010

Generating class Dependence graph of log-linear model Conformal graphical models Factor graphs Let A denote an arbitrary set of subsets of V. A density f (or function) factorizes w.r.t. A if there exist functions ψ a (x) which depend on x a only and f (x) = a A ψ a (x). Similar to factorization w.r.t. graph, but A are not necessarily complete subsets of a graph. The set of distributions P A which factorize w.r.t. A is the hierarchical log linear model generated by A. To avoid redundancy, it is common to assume the sets in A to be incomparable in the sense that no subset in A is contained in any other member of A. A is the generating class of the log linear model.

Generating class Dependence graph of log-linear model Conformal graphical models Factor graphs For any generating class A we construct the dependence graph G(A) = G(P A ) of the log linear model P A. Since the pairwise Markov property has to hold for all members of P A, it has at least to hold for all positive members. The dependence graph is determined by the relation α β a A : α, β a. For sets in A are clearly complete in G(A) and therefore distributions in P A do factorize according to G(A). On the other hand, any graph with fewer edges would not suffice. They are thus also global, local, and pairwise Markov w.r.t. G(A).

Generating class Dependence graph of log-linear model Conformal graphical models Factor graphs As a generating class defines a dependence graph G(A), the reverse is also true. The set C(G) of cliques (maximal complete subsets) of G is a generating class for the log linear model of distributions which factorize w.r.t. G. If the dependence graph completely summarizes the restrictions imposed by A, i.e. if A = C(G(A)), A is conformal.

Generating class Dependence graph of log-linear model Conformal graphical models Factor graphs φ IJ φ IK I K J The factor graph of A is the bipartite graph with vertices V A and edges define by φ JK α a α a. Using this graph even non-conformal log linear models admit a simple visual representation.

Data in list form Log linear models Consider a sample X 1 = x 1,..., X n = x n from a distribution with probability mass function p. We refer to such data as being in list form, e.g. as case Admitted Sex 1 Yes Male 2 Yes Female 3 No Male 4 Yes Male...

Contingency Table Log linear models Data often presented in the form of a contingency table or cross-classification, obtained from the list by sorting according to category: Sex Admitted Male Female Yes 1198 557 No 1493 1278 The numerical entries are cell counts n(x) = {ν : x ν = x} and the total number of observations is n = x X n (x).

Assume now p P A but otherwise unknown. The likelihood function can be expressed as L(p) = n ν=1 p(x ν ) = x X p(x) n(x). In contingency table form the data follow a multinomial distribution n! P{N(x) = n(x), x X } = x X n(x)! x X p(x) n(x) but this only affects the likelihood function by a constant factor.

The likelihood function L(p) = x X p(x) n(x), is continuous as a function of the ( X -dimensional vector) unknown probability distribution p. Since the closure P A is compact (bounded and closed), L attains its maximum on P A. Unfortunately, P A is not closed by itself so limits of factorizing distributions do not necessarily factorize. The maximum of the likelihood function may not necessarily on P A itself, so it is necessary in general to include the boundary points.

Indeed, it is also true that L has a unique maximum over P A, which we shall now show. For simplicity, we only establish uniqueness within P A. The proof is indirect, but quite simple. Assume p 1, p 2 P A with p 1 p 2 and L(p 1 ) = L(p 2 ) = sup p P A L(p). (1) Define p 12 (x) = c p 1 (x)p 2 (x), where c 1 = { x p1 (x)p 2 (x)} is a normalizing constant.

Then p 12 P A because p 12 (x) = c p 1 (x)p 2 (x) = c ψa(x)ψ 1 a(x) 2 = ψa 12 (x), a A a A where e.g. ψa 12 = c 1/ A ψa(x)ψ 1 a(x). 2 The Cauchy Schwarz inequality yields c 1 = x i.e. we have c > 1. p1 (x)p 2 (x) < x p 1 (x) p 2 (x) = 1 x

Hence L(p 12 ) = x = x p 12 (x) n(x) { c{ } n(x) p 1 (x)p 2 (x) = c n x p1 (x) n(x) p2 (x) n(x) = c n L(p 1 )L(p 2 ) > L(p 1 )L(p 2 ) = L(p 1 ) = L(p 2 ), which contradicts (1). Hence we conclude p 1 = p 2. The extension to P A is almost identical. It just needs a limit argument to establish p 1, p 2 P A p 12 P A. x

The maximum likelihood estimate ˆp of p is the unique element of P A which satisfies the system of equations nˆp(x a ) = n(x a ), a A, x a X a. (2) Here g(x a ) = y:y a=x a g(y) is the a-marginal of the function g. The system of equations (2) expresses the fitting of the marginals in A. It can be seen as an instance of the fact that in an exponential family (log-linear exponential), the MLE is found by equating the sufficient statistics (marginal counts) to their expectation.

Proof: Assume p P A is a solution to the equations (2). That p maximizes the likelihood function follows from the calculation below, where p P A is arbitrary and φ a = log ψ a : log L(p) = n(x) log p(x) = n(x) φ a (x) x X x X a A = n(x)φ a (x) a A x X = n(y)φ a (y) a A x a X a y:y a=x a = n(x a )φ a (x), a A x a X a as n(x a ) = y:y a=x a n(y).

Further we get log L(p) = n(x a )φ a (x) a A x a X a = np (x a )φ a (x) a A x a X a = np (x)φ a (x) a A x X = np (x) log p(x). x X Thus, for any p P A we have established that log L(p) = x X np (x) log p(x).

This is in particular also true for p. The information inequality now yields log L(p) = x X np (x) log p(x) x X np (x) log p (x) = log L(p ). The case of p P A needs an additional limit argument.

To show that the equations (2) indeed have a solution, we simply describe a convergent algorithm which solves it. This cycles (repeatedly) through all the a-marginals in A and fit them one by one. For a A define the following scaling operation on p: (T a p)(x) p(x) n(x a) np(x a ), x X where 0/0 = 0 and b/0 is undefined if b 0.

Fitting the marginals Log linear models The operation T a fits the a-marginal if p(x a ) > 0 when n(x a ) > 0: n(t a p)(x a ) = n p(y) n(y a) np(y y:y a ) a=x a = n n(x a) p(y) np(x a ) y:y a=x a = n n(x a) np(x a ) p(x a) = n(x a ). Consequently, we have T 2 a = T a. No reason to do it twice.

Make an ordering of the generators A = {a 1,..., a k }. Define S by a full cycle of scalings Define the iteration It then holds that Sp = T ak T a2 T a1. p 0 (x) 1/ X, p n = Sp n 1, n = 1,.... lim p n = ˆp n where ˆp is the unique maximum likelihood estimate of p P A, i.e. the solution of the equation system (2).

Known as the IPS-algorithm or IPF-algorithm, or as a variety of other names. Implemented e.g. (inefficiently) in R in loglin with front end loglm in MASS. Key elements in proof: 1. If p P A, so is T a p; 2. T a is continuous at any point p of P A with p(x a ) 0 whenever n(x a ) = 0; 3. L(T a p) L(p) so likelihood always increases; 4. ˆp is the unique fixpoint for T (and S); 5. P A is compact.

A simple example Log linear models Admitted Sex Yes No S-marginal Male 1198 1493 2691 Female 557 1278 1835 A-marginal 1755 2771 4526 Admissions data from Berkeley. Consider A S, corresponding to A = {{A}, {S}}. We should fit A-marginal and S-marginal iteratively.

Initial values Log linear models Admitted Sex Yes No S-marginal Male 1131.5 1131.5 2691 Female 1131.5 1131.5 1835 A-marginal 1755 2771 4526 Entries all equal to 4526/4. Gives initial values of np 0.

Fitting S-marginal Log linear models Admitted Sex Yes No S-marginal Male 1345.5 1345.5 2691 Female 917.5 917.5 1835 A-marginal 1755 2771 4526 For example and so on. 2691 1345.5 = 1131.5 1131.5 + 1131.5

Fitting A-marginal Log linear models For example Admitted Sex Yes No S-marginal Male 1043.46 1647.54 2691 Female 711.54 1123.46 1835 A-marginal 1755 2771 4526 1755 711.54 = 917.5 917.5 + 1345.5 and so on. Algorithm has converged, as both marginals now fit!

Normalised to probabilities Admitted Sex Yes No S-marginal Male 0.231 0.364 0.595 Female 0.157 0.248 0.405 A-marginal 0.388 0.612 1 Dividing everything by 4526 yields ˆp. It is overkill to use the IPS algorithm as there is an explicit formula, as we shall see next time.

L attains its maximum on P A.

L attains its maximum on P A. The maximizer ˆp P A of L is unique.

L attains its maximum on P A. The maximizer ˆp P A of L is unique. The maximizer ˆp is determined as the unique solution within P A of the equations nˆp(x a ) = n(x a ), a A, x a X a.

L attains its maximum on P A. The maximizer ˆp P A of L is unique. The maximizer ˆp is determined as the unique solution within P A of the equations nˆp(x a ) = n(x a ), a A, x a X a. The maximizer is the limit of the convergent repeated fitting of marginals T a : p(x) p(x)n(x a )/{np(x a )}, x X, a A.