Posterior Regularization

Similar documents
Support Vector Machines

Latent Variable Models and EM algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

PATTERN RECOGNITION AND MACHINE LEARNING

1 Expectation Maximization

Parameter learning in CRF s

Naïve Bayes classification

Introduction to Statistical Learning Theory

Posterior Regularization for Learning with Side Information and Weak Supervision

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Lecture 8: Graphical models for Text

13: Variational inference II

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Expectation Maximization

Week 3: The EM algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Posterior Sparsity in Unsupervised Dependency Parsing

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Auto-Encoding Variational Bayes

Undirected Graphical Models

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Series 7, May 22, 2018 (EM Convergence)

Lecture 14. Clustering, K-means, and EM

Lecture 2: August 31

Lecture 9: PGM Learning

Study Notes on the Latent Dirichlet Allocation

Lecture : Probabilistic Machine Learning

Latent Variable Models

G8325: Variational Bayes

Statistical Machine Learning Lectures 4: Variational Bayes

Quantitative Biology II Lecture 4: Variational Methods

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Expectation Maximization

Variational Autoencoders (VAEs)

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CMU-Q Lecture 24:

Machine learning - HT Maximum Likelihood

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Expectation Propagation Algorithm

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Unsupervised Learning

Lecture 1: Bayesian Framework Basics

Probabilistic Graphical Models

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Probabilistic Time Series Classification

CS Lecture 18. Topic Models and LDA

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Convex Optimization and Support Vector Machine

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

An Introduction to Expectation-Maximization

Gaussian Mixture Models

Probabilistic Reasoning in Deep Learning

Support Vector Machines for Classification: A Statistical Portrait

Bayesian Machine Learning - Lecture 7

The Expectation Maximization or EM algorithm

Nonparameteric Regression:

Algorithms for Predicting Structured Data

Variational Learning : From exponential families to multilinear systems

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Probabilistic Graphical Models

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Machine Learning And Applications: Supervised Learning-SVM

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Gaussian Mixture Models

Integrated Non-Factorized Variational Inference

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Variational Autoencoder

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Variational inference

Cycle-Consistent Adversarial Learning as Approximate Bayesian Inference

Variational Principal Components

Support Vector Machine (continued)

Data Mining Techniques

Expectation maximization

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Graphical Models for Collaborative Filtering

Machine Learning for Structured Prediction

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Variational Inference (11/04/13)

Support Vector Machines for Classification and Regression

Bayesian Machine Learning

Active and Semi-supervised Kernel Classification

topics about f-divergence

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Transcription:

Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods proposed to approximate the posterior, so as to make it easy to work with. Posterior Regularization is one of the proposed methods to approximate joint distribution between a set of structured variables, and taking the constrains into account. In [1] there is a comprehensive review of the method and previous ideas is mentioned. 2 Modelling the problem In Posterior Regularization, we are dealing with a doing inference over posterior probability, with considering constraints, as indirect supervision. In this context we define X as input observation variables, and Y as latent variables, which we want to make predictions about. One example problem could be POS tagging as an example of structured prediction, in which we see input sentences x, and output tags y. We assume having generative model defined using p(y) as prior knowledge on latent variables, likelihood to be p(x Y), and marginal-likelihood or evidence to be L(θ) = ln p(x; θ) = ln Y Y p(y)p(x Y)dY. However one can use this probabilistic modelling to learn in discriminative way. 3 Approximating the posterior Now we want to use to approximate the real intractable posterior,. In fact, is a simpler parametric distribution with parameter γ which is easy to work with 1. To approximate the true posterior p(.) using the decomposed distribution q(.) we should find a measure of distance 1 Thus the right way is to denote it by q(y; γ), but it is common to drop the parameters. 1

between two functions, and also is practical in computational sense. One of the famous distance measure between functions is called Kullback-Leibler divergence or in short, KL-divergence. Lemma 1. We have L(θ) = log p(x; θ) = J (q, θ) + KL ( ). (1) in which J (q, θ) = log dy (2) KL ( ) = E q [log Proof. Now the Bayes rule we have, ] log p(x; θ) = log log log p(x; θ) = log Multiplying two sides in we have, log. log p(x; θ) = log log log p(x; θ)dy = log dy log dy. Note that in the left part of the above equation p X (x) is not a function of Y and thus, log p X (X θ)dy = log p(x θ). Also note that, KL ( ) = log dy = log dy = E q [log L(θ) = J (q, θ) + KL ( ). ] 2

Lemma 2. For any given observation and latent variables we have the following inequality: L(θ) J (q, θ) Proof. I use Jensen s inequality here to show that J (q, θ) a lower bound for the original likelihood: L(θ) = ln p(x; θ) = ln dy = ln dy ln dy = J (q, θ) This shows that exp [J (q, θ)] a lower bound for likelihood p(x; θ): p(x θ) exp [J (q, θ)] or L(θ) J (q, θ). Corollary 1. Maximizing J (q, θ) (lower bound) will result in maximizing the likelihood p(x; θ). For this reason, J (q, θ) is sometimes called ELBO (Evidence Lower Bound). 4 Posterior Regularization We define the following to be the objective function to be maximized: J (θ, q) = L(θ) KL ( ) E q [φ(x, Y)] b (3) Remark 1. Having the above inequality lets us to define the the equation (1), since KL-divergence is always a positive value. We define a set of constraints in the output space as following: Q = E q [φ(x, Y)] b} Note that in the definition of the constraints E q [φ(x, Y)] is a function of X which means that we define our constraints over input observations, to follow the structure defined by inequality and in the feature function φ(x, Y). This definition can encode hard and soft constraints in itself. 3

5 Relaxed form To cut some slack on objective function and the constraint function, one could define extra slack variables: J (θ, q) = L(θ) KL ( ) + σ ξ β E q [φ(x, Y)] b ξ 6 Learning and inference over the regularized posterior Without considering the constraints, we can show that training consists of a two step procedure similar to EM algorithm: 1. Initialize θ, and variational parameters of q(.). 2. Repeat until convergence (a) E-step: q (t+1) = arg max q J (q (t), θ (t) ) (b) M-step: θ (t+1) = arg max θ J (q (t+1), θ (t) ) It is worthy to mention that minimizing the objective function is equivalent to maximizing the KL-divergence. By considering the constraint, the above procedure is limited to those q(.)s that satisfy the constraint set Q. One way to do so, it to constrain the E-step to Q: E-step(modifiede): q (t+1) = arg max q Q J (q(t), θ (t) ) Instead of applying such a constraint which is hard to calculate, we can use the dual form of the E-step and the constraint inequality. Theorem 1. Consider the following problem: max q,ξ KL ( ) E q [φ(x, Y)] b ξ (4) With θ as constant parameters. The primal solution q (Y) is unique since KL-divergence is strictly convex: q (Y) = exp λ.φ(x, Y)} Z(λ ) 4

in which Z(λ ) = exp λ.φ(x, Y)} dy is the normalizing function. The dual for the above program is as following: [ max b.λ ln Z(λ) ɛ λ β ] λ 0 in which. β is dual norm for. β. The proof can be found at [1]. 7 Bibliographical notes I used Jiayu Zhou s notes (http://www.public.asu.edu/~jzhou29/slides/ PR_Notes.pdf) in writing this document. References [1] Kuzman Ganchev, Joao Graça, Jennifer Gillenwater, and Ben Taskar. Posterior regularization for structured latent variable models. The Journal of Machine Learning Research, 99:2001 2049, 2010. 5