Multi-Conditional Learning for Joint Probability Models with Latent Variables

Similar documents
EM and Structure Learning

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Homework Assignment 3 Due in class, Thursday October 15

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Conjugacy and the Exponential Family

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Maximum Likelihood Estimation

Hidden Markov Models

Generalized Linear Methods

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

The Basic Idea of EM

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

A Robust Method for Calculating the Correlation Coefficient

EEE 241: Linear Systems

Gaussian Mixture Models

Linear Feature Engineering 11

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Lecture Notes on Linear Regression

Gaussian process classification: a message-passing viewpoint

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Week 5: Neural Networks

10-701/ Machine Learning, Fall 2005 Homework 3

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

1 Convex Optimization

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Supporting Information

Linear Regression Analysis: Terminology and Notation

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lecture 12: Discrete Laplacian

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Probability Theory (revisited)

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

The Expectation-Maximization Algorithm

Chapter 11: Simple Linear Regression and Correlation

SDMML HT MSc Problem Sheet 4

Kernel Methods and SVMs Extension

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Natural Language Processing and Information Retrieval

Linear Approximation with Regularization and Moving Least Squares

Hidden Markov Models

Comparison of the Population Variance Estimators. of 2-Parameter Exponential Distribution Based on. Multiple Criteria Decision Making Method

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Semi-supervised Classification with Active Query Selection

Chapter 8 Indicator Variables

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

A New Evolutionary Computation Based Approach for Learning Bayesian Network

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Lecture 10 Support Vector Machines II

Online Classification: Perceptron and Winnow

Ensemble Methods: Boosting

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Lecture 6: Introduction to Linear Regression

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Generative classification models

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

A quantum-statistical-mechanical extension of Gaussian mixture model

Boostrapaggregating (Bagging)

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Expectation propagation

Comparison of Regression Lines

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

3.1 ML and Empirical Distribution

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Course 395: Machine Learning - Lectures

Semi-Supervised Learning

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Chapter 9: Statistical Inference and the Relationship between Two Variables

Lecture 3: Probability Distributions

Limited Dependent Variables

Deep Learning: A Quick Overview

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Multigradient for Neural Networks for Equalizers 1

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

More metrics on cartesian products

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

MDL-Based Unsupervised Attribute Ranking

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Explaining the Stein Paradox

Lecture 3 Stat102, Spring 2007

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Statistics for Economics & Business

Clustering gene expression data & the EM algorithm

Transcription:

Mult-Condtonal Learnng for Jont Probablty Models wth Latent Varables Chrs Pal, Xueru Wang, Mchael Kelm and Andrew McCallum Department of Computer Scence Unversty of Massachusetts Amherst Amherst, MA USA {pal,xueru,mccallum}@cs.umass.edu Abstract We ntroduce Mult-Condtonal Learnng, a framework for optmzng graphcal models based not on ont lkelhood, or on condtonal lkelhood, but based on a product of several margnal condtonal lkelhoods each relyng on common sets of parameters from an underlyng ont model and predctng dfferent subsets of varables condtoned on other subsets. When appled to undrected models wth latent varables, such as the Harmonum, ths approach can result n powerful, structured latent varable representatons that combne some of the advantages of condtonal random felds wth the unsupervsed clusterng ablty of popular topc models, such as latent Drchlet allocaton and ts successors. We present new algorthms for parameter estmaton usng expected gradent based optmzaton and develop fast approxmate nference algorthms nspred by the contrastve dvergence approach. Our ntal expermental results show mproved cluster qualty on synthetc data, promsng results on a vowel recognton problem and sgnfcant mprovement nferrng hdden document categores from multple attrbutes of documents. Introducton Recently, there has been substantal nterest n Condtonal Random Felds (CRFs) [6] for sequence processng. CRFs are random felds for a ont dstrbuton globally condtoned on feature observatons. The CRF constructon can be contrasted wth MRFs whch have been used n the past to defne and model the ont dstrbuton of both labels and features. CRFs are also optmzed usng a Maxmum Condtonal Lkelhood obectve as opposed to a Maxmum (Jont) Lkelhood obectve whch s a tradtonal obectve functon used for MRFs. In the Machne Learnng communty the Boltzmann machne [] s another well known example of an MRF and recent attenton has been gven to a restrcted type of Boltzman machne [] known as a Harmonum [8]. We are nterested n more deeply explorng the relatonshps between these models, the dstrbutons they defne and the ways n whch they can be optmzed. In the approach we propose and develop here, one begns by specfyng a ont probablty model for all quanttes one wshes to consder as random quanttes, for our dscusson here we can thnk of ths as a ont model for: labels, features, hdden varables and parameters f desred. However, to optmze the ont model we propose fndng pont estmates for parameters usng an obectve functon consstng of the product of select

(margnal) condtonal dstrbutons. The goal of ths obectve s thus to obtan a Random Feld that has been optmzed to be good at modelng a number of condtonal dstrbutons that the modeler s partcularly nterested n focusng on capturng well wthn the context of a sngle ont model. Our experments show that latent varable models obtaned by optmzaton wth respect to ths type of obectve can produce both qualtatvely better clusters as well as quanttatvely better structured latent topc spaces. Mult-Condtonal Learnng n Jont Models. A Smple Illustratve Example n a Locally Normalzed Model Here we present an example of how one can optmze a ont probablty model under a number of dfferent obectves. Consder a Gaussan mxture model (GMM) for real valued random observed varables x (e.g., observed D values) wth an unobserved sub-class, s assocated wth each observed class label c. We wll use the notaton x and c to denote observatons or nstantatons of contnuous and dscrete random varables. We can wrte a model for the ont dstrbuton of these random varables as P (x, c, s) = p(x s)p (s c)p (c), where P (s c) s a sparse matrx assocatng a number of sub-classes. We shall use Θ to denote all the parameters of the model. Now consder that t s possble to optmze the GMM n a number of dfferent ways. Frst, consder the log margnal ont lkelhood L x,c of such a model, whch can be expressed as: L x,c (Θ; { x }, { c }) = log P ( x, c Θ) = log s P ( x, s, c Θ) = L x,c (Θ) () Second, n contrast to the log margnal ont lkelhood, the log margnal condtonal lkelhood L c x can be expressed as: L c x (Θ; { x }, { c }) = log s P ( c, s x, Θ) = L x,c (Θ) L x (Θ) () Thrd, consder the followng mult-condtonal obectve functon, L c x,x c whch we express as: L c x,x c (Θ; { x }, { c }) = log P ( c x, Θ) + log P ( x c, Θ) = L x,c (Θ) L x (Θ) L c (Θ) () Consder now the followng smple example data set whch s smlar to the example presented n Jebara s work [] to llustrate hs Condtonal Expectaton Maxmzaton (CEM) approach. Smlarly, we generate data from two classes, each wth four sub-classes drawn from D sotropc Gaussans. The data are llustrated by red s and blue s n Fgures. In contrast to [], here we ft models wth dagonal covarance matrces and we use the condtonal expected gradent [7] optmzaton approach to update parameters. To llustrate the effects of the dfferent optmzaton crtera we have ft models wth two subclasses for each class. We run each algorthm wth random ntalzatons usng gradent based optmzaton for the three obectve functons and choose the best model under the ont, condtonal and mult-condtonal obectves, (), () and (), respectvely. We llustrate the model parameters usng ellpses of constant probablty under the model. From ths llustratve example, we see that the ont lkelhood based obectve encodes no element explctly enforcng a good model of the condtonal dstrbuton of the class label and can thus place probablty mass n poor locatons wth respect to classfcaton. The condtonal obectve focuses completely on the decson boundary and can produce parameters wth very lttle nterpretablty. Whereas our mult-condtonal obectve explctly optmzes for a good class condtonal dstrbuton and a good settng of parameters for makng classfcatons.

6 7......... 6 8 6 7 Fgure : (Left) Jont lkelhood optmzaton. (Mddle) One of the many near optmal solutons found by condtonal lkelhood optmzaton. (Rght) An optmal soluton found by our multcondtonal obectve. Quanttatvely, we have found that a smlar mult-condtonal optmzaton and model selecton procedure for the class, solated vowel recognton problem n [] leads to a test set error rate of.6 compared to. usng the ML and CL obectves. In contrast, the best publshed result s.9 usng multvarate adaptve regresson splnes (MARS) [].. Our Proposed Mult-Condtonal Obectve More generally, our obectve functon can be expressed as follows. Consder a data set consstng of =... N observaton nstances, hdden dscrete and contnuous varables {z} and z respectvely. We defne =... M pars of dsont subsets of observed varables where { x}, represents the th nstance of the varables n subset and { x}, s the other half of the par (whch we wll condton upon). Usng these defntons, the optmal parameter settngs under our mult-condtonal crteron are gven by argmax P ({ x},, {z},, z, { x}, ; θ)dz,, () θ {z}, where we derve these margnal condtonal lkelhoods from a sngle underlyng ont model whch tself may be normalzed locally, globally or usng some combnaton of the two.. A Structured, Globally Normalzed, Latent Varable Model for Documents A Harmonum model [8] s a Markov Random Feld (MRF) consstng of observed varables and hdden varables. Lke all MRFs the model we present here wll be defned n terms of a globally normalzed product of (un-normalzed) potental functons defned upon subsets of varables. A Harmonum can also be descrbed as a type of restrcted Boltzmann machne [] whch can be wrtten as an exponental famly model. In partcular, the exponental famly Harmonum structured model we develop here can be wrtten as { P (x, y Θ) = exp θ T f (x ) + θ T f (y ) + } θ T f (x, y ) A(Θ), () where y s a vector of hdden varables, x s a vector of observatons, θ represents parameter vectors (or weghts), θ represents a parameter vector on a cross product of states, f denotes potental functons, Θ = {θ, θ, θ } s the set of all parameters and A s the log-partton functon or normalzaton constant. A Harmonum model factorzes the thrd term of () nto θ T f (x, y ) = f (x ) T W T f (y ), where W T s a parameter matrx wth dmensons a b,.e., wth rows equal to the number of states of f (x ) and columns equal to the number of states of f (y ). Fgure (rght) llustrates a Harmonum model as a factor graph []. Importantly, a Harmonum descrbes the factorzaton of a

ont dstrbuton for observed and hdden varables nto a globally normalzed product of local functons. In our experments here we shall use the Harmonum s factorzaton structure to defne a MRF and we wll then defne sets of margnal condtonals dstrbutons of some observed varables gven others that are of partcular nterest so as to form our mult-condtonal obectve. Importantly, usng a globally normalzed ont dstrbuton wth ths constructon t s also possble to derve two consstent condtonal models, one for hdden varables gven observed varables and one for observed varables gven hdden varables [9]. The condtonal dstrbutons defned by these models can also be used to mplement samplng schemes for varous probabltes n the underlyng ont model. However, s mportant to remember that the orgnal model parameterzaton s not defned n terms of these condtonal dstrbutons. In our specfc experments below we use a ont model wth a form defned by () wth W T = [Wb T WT d ] such that the (exponental famly) condtonal dstrbutons consstent wth the ont model are gven by P (y n x) = N (y n ; ˆµ, I), ˆµ = µ + W T x and (6) P (x b ỹ) = B(x b ; ˆθ b ), ˆθb = θ b + W b ỹ and (7) P (x d ỹ) = D(x d ; ˆθ), ˆθd = θ d + W d ỹ (8) Where N (), B() and D() represent Normal, Bernoull and Dscrete dstrbutons respectvely. The followng equaton can be used to represent the margnal P (x θ, Λ) = exp{θ T x + x T Λx A(θ, Λ)} (9) where Λ = WWT. In an exponental famly model wth exponental functon F(x; θ), t s easy to verfy that the gradent of the log ont lkelhood can be expressed as: [ ] L(θ; x) F(x; θ) F(x; θ) = N E P (x) E P (x;θ) () where E P (x) denotes the expectaton under the emprcal dstrbuton, E P (x) s an expectaton under the models margnal dstrbuton and N s the number of data elements. We can thus compute the gradent of the log-lkelhood under our constructon usng L(W T ; X) W T = N d ( ) N d W T x x T N s W T x,() x T,() N s = where N d are the number of vectors of observed data, x,() are samples ndexed by and N s are the number of MCMC samples used per data vector and computed Gbbs samplng and condtonals (6), (7) and (8). In our experments here we have found t possble to use ether one or a small number of MCMC steps ntalzed from the data vector (the contrastve dvergence approach) but a more standard MCMC approxmaton s also possble. Fnally, for condtonal lkelhood and mult-condtonal lkelhood based learnng, gradent values can be obtaned from L = N [ [ E P (x,x ) F(x, x ; θ) Experments, Results and Analyss = E P (x ) E P (x x ;θ) () ]] F(x, x ; θ) We are nterested n examnng the qualty of the latent representatons obtaned when optmzng mult-attrbute Harmonum structured models under ML, CL and MCL obectves. We use a smlar testng strategy to [9] but focus on comparng the dfferent latent spaces obtaned wth the dfferent optmzaton obectves. For our experments, we use the reduced newsgroups dataset prepared n MATLAB by Sam Rowes. In ths data ()

Observed x Unobserved y Precson.7.7.6.6..... Recall ML ML CL MCL Fgure : (Left) A factor graph for a Harmonum model. (Rght) Precson-recall curves for the newsgroups data usng ML, CL and MCL wth latent varables. Random guessng s a horzontal lne at.. set, 6 documents are represented by word vocabulary bnary occurrences and are labeled as one of four domans. To evaluate the qualty of our latent space, we retreve documents that have the same doman label as a test document based on ther cosne coeffcent n the latent space when observng only bnary occurrences. We randomly splt data nto a tranng set of, documents and a test set of documents. We use ont model wth a correspondng full rank multvarate Bernoull condtonal for bnary word occurrences and a dscrete condtonal for domans. Fgure shows precson-recall results. ML- s our model wth no doman label nformaton. ML- s optmzed wth doman label nformaton. CL s optmzed to predct domans from words and MCL s optmzed to predct both words from domans and domans from words. From Fgure we see that the latent space captured by the model s more relevant for doman classfcaton when the model s optmzed under the CL and MCL obectves. Further, at low recall both the CL and MCL derved latent spaces produced smlar precsons. However, as recall ncreases the precson for comparsons made n the MCL derved latent space s consstently better. In concluson, these results lead us to beleve that further nvestgaton s warranted nto the the use of Mult-Condtonal Learnng methods for dervng both more meanngful and more useful hdden varable models. References [] D. H. Ackley, G. E. Hnton, and T. J. Senowsk. A learnng algorthm for Boltzmann machnes. Cogntve Scence, 9:7 69, 98. [] T. Haste, R. Tbshran, and J. Fredman. The Elements of Statstcal Learnng: Data Mnng, Inference, and Predcton. Sprnger-Verlag,. [] G. Hnton. Tranng products of experts by mnmzng contrastve dvergence. Neural Computaton, :77 8,. [] T. Jebara and A. Pentland. On reversng ensen s nequalty. NIPS,. [] F. R. Kschschang, B. Frey, and H.-A. Loelger. Factor graphs and the sum-product algorthm. IEEE Transactons on Informaton Theory, 7():98 9,. [6] John Lafferty, Andrew McCallum, and Fernando Perera. Condtonal random felds: Probablstc models for segmentng and labelng sequence data. In ICML, pages 8 89,. [7] Ruslan Salakhutdnov, Sam Rowes, and Zoubn Ghahraman. Optmzaton wth EM and expectaton-conugate-gradent. Proceedngs of (ICML),. [8] P. Smolensky. Informaton processng n dynamcal systems: foundatons of harmony theory, chapter, pages 9 8. McGraw-Hll, New York, 986. [9] Max Wellng, Mchal Rosen-Zv, and Geoffrey Hnton. Exponental famly harmonums wth an applcaton to nformaton retreval. In NIPS7, pages 8 88..