SUPERVISED MULTI-MODAL TOPIC MODEL FOR IMAGE ANNOTATION

SUPERVISE MULTI-MOAL TOPIC MOEL FOR IMAGE AOTATIO Thu Hoai Tran 2 and Seungjin Choi 12 1 epartment of Computer Science and Engineering POSTECH Korea 2 ivision of IT Convergence Engineering POSTECH Korea thtlamson@gmailcom seungjin@postechacr ABSTRACT Multi-modal topic models are probabilistic generative models here hidden topics are learned from data of different types In this paper e present supervised multi-modal latent irichlet allocation smmla) here e incorporate class label global description) into the joint modeling of visual ords and caption ords local description) for image annotation tas We derive variational inference algorithm to approximately compute posterior distribution over latent variables Experiments on a subset of LabelMe dataset demonstrate the useful behavior of our model compared to existing topic models Index Terms Image annotation latent irichlet allocation topic models 1 ITROUCTIO Latent irichlet allocation LA) is a idely-used topic model hich as originally developed to model text corpora 5 It is a hierarchical Bayesian model in hich each observed item is modeled as a finite mixture over an underlying set of topics and each topic is characterized by a distribution over ords The basic idea of LA hen it is applied to model a set of images treating an image as a collection of visual ords is shon in Fig 1 The same intuition in the case of documents can be found in 2 Multi-modal extensions of LA referred to as multi-modal topic models have been proposed to jointly model data of different types These models ere mainly applied to image annotation here the goal is to assign a set of eyords to an image learning underlying topics from a set of image-annotations pairs Earlier or on this direction is correspondence LA cla) 3 hich finds conditional relationships beteen latent variable representations of visual ords and caption ords The conditional distribution of the annotation given visual descriptors is modeled for automatic image annotation Topic regression multi-modal LA trmmla) 10 is an alternative method for capturing statistical association beteen image and text Unlie cla trmmla learns to separate sets of hidden topics and counts on a regression module to allo a set of caption topics to be linearly predicted from the set of image topics It as motivated by the regression-based latent factor model 1 hich as further elaborated in the hierarchical Bayesian frameor 9 It as shon in 10 that trmmla is more flexible than cla in the sense that the former allos the number of image topics to be different from the number of caption topics Class label is a global description of an image hile annotated eyords are local descriptions of image patches Class label and annotations are related to each other For instance an image labeled as highay scene is more liely to be annotated ith cars and road Fig 1 A codeord is assigned to each image patch to represent an image as a collection of visual ords We assume that some number of topics hich are distributions over ords exist for the set of images An illustration of ho an image is generated by an LA model is shon here We first choose a distribution over topics the histogram at right) Then for each visual ord choose a topic assignment the circles ith patterns filled in) and choose the visual ord from the corresponding topic rather than apple and des In this paper e present supervised multimodal latent irichlet allocation smmla) here e incorporate class label into trmmla so that to sets of hidden topics hich are related via linear regression are learned from data of to types as ell as from class label Several extensions of LA to incorporate supervision have been developed in the literature 4 6 7 11 13 Most of these existing methods are limited to learning from data of single type The model trmmla outperforms most of previous methods in the tas of image annotation but is an unsupervised method Our model smmla extends the previous state of the arts in this domain trmmla by incorporating supervision of class label 2 LATET IRICHLET ALLOCATIO We briefly give an overvie of LA 5 LA 5 is a generative probabilistic model of a corpus in hich documents are represented

as random mixtures over latent topics here each topic is described by a distribution over ords Each document d1: is a sequence of ords for d 1 is the size of a corpus) and each ord dn R V V is the size of vocabulary) is a unit vector that has a single entry equal to one and all other entries equal to zero For instance if dn is the vth ord in the vocabulary then dnv 1 and dnj 0 for j v The graphical model for LA is shon in Fig 2 here each document d1: is assumed to be generated as follos: α Variational inference allos us to calculate approximate posterior distributions over hidden variables {θ d z dn } by maximizing the variational loer-bound on the log marginal lielihood 3 SUPERVISE MULTI-MOAL LA In this section e present the main result discriminative multimodal LA smmla) here e incorporate class label into the joint modeling of visual ords {r dn } and caption ords { dm } hose latent variable representations are related via linear regression The graphical model for smmla is shon in Fig 3 θ d αc C η z dn θ d c d A Λ µ φ dn Fig 2 Graphical model for LA K r c C K z dn r dn x d y dm c ra a vector of topic proportions θ d R K dm M C L θ d irα 1 α K) For each ord n ra a topic assignment z dn R K from multinomial distribution: z dn θ d Multθ d ) ra a ord dn R V : dn z dn φ 1:K p dn z dn φ 1:K ) Given parameters α and φ 1:K the joint distribution of hidden and observed variables is given by pθ d z d1: d1: α φ 1:K ) pθ d α) pz dn θ d )p dn z dn φ 1:K ) Integrating over θ d and φ 1:K and summing over z d1: the marginal distribution of a document is given by p d1: α φ 1:K ) pz dn θ d )p dn z dn φ 1:K ) z d1: pθ d α) dθ d Taing the product of marginal probabilities of single documents the probability of a crops the marginal lielihood is given by p 1:1: α φ 1:K ) p d1: α φ 1:K ) 1) Fig 3 Graphical model for discriminative multi-modal LA smmla) 31 Model The generation process for each visual ord {r dn } and caption ord { dm } is as follos Choose a category label: c d R C Multη) j1 η c dj j here c d is the C-dimensional unit vector If c d is the class label j then c dj 1 and c di 0 for i j ra a vector of image topic proportions: θ d R K irθ d α j) c dj j1 For each visual ord r dn ra an image topic assignment: z dn R K Multθ d ) K 1 θ z dn d

ra a visual ord: r dn z dn c d Mult r ) K V r r cdi z dn r dnj ij i1 1 j1 here V r is the size of visual ord vocabulary Given the empirical image topic frequency z d 1 z dn sample a real-valued topic proportion variable for caption text: x d z d A µ Λ x d Az d + µ Λ 1 ) Compute caption topic proportions: v dl For each caption ord dm e x dl L ex dl ra a caption topic assignment: y dm Multv d ) ra a caption ord: L v y dml dl dm y dm c d Mult ) L V cdi y dml dmj ilj i1 j1 here V is the size of caption ord vocabulary We define sets of variables as R {r dn } W { dm } Z {z dn } Y {y dm } C {c d } Θ {θ d } X {x d } Then the joint distribution over these variables obeys the folloing factorization: pr W C Θ X Z Y) pc η)pθ C α)pz Θ)pR Z r C) px Z A µ Λ)pY X)pW Y C) 2) here pc η) 1 C pθ C α) pz Θ) pr Z r C) px Z A µ Λ) j1 irθ d α j) c dj K 1 θ z dn d C K V r r ) cdi z dn r dnj ij i1 1 j1 x d Az d + µ Λ 1 ) M M py X) py dm x d ) pw Y C) m1 M C L m1 32 Variational Inference V i1 j1 The log marginal lielihood is given by L m1 v y dml dl ilj ) cdi y dml dmj log pr W C) log pr W C Θ X Z Y)dΘdX Θ X Z Y qθ X Z Y) Θ X Z Y ) pr W C Θ X Z Y) log dθdx qθ X Z Y) Fq) 3) here qθ X Z Y) denotes the variational distribution and Jensen s inequality is used to reach the variational loer-bound Fq) We assume that the variational distribution factorizes as qθ X Z Y) qθ)qx)qz)qy) 4) here each distribution { is assumed to be of the form in Table } 1 Variational parameters {α d } {x d Γ 1 d } {τ dn} {ρ dml } are determined by maximizing the variational loer-bound Fq) E q log pc η) + log log pθ C α) + log pz Θ) + log pr Z r C) + log px Z A µ Λ) + log py X) + log pw Y C) E q log qθ) + log qx) + log qz) + log qy) here E q denotes the statistical expectation ith respect to the variational distribution q ) etailed derivations for variational inference are omitted here due to the space limitation In fact derivations can be done in a similar manner to 10 Especially

Variational posterior distributions Table 1 Updating equations for variational parameters Updating equations for variational parameters qθ) irθ d α d ) α d C i1 c diα i + τ dn qz) K 1 τ z dn dn qy) M L m1 ρy dml dml qx) x d x d Γ 1 log τ dn ψα d ) ψα d1 + + α dk ) + C A Λx d µ) 1 + 1 d ) ξ d L log ρ dml C i1 ex dl+ 05 γ dl Vr i1 j1 c dir dnj log r ij 2 diaga ΛA) A 1 ΛA 2 2 m n φ dm V j1 c di dmj log ilj + x dl etermine x d and γ dl by eton-raphson method E q log v dl E q e x dl L e x dl as in 10 its convex loer-bound is maximized: E q log v dl E q x dl log ξ d 1 ξ d 33 Parameter Estimation is not directly maximized Instead x dl log ξ d 1 ξ d L L e x dl + 1 e x dl+ 5 γ dl + 1 Coordinate ascent algorithms for updating variational parameters are summarized in Table 1 Regression parameters {A µ Λ 1 } are updated: A 1 2 1 x d µ) φ dn tr diagφ dn ) + φ dn ) µ 1 x d 1 ) A φ dn Λ 1 1 x d µ)x d µ) + Γ 1 d m n φ dm 1 1 ) ) A φ dn x d Multinomial parameters { r } are updated: r ij c diτ dn r dnj c diτ dn r dnj Vr j1 highay inside city mountain open country street and tall building This subset has 2686 images of size 256 256 ith complete annotations We use 128-dimensional SIFT descriptors 8 computed on 20 20 image patches here each image patch is obtained by sliding a indo ith a 20-pixel interval Then e run -means clustering on 128-dimensional descriptors to learn a 256-ord visual codeboo For the annotation ords e remove the ords appearing less than 3 times in the hole data Finally e have a complete set of triples visual ords caption ords class label) The hole data is separated into the training set of size 2000 and the test set of size 686 We evaluate the performance in terms of caption perplexity defined as Perplexity exp { Md m1 log p dm r d1: ) M d here p dm r d1: ) is the conditional probability of caption ords given an image r d1: and M d is the number of caption ords in document d The higher conditional lielihood leads to the loer perplexity The performance comparison to the previous state of the arts trmmla is summarized in Table 2 here our smmla outperforms trmmla Table 2 Perplexity comparison Method K 25 K 30 trmmla 10 35 36 smmla our method) 285 304 } ilj M m1 c diρ dml dmj V M j1 m1 c diρ dml dmj irichlet parameters {α c} are updated using eton-raphson method as in LA 5 Given a test image r 1: class label and annotations are determined by choosing the most probable ones among conditional probabilities pc d r 1: ) and p dm r 1: ) 4 EXPERIMETS We use the 8-category subset of LabelMe dataset 12 to perform image annotation experiments Categories include coast forest 5 COCLUSIOS In this paper e have presented a multi-modal extension of LA ith supervision leading to smmla We have developed variational inference algorithms to approximately compute posterior distributions over variables of interest in smmla Applications to image annotation demonstrated the high performance of smmla compared to the previous state of the arts Acnoledgments: This or as supported by ational Research Foundation RF) of Korea RF-2013R1A2A2A01067464) and POSTECH Rising Star Program

6 REFERECES 1 Agaral and B-C Chen Regression-based latent factor models in Proceedings of the ACM SIGK Conference on Knoledge iscovery and ata Mining K) Paris France 2009 2 M Blei Probabilistic topic models Communications of the ACM vol 55 no 4 pp 77 84 2012 3 M Blei and M I Jordan Modeling annotated data in Proceedings of the ACM SIGIR Conference on Research and evelopment in Information Retrieval SIGIR) Toronto Canada 2003 4 M Blei and J McAuliffe Supervised topic models in Advances in eural Information Processing Systems IPS) vol 20 MIT Press 2008 5 M Blei A g and M I Jordan Latent irichlet allocation Journal of Machine Learning Research vol 3 pp 993 1022 2003 6 L Fei-Fei and P Perona A Bayesian hierarchical model for learning natural scene categories in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition CVPR) San iego CA 2005 7 S Lacoste-Julien F Sha and M I Jordan iscla: iscriminative learning for dimensionality reduction and classification in Advances in eural Information Processing Systems IPS) vol 21 2009 8 G Loe istinctive image features from scale-invariant eypoints International Journal of Computer Vision vol 60 no 2 pp 91 110 2004 9 S Par Y- Kim and S Choi Hierarchical Bayesian matrix factorization ith side information in Proceedings of the International Joint Conference on Artificial Intelligence IJCAI) Beijing China 2013 10 Putthividhya H T Attias and S S agarajan Topic regression multi-modal latent irichlet allocation for image annotation in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition CVPR) San Francisco CA USA 2010 11 Ramage Hall H allapati and C Manning Labeled LA: A supervised topic model for credit attribution in multi-labeled corpora in Proceedings of the Conference on Empirical Methods in atural Language Processing EMLP) Singapore 2009 12 B C Russel A Torralba K P Murphy and W T Freeman LabelMe: A database and eb-based tool for image annotation International Journal of Computer Vision vol 77 pp 157 173 2008 13 C Wang M Blei and L Fei-Fei Simultaneous image classification and annotation in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition CVPR) Miami FL USA 2009