A Probabilistic Multimedia Retrieval Model and Its Evaluation

EURASIP Journal on Appled Sgnal Processng 2:2, 86 98 c 2 Hndaw Publshng Corporaton A Probablstc Multmeda Retreval Model and Its Evaluaton Thjs Westerveld Natonal Research Insttute for Mathematcs and Computer Scence (CWI), P.O. Box 979, 9 GB Amsterdam, The Netherlands Emal: thjs@cw.nl Arjen P. de Vres Natonal Research Insttute for Mathematcs and Computer Scence (CWI), P.O. Box 979, 9 GB Amsterdam, The Netherlands Emal: arjen@cw.nl Alex van Ballegooj Natonal Research Insttute for Mathematcs and Computer Scence (CWI), P.O. Box 979, 9 GB Amsterdam, The Netherlands Emal: alexb@cw.nl Francska de Jong Unversty of Twente, P.O. Box 27, 75 AE Enschede, The Netherlands Emal: fdejong@cs.utwente.nl Djoerd Hemstra Unversty of Twente, P.O. Box 27, 75 AE Enschede, The Netherlands Emal: hemstra@cs.utwente.nl Receved 2 March 22 and n revsed form November 22 We present a probablstc model for the retreval of multmodal documents. The model s based on Bayesan decson theory and combnes models for text-based search wth models for vsual search. The textual model s based on the language modellng approach to text retreval, and the vsual nformaton s modelled as a mxture of Gaussan denstes. Both models have proved successful on varous standard retreval tasks. We evaluate the multmodal model on the search task of TREC s vdeo track. We found that the dsclosure of vdeo materal based on vsual nformaton only s stll too dffcult. Even wth purely vsual nformaton needs, text-based retreval stll outperforms vsual approaches. The probablstc model s useful for text, vsual, and multmeda retreval. Unfortunately, smplfyng assumptons that reduce ts computatonal complexty degrade retreval effectveness. Regardng the queston whether the model can effectvely combne nformaton from dfferent modaltes, we conclude that whenever both modaltes yeld reasonable scores, a combned run outperforms the ndvdual runs. Keywords and phrases: multmeda retreval, evaluaton, probablstc models, Gaussan mxture models, language models.. INTRODUCTION Both mage analyss and vdeo moton processng have been unable to meet the requrements for dsclosng the content of large scale unstructured vdeo archves. There appear to be two major unsolved problems n the ndexng and retreval of vdeo materal on the bass of these technologes, namely, (a) mage and vdeo processng s stll far away from understandng the content of a pcture n the sense of a knowledge-based understandng and (b) there s no effectve query language (n the wder sense) for searchng mage and vdeo databases. Unlke the target content n the feld of text retreval, the content of vdeo archves s hard to capture at the conceptual level. An ncreasng number of developers that accept ths analyss of the state-of-the-art n the feld have started to use human language as the meda nterlngua, makng the assumpton that as long as there s no possblty to carry out both a broad scale recognton of vsual objects and an automatc mappng from such objects to lngustc representatons, the detaled content of vdeo materal s best dsclosed through the lngustc content (text) that may be assocated wth the mages: speech transcrpts,

A Probablstc Multmeda Retreval Model and Its Evaluaton 87 manually generated annotatons, subttles, captons, and so on []. Snce the recent advances n automatc speech recognton, the potental role of speech transcrpts n mprovng the dsclosure of multmeda archves has been especally gven a lot of attenton. One of the nsghts ganed by these nvestgatons s that for the purpose of ndexng and retreval, perfect word recognton s not an ndspensable condton snce not every word wll have to make t nto the ndex, relevant words are lkely to occur more than once, and not every expresson n the ndex s lkely to be quered. Research nto the dfferences between text retreval and spoken document retreval ndcates that, gven the current level of performance of nformaton retreval technques, recognton errors do not add new problems to the retreval task [2, ]. The lmtatons nherent n the deployment of language features only have already lead to several attempts to deal wth the requrements of vdeo retreval by more closer ntegraton of human language technology and mage processng. The noton of multmodal and even more ambtous cross-modal retreval have come n use to refer to the explotaton of the analyss of a varety of feature types n representng and ndexng aspects of vdeo documents [, 5, 6, 7, 8, 9]. As ndcated, many useful tools and technques have become avalable from varous research areas that have contrbuted to the doman of multmeda retreval, but the ntegraton of automatcally generated multmodal metadata s most often done n an ad hoc manner. The varous nformaton modaltes that play a role n vdeo documents are each handled by dfferent tools. How the varous analyses affect the retreval performance s hard to establsh, and t s mpossble to gve an explanaton of performance results n terms of a formal retreval model. Ths paper descrbes an approach whch employs both textual and mage features and represents them n terms of one unform theoretcal framework. The output from varous feature extracton tools s represented n probablstc models based on Bayesan decson theory and the resultng model s a transparent combnaton of two smlar models, one for textual features based on language models for text and speech retreval [], and the other for mage features based on a mxture of Gaussan denstes []. Intal deployment of the approach wthn the search tasks for the vdeo retreval tracks n TREC-2 [2] and TREC- 22 [] has demonstrated the possblty of usng ths model n retreval experments for unstructured vdeo content. Addtonal experments have taken place for smaller test collectons. Secton 2 of ths paper descrbes the general probablstc retreval model, ts textual (Secton 2.), and vsual consttuents (Secton 2.2). Secton presents the expermental setup followed by a number of expermental results to evaluate the effectveness of the retreval model. Fnally, Secton summarses our man conclusons. 2. PROBABILISTIC RETRIEVAL MODEL If we reformulate the nformaton retreval problem to one of pattern classfcaton, the goal s to fnd the class to whch the query belongs. Let Ω ={ω,ω 2,...,ω M } be the set of classes underlyng our document collecton and Q be a queryrepresentaton. Usng the optmal Bayes or maxmum a posteror classfer, we can then fnd the class ω, wth mnmal probablty of classfcaton error, ω = arg max P ( ω Q ). () In a retreval settng, the best strategy s to rank classes by ncreasng probablty of classfcaton error. When no classfcaton s avalable, we can smply let each document be a separate class. It s hard to estmate () drectly; therefore, we reverse the probabltes usng Bayes rule ω = arg max P ( Q ω ) P ( ω ) P(Q) = arg max P ( Q ) ( ) ω P ω. (2) If the a pror probabltes of all classes are equal (.e., P(ω ) s unform), the maxmum a posteror classfer (2) reduces to the maxmum lkelhood classfer, whch s approxmated by the Kullback-Lebler (KL) dvergence between query model and class model ω = arg mn KL [ P q (x) P (x) ]. () The KL-dvergence measures the amount of nformaton there s to dscrmnate one model from another. The best matchng document s the document wth the model that s hardest to dscrmnate from the query model. Fgure llustrates the retreval framework. We buld models for queres and documents and compare them usng the KL-dvergence between the models. The vsual part s modelled as a mxture of Gaussans (see Secton 2.2); for the textual part, we use the language modellng approach n whch documents are treated as bags of words (see Secton 2.). The KL-dvergence between query model and document model s defned as follows: KL [ P q (x) P (x) ] = P ( x ) P ( x ) ω q ω q log P ( x ) dx ω = P ( x ) ( ) ω q log P x ω q dx () P ( x ) ( ) ω q log P x ω dx. The frst ntegral s ndependent of ω and can be gnored; thus, The query model s here, lke the document models, represented as a Gaussan mxture model but t can also be represented as a bag of blocks (see Secton 2.2).

88 EURASIP Journal on Appled Sgnal Processng Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton example capton example capton example capton example capton example capton example capton example capton example capton example capton example capton Ths s an example capton. Ths s an example capton example capton Ths s an example capton. Ths s an example capton example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Ths s an example capton. Ths s an example capton Query model Document models LM LM KLdvergence LM LM Fgure : Retreval framework: mage represented as Gaussan mxture and text as language model ( bags of words ). ω = arg mn KL [ P q (x) P (x) ] = arg max P ( x (5) ) ( ) ω q log P x ω dx. When workng wth multmodal materal lke vdeo, the documents n our collecton contan features n dfferent modaltes. Ths means that the classes underlyng our document collecton may contan dfferent feature subclasses. The class condtonal denstes can thus be descrbed as mxtures of feature denstes P ( x ω ) = F f = P ( x ω, f ) P ( ω, f ), (6) where F s the number of underlyng feature subclasses, P(ω, f ) s the probablty of subclass f of class ω, and P(x ω, f ) s the subclass condtonal densty for ths subclass. When we draw a random sample from class ω,wefrstselect a feature subclass accordng to P(ω, f ) and then draw a sample from ths subclass usng P(x ω, f ). To arrve at a generc expresson for smlarty between mxture models, Vasconcelos [] parttons the feature space nto dsjont subspaces, where each pont n the feature space s assgned to the subspace correspondng to the most probable feature subclass χ k = { x : P ( ω,k x ) P ( ω,l x ), l k }. (7) Usng ths partton, (5) canberewrttenas(theproofs gven n []) P ( x ) ( ) ω q log P x ω dx = P ( ) [ ω q, f log P ( ) ω,k f,k + P ( x ( ) ] ) P x ω,k ω q, f,x χ k log χ k P ( ω,k x ) dx P ( x ) ω q, f dx. χ k (8) When the subspaces χ k form the same hard parttonng of the features space for all query and document models, that s, when P ( ω,k x ) = P ( ω q,k x ) {, f x χk, = (9), otherwse, then P ( x { ), f f = k, ω q, f dx χ k, otherwse, P ( ω,k x ) =, x χ k. Ths reduces (8)to P ( x ) ( ) ω q log P x ω dx = P ( ) ( ) ω q, f log P ω, f f + f P ( ω q, f ) χ f P ( x ω q, f ) log P ( x ω, f ) dx. () ()

A Probablstc Multmeda Retreval Model and Its Evaluaton 89 Ths rankng formula s general and can, n prncple, be used for any knd of multmodal document collecton. In the rest of the paper, we lmt ourselves to vdeo collectons represented by stll frames and speech-recognzed transcrpts. The classes underlyng our collecton are defned through the shots n the vdeos. Furthermore, we assume that we have two feature subclasses, namely, a subclass generatng textual features and another generatng vsual features. We can now partton the feature space nto two dstnct subspaces for textual and vsual features: χ t and χ v. Ths parttonng s hard, that s, a feature can be textual or vsual but never both. Our rankng formula becomes ω = arg max P ( x ) ( ) ω q log P x ω dx [ = arg max P ( ω q,t ) log P ( ω,t ) + P ( ω q,t ) χ t P ( x ω q,t ) log P ( x ω,t ) dx + P ( ω q,v ) log P ( ω,v ) + P ( ω q,v ) χ v P ( x ω q,v ) log P ( x ω,v ) dx ]. (2) The mxture probabltes for the textual and vsual models P(ω,t )andp(ω,v )mghtbedervedfrombackground knowledge about the class ω. If, for example, we know that ω s a class from a news broadcast, we mght assgn ahghervaluetop(ω,t ) snce the probablty that there s text that helps us n fndng relevant nformaton s relatvely hgh. On the other hand, f ω s from a documentary or a slent move, we mght gan less nformaton from the text from ω and assgn a lower value to P(ω,t ). At the moment, however, we have no background nformaton; therefore, we do not dstngush between classes and use unform mxture probabltes. Ths means that the frst and thrd terms from (2) are ndependent of ω and can be gnored. Our fnal (general) rankng formula becomes [ ω = arg max P(t) P ( x ) ( ) ω q,t log P x ω,t dx χ t + P(v) P ( x ] ) ( ) ω q,v log P x ω,v dx, χ v () where P(t) and P(v) are the class-ndependent probabltes of drawng textual and vsual features, respectvely. 2.. Text retreval For the textual part of our rankng functon, we use statstcal language models. A famous applcaton of these models s Shannon s llustraton of the mplcatons of codng and nformaton theory usng models of letter sequences and word sequences []. In the 97s, statstcal language models were developed as a general natural language-processng tool, frst for automatc speech recognton [5] and later also for, for example, part-of-speech taggng [6] and machne translaton [7]. Recently, statstcal language models have been suggested for nformaton retreval by Ponte and Croft [8], Hemstra [9], and Mller et al. [2]. The language modelng approach to nformaton retreval defnes a smple ungram language model for each document n a collecton. For each document ω,t, the language model defnes the probablty P(x t,,...,x t,nt ω,t )ofa sequence of N t textual features (.e., words) x t,,...,x t,nt and the documents are ranked by that probablty. The standard language modellng approach to nformaton retreval uses a lnear nterpolaton of the document model P(x t,j ω )wth a general collecton model P(x t,j )[9, 2, 2, 22]. As these models operate on dscrete sgnals, the ntegral from ()can be replaced by a sum. Furthermore, f we use the emprcal dstrbuton of the query as the query model, then the standard textual part of ()s ω t = arg max N t N t log [ λp ( ) ( )] x t,j ω +( λ)p xt,j. () j= The lnear combnaton needs a smoothng parameter λ whch s set emprcally on some test collecton or alternatvely estmated by the expectaton-maxmsaton (EM)- algorthm [2] on a test collecton. The probablty of drawng textual feature x t,j from document ω (P(x t,j ω )) s computed as follows: f the document contans terms n total and the term x t,j occurs 2 tmes, ths probablty would smply be 2/ =.2. Smlarly, P(x t,j ) s the probablty of drawng x t,j from the entre document collecton. Usng the statstcal language modellng approach for vdeo retreval, we would lke to explot the herarchcal data model of vdeo, n whch a vdeo s subdvded nto scenes whch are subdvded nto shots whch are, n turn, subdvded nto frames. Statstcal language models are partcularly well suted for modellng such complex representatons of the data. We can smply extend the mxture to nclude the dfferent levels of the herarchy, wth models for shots and scenes, 2 Shot = arg max N t N t log [ λ Shot P ( ) x t,j Shot j= + λ Scene P ( x t,j Scene ) + λcoll P ( x t,j ) ] wth λ Coll = λ Shot λ Scene. (5) The man dea behnd ths approach s that a good shot contans the query terms and s part of a scene havng more occurrences of the query terms. Also, by ncludng scenes n 2 We assume that each shot s a separate class and replace ω wth Shot.

9 EURASIP Journal on Appled Sgnal Processng the rankng functon, we hope to retreve the shot of nterest even f the vdeo s speech descrbes the shot just before t begns or just after t s fnshed. Dependng on the nformaton need of the user, we mght use a smlar strategy to rank scenes or complete vdeos nstead of shots, that s, the best scene mght be a scene that contans a shot n whch the query terms (co-)occur. 2.2. Image retreval In order to specalse the vsual part of our rankng formula (), we need to estmate the class condtonal denstes for the vsual features P(x v ω ). We follow Vasconcelos []and model them usng Gaussan mxture models. The dea behnd modellng shots as a mxture of Gaussans s that each shot contans a certan number of classes or components and that each sample from a shot (.e., each block of 8 by 8 pxels extracted from a frame) was generated by one of these components. The class condtonal denstes for a Gaussan mxture model are defned as follows: P ( ) C x v ω = P ( ) ( ) θ,c xv,µ,c, Σ,c, (6) c= wherec s the number of components n the mxture model, θ,c s component c of class model ω,and (x,µ, Σ) s the Gaussan densty wth mean vector µ and covarance matrx Σ, (x, µ, Σ) = (2π) n Σ e (/2) x µ Σ, (7) where n s the dmensonalty of the feature space and x µ Σ = (x µ) T Σ (x µ). (8) 2.2. Estmatng model parameters The parameters of the models for a gven shot can be estmated usng the EM algorthm. Ths algorthm terates between estmatng the a posteror class probabltes for each sample P(θ c x v ) (the E-step) and re-estmatng the components parameters (µ c, Σ c,andp(θ c )) based on the sample dstrbuton (M-step). The approach s rather general: any knd of feature vectors can be used to descrbe samples. Our samplng process s as follows (It s llustrated n Fgure 2). Frst, we convert the keyframe of a shot to the YCbCr color space. Then, we cut t n dstnct blocks of 8 by 8 pxels. On these blocks, we perform the dscrete cosne transform (DCT) for each of the color channels. We now take the frst DCT coeffcents from the Y-channel and only the DC coeffcent from both the Cb and the Cr channels to descrbe the samples. These feature vectors are then fed to the EM algorthm to fnd the parameters (µ c, Σ c,andp(θ c )). The EM algorthm frst assgns each sample to a random component. Next, we Lookng at a sngle shot, we can drop the class subscrpts. compute the parameters (µ c, Σ c,andp(θ c )) for each component, based on the samples assgned to that component. We re-estmate the class assgnments, that s, we compute the posteror probabltes (P(θ c x)forallc). We terate between estmatng class assgnments (expectaton step) and estmatng class parameters (maxmsaton step) untl the algorthm converges. Fgure shows a query mage and the component assgnments after dfferent teratons of the EM algorthm. Instead of a random ntalsaton, we ntally assgned the left-most part of the samples to component, the samples n the mddle to component 2, and the rght-most samples to component. Ths way t s clearly vsble how the component assgnments move about the mage. Fnally, after convergence of the EM algorthm, we descrbe the poston n the mage plane of each component as a 2D-Gaussan wth mean and covarance computed from the postons of the samples assgned to ths component. 2.2.2 Bags of blocks Just lke n our textual approach, for the query model, we can smply take the emprcal dstrbuton of the query samples. If a query mage x v conssts of N v samples x v = (x v,,x v,2,...,x v,nv ), then P(x v, ω q ) = /N v. For the document model, we take a mxture of foreground and background probabltes, that s, the (foreground) probablty of drawng a query sample from the document s Gaussan mxture model, and the (background) probablty of drawng t from any Gaussan mxture n the collecton. In other words, the query mage s vewed as a bag of blocks (BoB), and ts probablty s estmated as the jont probablty of all ts blocks. The BoB measure for query mages then becomes ω v = arg max N v log [ κp ( ) ( )] x v,j ω +( κ)p xv,j, N v j= (9) where κ s a mxng parameter and the background probablty P(x v,j ) can be found by margnalsng over all M documents n the collecton P ( ) M x v,j = P ( ) ( ) x v,j ω P ω. (2) = Agan, we assume unform document prors (P(ω ) = /M for all ). In text retreval, one of the reasons for mxng the document model wth a collecton model s to assgn nonzero probabltes to words that are not observed n a document. Smoothng s not necessary n the vsual case snce the documents are modelled as mxtures of Gaussans havng nfnte support. Another motvaton for mxng s to weght term mportance: a common sample x (.e., a sample that occurs frequently n the collecton) has a relatvely In practce, a sample does not always belong entrely to one component. In fact, we compute means, covarances, and prors on the weghted feature vectors, where the feature vectors are weghted by ther proporton of belongng to the class under consderaton.

A Probablstc Multmeda Retreval Model and Its Evaluaton 9 Splt colour channels y Cb Cr Take samples DCT coeffcents 675 66 668 665 669 9 7 7 2 8 5 2 7. 9 5 2 2 2. 57 56 5 5 5 9 2 EM algorthm 85 8 87 829 8 5 5 2 2. 2 2 2 Fgure 2: Buldng a Gaussan mxture model from an mage. hgh probablty P(x) (equal for all documents) and, therefore, P(x ω) has only lttle nfluence on the probablty estmate. In other words, common terms and common blocks nfluence the fnal rankng only margnally. 2.2. Asymptotc lkelhood approxmaton A dsadvantage of usng the BoB measure s ts computatonal complexty. In order to rank the collecton, gven a query we need to compute the posteror probablty P(x v ω )

92 EURASIP Journal on Appled Sgnal Processng Intal 2 teratons teratons teratons Fgure : Class assgnments ( classes) for the mage at the top after dfferent numbers of teratons. of each mage block x v n the query for each document ω n the collecton. For evaluatng a retreval method, ths s fne, but for an nteractve retreval system, optmsaton s necessary. An alternatve s to represent the query mage, lke the document mage, as a Gaussan model (nstead of by ts emprcal dstrbuton as a bag of blocks) and then compare these two models usng the KL-dvergence. Yet, f we use Gaussans to model the class condtonal denstes of the mxture components, there s no closed-form soluton for the vsual part of the resultng rankng formula (). As a soluton, Vasconcelos assumes that the Gaussans are well separated and derves an approxmaton gnorng the overlap between the mxture components: the asymptotc lkelhood approxmaton (ALA) []. Startng from (8), he arrves at ω v = arg max χ v P ( x v ω q ) log P ( xv ω ) dxv arg max ALA [ ( ) ( )] P q xv P xv = arg max P ( ) { θ q,c log P ( ) θ,α(c) c +log ( ) µ q,c,µ,α(c), Σ,α(c) 2 trace [ Σ,α(c) Σ q,c] }, where α(c)=k µ q,c µ,k Σ,k < µ q,c µ,l Σ,l, l k. (2) In ths equaton, subscrpts ndcate, respectvely, classes and components (e.g., µ,c s the mean for component θ c of class ω ). 2.. ALA assumptons The man assumpton behnd the ALA s that the Gaussans for the components θ c wthn a class model ω have small overlap; n fact, there are two parts to ths []. The frst assumpton s that each mage sample s assgned to one and only one of the mxture components. The second s that samples from the support set from a sngle query component are all assgned to the same document component. More formally, we have the followng assumptons. Assumpton. Foreachsample,thecomponentwthmaxmum posteror probablty has posteror probablty one ω,x:maxp ( θ,k x ) =. (22) k Assumpton 2. For any document ω j, the component wth maxmum posteror probablty s the same for all samples of the support set of a sngle query component θ q,k, θ q,k,ω j l, x, P ( x θ q,k ) > = arg max l P ( θ j,l x ) =l. (2) We used Monte Carlo smulaton to test these assumptons on our collecton (the TREC-22 vdeo collecton, see Secton.) as follows. Frst, we took a random document ω from the search collecton and then a random mxture component θ,k from the mxture model of ths document. We then drew, random samples from ths component and, for each sample x,computed () P(θ,l x), the posteror component assgnment wthn document for all components θ,l ; () P(θ j,m x), the posteror component assgnment n a dfferent randomly chosen document j, for all components θ j,m. For the frst measure, we smply took the maxmum posteror probablty for each sample. We averaged the second measure over all, samples and took the maxmum over all components to approxmate the proporton of samples assgned to the most probable component

A Probablstc Multmeda Retreval Model and Its Evaluaton 9 Number of samples Number of samples 8 7 7 6 5 2..8.6..2.8.6..2 2.2...2...5.6 max l P(θ,l x) (a)..5.6 max m P(θ j,m x) (b) Fgure : Testng the ALA assumptons (hstogram (a)) and 2 (hstogram (b)), samples x are drawn from P(x θ,k ). (remember, there should be a component that explans all samples). We repeated ths process, teratons for dfferent documents and components selected at random, and hstogrammed the results (Fgure ). Both measures should be close to, the frst to satsfyassumpton and the second tosatsfyassumpton 2. As we can see from the plots n Fgure, the frst assumpton appears reasonable, but the second does not hold. 5 We nvestgate the effect of ths observaton n the retreval experments below. 5 The bar at probablty zero results from a truncaton error n the Bayesan nverson to compute P(θ j,m x) from a (too small) probablty P(x θ j,m )..7.7.8.8.9.9. EXPERIMENTS We evaluated the model outlned above and the presented measures on the search task of the vdeo track of the Text REtreval Conference TREC-22 []... TREC vdeo track TREC s a seres of workshops for large scale evaluaton of nformaton retreval technology [2, 25]. The goal s to test retreval technology on realstc test collecton usng unform and approprate scorng procedures. The general procedure s as follows: () a set of statements of an nformaton need (topc) s created; () partcpants search the collecton and return the top N results for each topc; () returned documents are pooled and judged for relevance to the topc; (v) systems are evaluated usng the relevance judgements. The measures used n evaluaton are usually precson and recall orented. Precson and recall are defned as follows: number of relevant shots retreved precson = total number of shots retreved, recall = number of relevant shots retreved total number of relevant shots n collecton. (2) The vdeo track was ntroduced at TREC-2 to evaluate content-based retreval from dgtal vdeo [2]. Here, we use the data from the TREC-22 vdeo track []. The track defnes three tasks: shot boundary detecton, feature detecton, and general nformaton search. The goal of the shot boundary task s to dentfy shot boundares n a gven vdeo clp. In the feature detecton task, we have to assgn a set of predefned features to a shot, for example, ndoor, outdoor, people, andspeech. In the search task, the goal s to fnd relevant shots gven a descrpton of an nformaton need, expressed by a multmeda topc. Both n the feature detecton task and n the search task, a predefned set of shots s to be used. In our experments, we focus on the search task. The collecton to be searched n ths task conssts of approxmately hours of MPEG- encoded vdeo; n addton, a set of 2 hours of tranng materal was avalable. The topcs consst of a textual descrpton of the nformaton need, accompaned by mages, vdeo fragments, and/or audo fragments llustratng what s needed. For each topc, a system could return a ranked lst of vdeo fragments. The top 5 returned shots of each run are then pooled and judged. We report expermental results usng the standard TREC measures, average precson, and mean average precson (MAP). Average precson s the average of the precson value obtaned after each relevant document s retreved (when a relevant document s not retreved at all, ts precson s assumed to be ). MAP s the mean of the average precson values over all topcs.

9 EURASIP Journal on Appled Sgnal Processng MAP..25.2.5..5..2...5 κ.6.7.8 Fgure 5: MAP on vdeo search task for dfferent κ. For the textual descrptons of the shots, we used speech transcrpts kndly provded by LIMSI. These transcrpts were algned to the predefned vdeo shots. We dd not have or defne a semantc dvson of the vdeo nto scenes but defned scenes smply as overlappng wndows of 5 consecutve shots. 6 We removed common words from the transcrpts (stoppng) and stemmed all terms usng the Porter stemmer [26]. For the vsual descrpton, we took keyframes from the common vdeo shots, and we used EM to fnd the parameters of Gaussan mxture models. Keyframe selecton was straghtforward: we smply used the mddle frame from each shot as representatve for the shot..2. Estmatng the mxture parameters The model does not specfy the value of mxng parameters λ, λ Shot, λ Scene,andκ. An optmal value can only be found a posteror by evaluatng retreval performance for dfferent values on a test collecton; a pror, we must make an educated guess for the rght values. Fgure 5 shows the MAP scores on the TREC-22 vdeo tracksearchtaskforκrangng from. to.. We can see that retreval results are nsenstve to the value of the mxng parameteraslongaswetakebothforegroundandbackground nto account. The plot has a smlar shape as that found n Hemstra s thess for the λ parameter n the standard language model []. For the transcrpts, we tred over thrty combnatons of settngs, usng two sets of text queres (see also Secton.). For queryset Tlong, ths resulted n optmal settngs for MAP wth λ Shot =.9, λ Scene =.2, and λ Coll =.7. Here, modellng the herarchy n the vdeo makes sense because shot and scene both contrbute to results n the rankng (λ Shot and λ Scene are larger than zero). For set Tshort, however, the optmalsettngshadλ Shot =. and the resultng model s 6 In prelmnary experments on the TREC-2 collecton, when varyng the wndow lengths, 5 shots were the optmum..9 dentcal to the orgnal language model. Summarzng and rankng transcrpt unts longer than shots s mportant, but we cannot conclude from these experments whether modelng the herarchy s really necessary. In all experments, the dfferences between the better parameter choces are not sgnfcant, but a partcularly bad choce may serously degrade retreval effectveness. In the remander of ths work, we have used κ =.9, λ Shot =.9, λ Scene =.2, and λ Coll =.7... Usng all or some mage examples In general, t s hard to guess what would be a good example mage for a specfc query. If we look for shots of the Golden Gate Brdge, we mght not care from what angle the brdge was flmed, or f the clp was flmed on a sunny or a cloudy day; vsually, however, such examples may be qute dfferent (Fgure 6). If a user has presented three examples and no addtonal nformaton, the best we can do s to try to fnd documents that descrbe all example mages well. Unfortunately, a document may be ranked low even though t models the samples from one example mage well as t may not explan the samples from the other mages. For each topc, we computed whch of the example mages would have gven the best results f t had been used as the only example for that topc. We compared these best example results to the full topc results n whch we used all avalable vsual examples. The experment was done usng both the ALA and the BoB measure. In the full topc case, the set of avalable topcs was regarded as one large bag of samples. For the ALA measure, we bult one mxture model to descrbe all avalable vsual examples. For BoB, we ranked documents by ther probablty of generatng all samples n all query mages. For the sngle mage queres n the best example, we bult a separate mxture model from each example and used t for ALA rankng. For BoB rankng, we used all samples from the sngle vsual example. Snce t s problematc to use multple examples n a query, we wanted to see f t s possble to guess n advance what would be a good example for a specfc topc. Therefore, for each topc, we also hand-pcked a sngle representatve from the avalable examples and compared these manual example results to the other two result sets. The results for the dfferent settngs are lsted n Table. A frst thng to notce s that all scores are rather low. When we take a closer look at the topcs wth hgher average precson scores, we see that these manly contan examples from the search collecton. In other words, we can fnd smlar shots from wthn the same vdeo, but generalsaton s a problem. Comparng BoB to ALA, we see that, averaged over all topcs for each set of examples, BoB outperforms ALA. For some specfc topcs, the ALA gves hgher scores, but agan these are cases wth examples from wthn the collecton. In general, the BoB approach, whch uses fewer assumptons, performs better. The fact that usng the best mage example outperforms the use of all examples shows that combnng results from

A Probablstc Multmeda Retreval Model and Its Evaluaton 95 Fgure 6: Vsual examples of the Golden Gate Brdge. Table : MAP for full topcs, best examples, and manual examples. Full topc Best example Manual example Topc BoB ALA BoB ALA BoB ALA vt75.8..28.59.28.56 vt76.85.7.2.27.76.958 vt77...... vt78...... vt79....5.. vt8.8.2.977.7.977.7 vt8...... vt82..2.2.22.2.22 vt8...... vt8.6..6..6. vt85...... vt86.5..7.9.7.5 vt87...... vt88.6..69.9.69.9 vt89...... vt9...5..5. vt9.95..95..95. vt92...6.2.. vt9.6..6... vt9.2..2..2. vt95...... vt96.2..2.8.2.8 vt97.2.2.8.96.. vt98....6.. vt99...... MAP.287.58..5.279.8 dfferent vsual examples can ndeed degrade results. Lookng at the results, manually selectng good examples seems a nontrval task, but the drop n performance s partly due to the generalsaton problem. If one of the mage examples happens to come from the collecton, t scores hgh. If we fal to select that partcular example, the score for the manual example run drops. Smply countng how often the manually selected example was the same as the best-performng example, we see that ths was the case for 8 out of topcs. 7 7 If we gnore the topcs for whch there s only one example and the ones for whch the best example scored... Usng example transcrpts We took two dfferent approaches n buldng textual queres from the multmeda topcs. The frst set of textual queres, Tshort, was constructed smply by takng the textual descrpton from the topc. In the second set of queres, Tlong, we augmented these wth the speech transcrpts from the vdeo examples avalable for a topc. The assumpton here s that relevant shots share a vocabulary wth example shots; thus, usng example transcrpts mght mprove retreval results. In both sets of queres, we removed common words and stemmed all terms. We found that across topcs, Tlong outperformed Tshort wth a MAP of.22 aganst.96. For detaled per-topc nformaton, see Table 2..5. Combnng textual and vsual runs We combned textual and vsual runs usng our combned rankng formula (). Snce we had no data to estmate the parameters for mxng textual and vsual nformaton, we used P(t) = P(v) =.5. For the textual part, we tred both short and long queres; for the vsual part, we used full queres and best-example queres. Table 2 shows the results for combnatons wth the BoB measure. We also expermented wth combnatons wth the ALA measure, but we found that n the ALA case, t s dffcult to combne textual and vsual scores because they are on dfferent scales. The BoB measure s closer to the KL-dvergence and, on top of that, more smlar to our textual approach, and thus easer to combne wth the textual scores. For most of the topcs, textual runs gve the best results; however, for some topcs, usng the vsual examples s useful. Ths s manly the case when ether the topcs come from the search collecton or when the relevant documents are outlers n the collecton. Ths llustrates how dffcult t s to search a generc vdeo collecton usng vsual nformaton only. We succeed only f the relevant documents are ether hghly smlar to the examples provded or very dssmlar from the other documents n the collecton (and, therefore, relatvely smlar to the query examples). When both textual and vsual runs have reasonable scores, combnng the runs can mprove on the ndvdual runs; however, when one of them has nferor performance, a combnaton only adds nose and lowers the scores.. CONCLUSIONS We presented a probablstc framework for multmodal retreval n whch textual and vsual retreval models are

96 EURASIP Journal on Appled Sgnal Processng Table 2: Average precson per topc for textual runs, BoB runs, and combned runs. Topc Tshort Tlong BoBfull BoBbest BoBfull BoBfull BoBbest BoBbest +Tshort +Tlong +Tshort +Tlong vt75..82.8.28.89.569.25.57 vt76.75.622.85.2.59.79.5757.682 vt77.225.5556...... vt78.8.2778...... vt79..6.....6.5 vt8...8.977.66.59.85.9 vt8.5....7... vt82.8.262..2.8.5.5.2 vt8.669.669...962.962.78.78 vt8.75.75.6.6.6875.6875.6875.6875 vt85........ vt86.55.676.5.7.56.25.79.6 vt87.59.295...52..52. vt88.8.5.6.69.52.6.69.69 vt89.76.76...5.5.5.5 vt9.229.7..5.6.75.56.77 vt9...95.95..86..86 vt92.627.687..6.9..78.6 vt9.977.7.6.6.99.2.7.2 vt9.22.252.2.2.22.6.22.6 vt95..2...8.2.. vt96...2.2.6.6.2.2 vt97.2.85.2.8.228.752.52.7 vt98.225.86...68... vt99.726.66...... MAP.96.22.287..69.75.78.87 ntegrated seamlessly, and evaluated the framework usng the search task from the TREC-22 vdeo track. We found that even though the topcs were specfcally desgned for contentbased retreval and relevance was defned vsually, a textual search outperforms vsual search for most topcs. As we have seen before [6], standard mage retreval technques cannot readly be appled to satsfy a varety of nformaton requests from a generc vdeo collecton. Future work has to show how ncorporatng dfferent sources of addtonal nformaton (e.g., contextual frames, the movement n vdeo, or user nteracton) can help mprove results. In the text-only experments, we saw that usng the transcrpts from the example vdeos n queres mproves results. We also found that t s useful to take transcrpts from surroundng shots nto account to descrbe a shot. However, t s stll unclear whether a herarchcal descrpton of scenes and shots s necessary. In our vsual experments, we found that the general probablstc framework s useful for mage retreval. However, we found that one of the assumptons underlyng the ALA of the KL-dvergence does not hold for the generc vdeo collecton we used. Ths was reflected n the dfference n performance of the ALA and the BoB model. Unfortunately, computng the jont block probabltes n the BoB model s computatonally expensve and unsutable for an nteractve retreval system. Future work wll nvestgate ways to speed up the process. Furthermore, we notced generalsaton problems. The vsual models only gave satsfyng results f the relevant documents were ether hghly smlar to the query mage(s) (.e., the query mages came from the collecton) or hghly dssmlar to the rest of the collecton (.e., the relevant documents were outlers n the collecton). When ether textual or vsual results are poor, combnng them, thus addng nose, seems to degrade the scores. However, when both modaltes yeld reasonable scores, a combned run outperforms the ndvdual runs. REFERENCES [] F. de Jong, J.-L. Gauvan, D. Hemstra, and K. Netter, Language-based multmeda nformaton retreval, n Proc. RIAO 2 Content-Based Multmeda Informaton Access,pp. 7 722, Pars, France, Aprl 2. [2] J.-L. Gauvan, L. Lamel, and G. Adda, Transcrbng broadcast news for audo and vdeo ndexng, Communcatons of the ACM, vol., no. 2, pp. 6 7, 2. []G.J.F.Jones,J.T.Foote,K.SparckJones,andS.J.Young, The vdeo mal retreval project: experences n retrevng

A Probablstc Multmeda Retreval Model and Its Evaluaton 97 spoken documents, n Intellgent Multmeda Informaton Retreval, M. T. Maybury, Ed., pp. 9 2, AAAI Press/MIT Press, Cambrdge, Mass, USA, 997. [] K. Barnard and D. Forsyth, Learnng the semantcs of words and pctures, n Proc. Internatonal Conf. on Computer Vson, vol. 2, pp. 8 5, Vancouver, Canada, 2. [5] M. La Casca, S. Seth, and S. Sclaroff, Combnng textual and vsual cues for content-based mage retreval on the world wde web, n Proc. IEEE Workshop on Content-Based Access of Image and Vdeo Lbrares, pp. 2 28, Santa Barbara, Calf, USA, June 998. [6] The lowlands team, Lazy users and automatc vdeo retreval tools n (the) lowlands, n The th Text REtreval Conference (TREC-2),E.M.VoorheesandD.K.Harman, Eds., vol., pp. 59 68, Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 22. [7] T. Westerveld, Image retreval: Content versus context, n Proc. RIAO 2 Content-Based Multmeda Informaton Access, pp. 276 28, Pars, France, Aprl 2. [8] T. Westerveld, Probablstc multmeda retreval, n Proc. the 25th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp. 7 8, Tampere, Fnland, 22. [9] T. Westerveld, A. P. de Vres, and A. van Ballegooj, CWI at the TREC-22 vdeo track, n The th Text REtreval Conference (TREC-22),E.M.VoorheesandD.K.Harman, Eds., Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 22. [] D. Hemstra, Usng language models for nformaton retreval, Ph.D. thess, Centre for Telematcs and Informaton Technology, Unversty of Twente, The Netherlands, 2. [] N. Vasconcelos, Bayesan models for vsual nformaton retreval, Ph.D. thess, Massachusetts Insttute of Technology, Cambrdge, Mass, USA, 2. [2] P. Over and R. Taban, The TREC-2 vdeo track framework, n Proc. the th Text REtreval Conference (TREC- 2), E.M.VoorheesandD.K.Harman,Eds.,vol.,pp. 79 87, Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 22. [] A. F. Smeaton and P. Over, The TREC-22 vdeo track report, n The th Text REtreval Conference (TREC-22), E. M. Voorhees and D. K. Harman, Eds., Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 22. [] C. E. Shannon, A mathematcal theory of communcaton, Bell System Techncal Journal, vol. 27, pp. 79 2, 62 656, 98. [5] F. Jelnek, Statstcal Methods for Speech Recognton, MIT Press, Cambrdge, Mass, USA, 997. [6] D. Cuttng, J. Kupec, J. Pedersen, and P. Sbun, A practcal part-of-speech tagger, n Proc. the rd Conference on Appled Natural Language Processng, pp., Trento, Italy, 992. [7] P. F. Brown, J. Cocke, S. A. Della Petra, et al., A statstcal approach to machne translaton, Computatonal Lngustcs, vol. 6, no. 2, pp. 79 85, 99. [8] J. M. Ponte and W. B. Croft, A language modelng approach to nformaton retreval, n Proc. the 2st Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp. 275 28, Melbourne, Australa, 998. [9] D. Hemstra, A lngustcally motvated probablstc model of nformaton retreval, n Proc. the 2nd European Conference on Research and Advanced Technology for Dgtal Lbrares, C. Ncolaou and C. Stephands, Eds., pp. 569 58, Heraklon, Crete, Greece, September 998. [2] D. R. H. Mller, T. Leek, and R. M. Schwartz, A hdden Markov model nformaton retreval system, n Proc. the 22nd Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp. 2 22, Berkeley, Calf, USA, August 999. [2] J. Lafferty and C. Zha, Document language models, query models, and rsk mnmzaton for nformaton retreval, n Proc. the 2th Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, pp. 9, New Orleans, La, USA, September 2. [22] K. Ng, A maxmum lkelhood rato nformaton retreval model, n Proc. the 8th Text REtreval Conference, TREC- 8, NIST Specal Publcatons, Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 2. [2] A. P. Dempster, N. M. Lard, and D. B. Rubn, Maxmum lkelhood from ncomplete data va the EM algorthm, Journal of the Royal statstcal Socety, Seres B, vol.9,no.,pp. 8, 977. [2] E. M. Voorhees and D. K. Harman, Eds., The th Text REtreval Conference (TREC-22), Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 22. [25] E. M. Voorhees and D. K. Harman, Eds., The th Text REtreval Conference (TREC-2), vol., Natonal Insttute of Standards and Technology, NIST, Gathersburg, Md, USA, 22. [26] M. F. Porter, An algorthm for suffx strppng, Program, vol., no., pp. 7, 98. Thjs Westerveld receved the M.S. degree n computer scence from the Unversty of Twente. As a Research Assstant at the same unversty, he has partcpated n a number of EU projects n the area of multmeda nformaton retreval. Workng on the natonal Waterland project at the CWI, the Natonal Research Insttute for Mathematcs and Computer Scence n the Netherlands, he nvestgates, for hs Ph.D., the use of probablstc models for retreval from generc multmeda collectons. Arjen P. de Vres receved hs Ph.D. n computer scence from the Unversty of Twente n 999, on the ntegraton of content management n database systems. He s especally nterested n the desgn of database systems that support search n multmeda dgtal lbrares. Arjen works as a Postdoctoral Researcher at the CWI, the Natonal Research Insttute for Mathematcs and Computer Scence n the Netherlands. Alex van Ballegooj receved the M.S. degree n computer scence from the Vrje Unverstet of Amsterdam n 999. He works towards hs Ph.D. on the natonal ICES-KIS MIA project at the CWI, the Natonal Research Insttute for Mathematcs and Computer Scence n the Netherlands. Hs current research actvtes ental the nvestgaton of aspects that make a database system sutable for computatonally ntensve tasks, specfcally search n multmeda dgtal lbrares.

98 EURASIP Journal on Appled Sgnal Processng Francska de Jong s Full Professor of language technology at the Computer Scence Department of the Unversty of Twente, Enschede snce 992. She s also afflated to the TNO TPD n Delft. She has a background n theoretcal and computatonal lngustcs and receved the Ph.D. degree at the Unversty of Utrecht n 99. She worked as a Researcher at Phlps Research on the Rosetta machne translaton project (985 992). Currently, her man research nterest s n the feld of multmeda ndexng and retreval. She s frequently nvolved n nternatonal program commttees, expert groups, and revew panels and has ntated a number of EU projects. Djoerd Hemstra s an Assstant Professor n the Database Group n the Computer Scence Department of the Unversty of Twente snce 2. At ths same unversty, he studed computer scence and graduated n the feld of language technology (996). In 2, he worked for three months at Mcrosoft Research n Cambrdge. He wrote a Ph.D. thess on probablstc retreval usng language models. Multmeda databases, cross-language nformaton retreval, and statstcal language modelng are among the research themes he s currently workng on. Together wth Arjen de Vres, he ntated the project CIRQUID, funded by NWO.