Cross-Language Informaton Retreval CLIR Ananthakrshnan R Computer Scence & Engg. IIT Bombay anand@cse Aprl 7 2006 Natural Language Processng/Language Technology for the Web
Cross Language Informaton Retreval CLIR A subfeld of nformaton retreval dealng th retrevng nformaton rtten n a language dfferent from the language of the user's query. E.g. Usng Hnd queres to retreve Englsh documents Also called mult-lngual cross-lngual or trans-lngual IR.
Why CLIR? E.g. On the eb e have: Documents n dfferent languages Multlngual documents Images th captons n dfferent languages A sngle query should retreve all such resources.
Approaches to CLIR most effcent; commonly used Query Translaton Dctonary/Thes aurus-based Corpus-based Knoledgebased Pseudo- Relevance Feedback PRF nfeasble for large collectons Document Translaton MT rule-based MT EBMT/StatMT Intermedate Representaton UNL AgroExplorer Latent Semantc Indexng Most effectve approaches are hybrd a combnaton of knoledge and corpus-based methods.
Dctonary-based Query Translaton आयरल ड श त व त phrase dentfcaton ords to be translterated Hnd-Englsh dctonares search Collecton Ireland peace talks
The problem th dctonary-based CLIR -- ambguty अ त र य घटन ज ल धन आयरल ड श त व त cosmc outer-space ncdent event occurrence lessen subsde decrease loer dmnsh ebb declne reduce lattce mesh net re_nettng meshed_fabrc counterfet forged false fabrcated small_net netork gauze gratng seve money rches ealth appostve property Ireland peace calm tranqulty slence quetude conversaton talk negotaton tale
flterng/dsambguaton s requred after query translaton.
Dsambguaton usng co-occurrence statstcs Hypothess: correct translatons of query terms ll co-occur and ncorrect translatons ll tend not to co-occur
Problem th countng co-occurrences: data sparsty freqmarath Shallo Parsng CRFs freqmarath Shallo Structurng CRFs freqmarath Shallo Analyzng CRFs are all zero. Ho do e choose beteen parsng structurng and analyzng?
Par-se co-occurrence अ त र य घटन cosmc outer-space ncdent event occurrence lessen subsde decrease loer dmnsh ebb declne reduce freqcosmc ncdent 70800 freqcosmc event 269000 freqcosmc lessen 7130 freqcosmc subsde 3120 freqouter-space ncdent 26100 freqouter-space event 104000 freqouter-space lessen 2600 freqouter-space subsde 980
Shallo Parsng Structurng or Analyzng? shallo parsng 166000 shallo structurng 180000 shallo analyzng 1230000 CRFs parsng 540 CRFs structurng 125 CRFs analyzng 765 But analyzng 74100000 parsng 40400000 structurng 17400000 shallo 33300000 Marath parsng 17100 Marath structurng 511 Marath analyzng 12200 shallo parsng 40700 shallo structurng 11 shallo analyzng 2 collocaton?
Rankng senses usng co-occurrence statstcs Use co-occurrence scores to calculate smlarty beteen to ords: smx y Pont-se mutual nformaton PMI Dce coeffcent PMI-IR PMI - IR x y = log hts x AND y hts x hts y
Dsambguaton algorthm user's query : q = { q s 1 q s 2... q s m } For each q s the set of translatons S = { t j }
= ' ' ' '. 1 t l S t l t j t j sm S sm = t j t j S sm score '. 2 ' }... { translated query 2 1 t m t t t q q q q = arg max. 3 t j t score q t j =
Example अ त र य घटन cosmc outer-space ncdent event lessen subsde decrease loer dmnsh ebb declne reduce scorecosmc= PMI-IRcosmc ncdent + PMI-IRcosmc event + PMI-IRcosmc lessen + PMI-IRcosmc subsde
Dsambguaton algorthm: sample outputs आयरल ड श त व त Ireland peace talks अ त र य घटन cosmc events ज ल धन net money?
Results on TREC8 dsks 4 and 5 Englsh topcs 401-450 manually translated to Hnd Assumpton: relevance judgments for Englsh topcs hold for the translated queres Results all TF-IDF: Technque MAP Monolngual 23 All-translatons 16 PMI based dsambguaton 20.5 Manual flterng 21.5
Pseudo-Relevance Feedback for CLIR
User Relevance Feedback mono-lngual 1. Retreve documents usng the user s query 2. The user marks relevant documents 3. Choose the top N terms from these documents Top terms IDF s one opton for scorng 4. Add these N terms to the user s query to form a ne query 5. Use ths ne query to retreve a ne set of documents
Pseudo-Relevance Feedback PRF mono-lngual 1. Retreve documents usng the user s query 2. Assume that the top M documents retreved are relevant 3. Choose the top N terms from these M documents 4. Add these N terms to the user s query to form a ne query 5. Use ths ne query to retreve a ne set of documents
PRF for CLIR Corpus-based Query Translaton Uses a parallel corpus of documents: Hnd collecton H H 1 E 1 H 2 E 2...... H m E m Englsh collecton E
PRF for CLIR 1. Retreve documents n H usng the user s query 2. Assume that the top M documents retreved are relevant 3. Select the M documents n E that are algned to the top M retreved documents 4. Choose the top N terms from these documents 5. These N terms are the translated query 6. Use ths query to retreve from the target collecton hch s n the same language as E
Cross-Lngual Relevance Models - Estmate relevance models usng a parallel corpus
Rankng th Relevance Models Relevance model or Query model dstrbuton encodes the nformaton need: Probablty of ord occurrence n a relevant document Probablty of ord occurrence n the canddate document Rankng functon relatve entropy or KL dvergence KL D R Θ R P ΘR P D P D = P D.log P Θ R
Estmatng Mono-Lngual Relevance Models......... 2 1 2 1 2 1 m m m R h h h P h h h P h h h P Q P P = = Θ Μ = = M m m M h P M P M P h h h P 1 2 1...
Estmatng Cross-Lngual Relevance Models Μ = = } { 1 2 1 } {... M H M E m H E E H m M h P M P M M P h h h P 1 P freq freq M P v X v X X λ λ + =
CLIR Evaluaton TREC Text REtreval Conference TREC CLIR track 2001 and 2002 Retreval of Arabc language nesre documents from topcs n Englsh 383872 Arabc documents 896 MB th SGML markup 50 topcs Use of provded resources stemmers blngual dctonares MT systems parallel corpora s encouraged to mnmze varablty http://trec.nst.gov/
CLIR Evaluaton CLEF Cross Language Evaluaton Forum Major CLIR evaluaton forum Tracks nclude Multlngual retreval on nes collectons topcs ll be provded n many languages ncludng Hnd Multple language Queston Anserng ImageCLEF Cross Language Speech Retreval WebCLEF http://.clef-campagn.org/
Summary CLIR technques Query Translaton-based Document Translaton-based Intermedate Representaton-based Query translaton usng dctonares folloed by dsambguaton s a smple and effectve technque for CLIR PRF uses a parallel corpus for query translaton Parallel corpora can also be used to estmate crosslngual relevance models CLEF and TREC: mportant CLIR evaluaton conferences
References 1 1. Phrasal Translaton and Query Expanson Technques for Crosslanguage Informaton Retreval Lsa Ballesteros and W. Bruce Croft Research and Development n Informaton Retreval 1995. 2. Resolvng Ambguty for Cross-Language Retreval Lsa Ballesteros and W. Bruce Croft Research and Development n Informaton Retreval 1998. 3. A Maxmum Coherence Model for Dctonary-Based Cross- Language Informaton Retreval Y Lu Rong Jn and Joyce Y. Cha ACM SIGIR 2005. 4. A Comparatve Study of Knoledge-Based Approaches for Cross- Language Informaton Retreval Douglas W. Oard Bonne J. Dorr Paul G. Hackett and Mara Katsova Techncal Report CS-TR- 3897 Unversty of Maryland 1998.
References 2 5. Translngual Informaton Retreval: A Comparatve Evaluaton Jame G. Carbonell Ymng Yang Robert E. Frederkng Ralf D. Bron Ybng Geng and Danny Lee Internatonal Jont Conference on Artfcal Intellgence 1997. 6. A Multstage Search Strategy for Cross Lngual Informaton Retreval Satsh Kagathara Mansh Deodalkar and Pushpak Bhattacharyya Symposum on Indan Morphology Phonology and Language Engneerng IIT Kharagpur February 2005. 7. Relevance-Based Language Models Vctor Lavrenko and W. Bruce Croft Research and Development n Informaton Retreval 2001. 8. Cross- Lngual Relevance Models V. Lavrenko M. Choquette and W. Croft ACM-SIGIR 2002.
Thank You