Cross-Language Information Retrieval (CLIR)

Similar documents
CS47300: Web Information Search and Management

MSU at ImageCLEF: Cross Language and Interactive Image Retrieval

Retrieval Models: Language models

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of Technology

Probabilistic Structured Query Methods

Probabilistic Structured Query Methods

Extending Relevance Model for Relevance Feedback

Similar Sentence Retrieval for Machine Translation Based on Word-Aligned Bilingual Corpus

Lecture 13: More uses of Language Models

Information Retrieval Language models for IR

The BBN Crosslingual Topic Detection and Tracking System

International Journal of Mathematical Archive-3(3), 2012, Page: Available online through ISSN

Chapter 11: Simple Linear Regression and Correlation

Power law and dimension of the maximum value for belief distribution with the max Deng entropy

Corpora and Statistical Methods Lecture 6. Semantic similarity, vector space models and wordsense disambiguation

Evaluation for sets of classes

Multigradient for Neural Networks for Equalizers 1

Complex Question Answering with ASQA at NTCIR 7 ACLIA

Search sequence databases 2 10/25/2016

Polynomial Regression Models

On the Effectiveness of Relevance Profiling

EM and Structure Learning

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Selecting Good Expansion Terms for Pseudo-Relevance Feedback

Classification. Representing data: Hypothesis (classifier) Lecture 2, September 14, Reading: Eric CMU,

Note on EM-training of IBM-model 1

This excerpt from. Foundations of Statistical Natural Language Processing. Christopher D. Manning and Hinrich Schütze The MIT Press.

Machine Learning for IR. Outline. Learning to Rank. MAP vs Accuracy. Mean Average Precision 3/9/2010. Information Retrieval as Structured Prediction

Automatic Object Trajectory- Based Motion Recognition Using Gaussian Mixture Models

A Particle Filter Algorithm based on Mixing of Prior probability density and UKF as Generate Importance Function

Semi-supervised Classification with Active Query Selection

Web-Mining Agents Probabilistic Information Retrieval

Primer on High-Order Moment Estimators

Chapter 6. Supplemental Text Material

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

The Synchronous 8th-Order Differential Attack on 12 Rounds of the Block Cipher HyRAL

Why BP Works STAT 232B

Uncertainty in measurements of power and energy on power networks

Economics 130. Lecture 4 Simple Linear Regression Continued

Information Retrieval and Web Search

Handling Uncertain Spatial Data: Comparisons between Indexing Structures. Bir Bhanu, Rui Li, Chinya Ravishankar and Jinfeng Ni

Building a Bilingual Dictionary with Scarce Resources: A Genetic Algorithm Approach

An Improved multiple fractal algorithm

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Midterm Review. Hongning Wang

A New Scrambling Evaluation Scheme based on Spatial Distribution Entropy and Centroid Difference of Bit-plane

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

1 Information retrieval fundamentals

Chapter Newton s Method

Synchronized Multi-sensor Tracks Association and Fusion

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Pattern Classification

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Detecting Attribute Dependencies from Query Feedback

A Multimodal Fusion Algorithm Based on FRR and FAR Using SVM

Evaluation of classifiers MLPs

Sketching Sampled Data Streams

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

A Network Intrusion Detection Method Based on Improved K-means Algorithm

Some basic statistics and curve fitting techniques

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Generalized Linear Methods

Part-of-Speech Tagging with Hidden Markov Models

EDMS Modern Measurement Theories. Multidimensional IRT Models. (Session 6)

Chapter 15 Student Lecture Notes 15-1

Comparison of Regression Lines

x i1 =1 for all i (the constant ).

Multivariate Ratio Estimation With Known Population Proportion Of Two Auxiliary Characters For Finite Population

Split alignment. Martin C. Frith April 13, 2012

Topic 23 - Randomized Complete Block Designs (RCBD)

MDL-Based Unsupervised Attribute Ranking

Some Consequences. Example of Extended Euclidean Algorithm. The Fundamental Theorem of Arithmetic, II. Characterizing the GCD and LCM

ECE559VV Project Report

A Probabilistic Multimedia Retrieval Model and Its Evaluation

CS 646 (Fall 2016) Homework 3

Tests of Single Linear Coefficient Restrictions: t-tests and F-tests. 1. Basic Rules. 2. Testing Single Linear Coefficient Restrictions

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Keyword Reduction for Text Categorization using Neighborhood Rough Sets

On Statistical Analysis and Optimization of Information Retrieval Effectiveness Metrics

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

ML4NLP Introduction to Classification

Methods in Epidemiology. Medical statistics 02/11/2014. Estimation How large is the effect? At the end of the lecture students should be able

Deformation rate estimation on changing landscapes using. Abstract Title. Temporarily Coherent Point InSAR. Author name

Methods of Detecting Outliers in A Regression Analysis Model.

Checking Pairwise Relationships. Lecture 19 Biostatistics 666

Message modification, neutral bits and boomerangs

Question Classification Using Language Modeling

Benchmarking in pig production

Uncertainty as the Overlap of Alternate Conditional Distributions

Bayesian Planning of Hit-Miss Inspection Tests

Exploiting association rules and ontology for semantic document indexing

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Online Appendix to: Axiomatization and measurement of Quasi-hyperbolic Discounting

Simulation and Probability Distribution

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Transcription:

Cross-Language Informaton Retreval CLIR Ananthakrshnan R Computer Scence & Engg. IIT Bombay anand@cse Aprl 7 2006 Natural Language Processng/Language Technology for the Web

Cross Language Informaton Retreval CLIR A subfeld of nformaton retreval dealng th retrevng nformaton rtten n a language dfferent from the language of the user's query. E.g. Usng Hnd queres to retreve Englsh documents Also called mult-lngual cross-lngual or trans-lngual IR.

Why CLIR? E.g. On the eb e have: Documents n dfferent languages Multlngual documents Images th captons n dfferent languages A sngle query should retreve all such resources.

Approaches to CLIR most effcent; commonly used Query Translaton Dctonary/Thes aurus-based Corpus-based Knoledgebased Pseudo- Relevance Feedback PRF nfeasble for large collectons Document Translaton MT rule-based MT EBMT/StatMT Intermedate Representaton UNL AgroExplorer Latent Semantc Indexng Most effectve approaches are hybrd a combnaton of knoledge and corpus-based methods.

Dctonary-based Query Translaton आयरल ड श त व त phrase dentfcaton ords to be translterated Hnd-Englsh dctonares search Collecton Ireland peace talks

The problem th dctonary-based CLIR -- ambguty अ त र य घटन ज ल धन आयरल ड श त व त cosmc outer-space ncdent event occurrence lessen subsde decrease loer dmnsh ebb declne reduce lattce mesh net re_nettng meshed_fabrc counterfet forged false fabrcated small_net netork gauze gratng seve money rches ealth appostve property Ireland peace calm tranqulty slence quetude conversaton talk negotaton tale

flterng/dsambguaton s requred after query translaton.

Dsambguaton usng co-occurrence statstcs Hypothess: correct translatons of query terms ll co-occur and ncorrect translatons ll tend not to co-occur

Problem th countng co-occurrences: data sparsty freqmarath Shallo Parsng CRFs freqmarath Shallo Structurng CRFs freqmarath Shallo Analyzng CRFs are all zero. Ho do e choose beteen parsng structurng and analyzng?

Par-se co-occurrence अ त र य घटन cosmc outer-space ncdent event occurrence lessen subsde decrease loer dmnsh ebb declne reduce freqcosmc ncdent 70800 freqcosmc event 269000 freqcosmc lessen 7130 freqcosmc subsde 3120 freqouter-space ncdent 26100 freqouter-space event 104000 freqouter-space lessen 2600 freqouter-space subsde 980

Shallo Parsng Structurng or Analyzng? shallo parsng 166000 shallo structurng 180000 shallo analyzng 1230000 CRFs parsng 540 CRFs structurng 125 CRFs analyzng 765 But analyzng 74100000 parsng 40400000 structurng 17400000 shallo 33300000 Marath parsng 17100 Marath structurng 511 Marath analyzng 12200 shallo parsng 40700 shallo structurng 11 shallo analyzng 2 collocaton?

Rankng senses usng co-occurrence statstcs Use co-occurrence scores to calculate smlarty beteen to ords: smx y Pont-se mutual nformaton PMI Dce coeffcent PMI-IR PMI - IR x y = log hts x AND y hts x hts y

Dsambguaton algorthm user's query : q = { q s 1 q s 2... q s m } For each q s the set of translatons S = { t j }

= ' ' ' '. 1 t l S t l t j t j sm S sm = t j t j S sm score '. 2 ' }... { translated query 2 1 t m t t t q q q q = arg max. 3 t j t score q t j =

Example अ त र य घटन cosmc outer-space ncdent event lessen subsde decrease loer dmnsh ebb declne reduce scorecosmc= PMI-IRcosmc ncdent + PMI-IRcosmc event + PMI-IRcosmc lessen + PMI-IRcosmc subsde

Dsambguaton algorthm: sample outputs आयरल ड श त व त Ireland peace talks अ त र य घटन cosmc events ज ल धन net money?

Results on TREC8 dsks 4 and 5 Englsh topcs 401-450 manually translated to Hnd Assumpton: relevance judgments for Englsh topcs hold for the translated queres Results all TF-IDF: Technque MAP Monolngual 23 All-translatons 16 PMI based dsambguaton 20.5 Manual flterng 21.5

Pseudo-Relevance Feedback for CLIR

User Relevance Feedback mono-lngual 1. Retreve documents usng the user s query 2. The user marks relevant documents 3. Choose the top N terms from these documents Top terms IDF s one opton for scorng 4. Add these N terms to the user s query to form a ne query 5. Use ths ne query to retreve a ne set of documents

Pseudo-Relevance Feedback PRF mono-lngual 1. Retreve documents usng the user s query 2. Assume that the top M documents retreved are relevant 3. Choose the top N terms from these M documents 4. Add these N terms to the user s query to form a ne query 5. Use ths ne query to retreve a ne set of documents

PRF for CLIR Corpus-based Query Translaton Uses a parallel corpus of documents: Hnd collecton H H 1 E 1 H 2 E 2...... H m E m Englsh collecton E

PRF for CLIR 1. Retreve documents n H usng the user s query 2. Assume that the top M documents retreved are relevant 3. Select the M documents n E that are algned to the top M retreved documents 4. Choose the top N terms from these documents 5. These N terms are the translated query 6. Use ths query to retreve from the target collecton hch s n the same language as E

Cross-Lngual Relevance Models - Estmate relevance models usng a parallel corpus

Rankng th Relevance Models Relevance model or Query model dstrbuton encodes the nformaton need: Probablty of ord occurrence n a relevant document Probablty of ord occurrence n the canddate document Rankng functon relatve entropy or KL dvergence KL D R Θ R P ΘR P D P D = P D.log P Θ R

Estmatng Mono-Lngual Relevance Models......... 2 1 2 1 2 1 m m m R h h h P h h h P h h h P Q P P = = Θ Μ = = M m m M h P M P M P h h h P 1 2 1...

Estmatng Cross-Lngual Relevance Models Μ = = } { 1 2 1 } {... M H M E m H E E H m M h P M P M M P h h h P 1 P freq freq M P v X v X X λ λ + =

CLIR Evaluaton TREC Text REtreval Conference TREC CLIR track 2001 and 2002 Retreval of Arabc language nesre documents from topcs n Englsh 383872 Arabc documents 896 MB th SGML markup 50 topcs Use of provded resources stemmers blngual dctonares MT systems parallel corpora s encouraged to mnmze varablty http://trec.nst.gov/

CLIR Evaluaton CLEF Cross Language Evaluaton Forum Major CLIR evaluaton forum Tracks nclude Multlngual retreval on nes collectons topcs ll be provded n many languages ncludng Hnd Multple language Queston Anserng ImageCLEF Cross Language Speech Retreval WebCLEF http://.clef-campagn.org/

Summary CLIR technques Query Translaton-based Document Translaton-based Intermedate Representaton-based Query translaton usng dctonares folloed by dsambguaton s a smple and effectve technque for CLIR PRF uses a parallel corpus for query translaton Parallel corpora can also be used to estmate crosslngual relevance models CLEF and TREC: mportant CLIR evaluaton conferences

References 1 1. Phrasal Translaton and Query Expanson Technques for Crosslanguage Informaton Retreval Lsa Ballesteros and W. Bruce Croft Research and Development n Informaton Retreval 1995. 2. Resolvng Ambguty for Cross-Language Retreval Lsa Ballesteros and W. Bruce Croft Research and Development n Informaton Retreval 1998. 3. A Maxmum Coherence Model for Dctonary-Based Cross- Language Informaton Retreval Y Lu Rong Jn and Joyce Y. Cha ACM SIGIR 2005. 4. A Comparatve Study of Knoledge-Based Approaches for Cross- Language Informaton Retreval Douglas W. Oard Bonne J. Dorr Paul G. Hackett and Mara Katsova Techncal Report CS-TR- 3897 Unversty of Maryland 1998.

References 2 5. Translngual Informaton Retreval: A Comparatve Evaluaton Jame G. Carbonell Ymng Yang Robert E. Frederkng Ralf D. Bron Ybng Geng and Danny Lee Internatonal Jont Conference on Artfcal Intellgence 1997. 6. A Multstage Search Strategy for Cross Lngual Informaton Retreval Satsh Kagathara Mansh Deodalkar and Pushpak Bhattacharyya Symposum on Indan Morphology Phonology and Language Engneerng IIT Kharagpur February 2005. 7. Relevance-Based Language Models Vctor Lavrenko and W. Bruce Croft Research and Development n Informaton Retreval 2001. 8. Cross- Lngual Relevance Models V. Lavrenko M. Choquette and W. Croft ACM-SIGIR 2002.

Thank You