Cross-language Retrieval Experiments at CLEF-2002

Size: px
Start display at page:

Download "Cross-language Retrieval Experiments at CLEF-2002"

Transcription

1 Cross-language Retrieval Experiments at CLEF-2002 Aitao Chen School of Information Management and Systems University of California at Berkeley CLEF 2002 Workshop: 9-20 September, 2002, Rome, Italy

2 Talk Outline Overview of our CLEF-2002 experiments Evaluation of merging strategies German/Dutch decompounding Query expansion Conclusions

3 Overview of Multilingual Information Retrieval Experiments Query English IR/RF Documents English French/2 IR/RF French Babelfish L&H German/2 Italian/2 IR/RF/ DC IR/RF German Italian Spanish/2 IR/RF Spanish English docs French docs German docs Italian docs Spanish docs merger combined ranked list of documents

4 Multilingual Information Retrieval: Direct Merging English docs French docs Italian docs German docs Spanish docs E e E2 e2 E50 e50 E5 e5 E000 e000 F f F2 f2 F50 f50 F5 f5 F000 f000 I i I2 i2 I50 i50 I5 i5 I000 i000 G g G2 g2 G50 g50 G5 g5 G000 g000 S s S2 s2 S50 s50 S5 s5 S000 s000 () combine ranked lists; (2) sort by raw score; (3) take the top 000 docs Weakness: prone to un-comparable raw relevance scores in individual ranked lists.

5 Multilingual Information Retrieval: Normalized Merging English docs French docs Italian docs German docs Spanish docs E e E2 e2 E50 e50 E000 e000 F f F2 f2 F50 f50 F000 f000 I i I2 i2 I50 i50 I000 i000 G g G2 g2 G50 g50 G000 g000 S s S2 s2 S50 s50 S000 s000 E e/e E2 e2/e E50 e50/e E000 e000/e F f/f F2 f2/f F50 f50/f F000 f000/f I i/i I2 i2/i I50 i50/i I000 i000/i G g/g G2 g2/g G50 g50/g G000 g000/g S s/s S2 s2/s S50 s50/s S000 s000/s () combine ranked lists; (2) sort by normalized score; (3) take the top 000 docs Weakness: prone to skewed distribution of relevant documents over the document subcollections.

6 Optimal Merging (Known Relevance) (red=rel doc; black=irrel doc) Table Rank Run A Run B Run C A B C 2 A2 B2 C2 3 A3 B3 C3 4 A4 B4 C4 Table 2 Table 3 Set Optimal Ranking (0,) {A} Set 2 3 Run A (0,) {A} (,) {A2,A3} (,0) {A3} Run B (2,) {B,B2,B3} (,0) {B4} Run C (,3) {C,C2,C3,C4} Choose the set with the smallest number of irrelevant documents, but the largest number of relevant documents from the set of active sets.

7 Optimal Merging (Known Relevance) (red=rel doc; black=irrel doc) Set Optimal Ranking A Rank Run A A A2 A3 A4 Run B B B2 B3 B4 Run C C C2 C3 C (0,) {A} (,3) {C,C2,C3,C4} (,) {A2,A3} (2,) {B,B2,B3} (,0) {A4} (,0) {B4} C C2 C3 C4 A2 A3 B Set Run A Run B (2,) {B,B2,B3} Run C (,3) {C,C2,C3,C4} 9 0 B2 B3 A4 2 (,) {A2,A3} (,0) {B4} 2 B4 3 (,0) {A3}

8 Performances of MLIR with Different Merging Strategies (Topics: English,TD) (.4705) (.4773) (.4479) (.4008) (.4567) English docs French docs German docs Italian docs Spanish docs Direct Merging (.3762) Normalized Merging (.3570) Optimal Merging (.577) 72.67% 68.96% English French German Italian Spanish No. topics with no rel docs

9 Why decompounding? Topic 09: Computersicherheit (computer security) in title & desc fields, but not in the German document collection. Topic 88: The Dutch compound gekkekoeienziekte (mad cow disease) is not in the Dutch document collection. Topic 3: Fussballeuropameisterschaft in the title, but Europameisterschaft im FuBball in the desc. (B/ss) Topic 5: Scheidungsstatistiken (divorce statistics) in the title, but Statistiken uber die Scheidungsraten in the desc. In Der Spiegel : Literaturnobelpreistrager, Literatur- Nobelpreistrager, Literaturnobelpreis-Tragerin; Literaturnobelpreis v.s. Nobelpreis fur Literatur. The German translation of Latin America by Babelfish was lateinischem Amerika, not Lateinamerika. Bronchialasthma was not translated into English by Babelfish, but Bronchial and Asthma were.

10 German/Dutch Decompounding Procedure Create a German/Dutch base dictionary consisting of single words only (compounds are excluded). Decompose a compound into component words found in the German/Dutch base dictionary. Choose the decomposition with the minimum number of component words. If there are more than one decompositions having the minimum number of component words, choose the decomposition with the highest probability.

11 German Decompounding: Example Compound: fussballeuropameisterschaft (European Football Cup). Base dictionary ball europa fuss fussball meisterschaft s 2. Decompose a compound with respect to the base dictionary.. fuss ball europa meisterschaft 2. fussball europa meisterschaft 3. Choose the decomposition with the smallest number of component words. fussballeuropameisterschaft = fussball europa meisterschaft

12 German Decompounding: Example 2 Compound: wintersports (winter sports). Base dictionary port ports s sport sports winter winters 2. Decompose a compound with respect to the base dictionary. Decompositions log p(d). winter s ports winter sports winters ports Choose the most likely decomposition. wintersports = winter sports

13 Decompounding: Probability of Decomposition C = W W2 W3 W4 p( C) = p( W ) p( W2 ) p( W3 ) p( W4 ) p( w) = n tfc( i= tfc( w) w i ) Relative frequency of w in a collection. tfc(w) is the number of times word w occurs in a corpus. n is the number of unique words (including compounds) in a corpus.

14 Evaluation of Decompounding Test Without With collection Run type decompounding decompounding Change CLEF-2002 German--German % CLEF-2002 Dutch--Dutch % CLEF-200 Dutch--Dutch % CLEF-2002 English--German % (L&H) CLEF-2002 English--German % (Babelfish) CLEF-2002 French German % (Babelfish)

15 Document Ranking query t t2 2 t3 t4 t2 t3 4 t5 3 t8 documents n, { qtf, dtf, ctf }, ql, dl, i i i i =.. n cl x n qtf n dtf n ctf = ql c dl c, cl 4 2 i= i= i= i, x = log i, x = log i x = n logit( p( q, d)) = p p log = b 0 + b x + b 2 x 2 + b 3 x 3 + b 4 x 4

16 Query Expansion Select terms from top-ranked documents after the initial search. Assign weight to selected terms. Combine selected terms with original query terms.

17 Query Expansion (2) Step : term selection. relevant irrelevant w = n n 3 n 2 indexed not indexed n 4 ) Rank terms in the presumed relevant documents by w in descending order; 2) Choose the top-ranked m terms, m = 2 * average-number-of-unique-queryterms. Alternative weighting schemes include a) Maximum Likelihood Ratio; b) Chi-square statistics; c) Mutual information. n n2 n3 n4 Steps 2 & 3: term-weighting and merging. Initial query Selected terms T () T2 (2) T2 (2*0.5) T3 () T3 (*0.5) T4 (0.5) Expanded query T (.0) T2 (3.0) T3 (.5) T4 (0.5)

18 Evaluation of Query Expansion (0 terms/0 docs) Run id Run type No query expansion Query expansion Change bky2monl Dutch-Dutch % bky2mofr French-French % bky2mode German-German % bky2moit Italian-Italian % bky2moes Spanish-Spanish % bky2bienfr English-French % bky2bienfr2 English-French % bky2bidefr German-French % bky2biende English-German % bky2bifrde French-German % bky2bienit English-Italian % bky2bienes English-Spanish % bky2biennl English-Dutch %

19 Evaluation of Decompounding, Stemming and Query Expansion in Monolingual Retrieval (Topics: German, TD) decomp+stem+expan.5234 (5.8%).4393 (26.89%).457 (30.47%).4393 (26.89%) decomp+stem decomp+expan stem+expan.3859 (.47%).3633 (4.94%).445 (9.73%) decomp stem expan baseline.3465

20 English-to-French Dictionary Built from Parallel Texts. fall in sale of cars 2. ski race; car race 3. pop star; galaxy star autome 0.32 tomber 0.08 telever 0.06 race 0.60 courser 0.8 racial 0.05 star 0.62 etoile 0.3 etoiler rock music 5. lead singer rock 0.89 rocher 0.02 pierre 0.0 mener 0.3 conduire 0.08 amener 0.07 principal Run id bky2enfr3 bky2enfr4 bky2enfr5 Resource Babelfish L&H Parallel texts AP

21 Conclusions The simplest direct merging method worked better than the score-normalized method when the intermediate ranked lists were produced under similar conditions (e.g., roughly the same query length, the same number of terms selected from the same number of documents). Decompounding improved the retrieval performance of German/Dutch monolingual and cross-language retrieval to German. The margin of improvement varies from one topic set to another. Query expansion substantially improved the performance of monolingual, cross-language, and multilingual retrieval.

22 THANK YOU

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

GCSE Results June 2018: Grades A* G

GCSE Results June 2018: Grades A* G GCSE Results June 2018: Grades A* G Spec A* A* A A B B C C D D E E F F G G Total Code No % Cum No Cum% Cum No Cum% Cum No Cum% Cum No Cum% Cum No Cum% Cum No Cum% Cum No Cum% Examined Ancient History (QN:

More information

Advanced GCE Results June 2017: Grades A* E

Advanced GCE Results June 2017: Grades A* E percentage of candidates at each of Grades A* to E are given. Advanced GCE Results June 2017: Grades A* E Specification Title Spec A* A* A A B B C C D D E E Total Code No % Cum No Cum% Cum No Cum% Cum

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

Distribution of GCSE Grades Summer 2016

Distribution of GCSE Grades Summer 2016 Distribution of GCSE Grades Summer 2016 Entries A* A B C D E F G U X Art and Design F 62 3 12 25 16 6 1 1 0 0 0 M 28 0 1 3 10 11 1 1 0 0 0 All 90 3 13 28 26 17 2 1 0 0 0 Business Studies F 44 0 2 9 14

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Latent semantic indexing

Latent semantic indexing Latent semantic indexing Relationship between concepts and words is many-to-many. Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms. For retrieval,

More information

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides CSE 494/598 Lecture-6: Latent Semantic Indexing LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Homework-1 and Quiz-1 Project part-2 released

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Name three countries in Europe.

Name three countries in Europe. Name three countries in Europe. I will be able to identify the 5 themes of geography for Europe and locate the countries and capitals of Europe and the major physical features. People: Past: Through the

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

Syntax versus Semantics:

Syntax versus Semantics: Syntax versus Semantics: Analysis of Enriched Vector Space Models Benno Stein and Sven Meyer zu Eissen and Martin Potthast Bauhaus University Weimar Relevance Computation Information retrieval aims at

More information

A graph for a quantitative variable that divides a distribution into 25% segments.

A graph for a quantitative variable that divides a distribution into 25% segments. STATISTICS Unit 2 STUDY GUIDE Topics 6-10 Part 1: Vocabulary For each word, be sure you know the definition, the formula, or what the graph looks like. Name Block A. association M. mean absolute deviation

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

University of Illinois at Urbana-Champaign. Midterm Examination

University of Illinois at Urbana-Champaign. Midterm Examination University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,

More information

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introuction Plagiarism: Unauthorize use of Text, coe, iea, Plagiarism etection research area has receive increasing attention

More information

GIR Experimentation. Abstract

GIR Experimentation. Abstract GIR Experimentation Andogah Geoffrey Computational Linguistics Group Centre for Language and Cognition Groningen (CLCG) University of Groningen Groningen, The Netherlands g.andogah@rug.nl, annageof@yahoo.com

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:

More information

Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model

Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model Andreas Spitz, Jannik Strötgen, Thomas Bögel and Michael Gertz Heidelberg University Institute of Computer Science Database Systems

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011

More information

Probabilistic Field Mapping for Product Search

Probabilistic Field Mapping for Product Search Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This

More information

Midterm Examination Practice

Midterm Examination Practice University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following

More information

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).

More information

Test One Mathematics Fall 2009

Test One Mathematics Fall 2009 Test One Mathematics 35.2 Fall 29 TO GET FULL CREDIT YOU MUST SHOW ALL WORK! I have neither given nor received aid in the completion of this test. Signature: pts. 2 pts. 3 5 pts. 2 pts. 5 pts. 6(i) pts.

More information

The Static Absorbing Model for the Web a

The Static Absorbing Model for the Web a Journal of Web Engineering, Vol. 0, No. 0 (2003) 000 000 c Rinton Press The Static Absorbing Model for the Web a Vassilis Plachouras University of Glasgow Glasgow G12 8QQ UK vassilis@dcs.gla.ac.uk Iadh

More information

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition Yu-Seop Kim 1, Jeong-Ho Chang 2, and Byoung-Tak Zhang 2 1 Division of Information and Telecommunication

More information

Embeddings Learned By Matrix Factorization

Embeddings Learned By Matrix Factorization Embeddings Learned By Matrix Factorization Benjamin Roth; Folien von Hinrich Schütze Center for Information and Language Processing, LMU Munich Overview WordSpace limitations LinAlgebra review Input matrix

More information

IBM Model 1 for Machine Translation

IBM Model 1 for Machine Translation IBM Model 1 for Machine Translation Micha Elsner March 28, 2014 2 Machine translation A key area of computational linguistics Bar-Hillel points out that human-like translation requires understanding of

More information

CS4800: Algorithms & Data Jonathan Ullman

CS4800: Algorithms & Data Jonathan Ullman CS4800: Algorithms & Data Jonathan Ullman Lecture 22: Greedy Algorithms: Huffman Codes Data Compression and Entropy Apr 5, 2018 Data Compression How do we store strings of text compactly? A (binary) code

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson Latent Semantic Indexing Outline Introduction Linear Algebra Refresher

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

Uptake of IGCSE subjects 2012

Uptake of IGCSE subjects 2012 Uptake of IGCSE subjects 2012 Statistics Report Series No.63 Tom Sutch September 2013 Research Division Assessment, Research and Development Cambridge Assessment 1 Regent Street, Cambridge, CB2 1GG Introduction

More information

Lecture 13: More uses of Language Models

Lecture 13: More uses of Language Models Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Yuriy Sverchkov Intelligent Systems Program University of Pittsburgh October 6, 2011 Outline Latent Semantic Analysis (LSA) A quick review Probabilistic LSA (plsa)

More information

YORK UNIVERSITY - UNIVERSITÉ YORK

YORK UNIVERSITY - UNIVERSITÉ YORK Faculty of Liberal Arts and Professional Studies Administrative Studies 1,538 1,323 336 309 3,506 0 0 1 0 1 African Studies 0 0 0 0 0 3 6 2 3 14 Anthropology 49 90 9 30 178 0 0 0 0 0 Applied Mathematics

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

XVII. Science and Technology/Engineering, Grade 8

XVII. Science and Technology/Engineering, Grade 8 XVII. Science and Technology/Engineering, Grade 8 Grade 8 Science and Technology/Engineering Test The spring 2017 grade 8 Science and Technology/Engineering test was based on learning standards in the

More information

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Latent Semantic Models Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Vector Space Model: Pros Automatic selection of index terms Partial matching of queries

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi On the Foundations of Diverse Information Retrieval Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi 1 Outline Need for diversity The answer: MMR But what was the

More information

A Study of the Dirichlet Priors for Term Frequency Normalisation

A Study of the Dirichlet Priors for Term Frequency Normalisation A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval

More information

I Stable, marginally stable, & unstable linear systems. I Relationship between pole locations and stability. I Routh-Hurwitz criterion

I Stable, marginally stable, & unstable linear systems. I Relationship between pole locations and stability. I Routh-Hurwitz criterion EE C128 / ME C134 Feedback Control Systems Lecture Chapter 6 Stability Lecture abstract Alexandre Bayen Department of Electrical Engineering & Computer Science University of California Berkeley Topics

More information

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics Mathematics Curriculum A. DESCRIPTION This is a full year courses designed to introduce students to the basic elements of statistics and probability. Emphasis is placed on understanding terminology and

More information

Dimension Reduction and Iterative Consensus Clustering

Dimension Reduction and Iterative Consensus Clustering Dimension Reduction and Iterative Consensus Clustering Southeastern Clustering and Ranking Workshop August 24, 2009 Dimension Reduction and Iterative 1 Document Clustering Geometry of the SVD Centered

More information

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference NEAL: A Neurally Enhanced Approach to Linking Citation and Reference Tadashi Nomoto 1 National Institute of Japanese Literature 2 The Graduate University of Advanced Studies (SOKENDAI) nomoto@acm.org Abstract.

More information

Language Processing with Perl and Prolog

Language Processing with Perl and Prolog Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and

More information

Language Models. Hongning Wang

Language Models. Hongning Wang Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity

More information

text statistics October 24, 2018 text statistics 1 / 20

text statistics October 24, 2018 text statistics 1 / 20 text statistics October 24, 2018 text statistics 1 / 20 Overview 1 2 text statistics 2 / 20 Outline 1 2 text statistics 3 / 20 Model collection: The Reuters collection symbol statistic value N documents

More information

60% 50% 40% 30% 20% 10% 0%

60% 50% 40% 30% 20% 10% 0% Table 1. Number & Percentage of Students by IB Diploma-Seeking Status: N Percent N Percent N Percent Seeking Diploma 13 34% 11 34% 24 34% Seeking Certificate 8 21% 1 3% 9 13% Anticipated 17 45% 20 63%

More information

Lecture 1b: Text, terms, and bags of words

Lecture 1b: Text, terms, and bags of words Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection

More information

Unsupervised Rank Aggregation with Distance-Based Models

Unsupervised Rank Aggregation with Distance-Based Models Unsupervised Rank Aggregation with Distance-Based Models Kevin Small Tufts University Collaborators: Alex Klementiev (Johns Hopkins University) Ivan Titov (Saarland University) Dan Roth (University of

More information

OFFICIAL JOURNAL OF THE AMERICAN SOCIOLOGICAL ASSOCIATION

OFFICIAL JOURNAL OF THE AMERICAN SOCIOLOGICAL ASSOCIATION AMERICAN SOCIOLOGICAL REVIEW OFFICIAL JOURNAL OF THE AMERICAN SOCIOLOGICAL ASSOCIATION ONLINE SUPPLEMENT to article in AMERICAN SOCIOLOGICAL REVIEW, 2015, VOL. 80 Do Women Suffer from Network Closure?

More information

Social Studies Mr. Poirier Introduction Test - Study Guide

Social Studies Mr. Poirier Introduction Test - Study Guide Social Studies Mr. Poirier Introduction Test - Study Guide Study Guide given in class on Monday, September 15, 2014 Test Date: Friday, September 19, 2014 I. Study the following Vocabulary Words to be defined:

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Geometric and algebraic structures in pattern recognition

Geometric and algebraic structures in pattern recognition Geometric and algebraic structures in pattern recognition Luke Oeding Department of Mathematics, University of California, Berkeley April 30, 2012 Multimedia Pattern Recognition Rolf Bardeli mmprec.iais.fraunhofer.de/

More information

CUNI at the CLEF ehealth 2015 Task 2

CUNI at the CLEF ehealth 2015 Task 2 CUNI at the CLEF ehealth 2015 Task 2 Shadi Saleh, Feraena Bibyna, and Pavel Pecina Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics, Czech Republic

More information

C SC 620 Advanced Topics in Natural Language Processing. Lecture 21 4/13

C SC 620 Advanced Topics in Natural Language Processing. Lecture 21 4/13 C SC 620 Advanced Topics in Natural Language Processing Lecture 21 4/13 Reading List Readings in Machine Translation, Eds. Nirenburg, S. et al. MIT Press 2003. 19. Montague Grammar and Machine Translation.

More information

Cross-Language French-English Question Answering using the DLT System at CLEF 2004

Cross-Language French-English Question Answering using the DLT System at CLEF 2004 Cross-Language French-English Question Answering using the DLT System at CLEF 2004 Richard F. E. Sutcliffe Igal Gabbay Michael Mulcahy Aoife O'Gorman Documents and Linguistic Technology Group Department

More information

Vector Model Improvement by FCA and Topic Evolution

Vector Model Improvement by FCA and Topic Evolution Vector Model Improvement by FCA and Topic Evolution Petr Gajdoš Jan Martinovič Department of Computer Science, VŠB - Technical University of Ostrava, tř. 17. listopadu 15, 708 33 Ostrava-Poruba Czech Republic

More information

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE Course Title: Probability and Statistics (MATH 80) Recommended Textbook(s): Number & Type of Questions: Probability and Statistics for Engineers

More information

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Query Performance Prediction: Evaluation Contrasted with Effectiveness Query Performance Prediction: Evaluation Contrasted with Effectiveness Claudia Hauff 1, Leif Azzopardi 2, Djoerd Hiemstra 1, and Franciska de Jong 1 1 University of Twente, Enschede, the Netherlands {c.hauff,

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43

More information

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes Outline Precision and Recall The problem with indexing so far Intuition for solving it Overview of the solution The Math How to measure

More information

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13 Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis

More information

Linear Algebra Background

Linear Algebra Background CS76A Text Retrieval and Mining Lecture 5 Recap: Clustering Hierarchical clustering Agglomerative clustering techniques Evaluation Term vs. document space clustering Multi-lingual docs Feature selection

More information

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion

More information

CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012

CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012 CINQA Workshop Probability Math 105 Silvia Heubach Department of Mathematics, CSULA Thursday, September 6, 2012 Silvia Heubach/CINQA 2012 Workshop Objectives To familiarize biology faculty with one of

More information

MAJOR IN INTERNATIONAL STUDIES, EUROPEAN STUDIES CONCENTRATION

MAJOR IN INTERNATIONAL STUDIES, EUROPEAN STUDIES CONCENTRATION Major in International Studies, European Studies Concentration 1 MAJOR IN INTERNATIONAL STUDIES, EUROPEAN STUDIES CONCENTRATION Requirements Effective Fall 2018 Freshman ANTH 200 Cultures and the Global

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models

Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models Behavioral Data Mining Lecture 3 Naïve Bayes Classifier and Generalized Linear Models Outline Naïve Bayes Classifier Regularization in Linear Regression Generalized Linear Models Assignment Tips: Matrix

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Nuevo examen - 02 de Febrero de 2017 [280 marks]

Nuevo examen - 02 de Febrero de 2017 [280 marks] Nuevo examen - 0 de Febrero de 0 [0 marks] Jar A contains three red marbles and five green marbles. Two marbles are drawn from the jar, one after the other, without replacement. a. Find the probability

More information

Dimensionality Reduction

Dimensionality Reduction Dimensionality Reduction Given N vectors in n dims, find the k most important axes to project them k is user defined (k < n) Applications: information retrieval & indexing identify the k most important

More information

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query:

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query: Words vs. Terms Words vs. Terms Information Retrieval cares about You search for em, Google indexes em Query: What kind of monkeys live in Costa Rica? 600.465 - Intro to NLP - J. Eisner 1 600.465 - Intro

More information

3. Name two countries that have a very high percentage of arable land. 4. What economic activity does most of Kenya s wealth come from?

3. Name two countries that have a very high percentage of arable land. 4. What economic activity does most of Kenya s wealth come from? AP Human Geography Chapter 1: Intro To Human Geo Reader s Notes I. What is Human Geography? 8-9 1. What South American country has the highest average daily calorie consumption per capita? 2. What are

More information

Semantic Similarity from Corpora - Latent Semantic Analysis

Semantic Similarity from Corpora - Latent Semantic Analysis Semantic Similarity from Corpora - Latent Semantic Analysis Carlo Strapparava FBK-Irst Istituto per la ricerca scientifica e tecnologica I-385 Povo, Trento, ITALY strappa@fbk.eu Overview Latent Semantic

More information

Wednesday, 10 September 2008

Wednesday, 10 September 2008 MA211 : Calculus, Part 1 Lecture 2: Sets and Functions Dr Niall Madden (Mathematics, NUI Galway) Wednesday, 10 September 2008 MA211 Lecture 2: Sets and Functions 1/33 Outline 1 Short review of sets 2 Sets

More information

Pivoted Length Normalization I. Summary idf II. Review

Pivoted Length Normalization I. Summary idf II. Review 2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted

More information

Outline. Wednesday, 10 September Schedule. Welcome to MA211. MA211 : Calculus, Part 1 Lecture 2: Sets and Functions

Outline. Wednesday, 10 September Schedule. Welcome to MA211. MA211 : Calculus, Part 1 Lecture 2: Sets and Functions Outline MA211 : Calculus, Part 1 Lecture 2: Sets and Functions Dr Niall Madden (Mathematics, NUI Galway) Wednesday, 10 September 2008 1 Short review of sets 2 The Naturals: N The Integers: Z The Rationals:

More information

International Semester Modeling and Simulation in Chemical and Process Engineering

International Semester Modeling and Simulation in Chemical and Process Engineering International Semester Modeling and Simulation in Chemical and Process Engineering October March Internationales Zentrum Clausthal (IZC) International Center Clausthal (IZC) Clausthal University of Technology

More information

Sets and Venn Diagrams

Sets and Venn Diagrams 1) Sets and Venn Diagrams In a survey, 100 students are asked if they like basketball (), football (F) and swimming (S). The Venn diagram shows the results. F 20 25 q 12 17 p 8 r S 42 students like swimming.

More information

Exercise 1: Basics of probability calculus

Exercise 1: Basics of probability calculus : Basics of probability calculus Stig-Arne Grönroos Department of Signal Processing and Acoustics Aalto University, School of Electrical Engineering stig-arne.gronroos@aalto.fi [21.01.2016] Ex 1.1: Conditional

More information

University of Pittsburgh at Johnstown. Morehead State University (Kentucky) COURSE COURSE NUMBER PITT JOHNSTOWN COURSE TITLE CREDITS TRANSFER SUBJECT

University of Pittsburgh at Johnstown. Morehead State University (Kentucky) COURSE COURSE NUMBER PITT JOHNSTOWN COURSE TITLE CREDITS TRANSFER SUBJECT TITLE PITT JOHNSTOWN TITLE ACCT 281 Principles of Financial Accounting 3 BUS 0000 Non-Equivalent* 3 ACCT 282 Principles of Managerial Accounting 3 BUS 0000 Non-Equivalent* 3 ACCT 381 Intermediate Accounting

More information

Blog Distillation via Sentiment-Sensitive Link Analysis

Blog Distillation via Sentiment-Sensitive Link Analysis Blog Distillation via Sentiment-Sensitive Link Analysis Giacomo Berardi, Andrea Esuli, Fabrizio Sebastiani, and Fabrizio Silvestri Istituto di Scienza e Tecnologie dell Informazione, Consiglio Nazionale

More information