Towards an Efficient Combination of Similarity Measures for Semantic Relation Extraction
|
|
- Joy Atkinson
- 5 years ago
- Views:
Transcription
1 Towards an Efficient Combination of Similarity Measures for Semantic Relation Extraction Alexander Panchenko Université catholique de Louvain & Bauman Moscow State Technical University 5th December 2011 / CLAIM Seminar, BMSTU Alexander Panchenko 1/30
2 Plan 1 Introduction 2 Methodology 3 Evaluation 4 Results 5 Conclusion and Further Research Alexander Panchenko 2/30
3 Reference Papers Panchenko A. Method for Automatic Construction of Semantic Relations Between Concepts of an Information Retrieval Thesaurus. // In Herald of the Voronezh State University. Series Systems Analysis and Information Technologies, vol.2, pages , analiz&year=2010&num=02&f_name= Panchenko A. Comparison of the Knowledge-, Corpus-, and Web-based Similarity Measures for Semantic Relations Extraction // Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics, EMNLP 2011, pages 11-21, Panchenko A. Towards an Efficient Combination of Similarity Measures for Semantic Relation Extraction // Submitted to the Student Workshop of EACL Alexander Panchenko 3/30
4 Semantic Relations r = c i, t, c j semantic relation, where c i, c j C, t T C terms e.g. radio or receiver operating characteristic T semantic relation types, e.g. hyponymy or synonymy R C T C set of semantic relations Alexander Panchenko 4/30
5 Semantic Relations Example: Thesaurus Figure: A part of a the information retrieval thesaurus EuroVoc. Alexander Panchenko 5/30
6 Semantic Relations Example: Thesaurus Figure: A part of a the information retrieval thesaurus EuroVoc. R = energy-generating product, NT, energy industry energy technology, NT, energy industry petrolium, RT, fossil fuel energy technology, RT, oil technology... Alexander Panchenko 5/30
7 General Problem: Automatic Thesaurus Construction Figure: A technology of automatic thesaurus construction. How thesaurus is used? Query expansion and query suggestion Navigation and browsing on the corpus Visualization of the corpus Alexander Panchenko 6/30
8 The Problem Semantic Relations Extraction Input: terms C, semantic relation types T Ouput: lexico-semantic relations ^R R Alexander Panchenko 7/30
9 The Problem Semantic Relations Extraction Input: terms C, semantic relation types T Ouput: lexico-semantic relations ^R R Pattern-based relations extraction, where patterns are built manually (Hearst, 1992) or semi-automatically (Snow, 2004) (+) High precision ( ) Complexity and cost pattern construction ( ) Patterns are highly task and domain dependent Alexander Panchenko 7/30
10 The Problem Semantic Relations Extraction Input: terms C, semantic relation types T Ouput: lexico-semantic relations ^R R Pattern-based relations extraction, where patterns are built manually (Hearst, 1992) or semi-automatically (Snow, 2004) (+) High precision ( ) Complexity and cost pattern construction ( ) Patterns are highly task and domain dependent Similarity-based relation extraction (Philippovich and Prokhorov, 2002; Grefenstette, 1994; Curran and Moens, 2002) ( ) Less precise (+) Little or no manual work (+) More adaptive across domains Alexander Panchenko 7/30
11 Similarity-based Relation Extraction State of the Art: There exist many heterogeneous similarity measures based on corpus, knowledge, web, definitions, etc. Research Questions: Alexander Panchenko 8/30
12 Similarity-based Relation Extraction State of the Art: There exist many heterogeneous similarity measures based on corpus, knowledge, web, definitions, etc. Various measures provide complimentary types of semantic information. Research Questions: Alexander Panchenko 8/30
13 Similarity-based Relation Extraction State of the Art: There exist many heterogeneous similarity measures based on corpus, knowledge, web, definitions, etc. Various measures provide complimentary types of semantic information. This suggest their combination. Research Questions: Alexander Panchenko 8/30
14 Similarity-based Relation Extraction State of the Art: There exist many heterogeneous similarity measures based on corpus, knowledge, web, definitions, etc. Various measures provide complimentary types of semantic information. This suggest their combination. Research Questions: Which similarity measure is the best for relation extraction? Alexander Panchenko 8/30
15 Similarity-based Relation Extraction State of the Art: There exist many heterogeneous similarity measures based on corpus, knowledge, web, definitions, etc. Various measures provide complimentary types of semantic information. This suggest their combination. Research Questions: Which similarity measure is the best for relation extraction? How to efficiently combine similarity measures so as to improve relation extraction? Alexander Panchenko 8/30
16 The Key Contributions Up To Now A protocol for evaluation of the similarity-based relation extraction Comparison of 34 single measures Two methods of combination similarity and relation fusion Six best combinations outperforming single measures are found Alexander Panchenko 9/30
17 Similarity-based Semantic Relations Extraction Semantic Relations Extraction Algorithm Input: Terms C, Sim.parameters P, Threshold k, Min.similarity value γ Output: Semantic relations ^R (unlabeled) 1 S sim(c, P) ; 2 S normalize(s) ; 3 ^R threshold(s, k, γ) ; 4 return ^R ; Alexander Panchenko 10/30
18 Similarity-based Semantic Relations Extraction Semantic Relations Extraction Algorithm Input: Terms C, Sim.parameters P, Threshold k, Min.similarity value γ Output: Semantic relations ^R (unlabeled) 1 S sim(c, P) ; 2 S normalize(s) ; 3 ^R threshold(s, k, γ) ; 4 return ^R ; sim a similarity measure Alexander Panchenko 10/30
19 Similarity-based Semantic Relations Extraction Semantic Relations Extraction Algorithm Input: Terms C, Sim.parameters P, Threshold k, Min.similarity value γ Output: Semantic relations ^R (unlabeled) 1 S sim(c, P) ; 2 S normalize(s) ; 3 ^R threshold(s, k, γ) ; 4 return ^R ; sim a similarity measure normalize similarity score normalization Alexander Panchenko 10/30
20 Similarity-based Semantic Relations Extraction Semantic Relations Extraction Algorithm Input: Terms C, Sim.parameters P, Threshold k, Min.similarity value γ Output: Semantic relations ^R (unlabeled) 1 S sim(c, P) ; 2 S normalize(s) ; 3 ^R threshold(s, k, γ) ; 4 return ^R ; sim a similarity measure normalize similarity score normalization threshold knn thresholding R = C i=1 { c i, t, c j : c j top k% terms s ij γ}. Alexander Panchenko 10/30
21 Knowledge-based Measures (6) Data: semantic network WordNet 3.0, corpus SemCor. Alexander Panchenko 11/30
22 Knowledge-based Measures (6) Data: semantic network WordNet 3.0, corpus SemCor. Variables: len(c i, c j ) length of the shortest path between terms c i and c j len(c i, lcs(c i, c j )) length of the shortest path from c i to the lowest common subsumer (LCS) of c i and c j len(c root, lcs(c i, c j )) length of the shortest path from the root term c root to the LCS of c i and c j P(c) probability of the term c, estimated from a corpus P(lcs(c i, c j )) probability of the LCS of c i and c j Alexander Panchenko 11/30
23 Knowledge-based Measures (6) Data: semantic network WordNet 3.0, corpus SemCor. Variables: len(c i, c j ) length of the shortest path between terms c i and c j len(c i, lcs(c i, c j )) length of the shortest path from c i to the lowest common subsumer (LCS) of c i and c j len(c root, lcs(c i, c j )) length of the shortest path from the root term c root to the LCS of c i and c j P(c) probability of the term c, estimated from a corpus P(lcs(c i, c j )) probability of the LCS of c i and c j Measures: Inverted Edge Count (Jurafsky and Martin, 2009), Leacock-Chodorow (1998), Wu-Palmer (1994), Resnik (1995), Jiang-Conrath (1997), Lin (1998). Alexander Panchenko 11/30
24 Web-based Measures (9) Data: number of the hits returned by an information retrieval system (GOOGLE, YAHOO, YAHOO BOSS, BING). Alexander Panchenko 12/30
25 Web-based Measures (9) Data: number of the hits returned by an information retrieval system (GOOGLE, YAHOO, YAHOO BOSS, BING). Variables: h i number of hits returned by query "c i " h ij number of hits returned by the query "c i AND c j " Alexander Panchenko 12/30
26 Web-based Measures (9) Data: number of the hits returned by an information retrieval system (GOOGLE, YAHOO, YAHOO BOSS, BING). Variables: h i number of hits returned by query "c i " h ij number of hits returned by the query "c i AND c j " Measures: NGD (Cilibrasi and Vitanyi, 2007) PMI-IR (Turney, 2001) Alexander Panchenko 12/30
27 Corpus-based Measures (13) Data: corpus WACYPEDIA (800M tokens) and UKWAC (2000M) Alexander Panchenko 13/30
28 Corpus-based Measures (13) Data: corpus WACYPEDIA (800M tokens) and UKWAC (2000M) Variables: f i context window feature vector of term c i f s i syntactic feature vector of c i Alexander Panchenko 13/30
29 Corpus-based Measures (13) Data: corpus WACYPEDIA (800M tokens) and UKWAC (2000M) Variables: f i context window feature vector of term c i f s i syntactic feature vector of c i Measures: BDA (Sahlgren, 2006) SDA (Curran, 2003) LSA on the TASA corpus (Landauer and Dumais, 1997) NGD and PMI-IR on the Factiva corpus (Veksler et al., 2008). Alexander Panchenko 13/30
30 Corpus-based Measures: Distributional Analysis Distributional Similarity Measure Input: Terms C, Corpus D, Number of features β, Min.term frequency θ, Feature matrix construction param. P Output: Similarity matrix, S [C C] 1 F construct_fmatrix(c, D, β, θ, P) ; 2 F pmi(f) ; 3 S cos(f) ; 4 return S ; PMI normalization f ij = log P(c i,f j ) P(c i )P(f j ) = log Cosine similarity: s ij = cos(c i, c j ) = f i f j f i f j f ij n(c i ) i f ij Alexander Panchenko 14/30
31 Definition-based Measures (6) Data: definitions from WordNet, Wikipedia, and Wiktionary. Alexander Panchenko 15/30
32 Definition-based Measures (6) Data: definitions from WordNet, Wikipedia, and Wiktionary. Variables: gloss(c) definition of the term sim(gloss(c i ), gloss(c j )) similarity of terms glosses f i context vector of c i, calculated on the corpus of all glosses f i bag-of-words vector, derived from the definition of c i exist(c i, c j ) a relation between c i and c j in the dictionary Alexander Panchenko 15/30
33 Definition-based Measures (6) Data: definitions from WordNet, Wikipedia, and Wiktionary. Variables: gloss(c) definition of the term sim(gloss(c i ), gloss(c j )) similarity of terms glosses f i context vector of c i, calculated on the corpus of all glosses f i bag-of-words vector, derived from the definition of c i exist(c i, c j ) a relation between c i and c j in the dictionary Measures: BDA using Wiktionary and Wikipedia Extended Lesk using Wordnet (Banerjee and Pedersen, 2003) Gloss Vectors using Wordnet (Patwardhan and Pedersen, 2006) Alexander Panchenko 15/30
34 Definition-based Measures Wiktionary-based Similarity Measure Input: Terms C, UseWikipedia, Number of features β Output: Similarity matrix, S [C C] 1 D get_wiktionary_definitions(c) ; 2 if UseWikipedia then 3 D D get_wikipedia_definitions(c) 4 F construct_fmatrix(c, D, β) ; 5 F pmi(f) ; 6 S cos(f) ; 7 S update_similarity(s) ; 8 return S ; Alexander Panchenko 16/30
35 Combined Measures Similarity Fusion: S cmb = 1 N N i=1 S i Relation Fusion: Relation fusion measure Input: Sim.matrices produced by N measures {S 1,..., S N }, knn threshold k Output: Combined similarity matrix, S cmb 1 for i=1,n do 2 R i threshold(s i, k, γ = 0) R i relation_matrix(r i ) 3 S cmb 1 N N i=1 R i ; 4 return S cmb ; { 1 if ci, t, c r ij = j R k 0 else Alexander Panchenko 17/30
36 Combined Measures Which of the 34 single measures should we combine? We present combinations of three groups of measures: Group4 = WN-Resnik, BDA , SDA , Def-WktWiki-1000 Group8 = Group4 + WN-WuPalmer, LSA-Tasa, Def-GlossVec., and Def-Ext.Les Group14 = Group8 + WN-LeacockChodorow, WN-Lin, WN-JiangConrath, NGD-Factiva, NGD-Yahoo, and NGD-GoogleWiki. Alexander Panchenko 18/30
37 Evaluation with Human Judgments term, c i term, c j human sim., s sim., s human rank, r sim.rank, ^r tiger cat book paper computer keyboard possibility girl sugar approach Alexander Panchenko 19/30
38 Evaluation with Human Judgments term, c i term, c j human sim., s sim., s human rank, r sim.rank, ^r tiger cat book paper computer keyboard possibility girl sugar approach Human judgments datasets: WordSim353 (Finkelstein, 2002) 353 pairs Miller Charles (1991) 30 pairs Rubenstein Goodenough (1965) 65 pairs Alexander Panchenko 19/30
39 Evaluation with Human Judgments term, c i term, c j human sim., s sim., s human rank, r sim.rank, ^r tiger cat book paper computer keyboard possibility girl sugar approach Human judgments datasets: WordSim353 (Finkelstein, 2002) 353 pairs Miller Charles (1991) 30 pairs Rubenstein Goodenough (1965) 65 pairs Person s correlation: ρ = cov(s,^s) σ(s)σ(^s) Spearman s correlation: r = cov(r,^r) σ(r)σ(^r) Alexander Panchenko 19/30
40 Evaluation with Semantic Relations target term, c i relatum term, c j relation type, t judge adjudicate syn judge arbitrate syn judge asessor syn judge chancellor syn judge gendarmerie syn judge sheriff syn judge pc random judge fare random judge lemon random Number of correct and random relations is equal for each target term! Semantic Relations Datasets: BLESS (Baroni and Lenci, 2011) relations (hyper, coord, mero, event, attri, random) SN (Panchenko,?) relations (syn, random) Alexander Panchenko 20/30
41 Evaluation with Semantic Relations Let R all semantic relations, which are not random ^R extracted relations k knn threshold Evaluation Metrics Precision = R ^R ^R Recall = R ^R R F1 = 2 Precision Recall Precision+Recall MAP(M) = 1 M M k=1 Precision(k). Alexander Panchenko 21/30
42 Example: Evaluation with Semantic Relations Precision(50%) = target word relatum word relation type sim aficionado enthusiast syn aficionado fan syn aficionado admirer syn aficionado addict syn aficionado devotee syn aficionado foundling random aficionado fanatic syn aficionado adherent syn aficionado capital random aficionado statute random aficionado blot random aficionado meddler random aficionado enlargement random aficionado bawdyhouse random Alexander Panchenko 22/30
43 Results on the Human Judgements Datasets Alexander Panchenko 23/30
44 Results on the Semantic Relations Datasets Alexander Panchenko 24/30
45 Precision-Recall Curves Figure: PR graphs of (on the left) the best single and combined measures; (on the right) Wiktionary measures. Alexander Panchenko 25/30
46 Precision-Recall Curves Figure: PR graph of four combined measures. Alexander Panchenko 26/30
47 Conclusion: The best single measures: Wordnet-based measure WN-Resnik Bag-of-word distributional measure BDA Syntactic distributional measure SDA Wiktionary measure Def-WktWiki-1000 The best combined measure: Relation fusion of 8 measures Comb-Rel-810 Very close to combined measures using 14 measures Alexander Panchenko 27/30
48 Further Research: More Sophisticated Combination Methods: Unsupervised feature combination Bag-of-word features of Distributional Analysis + Wikipedia/Wiktionary/Wordnet definitions Feature tensor: jointly co-occuring DA features, tensor decompositions for better fusion Similarity tensor: yet another similarity fusion technique Alexander Panchenko 28/30
49 Further Research: More Sophisticated Combination Methods: Unsupervised feature combination Bag-of-word features of Distributional Analysis + Wikipedia/Wiktionary/Wordnet definitions Feature tensor: jointly co-occuring DA features, tensor decompositions for better fusion Similarity tensor: yet another similarity fusion technique Supervised linear combination of pairwise similarities Alexander Panchenko 28/30
50 Further Research: More Sophisticated Combination Methods: Unsupervised feature combination Bag-of-word features of Distributional Analysis + Wikipedia/Wiktionary/Wordnet definitions Feature tensor: jointly co-occuring DA features, tensor decompositions for better fusion Similarity tensor: yet another similarity fusion technique Supervised linear combination of pairwise similarities Supervised linear combination of features used by single measures Alexander Panchenko 28/30
51 Further Research: Evaluation Domain-specific terms and relations Agrovoc, MeSH, etc. An application-based evaluation query expansion Alexander Panchenko 29/30
52 Further Research: Evaluation Domain-specific terms and relations Agrovoc, MeSH, etc. An application-based evaluation query expansion Methods Corpus-based:DA with n-grams, surface patterns, LSA, LDA, syntactic tree kernels Web-based: more experiments with Google hits Knowledge-based: SimRank, random walks and the like on the Wikipedia/Wiktionary/Wordnet category lattice Surface-based: edit distance, longest common substring etc. Alexander Panchenko 29/30
53 Further Research: Evaluation Domain-specific terms and relations Agrovoc, MeSH, etc. An application-based evaluation query expansion Methods Corpus-based:DA with n-grams, surface patterns, LSA, LDA, syntactic tree kernels Web-based: more experiments with Google hits Knowledge-based: SimRank, random walks and the like on the Wikipedia/Wiktionary/Wordnet category lattice Surface-based: edit distance, longest common substring etc. Relation types: supervised model trained on a set of hyponyms, synonyms, etc. Alexander Panchenko 29/30
54 Questions Thank you! Questions? Alexander Panchenko 30/30
Semantic Similarity and Relatedness
Semantic Relatedness Semantic Similarity and Relatedness (Based on Budanitsky, Hirst 2006 and Chapter 20 of Jurafsky/Martin 2 nd. Ed. - Most figures taken from either source.) Many applications require
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:
More informationFactorization of Latent Variables in Distributional Semantic Models
Factorization of Latent Variables in Distributional Semantic Models Arvid Österlund and David Ödling KTH Royal Institute of Technology, Sweden arvidos dodling@kth.se Magnus Sahlgren Gavagai, Sweden mange@gavagai.se
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationNatural Language Processing
Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization
More informationMATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS
ABSTRACT MATHEMATICAL AND EXPERIMENTAL INVESTIGATION OF ONTOLOGICAL SIMILARITY MEASURES AND THEIR USE IN BIOMEDICAL DOMAINS by Xinran Yu Similarity measurement is an important notion. In the context of
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationThe OntoNL Semantic Relatedness Measure for OWL Ontologies
The OntoNL Semantic Relatedness Measure for OWL Ontologies Anastasia Karanastasi and Stavros hristodoulakis Laboratory of Distributed Multimedia Information Systems and Applications Technical University
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationCalculating Semantic Relatedness with GermaNet
Organismus, Lebewesen organism, being Katze cat... Haustier pet Hund dog...... Baum tree Calculating Semantic Relatedness with GermaNet Verena Henrich, Düsseldorf, 19. Februar 2015 Semantic Relatedness
More informationINF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes Outline Precision and Recall The problem with indexing so far Intuition for solving it Overview of the solution The Math How to measure
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationJohn Pavlopoulos and Ion Androutsopoulos NLP Group, Department of Informatics Athens University of Economics and Business, Greece
John Pavlopoulos and Ion Androutsopoulos NLP Group, Department of Informatics Athens University of Economics and Business, Greece http://nlp.cs.aueb.gr/ A laptop with great design, but the service was
More informationWEST: WEIGHTED-EDGE BASED SIMILARITY MEASUREMENT TOOLS FOR WORD SEMANTICS
WEST: WEIGHTED-EDGE BASED SIMILARITY MEASUREMENT TOOLS FOR WORD SEMANTICS Liang Dong, Pradip K. Srimani, James Z. Wang School of Computing, Clemson University Web Intelligence 2010, September 1, 2010 Outline
More informationFROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationInformation Extraction from Text
Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information
More informationDo Neural Network Cross-Modal Mappings Really Bridge Modalities?
Do Neural Network Cross-Modal Mappings Really Bridge Modalities? Language Intelligence and Information Retrieval group (LIIR) Department of Computer Science Story Collell, G., Zhang, T., Moens, M.F. (2017)
More informationSemantic Similarity from Corpora - Latent Semantic Analysis
Semantic Similarity from Corpora - Latent Semantic Analysis Carlo Strapparava FBK-Irst Istituto per la ricerca scientifica e tecnologica I-385 Povo, Trento, ITALY strappa@fbk.eu Overview Latent Semantic
More informationA Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness
A Generalized Vector Space Model for Text Retrieval Based on Semantic Relatedness George Tsatsaronis and Vicky Panagiotopoulou Department of Informatics Athens University of Economics and Business, 76,
More informationWord Meaning and Similarity. Word Similarity: Distributional Similarity (I)
Word Meaning and Similarity Word Similarity: Distributional Similarity (I) Problems with thesaurus-based meaning We don t have a thesaurus for every language Even if we do, they have problems with recall
More informationA Game-Theoretic Approach to Graph Transduction: An Experimental Study
MSc (ex D.M. 270/2004) in Computer Science Dissertation A Game-Theoretic Approach to Graph Transduction: An Experimental Study Supervisor Prof. Marcello Pelillo Candidate Michele Schiavinato Id 810469
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More information.. CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a
More informationNotes on Latent Semantic Analysis
Notes on Latent Semantic Analysis Costas Boulis 1 Introduction One of the most fundamental problems of information retrieval (IR) is to find all documents (and nothing but those) that are semantically
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationConcepts & Categorization. Measurement of Similarity
Concepts & Categorization Measurement of Similarity Geometric approach Featural approach both are vector representations Vector-representation for words Words represented as vectors of feature values Similar
More informationInformation Retrieval
Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices
More informationDeep Learning for NLP Part 2
Deep Learning for NLP Part 2 CS224N Christopher Manning (Many slides borrowed from ACL 2012/NAACL 2013 Tutorials by me, Richard Socher and Yoshua Bengio) 2 Part 1.3: The Basics Word Representations The
More informationApplied Natural Language Processing
Applied Natural Language Processing Info 256 Lecture 9: Lexical semantics (Feb 19, 2019) David Bamman, UC Berkeley Lexical semantics You shall know a word by the company it keeps [Firth 1957] Harris 1954
More informationToponym Disambiguation using Ontology-based Semantic Similarity
Toponym Disambiguation using Ontology-based Semantic Similarity David S Batista 1, João D Ferreira 2, Francisco M Couto 2, and Mário J Silva 1 1 IST/INESC-ID Lisbon, Portugal {dsbatista,msilva}@inesc-id.pt
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationAutomatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics
Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationMachine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012
Machine Learning CSE6740/CS7641/ISYE6740, Fall 2012 Principal Components Analysis Le Song Lecture 22, Nov 13, 2012 Based on slides from Eric Xing, CMU Reading: Chap 12.1, CB book 1 2 Factor or Component
More informationA Unified Learning Framework of Skip-Grams and Global Vectors
A Unified Learning Framework of Skip-Grams and Global Vectors Jun Suzuki and Masaaki Nagata NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto, 619-0237
More informationLearning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations
Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations Fei Sun, Jiafeng Guo, Yanyan Lan, Jun Xu, and Xueqi Cheng CAS Key Lab of Network Data Science and Technology Institute
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43
More informationPredicting New Search-Query Cluster Volume
Predicting New Search-Query Cluster Volume Jacob Sisk, Cory Barr December 14, 2007 1 Problem Statement Search engines allow people to find information important to them, and search engine companies derive
More informationPart I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationCross-lingual and temporal Wikipedia analysis
MTA SZTAKI Data Mining and Search Group June 14, 2013 Supported by the EC FET Open project New tools and algorithms for directed network analysis (NADINE No 288956) Table of Contents 1 Link prediction
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationCS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya
CS 375 Advanced Machine Learning Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya Outline SVD and LSI Kleinberg s Algorithm PageRank Algorithm Vector Space Model Vector space model represents
More informationMeasuring Semantic Similarity Between Digital Forensics Terminologies Using Web Search Engines
Measuring Semantic Similarity Between Digital Forensics Terminologies Using Web Search Engines Nickson M. Karie Department of Computer Science, University of Pretoria, Private Bag X20, Hatfield 0028, Pretoria,
More informationLearning Features from Co-occurrences: A Theoretical Analysis
Learning Features from Co-occurrences: A Theoretical Analysis Yanpeng Li IBM T. J. Watson Research Center Yorktown Heights, New York 10598 liyanpeng.lyp@gmail.com Abstract Representing a word by its co-occurrences
More informationMining coreference relations between formulas and text using Wikipedia
Mining coreference relations between formulas and text using Wikipedia Minh Nghiem Quoc 1, Keisuke Yokoi 2, Yuichiroh Matsubayashi 3 Akiko Aizawa 1 2 3 1 Department of Informatics, The Graduate University
More informationFACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION
SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL
More informationLearning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31
Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationFast LSI-based techniques for query expansion in text retrieval systems
Fast LSI-based techniques for query expansion in text retrieval systems L. Laura U. Nanni F. Sarracco Department of Computer and System Science University of Rome La Sapienza 2nd Workshop on Text-based
More informationCitation for published version (APA): Andogah, G. (2010). Geographically constrained information retrieval Groningen: s.n.
University of Groningen Geographically constrained information retrieval Andogah, Geoffrey IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal
More informationPROBABILISTIC LATENT SEMANTIC ANALYSIS
PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications
More informationEvaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018
Evaluation Brian Thompson slides by Philipp Koehn 25 September 2018 Evaluation 1 How good is a given machine translation system? Hard problem, since many different translations acceptable semantic equivalence
More informationPV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211
PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,
More informationIII.6 Advanced Query Types
III.6 Advanced Query Types 1. Query Expansion 2. Relevance Feedback 3. Novelty & Diversity Based on MRS Chapter 9, BY Chapter 5, [Carbonell and Goldstein 98] [Agrawal et al 09] 123 1. Query Expansion Query
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationHMM Expanded to Multiple Interleaved Chains as a Model for Word Sense Disambiguation
HMM Expanded to Multiple Interleaved Chains as a Model for Word Sense Disambiguation Denis Turdakov and Dmitry Lizorkin Institute for System Programming of the Russian Academy of Sciences, 25 Solzhenitsina
More informationInvestigation of Latent Semantic Analysis for Clustering of Czech News Articles
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Michal Rott, Petr Červa Laboratory of Computer Speech Processing 4. 9. 2014 Introduction Idea of article clustering Presumptions:
More informationOntology-Based News Recommendation
Ontology-Based News Recommendation Wouter IJntema Frank Goossen Flavius Frasincar Frederik Hogenboom Erasmus University Rotterdam, the Netherlands frasincar@ese.eur.nl Outline Introduction Hermes: News
More informationCross-Lingual Language Modeling for Automatic Speech Recogntion
GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The
More informationAn Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms
An Efficient Sliding Window Approach for Approximate Entity Extraction with Synonyms Jin Wang (UCLA) Chunbin Lin (Amazon AWS) Mingda Li (UCLA) Carlo Zaniolo (UCLA) OUTLINE Motivation Preliminaries Framework
More information16 The Information Retrieval "Data Model"
16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues
More informationCollaborative NLP-aided ontology modelling
Collaborative NLP-aided ontology modelling Chiara Ghidini ghidini@fbk.eu Marco Rospocher rospocher@fbk.eu International Winter School on Language and Data/Knowledge Technologies TrentoRISE Trento, 24 th
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationA Neural Passage Model for Ad-hoc Document Retrieval
A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,
More informationKnowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar Knowledge Discovery in Data: Naïve Bayes Overview Naïve Bayes methodology refers to a probabilistic approach to information discovery
More informationFall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent
More informationLatent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology
Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,
More informationInformation Retrieval
https://vvtesh.sarahah.com/ Information Retrieval Venkatesh Vinayakarao Term: Aug Dec, 2018 Indian Institute of Information Technology, Sri City Characteristic vectors representing code are often high
More informationLatent semantic indexing
Latent semantic indexing Relationship between concepts and words is many-to-many. Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms. For retrieval,
More informationCollaborative topic models: motivations cont
Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.
More informationManning & Schuetze, FSNLP (c) 1999,2000
558 15 Topics in Information Retrieval (15.10) y 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Figure 15.7 An example of linear regression. The line y = 0.25x + 1 is the best least-squares fit for the four points (1,1),
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going
More informationClick Models for Web Search
Click Models for Web Search Lecture 1 Aleksandr Chuklin, Ilya Markov Maarten de Rijke a.chuklin@uva.nl i.markov@uva.nl derijke@uva.nl University of Amsterdam Google Research Europe AC IM MdR Click Models
More informationCan Vector Space Bases Model Context?
Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval
More informationData Mining Recitation Notes Week 3
Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents
More informationPredicting Neighbor Goodness in Collaborative Filtering
Predicting Neighbor Goodness in Collaborative Filtering Alejandro Bellogín and Pablo Castells {alejandro.bellogin, pablo.castells}@uam.es Universidad Autónoma de Madrid Escuela Politécnica Superior Introduction:
More informationSupervised Metric Learning with Generalization Guarantees
Supervised Metric Learning with Generalization Guarantees Aurélien Bellet Laboratoire Hubert Curien, Université de Saint-Etienne, Université de Lyon Reviewers: Pierre Dupont (UC Louvain) and Jose Oncina
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationLecture 5: Web Searching using the SVD
Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially
More informationAn Introduction to String Re-Writing Kernel
An Introduction to String Re-Writing Kernel Fan Bu 1, Hang Li 2 and Xiaoyan Zhu 3 1,3 State Key Laboratory of Intelligent Technology and Systems 1,3 Tsinghua National Laboratory for Information Sci. and
More informationFrom ITDL to Place2Vec Reasoning About Place Type Similarity and Relatedness by Learning Embeddings From Augmented Spatial Contexts
From ITDL to Place2Vec Reasoning About Place Type Similarity and Relatedness by Learning Embeddings From Augmented Spatial Contexts ABSTRACT Bo Yan STKO Lab University of California, Santa Barbara boyan@geog.ucsb.edu
More information13 Searching the Web with the SVD
13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More informationPrincipal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014
Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes
More informationNatural Language Processing
David Packard, A Concordance to Livy (1968) Natural Language Processing Info 159/259 Lecture 8: Vector semantics and word embeddings (Sept 18, 2018) David Bamman, UC Berkeley 259 project proposal due 9/25
More informationFielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data
Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data Nikita Zhiltsov 1,2 Alexander Kotov 3 Fedor Nikolaev 3 1 Kazan Federal University 2 Textocat 3 Textual Data Analytics
More informationComputer science research seminar: VideoLectures.Net recommender system challenge: presentation of baseline solution
Computer science research seminar: VideoLectures.Net recommender system challenge: presentation of baseline solution Nino Antulov-Fantulin 1, Mentors: Tomislav Šmuc 1 and Mile Šikić 2 3 1 Institute Rudjer
More informationNearest Neighbor Search with Keywords
Nearest Neighbor Search with Keywords Yufei Tao KAIST June 3, 2013 In recent years, many search engines have started to support queries that combine keyword search with geography-related predicates (e.g.,
More informationORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation
ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation Chin-Yew Lin and Franz Josef Och Information Sciences Institute University of Southern California 4676 Admiralty Way
More informationSimilarity for Conceptual Querying
Similarity for Conceptual Querying Troels Andreasen, Henrik Bulskov, and Rasmus Knappe Department of Computer Science, Roskilde University, P.O. Box 260, DK-4000 Roskilde, Denmark {troels,bulskov,knappe}@ruc.dk
More informationMidterm Examination Practice
University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following
More information