What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.
|
|
- Godfrey Moses Flowers
- 5 years ago
- Views:
Transcription
1
2 What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.
3 Text mining What can be used for text mining?? Classification/categorization Clustering Summarization Retrieval.
4 Pre-processing of text Tokenisation: Separation of tokens with removal of special symbols that are not required in the text. Stemming: Convering the words like playing, played into play. Lemmatisation: Returning the base form of the word. Eg: heard hear Case folding: Conversion of case- caps to small Stop word removal: the, an, on are called stop words. They are of limited use when it comes to determine the weight of a document for retreival. Normalisation: Equivalence classing. Use of synonyms. Spell check too can be performed here.
5 Different models for representation Term frequency and weighing Bag of words: number of occurance of word where the exact ordering is ignored. Vector space model and so on.
6 Term frequency Term frequency is the number of times, the term occurs in the document. Eg: Cricket is a game. Sam likes the game of cricket. Ter ms Cricket is a game Sam likes the of Freq nor mali sed 2/10 1/10 1/10 2/10 1/10 1/10 1/10 1/10 Each documents varies in size. Thus the frequency of terms differs with the size. And it impacts the smaller ones Thus it is normalised
7 Inverse document frequency The whole intension for the terms generation is finding out relevant documents to one specific or to a query that is fired. Occurrence of a term more times cannot indicate the power or potential to determine relevance. Thus their weight needs to be scaled down. We use idf : N idf t = 1 + log e df t Where t is the terms, N = total no. of documents and df t = no. of documents with t term.
8 So, For 3 documents: D1= Cricket is a game. Sam likes the game of cricket. D2= Do you play cricket? D3 = Playing any game is good for health. I play basketball. For D1, the idf values: Cricket is a game Sam likes the of Tf Normalise d tf 2/10 1/10 1/10 2/10 1/10 1/10 1/10 1/10 idf Log(3/2) Log(3/2) Log(3/1) Log(3/2) Log(3/1) Log(3/1) Log(3/1) Log(3/1)
9 Tf-idf To find relevant documents, generally a combined weighted approach is used called as tf-idf. So: w t,d = tf t,d idf t Representation of set of documents as vectors in common vector space is known as vector space model.
10 Calculating similarities between the documents Often cosine similarity is used. We are interested in determining the orthogonality.
11 More about Dot product: When we consider the dot product of two vectors say a. b, we are trying to project a into b. The angle between these vectors determines the orthagonality. If it is 90 degrees, the vectors are orthogonal.
12 Cosine similarity Cosine Similarity (d1, d2) = Dot product(d1, d2) / d1 * d2 Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] + + d1[n] * d2[n] d1 = square root(d1[0] 2 + d1[1] d1[n] 2 ) d2 = square root(d2[0] 2 + d2[1] d2[n] 2 )
13 What is to be done and how? To determine which document matches or is most similar to the input query from the given sets of documents 1. For every term present in the document calculate its term frequency. That is no. of times it is occurring in that document. This is to be done for every document. 2. Calculate idf for every term. Considering the entire corpus of documents, the idf is to be calculated for every term.
14 3. Now, given a query, Calculate tf*idf of those words only which are in the query for every document and So, if the words are game and cricket in query then Doc 1 Doc 2 Doc 3 Game Val 11 Val21 Val31 cricket val12 val22 Val32 4. Same way for the query we have to calculate the tf idf score. Now since we are assuming the query as game cricket, the tf will be 1 and 1. Normalized tf will be 0.5 and 0.5. What about idf? Here the idf of training data is considered directly. So tf*idf value will be 0.5 * idf (obtained in step 2) Let game tf*idf = Qval1 and cricket tf*idf= Qval2
15 Once we have the tf*idf values, every document now will be represented using the FV with just two terms. Doc 1 = Val11, val12 Doc 2 = val21, val22 Doc 3= val31, val32 Query = Qval1, Qval2 Cosine similarity between doc 1 and query = Step 1: dot product =(val11 * Qval1) + (val12 *Qval2) Step 2: Query = sqrt ((Qval1) 2 + (Qval2) 2 ) Doc 1 = sqrt((val11) 2 + (Val12) 2 ) Cosine value = dot product / Query * Doc1 * This way we need to calculate the similarity of all the documents with query and determine the match. * Value = 1 is exact match
16 Problem : D1: This is big classroom D2: Classroom has many benches D3: This is house D4: The house has garden D5: The house is big Big house Query Calculate cosine similarity and determine which document matches to the input query.
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Collection Frequency, cf Define: The total
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More information.. CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationLecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25
Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationRanking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval
Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: k nearest neighbors Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 28 Introduction Classification = supervised method
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationWeb Information Retrieval Dipl.-Inf. Christoph Carl Kling
Institute for Web Science & Technologies University of Koblenz-Landau, Germany Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Exercises WebIR ask questions! WebIR@c-kling.de 2 of 40 Probabilities
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationPart I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationMATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson
MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Latent
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationEstimating the Selectivity of tf-idf based Cosine Similarity Predicates
Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Sandeep Tata Jignesh M. Patel Department of Electrical Engineering and Computer Science University of Michigan 22 Hayward Street,
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationInformation Retrival Ranking. tf-idf. Eddie Aronovich. October 30, 2012
October 30, 2012 Table of contents 1 2 Term Frequency tf idf Inforamation Retrival the techniques of storing and recovering and often disseminating recorded data especially through the use of a computerized
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More informationRecap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?
CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This
More informationMapping of Science. Bart Thijs ECOOM, K.U.Leuven, Belgium
Mapping of Science Bart Thijs ECOOM, K.U.Leuven, Belgium Introduction Definition: Mapping of Science is the application of powerful statistical tools and analytical techniques to uncover the structure
More informationInvestigation of Latent Semantic Analysis for Clustering of Czech News Articles
Investigation of Latent Semantic Analysis for Clustering of Czech News Articles Michal Rott, Petr Červa Laboratory of Computer Speech Processing 4. 9. 2014 Introduction Idea of article clustering Presumptions:
More informationFall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent
More informationTDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1
TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf
More informationTerms in Time and Times in Context: A Graph-based Term-Time Ranking Model
Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model Andreas Spitz, Jannik Strötgen, Thomas Bögel and Michael Gertz Heidelberg University Institute of Computer Science Database Systems
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationLanguage Processing with Perl and Prolog
Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and
More informationProbabilistic Information Retrieval
Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23 Overview 1 Information Retrieval IR Models Probability Basics 2 Document Ranking Problem
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationNon-Boolean models of retrieval: Agenda
Non-Boolean models of retrieval: Agenda Review of Boolean model and TF/IDF Simple extensions thereof Vector model Language Model-based retrieval Matrix decomposition methods Non-Boolean models of retrieval:
More informationDocument Similarity in Information Retrieval
Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of
More informationLatent semantic indexing
Latent semantic indexing Relationship between concepts and words is many-to-many. Solve problems of synonymy and ambiguity by representing documents as vectors of ideas or concepts, not terms. For retrieval,
More information5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists
More information6.034 Introduction to Artificial Intelligence
6.34 Introduction to Artificial Intelligence Tommi Jaakkola MIT CSAIL The world is drowning in data... The world is drowning in data...... access to information is based on recommendations Recommending
More informationLecture 1b: Text, terms, and bags of words
Lecture 1b: Text, terms, and bags of words Trevor Cohn (based on slides by William Webber) COMP90042, 2015, Semester 1 Corpus, document, term Body of text referred to as corpus Corpus regarded as a collection
More informationDimensionality reduction
Dimensionality reduction ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2017 Recapitulating: Evaluating
More informationFROM QUERIES TO TOP-K RESULTS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
FROM QUERIES TO TOP-K RESULTS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link
More informationSemantic Similarity from Corpora - Latent Semantic Analysis
Semantic Similarity from Corpora - Latent Semantic Analysis Carlo Strapparava FBK-Irst Istituto per la ricerca scientifica e tecnologica I-385 Povo, Trento, ITALY strappa@fbk.eu Overview Latent Semantic
More informationSemantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing
Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding
More informationLecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu
Lecture 16-17: Bayesian Nonparametrics I STAT 6474 Instructor: Hongxiao Zhu Plan for today Why Bayesian Nonparametrics? Dirichlet Distribution and Dirichlet Processes. 2 Parameter and Patterns Reference:
More informationText Analysis. Week 5
Week 5 Text Analysis Reference and Slide Source: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text Mining. Association
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationQuerying. 1 o Semestre 2008/2009
Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the
More informationManning & Schuetze, FSNLP, (c)
page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationCS 572: Information Retrieval
CS 572: Information Retrieval Lecture 5: Term Weighting and Ranking Acknowledgment: Some slides in this lecture are adapted from Chris Manning (Stanford) and Doug Oard (Maryland) Lecture Plan Skip for
More informationMotivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models
3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical
More informationSemantic Similarity and Relatedness
Semantic Relatedness Semantic Similarity and Relatedness (Based on Budanitsky, Hirst 2006 and Chapter 20 of Jurafsky/Martin 2 nd. Ed. - Most figures taken from either source.) Many applications require
More informationComplex Data Mining & Workflow Mining. Introduzione al text mining
Complex Data Mining & Workflow Mining Introduzione al text mining Outline Introduzione e concetti di base Motivazioni, applicazioni Concetti di base nell analisi dei dati complessi Text/Web Mining Concetti
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal
More informationFAQs. Fitting h θ (x)
10/15/2018 - FALL 2018 W9.A.0.0 10/15/2018 - FALL 2018 W9.A.1 FAs How to iprove the ter project proposal? Brainstor with your teaates Make an appointent with e (eeting ust include all of the tea ebers)
More informationA Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR)
A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) Chirag Shah Dept. of CSE IIT Bombay, Powai Mumbai - 400 076, Maharashtra, India. Email: chirag@cse.iitb.ac.in
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 9: Dimension Reduction/Word2vec Cho-Jui Hsieh UC Davis May 15, 2018 Principal Component Analysis Principal Component Analysis (PCA) Data
More informationNeural Networks for Web Page Classification Based on Augmented PCA
Neural Networks for Web Page Classification Based on Augmented PCA Ali Selamat and Sigeru Omatu Division of Computer and Systems Sciences, Graduate School of Engineering, Osaka Prefecture University, Sakai,
More informationText Analytics. Models for Information Retrieval 1. Ulf Leser
Text Analytics Models for Information Retrieval 1 Ulf Leser Processing Pipeline Parsing Normalization Term Weighting Normalization Parsing Corpus IR Database Interface User Document IDs Filtering Ranking
More informationA Study of the Dirichlet Priors for Term Frequency Normalisation
A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval
More informationA Neural Passage Model for Ad-hoc Document Retrieval
A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,
More informationText Document Clustering Using Global Term Context Vectors
Knowledge and Information Systems, 31:3, pp. 455 474, 212 (This is recompilation of the published material.) Text Document Clustering Using Global Term Context Vectors Argyris Kalogeratos Aristidis Likas
More informationRetrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model
1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval
More informationClassification. Team Ravana. Team Members Cliffton Fernandes Nikhil Keswaney
Email Classification Team Ravana Team Members Cliffton Fernandes Nikhil Keswaney Hello! Cliffton Fernandes MS-CS Nikhil Keswaney MS-CS 2 Topic Area Email Classification Spam: Unsolicited Junk Mail Ham
More informationCSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides
CSE 494/598 Lecture-4: Correlation Analysis LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Project-1 Due: February 12 th 2016 Analysis report:
More informationAn R Package for Latent Semantic Analysis
An R Package for Latent Semantic Analysis LSA-TEL, March 29th 2007, Heerlen, NL Fridolin Wild Vienna University of Economics and Business Administration Structure of the Talk Concepts of the Package Analysis
More informationLecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science
Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science http://www.cs.odu.edu/ jbollen January 30, 2003 Page 1 Structure 1. IR formal characterization (a) Mathematical
More information3. Basics of Information Retrieval
Text Analysis and Retrieval 3. Basics of Information Retrieval Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder With contributions from dr. sc. Goran Glavaš Mladen Karan, mag. ing. University of Zagreb
More informationVisualização Multi-dimensional: Mineração Visual de Dados multidimensionais e aplicações Parte II Variações e Aplicações. Rosane Minghim
Visualização Multi-dimensional: Mineração Visual de Dados multidimensionais e aplicações Parte II Variações e Aplicações Rosane Minghim 2008-2012 1 Basic Concepts Text Preprocessing Data and text mining
More informationLecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides
Lecture 4 Ranking Search Results Many thanks to Prabhakar Raghavan for sharing most content from the following slides Recap of the previous lecture Index construction Doing sorting with limited main memory
More information