Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
|
|
- Wilfrid Copeland
- 6 years ago
- Views:
Transcription
1 Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search
2 The Web Challenges Tim Berners-Lee, Information Management: A Proposal, CERN, March
3 The Web Challenges 8 years later The recent Web is huge an grows increibly fast. About ten years after the Tim Berners-Lee proposal the Web was estimate to 50 million noes (pages an.7 billion eges (links. Now it inclues more than 4 billion pages, with about a million ae every ay. Restricte formal semantics - noes are ust web pages an links are of a single type (e.g. refer to. The meaning of the noes an links is not a part of the web system, rather it is left to the web page evelopers to escribe in the page content what their web ocuments mean an what kin of relations they have with the ocumente they link to. As there is no central authority or eitors relevance, popularity or authority of web pages are har to evaluate. Links are also very iverse an many have nothing to o with content or authority (e.g. navigation links. 3
4 The Web Challenges How to turn the web ata into web knowlege Use the existing Web Web Search Engines Topic Directories Change the Web Semantic Web 4
5 Crawling The Web To make Web search efficient search engines collect web ocuments an inex them by the wors (terms they contain. For the purposes of inexing web pages are first collecte an store in a local repository Web crawlers (also calle spiers or robots are programs that systematically an exhaustively browse the Web an store all visite pages Crawlers follow the hyperlinks in the Web ocuments implementing graph search algorithms like epth-first an breath-first 5
6 Crawling The Web Depth-first Web crawling limite to epth 3 6
7 Crawling The Web Breath-first Web crawling limite to epth 3 7
8 Crawling The Web Issues in Web Crawling: Network latency (multithreaing Aress resolution (DNS caching Extracting URLs (use canonical form Managing a huge web page repository Upating inices Responing to constantly changing Web Interaction of Web page evelopers Avance crawling by guie (informe search (using web page ranks 8
9 Inexing an Keywor Search We nee efficient content-base access to Web ocuments Document representation: Term-ocument matrix (inverte inex Relevance ranking: Vector space moel 9
10 Inexing an Keywor Search Creating term-ocument matrix (inverte inex Documents are tokenize (punctuation marks are remove an the character strings without spaces are consiere as tokens All characters are converte to upper or to lower case. Wors are reuce to their canonical form (stemming Stopwors (a, an, the, on, in, at, etc. are remove. The remaining wors, now calle terms are use as features (attributes in the term-ocument matrix 0
11 CCSU Departments example Document statistics Document ID Document name Wors Terms Anthropology Art Biology Chemistry Communication Computer Science Criminal Justice Economics English Geography History Mathematics Moern Languages Music Philosophy Physics Political Science Psychology Sociology Theatre 6 80 Total number of wors/terms Number of ifferent wors/terms
12 CCSU Departments example Boolean (Binary Term Document Matrix DID lab laboratory programming computer program
13 CCSU Departments example Term ocument matrix with positions DID lab laboratory programming computer program [7] [7] 3 0 [65,69] 0 [68] [26] [30,43] [40,42] [,3,7,3,26,34] [,8,6] [9,42] [57] [7] [42] 0 0 [4] [7] [37,38] [8] [68] [5]
14 Vector Space Moel Boolean representation ocuments, 2,, n terms t, t 2,, t m term t i occurs n i times in ocument. Boolean representation: r ( K 2 m i 0 if if n n i i > 0 0 For example, if the terms are: lab, laboratory, programming, computer an program. Then the Computer Science ocument is represente by the Boolean vector r 6 (0 0 4
15 Term Frequency (TF representation Document vector r ( K 2 m with components i TF( ti, Using the sum of term counts: Using the maximum of term counts: TF( t, i 0 ni m n k TF( t i, n k if n if n 0 n max i k i i 0 > 0 n k if if n n i i 0 > 0 Cornell SMART system: TF( t i, 0 + log( + log n i if if n n i i 0 > 0 5
16 Inverte Document Frequency (IDF U n Document collection: D, ocuments that contain term : t i Dt i i { n > 0} Simple fraction: or IDF ( t i D D t i Using a log function: IDF( t i + D log D t i 6
17 TFIDF representation i TF( ti, IDF( ti For example, the computer science TF vector r 6 ( scale with the IDF of the terms results in lab laboratory Programming computer program r 6 (
18 Relevance Ranking Represent the query as a vector q {computer, program} q r ( Apply IDF to its components q r ( Use Eucliean norm of the vector ifference lab laboratory Programming computer program r r q or Cosine similarity (equivalent to ot prouct for normalize vectors r q. r m i q i i m i ( q i i 2 8
19 Relevance Ranking q r ( Cosine similarities an istances to r Doc TFIDF Coorinates (normalize q r (normalize r r. (rank q (rank ( ( ( ( (3.043 ( D
20 Relevance Feeback The user provies fee back: Relevant ocuments Irrelevant ocuments D + D q r The original query vector is upate (Rocchio s metho r r q α q + β D + D r γ + D r Pseuo-relevance feeback Top 0 ocuments returne by the original query belong to D + The rest of ocuments belong to D - 20
21 Avance text search Using OR or NOT boolean operators Phrase Search Statistical methos to extract phrases from text Inexing phrases Part-of-speech tagging Approximate string matching (using n-grams Example: match program an prorgam {pr, ro, og, gr, ra, am} {pr, ro, or, rg, ga, am} {pr, ro, am} 2
22 Using the HTML structure in keywor search Titles an metatags Use them as tags in inexing Moify ranking epening on the context where the term occurs Heaings an font moifiers (prone to spam Anchor text Plays an important role in web page inexing an search Allows to increase search inices with pages that have never been crawle Allows to inex non-textual content (such as images an programs 22
23 Evaluating search quality Assume that there is a set of queries Q an a set of ocuments D, an for each query q Q submitte to the system we have: The response set of ocuments (retrieve ocuments R q D D q The set of relevant ocuments selecte manually from the whole collection of ocuments, i.e. D q D precision D q I R R q q recall D q I R D q q 23
24 Precision-recall framework (set-value Determine the relationship between the set of relevant ocuments ( D q an the set of retrieve ocuments ( R q Ieally D q R q Generally D q I R q D q A very general query leas to recall, but low precision A very restrictive query leas to precision, but low recall A goo balance is neee to maximize both precision an recall 24
25 Precision-recall framework (using ranks With thousans of ocuments fining D q is practically impossible. So, let s consier a list R (,..., q, 2 m of ranke ocuments (highest rank first if i Dq For each i R q compute its relevance as ri 0 otherwise Then efine precision at rank k as precision ( k k An recall at rank k as k recall ( k D q k i i r r i i 25
26 Precision-recall framework (example k Relevant ocuments D q,, Document inex r recall (k precision (k k (
27 Precision-recall framework Average precision r k precision ( k D Combines precision an recall an also evaluates ocument ranking The maximal value of is reache when all relevant ocuments are retrieve an ranke before any irrelevant ones. q D k Practically to compute the average precision we first go over the ranke ocuments from R q an then continue with the rest of the ocuments from D until all relevant ocuments are inclue. 27
28 Similarity Search Cluster hypothesis in IR: Documents similar to relevant ocuments are likely to be relevant too Once a relevant ocument is foun a larger collection of possibly relevant ocuments may be foun by retrieving similar ocuments. Similarity measure between ocuments Cosine Similarity Jaccar Similarity (most popular approach Document Resemblance 28
29 Cosine Similarity Given ocument an collection D the problem is to fin a number (usually 0 or 20 of ocuments i D, which have the largest value of cosine similarity to. No problems with the small query vector, so we can use more (or all imensions of the vector space How many an which terms to use? o All terms from the corpus o Select the terms that best represent the ocuments (Feature Selection Use highest TF score Use highest IDF score Combine TF an IDF scores (e.g. TFIDF 29
30 Jaccar Similarity Use Boolean ocument representation an only the nonzero coorinates of the vectors (i.e. those that are Jaccar Coefficient: proportion of coorinates that are in both ocuments to those that are in either of the ocuments 30 both ocuments to those that are in either of the ocuments Set formulation } { } {, ( sim r r ( ( ( (, ( T T T T sim U I
31 Computing Jaccar Coefficient Problems Direct computation is straightforwar, but with large collections it may lea to inefficient similarity search. Fining similar ocuments at query time is impractical. Solutions Do most of the computation offline Create a list of all ocument pairs sorte by the similarity of the ocuments in each pair. Then the k most similar ocuments to a given ocument are those that are paire with in the first k pairs from the list. Eliminate frequent terms Pair ocuments that share at least one term 3
32 Document Resemblance The task is fining ientical or nearly ientical ocuments, or ocuments that share phrases, sentences or paragraphs. The set of wors approach oes not work. Consier the ocument as a sequence of wors an extract from this sequence short subsequences with fixe length (n-grams or shingles. Represent ocument as a set of w-grams S(, w Example: T ( { a, b, c,, e} S (,2 { ab, bc, c, e} Note that T ( S(, an S (, Use Jaccar to compute resemblance r w (, 2 S( S(, w I S(, w U S( 2 2, w, w 32
. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.
S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationCUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu
CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationCollapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling
Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example
More informationLDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling
Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationA Pairwise Document Analysis Approach for Monolingual Plagiarism Detection
A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introuction Plagiarism: Unauthorize use of Text, coe, iea, Plagiarism etection research area has receive increasing attention
More informationOptimization of Geometries by Energy Minimization
Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.
More informationClassifying Biomedical Text Abstracts based on Hierarchical Concept Structure
Classifying Biomeical Text Abstracts base on Hierarchical Concept Structure Rozilawati Binti Dollah an Masai Aono Abstract Classifying biomeical literature is a ifficult an challenging tas, especially
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationSYNCHRONOUS SEQUENTIAL CIRCUITS
CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationTopic 2.3: The Geometry of Derivatives of Vector Functions
BSU Math 275 Notes Topic 2.3: The Geometry of Derivatives of Vector Functions Textbook Sections: 13.2 From the Toolbox (what you nee from previous classes): Be able to compute erivatives scalar-value functions
More informationSurvey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013
Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationAn Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback
Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,
More informationLagrangian and Hamiltonian Mechanics
Lagrangian an Hamiltonian Mechanics.G. Simpson, Ph.. epartment of Physical Sciences an Engineering Prince George s Community College ecember 5, 007 Introuction In this course we have been stuying classical
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08
More informationMultivariable Calculus: Chapter 13: Topic Guide and Formulas (pgs ) * line segment notation above a variable indicates vector
Multivariable Calculus: Chapter 13: Topic Guie an Formulas (pgs 800 851) * line segment notation above a variable inicates vector The 3D Coorinate System: Distance Formula: (x 2 x ) 2 1 + ( y ) ) 2 y 2
More informationCMSC 313 Preview Slides
CMSC 33 Preview Slies These are raft slies. The actual slies presente in lecture may be ifferent ue to last minute changes, scheule slippage,... UMBC, CMSC33, Richar Chang CMSC 33 Lecture
More information2-7. Fitting a Model to Data I. A Model of Direct Variation. Lesson. Mental Math
Lesson 2-7 Fitting a Moel to Data I BIG IDEA If you etermine from a particular set of ata that y varies irectly or inversely as, you can graph the ata to see what relationship is reasonable. Using that
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationPD Controller for Car-Following Models Based on Real Data
PD Controller for Car-Following Moels Base on Real Data Xiaopeng Fang, Hung A. Pham an Minoru Kobayashi Department of Mechanical Engineering Iowa State University, Ames, IA 5 Hona R&D The car following
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationSparse Reconstruction of Systems of Ordinary Differential Equations
Sparse Reconstruction of Systems of Orinary Differential Equations Manuel Mai a, Mark D. Shattuck b,c, Corey S. O Hern c,a,,e, a Department of Physics, Yale University, New Haven, Connecticut 06520, USA
More informationCalculus BC Section II PART A A GRAPHING CALCULATOR IS REQUIRED FOR SOME PROBLEMS OR PARTS OF PROBLEMS
Calculus BC Section II PART A A GRAPHING CALCULATOR IS REQUIRED FOR SOME PROBLEMS OR PARTS OF PROBLEMS. An isosceles triangle, whose base is the interval from (0, 0) to (c, 0), has its verte on the graph
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationIR Models: The Probabilistic Model. Lecture 8
IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index
More informationVectors in two dimensions
Vectors in two imensions Until now, we have been working in one imension only The main reason for this is to become familiar with the main physical ieas like Newton s secon law, without the aitional complication
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More information16 The Information Retrieval "Data Model"
16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues
More informationThe Principle of Least Action
Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of
More informationTAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS
MISN-0-4 TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS f(x ± ) = f(x) ± f ' (x) + f '' (x) 2 ±... 1! 2! = 1.000 ± 0.100 + 0.005 ±... TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS by Peter Signell 1.
More informationAPPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France
APPROXIMAE SOLUION FOR RANSIEN HEA RANSFER IN SAIC URBULEN HE II B. Bauouy CEA/Saclay, DSM/DAPNIA/SCM 91191 Gif-sur-Yvette Ceex, France ABSRAC Analytical solution in one imension of the heat iffusion equation
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationEquations of lines in
Roberto s Notes on Linear Algebra Chapter 6: Lines, planes an other straight objects Section 1 Equations of lines in What ou nee to know alrea: The ot prouct. The corresponence between equations an graphs.
More informationLower bounds on Locality Sensitive Hashing
Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,
More informationLeast-Squares Regression on Sparse Spaces
Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle
More informationWhat is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.
What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization
More informationLatent Dirichlet Allocation in Web Spam Filtering
Latent Dirichlet Allocation in Web Spam Filtering István Bíró Jácint Szabó Anrás A. Benczúr Data Mining an Web search Research Group, Informatics Laboratory Computer an Automation Research Institute of
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationRecap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?
CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationPivoted Length Normalization I. Summary idf II. Review
2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationLecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012
CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration
More informationA Course in Machine Learning
A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.
More informationLevel Construction of Decision Trees in a Partition-based Framework for Classification
Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationHow does Google rank webpages?
Linear Algebra Spring 016 How does Google rank webpages? Dept. of Internet and Multimedia Eng. Konkuk University leehw@konkuk.ac.kr 1 Background on search engines Outline HITS algorithm (Jon Kleinberg)
More informationIR: Information Retrieval
/ 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC
More informationTopic Modeling: Beyond Bag-of-Words
Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while
More informationInformation Retrieval. Lecture 6
Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture
More informationMotivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models
3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical
More informationTable of Common Derivatives By David Abraham
Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec
More informationGeneralizing Kronecker Graphs in order to Model Searchable Networks
Generalizing Kronecker Graphs in orer to Moel Searchable Networks Elizabeth Boine, Babak Hassibi, Aam Wierman California Institute of Technology Pasaena, CA 925 Email: {eaboine, hassibi, aamw}@caltecheu
More informationCS 572: Information Retrieval
CS 572: Information Retrieval Lecture 5: Term Weighting and Ranking Acknowledgment: Some slides in this lecture are adapted from Chris Manning (Stanford) and Doug Oard (Maryland) Lecture Plan Skip for
More informationMath 342 Partial Differential Equations «Viktor Grigoryan
Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite
More informationLink Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci
Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor
More informationQuerying. 1 o Semestre 2008/2009
Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the
More informationThe Spearman s Rank Correlation Coefficient is a statistical test that examines the degree to which two data sets are correlated, if at all.
4e A Guie to Spearman s Rank The Spearman s Rank Correlation Coefficient is a statistical test that examines the egree to which two ata sets are correlate, if at all. Why woul we use Spearman s Rank? While
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationA study on ant colony systems with fuzzy pheromone dispersion
A stuy on ant colony systems with fuzzy pheromone ispersion Louis Gacogne LIP6 104, Av. Kenney, 75016 Paris, France gacogne@lip6.fr Sanra Sanri IIIA/CSIC Campus UAB, 08193 Bellaterra, Spain sanri@iiia.csic.es
More informationMulti-View Clustering via Canonical Correlation Analysis
Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms
More informationSection 2.7 Derivatives of powers of functions
Section 2.7 Derivatives of powers of functions (3/19/08) Overview: In this section we iscuss the Chain Rule formula for the erivatives of composite functions that are forme by taking powers of other functions.
More informationMath Skills. Fractions
Throughout your stuy of science, you will often nee to solve math problems. This appenix is esigne to help you quickly review the basic math skills you will use most often. Fractions Aing an Subtracting
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationThermal conductivity of graded composites: Numerical simulations and an effective medium approximation
JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University
More informationDATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS
DATA MINING LECTURE 3 Link Analysis Ranking PageRank -- Random walks HITS How to organize the web First try: Manually curated Web Directories How to organize the web Second try: Web Search Information
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationGeneric Text Summarization
June 27, 2012 Outline Introduction 1 Introduction Notation and Terminology 2 3 4 5 6 Text Summarization Introduction Notation and Terminology Two Types of Text Summarization Query-Relevant Summarization:
More informationDocument Similarity in Information Retrieval
Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of
More informationFast Resampling Weighted v-statistics
Fast Resampling Weighte v-statistics Chunxiao Zhou Mar O. Hatfiel Clinical Research Center National Institutes of Health Bethesa, MD 20892 chunxiao.zhou@nih.gov Jiseong Par Dept of Math George Mason Univ
More information12.5. Differentiation of vectors. Introduction. Prerequisites. Learning Outcomes
Differentiation of vectors 12.5 Introuction The area known as vector calculus is use to moel mathematically a vast range of engineering phenomena incluing electrostatics, electromagnetic fiels, air flow
More informationTDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1
TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf
More informationDiophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations
Diophantine Approximations: Examining the Farey Process an its Metho on Proucing Best Approximations Kelly Bowen Introuction When a person hears the phrase irrational number, one oes not think of anything
More informationMulti-Task Minimum Error Rate Training for SMT
Minimum Error Rate Training for SMT Patrick Katharina Stefan Department of Computational Linguistics University of Heielberg, Germany Learning Multi-task learning aims at learning several ifferent tasks
More informationSimilarity Measures for Categorical Data A Comparative Study. Technical Report
Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN
More informationThis module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics
This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationWeb Information Retrieval Dipl.-Inf. Christoph Carl Kling
Institute for Web Science & Technologies University of Koblenz-Landau, Germany Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Exercises WebIR ask questions! WebIR@c-kling.de 2 of 40 Probabilities
More information