Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Size: px
Start display at page:

Download "Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search"

Transcription

1 Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search

2 The Web Challenges Tim Berners-Lee, Information Management: A Proposal, CERN, March

3 The Web Challenges 8 years later The recent Web is huge an grows increibly fast. About ten years after the Tim Berners-Lee proposal the Web was estimate to 50 million noes (pages an.7 billion eges (links. Now it inclues more than 4 billion pages, with about a million ae every ay. Restricte formal semantics - noes are ust web pages an links are of a single type (e.g. refer to. The meaning of the noes an links is not a part of the web system, rather it is left to the web page evelopers to escribe in the page content what their web ocuments mean an what kin of relations they have with the ocumente they link to. As there is no central authority or eitors relevance, popularity or authority of web pages are har to evaluate. Links are also very iverse an many have nothing to o with content or authority (e.g. navigation links. 3

4 The Web Challenges How to turn the web ata into web knowlege Use the existing Web Web Search Engines Topic Directories Change the Web Semantic Web 4

5 Crawling The Web To make Web search efficient search engines collect web ocuments an inex them by the wors (terms they contain. For the purposes of inexing web pages are first collecte an store in a local repository Web crawlers (also calle spiers or robots are programs that systematically an exhaustively browse the Web an store all visite pages Crawlers follow the hyperlinks in the Web ocuments implementing graph search algorithms like epth-first an breath-first 5

6 Crawling The Web Depth-first Web crawling limite to epth 3 6

7 Crawling The Web Breath-first Web crawling limite to epth 3 7

8 Crawling The Web Issues in Web Crawling: Network latency (multithreaing Aress resolution (DNS caching Extracting URLs (use canonical form Managing a huge web page repository Upating inices Responing to constantly changing Web Interaction of Web page evelopers Avance crawling by guie (informe search (using web page ranks 8

9 Inexing an Keywor Search We nee efficient content-base access to Web ocuments Document representation: Term-ocument matrix (inverte inex Relevance ranking: Vector space moel 9

10 Inexing an Keywor Search Creating term-ocument matrix (inverte inex Documents are tokenize (punctuation marks are remove an the character strings without spaces are consiere as tokens All characters are converte to upper or to lower case. Wors are reuce to their canonical form (stemming Stopwors (a, an, the, on, in, at, etc. are remove. The remaining wors, now calle terms are use as features (attributes in the term-ocument matrix 0

11 CCSU Departments example Document statistics Document ID Document name Wors Terms Anthropology Art Biology Chemistry Communication Computer Science Criminal Justice Economics English Geography History Mathematics Moern Languages Music Philosophy Physics Political Science Psychology Sociology Theatre 6 80 Total number of wors/terms Number of ifferent wors/terms

12 CCSU Departments example Boolean (Binary Term Document Matrix DID lab laboratory programming computer program

13 CCSU Departments example Term ocument matrix with positions DID lab laboratory programming computer program [7] [7] 3 0 [65,69] 0 [68] [26] [30,43] [40,42] [,3,7,3,26,34] [,8,6] [9,42] [57] [7] [42] 0 0 [4] [7] [37,38] [8] [68] [5]

14 Vector Space Moel Boolean representation ocuments, 2,, n terms t, t 2,, t m term t i occurs n i times in ocument. Boolean representation: r ( K 2 m i 0 if if n n i i > 0 0 For example, if the terms are: lab, laboratory, programming, computer an program. Then the Computer Science ocument is represente by the Boolean vector r 6 (0 0 4

15 Term Frequency (TF representation Document vector r ( K 2 m with components i TF( ti, Using the sum of term counts: Using the maximum of term counts: TF( t, i 0 ni m n k TF( t i, n k if n if n 0 n max i k i i 0 > 0 n k if if n n i i 0 > 0 Cornell SMART system: TF( t i, 0 + log( + log n i if if n n i i 0 > 0 5

16 Inverte Document Frequency (IDF U n Document collection: D, ocuments that contain term : t i Dt i i { n > 0} Simple fraction: or IDF ( t i D D t i Using a log function: IDF( t i + D log D t i 6

17 TFIDF representation i TF( ti, IDF( ti For example, the computer science TF vector r 6 ( scale with the IDF of the terms results in lab laboratory Programming computer program r 6 (

18 Relevance Ranking Represent the query as a vector q {computer, program} q r ( Apply IDF to its components q r ( Use Eucliean norm of the vector ifference lab laboratory Programming computer program r r q or Cosine similarity (equivalent to ot prouct for normalize vectors r q. r m i q i i m i ( q i i 2 8

19 Relevance Ranking q r ( Cosine similarities an istances to r Doc TFIDF Coorinates (normalize q r (normalize r r. (rank q (rank ( ( ( ( (3.043 ( D

20 Relevance Feeback The user provies fee back: Relevant ocuments Irrelevant ocuments D + D q r The original query vector is upate (Rocchio s metho r r q α q + β D + D r γ + D r Pseuo-relevance feeback Top 0 ocuments returne by the original query belong to D + The rest of ocuments belong to D - 20

21 Avance text search Using OR or NOT boolean operators Phrase Search Statistical methos to extract phrases from text Inexing phrases Part-of-speech tagging Approximate string matching (using n-grams Example: match program an prorgam {pr, ro, og, gr, ra, am} {pr, ro, or, rg, ga, am} {pr, ro, am} 2

22 Using the HTML structure in keywor search Titles an metatags Use them as tags in inexing Moify ranking epening on the context where the term occurs Heaings an font moifiers (prone to spam Anchor text Plays an important role in web page inexing an search Allows to increase search inices with pages that have never been crawle Allows to inex non-textual content (such as images an programs 22

23 Evaluating search quality Assume that there is a set of queries Q an a set of ocuments D, an for each query q Q submitte to the system we have: The response set of ocuments (retrieve ocuments R q D D q The set of relevant ocuments selecte manually from the whole collection of ocuments, i.e. D q D precision D q I R R q q recall D q I R D q q 23

24 Precision-recall framework (set-value Determine the relationship between the set of relevant ocuments ( D q an the set of retrieve ocuments ( R q Ieally D q R q Generally D q I R q D q A very general query leas to recall, but low precision A very restrictive query leas to precision, but low recall A goo balance is neee to maximize both precision an recall 24

25 Precision-recall framework (using ranks With thousans of ocuments fining D q is practically impossible. So, let s consier a list R (,..., q, 2 m of ranke ocuments (highest rank first if i Dq For each i R q compute its relevance as ri 0 otherwise Then efine precision at rank k as precision ( k k An recall at rank k as k recall ( k D q k i i r r i i 25

26 Precision-recall framework (example k Relevant ocuments D q,, Document inex r recall (k precision (k k (

27 Precision-recall framework Average precision r k precision ( k D Combines precision an recall an also evaluates ocument ranking The maximal value of is reache when all relevant ocuments are retrieve an ranke before any irrelevant ones. q D k Practically to compute the average precision we first go over the ranke ocuments from R q an then continue with the rest of the ocuments from D until all relevant ocuments are inclue. 27

28 Similarity Search Cluster hypothesis in IR: Documents similar to relevant ocuments are likely to be relevant too Once a relevant ocument is foun a larger collection of possibly relevant ocuments may be foun by retrieving similar ocuments. Similarity measure between ocuments Cosine Similarity Jaccar Similarity (most popular approach Document Resemblance 28

29 Cosine Similarity Given ocument an collection D the problem is to fin a number (usually 0 or 20 of ocuments i D, which have the largest value of cosine similarity to. No problems with the small query vector, so we can use more (or all imensions of the vector space How many an which terms to use? o All terms from the corpus o Select the terms that best represent the ocuments (Feature Selection Use highest TF score Use highest IDF score Combine TF an IDF scores (e.g. TFIDF 29

30 Jaccar Similarity Use Boolean ocument representation an only the nonzero coorinates of the vectors (i.e. those that are Jaccar Coefficient: proportion of coorinates that are in both ocuments to those that are in either of the ocuments 30 both ocuments to those that are in either of the ocuments Set formulation } { } {, ( sim r r ( ( ( (, ( T T T T sim U I

31 Computing Jaccar Coefficient Problems Direct computation is straightforwar, but with large collections it may lea to inefficient similarity search. Fining similar ocuments at query time is impractical. Solutions Do most of the computation offline Create a list of all ocument pairs sorte by the similarity of the ocuments in each pair. Then the k most similar ocuments to a given ocument are those that are paire with in the first k pairs from the list. Eliminate frequent terms Pair ocuments that share at least one term 3

32 Document Resemblance The task is fining ientical or nearly ientical ocuments, or ocuments that share phrases, sentences or paragraphs. The set of wors approach oes not work. Consier the ocument as a sequence of wors an extract from this sequence short subsequences with fixe length (n-grams or shingles. Represent ocument as a set of w-grams S(, w Example: T ( { a, b, c,, e} S (,2 { ab, bc, c, e} Note that T ( S(, an S (, Use Jaccar to compute resemblance r w (, 2 S( S(, w I S(, w U S( 2 2, w, w 32

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introuction Plagiarism: Unauthorize use of Text, coe, iea, Plagiarism etection research area has receive increasing attention

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure Classifying Biomeical Text Abstracts base on Hierarchical Concept Structure Rozilawati Binti Dollah an Masai Aono Abstract Classifying biomeical literature is a ifficult an challenging tas, especially

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Topic 2.3: The Geometry of Derivatives of Vector Functions

Topic 2.3: The Geometry of Derivatives of Vector Functions BSU Math 275 Notes Topic 2.3: The Geometry of Derivatives of Vector Functions Textbook Sections: 13.2 From the Toolbox (what you nee from previous classes): Be able to compute erivatives scalar-value functions

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,

More information

Lagrangian and Hamiltonian Mechanics

Lagrangian and Hamiltonian Mechanics Lagrangian an Hamiltonian Mechanics.G. Simpson, Ph.. epartment of Physical Sciences an Engineering Prince George s Community College ecember 5, 007 Introuction In this course we have been stuying classical

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

Multivariable Calculus: Chapter 13: Topic Guide and Formulas (pgs ) * line segment notation above a variable indicates vector

Multivariable Calculus: Chapter 13: Topic Guide and Formulas (pgs ) * line segment notation above a variable indicates vector Multivariable Calculus: Chapter 13: Topic Guie an Formulas (pgs 800 851) * line segment notation above a variable inicates vector The 3D Coorinate System: Distance Formula: (x 2 x ) 2 1 + ( y ) ) 2 y 2

More information

CMSC 313 Preview Slides

CMSC 313 Preview Slides CMSC 33 Preview Slies These are raft slies. The actual slies presente in lecture may be ifferent ue to last minute changes, scheule slippage,... UMBC, CMSC33, Richar Chang CMSC 33 Lecture

More information

2-7. Fitting a Model to Data I. A Model of Direct Variation. Lesson. Mental Math

2-7. Fitting a Model to Data I. A Model of Direct Variation. Lesson. Mental Math Lesson 2-7 Fitting a Moel to Data I BIG IDEA If you etermine from a particular set of ata that y varies irectly or inversely as, you can graph the ata to see what relationship is reasonable. Using that

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

PD Controller for Car-Following Models Based on Real Data

PD Controller for Car-Following Models Based on Real Data PD Controller for Car-Following Moels Base on Real Data Xiaopeng Fang, Hung A. Pham an Minoru Kobayashi Department of Mechanical Engineering Iowa State University, Ames, IA 5 Hona R&D The car following

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Sparse Reconstruction of Systems of Ordinary Differential Equations

Sparse Reconstruction of Systems of Ordinary Differential Equations Sparse Reconstruction of Systems of Orinary Differential Equations Manuel Mai a, Mark D. Shattuck b,c, Corey S. O Hern c,a,,e, a Department of Physics, Yale University, New Haven, Connecticut 06520, USA

More information

Calculus BC Section II PART A A GRAPHING CALCULATOR IS REQUIRED FOR SOME PROBLEMS OR PARTS OF PROBLEMS

Calculus BC Section II PART A A GRAPHING CALCULATOR IS REQUIRED FOR SOME PROBLEMS OR PARTS OF PROBLEMS Calculus BC Section II PART A A GRAPHING CALCULATOR IS REQUIRED FOR SOME PROBLEMS OR PARTS OF PROBLEMS. An isosceles triangle, whose base is the interval from (0, 0) to (c, 0), has its verte on the graph

More information

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

IR Models: The Probabilistic Model. Lecture 8

IR Models: The Probabilistic Model. Lecture 8 IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index

More information

Vectors in two dimensions

Vectors in two dimensions Vectors in two imensions Until now, we have been working in one imension only The main reason for this is to become familiar with the main physical ieas like Newton s secon law, without the aitional complication

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

16 The Information Retrieval "Data Model"

16 The Information Retrieval Data Model 16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS

TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS MISN-0-4 TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS f(x ± ) = f(x) ± f ' (x) + f '' (x) 2 ±... 1! 2! = 1.000 ± 0.100 + 0.005 ±... TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS by Peter Signell 1.

More information

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France APPROXIMAE SOLUION FOR RANSIEN HEA RANSFER IN SAIC URBULEN HE II B. Bauouy CEA/Saclay, DSM/DAPNIA/SCM 91191 Gif-sur-Yvette Ceex, France ABSRAC Analytical solution in one imension of the heat iffusion equation

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

Equations of lines in

Equations of lines in Roberto s Notes on Linear Algebra Chapter 6: Lines, planes an other straight objects Section 1 Equations of lines in What ou nee to know alrea: The ot prouct. The corresponence between equations an graphs.

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization

More information

Latent Dirichlet Allocation in Web Spam Filtering

Latent Dirichlet Allocation in Web Spam Filtering Latent Dirichlet Allocation in Web Spam Filtering István Bíró Jácint Szabó Anrás A. Benczúr Data Mining an Web search Research Group, Informatics Laboratory Computer an Automation Research Institute of

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,

More information

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors? CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Pivoted Length Normalization I. Summary idf II. Review

Pivoted Length Normalization I. Summary idf II. Review 2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

How does Google rank webpages?

How does Google rank webpages? Linear Algebra Spring 016 How does Google rank webpages? Dept. of Internet and Multimedia Eng. Konkuk University leehw@konkuk.ac.kr 1 Background on search engines Outline HITS algorithm (Jon Kleinberg)

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models 3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

Generalizing Kronecker Graphs in order to Model Searchable Networks

Generalizing Kronecker Graphs in order to Model Searchable Networks Generalizing Kronecker Graphs in orer to Moel Searchable Networks Elizabeth Boine, Babak Hassibi, Aam Wierman California Institute of Technology Pasaena, CA 925 Email: {eaboine, hassibi, aamw}@caltecheu

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 5: Term Weighting and Ranking Acknowledgment: Some slides in this lecture are adapted from Chris Manning (Stanford) and Doug Oard (Maryland) Lecture Plan Skip for

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor

More information

Querying. 1 o Semestre 2008/2009

Querying. 1 o Semestre 2008/2009 Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the

More information

The Spearman s Rank Correlation Coefficient is a statistical test that examines the degree to which two data sets are correlated, if at all.

The Spearman s Rank Correlation Coefficient is a statistical test that examines the degree to which two data sets are correlated, if at all. 4e A Guie to Spearman s Rank The Spearman s Rank Correlation Coefficient is a statistical test that examines the egree to which two ata sets are correlate, if at all. Why woul we use Spearman s Rank? While

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

A study on ant colony systems with fuzzy pheromone dispersion

A study on ant colony systems with fuzzy pheromone dispersion A stuy on ant colony systems with fuzzy pheromone ispersion Louis Gacogne LIP6 104, Av. Kenney, 75016 Paris, France gacogne@lip6.fr Sanra Sanri IIIA/CSIC Campus UAB, 08193 Bellaterra, Spain sanri@iiia.csic.es

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Keywors: multi-view learning, clustering, canonical correlation analysis Abstract Clustering ata in high-imensions is believe to be a har problem in general. A number of efficient clustering algorithms

More information

Section 2.7 Derivatives of powers of functions

Section 2.7 Derivatives of powers of functions Section 2.7 Derivatives of powers of functions (3/19/08) Overview: In this section we iscuss the Chain Rule formula for the erivatives of composite functions that are forme by taking powers of other functions.

More information

Math Skills. Fractions

Math Skills. Fractions Throughout your stuy of science, you will often nee to solve math problems. This appenix is esigne to help you quickly review the basic math skills you will use most often. Fractions Aing an Subtracting

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS DATA MINING LECTURE 3 Link Analysis Ranking PageRank -- Random walks HITS How to organize the web First try: Manually curated Web Directories How to organize the web Second try: Web Search Information

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

Generic Text Summarization

Generic Text Summarization June 27, 2012 Outline Introduction 1 Introduction Notation and Terminology 2 3 4 5 6 Text Summarization Introduction Notation and Terminology Two Types of Text Summarization Query-Relevant Summarization:

More information

Document Similarity in Information Retrieval

Document Similarity in Information Retrieval Document Similarity in Information Retrieval Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of

More information

Fast Resampling Weighted v-statistics

Fast Resampling Weighted v-statistics Fast Resampling Weighte v-statistics Chunxiao Zhou Mar O. Hatfiel Clinical Research Center National Institutes of Health Bethesa, MD 20892 chunxiao.zhou@nih.gov Jiseong Par Dept of Math George Mason Univ

More information

12.5. Differentiation of vectors. Introduction. Prerequisites. Learning Outcomes

12.5. Differentiation of vectors. Introduction. Prerequisites. Learning Outcomes Differentiation of vectors 12.5 Introuction The area known as vector calculus is use to moel mathematically a vast range of engineering phenomena incluing electrostatics, electromagnetic fiels, air flow

More information

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf

More information

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations Diophantine Approximations: Examining the Farey Process an its Metho on Proucing Best Approximations Kelly Bowen Introuction When a person hears the phrase irrational number, one oes not think of anything

More information

Multi-Task Minimum Error Rate Training for SMT

Multi-Task Minimum Error Rate Training for SMT Minimum Error Rate Training for SMT Patrick Katharina Stefan Department of Computational Linguistics University of Heielberg, Germany Learning Multi-task learning aims at learning several ifferent tasks

More information

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Study. Technical Report Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Institute for Web Science & Technologies University of Koblenz-Landau, Germany Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Exercises WebIR ask questions! WebIR@c-kling.de 2 of 40 Probabilities

More information