Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Similar documents
. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Boolean and Vector Space Retrieval Models

Chap 2: Classical models for information retrieval

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

Optimization of Geometries by Energy Minimization

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure

1 Information retrieval fundamentals

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

SYNCHRONOUS SEQUENTIAL CIRCUITS

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

PV211: Introduction to Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Topic 2.3: The Geometry of Derivatives of Vector Functions

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Information Retrieval and Web Search

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Information Retrieval

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Modern Information Retrieval

Lagrangian and Hamiltonian Mechanics

Variable Latent Semantic Indexing

Scoring, Term Weighting and the Vector Space

Introduction to Information Retrieval

Multivariable Calculus: Chapter 13: Topic Guide and Formulas (pgs ) * line segment notation above a variable indicates vector

CMSC 313 Preview Slides

2-7. Fitting a Model to Data I. A Model of Direct Variation. Lesson. Mental Math

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

PD Controller for Car-Following Models Based on Real Data

Information Retrieval

Ranked Retrieval (2)

Sparse Reconstruction of Systems of Ordinary Differential Equations

Calculus BC Section II PART A A GRAPHING CALCULATOR IS REQUIRED FOR SOME PROBLEMS OR PARTS OF PROBLEMS

Dealing with Text Databases

IR Models: The Probabilistic Model. Lecture 8

Vectors in two dimensions

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

16 The Information Retrieval "Data Model"

The Principle of Least Action

TAYLOR S POLYNOMIAL APPROXIMATION FOR FUNCTIONS

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Equations of lines in

Lower bounds on Locality Sensitive Hashing

Least-Squares Regression on Sparse Spaces

Maschinelle Sprachverarbeitung

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Latent Dirichlet Allocation in Web Spam Filtering

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Pivoted Length Normalization I. Summary idf II. Review

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

A Course in Machine Learning

Level Construction of Decision Trees in a Partition-based Framework for Classification

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Information Retrieval Basic IR models. Luca Bondi

How does Google rank webpages?

IR: Information Retrieval

Topic Modeling: Beyond Bag-of-Words

Information Retrieval. Lecture 6

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Table of Common Derivatives By David Abraham

Generalizing Kronecker Graphs in order to Model Searchable Networks

CS 572: Information Retrieval

Math 342 Partial Differential Equations «Viktor Grigoryan

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Querying. 1 o Semestre 2008/2009

The Spearman s Rank Correlation Coefficient is a statistical test that examines the degree to which two data sets are correlated, if at all.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

A study on ant colony systems with fuzzy pheromone dispersion

Multi-View Clustering via Canonical Correlation Analysis

Section 2.7 Derivatives of powers of functions

Math Skills. Fractions

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Generic Text Summarization

Document Similarity in Information Retrieval

Fast Resampling Weighted v-statistics

12.5. Differentiation of vectors. Introduction. Prerequisites. Learning Outcomes

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Diophantine Approximations: Examining the Farey Process and its Method on Producing Best Approximations

Multi-Task Minimum Error Rate Training for SMT

Similarity Measures for Categorical Data A Comparative Study. Technical Report

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Transcription:

Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search

The Web Challenges Tim Berners-Lee, Information Management: A Proposal, CERN, March 989. 2

The Web Challenges 8 years later The recent Web is huge an grows increibly fast. About ten years after the Tim Berners-Lee proposal the Web was estimate to 50 million noes (pages an.7 billion eges (links. Now it inclues more than 4 billion pages, with about a million ae every ay. Restricte formal semantics - noes are ust web pages an links are of a single type (e.g. refer to. The meaning of the noes an links is not a part of the web system, rather it is left to the web page evelopers to escribe in the page content what their web ocuments mean an what kin of relations they have with the ocumente they link to. As there is no central authority or eitors relevance, popularity or authority of web pages are har to evaluate. Links are also very iverse an many have nothing to o with content or authority (e.g. navigation links. 3

The Web Challenges How to turn the web ata into web knowlege Use the existing Web Web Search Engines Topic Directories Change the Web Semantic Web 4

Crawling The Web To make Web search efficient search engines collect web ocuments an inex them by the wors (terms they contain. For the purposes of inexing web pages are first collecte an store in a local repository Web crawlers (also calle spiers or robots are programs that systematically an exhaustively browse the Web an store all visite pages Crawlers follow the hyperlinks in the Web ocuments implementing graph search algorithms like epth-first an breath-first 5

Crawling The Web Depth-first Web crawling limite to epth 3 6

Crawling The Web Breath-first Web crawling limite to epth 3 7

Crawling The Web Issues in Web Crawling: Network latency (multithreaing Aress resolution (DNS caching Extracting URLs (use canonical form Managing a huge web page repository Upating inices Responing to constantly changing Web Interaction of Web page evelopers Avance crawling by guie (informe search (using web page ranks 8

Inexing an Keywor Search We nee efficient content-base access to Web ocuments Document representation: Term-ocument matrix (inverte inex Relevance ranking: Vector space moel 9

Inexing an Keywor Search Creating term-ocument matrix (inverte inex Documents are tokenize (punctuation marks are remove an the character strings without spaces are consiere as tokens All characters are converte to upper or to lower case. Wors are reuce to their canonical form (stemming Stopwors (a, an, the, on, in, at, etc. are remove. The remaining wors, now calle terms are use as features (attributes in the term-ocument matrix 0

CCSU Departments example Document statistics Document ID Document name Wors Terms Anthropology 4 86 2 Art 53 05 3 Biology 23 9 4 Chemistry 87 58 5 Communication 24 88 6 Computer Science 0 77 7 Criminal Justice 85 60 8 Economics 07 76 9 English 6 80 0 Geography 95 68 History 08 78 2 Mathematics 89 66 3 Moern Languages 0 75 4 Music 37 9 5 Philosophy 85 54 6 Physics 30 00 7 Political Science 20 86 8 Psychology 96 60 9 Sociology 99 66 20 Theatre 6 80 Total number of wors/terms 295 545 Number of ifferent wors/terms 744 67

CCSU Departments example Boolean (Binary Term Document Matrix DID lab laboratory programming computer program 0 0 0 0 2 0 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 0 0 6 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 4 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 0 9 0 0 0 0 20 0 0 0 0 0 2

CCSU Departments example Term ocument matrix with positions DID lab laboratory programming computer program 0 0 0 0 [7] 2 0 0 0 0 [7] 3 0 [65,69] 0 [68] 0 4 0 0 0 [26] [30,43] 5 0 0 0 0 0 6 0 0 [40,42] [,3,7,3,26,34] [,8,6] 7 0 0 0 0 [9,42] 8 0 0 0 0 [57] 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 [7] 0 3 0 0 0 0 0 4 [42] 0 0 [4] [7] 5 0 0 0 0 [37,38] 6 0 0 0 0 [8] 7 0 0 0 0 [68] 8 0 0 0 0 0 9 0 0 0 0 [5] 20 0 0 0 0 0 3

Vector Space Moel Boolean representation ocuments, 2,, n terms t, t 2,, t m term t i occurs n i times in ocument. Boolean representation: r ( K 2 m i 0 if if n n i i > 0 0 For example, if the terms are: lab, laboratory, programming, computer an program. Then the Computer Science ocument is represente by the Boolean vector r 6 (0 0 4

Term Frequency (TF representation Document vector r ( K 2 m with components i TF( ti, Using the sum of term counts: Using the maximum of term counts: TF( t, i 0 ni m n k TF( t i, n k if n if n 0 n max i k i i 0 > 0 n k if if n n i i 0 > 0 Cornell SMART system: TF( t i, 0 + log( + log n i if if n n i i 0 > 0 5

Inverte Document Frequency (IDF U n Document collection: D, ocuments that contain term : t i Dt i i { n > 0} Simple fraction: or IDF ( t i D D t i Using a log function: IDF( t i + D log D t i 6

TFIDF representation i TF( ti, IDF( ti For example, the computer science TF vector r 6 (0 0 0.026 0.078 0.039 scale with the IDF of the terms results in lab laboratory Programming computer program 3.04452 3.04452 3.04452.43508 0.55966 r 6 (0 0 0.079 0.2 0.022 7

Relevance Ranking Represent the query as a vector q {computer, program} q r (0 0 0 0.5 0.5 Apply IDF to its components q r (0 0 0 0.78 0.28 Use Eucliean norm of the vector ifference lab laboratory Programming computer program 3.04452 3.04452 3.04452.43508 0.55966 r r q or Cosine similarity (equivalent to ot prouct for normalize vectors r q. r m i q i i m i ( q i i 2 8

Relevance Ranking q r (0 0.932 Cosine similarities an istances to r Doc TFIDF Coorinates (normalize q r (normalize r r. (rank q (rank 0 0 0.363 0 0 0 0 0.363.29 2 0 0 0 0 0.363.29 3 0 0.972 0 0.234 0 0.28.250 4 0 0 0 0.783 0.622 0.956 ( 0.298 ( 5 0 0 0 0 0.363.29 6 0 0 0.559 0.8 0.72 0.89 (2 0.603 (2 7 0 0 0 0 0.363.29 8 0 0 0 0 0.363.29 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0.932 0.369 3 0 0 0 0 0 0 4 0.890 0 0 0.424 0.67 0.456 (3.043 (3 5 0 0 0 0 0.363.29 6 0 0 0 0 0.363.29 7 0 0 0 0 0.363.29 8 0 0 0 0 0 0 9 0 0 0 0 0.363.29 D 20 0 0 0 0 0 0 9

Relevance Feeback The user provies fee back: Relevant ocuments Irrelevant ocuments D + D q r The original query vector is upate (Rocchio s metho r r q α q + β D + D r γ + D r Pseuo-relevance feeback Top 0 ocuments returne by the original query belong to D + The rest of ocuments belong to D - 20

Avance text search Using OR or NOT boolean operators Phrase Search Statistical methos to extract phrases from text Inexing phrases Part-of-speech tagging Approximate string matching (using n-grams Example: match program an prorgam {pr, ro, og, gr, ra, am} {pr, ro, or, rg, ga, am} {pr, ro, am} 2

Using the HTML structure in keywor search Titles an metatags Use them as tags in inexing Moify ranking epening on the context where the term occurs Heaings an font moifiers (prone to spam Anchor text Plays an important role in web page inexing an search Allows to increase search inices with pages that have never been crawle Allows to inex non-textual content (such as images an programs 22

Evaluating search quality Assume that there is a set of queries Q an a set of ocuments D, an for each query q Q submitte to the system we have: The response set of ocuments (retrieve ocuments R q D D q The set of relevant ocuments selecte manually from the whole collection of ocuments, i.e. D q D precision D q I R R q q recall D q I R D q q 23

Precision-recall framework (set-value Determine the relationship between the set of relevant ocuments ( D q an the set of retrieve ocuments ( R q Ieally D q R q Generally D q I R q D q A very general query leas to recall, but low precision A very restrictive query leas to precision, but low recall A goo balance is neee to maximize both precision an recall 24

Precision-recall framework (using ranks With thousans of ocuments fining D q is practically impossible. So, let s consier a list R (,..., q, 2 m of ranke ocuments (highest rank first if i Dq For each i R q compute its relevance as ri 0 otherwise Then efine precision at rank k as precision ( k k An recall at rank k as k recall ( k D q k i i r r i i 25

Precision-recall framework (example k Relevant ocuments D q,, Document inex r recall (k precision (k k ( 4 6 4 4 0.333 2 2 0 0.333 0.5 3 6 0.667 0.667 4 4 0.75 5 0 0.6 6 2 0 0.5 7 3 0 0.429 26

Precision-recall framework Average precision r k precision ( k D Combines precision an recall an also evaluates ocument ranking The maximal value of is reache when all relevant ocuments are retrieve an ranke before any irrelevant ones. q D k Practically to compute the average precision we first go over the ranke ocuments from R q an then continue with the rest of the ocuments from D until all relevant ocuments are inclue. 27

Similarity Search Cluster hypothesis in IR: Documents similar to relevant ocuments are likely to be relevant too Once a relevant ocument is foun a larger collection of possibly relevant ocuments may be foun by retrieving similar ocuments. Similarity measure between ocuments Cosine Similarity Jaccar Similarity (most popular approach Document Resemblance 28

Cosine Similarity Given ocument an collection D the problem is to fin a number (usually 0 or 20 of ocuments i D, which have the largest value of cosine similarity to. No problems with the small query vector, so we can use more (or all imensions of the vector space How many an which terms to use? o All terms from the corpus o Select the terms that best represent the ocuments (Feature Selection Use highest TF score Use highest IDF score Combine TF an IDF scores (e.g. TFIDF 29

Jaccar Similarity Use Boolean ocument representation an only the nonzero coorinates of the vectors (i.e. those that are Jaccar Coefficient: proportion of coorinates that are in both ocuments to those that are in either of the ocuments 30 both ocuments to those that are in either of the ocuments Set formulation } { } {, ( 2 2 2 sim r r ( ( ( (, ( 2 2 2 T T T T sim U I

Computing Jaccar Coefficient Problems Direct computation is straightforwar, but with large collections it may lea to inefficient similarity search. Fining similar ocuments at query time is impractical. Solutions Do most of the computation offline Create a list of all ocument pairs sorte by the similarity of the ocuments in each pair. Then the k most similar ocuments to a given ocument are those that are paire with in the first k pairs from the list. Eliminate frequent terms Pair ocuments that share at least one term 3

Document Resemblance The task is fining ientical or nearly ientical ocuments, or ocuments that share phrases, sentences or paragraphs. The set of wors approach oes not work. Consier the ocument as a sequence of wors an extract from this sequence short subsequences with fixe length (n-grams or shingles. Represent ocument as a set of w-grams S(, w Example: T ( { a, b, c,, e} S (,2 { ab, bc, c, e} Note that T ( S(, an S (, Use Jaccar to compute resemblance r w (, 2 S( S(, w I S(, w U S( 2 2, w, w 32