Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search

The Web Challenges Tim Berners-Lee, Information Management: A Proposal, CERN, March 989. 2

The Web Challenges 8 years later The recent Web is huge an grows increibly fast. About ten years after the Tim Berners-Lee proposal the Web was estimate to 50 million noes (pages an.7 billion eges (links. Now it inclues more than 4 billion pages, with about a million ae every ay. Restricte formal semantics - noes are ust web pages an links are of a single type (e.g. refer to. The meaning of the noes an links is not a part of the web system, rather it is left to the web page evelopers to escribe in the page content what their web ocuments mean an what kin of relations they have with the ocumente they link to. As there is no central authority or eitors relevance, popularity or authority of web pages are har to evaluate. Links are also very iverse an many have nothing to o with content or authority (e.g. navigation links. 3

The Web Challenges How to turn the web ata into web knowlege Use the existing Web Web Search Engines Topic Directories Change the Web Semantic Web 4

Crawling The Web To make Web search efficient search engines collect web ocuments an inex them by the wors (terms they contain. For the purposes of inexing web pages are first collecte an store in a local repository Web crawlers (also calle spiers or robots are programs that systematically an exhaustively browse the Web an store all visite pages Crawlers follow the hyperlinks in the Web ocuments implementing graph search algorithms like epth-first an breath-first 5

Crawling The Web Depth-first Web crawling limite to epth 3 6

Crawling The Web Breath-first Web crawling limite to epth 3 7

Crawling The Web Issues in Web Crawling: Network latency (multithreaing Aress resolution (DNS caching Extracting URLs (use canonical form Managing a huge web page repository Upating inices Responing to constantly changing Web Interaction of Web page evelopers Avance crawling by guie (informe search (using web page ranks 8

Inexing an Keywor Search We nee efficient content-base access to Web ocuments Document representation: Term-ocument matrix (inverte inex Relevance ranking: Vector space moel 9

Inexing an Keywor Search Creating term-ocument matrix (inverte inex Documents are tokenize (punctuation marks are remove an the character strings without spaces are consiere as tokens All characters are converte to upper or to lower case. Wors are reuce to their canonical form (stemming Stopwors (a, an, the, on, in, at, etc. are remove. The remaining wors, now calle terms are use as features (attributes in the term-ocument matrix 0

CCSU Departments example Document statistics Document ID Document name Wors Terms Anthropology 4 86 2 Art 53 05 3 Biology 23 9 4 Chemistry 87 58 5 Communication 24 88 6 Computer Science 0 77 7 Criminal Justice 85 60 8 Economics 07 76 9 English 6 80 0 Geography 95 68 History 08 78 2 Mathematics 89 66 3 Moern Languages 0 75 4 Music 37 9 5 Philosophy 85 54 6 Physics 30 00 7 Political Science 20 86 8 Psychology 96 60 9 Sociology 99 66 20 Theatre 6 80 Total number of wors/terms 295 545 Number of ifferent wors/terms 744 67

CCSU Departments example Boolean (Binary Term Document Matrix DID lab laboratory programming computer program 0 0 0 0 2 0 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 0 0 6 0 0 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 4 0 0 5 0 0 0 0 6 0 0 0 0 7 0 0 0 0 8 0 0 0 0 0 9 0 0 0 0 20 0 0 0 0 0 2

CCSU Departments example Term ocument matrix with positions DID lab laboratory programming computer program 0 0 0 0 [7] 2 0 0 0 0 [7] 3 0 [65,69] 0 [68] 0 4 0 0 0 [26] [30,43] 5 0 0 0 0 0 6 0 0 [40,42] [,3,7,3,26,34] [,8,6] 7 0 0 0 0 [9,42] 8 0 0 0 0 [57] 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 [7] 0 3 0 0 0 0 0 4 [42] 0 0 [4] [7] 5 0 0 0 0 [37,38] 6 0 0 0 0 [8] 7 0 0 0 0 [68] 8 0 0 0 0 0 9 0 0 0 0 [5] 20 0 0 0 0 0 3

Vector Space Moel Boolean representation ocuments, 2,, n terms t, t 2,, t m term t i occurs n i times in ocument. Boolean representation: r ( K 2 m i 0 if if n n i i > 0 0 For example, if the terms are: lab, laboratory, programming, computer an program. Then the Computer Science ocument is represente by the Boolean vector r 6 (0 0 4

Term Frequency (TF representation Document vector r ( K 2 m with components i TF( ti, Using the sum of term counts: Using the maximum of term counts: TF( t, i 0 ni m n k TF( t i, n k if n if n 0 n max i k i i 0 > 0 n k if if n n i i 0 > 0 Cornell SMART system: TF( t i, 0 + log( + log n i if if n n i i 0 > 0 5

Inverte Document Frequency (IDF U n Document collection: D, ocuments that contain term : t i Dt i i { n > 0} Simple fraction: or IDF ( t i D D t i Using a log function: IDF( t i + D log D t i 6

TFIDF representation i TF( ti, IDF( ti For example, the computer science TF vector r 6 (0 0 0.026 0.078 0.039 scale with the IDF of the terms results in lab laboratory Programming computer program 3.04452 3.04452 3.04452.43508 0.55966 r 6 (0 0 0.079 0.2 0.022 7

Relevance Ranking Represent the query as a vector q {computer, program} q r (0 0 0 0.5 0.5 Apply IDF to its components q r (0 0 0 0.78 0.28 Use Eucliean norm of the vector ifference lab laboratory Programming computer program 3.04452 3.04452 3.04452.43508 0.55966 r r q or Cosine similarity (equivalent to ot prouct for normalize vectors r q. r m i q i i m i ( q i i 2 8

Relevance Ranking q r (0 0.932 Cosine similarities an istances to r Doc TFIDF Coorinates (normalize q r (normalize r r. (rank q (rank 0 0 0.363 0 0 0 0 0.363.29 2 0 0 0 0 0.363.29 3 0 0.972 0 0.234 0 0.28.250 4 0 0 0 0.783 0.622 0.956 ( 0.298 ( 5 0 0 0 0 0.363.29 6 0 0 0.559 0.8 0.72 0.89 (2 0.603 (2 7 0 0 0 0 0.363.29 8 0 0 0 0 0.363.29 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0.932 0.369 3 0 0 0 0 0 0 4 0.890 0 0 0.424 0.67 0.456 (3.043 (3 5 0 0 0 0 0.363.29 6 0 0 0 0 0.363.29 7 0 0 0 0 0.363.29 8 0 0 0 0 0 0 9 0 0 0 0 0.363.29 D 20 0 0 0 0 0 0 9

Relevance Feeback The user provies fee back: Relevant ocuments Irrelevant ocuments D + D q r The original query vector is upate (Rocchio s metho r r q α q + β D + D r γ + D r Pseuo-relevance feeback Top 0 ocuments returne by the original query belong to D + The rest of ocuments belong to D - 20

Avance text search Using OR or NOT boolean operators Phrase Search Statistical methos to extract phrases from text Inexing phrases Part-of-speech tagging Approximate string matching (using n-grams Example: match program an prorgam {pr, ro, og, gr, ra, am} {pr, ro, or, rg, ga, am} {pr, ro, am} 2

Using the HTML structure in keywor search Titles an metatags Use them as tags in inexing Moify ranking epening on the context where the term occurs Heaings an font moifiers (prone to spam Anchor text Plays an important role in web page inexing an search Allows to increase search inices with pages that have never been crawle Allows to inex non-textual content (such as images an programs 22

Evaluating search quality Assume that there is a set of queries Q an a set of ocuments D, an for each query q Q submitte to the system we have: The response set of ocuments (retrieve ocuments R q D D q The set of relevant ocuments selecte manually from the whole collection of ocuments, i.e. D q D precision D q I R R q q recall D q I R D q q 23

Precision-recall framework (set-value Determine the relationship between the set of relevant ocuments ( D q an the set of retrieve ocuments ( R q Ieally D q R q Generally D q I R q D q A very general query leas to recall, but low precision A very restrictive query leas to precision, but low recall A goo balance is neee to maximize both precision an recall 24

Precision-recall framework (using ranks With thousans of ocuments fining D q is practically impossible. So, let s consier a list R (,..., q, 2 m of ranke ocuments (highest rank first if i Dq For each i R q compute its relevance as ri 0 otherwise Then efine precision at rank k as precision ( k k An recall at rank k as k recall ( k D q k i i r r i i 25

Precision-recall framework (example k Relevant ocuments D q,, Document inex r recall (k precision (k k ( 4 6 4 4 0.333 2 2 0 0.333 0.5 3 6 0.667 0.667 4 4 0.75 5 0 0.6 6 2 0 0.5 7 3 0 0.429 26

Precision-recall framework Average precision r k precision ( k D Combines precision an recall an also evaluates ocument ranking The maximal value of is reache when all relevant ocuments are retrieve an ranke before any irrelevant ones. q D k Practically to compute the average precision we first go over the ranke ocuments from R q an then continue with the rest of the ocuments from D until all relevant ocuments are inclue. 27

Similarity Search Cluster hypothesis in IR: Documents similar to relevant ocuments are likely to be relevant too Once a relevant ocument is foun a larger collection of possibly relevant ocuments may be foun by retrieving similar ocuments. Similarity measure between ocuments Cosine Similarity Jaccar Similarity (most popular approach Document Resemblance 28

Cosine Similarity Given ocument an collection D the problem is to fin a number (usually 0 or 20 of ocuments i D, which have the largest value of cosine similarity to. No problems with the small query vector, so we can use more (or all imensions of the vector space How many an which terms to use? o All terms from the corpus o Select the terms that best represent the ocuments (Feature Selection Use highest TF score Use highest IDF score Combine TF an IDF scores (e.g. TFIDF 29

Jaccar Similarity Use Boolean ocument representation an only the nonzero coorinates of the vectors (i.e. those that are Jaccar Coefficient: proportion of coorinates that are in both ocuments to those that are in either of the ocuments 30 both ocuments to those that are in either of the ocuments Set formulation } { } {, ( 2 2 2 sim r r ( ( ( (, ( 2 2 2 T T T T sim U I

Computing Jaccar Coefficient Problems Direct computation is straightforwar, but with large collections it may lea to inefficient similarity search. Fining similar ocuments at query time is impractical. Solutions Do most of the computation offline Create a list of all ocument pairs sorte by the similarity of the ocuments in each pair. Then the k most similar ocuments to a given ocument are those that are paire with in the first k pairs from the list. Eliminate frequent terms Pair ocuments that share at least one term 3

Document Resemblance The task is fining ientical or nearly ientical ocuments, or ocuments that share phrases, sentences or paragraphs. The set of wors approach oes not work. Consier the ocument as a sequence of wors an extract from this sequence short subsequences with fixe length (n-grams or shingles. Represent ocument as a set of w-grams S(, w Example: T ( { a, b, c,, e} S (,2 { ab, bc, c, e} Note that T ( S(, an S (, Use Jaccar to compute resemblance r w (, 2 S( S(, w I S(, w U S( 2 2, w, w 32