Midterm Review. Hongning Wang

Size: px

Start display at page:

Download "Midterm Review. Hongning Wang"

Kristopher Moody
6 years ago
Views:

1 Mdterm Revew Hongnng Wang

2 Core concepts Search Engne Archtecture Key components n a modern search engne Crawlng & Text processng Dfferent strateges for crawlng Challenges n crawlng Text processng ppelne Zpf s law Inverted ndex Index compresson Phase queres CS@UVa CS 4501: Informaton Retreval 2

3 Core concepts Vector space model Basc term/document weghtng schemata Latent semantc analyss Word space to concept space Probablstc rankng prncple Rsk mnmzaton Document generaton model Language model N-gram language models Smoothng technques CS 4501: Informaton Retreval 3

4 Core concepts Classcal IR evaluatons Basc components n IR evaluatons Evaluaton metrcs Annotaton strategy and annotaton consstency CS@UVa CS 4501: Informaton Retreval 4

5 Abstracton of search engne archtecture Indexed corpus Crawler Rankng procedure Research attenton Doc Analyzer Doc Representaton Query Rep Feedback (Query) Evaluaton User Indexer Index Ranker results CS 4501: Informaton Retreval 5

6 IR v.s. DBs Informaton Retreval: Unstructured data Semantcs of objects are subjectve Smple keyword queres Relevance-drve retreval Effectveness s prmary ssue, though effcency s also mportant Database Systems: Structured data Semantcs of each object are well defned Structured query languages (e.g., SQL) Exact retreval Emphass on effcency CS@UVa CS 4501: Informaton Retreval 6

7 Crawler: vstng strategy Breadth frst Unformly explore from the entry page Memorze all nodes on the prevous level As shown n pseudo code Depth frst Explore the web by branch Based crawlng gven the web s not a tree structure Focused crawlng Prortze the new lnks by predefned strateges Challenges Avod duplcate vsts Re-vst polcy CS@UVa CS 4501: Informaton Retreval 7

8 Automatc text ndexng Tokenzaton Regular expresson based Learnng-based Normalzaton Stemmng Stopword removal CS 4501: Informaton Retreval 8

9 Statstcal property of language Zpf s law Frequency of any word s nversely proportonal to ts rank n the frequency table Formally f k; s, N = 1/ks n=1 N 1/n s dscrete verson of power law where k s rank of the word; N s the vocabulary sze; s s language-specfc parameter CS@UVa CS 4501: Informaton Retreval 9

10 Automatc text ndexng Remove non-nformatve words Remove 1s Remove 0s Remove rare words CS 4501: Informaton Retreval 10

11 Inverted ndex Buld a look-up table for each word n vocabulary From word to fnd documents! Dctonary Postngs Query: nformaton retreval nformaton retreval retreved s helpful Doc1 Doc1 Doc2 Doc1 Doc1 Doc2 Doc2 Doc2 Tme complexty: O( q L ), L s the average length of postng lst By Zpf s law, L D for Doc1 Doc2 you Doc2 everyone Doc1 CS@UVa CS 4501: Informaton Retreval 11

12 Structures for nverted ndex Dctonary: modest sze Needs fast random access Stay n memory Hash table, B-tree, tre, Postngs: huge Sequental access s expected Stay on dsk Contan docid, term freq, term poston, Compresson s needed Key data structure underlyng modern IR - Chrstopher D. Mannng CS@UVa CS 4501: Informaton Retreval 12

13 ... Sortng-based nverted ndex constructon doc1 doc2 doc300 Sort by docid <1,1,3> <2,1,2> <3,1,1>... <1,2,2> <3,2,3> <4,2,5> <1,300,3> <3,300,1>... Parse & Count <Tuple>: <termid, docid, count> Sort by termid All nfo about term 1 <1,1,3> <1,2,2> <2,1,2> <2,4,3>... <1,5,3> <1,6,2> <1,299,3> <1,300,1>... Local sort <1,1,3> <1,2,2> <1,5,2> <1,6,3>... <1,300,3> <2,1,2> <5000,299,1> <5000,300,1>... Merge sort Term Lexcon: 1 the 2 cold 3 days 4 a... DocID Lexcon: 1 doc1 2 doc2 3 doc3... CS@UVa CS 4501: Informaton Retreval 13

14 A close look at nverted ndex Approxmate search: e.g., msspelled queres, wldcard queres Dctonary nformaton retreval retreved s helpful for you everyone Proxmty search: e.g., phrase queres Doc1 Doc1 Doc2 Doc1 Doc1 Doc1 Doc2 Doc1 Postngs Doc2 Doc2 Doc2 Doc2 Dynamc ndex update Index compresson CS@UVa CS 4501: Informaton Retreval 14

15 Index compresson Observaton of postng fles Instead of storng docid n postng, we store gap between docids, snce they are ordered Zpf s law agan: The more frequent a word s, the smaller the gaps are The less frequent a word s, the shorter the postng lst s Heavly based dstrbuton gves us great opportunty of compresson! Informaton theory: entropy measures compresson dffculty. CS@UVa CS 4501: Informaton Retreval 15

16 Phrase query Generalzed postngs matchng Equalty condton check wth requrement of poston pattern between two query terms e.g., T2.pos-T1.pos = 1 (T1 must be mmedately before T2 n any matched document) Proxmty query: T2.pos-T1.pos k scan the postngs Term1 Term CS@UVa CS 4501: Informaton Retreval 16

17 Consderatons n result dsplay Relevance Order the results by relevance Dversty Maxmze the topcal coverage of the dsplayed results Navgaton Help users easly explore the related search space Query suggeston Search by example CS@UVa CS 4501: Informaton Retreval 17

18 Defcency of Boolean model The query s unlkely precse Over-constraned query (terms are too specfc): no relevant documents can be found Under-constraned query (terms are too general): over delvery It s hard to fnd the rght poston between these two extremes (hard for users to specfy constrants) Even f t s accurate Not all users would lke to use such queres All relevant documents are not equally mportant No one would go through all the matched results Relevance s a matter of degree! CS@UVa CS 4501: Informaton Retreval 18

19 Vector space model Represent both doc and query by concept vectors Each concept defnes one dmenson K concepts defne a hgh-dmensonal space Element of vector corresponds to concept weght E.g., d=(x 1,,x k ), x s mportance of concept Measure relevance Dstance between the query vector and document vector n ths concept space CS@UVa CS 4501: Informaton Retreval 19

20 What s a good basc concept? Orthogonal Lnearly ndependent bass vectors Non-overlappng n meanng No ambguty Weghts can be assgned automatcally and accurately Exstng solutons Terms or N-grams,.e., bag-of-words Topcs,.e., topc model CS@UVa CS 4501: Informaton Retreval 20

21 TF normalzaton Sublnear TF scalng tf t, d = 1 + log f t, d, f f t, d > 0 0, otherwse Norm. TF Raw TF CS@UVa CS 4501: Informaton Retreval 21

22 TF normalzaton Maxmum TF scalng tf t, d = α + (1 α) max t f(t,d) f(t,d) Normalze by the most frequent word n ths doc Norm. TF 1 α Raw TF CS@UVa CS 4501: Informaton Retreval 22

23 IDF weghtng Soluton Assgn hgher weghts to the rare terms Formula IDF t = 1 + log( N ) df(t) A corpus-specfc property Independent of a sngle document Non-lnear scalng Total number of docs n collecton Number of docs contanng term t CS@UVa CS 4501: Informaton Retreval 23

TF-IDF weghtng Combnng TF and IDF Common n doc hgh tf hgh weght Rare n collecton hgh df hgh weght w t, d = TF t, d IDF(t) Most well-known document representaton schema n IR! (G Salton et al.

24 TF-IDF weghtng Combnng TF and IDF Common n doc hgh tf hgh weght Rare n collecton hgh df hgh weght w t, d = TF t, d IDF(t) Most well-known document representaton schema n IR! (G Salton et al. 1983) Salton was perhaps the leadng computer scentst workng n the feld of nformaton retreval durng hs tme. - wkpeda Gerard Salton Award hghest achevement award n IR CS@UVa CS 4501: Informaton Retreval 24

25 From dstance to angle Angle: how vectors are overlapped Cosne smlarty projecton of one vector onto another Fnance TF-IDF space The choce of Eucldean dstance D 2 D 1 The choce of angle Query Sports CS@UVa CS 4501: Informaton Retreval 25

26 Advantages of VS Model Emprcally effectve! (Top TREC performance) Intutve Easy to mplement Well-studed/Mostly evaluated The Smart system Developed at Cornell: Stll wdely used Warnng: Many varants of TF-IDF! CS 4501: Informaton Retreval 26

27 Dsadvantages of VS Model Assume term ndependence Assume query and document to be the same Lack of predctve adequacy Arbtrary term weghtng Arbtrary smlarty measure Lots of parameter tunng! CS 4501: Informaton Retreval 27

28 Latent semantc analyss Soluton Low rank matrx approxmaton Imagne ths s *true* concept-document matrx Imagne ths s our observed term-document matrx Random nose over the word selecton n each document CS@UVa CS 4501: Informaton Retreval 28

29 Latent Semantc Analyss (LSA) Solve LSA by SVD Z = Procedure of LSA argmn Z rank Z =k = argmn Z rank Z =k k = C M N Map to a lower dmensonal space C Z F M N 2 =1 j=1 C j Z j 1. Perform SVD on document-term adjacency matrx k 2. Construct C M N by only keepng the largest k sngular values n Σ non-zero CS@UVa CS 4501: Informaton Retreval 29

30 Probablstc rankng prncple From decson theory Two types of loss Loss(retreved non-relevant)=a 1 Loss(not retreved relevant)=a 2 φ(d, q): probablty of d beng relevant to q Expected loss regardng to the decson of ncludng d n the fnal results Retreve: 1 φ d, q a 1 Not retreve: φ d, q a 2 Your decson crteron? CS@UVa CS 4501: Informaton Retreval 30

31 Probablstc rankng prncple From decson theory We make decson by If 1 φ d, q a 1 <φ d, q a 2, retreve d Otherwse, not retreve d Check f φ d, q > a 1 a 1 +a 2 Rank documents by descendng order of φ d, q would mnmze the loss CS@UVa CS 4501: Informaton Retreval 31

32 Condtonal models for P(R=1 Q,D) Basc dea: relevance depends on how well a query matches a document P(R=1 Q,D)=g(Rep(Q,D) ) a functonal form Rep(Q,D): feature representaton of query-doc par E.g., #matched terms, hghest IDF of a matched term, doclen Usng tranng data (wth known relevance judgments) to estmate parameter Apply the model to rank new documents Specal case: logstc regresson CS@UVa CS 4501: Informaton Retreval 32

33 Generatve models for P(R=1 Q,D) Basc dea Compute Odd(R=1 Q,D) usng Bayes rule Odd( R 1 Q, D) P( R 1 Q, D) P( R 0 Q, D) Assumpton Relevance s a bnary varable Varants Document generaton P(Q,D R)=P(D Q,R)P(Q R) Query generaton P(Q,D R)=P(Q D,R)P(D R) P( Q, D R 1) P( R 1) P( Q, D R 0) P( R 0) Ignored for rankng CS@UVa CS 4501: Informaton Retreval 33

34 Document generaton model CS 4501: Informaton Retreval k q k q d k q d k q d k d k d k u p p u u p u p u p R Q A P R Q A P R Q A P R Q A P R Q d A P R Q d A P D Q R Odd 1 1, 1 1, 1 0, 1, 1 1, 0 1, 1 1, ) (1 ) ( ), 0 ( 1), 0 ( 0), 1 ( 1), 1 ( 0), ( 1), ( ), 1 ( Terms occur n doc Terms do not occur n doc document relevant(r=1) nonrelevant(r=0) term present A =1 p u term absent A =0 1-p 1-u Assumpton: terms not occurrng n the query are equally lkely to occur n relevant and nonrelevant documents,.e., p t =u t Important trcks 34

35 Maxmum lkelhood estmaton Data: a document d wth counts c(w 1 ),, c(w N ) Model: multnomal dstrbuton p(w θ) wth parameters θ = p(w ) Maxmum lkelhood estmator: θ = argmax θ p(w θ) N p W θ = L W, θ = N c w 1,, c(w N ) N =1 N =1 c w log θ + λ N θ c(w ) =1 =1 L = c w + λ θ θ θ = c w λ Snce θ = N θ =1 we have λ = =1 c w N θ 1 N =1 θ c(w ) c w log p W θ = Usng Lagrange multpler approach, we ll tune θ to maxmze L(W, θ) Set partal dervatves to zero ML estmate CS@UVa CS 4501: Informaton Retreval 35 =1 N c w =1 c w log θ Requrement from probablty

36 The BM25 formula A closer look rel( q, D) TF-IDF component for document n 1 IDF( q ) D 1 tf k k b b 2 1(1 ) b s usually set to [0.75, 1.2] k 1 s usually set to [1.2, 2.0] k 2 s usually set to (0, 1000] tf ( k 1) avg D TF component for query qtf ( k2 1) qtf Vector space model wth TF-IDF schema! CS@UVa CS 4501: Informaton Retreval 36

37 Source-Channel framework [Shannon 48] Source P(X) Transmtter Nosy Recever X (encoder) Channel Y (decoder) X P(Y X) P(X Y)=? Destnaton Xˆ arg max X p( X Y ) arg max X p( Y X ) p( X ) (Bayes Rule) When X s text, p(x) s a language model Many Examples: Speech recognton: X=Word sequence Y=Speech sgnal Machne translaton: X=Englsh sentence Y=Chnese sentence OCR Error Correcton: X=Correct word Y= Erroneous word Informaton Retreval: X=Document Y=Query Summarzaton: X=Summary Y=Document CS@UVa CS 4501: Informaton Retreval 37

38 More sophstcated LMs N-gram language models In general, p(w 1 w 2 w n ) = p(w 1 )p(w 2 w 1 ) p(w n w 1 w n 1 ) N-gram: condtoned only on the past N-1 words E.g., bgram: p(w 1 w n ) = p(w 1 )p(w 2 w 1 ) p(w 3 w 2 ) p(w n w n 1 ) Remote-dependence language models (e.g., Maxmum Entropy model) Structured language models (e.g., probablstc contextfree grammar) CS@UVa CS 4501: Informaton Retreval 38

39 Justfcaton from PRP 0)) ( 0), ( ( 0) ( 1) ( 1), ( 0) ( 0), ( 1) ( 1), ( 0), ( 1), ( ), 1 ( R Q P R D Q P Assume R D P R D P R D Q P R D P R D Q P R D P R D Q P R D Q P R D Q P D Q R O Query lkelhood p(q d ) Document pror Assumng unform document pror, we have 1), ( ), 1 ( R D Q P D Q R O CS@UVa CS 4501: Informaton Retreval 39 Query generaton

40 Problem wth MLE What probablty should we gve a word that has not been observed n the document? log0? If we want to assgn non-zero probabltes to such words, we ll have to dscount the probabltes of observed words Ths s so-called smoothng CS@UVa CS 4501: Informaton Retreval 40

41 Illustraton of language model smoothng P(w d) Max. Lkelhood Estmate count of w p ( w) ML count of all words Smoothed LM w Dscount from the seen words Assgnng nonzero probabltes to the unseen words CS@UVa CS 4501: Informaton Retreval 41

42 Refne the dea of smoothng Should all unseen words get equal probabltes? We can use a reference model to dscrmnate unseen words Dscounted ML estmate pseen ( w d) f w s seen n d p( w d) d p( w REF) otherwse d 1 p ( w d) w s seen w s unseen seen p( w REF) Reference language model CS@UVa CS 4501: Informaton Retreval 42

43 Smoothng methods Method 1: Addtve smoothng Add a constant to the counts of each word Counts of w n d c( w, d) 1 p( w d) d V Add one, Laplace smoothng Vocabulary sze Length of d (total counts) Problems? Hnt: all words are equally mportant? CS@UVa CS 4501: Informaton Retreval 43

44 Smoothng methods Method 2: Absolute dscountng Subtract a constant from the counts of each word u p ( w d) Problems? Hnt: vared document length? max( c( w; d ),0) d p( w REF ) d # unq words CS@UVa CS 4501: Informaton Retreval 44

45 Smoothng methods Method 3: Lnear nterpolaton, Jelnek-Mercer Shrnk unformly toward p(w REF) c( w, d) p( w d) (1 ) p( w REF) d Problems? MLE Hnt: what s mssng? parameter CS@UVa CS 4501: Informaton Retreval 45

46 Smoothng methods Method 4: Drchlet Pror/Bayesan Assume pseudo counts p(w REF) c( w; d ) p( w REF ) d c( w, d) p ( w d) d d d p( w REF) d Problems? parameter CS@UVa CS 4501: Informaton Retreval 46

47 Two-stage smoothng [Zha & Lafferty 02] Stage-1 -Explan unseen words -Drchlet pror (Bayesan) Stage-2 -Explan nose n query -2-component mxture P(w d) = (1-) c(w,d) d +p(w C) + Collecton LM + p(w U) User background model CS@UVa CS 4501: Informaton Retreval 47

48 Understandng smoothng Topcal words Query = the algorthms for data mnng p ML (w d1): p ML (w d2): p( algorthms d1) = p( algorthms d2) p( data d1) < p( data d2) p( mnng d1) < p( mnng d2) Intutvely, d2 should have a hgher score, but p(q d1)>p(q d2) So we should make p( the ) and p( for ) less dfferent for all docs, and smoothng helps to acheve ths goal After smoothng wth p( w d) 0.1pDML ( w d) 0.9 p( w REF ), p( q d1) p( q d2)! Query = the algorthms for data mnng P(w REF) Smoothed p(w d1): Smoothed p(w d2): CS@UVa CS 4501: Informaton Retreval 48

49 Smoothng & TF-IDF weghtng log p( q d) c( w, q)log p( w d) wv, c( w, q) 0 c( w, q)log p ( w d) c( w, q)log p( w C) Seen wv, c( w, d ) 0, wv, c( w, q) 0, c( w, d ) 0 c( w, q) 0 Retreval formula usng the general smoothng scheme c( w, q)log p ( w d) c( w, q)log p( w C) c( w, q)log p( w C) Seen d d wv, c( w, d ) 0 wv, c( w, q) 0 wv, c( w, q) 0, c( w, d ) 0 c w, q > 0 pseen ( w d) c( w, q)log q log d c( w, q) p( w C) wv, c( w, d ) 0 d p( w C) wv, c( w, q) 0 c( w, q) 0 d pseen ( w d) f w s seen n d p( w d) d p( w C) otherwse 1 p ( w d) w s seen w s unseen Seen p( w C) d Smoothed ML estmate Reference language model Key rewrtng step (where dd we see t before?) Smlar rewrtngs are very common when usng probablstc models for IR CS@UVa CS 4501: Informaton Retreval 49

50 Retreval evaluaton Aforementoned evaluaton crtera are all good, but not essental Goal of any IR system Satsfyng users nformaton need Core qualty measure crteron how well a system meets the nformaton needs of ts users. wk Unfortunately vague and hard to execute CS@UVa CS 4501: Informaton Retreval 50

51 Classcal IR evaluaton Three key elements for IR evaluaton 1. A document collecton 2. A test sute of nformaton needs, expressble as queres 3. A set of relevance judgments, e.g., bnary assessment of ether relevant or nonrelevant for each query-document par CS@UVa CS 4501: Informaton Retreval 51

52 Evaluaton of unranked retreval sets Summarzng precson and recall to a sngle value In order to compare dfferent systems F-measure: weghted harmonc mean of precson and recall, α balances the trade-off F = Why harmonc mean? System1: P:0.53, R:0.36 System2: P:0.01, R: α 1 P + 1 α 1 R H F 1 = 2 1 P + 1 R A Equal weght between precson and recall CS@UVa CS 4501: Informaton Retreval 52

53 Evaluaton of ranked retreval results Summarze the rankng performance wth a sngle number Bnary relevance Eleven-pont nterpolated average precson Precson@K (P@K) Mean Average Precson (MAP) Mean Recprocal Rank (MRR) Multple grades of relevance Normalzed Dscounted Cumulatve Gan (NDCG) CS@UVa CS 4501: Informaton Retreval 53

54 AvgPrec s about one query AvgPrec of the two rankngs Fgure from Mannng Stanford CS276, Lecture 8 CS@UVa CS 4501: Informaton Retreval 54

55 What does query averagng hde? Precson Recall Fgure from Doug Oard s presentaton, orgnally from Ellen Voorhees presentaton CS@UVa CS 4501: Informaton Retreval 55

56 Measurng assessor consstency kappa statstc A measure of agreement between judges P A P(E) κ = 1 P(E) P(A) s the proporton of the tmes judges agreed P(E) s the proporton of tmes they would be expected to agree by chance κ = 1 f two judges always agree κ = 0 f two judges agree by chance κ < 0 f two judges always dsagree CS@UVa CS 4501: Informaton Retreval 56

57 What we have not consdered The physcal form of the output User nterface The effort, ntellectual or physcal, demanded of the user User effort when usng the system Bas IR research towards optmzng relevancecentrc metrcs CS 4501: Informaton Retreval 57

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes