Web-Mining Agents Probabilistic Information Retrieval

Size: px

Start display at page:

Download "Web-Mining Agents Probabilistic Information Retrieval"

Cleopatra Copeland
5 years ago
Views:

1 Web-Mnng Agents Probablstc Informaton etreval Prof. Dr. alf Möller Unverstät zu Lübeck Insttut für Informatonssysteme Karsten Martny Übungen

2 Acknowledgements Sldes taken from: Introducton to Informaton etreval Chrstoher Mannng and Prabhakar aghavan 2

3 How eact s the reresentaton of the document? How eact s the reresentaton of the uery? Document eresentaton Query reresentaton How well s uery matched to data? How relevant s the result to the uery? Query TYPICAL I POBLEM Query Answer Document collecton 3

Understandng of user need s uncertan How

n a semantcally mrecse sace of nde terms.

4 Why robabltes n I? User Informaton Need Query eresentaton Understandng of user need s uncertan How to match? Documents Document eresentaton Uncertan guess of whether document has relevant content In tradtonal I systems matchng between each document and uery s attemted n a semantcally mrecse sace of nde terms. Probabltes rovde a rncled foundaton for uncertan reasonng. Can we use robabltes to uantfy our uncertantes? 4

5 Probablstc Aroaches to I Probablty ankng Prncle obertson 70es; Maron Kuhns 1959 Informaton etreval as Probablstc Inference van jsbergen & co snce 70es Probablstc Indeng Fuhr & Co.late 80es-90es Bayesan Nets n I Turtle Croft 90es Success : vared 5

6 Probablty ankng Prncle Collecton of Documents User ssues a uery A set of documents needs to be returned Queston: In what order to resent documents to user? 6

7 Probablty ankng Prncle Queston: In what order to resent documents to user? Intutvely want the best document to be frst second best - second etc Need a formal way to judge the goodness of documents w.r.t. ueres. Idea: Probablty of relevance of the document w.r.t. uery 7

8 Let us reca robablty theory Bayesan robablty formulas Odds: 1 y y y y y O a a b b b a b a a b b a a a b b a b b a 8

9 Odds vs. Probabltes 9

10 Probablty ankng Prncle Let be a document n the retreved collecton. Let reresent relevance of a document w.r.t. gven fed uery and let N reresent non-relevance. Need to fnd - robablty that a retreved document s relevant. N - ror robablty of retrevng a relevant or nonrelevant document resectvely N N N N - robablty that f a relevant non-relevant document s retreved t s. 10

11 Probablty ankng Prncle N N N ankng Prncle Bayes Decson ule: If > N then s relevant otherwse s not relevant Note: + N 1 11

12 Probablty ankng Prncle Clam: PP mnmzes the average robablty of error error N error error If we decde N If we decde error s mnmal when all error are mnmmal. Bayes decson rule mnmzes each error. 12

13 Probablty ankng Prncle More comle case: retreval costs. C - cost of retreval of relevant document C - cost of retreval of non-relevant document let d be a document Probablty ankng Prncle: f C d + Cʹ 1 d C dʹ + Cʹ 1 dʹ for all d not yet retreved then d s the net document to be retreved 13

14 PP: Issues Problems? How do we comute all those robabltes? Cannot comute eact robabltes have to use estmates. Bnary Indeendence etreval BI See below estrctve assumtons elevance of each document s ndeendent of relevance of other documents. Most alcatons are for Boolean model. 14

15 Bayesan Nets n I Bayesan Nets s the most oular way of dong robablstc nference. What s a Bayesan Net? How to use Bayesan Nets n I? J. Pearl Probablstc easonng n Intellgent Systems: Networks of Plausble Inference Morgan-Kaufman

16 Bayesan Nets for I: Idea Document Network d1 d2 d -documents dn t1 t2 t Large - document but reresentatons r Comute - concets once for each tn tn r1 r2 document collecton r3 rk c - uery concets c1 c2 cm Small comute once for - hgh-level every uery concets 1 2 Query Network I I - goal node 16

17 Eamle: reason trouble two Hamlet Macbeth Document Network reason trouble double reason trouble two Query Network O NOT User uery 17

18 Bayesan Nets for I: oadma Construct Document Network once! For each uery Construct best Query Network Attach t to Document Network Fnd subset of d s whch mamzes the robablty value of node I best subset. etreve these d s as the answer to uery. 18

19 Bayesan Nets n I: Pros / Cons Pros More of a cookbook soluton Fleble:create-your- own Document Query Networks elatvely easy to udate Generalzes other Probablstc aroaches PP Probablstc Indeng Cons Best-Subset comutaton s NP-hard have to use uck aromatons aromated Best Subsets may not contan best documents Where do we get the numbers? 19

20 elevance models Gven: PP Goal: Estmate robablty Pd Bnary Indeendence etreval BI: Many documents D - one uery Estmate Pd by consderng whether d n D s relevant for Bnary Indeendence Indeng BII: One document d - many ueres Q Estmate Pd by consderng whether a document d s relevant for a uery n Q 20

21 Bnary Indeendence etreval Tradtonally used n conjuncton wth PP Bnary Boolean: documents are reresented as bnary vectors of terms:! 1 1 n ff term s resent n document. Indeendence : terms occur n documents ndeendently Dfferent documents can be modeled as same vector. 21

22 Bnary Indeendence etreval Queres: bnary vectors of terms Gven uery for each document d need to comute d. relace wth comutng where s vector reresentng d Interested only n rankng Wll use odds:!!! O!! N N N 22

23 Bnary Indeendence etreval Usng Indeendence Assumton: n N N 1!! N N N O!!!!! Constant for each uery Needs estmaton n N O d O 1 So : 23

24 Bnary Indeendence etreval n N O d O 1 Snce s ether 0 or 1: N N O d O Let ; 1 ; 1 N r Assume for all terms not occurng n the uery 0 r Then... 24

25 All matchng terms Non-matchng uery terms Bnary Indeendence etreval All matchng terms All uery terms r r r O r r O O! 25

26 Bnary Indeendence etreval O! O 1 r r r Constant for each uery etreval Status Value: SV log 1 r Only uantty to be estmated for rankngs log 1 r 1 r 1 1 r 1 26

27 Bnary Indeendence etreval All bols down to comutng SV. SV log SV c 1 1 r log 1 r 1 r 1 1 r 1 ; c 1 r log r 1 So how do we comute c s from our data? 27

28 Bnary Indeendence etreval Estmatng SV coeffcents. For each term look at the followng table: Document elevant Non-elevant Total X 1 s n-s n X 0 S-s N-n-S+s N-n Total S N-S N s n s Estmates: r S N S c s S s K N n S s log n s N n S + s 28

29 Bnary Indeendence Indeng Learnng from ueres More ueres: better results - robablty that f document had been deemed relevant uery had been asked The rest of the framework s smlar to BI!!!!!!! 29

Bayesan Probablty:!!!!!!! One Document Many Queres Bayesan Probablty!

30 Bnary Indeendence Indeng vs. Bnary Indeendence etreval BI BII Many Documents One Query Bayesan Probablty:!!!!!!! One Document Many Queres Bayesan Probablty!!!!!!! Vares: document reresentaton Constant: uery reresentaton Vares: uery Constant: document 30

31 Estmaton key challenge If non-relevant documents are aromated by the whole collecton then r rob. of occurrence n nonrelevant documents for uery s n/n and log 1 r /r log N n/n log N/n IDF! robablty of occurrence n relevant documents can be estmated n varous ways: from relevant documents f know some elevance weghtng can be used n feedback loo constant Croft and Harer combnaton match then just get df weghtng of terms roortonal to rob. of occurrence n collecton more accurately to log of ths Greff SIGI 1998 We have a nce theoretcal foundaton of wtd.idf 31

32 Iteratvely estmatng 1. Assume that constant over all n uery 0.5 even odds for any gven doc 2. Determne guess of relevant document set: V s fed sze set of hghest ranked documents on ths model note: now a bt lke tf.df! 3. We need to mrove our guesses for and r so Use dstrbuton of n docs n V. Let V be set of documents contanng V / V Assume f not retreved then not relevant r n V / N V 4. Go to 2. untl converges then return rankng 32

33 Probablstc elevance Feedback 1. Guess a relmnary robablstc descrton of and use t to retreve a frst set of documents V as above. 2. Interact wth the user to refne the descrton: learn some defnte members of and N 3. eestmate and r on the bass of these Or can combne new nformaton wth orgnal guess use Bayesan 1 ror: 2 V + κ V + κ 4. eeat thus generatng a successon of aromatons to. κ s ror weght 33

34 PP and BI: The lessons Gettng reasonable aromatons of robabltes s ossble. Smle methods work only wth restrctve assumtons: term ndeendence terms not n uery do not affect the outcome boolean reresentaton of documents/ueres document relevance values are ndeendent Some of these assumtons can be removed 34

35 Food for thought Thnk through the dfferences between standard tf.df and the robablstc retreval model n the frst teraton Thnk through the dfferences between vector sace seudo relevance feedback and robablstc seudo relevance feedback 35

36 Good and Bad News Standard Vector Sace Model Emrcal for the most art; success measured by results Few roertes rovable Probablstc Model Advantages Based on a frm theoretcal foundaton Theoretcally justfed otmal rankng scheme Dsadvantages Bnary word-n-doc weghts not usng term freuences Indeendence of terms can be allevated Amount of comutaton Has never worked convncngly better n ractce 36

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes