Dealing with Text Databases

Size: px
Start display at page:

Download "Dealing with Text Databases"

Transcription

1 Dealig with Text Databases Ustructured data Boolea queries Sparse matrix represetatio Iverted idex Couts vs. frequecies Term frequecy tf x idf term weights Documets as vectors Cosie similarity Dimesioality reductio Vectors ad Boolea queries 1

2 Christopher Maig Prabhakar Raghava Hirich Schütze Uiversity of Stuttgart Ustructured data Which plays of Shakespeare cotai the words Brutus AND Caesar but NOT Calpuria? (Calpuriia, third ad last wife of Julius Caesar) Oe could grep all of Shakespeare s plays for Brutus ad Caesar, the strip out lies cotaiig Calpuria? Slow (for large corpora) NOT Calpuria is o-trivial 2

3 Term-documet icidece Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Brutus AND Caesar but NOT Calpuria 1 if play cotais word, 0 otherwise Icidece vectors So we have a 0/1 vector for each term To aswer query: take the vectors for Brutus, Caesar ad ot Calpuria (complemeted) è bitwise AND AND AND =

4 Aswers to query AND AND = Atoy ad Cleopatra Hamlet Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Sparse matrix represetatio For real data matrix becomes very big Matrix has much, much more zeros the oes matrix is extremely sparse Why? Not every term (word) i every documet preset What s a better represetatio? We oly record the 1 positios 4

5 Iverted idex For each term T, we must store a list of all documets that cotai T Do we use a array or a list for this? Brutus Calpuria Caesar What happes if the word Caesar is added to documet 14? Iverted idex Liked lists geerally preferred to arrays Dyamic space allocatio Isertio of terms ito documets easy Space overhead of poiters Brutus Calpuria Caesar

6 Iverted idex costructio Documets to be idexed. Frieds, Romas, coutryme. Tokeizer Toke stream. Frieds Romas Coutryme Liguistic modules Modified tokes. fried roma coutryma Idexer fried 2 4 Iverted idex roma 1 2 coutryma Boolea queries: Exact match The Boolea Retrieval model is beig able to ask a query that is a Boolea expressio: Boolea Queries are queries usig AND, OR ad NOT to joi query terms Views each documet as a set of words (terms) Is precise: documet matches coditio or ot 6

7 Exact match Primary commercial retrieval tool for 3 decades Professioal searchers (e.g., lawyers) still like Boolea queries: You kow exactly what you re gettig. Scorig Our queries have all bee Boolea Good for expert users with precise uderstadig of their eeds ad the corpus Not good for (the majority of) users with poor Boolea formulatio of their eeds 7

8 Scorig We wish to retur i order the documets most likely to be useful to the searcher How ca we rak order the docs i the corpus with respect to a query? Assig a score say i [0,1] for each doc o each query Icidece matrices Recall: Documet (or a zoe i it) is biary vector X i {0,1} v Query is a vector Score: Overlap measure: X Y Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser

9 Example O the query ides of march, Shakespeare s Julius Caesar has a score of 3 All other Shakespeare plays have a score of 2 (because they cotai march) or 1 Thus i a rak order, Julius Caesar would come out tops Overlap matchig What s wrog with the overlap measure? It does t cosider: Term frequecy i documet Term scarcity i collectio (documet metio frequecy) of is more commo tha ides or march Legth of documets 9

10 Scorig: desity-based Obvious ext idea: if a documet talks about a topic more, the it is a better match This applies eve whe we oly have a sigle query term. Documet relevat if it has a lot of the terms This leads to the idea of term weightig Term-documet cout matrices Cosider the umber of occurreces of a term i a documet: Bag of words model Documet is a vector i N v : a colum below Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser

11 Couts vs. frequecies Cosider agai the ides of march query Julius Caesar has 5 occurreces of ides No other play has ides march occurs i over a doze All the plays cotai of By this scorig measure, the top-scorig play is likely to be the oe with the most ofs Digressio: termiology WARNING: I a lot of IR literature, frequecy is used to mea cout Thus term frequecy i IR literature is used to mea umber of occurreces i a doc Not divided by documet legth (which would actually make it a frequecy) We will coform to this misomer I sayig term frequecy we mea the umber of occurreces of a term i a documet. 11

12 Term frequecy tf Log docs are favored because they re more likely to cotai query terms Ca fix this to some extet by ormalizig for documet legth But is raw tf the right measure? Weightig term frequecy: tf What is the relative importace of 0 vs. 1 occurrece of a term i a doc 1 vs. 2 occurreces 2 vs. 3 occurreces Uclear: while it seems that more is better, a lot is t proportioally better tha a few Ca just use raw tf Aother optio commoly used i practice: t=term, d=documet wf t d = 0 if tft, d = 0, 1+ logtft,, d otherwise 12

13 Score computatio Score for a query q = sum over terms t i q (several terms): = t q tf, t d [Note: 0 if o query terms i documet] This score ca be zoe-combied Ca use wf istead of tf i the above Still does t cosider term scarcity i collectio (ides is rarer tha of) Weightig should deped o the term overall Which of these tells you more about a doc? 10 occurreces of heria? 10 occurreces of the? Would like to atteuate the weight of a commo term But what is commo? Suggest lookig at collectio frequecy (cf ) The total umber of occurreces of the term i the etire collectio of documets 13

14 Documet frequecy But documet frequecy (df ) may be better: df = umber of docs i the corpus cotaiig the term Word cf df alfa isurace Documet/collectio frequecy weightig is oly possible i kow (static) collectio The umber of documets i the etire collectio of documets So how do we make use of df? tf x idf term weights tf x idf measure combies: term frequecy (tf ) or wf, some measure of term desity i a doc iverse documet frequecy (idf ) measure of iformativeess of a term: its rarity across the whole corpus could just be raw cout of umber of documets the term occurs i (idf i = 1/df i ) but by far the most commoly used versio is: idf = i log df i See Kishore Papiei, NAACL 2, 2002 for theoretical justificatio 14

15 Summary: tf x idf (or tf.idf) Assig a tf.idf weight to each term i i each documet d What is the wt of a term that w i, d = tfi, d log( / dfi ) occurs i all of the docs? tf i,d = frequecy of term i i documet d = total umber of documets df i = the umber of documets that cotai term i Icreases with the umber of occurreces withi a doc Icreases with the rarity of the term across the whole corpus 15

16 16

17 Real-valued term-documet matrices Fuctio (scalig) of cout of a word i a documet: Bag of words model Each is a vector i R v Here log-scaled tf.idf Note ca be >1! Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Documets as vectors Each doc j ca ow be viewed as a vector of wf idf values, oe compoet for each term So we have a vector space terms are axes docs live i this space eve with stemmig, may have 20,000+ dimesios 17

18 Why tur docs ito vectors? First applicatio: Query-by-example Give a doc d, fid others like it. Now that d is a vector, fid vectors (docs) ear it... Ituitio t 3 d 2 d 3 φ θ d 1 t 1 t 2 d 4 d 5 Postulate: Documets that are close together i the vector space talk about the same thigs. 18

19 Desiderata for proximity If d 1 is ear d 2, the d 2 is ear d 1 If d 1 ear d 2, ad d 2 ear d 3, the d 1 is ot far from d 3 No doc is closer to d tha d itself First cut Idea: Distace betwee d 1 ad d 2 is the legth of the vector d 1 d 2. Euclidea distace Why is this ot a great idea? We still have t dealt with the issue of legth ormalizatio Short documets would be more similar to each other by virtue of legth, ot topic However, we ca implicitly ormalize by lookig at agles istead 19

20 Cosie similarity Distace betwee vectors d 1 ad d 2 captured by the cosie of the agle x betwee them. Note this is similarity, ot distace No triagle iequality for similarity. t 3 d 2 θ d 1 t 1 t 2 Varies from 0 to 1!!!!! Cosie similarity A vector ca be ormalized (give a legth of 1) by dividig each of its compoets by its legth here we use the L 2 orm This maps vectors oto the uit sphere: The, x 2 d! j = =1 w i, j =1 i = Loger documets do t get more weight i x 2 i 20

21 Normalized vectors For ormalized vectors, the cosie is simply the dot product:! cos( d j!, d k ) =! d j! d k Varies from 0 to 1!!!!! Example Docs: Auste's Sese ad Sesibility, Pride ad Prejudice; Brote's Wutherig Heights. tf weights SaS PaP WH affectio jealous gossip SaS PaP WH affectio jealous gossip cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 =

22 Euclidea distace betwee vectors: d j d k = i = 1 2 ( d d ) i, j i, k For ormalized vectors, Euclidea distace gives the same proximity orderig as the cosie measure Queries i the vector space model Cetral idea: the query as a vector: We regard the query as short documet We retur the documets raked by the closeess of their vectors to the query, also represeted as a vector!! d j d w i= i jw q 1, i, q sim( d j, dq ) =!! = d j d 2 2 q w w i= 1 Note that d q is very sparse! Varies from 0 to 1!!!!! i, j i= 1 i, q 22

23 Normalized vectors For ormalized vectors, the cosie is simply the dot product:! cos( d j!, d k ) =! d j! d k Varies from 0 to 1!!!!! Example Docs: Auste's Sese ad Sesibility, Pride ad Prejudice; Brote's Wutherig Heights. tf weights SaS PaP WH affectio jealous gossip SaS PaP WH affectio jealous gossip cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 =

24 Relatio Dimesioality reductio What if we could take our vectors ad pack them ito fewer dimesios (say 50, ) while preservig distaces? (Well, almost.) Speeds up cosie computatios 24

25 25

26 Measures for Results All of the precedig criteria are measurable: we ca quatify speed/size; we ca make expressiveess precise The key measure: user happiess What is this? Speed of respose/size of idex are factors But blidigly fast, useless aswers wo t make a user happy Need a way of quatifyig user happiess Measurig user happiess Issue: who is the user we are tryig to make happy? Depeds o the settig Web egie: user fids what they wat ad retur to the egie Ca measure rate of retur users ecommerce site: user fids what they wat ad make a purchase Is it the ed-user, or the ecommerce site, whose happiess we measure? Measure time to purchase, or fractio of searchers who become buyers? 26

27 Measurig user happiess Eterprise (compay/govt/academic): Care about user productivity How much time do my users save whe lookig for iformatio? May other criteria havig to do with breadth of access, secure access, etc. Happiess: elusive to measure Commoest proxy: relevace of search results But how do you measure relevace? We will detail a methodology here, the examie its issues Relevat measuremt requires 3 elemets: 1. A bechmark documet collectio 2. A bechmark suite of queries 3. A biary assessmet of either Relevat or Irrelevat for each query-doc pair Some work o more-tha-biary, but ot the stadard 27

28 Evaluatig a IR system Note: the iformatio eed is traslated ito a query Relevace is assessed relative to the iformatio eed ot the query E.g., Iformatio eed: I'm lookig for iformatio o whether drikig red wie is more effective at reducig your risk of heart attacks tha white wie. Query: wie red white heart attack effective You evaluate whether the doc addresses the iformatio eed, ot whether it has those words Uraked retrieval evaluatio: Precisio ad Recall Precisio: fractio of retrieved docs that are relevat = P(relevat retrieved) Recall: fractio of relevat docs that are retrieved = P(retrieved relevat) Relevat Retrieved tp fp Not Retrieved f Not Relevat t Precisio P = tp/(tp + fp) Recall R = tp/(tp + f) 28

29 Accuracy Give a query a egie classifies each doc as Relevat or Irrelevat. Accuracy of a egie: the fractio of these classificatios that is correct. Why is this ot a very useful evaluatio measure i IR? No result, 100% accuracy... Precisio/Recall You ca get high recall (but low precisio) by retrievig all docs for all queries! Recall is a o-decreasig fuctio of the umber of docs retrieved I a good system, precisio decreases as either umber of docs retrieved or recall icreases A fact with strog empirical cofirmatio 29

30 Difficulties i usig precisio/recall Should average over large corpus/query esembles Need huma relevace assessmets People are t reliable assessors Assessmets have to be biary Nuaced assessmets? Heavily skewed by corpus/authorship Results may ot traslate from oe domai to aother A combied measure: F Combied measure that assesses this tradeoff is F measure (weighted harmoic mea): F 2 1 ( β + 1) PR = = α + (1 α) β P + R P R People usually use balaced F 1 measure i.e., with b = 1 or a = ½ 30

31 Evaluatig raked results Evaluatio of raked results: The system ca retur ay umber of results By takig various umbers of the top retured documets (levels of recall), the evaluator ca produce a precisio-recall curve A precisio-recall curve Precisio Recall 31

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

Quiz #2 TEXT SIMILARITY. Class feedback. Class presentations 3/21/11

Quiz #2 TEXT SIMILARITY. Class feedback. Class presentations 3/21/11 Quiz #2 Out of 30 poits High: 28.75 Ave: 23 Will drop lowest quiz I do ot grade based o absolutes TEXT SIMILARITY David Kauchak CS159 Sprig 2011 Class feedback Class presetatios Thaks! Specific commets:

More information

CS276A Practice Problem Set 1 Solutions

CS276A Practice Problem Set 1 Solutions CS76A Practice Problem Set Solutios Problem. (i) (ii) 8 (iii) 6 Compute the gamma-codes for the followig itegers: (i) (ii) 8 (iii) 6 Problem. For this problem, we will be dealig with a collectio of millio

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

Statistics 511 Additional Materials

Statistics 511 Additional Materials Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Collection Frequency, cf Define: The total

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Understanding Samples

Understanding Samples 1 Will Moroe CS 109 Samplig ad Bootstrappig Lecture Notes #17 August 2, 2017 Based o a hadout by Chris Piech I this chapter we are goig to talk about statistics calculated o samples from a populatio. We

More information

Probability, Expectation Value and Uncertainty

Probability, Expectation Value and Uncertainty Chapter 1 Probability, Expectatio Value ad Ucertaity We have see that the physically observable properties of a quatum system are represeted by Hermitea operators (also referred to as observables ) such

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

CHAPTER 10 INFINITE SEQUENCES AND SERIES

CHAPTER 10 INFINITE SEQUENCES AND SERIES CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece

More information

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval Lecture 6-2: The Vector Space Model Outline The vector space model 2 Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth...

More information

Informa(on Retrieval

Informa(on Retrieval Introduc*on to Informa(on Retrieval Lecture 6-2: The Vector Space Model Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth... ANTHONY BRUTUS CAESAR CALPURNIA

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Quantum Information & Quantum Computation

Quantum Information & Quantum Computation CS9A, Sprig 5: Quatum Iformatio & Quatum Computatio Wim va Dam Egieerig, Room 59 vadam@cs http://www.cs.ucsb.edu/~vadam/teachig/cs9/ Admiistrivia Do the exercises. Aswers will be posted at the ed of the

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference EXST30 Backgroud material Page From the textbook The Statistical Sleuth Mea [0]: I your text the word mea deotes a populatio mea (µ) while the work average deotes a sample average ( ). Variace [0]: The

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

(VII.A) Review of Orthogonality

(VII.A) Review of Orthogonality VII.A Review of Orthogoality At the begiig of our study of liear trasformatios i we briefly discussed projectios, rotatios ad projectios. I III.A, projectios were treated i the abstract ad without regard

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Average-Case Analysis of QuickSort

Average-Case Analysis of QuickSort Average-Case Aalysis of QuickSort Comp 363 Fall Semester 003 October 3, 003 The purpose of this documet is to itroduce the idea of usig recurrece relatios to do average-case aalysis. The average-case ruig

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information

5.1 Review of Singular Value Decomposition (SVD)

5.1 Review of Singular Value Decomposition (SVD) MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of

More information

Confidence Intervals for the Population Proportion p

Confidence Intervals for the Population Proportion p Cofidece Itervals for the Populatio Proportio p The cocept of cofidece itervals for the populatio proportio p is the same as the oe for, the samplig distributio of the mea, x. The structure is idetical:

More information

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer. 6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates. 5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece

More information

Information Retrieval Using Boolean Model SEEM5680

Information Retrieval Using Boolean Model SEEM5680 Information Retrieval Using Boolean Model SEEM5680 1 Unstructured (text) vs. structured (database) data in 1996 2 2 Unstructured (text) vs. structured (database) data in 2009 3 3 The problem of IR Goal

More information

Disjoint set (Union-Find)

Disjoint set (Union-Find) CS124 Lecture 7 Fall 2018 Disjoit set (Uio-Fid) For Kruskal s algorithm for the miimum spaig tree problem, we foud that we eeded a data structure for maitaiig a collectio of disjoit sets. That is, we eed

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare

More information

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11 Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple

More information

Shannon s noiseless coding theorem

Shannon s noiseless coding theorem 18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio

More information

Topic 10: Introduction to Estimation

Topic 10: Introduction to Estimation Topic 0: Itroductio to Estimatio Jue, 0 Itroductio I the simplest possible terms, the goal of estimatio theory is to aswer the questio: What is that umber? What is the legth, the reactio rate, the fractio

More information

BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS

BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS. INTRODUCTION We have so far discussed three measures of cetral tedecy, viz. The Arithmetic Mea, Media

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality

More information

Approximations and more PMFs and PDFs

Approximations and more PMFs and PDFs Approximatios ad more PMFs ad PDFs Saad Meimeh 1 Approximatio of biomial with Poisso Cosider the biomial distributio ( b(k,,p = p k (1 p k, k λ: k Assume that is large, ad p is small, but p λ at the limit.

More information

Measures of Spread: Standard Deviation

Measures of Spread: Standard Deviation Measures of Spread: Stadard Deviatio So far i our study of umerical measures used to describe data sets, we have focused o the mea ad the media. These measures of ceter tell us the most typical value of

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations

Section 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations Differece Equatios to Differetial Equatios Sectio. Calculus: Areas Ad Tagets The study of calculus begis with questios about chage. What happes to the velocity of a swigig pedulum as its positio chages?

More information

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions Math 451: Euclidea ad No-Euclidea Geometry MWF 3pm, Gasso 204 Homework 3 Solutios Exercises from 1.4 ad 1.5 of the otes: 4.3, 4.10, 4.12, 4.14, 4.15, 5.3, 5.4, 5.5 Exercise 4.3. Explai why Hp, q) = {x

More information

18.S096: Homework Problem Set 1 (revised)

18.S096: Homework Problem Set 1 (revised) 8.S096: Homework Problem Set (revised) Topics i Mathematics of Data Sciece (Fall 05) Afoso S. Badeira Due o October 6, 05 Exteded to: October 8, 05 This homework problem set is due o October 6, at the

More information

Math 216A Notes, Week 5

Math 216A Notes, Week 5 Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds

More information

L = n i, i=1. dp p n 1

L = n i, i=1. dp p n 1 Exchageable sequeces ad probabilities for probabilities 1996; modified 98 5 21 to add material o mutual iformatio; modified 98 7 21 to add Heath-Sudderth proof of de Fietti represetatio; modified 99 11

More information

Lecture 2: April 3, 2013

Lecture 2: April 3, 2013 TTIC/CMSC 350 Mathematical Toolkit Sprig 203 Madhur Tulsiai Lecture 2: April 3, 203 Scribe: Shubhedu Trivedi Coi tosses cotiued We retur to the coi tossig example from the last lecture agai: Example. Give,

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Math 113 Exam 3 Practice

Math 113 Exam 3 Practice Math Exam Practice Exam will cover.-.9. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for you

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

P1 Chapter 8 :: Binomial Expansion

P1 Chapter 8 :: Binomial Expansion P Chapter 8 :: Biomial Expasio jfrost@tiffi.kigsto.sch.uk www.drfrostmaths.com @DrFrostMaths Last modified: 6 th August 7 Use of DrFrostMaths for practice Register for free at: www.drfrostmaths.com/homework

More information

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = = Review Problems ICME ad MS&E Refresher Course September 9, 0 Warm-up problems. For the followig matrices A = 0 B = C = AB = 0 fid all powers A,A 3,(which is A times A),... ad B,B 3,... ad C,C 3,... Solutio:

More information

CALCULATION OF FIBONACCI VECTORS

CALCULATION OF FIBONACCI VECTORS CALCULATION OF FIBONACCI VECTORS Stuart D. Aderso Departmet of Physics, Ithaca College 953 Daby Road, Ithaca NY 14850, USA email: saderso@ithaca.edu ad Dai Novak Departmet of Mathematics, Ithaca College

More information

if j is a neighbor of i,

if j is a neighbor of i, We see that if i = j the the coditio is trivially satisfied. Otherwise, T ij (i) = (i)q ij mi 1, (j)q ji, ad, (i)q ij T ji (j) = (j)q ji mi 1, (i)q ij. (j)q ji Now there are two cases, if (j)q ji (i)q

More information

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2 Aa Jaicka Mathematical Statistics 18/19 Lecture 1, Parts 1 & 1. Descriptive Statistics By the term descriptive statistics we will mea the tools used for quatitative descriptio of the properties of a sample

More information

Section 11.8: Power Series

Section 11.8: Power Series Sectio 11.8: Power Series 1. Power Series I this sectio, we cosider geeralizig the cocept of a series. Recall that a series is a ifiite sum of umbers a. We ca talk about whether or ot it coverges ad i

More information

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by

More information

6.3 Testing Series With Positive Terms

6.3 Testing Series With Positive Terms 6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial

More information

R is a scalar defined as follows:

R is a scalar defined as follows: Math 8. Notes o Dot Product, Cross Product, Plaes, Area, ad Volumes This lecture focuses primarily o the dot product ad its may applicatios, especially i the measuremet of agles ad scalar projectio ad

More information

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass

More information

Lesson 10: Limits and Continuity

Lesson 10: Limits and Continuity www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals

More information

A Simple Probabilistic Explanation of Term Frequency-Inverse Document Frequency (tf-idf) Heuristic (and Variations Motivated by This Explanation)

A Simple Probabilistic Explanation of Term Frequency-Inverse Document Frequency (tf-idf) Heuristic (and Variations Motivated by This Explanation) Uiversity of Texas at El Paso DigitalCommos@UTEP Departmetal Techical Reports (CS) Departmet of Computer Sciece 5-2014 A Simple Probabilistic Explaatio of Term Frequecy-Iverse Documet Frequecy (tf-idf)

More information

Topic 1 2: Sequences and Series. A sequence is an ordered list of numbers, e.g. 1, 2, 4, 8, 16, or

Topic 1 2: Sequences and Series. A sequence is an ordered list of numbers, e.g. 1, 2, 4, 8, 16, or Topic : Sequeces ad Series A sequece is a ordered list of umbers, e.g.,,, 8, 6, or,,,.... A series is a sum of the terms of a sequece, e.g. + + + 8 + 6 + or... Sigma Notatio b The otatio f ( k) is shorthad

More information

Math 10A final exam, December 16, 2016

Math 10A final exam, December 16, 2016 Please put away all books, calculators, cell phoes ad other devices. You may cosult a sigle two-sided sheet of otes. Please write carefully ad clearly, USING WORDS (ot just symbols). Remember that the

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains. the Further Mathematics etwork wwwfmetworkorguk V 07 The mai ideas are: Idetities REVISION SHEET FP (MEI) ALGEBRA Before the exam you should kow: If a expressio is a idetity the it is true for all values

More information

Parallel Vector Algorithms David A. Padua

Parallel Vector Algorithms David A. Padua Parallel Vector Algorithms 1 of 32 Itroductio Next, we study several algorithms where parallelism ca be easily expressed i terms of array operatios. We will use Fortra 90 to represet these algorithms.

More information

1 Introduction to reducing variance in Monte Carlo simulations

1 Introduction to reducing variance in Monte Carlo simulations Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by

More information

10.6 ALTERNATING SERIES

10.6 ALTERNATING SERIES 0.6 Alteratig Series Cotemporary Calculus 0.6 ALTERNATING SERIES I the last two sectios we cosidered tests for the covergece of series whose terms were all positive. I this sectio we examie series whose

More information

Hashing and Amortization

Hashing and Amortization Lecture Hashig ad Amortizatio Supplemetal readig i CLRS: Chapter ; Chapter 7 itro; Sectio 7.. Arrays ad Hashig Arrays are very useful. The items i a array are statically addressed, so that isertig, deletig,

More information

1 Lesson 6: Measure of Variation

1 Lesson 6: Measure of Variation 1 Lesso 6: Measure of Variatio 1.1 The rage As we have see, there are several viable coteders for the best measure of the cetral tedecy of data. The mea, the mode ad the media each have certai advatages

More information

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite

More information

Bertrand s Postulate

Bertrand s Postulate Bertrad s Postulate Lola Thompso Ross Program July 3, 2009 Lola Thompso (Ross Program Bertrad s Postulate July 3, 2009 1 / 33 Bertrad s Postulate I ve said it oce ad I ll say it agai: There s always a

More information

Chapter 18 Summary Sampling Distribution Models

Chapter 18 Summary Sampling Distribution Models Uit 5 Itroductio to Iferece Chapter 18 Summary Samplig Distributio Models What have we leared? Sample proportios ad meas will vary from sample to sample that s samplig error (samplig variability). Samplig

More information

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense, 3. Z Trasform Referece: Etire Chapter 3 of text. Recall that the Fourier trasform (FT) of a DT sigal x [ ] is ω ( ) [ ] X e = j jω k = xe I order for the FT to exist i the fiite magitude sese, S = x [

More information

Kinetics of Complex Reactions

Kinetics of Complex Reactions Kietics of Complex Reactios by Flick Colema Departmet of Chemistry Wellesley College Wellesley MA 28 wcolema@wellesley.edu Copyright Flick Colema 996. All rights reserved. You are welcome to use this documet

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 018/019 DR. ANTHONY BROWN 8. Statistics 8.1. Measures of Cetre: Mea, Media ad Mode. If we have a series of umbers the

More information

CHAPTER I: Vector Spaces

CHAPTER I: Vector Spaces CHAPTER I: Vector Spaces Sectio 1: Itroductio ad Examples This first chapter is largely a review of topics you probably saw i your liear algebra course. So why cover it? (1) Not everyoe remembers everythig

More information

PRACTICE PROBLEMS FOR THE FINAL

PRACTICE PROBLEMS FOR THE FINAL PRACTICE PROBLEMS FOR THE FINAL Math 36Q Fall 25 Professor Hoh Below is a list of practice questios for the Fial Exam. I would suggest also goig over the practice problems ad exams for Exam ad Exam 2 to

More information

19.1 The dictionary problem

19.1 The dictionary problem CS125 Lecture 19 Fall 2016 19.1 The dictioary proble Cosider the followig data structural proble, usually called the dictioary proble. We have a set of ites. Each ite is a (key, value pair. Keys are i

More information

x a x a Lecture 2 Series (See Chapter 1 in Boas)

x a x a Lecture 2 Series (See Chapter 1 in Boas) Lecture Series (See Chapter i Boas) A basic ad very powerful (if pedestria, recall we are lazy AD smart) way to solve ay differetial (or itegral) equatio is via a series expasio of the correspodig solutio

More information

NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK

NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK For this piece of coursework studets must use the methods for umerical itegratio they meet i the Numerical Methods module

More information

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1 PH 425 Quatum Measuremet ad Spi Witer 23 SPIS Lab Measure the spi projectio S z alog the z-axis This is the experimet that is ready to go whe you start the program, as show below Each atom is measured

More information

11.6 Absolute Convergence and the Ratio and Root Tests

11.6 Absolute Convergence and the Ratio and Root Tests .6 Absolute Covergece ad the Ratio ad Root Tests The most commo way to test for covergece is to igore ay positive or egative sigs i a series, ad simply test the correspodig series of positive terms. Does

More information

Axis Aligned Ellipsoid

Axis Aligned Ellipsoid Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple

More information

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query Information Retrieval (IR) Based on slides by Prabhaar Raghavan, Hinrich Schütze, Ray Larson Query Which plays of Shaespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shaespeare

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples

More information

The Growth of Functions. Theoretical Supplement

The Growth of Functions. Theoretical Supplement The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that

More information

Fortgeschrittene Datenstrukturen Vorlesung 11

Fortgeschrittene Datenstrukturen Vorlesung 11 Fortgeschrittee Datestruture Vorlesug 11 Schriftführer: Marti Weider 19.01.2012 1 Succict Data Structures (ctd.) 1.1 Select-Queries A slightly differet approach, compared to ra, is used for select. B represets

More information

Introduction to Machine Learning DIS10

Introduction to Machine Learning DIS10 CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig

More information