Dealing with Text Databases
|
|
- Colin Perkins
- 5 years ago
- Views:
Transcription
1 Dealig with Text Databases Ustructured data Boolea queries Sparse matrix represetatio Iverted idex Couts vs. frequecies Term frequecy tf x idf term weights Documets as vectors Cosie similarity Dimesioality reductio Vectors ad Boolea queries 1
2 Christopher Maig Prabhakar Raghava Hirich Schütze Uiversity of Stuttgart Ustructured data Which plays of Shakespeare cotai the words Brutus AND Caesar but NOT Calpuria? (Calpuriia, third ad last wife of Julius Caesar) Oe could grep all of Shakespeare s plays for Brutus ad Caesar, the strip out lies cotaiig Calpuria? Slow (for large corpora) NOT Calpuria is o-trivial 2
3 Term-documet icidece Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Brutus AND Caesar but NOT Calpuria 1 if play cotais word, 0 otherwise Icidece vectors So we have a 0/1 vector for each term To aswer query: take the vectors for Brutus, Caesar ad ot Calpuria (complemeted) è bitwise AND AND AND =
4 Aswers to query AND AND = Atoy ad Cleopatra Hamlet Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Sparse matrix represetatio For real data matrix becomes very big Matrix has much, much more zeros the oes matrix is extremely sparse Why? Not every term (word) i every documet preset What s a better represetatio? We oly record the 1 positios 4
5 Iverted idex For each term T, we must store a list of all documets that cotai T Do we use a array or a list for this? Brutus Calpuria Caesar What happes if the word Caesar is added to documet 14? Iverted idex Liked lists geerally preferred to arrays Dyamic space allocatio Isertio of terms ito documets easy Space overhead of poiters Brutus Calpuria Caesar
6 Iverted idex costructio Documets to be idexed. Frieds, Romas, coutryme. Tokeizer Toke stream. Frieds Romas Coutryme Liguistic modules Modified tokes. fried roma coutryma Idexer fried 2 4 Iverted idex roma 1 2 coutryma Boolea queries: Exact match The Boolea Retrieval model is beig able to ask a query that is a Boolea expressio: Boolea Queries are queries usig AND, OR ad NOT to joi query terms Views each documet as a set of words (terms) Is precise: documet matches coditio or ot 6
7 Exact match Primary commercial retrieval tool for 3 decades Professioal searchers (e.g., lawyers) still like Boolea queries: You kow exactly what you re gettig. Scorig Our queries have all bee Boolea Good for expert users with precise uderstadig of their eeds ad the corpus Not good for (the majority of) users with poor Boolea formulatio of their eeds 7
8 Scorig We wish to retur i order the documets most likely to be useful to the searcher How ca we rak order the docs i the corpus with respect to a query? Assig a score say i [0,1] for each doc o each query Icidece matrices Recall: Documet (or a zoe i it) is biary vector X i {0,1} v Query is a vector Score: Overlap measure: X Y Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser
9 Example O the query ides of march, Shakespeare s Julius Caesar has a score of 3 All other Shakespeare plays have a score of 2 (because they cotai march) or 1 Thus i a rak order, Julius Caesar would come out tops Overlap matchig What s wrog with the overlap measure? It does t cosider: Term frequecy i documet Term scarcity i collectio (documet metio frequecy) of is more commo tha ides or march Legth of documets 9
10 Scorig: desity-based Obvious ext idea: if a documet talks about a topic more, the it is a better match This applies eve whe we oly have a sigle query term. Documet relevat if it has a lot of the terms This leads to the idea of term weightig Term-documet cout matrices Cosider the umber of occurreces of a term i a documet: Bag of words model Documet is a vector i N v : a colum below Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser
11 Couts vs. frequecies Cosider agai the ides of march query Julius Caesar has 5 occurreces of ides No other play has ides march occurs i over a doze All the plays cotai of By this scorig measure, the top-scorig play is likely to be the oe with the most ofs Digressio: termiology WARNING: I a lot of IR literature, frequecy is used to mea cout Thus term frequecy i IR literature is used to mea umber of occurreces i a doc Not divided by documet legth (which would actually make it a frequecy) We will coform to this misomer I sayig term frequecy we mea the umber of occurreces of a term i a documet. 11
12 Term frequecy tf Log docs are favored because they re more likely to cotai query terms Ca fix this to some extet by ormalizig for documet legth But is raw tf the right measure? Weightig term frequecy: tf What is the relative importace of 0 vs. 1 occurrece of a term i a doc 1 vs. 2 occurreces 2 vs. 3 occurreces Uclear: while it seems that more is better, a lot is t proportioally better tha a few Ca just use raw tf Aother optio commoly used i practice: t=term, d=documet wf t d = 0 if tft, d = 0, 1+ logtft,, d otherwise 12
13 Score computatio Score for a query q = sum over terms t i q (several terms): = t q tf, t d [Note: 0 if o query terms i documet] This score ca be zoe-combied Ca use wf istead of tf i the above Still does t cosider term scarcity i collectio (ides is rarer tha of) Weightig should deped o the term overall Which of these tells you more about a doc? 10 occurreces of heria? 10 occurreces of the? Would like to atteuate the weight of a commo term But what is commo? Suggest lookig at collectio frequecy (cf ) The total umber of occurreces of the term i the etire collectio of documets 13
14 Documet frequecy But documet frequecy (df ) may be better: df = umber of docs i the corpus cotaiig the term Word cf df alfa isurace Documet/collectio frequecy weightig is oly possible i kow (static) collectio The umber of documets i the etire collectio of documets So how do we make use of df? tf x idf term weights tf x idf measure combies: term frequecy (tf ) or wf, some measure of term desity i a doc iverse documet frequecy (idf ) measure of iformativeess of a term: its rarity across the whole corpus could just be raw cout of umber of documets the term occurs i (idf i = 1/df i ) but by far the most commoly used versio is: idf = i log df i See Kishore Papiei, NAACL 2, 2002 for theoretical justificatio 14
15 Summary: tf x idf (or tf.idf) Assig a tf.idf weight to each term i i each documet d What is the wt of a term that w i, d = tfi, d log( / dfi ) occurs i all of the docs? tf i,d = frequecy of term i i documet d = total umber of documets df i = the umber of documets that cotai term i Icreases with the umber of occurreces withi a doc Icreases with the rarity of the term across the whole corpus 15
16 16
17 Real-valued term-documet matrices Fuctio (scalig) of cout of a word i a documet: Bag of words model Each is a vector i R v Here log-scaled tf.idf Note ca be >1! Atoy ad Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Atoy Brutus Caesar Calpuria Cleopatra mercy worser Documets as vectors Each doc j ca ow be viewed as a vector of wf idf values, oe compoet for each term So we have a vector space terms are axes docs live i this space eve with stemmig, may have 20,000+ dimesios 17
18 Why tur docs ito vectors? First applicatio: Query-by-example Give a doc d, fid others like it. Now that d is a vector, fid vectors (docs) ear it... Ituitio t 3 d 2 d 3 φ θ d 1 t 1 t 2 d 4 d 5 Postulate: Documets that are close together i the vector space talk about the same thigs. 18
19 Desiderata for proximity If d 1 is ear d 2, the d 2 is ear d 1 If d 1 ear d 2, ad d 2 ear d 3, the d 1 is ot far from d 3 No doc is closer to d tha d itself First cut Idea: Distace betwee d 1 ad d 2 is the legth of the vector d 1 d 2. Euclidea distace Why is this ot a great idea? We still have t dealt with the issue of legth ormalizatio Short documets would be more similar to each other by virtue of legth, ot topic However, we ca implicitly ormalize by lookig at agles istead 19
20 Cosie similarity Distace betwee vectors d 1 ad d 2 captured by the cosie of the agle x betwee them. Note this is similarity, ot distace No triagle iequality for similarity. t 3 d 2 θ d 1 t 1 t 2 Varies from 0 to 1!!!!! Cosie similarity A vector ca be ormalized (give a legth of 1) by dividig each of its compoets by its legth here we use the L 2 orm This maps vectors oto the uit sphere: The, x 2 d! j = =1 w i, j =1 i = Loger documets do t get more weight i x 2 i 20
21 Normalized vectors For ormalized vectors, the cosie is simply the dot product:! cos( d j!, d k ) =! d j! d k Varies from 0 to 1!!!!! Example Docs: Auste's Sese ad Sesibility, Pride ad Prejudice; Brote's Wutherig Heights. tf weights SaS PaP WH affectio jealous gossip SaS PaP WH affectio jealous gossip cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 =
22 Euclidea distace betwee vectors: d j d k = i = 1 2 ( d d ) i, j i, k For ormalized vectors, Euclidea distace gives the same proximity orderig as the cosie measure Queries i the vector space model Cetral idea: the query as a vector: We regard the query as short documet We retur the documets raked by the closeess of their vectors to the query, also represeted as a vector!! d j d w i= i jw q 1, i, q sim( d j, dq ) =!! = d j d 2 2 q w w i= 1 Note that d q is very sparse! Varies from 0 to 1!!!!! i, j i= 1 i, q 22
23 Normalized vectors For ormalized vectors, the cosie is simply the dot product:! cos( d j!, d k ) =! d j! d k Varies from 0 to 1!!!!! Example Docs: Auste's Sese ad Sesibility, Pride ad Prejudice; Brote's Wutherig Heights. tf weights SaS PaP WH affectio jealous gossip SaS PaP WH affectio jealous gossip cos(sas, PAP) =.996 x x x 0.0 = cos(sas, WH) =.996 x x x.254 =
24 Relatio Dimesioality reductio What if we could take our vectors ad pack them ito fewer dimesios (say 50, ) while preservig distaces? (Well, almost.) Speeds up cosie computatios 24
25 25
26 Measures for Results All of the precedig criteria are measurable: we ca quatify speed/size; we ca make expressiveess precise The key measure: user happiess What is this? Speed of respose/size of idex are factors But blidigly fast, useless aswers wo t make a user happy Need a way of quatifyig user happiess Measurig user happiess Issue: who is the user we are tryig to make happy? Depeds o the settig Web egie: user fids what they wat ad retur to the egie Ca measure rate of retur users ecommerce site: user fids what they wat ad make a purchase Is it the ed-user, or the ecommerce site, whose happiess we measure? Measure time to purchase, or fractio of searchers who become buyers? 26
27 Measurig user happiess Eterprise (compay/govt/academic): Care about user productivity How much time do my users save whe lookig for iformatio? May other criteria havig to do with breadth of access, secure access, etc. Happiess: elusive to measure Commoest proxy: relevace of search results But how do you measure relevace? We will detail a methodology here, the examie its issues Relevat measuremt requires 3 elemets: 1. A bechmark documet collectio 2. A bechmark suite of queries 3. A biary assessmet of either Relevat or Irrelevat for each query-doc pair Some work o more-tha-biary, but ot the stadard 27
28 Evaluatig a IR system Note: the iformatio eed is traslated ito a query Relevace is assessed relative to the iformatio eed ot the query E.g., Iformatio eed: I'm lookig for iformatio o whether drikig red wie is more effective at reducig your risk of heart attacks tha white wie. Query: wie red white heart attack effective You evaluate whether the doc addresses the iformatio eed, ot whether it has those words Uraked retrieval evaluatio: Precisio ad Recall Precisio: fractio of retrieved docs that are relevat = P(relevat retrieved) Recall: fractio of relevat docs that are retrieved = P(retrieved relevat) Relevat Retrieved tp fp Not Retrieved f Not Relevat t Precisio P = tp/(tp + fp) Recall R = tp/(tp + f) 28
29 Accuracy Give a query a egie classifies each doc as Relevat or Irrelevat. Accuracy of a egie: the fractio of these classificatios that is correct. Why is this ot a very useful evaluatio measure i IR? No result, 100% accuracy... Precisio/Recall You ca get high recall (but low precisio) by retrievig all docs for all queries! Recall is a o-decreasig fuctio of the umber of docs retrieved I a good system, precisio decreases as either umber of docs retrieved or recall icreases A fact with strog empirical cofirmatio 29
30 Difficulties i usig precisio/recall Should average over large corpus/query esembles Need huma relevace assessmets People are t reliable assessors Assessmets have to be biary Nuaced assessmets? Heavily skewed by corpus/authorship Results may ot traslate from oe domai to aother A combied measure: F Combied measure that assesses this tradeoff is F measure (weighted harmoic mea): F 2 1 ( β + 1) PR = = α + (1 α) β P + R P R People usually use balaced F 1 measure i.e., with b = 1 or a = ½ 30
31 Evaluatig raked results Evaluatio of raked results: The system ca retur ay umber of results By takig various umbers of the top retured documets (levels of recall), the evaluator ca produce a precisio-recall curve A precisio-recall curve Precisio Recall 31
Dealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationQuiz #2 TEXT SIMILARITY. Class feedback. Class presentations 3/21/11
Quiz #2 Out of 30 poits High: 28.75 Ave: 23 Will drop lowest quiz I do ot grade based o absolutes TEXT SIMILARITY David Kauchak CS159 Sprig 2011 Class feedback Class presetatios Thaks! Specific commets:
More informationCS276A Practice Problem Set 1 Solutions
CS76A Practice Problem Set Solutios Problem. (i) (ii) 8 (iii) 6 Compute the gamma-codes for the followig itegers: (i) (ii) 8 (iii) 6 Problem. For this problem, we will be dealig with a collectio of millio
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationVector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationStatistics 511 Additional Materials
Cofidece Itervals o mu Statistics 511 Additioal Materials This topic officially moves us from probability to statistics. We begi to discuss makig ifereces about the populatio. Oe way to differetiate probability
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Collection Frequency, cf Define: The total
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationUnderstanding Samples
1 Will Moroe CS 109 Samplig ad Bootstrappig Lecture Notes #17 August 2, 2017 Based o a hadout by Chris Piech I this chapter we are goig to talk about statistics calculated o samples from a populatio. We
More informationProbability, Expectation Value and Uncertainty
Chapter 1 Probability, Expectatio Value ad Ucertaity We have see that the physically observable properties of a quatum system are represeted by Hermitea operators (also referred to as observables ) such
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationCHAPTER 10 INFINITE SEQUENCES AND SERIES
CHAPTER 10 INFINITE SEQUENCES AND SERIES 10.1 Sequeces 10.2 Ifiite Series 10.3 The Itegral Tests 10.4 Compariso Tests 10.5 The Ratio ad Root Tests 10.6 Alteratig Series: Absolute ad Coditioal Covergece
More informationTDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1
TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval Lecture 6-2: The Vector Space Model Outline The vector space model 2 Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth...
More informationInforma(on Retrieval
Introduc*on to Informa(on Retrieval Lecture 6-2: The Vector Space Model Binary incidence matrix Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth... ANTHONY BRUTUS CAESAR CALPURNIA
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationQuantum Information & Quantum Computation
CS9A, Sprig 5: Quatum Iformatio & Quatum Computatio Wim va Dam Egieerig, Room 59 vadam@cs http://www.cs.ucsb.edu/~vadam/teachig/cs9/ Admiistrivia Do the exercises. Aswers will be posted at the ed of the
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationLecture 2: Monte Carlo Simulation
STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?
More informationt distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference
EXST30 Backgroud material Page From the textbook The Statistical Sleuth Mea [0]: I your text the word mea deotes a populatio mea (µ) while the work average deotes a sample average ( ). Variace [0]: The
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More information(VII.A) Review of Orthogonality
VII.A Review of Orthogoality At the begiig of our study of liear trasformatios i we briefly discussed projectios, rotatios ad projectios. I III.A, projectios were treated i the abstract ad without regard
More informationTopic 9: Sampling Distributions of Estimators
Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be
More informationAverage-Case Analysis of QuickSort
Average-Case Aalysis of QuickSort Comp 363 Fall Semester 003 October 3, 003 The purpose of this documet is to itroduce the idea of usig recurrece relatios to do average-case aalysis. The average-case ruig
More informationInfinite Sequences and Series
Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet
More information5.1 Review of Singular Value Decomposition (SVD)
MGMT 69000: Topics i High-dimesioal Data Aalysis Falll 06 Lecture 5: Spectral Clusterig: Overview (cotd) ad Aalysis Lecturer: Jiamig Xu Scribe: Adarsh Barik, Taotao He, September 3, 06 Outlie Review of
More informationConfidence Intervals for the Population Proportion p
Cofidece Itervals for the Populatio Proportio p The cocept of cofidece itervals for the populatio proportio p is the same as the oe for, the samplig distributio of the mea, x. The structure is idetical:
More information6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.
6 Itegers Modulo I Example 2.3(e), we have defied the cogruece of two itegers a,b with respect to a modulus. Let us recall that a b (mod ) meas a b. We have proved that cogruece is a equivalece relatio
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationBig Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.
5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece
More informationInformation Retrieval Using Boolean Model SEEM5680
Information Retrieval Using Boolean Model SEEM5680 1 Unstructured (text) vs. structured (database) data in 1996 2 2 Unstructured (text) vs. structured (database) data in 2009 3 3 The problem of IR Goal
More informationDisjoint set (Union-Find)
CS124 Lecture 7 Fall 2018 Disjoit set (Uio-Fid) For Kruskal s algorithm for the miimum spaig tree problem, we foud that we eeded a data structure for maitaiig a collectio of disjoit sets. That is, we eed
More information7.1 Convergence of sequences of random variables
Chapter 7 Limit Theorems Throughout this sectio we will assume a probability space (, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationMachine Learning Theory Tübingen University, WS 2016/2017 Lecture 11
Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture Tolstikhi Ilya Abstract We will itroduce the otio of reproducig kerels ad associated Reproducig Kerel Hilbert Spaces (RKHS). We will cosider couple
More informationShannon s noiseless coding theorem
18.310 lecture otes May 4, 2015 Shao s oiseless codig theorem Lecturer: Michel Goemas I these otes we discuss Shao s oiseless codig theorem, which is oe of the foudig results of the field of iformatio
More informationTopic 10: Introduction to Estimation
Topic 0: Itroductio to Estimatio Jue, 0 Itroductio I the simplest possible terms, the goal of estimatio theory is to aswer the questio: What is that umber? What is the legth, the reactio rate, the fractio
More informationBUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS
BUSINESS STATISTICS (PART-9) AVERAGE OR MEASURES OF CENTRAL TENDENCY: THE GEOMETRIC AND HARMONIC MEANS. INTRODUCTION We have so far discussed three measures of cetral tedecy, viz. The Arithmetic Mea, Media
More informationAlgebra of Least Squares
October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal
More informationECE 901 Lecture 12: Complexity Regularization and the Squared Loss
ECE 90 Lecture : Complexity Regularizatio ad the Squared Loss R. Nowak 5/7/009 I the previous lectures we made use of the Cheroff/Hoeffdig bouds for our aalysis of classifier errors. Hoeffdig s iequality
More informationApproximations and more PMFs and PDFs
Approximatios ad more PMFs ad PDFs Saad Meimeh 1 Approximatio of biomial with Poisso Cosider the biomial distributio ( b(k,,p = p k (1 p k, k λ: k Assume that is large, ad p is small, but p λ at the limit.
More informationMeasures of Spread: Standard Deviation
Measures of Spread: Stadard Deviatio So far i our study of umerical measures used to describe data sets, we have focused o the mea ad the media. These measures of ceter tell us the most typical value of
More informationStatistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.
Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized
More informationSection 1.1. Calculus: Areas And Tangents. Difference Equations to Differential Equations
Differece Equatios to Differetial Equatios Sectio. Calculus: Areas Ad Tagets The study of calculus begis with questios about chage. What happes to the velocity of a swigig pedulum as its positio chages?
More informationMath 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions
Math 451: Euclidea ad No-Euclidea Geometry MWF 3pm, Gasso 204 Homework 3 Solutios Exercises from 1.4 ad 1.5 of the otes: 4.3, 4.10, 4.12, 4.14, 4.15, 5.3, 5.4, 5.5 Exercise 4.3. Explai why Hp, q) = {x
More information18.S096: Homework Problem Set 1 (revised)
8.S096: Homework Problem Set (revised) Topics i Mathematics of Data Sciece (Fall 05) Afoso S. Badeira Due o October 6, 05 Exteded to: October 8, 05 This homework problem set is due o October 6, at the
More informationMath 216A Notes, Week 5
Math 6A Notes, Week 5 Scribe: Ayastassia Sebolt Disclaimer: These otes are ot early as polished (ad quite possibly ot early as correct) as a published paper. Please use them at your ow risk.. Thresholds
More informationL = n i, i=1. dp p n 1
Exchageable sequeces ad probabilities for probabilities 1996; modified 98 5 21 to add material o mutual iformatio; modified 98 7 21 to add Heath-Sudderth proof of de Fietti represetatio; modified 99 11
More informationLecture 2: April 3, 2013
TTIC/CMSC 350 Mathematical Toolkit Sprig 203 Madhur Tulsiai Lecture 2: April 3, 203 Scribe: Shubhedu Trivedi Coi tosses cotiued We retur to the coi tossig example from the last lecture agai: Example. Give,
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationMath 113 Exam 3 Practice
Math Exam Practice Exam will cover.-.9. This sheet has three sectios. The first sectio will remid you about techiques ad formulas that you should kow. The secod gives a umber of practice questios for you
More informationLinear Regression Demystified
Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to
More informationP1 Chapter 8 :: Binomial Expansion
P Chapter 8 :: Biomial Expasio jfrost@tiffi.kigsto.sch.uk www.drfrostmaths.com @DrFrostMaths Last modified: 6 th August 7 Use of DrFrostMaths for practice Register for free at: www.drfrostmaths.com/homework
More informationReview Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =
Review Problems ICME ad MS&E Refresher Course September 9, 0 Warm-up problems. For the followig matrices A = 0 B = C = AB = 0 fid all powers A,A 3,(which is A times A),... ad B,B 3,... ad C,C 3,... Solutio:
More informationCALCULATION OF FIBONACCI VECTORS
CALCULATION OF FIBONACCI VECTORS Stuart D. Aderso Departmet of Physics, Ithaca College 953 Daby Road, Ithaca NY 14850, USA email: saderso@ithaca.edu ad Dai Novak Departmet of Mathematics, Ithaca College
More informationif j is a neighbor of i,
We see that if i = j the the coditio is trivially satisfied. Otherwise, T ij (i) = (i)q ij mi 1, (j)q ji, ad, (i)q ij T ji (j) = (j)q ji mi 1, (i)q ij. (j)q ji Now there are two cases, if (j)q ji (i)q
More informationAnna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2
Aa Jaicka Mathematical Statistics 18/19 Lecture 1, Parts 1 & 1. Descriptive Statistics By the term descriptive statistics we will mea the tools used for quatitative descriptio of the properties of a sample
More informationSection 11.8: Power Series
Sectio 11.8: Power Series 1. Power Series I this sectio, we cosider geeralizig the cocept of a series. Recall that a series is a ifiite sum of umbers a. We ca talk about whether or ot it coverges ad i
More informationCS284A: Representations and Algorithms in Molecular Biology
CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by
More information6.3 Testing Series With Positive Terms
6.3. TESTING SERIES WITH POSITIVE TERMS 307 6.3 Testig Series With Positive Terms 6.3. Review of what is kow up to ow I theory, testig a series a i for covergece amouts to fidig the i= sequece of partial
More informationR is a scalar defined as follows:
Math 8. Notes o Dot Product, Cross Product, Plaes, Area, ad Volumes This lecture focuses primarily o the dot product ad its may applicatios, especially i the measuremet of agles ad scalar projectio ad
More informationOPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES
OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass
More informationLesson 10: Limits and Continuity
www.scimsacademy.com Lesso 10: Limits ad Cotiuity SCIMS Academy 1 Limit of a fuctio The cocept of limit of a fuctio is cetral to all other cocepts i calculus (like cotiuity, derivative, defiite itegrals
More informationA Simple Probabilistic Explanation of Term Frequency-Inverse Document Frequency (tf-idf) Heuristic (and Variations Motivated by This Explanation)
Uiversity of Texas at El Paso DigitalCommos@UTEP Departmetal Techical Reports (CS) Departmet of Computer Sciece 5-2014 A Simple Probabilistic Explaatio of Term Frequecy-Iverse Documet Frequecy (tf-idf)
More informationTopic 1 2: Sequences and Series. A sequence is an ordered list of numbers, e.g. 1, 2, 4, 8, 16, or
Topic : Sequeces ad Series A sequece is a ordered list of umbers, e.g.,,, 8, 6, or,,,.... A series is a sum of the terms of a sequece, e.g. + + + 8 + 6 + or... Sigma Notatio b The otatio f ( k) is shorthad
More informationMath 10A final exam, December 16, 2016
Please put away all books, calculators, cell phoes ad other devices. You may cosult a sigle two-sided sheet of otes. Please write carefully ad clearly, USING WORDS (ot just symbols). Remember that the
More informationMachine Learning for Data Science (CS 4786)
Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm
More informationREVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.
the Further Mathematics etwork wwwfmetworkorguk V 07 The mai ideas are: Idetities REVISION SHEET FP (MEI) ALGEBRA Before the exam you should kow: If a expressio is a idetity the it is true for all values
More informationParallel Vector Algorithms David A. Padua
Parallel Vector Algorithms 1 of 32 Itroductio Next, we study several algorithms where parallelism ca be easily expressed i terms of array operatios. We will use Fortra 90 to represet these algorithms.
More information1 Introduction to reducing variance in Monte Carlo simulations
Copyright c 010 by Karl Sigma 1 Itroductio to reducig variace i Mote Carlo simulatios 11 Review of cofidece itervals for estimatig a mea I statistics, we estimate a ukow mea µ = E(X) of a distributio by
More information10.6 ALTERNATING SERIES
0.6 Alteratig Series Cotemporary Calculus 0.6 ALTERNATING SERIES I the last two sectios we cosidered tests for the covergece of series whose terms were all positive. I this sectio we examie series whose
More informationHashing and Amortization
Lecture Hashig ad Amortizatio Supplemetal readig i CLRS: Chapter ; Chapter 7 itro; Sectio 7.. Arrays ad Hashig Arrays are very useful. The items i a array are statically addressed, so that isertig, deletig,
More information1 Lesson 6: Measure of Variation
1 Lesso 6: Measure of Variatio 1.1 The rage As we have see, there are several viable coteders for the best measure of the cetral tedecy of data. The mea, the mode ad the media each have certai advatages
More information7.1 Convergence of sequences of random variables
Chapter 7 Limit theorems Throughout this sectio we will assume a probability space (Ω, F, P), i which is defied a ifiite sequece of radom variables (X ) ad a radom variable X. The fact that for every ifiite
More informationBertrand s Postulate
Bertrad s Postulate Lola Thompso Ross Program July 3, 2009 Lola Thompso (Ross Program Bertrad s Postulate July 3, 2009 1 / 33 Bertrad s Postulate I ve said it oce ad I ll say it agai: There s always a
More informationChapter 18 Summary Sampling Distribution Models
Uit 5 Itroductio to Iferece Chapter 18 Summary Samplig Distributio Models What have we leared? Sample proportios ad meas will vary from sample to sample that s samplig error (samplig variability). Samplig
More information3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,
3. Z Trasform Referece: Etire Chapter 3 of text. Recall that the Fourier trasform (FT) of a DT sigal x [ ] is ω ( ) [ ] X e = j jω k = xe I order for the FT to exist i the fiite magitude sese, S = x [
More informationKinetics of Complex Reactions
Kietics of Complex Reactios by Flick Colema Departmet of Chemistry Wellesley College Wellesley MA 28 wcolema@wellesley.edu Copyright Flick Colema 996. All rights reserved. You are welcome to use this documet
More informationACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics
ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER 1 018/019 DR. ANTHONY BROWN 8. Statistics 8.1. Measures of Cetre: Mea, Media ad Mode. If we have a series of umbers the
More informationCHAPTER I: Vector Spaces
CHAPTER I: Vector Spaces Sectio 1: Itroductio ad Examples This first chapter is largely a review of topics you probably saw i your liear algebra course. So why cover it? (1) Not everyoe remembers everythig
More informationPRACTICE PROBLEMS FOR THE FINAL
PRACTICE PROBLEMS FOR THE FINAL Math 36Q Fall 25 Professor Hoh Below is a list of practice questios for the Fial Exam. I would suggest also goig over the practice problems ad exams for Exam ad Exam 2 to
More information19.1 The dictionary problem
CS125 Lecture 19 Fall 2016 19.1 The dictioary proble Cosider the followig data structural proble, usually called the dictioary proble. We have a set of ites. Each ite is a (key, value pair. Keys are i
More informationx a x a Lecture 2 Series (See Chapter 1 in Boas)
Lecture Series (See Chapter i Boas) A basic ad very powerful (if pedestria, recall we are lazy AD smart) way to solve ay differetial (or itegral) equatio is via a series expasio of the correspodig solutio
More informationNUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK
NUMERICAL METHODS COURSEWORK INFORMAL NOTES ON NUMERICAL INTEGRATION COURSEWORK For this piece of coursework studets must use the methods for umerical itegratio they meet i the Numerical Methods module
More informationPH 425 Quantum Measurement and Spin Winter SPINS Lab 1
PH 425 Quatum Measuremet ad Spi Witer 23 SPIS Lab Measure the spi projectio S z alog the z-axis This is the experimet that is ready to go whe you start the program, as show below Each atom is measured
More information11.6 Absolute Convergence and the Ratio and Root Tests
.6 Absolute Covergece ad the Ratio ad Root Tests The most commo way to test for covergece is to igore ay positive or egative sigs i a series, ad simply test the correspodig series of positive terms. Does
More informationAxis Aligned Ellipsoid
Machie Learig for Data Sciece CS 4786) Lecture 6,7 & 8: Ellipsoidal Clusterig, Gaussia Mixture Models ad Geeral Mixture Models The text i black outlies high level ideas. The text i blue provides simple
More informationQuery. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query
Information Retrieval (IR) Based on slides by Prabhaar Raghavan, Hinrich Schütze, Ray Larson Query Which plays of Shaespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shaespeare
More information6.867 Machine learning
6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples
More informationThe Growth of Functions. Theoretical Supplement
The Growth of Fuctios Theoretical Supplemet The Triagle Iequality The triagle iequality is a algebraic tool that is ofte useful i maipulatig absolute values of fuctios. The triagle iequality says that
More informationFortgeschrittene Datenstrukturen Vorlesung 11
Fortgeschrittee Datestruture Vorlesug 11 Schriftführer: Marti Weider 19.01.2012 1 Succict Data Structures (ctd.) 1.1 Select-Queries A slightly differet approach, compared to ra, is used for select. B represets
More informationIntroduction to Machine Learning DIS10
CS 189 Fall 017 Itroductio to Machie Learig DIS10 1 Fu with Lagrage Multipliers (a) Miimize the fuctio such that f (x,y) = x + y x + y = 3. Solutio: The Lagragia is: L(x,y,λ) = x + y + λ(x + y 3) Takig
More information