Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Similar documents
Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

STATE-SPACE MODELLING. A mass balance across the tank gives:

Information Retrieval and Web Search

GMM - Generalized Method of Moments

Logic in computer science

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

Content-Based Shape Retrieval Using Different Shape Descriptors: A Comparative Study Dengsheng Zhang and Guojun Lu

A Framework for Efficient Document Ranking Using Order and Non Order Based Fitness Function

Lecture Notes 2. The Hilbert Space Approach to Time Series

Information Retrieval

Notes on Kalman Filtering

Asymptotic Equipartition Property - Seminar 3, part 1

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure

Basilio Bona ROBOTICA 03CFIOR 1

FITTING OF A PARTIALLY REPARAMETERIZED GOMPERTZ MODEL TO BROILER DATA

Analyze patterns and relationships. 3. Generate two numerical patterns using AC

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Linear Response Theory: The connection between QFT and experiments

ERROR LOCATING CODES AND EXTENDED HAMMING CODE. Pankaj Kumar Das. 1. Introduction and preliminaries

SUPPLEMENTARY INFORMATION

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

LAPLACE TRANSFORM AND TRANSFER FUNCTION

Math 315: Linear Algebra Solutions to Assignment 6

Latent Spaces and Matrix Factorization

Latent Spaces and Matrix Factorization

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Vehicle Arrival Models : Headway

Block Diagram of a DCS in 411

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Introduction D P. r = constant discount rate, g = Gordon Model (1962): constant dividend growth rate.

Chapter 7: Solving Trig Equations

20. Applications of the Genetic-Drift Model

Online Convex Optimization Example And Follow-The-Leader

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts)

Section 4.4 Logarithmic Properties

1 Review of Zero-Sum Games

Section 4.4 Logarithmic Properties

EECE 301 Signals & Systems Prof. Mark Fowler

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

Then. 1 The eigenvalues of A are inside R = n i=1 R i. 2 Union of any k circles not intersecting the other (n k)

Math From Scratch Lesson 34: Isolating Variables

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

12: AUTOREGRESSIVE AND MOVING AVERAGE PROCESSES IN DISCRETE TIME. Σ j =

Solutions from Chapter 9.1 and 9.2

10. State Space Methods

Internet Traffic Modeling for Efficient Network Research Management Prof. Zhili Sun, UniS Zhiyong Liu, CATR

Chapter 2. Models, Censoring, and Likelihood for Failure-Time Data

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Longest Common Prefixes

Single-Pass-Based Heuristic Algorithms for Group Flexible Flow-shop Scheduling Problems

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Maintenance Models. Prof. Robert C. Leachman IEOR 130, Methods of Manufacturing Improvement Spring, 2011

Comparing Means: t-tests for One Sample & Two Related Samples

Mathematical Theory and Modeling ISSN (Paper) ISSN (Online) Vol 3, No.3, 2013

( ) ( ) if t = t. It must satisfy the identity. So, bulkiness of the unit impulse (hyper)function is equal to 1. The defining characteristic is

ON THE BEAT PHENOMENON IN COUPLED SYSTEMS

KINEMATICS IN ONE DIMENSION

Some Basic Information about M-S-D Systems

Physical Limitations of Logic Gates Week 10a

MATH 128A, SUMMER 2009, FINAL EXAM SOLUTION

Let us start with a two dimensional case. We consider a vector ( x,

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Failure of the work-hamiltonian connection for free energy calculations. Abstract

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

More Digital Logic. t p output. Low-to-high and high-to-low transitions could have different t p. V in (t)

Vectorautoregressive Model and Cointegration Analysis. Time Series Analysis Dr. Sevtap Kestel 1

) were both constant and we brought them from under the integral.

ADDITIONAL PROBLEMS (a) Find the Fourier transform of the half-cosine pulse shown in Fig. 2.40(a). Additional Problems 91

Crossing the Bridge between Similar Games

DP ro: A Probabilistic Approach for Hidden Web Database Selection Using Dynamic Probing

DEPARTMENT OF STATISTICS

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Sequential Importance Resampling (SIR) Particle Filter

EE3723 : Digital Communications

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

ODEs II, Lecture 1: Homogeneous Linear Systems - I. Mike Raugh 1. March 8, 2004

Predator - Prey Model Trajectories and the nonlinear conservation law

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

A probabilistic justification for using tf idf term weighting in information retrieval

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

An introduction to the theory of SDDP algorithm

Lecture 3: Exponential Smoothing

An Efficient Image Similarity Measure Based on Approximations of KL-Divergence Between Two Gaussian Mixtures

Kinematics and kinematic functions

Lab 10: RC, RL, and RLC Circuits

Some Ramsey results for the n-cube

Isolated-word speech recognition using hidden Markov models

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Numerical Dispersion

Methodology. -ratios are biased and that the appropriate critical values have to be increased by an amount. that depends on the sample size.

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

Random Walk with Anti-Correlated Steps

Lecture 1 Overview. course mechanics. outline & topics. what is a linear dynamical system? why study linear systems? some examples

An Efficient Image Similarity Measure based on Approximations of KL-Divergence Between Two Gaussian Mixtures

NCSS Statistical Software. , contains a periodic (cyclic) component. A natural model of the periodic component would be

Transcription:

1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval Models A rerieval model specifies he deails of: Documen represenaion Query represenaion Rerieval funcion Deermines a noion of relevance. Noion of relevance can be binary or coninuous (i.e. ranked rerieval). 1 Common Preprocessing Seps Srip unwaned characers/markup (e.g. HTML ags, puncuaion, numbers, ec.). Break ino okens (keywords) on whiespace. Sem okens o roo words compuaional compu Remove common sopwords (e.g. a, he, i, ec.). Deec common phrases (possibly using a domain specific dicionary). Build invered index (keyword lis of docs conaining i). Boolean Model A documen is represened as a se of keywords. Queries are Boolean expressions of keywords, conneced by AND, OR, and NOT, including he use of brackes o indicae scope. [[Rio & Brazil] [Hilo & Hawaii]] & hoel &!Hilon] Oupu: Documen is relevan or no. No parial maches or ranking. 3 4 Boolean Rerieval Model Popular rerieval model because: Easy o undersand for simple queries. Clean formalism. Boolean models can be exended o include ranking. Reasonably efficien implemenaions possible for normal queries. Boolean Models Problems Very rigid: AND means all; OR means any. Difficul o express complex user requess. Difficul o conrol he number of documens rerieved. All mached documens will be reurned. Difficul o rank oupu. All mached documens logically saisfy he query. Difficul o perform relevance feedback. If a documen is idenified by he user as relevan or irrelevan, how should he query be modified? 5 6

Saisical Models A documen is ypically represened by a bag of words (unordered words wih frequencies). Bag = se ha allows muliple occurrences of he same elemen. User specifies a se of desired erms wih opional weighs: Weighed query erms: Q = < daabase 0.5; ex 0.8; informaion 0. > Unweighed query erms: Q = < daabase; ex; informaion > No Boolean condiions specified in he query. Saisical Rerieval Rerieval based on similariy beween query and documens. Oupu documens are ranked according o similariy o query. Similariy based on occurrence frequencies of keywords in query and documen. Auomaic relevance feedback can be suppored: Relevan documens added o query. Irrelevan documens subraced from query. 7 8 Issues for Vecor Space Model How o deermine imporan words in a documen? Word sense? Word n-grams (and phrases, idioms, ) erms How o deermine he degree of imporance of a erm wihin a documen and wihin he enire collecion? How o deermine he degree of similariy beween a documen and he query? In he case of he web, wha is a collecion and wha are he effecs of links, formaing informaion, ec.? The Vecor-Space Model Assume disinc erms remain afer preprocessing; call hem index erms or he vocabulary. These orhogonal erms form a vecor space. Dimension = = vocabulary Each erm, i, in a documen or query, j, is given a real-valued weigh, w ij. Boh documens and queries are expressed as -dimensional vecors: d j = (w 1j, w j,, w j ) 9 10 Graphic Represenaion Documen Collecion Example: D 1 = T 1 + 3T + 5T 3 D = 3T 1 + 7T + T 3 D 1 = T 1 + 3T + 5T 3 D = 3T 1 + 7T + T 3 T 7 5 T 3 3 T 1 Is D 1 or D more similar o Q? How o measure he degree of similariy? Disance? Angle? Projecion? A collecion of n documens can be represened in he vecor space model by a erm-documen marix. An enry in he marix corresponds o he weigh of a erm in he documen; zero means he erm has no significance in he documen or i simply doesn exis in he documen. T 1 T. T D 1 w 11 w 1 w 1 D w 1 w w : : : : : : : : D n w 1n w n w n 11 1

3 Term Weighs: Term Frequency More frequen erms in a documen are more imporan, i.e. more indicaive of he opic. f ij = frequency of erm i in documen j May wan o normalize erm frequency (f) across he enire corpus: f ij = f ij / max{f ij } Term Weighs: Inverse Documen Frequency Terms ha appear in many differen documens are less indicaive of overall opic. df i = documen frequency of erm i = number of documens conaining erm i idf i = inverse documen frequency of erm i, = log (N/ df i ) (N: oal number of documens) An indicaion of a erm s discriminaion power. Log used o dampen he effec relaive o f. 13 14 TF-IDF Weighing A ypical combined erm imporance indicaor is f-idf weighing: w ij = f ij idf i = f ij log (N/ df i ) A erm occurring frequenly in he documen bu rarely in he res of he collecion is given high weigh. Many oher ways of deermining erm weighs have been proposed. Experimenally, f-idf has been found o work well. Compuing TF-IDF -- An Example Given a documen conaining erms wih given frequencies: A(3), B(), C(1) Assume collecion conains 10,000 documens and documen frequencies of hese erms are: A(50), B(1300), C(50) Then: A: f = 3/3; idf = log(10000/50) = 5.3; f-idf = 5.3 B: f = /3; idf = log(10000/1300) =.0; f-idf = 1.3 C: f = 1/3; idf = log(10000/50) = 3.7; f-idf = 1. 15 16 Query Vecor Query vecor is ypically reaed as a documen and also f-idf weighed. Alernaive is for he user o supply weighs for he given query erms. Similariy Measure A similariy measure is a funcion ha compues he degree of similariy beween wo vecors. Using a similariy measure beween he query and each documen: I is possible o rank he rerieved documens in he order of presumed relevance. I is possible o enforce a cerain hreshold so ha he size of he rerieved se can be conrolled. 17 18

4 Similariy Measure - Inner Produc Properies of Inner Produc Similariy beween vecors for he documen d i and query q can be compued as he vecor inner produc: sim(d j,q) = d j q = w ij w iq i= 1 where w ij is he weigh of erm i in documen j and w iq is he weigh of erm i in he query For binary vecors, he inner produc is he number of mached query erms in he documen (size of inersecion). For weighed erm vecors, i is he sum of he producs of he weighs of he mached erms. The inner produc is unbounded. Favors long documens wih a large number of unique erms. Measures how many erms mached bu no how many erms are no mached. 19 0 Binary: Inner Produc -- Examples D = 1, 1, 1, 0, 1, 1, 0 Q = 1, 0, 1, 0, 0, 1, 1 sim(d, Q) = 3 rerieval daabase archiecure compuer ex managemen informaion Weighed: D 1 = T 1 + 3T + 5T 3 D = 3T 1 + 7T + 1T 3 sim(d 1, Q) = *0 + 3*0 + 5* = 10 sim(d, Q) = 3*0 + 7*0 + 1* = Size of vecor = size of vocabulary = 7 0 means corresponding erm no found in documen or query Cosine Similariy Measure Cosine similariy measures he cosine of he angle beween wo vecors. Inner produc normalized by he vecor lenghs. r r ( d q w ij w iq ) j i = 1 CosSim(d = j, q) = r r d j q w w i ij = 1 i = 1 iq D 1 θ D D 1 = T 1 + 3T + 5T 3 CosSim(D 1, Q) = 10 / (4+9+5)(0+0+4) = 0.81 D = 3T 1 + 7T + 1T 3 CosSim(D, Q) = / (9+49+1)(0+0+4) = 0.13 D 1 is 6 imes beer han D using cosine similariy bu only 5 imes beer using inner produc. θ 1 3 Q 1 1 Naïve Implemenaion Conver all documens in collecion D o f-idf weighed vecors, d j, for keyword vocabulary V. Conver query o a f-idf-weighed vecor q. For each d j in D do Compue score s j = cossim(d j, q) Sor documens by decreasing score. Presen op ranked documens o he user. Time complexiy: O( V D ) Bad for large V & D! V = 10,000; D = 100,000; V D = 1,000,000,000 Commens on Vecor Space Models Simple, mahemaically based approach. Considers boh local (f) and global (idf) word occurrence frequencies. Provides parial maching and ranked resuls. Tends o work quie well in pracice despie obvious weaknesses. Allows efficien implemenaion for large documen collecions. 3 4

5 Problems wih Vecor Space Model Missing semanic informaion (e.g. word sense). Missing synacic informaion (e.g. phrase srucure, word order, proximiy informaion). Assumpion of erm independence (e.g. ignores synonomy). Lacks he conrol of a Boolean model (e.g., requiring a erm o appear in a documen). Given a wo-erm query A B, may prefer a documen conaining A frequenly bu no B, over a documen ha conains boh A and B, bu boh less frequenly. 5