Computing Relevance, Similarity: The Vector Space Model

Similar documents
CHAPTER 10: LINEAR DISCRIMINATION

Advanced Machine Learning & Perception

Solution in semi infinite diffusion couples (error function analysis)

Machine Learning 2nd Edition

Outline. Probabilistic Model Learning. Probabilistic Model Learning. Probabilistic Model for Time-series Data: Hidden Markov Model

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 4

Robustness Experiments with Two Variance Components

( t) Outline of program: BGC1: Survival and event history analysis Oslo, March-May Recapitulation. The additive regression model

Clustering (Bishop ch 9)

Lecture 6: Learning for Control (Generalised Linear Regression)

Machine Learning Linear Regression

Lecture 11 SVM cont

Lecture VI Regression

Introduction to Boosting

Bayes rule for a classification problem INF Discriminant functions for the normal density. Euclidean distance. Mahalanobis distance

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6)

Fall 2010 Graduate Course on Dynamic Learning

Variants of Pegasos. December 11, 2009

Anomaly Detection. Lecture Notes for Chapter 9. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Combining linguistic resources and statistical language modeling for information retrieval

Math 128b Project. Jude Yuen

Using Genetic Algorithm to Improve Information Retrieval Systems

PHYS 1443 Section 001 Lecture #4

Advanced time-series analysis (University of Lund, Economic History Department)

THE EFFECTS OF INDEXING STRATEGY-QUERY TERM COMBINATION ON RETRIEVAL EFFECTIVENESS IN A SWEDISH FULL TEXT DATABASE. Per Ahlgren

FI 3103 Quantum Physics

V.Abramov - FURTHER ANALYSIS OF CONFIDENCE INTERVALS FOR LARGE CLIENT/SERVER COMPUTER NETWORKS

Appendix H: Rarefaction and extrapolation of Hill numbers for incidence data

Panel Data Regression Models

Lecture 2 L n i e n a e r a M od o e d l e s

( ) [ ] MAP Decision Rule

Ordinary Differential Equations in Neuroscience with Matlab examples. Aim 1- Gain understanding of how to set up and solve ODE s

Reactive Methods to Solve the Berth AllocationProblem with Stochastic Arrival and Handling Times

An introduction to Support Vector Machine

WebAssign HW Due 11:59PM Tuesday Clicker Information

CHAPTER 10: LINEAR DISCRIMINATION

Robust and Accurate Cancer Classification with Gene Expression Profiling

Improved Classification Based on Predictive Association Rules

Chapters 2 Kinematics. Position, Distance, Displacement

John Geweke a and Gianni Amisano b a Departments of Economics and Statistics, University of Iowa, USA b European Central Bank, Frankfurt, Germany

. The geometric multiplicity is dim[ker( λi. number of linearly independent eigenvectors associated with this eigenvalue.

Introduction to Compact Dynamical Modeling. III.1 Reducing Linear Time Invariant Systems. Luca Daniel Massachusetts Institute of Technology

TSS = SST + SSE An orthogonal partition of the total SS

CS 536: Machine Learning. Nonparametric Density Estimation Unsupervised Learning - Clustering

In the complete model, these slopes are ANALYSIS OF VARIANCE FOR THE COMPLETE TWO-WAY MODEL. (! i+1 -! i ) + [(!") i+1,q - [(!

On One Analytic Method of. Constructing Program Controls

HEAT CONDUCTION PROBLEM IN A TWO-LAYERED HOLLOW CYLINDER BY USING THE GREEN S FUNCTION METHOD

An Integrated and Interactive Video Retrieval Framework with Hierarchical Learning Models and Semantic Clustering Strategy

Chapter 6 DETECTION AND ESTIMATION: Model of digital communication system. Fundamental issues in digital communications are

CHAPTER 2: Supervised Learning

Methods of Improving Constitutive Equations

J i-1 i. J i i+1. Numerical integration of the diffusion equation (I) Finite difference method. Spatial Discretization. Internal nodes.

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

( ) lamp power. dx dt T. Introduction to Compact Dynamical Modeling. III.1 Reducing Linear Time Invariant Systems

Density Matrix Description of NMR BCMB/CHEM 8190

. The geometric multiplicity is dim[ker( λi. A )], i.e. the number of linearly independent eigenvectors associated with this eigenvalue.

Hidden Markov Models Following a lecture by Andrew W. Moore Carnegie Mellon University

by Lauren DeDieu Advisor: George Chen

Scattering at an Interface: Oblique Incidence

Hidden Markov Models

CSCE 478/878 Lecture 5: Artificial Neural Networks and Support Vector Machines. Stephen Scott. Introduction. Outline. Linear Threshold Units

SOME NOISELESS CODING THEOREMS OF INACCURACY MEASURE OF ORDER α AND TYPE β

Cubic Bezier Homotopy Function for Solving Exponential Equations

Displacement, Velocity, and Acceleration. (WHERE and WHEN?)

Gait Tracking and Recognition Using Person-Dependent Dynamic Shape Model

CHAPTER 5: MULTIVARIATE METHODS

Dishonest casino as an HMM

Optimal environmental charges under imperfect compliance

Clustering with Gaussian Mixtures

Computational results on new staff scheduling benchmark instances

Baseflow Analysis. Objectives. Baseflow definition and significance. Reservoir model for recession analysis. Physically-based aquifer model

CS 268: Packet Scheduling

FTCS Solution to the Heat Equation

EEL 6266 Power System Operation and Control. Chapter 5 Unit Commitment

An Effective TCM-KNN Scheme for High-Speed Network Anomaly Detection

10. A.C CIRCUITS. Theoretically current grows to maximum value after infinite time. But practically it grows to maximum after 5τ. Decay of current :

Attribute Reduction Algorithm Based on Discernibility Matrix with Algebraic Method GAO Jing1,a, Ma Hui1, Han Zhidong2,b

5th International Conference on Advanced Design and Manufacturing Engineering (ICADME 2015)

Single-loop System Reliability-Based Design & Topology Optimization (SRBDO/SRBTO): A Matrix-based System Reliability (MSR) Method

Learning Objectives. Self Organization Map. Hamming Distance(1/5) Introduction. Hamming Distance(3/5) Hamming Distance(2/5) 15/04/2015

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Bell Laboratories. 600 Mountain Avenue

Professor Joseph Nygate, PhD

Tools for Analysis of Accelerated Life and Degradation Test Data

CHAPTER 7: CLUSTERING

Particle Filter Based Robot Self-localization Using RGBD Cues and Wheel Odometry Measurements Enyang Gao1, a*, Zhaohua Chen1 and Qizhuhui Gao1

Introduction ( Week 1-2) Course introduction A brief introduction to molecular biology A brief introduction to sequence comparison Part I: Algorithms

Normal Random Variable and its discriminant functions

CS286.2 Lecture 14: Quantum de Finetti Theorems II

Forecasting customer behaviour in a multi-service financial organisation: a profitability perspective

A decision-theoretic generalization of on-line learning. and an application to boosting. AT&T Labs. 180 Park Avenue. Florham Park, NJ 07932

GMM parameter estimation. Xiaoye Lu CMPS290c Final Project

Kernel-Based Bayesian Filtering for Object Tracking

Existence and Uniqueness Results for Random Impulsive Integro-Differential Equation

Linear Response Theory: The connection between QFT and experiments

Chapter Lagrangian Interpolation

Bayesian Inference of the GARCH model with Rational Errors

Comparison of Supervised & Unsupervised Learning in βs Estimation between Stocks and the S&P500

ECE 366 Honors Section Fall 2009 Project Description

DEEP UNFOLDING FOR MULTICHANNEL SOURCE SEPARATION SUPPLEMENTARY MATERIAL

Transcription:

Compung Relevance, Smlary: The Vecor Space Model Based on Larson and Hears s sldes a UC-Bereley hp://.sms.bereley.edu/courses/s0/f00/ aabase Managemen Sysems, R. Ramarshnan ocumen Vecors v ocumens are represened as bags of ords v Represened as vecors hen used compuaonally A vecor s le an array of floang pon Has drecon and magnude Each vecor holds a place for every erm n he collecon Therefore, mos vecors are sparse aabase Managemen Sysems, R. Ramarshnan

ocumen Vecors: One locaon for each ord. A B C E F G H I nova galaxy hea h ood flm role de fur 0 5 3 5 0 0 8 7 Nova occurs 9 0 mes 0 n ex 5 A Galaxy occurs 5 mes n ex A 0 0 Hea occurs 3 mes n ex A 9 0 5 7 (Blan means 0 occurrences. 9 6 0 8 7 5 3 aabase Managemen Sysems, R. Ramarshnan 3 ocumen ds A B C E F G H I ocumen Vecors nova galaxy hea h ood flm role de fur 0 5 3 5 0 0 8 7 9 0 5 0 0 9 0 5 7 9 6 0 8 7 5 3 aabase Managemen Sysems, R. Ramarshnan 4

We Can Plo he Vecors Sar oc abou asronomy oc abou move sars oc abou mammal behavor e Assumpon: ocumens ha are close n space are smlar. aabase Managemen Sysems, R. Ramarshnan 5 Vecor Space Model v ocumens are represened as vecors n erm space Terms are usually sems ocumens represened by bnary vecors of erms v Queres represened he same as documens v A vecor dsance measure beeen he query and documens s used o ran rereved documens Query and ocumen smlary s based on lengh and drecon of her vecors Vecor operaons o capure boolean query condons Terms n a vecor can be eghed n many ays aabase Managemen Sysems, R. Ramarshnan 6

Vecor Space ocumens and Queres docs 3 RSVQ. 0 4 0 0 3 0 5 4 0 0 5 6 6 0 3 7 0 0 8 0 0 9 0 0 3 0 0 5 0 3 Q 3 q q q3 Q s a query also represened as a vecor 3 3 0 5 6 7 8 aabase Managemen Sysems, R. Ramarshnan 7 9 4 Boolean erm combnaons Assgnng Weghs o Terms åbnary Weghs Ra erm frequency êf x df Recall he Zpf dsrbuon Wan o egh erms hghly f hey are frequen n relevan documens BUT nfrequen n he collecon as a hole aabase Managemen Sysems, R. Ramarshnan 8

Bnary Weghs v Only he presence ( or absence (0 of a erm s ncluded n he vecor docs 3 0 0 0 3 0 4 0 0 5 6 0 7 0 0 8 0 0 9 0 0 0 0 0 aabase Managemen Sysems, R. Ramarshnan 9 Ra Term Weghs v The frequency of occurrence for he erm n each documen s ncluded n he vecor docs 3 0 3 0 0 3 0 4 7 4 3 0 0 5 6 3 6 3 5 0 7 0 8 0 8 0 0 0 9 0 0 0 0 3 5 4 0 aabase Managemen Sysems, R. Ramarshnan 0

TF x IF Weghs v f x df measure: Term Frequency (f Inverse ocumen Frequency (df -- a ay o deal h he problems of he Zpf dsrbuon v Goal: Assgn a f * df egh o each erm n each documen aabase Managemen Sysems, R. Ramarshnan TF x IF Calculaon f * log( N / n T f n df df erm n documen frequency of erm T n documen nverse documen frequency of erm T n C N oal number of documens n he collecon C he number of documens n C ha conan T log N n aabase Managemen Sysems, R. Ramarshnan

Inverse ocumen Frequency v IF provdes hgh values for rare ords and lo values for common ords For a collecon of 0000 documens 0000 log 0 0000 0000 log 0.30 5000 0000 log.698 0 0000 log 4 aabase Managemen Sysems, R. Ramarshnan 3 TF x IF Normalzaon v Normalze he erm eghs (so longer documens are no unfarly gven more egh Usually means forcng all values o fall hn a ceran range, ypcally beeen 0 and, nclusve. f ( f log( N / n [log( N / n ] aabase Managemen Sysems, R. Ramarshnan 4

Par-se ocumen Smlary A B C nova galaxy hea h ood flm role de fur 3 5 5 4 Ho o compue documen smlary? aabase Managemen Sysems, R. Ramarshnan 5 Par-se ocumen Smlary,, sm(,,,...,..., sm( A, B ( 5 + ( 3 sm( A, C 0 sm( A, 0 sm( B, C 0 sm( B, 0 sm( C, ( 4 + ( 9 A B C nova galaxy hea h ood flm role de fur 3 5 5 4 aabase Managemen Sysems, R. Ramarshnan 6

aabase Managemen Sysems, R. Ramarshnan 7 Par-se ocumen Smlary (cosne normalzaon cosne normalzed ( (, ( unnormalzed, (...,,...,,,, sm sm aabase Managemen Sysems, R. Ramarshnan 8 Vecor Space Relevance Measure ( (, ( oherse normalze n he smlary comparson :, ( normalzed: erm eghs f erm s absen a 0 f...,,,...,,, j d j qj j d qj j d qj q q q d d d j j j sm Q sm Q Q

Compung Relevance Scores Say e have query vecor Q (0.4,0.8 Also, documen Wha does her smlary comparson yeld? sm( Q, [(0.4 0.64 0.4 (0.,0.7 (0.4*0. + (0.8*0.7 + (0.8 0.98 ]*[(0. + (0.7 ] aabase Managemen Sysems, R. Ramarshnan 9 Term B.0 0.8 0.6 0.4 0. Vecor Space h Term Weghs and Cosne Machng α α Q Q (0.4,0.8 (0.8,0.3 (0.,0.7 0 0. 0.4 0.6 0.8.0 Term A (d, d ;d, d ; ;d, d Q (q, q ;q, q ; ;q, q sm( Q, sm( Q, ( j q j dj j q j j 0.64 0.98 0.4.56 sm( Q, 0.74 0.58 ( dj (0.4 0. + (0.8 0.7 [(0.4 + (0.8 ] [(0. + (0.7 ] aabase Managemen Sysems, R. Ramarshnan 0

Smlary Measures Q Q Q + Q Q Q Q Q mn( Q, Smple machng (coordnaon level mach ce s Coeffcen Jaccard s Coeffcen Cosne Coeffcen Overlap Coeffcen aabase Managemen Sysems, R. Ramarshnan Tex Cluserng v Fnds overall smlares among groups of documens v Fnds overall smlares among groups of oens v Pcs ou some hemes, gnores ohers aabase Managemen Sysems, R. Ramarshnan

Tex Cluserng Cluserng s The ar of fndng groups n daa. -- Kaufmann and Rousseeu Term Term aabase Managemen Sysems, R. Ramarshnan 3 Problems h Vecor Space v There s no real heorecal bass for he assumpon of a erm space I s more for vsualzaon han havng any real bass Mos smlary measures or abou he same v Terms are no really orhogonal dmensons Terms are no ndependen of all oher erms; remember our dscusson of correlaed erms n ex aabase Managemen Sysems, R. Ramarshnan 4

Probablsc Models v Rgorous formal model aemps o predc he probably ha a gven documen ll be relevan o a gven query v Rans rereved documens accordng o hs probably of relevance (Probably Ranng Prncple v Reles on accurae esmaes of probables aabase Managemen Sysems, R. Ramarshnan 5 Probably Ranng Prncple v If a reference rereval sysem s response o each reques s a ranng of he documens n he collecons n he order of decreasng probably of usefulness o he user ho submed he reques, here he probables are esmaed as accuraely as possble on he bass of haever daa has been made avalable o he sysem for hs purpose, hen he overall effecveness of he sysem o s users ll be he bes ha s obanable on he bass of ha daa. Sephen E. Roberson, J. ocumenaon 977 aabase Managemen Sysems, R. Ramarshnan 6

Ierave Query Refnemen aabase Managemen Sysems, R. Ramarshnan 7 Query Modfcaon v Problem: Ho can e reformulae he query o help a user ho s ryng several searches o ge a he same nformaon? Thesaurus expanson: Sugges erms smlar o query erms Relevance feedbac: Sugges erms (and documens smlar o rereved documens ha have been judged o be relevan aabase Managemen Sysems, R. Ramarshnan 8

Relevance Feedbac v Man Idea: Modfy exsng query based on relevance judgemens Exrac erms from relevan documens and add hem o he query AN/OR re-egh he erms already n he query v There are many varaons: Usually posve eghs for erms from relevan docs Somemes negave eghs for erms from non-relevan docs v Users, or he sysem, gude hs process by selecng erms from an auomacally-generaed ls. aabase Managemen Sysems, R. Ramarshnan 9 Roccho Mehod v Roccho auomacally Re-eghs erms Adds n ne erms (from relevan docs have o be careful hen usng negave erms Roccho s no a machne learnng algorhm aabase Managemen Sysems, R. Ramarshnan 30

Q Q S n n 0 Roccho Mehod β n n α Q0 + R n n here he vecor for he nal query R he vecor for he relevan documen γ he vecor for he non - relevan documen he number of relevan documens chosen he number of non - relevan documens chosen α, β and γ une he mporance of relevan and nonrelevan erms (n some sudes bes o se β o 0.75 and γ o 0.5 S aabase Managemen Sysems, R. Ramarshnan 3 Roccho/Vecor Illusraon Informaon.0 Q 0 rereval of nformaon (0.7,0.3 nformaon scence (0.,0.8 rereval sysems (0.9,0. 0.5 Q Q ½*Q 0 + ½ * (0.45,0.55 Q ½*Q 0 + ½ * (0.80,0.0 Q 0 Q 0 0.5.0 Rereval aabase Managemen Sysems, R. Ramarshnan 3