RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Size: px
Start display at page:

Download "RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS"

Transcription

1 RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS

2 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic retrieval Retrieval evaluation Link analysis From queries to top-k results Social search Dr. Gjergji Kasneci Introduction to Information Retrieval WS

3 Boolean model Exact match, i.e., either a document matches the query or not No ranking Query composed of Boolean operators Examples: Richard Nixon Watergate, Schwarzenegger (governor politics) (movie action) Advantages Efficient processing Results are easy to explain Main disadvantages Effectiveness (i.e., whether user will find all relevant results) depends entirely on user Not practical for web retrieval (result set would be either huge or too small) Dr. Gjergji Kasneci Introduction to Information Retrieval WS

4 Extended Boolean model (1) Consider term weights, e.g., for document d and term t w t, d = P t d tf t, d Consider the vector t 1,, t m of all terms in all documents (in some order) d can be represented as d = w(t 1, d),, w(t m, d) For query q = t i1 t ik sim d, q = 1 p 1 w t i1, d p w t ik, d p k For query q = t i1 t ik p sim d, q = w t i 1, d p + + w t ik, d p k Rank documents by decreasing similarities to query P-norm Dr. Gjergji Kasneci Introduction to Information Retrieval WS

5 Extended Boolean model (2) Source: G. Salton, Extended Boolean Information Retrieval. CACM 1983 Dr. Gjergji Kasneci Introduction to Information Retrieval WS

6 Extended Boolean model: example Documents X, Z, Y and queries A B and A B Source: G. Salton, Extended Boolean Information Retrieval. CACM 1983 Dr. Gjergji Kasneci Introduction to Information Retrieval WS

7 Term-document matrix Vector space model (1) d 1 d 2 d 3 d 4 d 5 d 6 champion football goal law party politician rain score soccer weather wind Dr. Gjergji Kasneci Introduction to Information Retrieval WS

8 Main idea Vector space model (2) Terms span a vector space Documents and queries are vectors in that space Distance measured by cosine similarity: t 3 For d i = t i1,, t im and q = q 1,, q m d i d j sim d i, q = j t ij q j t 2 2 j ij j q j q Measures angle-based affinity between query and doc vector Feature value of query term Feature value of doc term (e.g., tf-idf) t 1 t 2 Dr. Gjergji Kasneci Introduction to Information Retrieval WS

9 Query: football score Vector space model: example Query vector: q = Ranking by using term frequencies tf football, d , tf score, d , tf football, d 2 0, tf score, d sim d 1, q = d 1 T q d 1 q = 0.63, sim d 2, q = 0.52 Dr. Gjergji Kasneci Introduction to Information Retrieval WS

10 User issues simple query System returns initial results Relevance feedback User marks some results as relevant others as non-relevant Initial query can be reformulated to yield better results based on above relevance feedback q opt = arg max q sim q, R sim q, R Relevant docs Non-relevant docs In vector space model q opt = arg max q 1 R d j R sim(d j, q) 1 R d j R sim(d j, q) Dr. Gjergji Kasneci Introduction to Information Retrieval WS

11 Relevance feedback: Rocchio algorithm Used as relevance feedback algorithm in the SMART system q = αq 0 + β 1 R d j d j R γ 1 R d j d j R With typical parameter values α = 8, β = 16, γ = 4 Negative feature values are set to 0 Source: Introduction to Information Retrieval Dr. Gjergji Kasneci Introduction to Information Retrieval WS

12 Vector space model for Latent Semantic Indexing (1) Map documents to corresponding latent topics d 1 d 2 d 3 d 4 d 5 d 6 champion football goal law party politician rain score soccer weather wind Dr. Gjergji Kasneci Introduction to Information Retrieval WS

13 Vector space model for Latent Semantic Indexing (2) Basic idea (further details in the Data Mining lecture ) holds orthonormal Eigenvectors of AA T (doc-doc-similarity) Mapping of docs d i, d j into latent space: d i U T k d i = d T i = D k V k i q U T k q = q (map keyword query) and measure cosine-similarity positive roots of the Eigenvalues (singular values) of A T A holds orthonormal Eigenvectors of A T A (term-term similarity) Mapping of new doc d into latent space: d U k T d = d Add d as last column of V k T Dr. Gjergji Kasneci Introduction to Information Retrieval WS

14 Advantages Vector space model: pros & cons Simple framework for ranking Other similarity measures or similarity weighting schemes can be used, e.g.: Metric distance Definition Euclidean x y = x i y i 2 i Manhattan x y 1 = x i y i i Maximum x y = max i x i y i with sim x, y Disadvantages 1 1+dist(x,y) Term independence assumption Impractical for relevance prediction or sim x, y 1 e dist(x,y) Dr. Gjergji Kasneci Introduction to Information Retrieval WS

15 Basic idea Probabilistic retrieval Return documents to a query ranked by decreasing probability of relevance (with respect to the query) If probabilities are calculated on all available data, the retrieval effectiveness should be best! Sample doc, returned by a retrieval system to a user s query Rq Rq Set of all documents Bayes decision rule: document d is relevant if P Rq d > P Rq d P Rq, d > P Rq, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS

16 Binary independence model with Naïve Bayes Estimate probabilities P Rq, d = P d Rq P Rq Assume independence between terms in the document given the class (e.g., Rq or Rq) P d Rq = P t i Rq t i d Document d can be classified as relevant if P d Rq P d Rq P Rq > P Rq Note: a document is a vector of binary features (i.e., 0, 1 indicating term occurrence or non-occurrence respectively) Dr. Gjergji Kasneci Introduction to Information Retrieval WS

17 Bayes decision with binary independence model Let r i P t i Rq and r i P t i Rq P d Rq P d Rq = r i i,ti =1 r i 1 r i i,ti =0 1 r i = r i i,ti =1 r i 1 r i i,ti =1 1 r i 1 r i i,ti =1 1 r i 1 r i i,ti =0 1 r i = r i 1 r i i,ti =1 r i 1 r i i 1 r i 1 r i In practice, score is computed by taking the logarithm P d Rq P d Rq log r i 1 r i i,ti =1 r i 1 r i Dr. Gjergji Kasneci Introduction to Information Retrieval WS

18 BM25 Contingency table Relevant Non-relevant All t i = 1 R i n i R i n i t i = 0 R R i N R n i R i N n i All R N R N Smoothed maximum likelihood estimations for r i and r i r i = R i r R + 1 i = n i R i N R + 1 BM25 (short for Best Match 25) P d Rq P d Rq log r i 1 r i i,ti =1 r i 1 r i = log R i N R n i + R i n i R i R R i i,t i =1 BM25 d, q log R i N R n i + R i n i R i R R i i,t i =1 α + 1 f i A + f i β + 1 qf i β + qf i α, β, A are parameters, where A = α (1 γ) + γ av d k Dr. Gjergji Kasneci Introduction to Information Retrieval WS d

19 Language models Text is generated from probability distribution over terms (i.e., language model) Terms are drawn with probability of their occurrence Suppose each document d = e.g., multinomial distribution: P t 1 = k 1,, t m = k m ; p 1,, p m = LM(θ 1 ) LM(θ 2 ) LM(θ 3 ) t 1,, t m is represented by a language model, d! k 1! k m! p 1 k 1 p m k m with θ d = (p 1,, p m ), such that i p i = 1 and i k i = d Given a query q = q 1,, q n, what is the likelihood P q θ d that q is a sample from a document d? Rank documents by decreasing likelihoods of having generated q. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

20 Language models: the query likelihood model Start with joint probability distribution P θ d, q = P q θ d P θ d likelihood prior Assume uniform prior and conditional independence between query terms given the parameters θ d P q θ d = P q i θ d i (unigram model) Rank documents by decreasing likelihoods P q θ dj Dr. Gjergji Kasneci Introduction to Information Retrieval WS

21 Language models: estimating probabilities P q θ d = P q i θ d i Estimate P q i θ d by P q i θ d = freq q i, d d Maximum likelihood estimation Problem What if q i does not occur in d? Dr. Gjergji Kasneci Introduction to Information Retrieval WS

22 Language models: smoothing P q i θ d = freq q i, d d Problem What if q i does not occur in d? Maximum likelihood estimation Use smoothing: technique for estimating probabilities of missing words Mixture model for smoothing P q θ d = 1 α P q i θ d + αp q i C i C: whole document corpus Background model Dr. Gjergji Kasneci Introduction to Information Retrieval WS

23 Mixture model for smoothing Smoothing & tf-idf P q θ d = 1 α P q i θ d + αp q i C i log 1 α P q i θ d + αp q i C i = log 1 α P q i θ d + αp q i C i,fq i >0 + log αp q i C i,fq i =0 = log i,fq i >0 1 α P q i θ d + αp q i C αp q i C + log αp q i C i i,fq i >0 log 1 α P q i θ d αp q i C + 1 What does this have to do with tf-idf??? Dr. Gjergji Kasneci Introduction to Information Retrieval WS

24 Jelineck-Mercer smoothing P q θ d = 1 α P q i θ d + αp q i C i Set α to a constant λ and use maximum likelihood estimations P q θ d = 1 λ freq q i, d d i + λ freq q i, C C In practice: use logarithms for ranking Score q, d = log 1 λ freq q i, d d i + λ freq q i, C C where C = t C freq t, d Dr. Gjergji Kasneci Introduction to Information Retrieval WS

25 Dirichlet smoothing P q θ d = 1 α P q i θ d + αp q i C i Basic idea Term that does not occur in a long document should be assigned low smoothed probability (the longer the document, the lower this probability) Set α = μ μ+ d and use maximum likelihood estimations P q θ d = i freq q i, d + μ freq q i, C / C μ + d Score d, q = log freq q i, d + μ freq q i, C / C μ + d i Dr. Gjergji Kasneci Introduction to Information Retrieval WS

26 Dirichlet smoothing: side note Motivated by Bayesian reasoning posterior P θ D P D θ P θ likelihood prior Dirichlet distribution is conjugate prior of the Multinomial distribution (enables closed-form calculation: prior has same functional form as posterior) In Dirichlet smoothing: Dirichlet prior is assumed over the model parameters, likelihood is multinomially distributed Priors are updated as more and more data is observed Further details in the Data Mining lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS

27 Estimate P q i θ d by Laplace/Lidstone smoothing P q i θ d = freq q i, d + 1 freq t, d + 1 t C = freq q i, d + 1 C + d For ranking Score q, d = log freq q i, d + 1 C + d i Advantage: easy-to-use (i.e., prameter-less) additive smoothing technique, no background model Disadvantage: for corpus with many short documents probability estimations are biased towards corpus frequencies Dr. Gjergji Kasneci Introduction to Information Retrieval WS

28 Jelineck-Mercer smoothing (λ = 0.2) P party d 1 = Smoothing: examples P wind d 2 = Dirichlet smoothing (μ = 0.2) P party d 1 = P wind d 2 = Laplace/Lidstone smoothing P party d 1 = P wind d 2 = Dr. Gjergji Kasneci Introduction to Information Retrieval WS

29 Language models vs. tf-idf Precision-recall results on TREC dataset. Source: A Language Modeling Approach to Information Retrieval by M. Ponte More on evaluation measures, next lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS

30 Jelineck-Mercer smoothing Smoothing: summary Flexible combination of term frequencies with background model Empirically: the longer the queries the larger λ (i.e., the more smoothing is needed) Does not take document length into account Dirichlet smoothing Probabilistic (i.e. Bayesian) reasoning about prior distribution on model parameters Empirically: the longer the queries the larger μ However, μ is collection dependent Laplace/Lidstone smoothing Easy-to-use additive smoothing technique Probability estimations may be biased towards corpus frequencies Two-stage smoothing is possible (i.e., Jelineck-Mercer smoothing on a Dirichlet smoothed estimation) Dr. Gjergji Kasneci Introduction to Information Retrieval WS

31 Language models for relevance (1) In document language models we were interested in estimating the query likelihood P q θ d In relevance language models we are interested in P d θ Rq, the probability of a document being generated by a relevance model (i.e., document is sample from this model) Can the two models be combined for ranking? One possibility: rank documents by increasing dissimilarity between P q θ d and P d θ Rq Dr. Gjergji Kasneci Introduction to Information Retrieval WS

32 Language models for relevance (2) Measure the negative Kullback-Leibler divergence KL P d θ Rq P q θ d = P t θ Rq log P t θ Rq t P t θ d = P t θ Rq log P t θ d t P t θ Rq log P t θ Rq t Simple estimation of P t θ Rq : P t θ Rq P t q 1,, q n = P t, q 1,, q n P q 1,, q n P t, q 1,, q n = P d P t, q 1,, q n d d C = P d P t d P q i d d C i conditional independence assumption Dr. Gjergji Kasneci Introduction to Information Retrieval WS

33 Pseudo-relevance feedback algorithm Estimation of P t θ Rq : P t θ Rq P t q 1,, q n = P t, q 1,, q n P q 1,, q n P t, q 1,, q n = P d P t, q 1,, q n d d C = P d P t d P q i d d C i 1. Rank documents by P q θ d 2. Confine the corpus to C = {top-k ranked documents} 3. Estimate P d θ Rq on this new corpus 4. Rerank documents by using KL P d θ Rq P q θ d Dr. Gjergji Kasneci Introduction to Information Retrieval WS

34 Integrating relevance feedback in language models Doc D Query Q D Q KL D( Q D) Results ' (1 ) Q Q F Feedback Docs F ={d 1, d 2,, d k } Dr. Gjergji Kasneci Introduction to Information Retrieval WS

35 Language models: summary Tractable retrieval framework with appealing interpretation Smoothed maximum likelihood estimations for unigrams work well in practice Estimations work less well for multigram likelihoods, problem known as curse of dimensionality Relevance model is possible Relevance feedback is difficult to integrate Dr. Gjergji Kasneci Introduction to Information Retrieval WS

36 Advanced probabilistic models: Bayesian networks Bayesian network: directed graph representing dependencies between random variables Local Markov property: a variable node is independent of all its non- descendants given its parents. One can reason about joint probabilities, e.g., P d i, I d 1 f 1 f i f m q 1 q 2 q k d 2 Doc variable More details in the Data Mining lecture I Query variable feature variable Information need variable Dr. Gjergji Kasneci Introduction to Information Retrieval WS

37 Ranking support vector machine (Ranking SVM) Motivation: feedback is cheap and plentiful Input: q, r, q, r,, where d i, d j r if d i ranks higher than d j Can we learn a weight vector w such that r: d 1 d 2 d 6 d i, d j r: w T d i > w T d j d 5 d 7 Optimization problem minimize subject to r: 1 w + λ ξ 2 i,j,r i,j,r Slack variable, allows some misclassification w d 9 d 1, d 2, d 1, d 6, d 5, d 6 d 5, d 2, d 5, d 7, d 7, d 9 r r d i, d j r: w T d i > w T d j + 1 ξ i,j,r More details on SVMs in the Data Mining lecture Dr. Gjergji Kasneci Introduction to Information Retrieval WS

38 Retrieval models (overview) Binary Independence Source: Dr. Gjergji Kasneci Introduction to Information Retrieval WS

39 Summary Boolean model Concatenation of query keywords by Boolean operators Either match or no match, no ranking Allows efficient processing but is impractical for large-scale retrieval Vector space model Vector representations of query and documents allow algebraic methods for measuring similarity Good for ranking, but not good dealing with uncertainty or predicting relevance Probabilistic and language models Extensible and versatile Can handle uncertainty (by reasoning about probabilities) Maximum likelihood estimations work well for unigram models, less well for multigram models Difficult to maintain at large scale and in the presence of frequent updates Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13 Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

Lecture 13: More uses of Language Models

Lecture 13: More uses of Language Models Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

University of Illinois at Urbana-Champaign. Midterm Examination

University of Illinois at Urbana-Champaign. Midterm Examination University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic

More information

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) ( ) Knowledge Discovery and Data Mining 1 (VO) (707.003) Probabilistic Latent Semantic Analysis Denis Helic KTI, TU Graz Jan 16, 2014 Denis Helic (KTI, TU Graz) KDDM1 Jan 16, 2014 1 / 47 Big picture: KDDM

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Machine Learning: Assignment 1

Machine Learning: Assignment 1 10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this

More information

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare

More information

CS 646 (Fall 2016) Homework 3

CS 646 (Fall 2016) Homework 3 CS 646 (Fall 2016) Homework 3 Deadline: 11:59pm, Oct 31st, 2016 (EST) Access the following resources before you start working on HW3: Download and uncompress the index file and other data from Moodle.

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi On the Foundations of Diverse Information Retrieval Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi 1 Outline Need for diversity The answer: MMR But what was the

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

Probabilistic Information Retrieval

Probabilistic Information Retrieval Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23 Overview 1 Information Retrieval IR Models Probability Basics 2 Document Ranking Problem

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 18: Latent Semantic Indexing Hinrich Schütze Center for Information and Language Processing, University of Munich 2013-07-10 1/43

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Behavioral Data Mining. Lecture 2

Behavioral Data Mining. Lecture 2 Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Modeling Environment

Modeling Environment Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA

More information

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS LINK ANALYSIS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis Models

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar Knowledge Discovery in Data: Naïve Bayes Overview Naïve Bayes methodology refers to a probabilistic approach to information discovery

More information

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which

More information

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Retrieval models II. IN4325 Information Retrieval

Retrieval models II. IN4325 Information Retrieval Retrieval models II IN4325 Information Retrieval 1 Assignment 1 Deadline Wednesday means as long as it is Wednesday you are still on time In practice, any time before I make it to the office on Thursday

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

IR Models: The Probabilistic Model. Lecture 8

IR Models: The Probabilistic Model. Lecture 8 IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 Scribes: Gilly Leshed, N. Sadat Shami Outline. Review. Mixture of Poissons ( Poisson) model 3. BM5/Okapi method 4. Relevance feedback. Review In discussing probabilistic models for information retrieval

More information

Language Models. Hongning Wang

Language Models. Hongning Wang Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity

More information

Day 6: Classification and Machine Learning

Day 6: Classification and Machine Learning Day 6: Classification and Machine Learning Kenneth Benoit Essex Summer School 2014 July 30, 2013 Today s Road Map The Naive Bayes Classifier The k-nearest Neighbour Classifier Support Vector Machines (SVMs)

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

Language Information Processing, Advanced. Topic Models

Language Information Processing, Advanced. Topic Models Language Information Processing, Advanced Topic Models mcuturi@i.kyoto-u.ac.jp Kyoto University - LIP, Adv. - 2011 1 Today s talk Continue exploring the representation of text as histogram of words. Objective:

More information

Midterm Examination Practice

Midterm Examination Practice University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Latent Semantic Analysis. Hongning Wang

Latent Semantic Analysis. Hongning Wang Latent Semantic Analysis Hongning Wang CS@UVa VS model in practice Document and query are represented by term vectors Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy:

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 6: Numerical Linear Algebra: Applications in Machine Learning Cho-Jui Hsieh UC Davis April 27, 2017 Principal Component Analysis Principal

More information

Score Distribution Models

Score Distribution Models Score Distribution Models Evangelos Kanoulas Virgil Pavlu Keshi Dai Javed Aslam Score Distributions 2 Score Distributions 2 Score Distributions 9.6592 9.5761 9.4919 9.4784 9.2693 9.2066 9.1407 9.0824 9.0110

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

DATA MINING LECTURE 10

DATA MINING LECTURE 10 DATA MINING LECTURE 10 Classification Nearest Neighbor Classification Support Vector Machines Logistic Regression Naïve Bayes Classifier Supervised Learning 10 10 Illustrating Classification Task Tid Attrib1

More information

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276,

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Yuriy Sverchkov Intelligent Systems Program University of Pittsburgh October 6, 2011 Outline Latent Semantic Analysis (LSA) A quick review Probabilistic LSA (plsa)

More information

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf RECSM Summer School: Facebook + Topic Models Pablo Barberá School of International Relations University of Southern California pablobarbera.com Networked Democracy Lab www.netdem.org Course website: github.com/pablobarbera/big-data-upf

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 18: Latent Semantic Indexing Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University,

More information

Bayesian Networks. Motivation

Bayesian Networks. Motivation Bayesian Networks Computer Sciences 760 Spring 2014 http://pages.cs.wisc.edu/~dpage/cs760/ Motivation Assume we have five Boolean variables,,,, The joint probability is,,,, How many state configurations

More information

Pivoted Length Normalization I. Summary idf II. Review

Pivoted Length Normalization I. Summary idf II. Review 2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,

More information

Intelligent Data Analysis Lecture Notes on Document Mining

Intelligent Data Analysis Lecture Notes on Document Mining Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.

More information

Chapter 4: Advanced IR Models

Chapter 4: Advanced IR Models Chapter 4: Advanced I Models 4.1 robabilistic I 4.1.1 rinciples 4.1.2 robabilistic I with Term Independence 4.1.3 robabilistic I with 2-oisson Model (Okapi BM25) IDM WS 2005 4-1 4.1.1 robabilistic etrieval:

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information