Concept Tracking for Microblog Search

Similar documents
Mining the Temporal Statistics of Query Terms for Searching Social Media Posts

Using temporal bursts for query modeling


Utilizing Passage-Based Language Models for Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval

Generalized Inverse Document Frequency

Lecture 13: More uses of Language Models

arxiv: v1 [cs.ir] 1 May 2018

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

A Study of the Dirichlet Priors for Term Frequency Normalisation

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

CS 646 (Fall 2016) Homework 3

Language Models, Smoothing, and IDF Weighting

Time-Aware Rank Aggregation for Microblog Search

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

Query-document Relevance Topic Models

Is Document Frequency important for PRF?

The Combination and Evaluation of Query Performance Prediction Methods

Using the Past To Score the Present: Extending Term Weighting Models Through Revision History Analysis

INFO 630 / CS 674 Lecture Notes

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Information Retrieval

Markov Precision: Modelling User Behaviour over Rank and Time

Featured tweet search: Modeling time and social influence for microblog retrieval

Predicting Neighbor Goodness in Collaborative Filtering

Dynamic Clustering of Streaming Short Documents

Fielded Sequential Dependence Model for Ad-Hoc Entity Retrieval in the Web of Data

Boolean and Vector Space Retrieval Models

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Injecting User Models and Time into Precision via Markov Chains

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Semantic Term Matching in Axiomatic Approaches to Information Retrieval

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Measuring the Variability in Effectiveness of a Retrieval System

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

The Use of Dependency Relation Graph to Enhance the Term Weighting in Question Retrieval

Modeling Controversy Within Populations

Comparing Relevance Feedback Techniques on German News Articles

III.6 Advanced Query Types

Modeling the Score Distributions of Relevant and Non-relevant Documents

Reducing the Risk of Query Expansion via Robust Constrained Optimization

Chap 2: Classical models for information retrieval

Positional Language Models for Information Retrieval

Discovering Geographical Topics in Twitter

Blog Distillation via Sentiment-Sensitive Link Analysis

Optimal Mixture Models in IR

Language Models. Hongning Wang

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Towards Collaborative Information Retrieval

On the Foundations of Diverse Information Retrieval. Scott Sanner, Kar Wai Lim, Shengbo Guo, Thore Graepel, Sarvnaz Karimi, Sadegh Kharazmi

Ranked Retrieval (2)

Research Methodology in Studies of Assessor Effort for Information Retrieval Evaluation

An Efficient Framework for Constructing Generalized Locally-Induced Text Metrics

Correlation, Prediction and Ranking of Evaluation Metrics in Information Retrieval

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

Topic Models and Applications to Short Documents

A Framework of Detecting Burst Events from Micro-blogging Streams

The Axiometrics Project

Latent Semantic Tensor Indexing for Community-based Question Answering

The effect on accuracy of tweet sample size for hashtag segmentation dictionary construction

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

Bayesian Query-Focused Summarization

Advances in IR Evalua0on. Ben Cartere6e Evangelos Kanoulas Emine Yilmaz

A Risk Minimization Framework for Information Retrieval

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Advanced Topics in Information Retrieval 5. Diversity & Novelty

Medical Question Answering for Clinical Decision Support

Applying the Divergence From Randomness Approach for Content-Only Search in XML Documents

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

BEWARE OF RELATIVELY LARGE BUT MEANINGLESS IMPROVEMENTS

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

A General Linear Mixed Models Approach to Study System Component Effects

USING SINGULAR VALUE DECOMPOSITION (SVD) AS A SOLUTION FOR SEARCH RESULT CLUSTERING

Latent Semantic Analysis. Hongning Wang

Ranking Sequential Patterns with Respect to Significance

A Note on the Expectation-Maximization (EM) Algorithm

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Hub, Authority and Relevance Scores in Multi-Relational Data for Query Search

Analysis of Patents for Prior Art Candidate Search

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Effectiveness of complex index terms in information retrieval

Learning Topical Transition Probabilities in Click Through Data with Regression Models

Learning Maximal Marginal Relevance Model via Directly Optimizing Diversity Evaluation Measures

Integrating Logical Operators in Query Expansion in Vector Space Model

Click Models for Web Search

ECIR Tutorial 30 March Advanced Language Modeling Approaches (case study: expert search) 1/100.

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

1 Information retrieval fundamentals

Passage language models in ad hoc document retrieval

A Query Algebra for Quantum Information Retrieval

Lower-Bounding Term Frequency Normalization

Fast LSI-based techniques for query expansion in text retrieval systems

Faceted Models of Blog Feeds

AN INTRODUCTION TO TOPIC MODELS

Transcription:

1,a) 1,b) 1,c) 2013 12 22, 2014 4 7 2 Tweets2011 Twitter Concept Tracking for Microblog Search Taiki Miyanishi 1,a) Kazuhiro Seki 1,b) Kuniaki Uehara 1,c) Received: December 22, 2013, Accepted: April 7, 2014 Abstract: Incorporating the temporal property of words into query expansion methods based on relevance feedback has been shown to have a significant positive effect on microblog searching. In this paper, we propose a concept-based query expansion method based on a temporal relevance model that uses the temporal variation of concepts (e.g., terms or phrases) on microblogs. Our model naturally extends an extremely effective existing concept-based relevance model by tracking the concept frequency over time. Moreover, the proposed model produces important concepts that are frequently used within a particular time period associated with a given topic, which have more power to discriminate between relevant and non-relevant microblog documents than words. Our experiments using a corpus of microblog data (the Tweets2011 corpus) show that the proposed concept-based query expansion method improves search performance significantly, especially when retrieving highly relevant documents. Keywords: Twitter, microblog search, pseudo-relevance feedback, query expansion 1. 1 Graduate School of System Informatics, Kobe University, Kobe, Hyogo 657 8501, Japan a) miyanishi@ai.cs.kobe-u.ac.jp b) seki@cs.kobe-u.ac.jp c) uehara@kobe-u.ac.jp [10], [11], [12], [20], [21], [30] [8], [23], [24], [27], [28] 2013 11 WebDB 2013 c 2014 Information Processing Society of Japan 1

[27] [1], [2], [3], [4], [5], [16], [18], [25], [26] Web Latent Concept Expansion; LCE [26] wrm [17] ctrm TREC White House spokesman replaced MB044 1 Joseph R. Diden Jr. Jay Carney wrm ctrm Lexical ctrm Temporal 1 wrm jay carney ctrm Lexical news spokesman press secretary jay carney 1 Table 1 MB044: White House spokesman replaced Example of expanded words and concepts about a topic MB044: White House spokesman replaced suggested by a word-based PRF (wrm) and a concept-based temporal one (ctrm). wrm ctrm (Lexical) ctrm (Temporal) jay jay carney carney carney jay qantas qantas press secretary new new spokesman jay carney obama new biden spokesman ctrm ctrm Temporal ctrm Lexical ctrm LCE TREC 2011 2012 1,600 tweet Twitter Tweets2011 *1 2 3 4 5 2. 2.1 [35] Li [19] *1 http://trec.nist.gov/data/tweets/ c 2014 Information Processing Society of Japan 2

Efron [10] Dakka [9] Keikha [14] Lin [21] Miyanishi [27] Jones [13] 2 Efron [12] Miyanishi [28] tweet 1 Twitter 2.2 Markov Random Field; MRF MRF Web Metzler [25] MRF MRF [26] LCE Bendersky [3] Wikipedia Bendersky [4], [5] Bendersky [2] TREC Web Wikipedia 3. Ponte [31] 3.1 Q D P (D Q) P (D Q) P (Q D) P (D) P (D Q) P (Q D)P (D) P (D) D n q 1,q 2,...,q n Q P (Q D) =P (q 1,q 2,...,q n D) n P (Q D) = P (q i D) (1) i=1 n q i c 2014 Information Processing Society of Japan 3

Q i P (q D) f(w;d) P ml (w D) = w V f(w ;D) f(w; D) D w w V f(w ; D) D V (1) 1 P (Q D) 0 [36] P (w D) = D D + μ P μ ml(w D)+ D + μ P (w C) P (w C) C w μ 3.2 Lavrenko [17] P (w R) w w P (w R) P (w Q) = D R P (w, D Q) = 1 P (Q) P (w, Q D)P (D) D R D R P (D)P (w D) n P (q i D), (2) R Q M P (Q) P (w Q) D w q i (2) P (w Q) w k 3.3 Metzler [26] LCE LCE 1 LCE i [25] LCE Bendersky [4] LCE LCE R M c S LCE (c, Q) D R exp{γ 1 φ 1 (Q, D)+γ 2 φ 2 (c, D) γ 3 φ 3 (c, C)}(3) φ 1 (Q, D) Q D c φ 2 (c, D) c D φ 3 (c, C) C c γ 1 γ 2 γ 3 (3) γ 1 = γ 2 = γ 3 =1 φ 1 (Q, D) = log P (Q D) φ 2 (c, D) =logp (c D) φ 3 (c, C) =0 *2 D Q ˆq 1, ˆq 2,...,ˆq m c bag-of-concepts c Q S crm (c, Q) S crm (c, Q) D R P (D)P (c D) m P (ˆq j D) (4) ˆq j Q j (4) (3) (4) (2) Lavrenko [17] (2) (4) 3.4 [8], [23], [24], [27], [28] *2 [22] φ 3 (c, C) =0 j c 2014 Information Processing Society of Japan 4

P (c R) 2 c R l R t P (c Q) P (c Q) = P (c, D l,d t Q) D l R l D t R t = P (D l c, D t,q)p(c, D t Q) (5) D t R t D l R l D l R l D t R t Efron [10] D t D l (5) D t P (Q) P (c Q) P (c Q) = P (D l c, Q) P (c, D t Q) D l R l D t R t 1 = P (c, D l Q) P (c, D t Q) P (c Q) D l R t D t R t 1 P (D l )P (c, Q D l ) P (c Q) D l R t P (D t )P (c, Q D t ) D t R t bag-of-concepts D l D t Q ˆq 1, ˆq 2,...,ˆq m c 1 m P (c Q) P (D l )P (c D l ) P (ˆq j D l ) P (c Q) D l R l j m P (D t )P (c D t ) P (ˆq j D t ) D t R t j P (c Q) c Q S ct RM (c, Q) c { m S ct RM (c, Q) rank = P (D l )P (c D l ) P (ˆq j D l ) D l R l j }{{} Lexical m } 1/2 P (D t )P (c D t ) P (ˆq j D t ), D t R t j }{{} Temporal (6) P (c D t ) t c P (ˆq j D t ) t ˆq j (6) P (c D t ) m j P (ˆq j D t ) c ˆq 1, ˆq 2,...,ˆq m (6) R t c P (ˆq j D t ) P (c D t ) D l R l P (D l )P (c D l )P (Q D l ) (4) LCE (6) (1) P (c D t ) 0 P (c D t ) P (c D t )= D t ˆPml (c D t )+ P (c C) (7) D t + μ t D t + μ t D t D t f(c;d t) P ml (c D t )= c Vc f(c ;D V t) c f(c; D t ) t c μ t S ct RM (c, Q) k 4. 4.1 4.1.1 TREC 2011 2012 Tweets2011 tweet Tweets2011 2011 1 23 2 8 1,600 tweet 110 1 <num> <title> <querytime> TREC 2011 2012 <title> TREC tweet tweet <num> MB001 <title> BBC World Service staff cuts <querytime> Tue Feb 08 12:30:27 +0000 2011 1 TREC 2011 Fig. 1 Example topic from the TREC 2011 microblog track. μ t c 2014 Information Processing Society of Japan 5

2 Table 2 TREC Summary of TREC collections and topics used for evaluation. Name Type #Topics Topic Numbers TREC 2011 allrel 49 1-49 highrel 33 1, 10-30, 32, 36-38, 40-42, 44-46, 49 TREC 2012 allrel 59 51-75, 77-110 highrel 56 51, 52, 54-68, 70-75, 77-104, 106-110 0 1 2 3 tweet allrel tweet highrel TREC 2011 TREC 2012 allrel highrel 2 4.1.2 tweet tweet Indri [34] stopword Krovetz stemmer [15] stemming 4.1.3 Tweet Tweet Indri [36] [12] μ = 2500 LM μ t allrel μ t = 150 highrel μ t = 350 LM tweet ldig *3 retweet RT tweet *4 TREC [29], [33] retweet [8] retweet retweet *3 https://github.com/shuyo/ldig *4 tweet tweet retweet retweet 301 302 403 404 HTTP tweet 1,000 4.2 4.2.1 1 1 2 tweet sequential-dependence [25] w 1 w 2 #1(w 1,w 2 ) w 1 w 2 8 #8(w 1,w 2 ) (7) P (c C) df (c)/n P (c C) df (c) c N df (c)/n P (c C) μ t (4) ctrm 2 wtrm wtrm (4) 1 wtrm ctrm wtrm ctrm wtrm ctrm wrm [17] wrm 2 2 c 2014 Information Processing Society of Japan 6

3 Lexical Temporal Concept Table 3 Summary of the evaluated retrieval methods. Method Lexical Temporal Concept wrm crm wtrm ctrm crm LCE (3) LCE Bendersky [4] LCE crm ctrm ctrm crm R t 3 wrm crm wtrm ctrm 3 crm ctrm wrm wtrm wtrm ctrm wrm crm wrm crm wtrm ctrm tweet Tweets2011 tweet wrm crm wtrm ctrm 1 4.2.2 M URL @! @ # tweet P (D l ) P (D t ) S crm (c, Q) S ct RM (c, Q) k S crm (c, Q) S ct RM (c, Q) k 2 #weight( λ 1 #combine(bbc world service staff cuts) λ 2 #weight( s 1 #1(service outlines) s 2 #uw8(bbc outlines) s 3 outlines... s k #1(weds bbcworldservice))) TREC 2011 MB001: BBC world service staff cuts Fig. 2 Example of query reformulation of topic MB001: BBC world service staff cuts from TREC 2011 Microblog track queries. 4 Table 4 Parameters for IR models. allrel highrel Method Year M N K M N K wtrm 2011 10 30 10 10 30 40 ctrm 30 20 20 10 30 40 wtrm 2012 30 10 10 30 10 20 ctrm 20 30 40 10 30 40 Indri [34] 2 i ScT RM (ci,q) s i = k j ScT RM (cj,q) 1:1 λ 1 = λ 2 =0.5 4.2.3 M wtrm ctrm N k M 10 20 30 N 10 20 30 k 10 20 30 40 allrel highrel 1,000 Average Precision AP AP Precision [6]TREC 2012 1,000 AP TREC 2011 TREC 2012 TREC 2011 4 wtrm ctrm M N k TREC 2011 2012 c 2014 Information Processing Society of Japan 7

Table 5 5 The performance comparison of the word-based PRFs. allrel highrel Method AP P@30 bpref AP P@30 bpref LM 0.2936 0.3981 0.3103 0.2130 0.1723 0.1933 wrm 0.3502 α 0.4543 α 0.3594 α 0.2473 α 0.2086 α 0.2242 wtrm 0.3726 α β 0.4660α 0.3872 α β 0.2580α 0.2094 α 0.2361 α Table 6 6 The performance comparison of the concept-based PRFs. allrel highrel Method AP P@30 bpref AP P@30 bpref LM 0.2936 0.3981 0.3103 0.2130 0.1723 0.1933 crm 0.3385 α 0.4506 α 0.3479 α 0.2511 α 0.2060 α 0.2356 α ctrm 0.3644 α 0.4485 α 0.3825 α β 0.2694α β 0.2101α 0.2527 α 4.3 Average Precision AP 30 Precision P@30 binary preference bpref [7] P@30 TREC 2011 TREC 2012 P@30 AP bpref AP Precision TREC 2012 highrel allrel highrel t [32] 100,000 p<0.05 4.4 wtrm ctrm wtrm ctrm wrm crm 4.4.1 5 LM [17] wrm wtrm AP P@30 bpref LM wrm wtrm α β γ 5 wrm wtrm allrel highrel LM allrel highrel wtrm wrm allrel AP bpref 4.4.2 6 [4] crm Table 7 7 The performance comparison of the proposed wordand concept-based temporal PRF methods. allrel highrel Method AP P@30 bpref AP P@30 bpref wtrm 0.3726 0.4660 β 0.3872 0.2580 0.2094 0.2361 ctrm 0.3644 0.4485 0.3825 0.2694 0.2101 0.2527 ctrm AP P@30 bpref LM crm ctrm α β γ 6 crm ctrm allrel highrel LM ctrm crm allrel bref highrel AP 4.4.3 7 wtrm ctrm AP P@30 bpref wtrm ctrm α β 7 allrel wtrm ctrm P@30 P@30 highrel ctrm wtrm tweet Gasland *5 wtrm oscar *5 Gasland MB109 2011 c 2014 Information Processing Society of Japan 8

nomination ctrm oscar nomination nomination oscar nomination ctrm wtrm allrel highrel 5. 2 TREC 2011 2012 [1] Bendersky, M. and Croft, W.B.: Discovering key concepts in verbose queries, SIGIR, pp.491 498 (2008). [2] Bendersky, M. and Croft, W.B.: Modeling higher-order term dependencies in information retrieval using query hypergraphs, SIGIR, pp.941 950 (2012). [3] Bendersky, M., Metzler, D. and Croft, W.B.: Learning concept importance using a weighted dependence model, WSDM, pp.31 40 (2010). [4] Bendersky, M., Metzler, D. and Croft, W.B.: Parameterized concept weighting in verbose queries, SIGIR, pp.605 614 (2011). [5] Bendersky, M., Metzler, D. and Croft, W.B.: Effective query formulation with multiple information sources, WSDM, pp.443 452 (2012). [6] Buckley, C. and Voorhees, E.M.: Evaluating evaluation measure stability, SIGIR, pp.33 40 (2000). [7] Buckley, C. and Voorhees, E.M.: Retrieval evaluation with incomplete information, SIGIR, pp.25 32 (2004). [8] Choi, J. and Croft, W.B.: Temporal models for microblogs, CIKM, pp.2491 2494 (2012). [9] Dakka, W., Gravano, L. and Ipeirotis, P.G.: Answering general time-sensitive queries, TKDE, Vol.24, No.2, pp.220 235 (2012). [10] Efron, M. and Golovchinsky, G.: Estimation methods for ranking recent information, SIGIR, pp.495 504 (2011). [11] Efron, M.: Query-specific recency ranking: Survival analysis for improved microblog retrieval, #TAIA (2012). [12] Efron, M., Organisciak, P. and Fenlon, K.: Improving retrieval of short texts through document expansion, SI- GIR, pp.911 920 (2012). [13] Jones, R. and Diaz, F.: Temporal profiles of queries, TOIS, Vol.25, No.3 (2007). [14] Keikha, M., Gerani, S. and Crestani, F.: Time-based relevance models, SIGIR, pp.1087 1088 (2011). [15] Krovetz, R.: Viewing morphology as an inference process, SIGIR, pp.191 202 (1993). [16] Lang, H., Metzler, D., Wang, B. and Li, J.-T.: Improved latent concept expansion using hierarchical Markov random fields, CIKM, pp.249 258 (2010). [17] Lavrenko, V. and Croft, W.B.: Relevance based language models, SIGIR, pp.120 127 (2001). [18] Lease, M.: An improved Markov random field model for supporting verbose queries, SIGIR, pp.476 483 (2009). [19] Li, X. and Croft, W.: Time-based language models, CIKM, pp.469 475 (2003). [20] Liang, F., Qiang, R. and Yang, J.: Exploiting real-time information retrieval in the microblogosphere, JCDL, pp.267 276 (2012). [21] Lin, J. and Efron, M.: Temporal relevance profiles for tweet search, #TAIA (2013). [22] Macdonald, C. and Ounis, I.: Global statistics in proximity weighting models, SIGIR Web N-gram Workshop (2010). [23] Massoudi, K., Tsagkias, M., de Rijke, M. and Weerkamp, W.: Incorporating query expansion and quality indicators in searching microblog posts, ECIR, pp.362 367 (2011). [24] Metzler, D., Cai, C. and Hovy, E.: Structured event retrieval over microblog archives, HLT/NAACL, pp.646 655 (2012). [25] Metzler, D. and Croft, W.B.: A Markov random field model for term dependencies, SIGIR, pp.472 479 (2005). [26] Metzler, D. and Croft, W.B.: Latent concept expansion using Markov random fields, SIGIR, pp.311 318 (2007). [27] Miyanishi, T., Seki, K. and Uehara, K.: Combining recency and topic-dependent temporal variation for microblog search, ECIR, pp.331 343 (2013). [28] Miyanishi, T., Seki, K. and Uehara, K.: Improving pseudo-relevance feedback via tweet selection, CIKM, pp.439 448 (2013). [29] Ounis, I., Macdonald, C., Lin, J. and Soboroff, I.: Overview of the TREC-2011 microblog track, TREC (2011). [30] Peetz, M.H., Meij, E., de Rijke, M. and Weerkamp, W.: Adaptive temporal query modeling, ECIR, pp.455 458 (2012). [31] Ponte, J. and Croft, W.: A language modeling approach to information retrieval, SIGIR, pp.275 281 (1998). [32] Smucker, M.D., Allan, J. and Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation, CIKM, pp.623 632 (2007). [33] Soboroff, I., Ounis, I. and Lin, J.: Overview of the TREC-2012 microblog track, TREC (2012). [34] Strohman, T., Metzler, D., Turtle, H. and Croft, W.: Indri: A language model-based search engine for complex queries, ICIA, pp.2 6 (2005). [35] Teevan, J., Ramage, D. and Morris, M.: #Twitter- Search: A comparison of microblog search and web search, WSDM, pp.35 44 (2011). [36] Zhai, C. and Lafferty, J.: A study of smoothing methc 2014 Information Processing Society of Japan 9

ods for language models applied to information retrieval, TOIS, Vol.22, No.2, pp.179 214 (2004). 23 26 Web 14 18 Ph.D. 53 58 AAAI c 2014 Information Processing Society of Japan 10