Concept Tracking for Microblog Search

Size: px

Start display at page:

Download "Concept Tracking for Microblog Search"

Phyllis Lewis
5 years ago
Views:

1 1,a) 1,b) 1,c) , Tweets2011 Twitter Concept Tracking for Microblog Search Taiki Miyanishi 1,a) Kazuhiro Seki 1,b) Kuniaki Uehara 1,c) Received: December 22, 2013, Accepted: April 7, 2014 Abstract: Incorporating the temporal property of words into query expansion methods based on relevance feedback has been shown to have a significant positive effect on microblog searching. In this paper, we propose a concept-based query expansion method based on a temporal relevance model that uses the temporal variation of concepts (e.g., terms or phrases) on microblogs. Our model naturally extends an extremely effective existing concept-based relevance model by tracking the concept frequency over time. Moreover, the proposed model produces important concepts that are frequently used within a particular time period associated with a given topic, which have more power to discriminate between relevant and non-relevant microblog documents than words. Our experiments using a corpus of microblog data (the Tweets2011 corpus) show that the proposed concept-based query expansion method improves search performance significantly, especially when retrieving highly relevant documents. Keywords: Twitter, microblog search, pseudo-relevance feedback, query expansion 1. 1 Graduate School of System Informatics, Kobe University, Kobe, Hyogo , Japan a) miyanishi@ai.cs.kobe-u.ac.jp b) seki@cs.kobe-u.ac.jp c) uehara@kobe-u.ac.jp [10], [11], [12], [20], [21], [30] [8], [23], [24], [27], [28] WebDB 2013 c 2014 Information Processing Society of Japan 1

2 [27] [1], [2], [3], [4], [5], [16], [18], [25], [26] Web Latent Concept Expansion; LCE [26] wrm [17] ctrm TREC White House spokesman replaced MB044 1 Joseph R. Diden Jr. Jay Carney wrm ctrm Lexical ctrm Temporal 1 wrm jay carney ctrm Lexical news spokesman press secretary jay carney 1 Table 1 MB044: White House spokesman replaced Example of expanded words and concepts about a topic MB044: White House spokesman replaced suggested by a word-based PRF (wrm) and a concept-based temporal one (ctrm). wrm ctrm (Lexical) ctrm (Temporal) jay jay carney carney carney jay qantas qantas press secretary new new spokesman jay carney obama new biden spokesman ctrm ctrm Temporal ctrm Lexical ctrm LCE TREC ,600 tweet Twitter Tweets2011 * [35] Li [19] *1 c 2014 Information Processing Society of Japan 2

3 Efron [10] Dakka [9] Keikha [14] Lin [21] Miyanishi [27] Jones [13] 2 Efron [12] Miyanishi [28] tweet 1 Twitter 2.2 Markov Random Field; MRF MRF Web Metzler [25] MRF MRF [26] LCE Bendersky [3] Wikipedia Bendersky [4], [5] Bendersky [2] TREC Web Wikipedia 3. Ponte [31] 3.1 Q D P (D Q) P (D Q) P (Q D) P (D) P (D Q) P (Q D)P (D) P (D) D n q 1,q 2,...,q n Q P (Q D) =P (q 1,q 2,...,q n D) n P (Q D) = P (q i D) (1) i=1 n q i c 2014 Information Processing Society of Japan 3

4 Q i P (q D) f(w;d) P ml (w D) = w V f(w ;D) f(w; D) D w w V f(w ; D) D V (1) 1 P (Q D) 0 [36] P (w D) = D D + μ P μ ml(w D)+ D + μ P (w C) P (w C) C w μ 3.2 Lavrenko [17] P (w R) w w P (w R) P (w Q) = D R P (w, D Q) = 1 P (Q) P (w, Q D)P (D) D R D R P (D)P (w D) n P (q i D), (2) R Q M P (Q) P (w Q) D w q i (2) P (w Q) w k 3.3 Metzler [26] LCE LCE 1 LCE i [25] LCE Bendersky [4] LCE LCE R M c S LCE (c, Q) D R exp{γ 1 φ 1 (Q, D)+γ 2 φ 2 (c, D) γ 3 φ 3 (c, C)}(3) φ 1 (Q, D) Q D c φ 2 (c, D) c D φ 3 (c, C) C c γ 1 γ 2 γ 3 (3) γ 1 = γ 2 = γ 3 =1 φ 1 (Q, D) = log P (Q D) φ 2 (c, D) =logp (c D) φ 3 (c, C) =0 *2 D Q ˆq 1, ˆq 2,...,ˆq m c bag-of-concepts c Q S crm (c, Q) S crm (c, Q) D R P (D)P (c D) m P (ˆq j D) (4) ˆq j Q j (4) (3) (4) (2) Lavrenko [17] (2) (4) 3.4 [8], [23], [24], [27], [28] *2 [22] φ 3 (c, C) =0 j c 2014 Information Processing Society of Japan 4

5 P (c R) 2 c R l R t P (c Q) P (c Q) = P (c, D l,d t Q) D l R l D t R t = P (D l c, D t,q)p(c, D t Q) (5) D t R t D l R l D l R l D t R t Efron [10] D t D l (5) D t P (Q) P (c Q) P (c Q) = P (D l c, Q) P (c, D t Q) D l R l D t R t 1 = P (c, D l Q) P (c, D t Q) P (c Q) D l R t D t R t 1 P (D l )P (c, Q D l ) P (c Q) D l R t P (D t )P (c, Q D t ) D t R t bag-of-concepts D l D t Q ˆq 1, ˆq 2,...,ˆq m c 1 m P (c Q) P (D l )P (c D l ) P (ˆq j D l ) P (c Q) D l R l j m P (D t )P (c D t ) P (ˆq j D t ) D t R t j P (c Q) c Q S ct RM (c, Q) c { m S ct RM (c, Q) rank = P (D l )P (c D l ) P (ˆq j D l ) D l R l j }{{} Lexical m } 1/2 P (D t )P (c D t ) P (ˆq j D t ), D t R t j }{{} Temporal (6) P (c D t ) t c P (ˆq j D t ) t ˆq j (6) P (c D t ) m j P (ˆq j D t ) c ˆq 1, ˆq 2,...,ˆq m (6) R t c P (ˆq j D t ) P (c D t ) D l R l P (D l )P (c D l )P (Q D l ) (4) LCE (6) (1) P (c D t ) 0 P (c D t ) P (c D t )= D t ˆPml (c D t )+ P (c C) (7) D t + μ t D t + μ t D t D t f(c;d t) P ml (c D t )= c Vc f(c ;D V t) c f(c; D t ) t c μ t S ct RM (c, Q) k TREC Tweets2011 tweet Tweets ,600 tweet <num> <title> <querytime> TREC <title> TREC tweet tweet <num> MB001 <title> BBC World Service staff cuts <querytime> Tue Feb 08 12:30: TREC 2011 Fig. 1 Example topic from the TREC 2011 microblog track. μ t c 2014 Information Processing Society of Japan 5

6 2 Table 2 TREC Summary of TREC collections and topics used for evaluation. Name Type #Topics Topic Numbers TREC 2011 allrel highrel 33 1, 10-30, 32, 36-38, 40-42, 44-46, 49 TREC 2012 allrel , highrel 56 51, 52, 54-68, 70-75, , tweet allrel tweet highrel TREC 2011 TREC 2012 allrel highrel tweet tweet Indri [34] stopword Krovetz stemmer [15] stemming Tweet Tweet Indri [36] [12] μ = 2500 LM μ t allrel μ t = 150 highrel μ t = 350 LM tweet ldig *3 retweet RT tweet *4 TREC [29], [33] retweet [8] retweet retweet *3 *4 tweet tweet retweet retweet HTTP tweet 1, tweet sequential-dependence [25] w 1 w 2 #1(w 1,w 2 ) w 1 w 2 8 #8(w 1,w 2 ) (7) P (c C) df (c)/n P (c C) df (c) c N df (c)/n P (c C) μ t (4) ctrm 2 wtrm wtrm (4) 1 wtrm ctrm wtrm ctrm wtrm ctrm wrm [17] wrm 2 2 c 2014 Information Processing Society of Japan 6

7 3 Lexical Temporal Concept Table 3 Summary of the evaluated retrieval methods. Method Lexical Temporal Concept wrm crm wtrm ctrm crm LCE (3) LCE Bendersky [4] LCE crm ctrm ctrm crm R t 3 wrm crm wtrm ctrm 3 crm ctrm wrm wtrm wtrm ctrm wrm crm wrm crm wtrm ctrm tweet Tweets2011 tweet wrm crm wtrm ctrm # tweet P (D l ) P (D t ) S crm (c, Q) S ct RM (c, Q) k S crm (c, Q) S ct RM (c, Q) k 2 #weight( λ 1 #combine(bbc world service staff cuts) λ 2 #weight( s 1 #1(service outlines) s 2 #uw8(bbc outlines) s 3 outlines... s k #1(weds bbcworldservice))) TREC 2011 MB001: BBC world service staff cuts Fig. 2 Example of query reformulation of topic MB001: BBC world service staff cuts from TREC 2011 Microblog track queries. 4 Table 4 Parameters for IR models. allrel highrel Method Year M N K M N K wtrm ctrm wtrm ctrm Indri [34] 2 i ScT RM (ci,q) s i = k j ScT RM (cj,q) 1:1 λ 1 = λ 2 = M wtrm ctrm N k M N k allrel highrel 1,000 Average Precision AP AP Precision [6]TREC ,000 AP TREC 2011 TREC 2012 TREC wtrm ctrm M N k TREC c 2014 Information Processing Society of Japan 7

8 Table 5 5 The performance comparison of the word-based PRFs. allrel highrel Method AP P@30 bpref AP P@30 bpref LM wrm α α α α α wtrm α β α α β α α α Table 6 6 The performance comparison of the concept-based PRFs. allrel highrel Method AP P@30 bpref AP P@30 bpref LM crm α α α α α α ctrm α α α β α β α α 4.3 Average Precision AP 30 Precision P@30 binary preference bpref [7] P@30 TREC 2011 TREC 2012 P@30 AP bpref AP Precision TREC 2012 highrel allrel highrel t [32] 100,000 p< wtrm ctrm wtrm ctrm wrm crm LM [17] wrm wtrm AP P@30 bpref LM wrm wtrm α β γ 5 wrm wtrm allrel highrel LM allrel highrel wtrm wrm allrel AP bpref [4] crm Table 7 7 The performance comparison of the proposed wordand concept-based temporal PRF methods. allrel highrel Method AP P@30 bpref AP P@30 bpref wtrm β ctrm ctrm AP P@30 bpref LM crm ctrm α β γ 6 crm ctrm allrel highrel LM ctrm crm allrel bref highrel AP wtrm ctrm AP P@30 bpref wtrm ctrm α β 7 allrel wtrm ctrm P@30 P@30 highrel ctrm wtrm tweet Gasland *5 wtrm oscar *5 Gasland MB c 2014 Information Processing Society of Japan 8

9 nomination ctrm oscar nomination nomination oscar nomination ctrm wtrm allrel highrel 5. 2 TREC [1] Bendersky, M. and Croft, W.B.: Discovering key concepts in verbose queries, SIGIR, pp (2008). [2] Bendersky, M. and Croft, W.B.: Modeling higher-order term dependencies in information retrieval using query hypergraphs, SIGIR, pp (2012). [3] Bendersky, M., Metzler, D. and Croft, W.B.: Learning concept importance using a weighted dependence model, WSDM, pp (2010). [4] Bendersky, M., Metzler, D. and Croft, W.B.: Parameterized concept weighting in verbose queries, SIGIR, pp (2011). [5] Bendersky, M., Metzler, D. and Croft, W.B.: Effective query formulation with multiple information sources, WSDM, pp (2012). [6] Buckley, C. and Voorhees, E.M.: Evaluating evaluation measure stability, SIGIR, pp (2000). [7] Buckley, C. and Voorhees, E.M.: Retrieval evaluation with incomplete information, SIGIR, pp (2004). [8] Choi, J. and Croft, W.B.: Temporal models for microblogs, CIKM, pp (2012). [9] Dakka, W., Gravano, L. and Ipeirotis, P.G.: Answering general time-sensitive queries, TKDE, Vol.24, No.2, pp (2012). [10] Efron, M. and Golovchinsky, G.: Estimation methods for ranking recent information, SIGIR, pp (2011). [11] Efron, M.: Query-specific recency ranking: Survival analysis for improved microblog retrieval, #TAIA (2012). [12] Efron, M., Organisciak, P. and Fenlon, K.: Improving retrieval of short texts through document expansion, SI- GIR, pp (2012). [13] Jones, R. and Diaz, F.: Temporal profiles of queries, TOIS, Vol.25, No.3 (2007). [14] Keikha, M., Gerani, S. and Crestani, F.: Time-based relevance models, SIGIR, pp (2011). [15] Krovetz, R.: Viewing morphology as an inference process, SIGIR, pp (1993). [16] Lang, H., Metzler, D., Wang, B. and Li, J.-T.: Improved latent concept expansion using hierarchical Markov random fields, CIKM, pp (2010). [17] Lavrenko, V. and Croft, W.B.: Relevance based language models, SIGIR, pp (2001). [18] Lease, M.: An improved Markov random field model for supporting verbose queries, SIGIR, pp (2009). [19] Li, X. and Croft, W.: Time-based language models, CIKM, pp (2003). [20] Liang, F., Qiang, R. and Yang, J.: Exploiting real-time information retrieval in the microblogosphere, JCDL, pp (2012). [21] Lin, J. and Efron, M.: Temporal relevance profiles for tweet search, #TAIA (2013). [22] Macdonald, C. and Ounis, I.: Global statistics in proximity weighting models, SIGIR Web N-gram Workshop (2010). [23] Massoudi, K., Tsagkias, M., de Rijke, M. and Weerkamp, W.: Incorporating query expansion and quality indicators in searching microblog posts, ECIR, pp (2011). [24] Metzler, D., Cai, C. and Hovy, E.: Structured event retrieval over microblog archives, HLT/NAACL, pp (2012). [25] Metzler, D. and Croft, W.B.: A Markov random field model for term dependencies, SIGIR, pp (2005). [26] Metzler, D. and Croft, W.B.: Latent concept expansion using Markov random fields, SIGIR, pp (2007). [27] Miyanishi, T., Seki, K. and Uehara, K.: Combining recency and topic-dependent temporal variation for microblog search, ECIR, pp (2013). [28] Miyanishi, T., Seki, K. and Uehara, K.: Improving pseudo-relevance feedback via tweet selection, CIKM, pp (2013). [29] Ounis, I., Macdonald, C., Lin, J. and Soboroff, I.: Overview of the TREC-2011 microblog track, TREC (2011). [30] Peetz, M.H., Meij, E., de Rijke, M. and Weerkamp, W.: Adaptive temporal query modeling, ECIR, pp (2012). [31] Ponte, J. and Croft, W.: A language modeling approach to information retrieval, SIGIR, pp (1998). [32] Smucker, M.D., Allan, J. and Carterette, B.: A comparison of statistical significance tests for information retrieval evaluation, CIKM, pp (2007). [33] Soboroff, I., Ounis, I. and Lin, J.: Overview of the TREC-2012 microblog track, TREC (2012). [34] Strohman, T., Metzler, D., Turtle, H. and Croft, W.: Indri: A language model-based search engine for complex queries, ICIA, pp.2 6 (2005). [35] Teevan, J., Ramage, D. and Morris, M.: #Twitter- Search: A comparison of microblog search and web search, WSDM, pp (2011). [36] Zhai, C. and Lafferty, J.: A study of smoothing methc 2014 Information Processing Society of Japan 9

10 ods for language models applied to information retrieval, TOIS, Vol.22, No.2, pp (2004) Web Ph.D AAAI c 2014 Information Processing Society of Japan 10

Mining the Temporal Statistics of Query Terms for Searching Social Media Posts

Mining the Temporal Statistics of Query Terms for Searching Social Media Posts Jinfeng Rao, 1 Ferhan Ture, 2 Xing Niu, 1 and Jimmy Lin 3 1 Department of Computer Science, University of Maryland 2 Comcast