, pp.379-384 http://dx.doi.org/10.14257/astl.2013.29.78 A Framework of Detecting Burst Events from Micro-blogging Streams Kaifang Yang, Yongbo Yu, Lizhou Zheng, Peiquan Jin School of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn Abstract. Micro-blogs greatly accelerate the information flow on the Internet. Recent studies showed that burst events spread much faster in micro-blogging platform than any other media. Therefore, it is very useful to detect burst events from micro-blogging streams so that we can acquire real-time news events and use them for government and business decision making. Traditional methods of events detection were mainly focused on public news media, and are not very suitable for the micro-blogging platform. In this paper, we propose a framework of burst events detection with an emphasis on the characteristics of the micro-blogging streams. Experimental results on a real data set crawled from a commercial micro-blogging platform demonstrate the effectiveness of our method. Keywords: Micro-blog, Event Detection, Burst Event. 1 Introduction Micro-blogging platform has been very popular in people s daily life. For example, Sina Weibo is a popular social networking platform in China, where users are able to share and discuss news and events. Compare with traditional Web pages, micro-blogs have the following features: (a) the length of micro-blogs are restricted in 140 words, which make them very short and brief in describing information; (b) compare with traditional Web pages, micro-blogs are more colloquial and informal; (c) micro-blogs usually report more real-time information. Since micro-blogging service has been one of the major information sharing platforms, both governments and business companies want to extract valuable information from micro-blogs, such as burst events. However, the traditional methods for events detection are designed in according to long texts such as Web pages,and are not suitable for micro-blogs. Therefore, it is urgent to study effective method for burst events detection in micro-blogging platforms. In this paper we introduce a new method to solve this problem. Compared with the previous works, our contributions can be summarized as follows: (1) Based on the characteristics of micro-blog, we propose a framework to detect burst events from micro-blogging streams detection framework. (2)We conduct experiments on a real data set crawled from Sina.Weibo, and the ISSN: 2287-1233 ASTL Copyright 2013 SERSC
experimental results demonstrate the effectiveness of our method. The rest of the paper is organized as follows: Section 2 describes the related work. Section 3 discusses the framework to detect burst events from micro-blogs. In Section 4, we give the experimental results, and conclusion and further works are in Section 5. 2 Related Work Burst events detection is defined for a specific time and place events, which extend from the definition of topics. Current popular view is that the topic generally represents a relatively specific event, a central event or activity, but also directly related events and ensuing discussion. While an event is usually refers to something that happened at a particular time and place, related to the specific subject (people, institutions, organizations and so on).in our paper there is no clear distinction between topic detection and event detection. Traditional methods for burst event detection of Microblogging can be divided into the following categories:( a) Using the traditional event detecting and tracking (TDT), combined with widely used supervised machine learning methods to get the candidate event [1, 2]. (b) Monitor word frequency mutations: In [3], through by detecting the appearance of each word in recent time slices to determine whether it is a burst term; [4] proposed using wavelet to analysis microblogging information. ( c) Combining sentiment analysis methods to detect burst events [5 ]; (d )Proposed an improved clustering method based on microblog feature, using latent semantic analysis (LSA) on the vocabulary Text TF-IDF matrix, combined with the unique microblog on the label, facial expressions and other features clustering of microblog, detecting event on microblogging and extract the event summary. The majority of existing microblogging event detection methods do not fully consider the microblogging data features, or consider them too simply [ 3,6 ].Reference methods in paper [ 6 ], we proposed the measure criteria of hot burst and burst factor. According to the criteria to filer the microblogging keywords, and then combining keywords to obtain the relevant event summary. 3 A Framework to Detect Burst Events from Micro-blogs The framework of our method is shown in Fig.1. 380 Copyright 2013 SERSC
Micrologging Message Acquisition Data Cleaning Preprocessing Segmention&&POS Feature Trajectory Discrete Fourier Transform Model Word Filtrating Word Merger Summary Summary Fig. 1. The framework of detecting burst events from micro-blogs First, we perform preprocessing on the original dataset to obtain purify text from the raw dataset. Then we define a model to get the burst area and word.finally we get the burst event summary and other parameters. 3.1 Preprocessing Micro-blogging messages acquisition. We use the API provided by Sina weibo to acquire public micro-blogging texts. Data cleaning. There is a lot of noise in acquired messages, which does not actually useful to detect burst events. In preprocessing, we remove this noise from the message contents including emotion icons, mentioned names, URLs and other non-text. Segmentation and POS-tagging. Here we select widely used ICTCLAS [7] software which uses hidden Markov models to segment text. After segmentation, we discard the stop-words. 3.2 Feature trajectory model Lexical items feature. In order to fully measure the lexical items, We use the definition [6] of feature trajectories, word w in the feature event window T trajectory(df-idf) as:y w = [y w (1), y w (2) y w (T)], y w (t) = DD w(t) log N, DFw(t) is N(t) DDw the number of microblog which contains word w in slice t. N(t) is the total number of Copyright 2013 SERSC 381
microblog in slice t, N is the total microblog in window slice T. DD w is the number of microblog which contains word w in windows T. Discrete Fourier Transform (DFT). By means of Discrete Fourier transform the feature trajectory can obtained corresponding Spectrogram. According to the spectrogram we define the burst hot (S w = max{ x k 2 }, k = 2,3 T,) and burst factor (burstness = s w x 1 2, x 1 2 is the average DF-IDF value of corresponding interval T.). The Spectrogram can be used for frequency-domain analysis and words filtration. Word filtering. From the spectrogram we can observe the burst degree of each word. We can filter out hot words according to its spectrogram feature. First, filter out the periodic signal, the primary cycle is less than one should be filtered out. By setting different thresholds we can detect different levels of heat and hot burst words. Also what worth mentioning is that, a lot of experimental words appeared only once or several times in a time slice, which are considered to be noise. 3.3 Words Merger and Summary Word merger. A burst event or topic may have multiple bursts word, we use clustering method to merger those burst word. We define word distance as the degree of co-occurrence. The concurrence of w1 and w2 define as: d(w 1, w 2 ) = M w1 M w2,d(r min { M w1, M w2 } 1, R 2 ) = min{d(w 1, w 2 )}, w 1 R 1, w 2 R 2 M w is the number of microblogging which contains word w. We use hierarchical clustering method to cluster and get the summary of each burst event. Summary. We can generate hot events summary after hot words merger. Event Summary has the following format: event = {burst hot word set, burst time interval, heat, burst factor}. 4 Experimental Results 4.1 Data Set We acquired a collection of 1,728,000 microblog from Sina weibo, sampling frequency is less than 0.57% In our experiment, the observation window T is 3*24 hours. Burst factor threshold value is the mean value of burst factor. The threshold value of co-occurrence is 0.5. As the existing Chinese microblogging burst event detection has no standard data set, and no standard comparison evaluation criterion. Therefore we refer to the traditional information retrieval precision and recall rate as the evaluation criterion. But due to the frequency of experimental data is less than 0.57%, it is impossible to find all the hot events. This paper uses only accuracy rate indicators: the ration between detected hot events from experiments associated with the number of real events and all detected burst hot events. 382 Copyright 2013 SERSC
In order to verify whether the bursting events detected by experiment are real ev ents, we use Baidu advanced search tool to verify the experimental events respectively. We use the abstract of the experimental events as search keywords and also limit the search time period as the time period of the events. If the search results show that three of the top five are related to the detected events, we regard the detected events as real ones. 4.2 Experimental Results Finally we detect and figure out 40 burst events. According to the judge method mentioned above, 30 of 40 are real news events. Thus this method can obtained correct detection rate of about 75%. The rest ten by manual checks found that four are microblogging commercial promotion advertising, and the other six others are noise data with hot burst word. The reasons that affect the accuracy can be the follows: First, the Sina microblogging contains many commercial promotion and noise data, some zombie fan use hash tags which include burst tags to attract user attention; all those affect the purity of experimental data. Secondly, the limits the frequency of crawler lead to crawling data sampling rate is not high, this will also affect the results. Furthermore, since microblogging text are informal and include many out of vocabulary new word,the traditional word segmentation tool can t Figure out new words from microblog and as a result of this error rate of participle in microblog text is higher than traditional one. 5 Conclusion In this paper, we proposed a framework for detection burst events in micro-blogging streams. The experimental results on a real micro-blogging data set showed that the proposed method is effective in burst events detection. However, our method does not consider the noise in micro-blogging streams. Therefore, the future work will be focused on micro-blogging noise data detection and elimination. Another future work is to take the social network properties in burst events detection. Acknowledgement. This paper is supported by the National Science Foundation of China (No. 60776801 and No. 71273010), and the National Science Foundation of Anhui Province (no. 1208085MG117). References 1. Ritter, Alan, Oren Ezine, and Sam Clark. Open domain event extraction from twitter. In Proc. Of SIGKDD, ACM press, 2012. Copyright 2013 SERSC 383
2. Popescu, Ana-Maria, Marco Pennacchiotti, and Deepa Paranjpe. Extracting events and event descriptions from twitter. In Proc. Of WWW, ACM press, 2011. 3. Mathioudakis, Michael, and Nick Koudas. Twitter monitor: trend detection over the twitter stream. In Proc. Of SIGMOD, ACM press, 2010. 4. Weng, Jianshu, and Bu-Sung Lee. Event detection in Twitter. In Proc. Of the 5th International AAAI Conference on Weblogs and Social Media, 2011. 5. Thelwall, M., Buckley, K., and Paltoglou, G., Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418, 2011. 6. Qi He, Kuiyu Chang, and Ee-Peng Lim. Analyzing feature trajectories for event detection. In Proc. Of SIGIR, ACM press, 2007. 7. ICTCLAS, Available at http://ictclas.nlpir.org/ 384 Copyright 2013 SERSC