A Framework of Detecting Burst Events from Micro-blogging Streams

Similar documents
Measurement of Burst Topic in Microblog

DM-Group Meeting. Subhodip Biswas 10/16/2014

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

Department of Computer Science, Guiyang University, Guiyang , GuiZhou, China

Term Filtering with Bounded Error

Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University Bethlehem, PA

Real-time Sentiment-Based Anomaly Detection in Twitter Data Streams

Beating Social Pulse: Understanding Information Propagation via Online Social Tagging Systems 1

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

A Novel Method for Mining Relationships of entities on Web

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

Predicting New Search-Query Cluster Volume

Latent Geographic Feature Extraction from Social Media

Virtual network analysis of Wuhan 1+8 City Circle based on Sina microblog user relations

Spatial Extension of the Reality Mining Dataset

Giovanni Stilo SAX! A Symbolic Representations of Time Series

Mobility Analytics through Social and Personal Data. Pierre Senellart

Boolean and Vector Space Retrieval Models

WITH the explosive growth of user generated

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Spam ain t as Diverse as It Seems: Throttling OSN Spam with Templates Underneath

SocViz: Visualization of Facebook Data

December 3, Dipartimento di Informatica, Università di Torino. Felicittà. Visualizing and Estimating Happiness in

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

From Social User Activities to People Affiliation

Generalisation and Multiple Representation of Location-Based Social Media Data

CED: Credible Early Detection of Social Media Rumors

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

* Abstract. Keywords: Smart Card Data, Public Transportation, Land Use, Non-negative Matrix Factorization.

Multi-wind Field Output Power Prediction Method based on Energy Internet and DBPSO-LSSVM

Introduction to ArcGIS Maps for Office. Greg Ponto Scott Ball

THE DESIGN AND IMPLEMENTATION OF A WEB SERVICES-BASED APPLICATION FRAMEWORK FOR SEA SURFACE TEMPERATURE INFORMATION

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

GOVERNMENT GIS BUILDING BASED ON THE THEORY OF INFORMATION ARCHITECTURE

Topic Discovery Project Report

Content-based Recommendation

DISTRIBUTIONAL SEMANTICS

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Mining Triadic Closure Patterns in Social Networks

Unified Modeling of User Activities on Social Networking Sites

Application of Swarm Intelligent Algorithm Optimization Neural Network in Network Security Hui Xia1

Discovering Geographical Topics in Twitter

Building a Timeline Action Network for Evacuation in Disaster

Discovery Through Situational Awareness

FUSION METHODS BASED ON COMMON ORDER INVARIABILITY FOR META SEARCH ENGINE SYSTEMS

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Information Retrieval and Organisation

Collaborative topic models: motivations cont

SOFTWARE ARCHITECTURE DESIGN OF GIS WEB SERVICE AGGREGATION BASED ON SERVICE GROUP

A Unified Model for Stable and Temporal Topic Detection from Social Media Data

Exploring spatial decay effect in mass media and social media: a case study of China

Modeling population growth in online social networks

Topic Models and Applications to Short Documents

Uncovering News-Twitter Reciprocity via Interaction Patterns

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Multimedia analysis and retrieval

Multi-theme Sentiment Analysis using Quantified Contextual

Journal of Chemical and Pharmaceutical Research, 2014, 6(5): Research Article

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Internet Engineering Jacek Mazurkiewicz, PhD


IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Introduction to Information Retrieval

What Is Good for One City May Not Be Good for Another One: Evaluating Generalization for Tweet Classification Based on Semantic Abstraction

Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

Variable Latent Semantic Indexing

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

N-gram N-gram Language Model for Large-Vocabulary Continuous Speech Recognition

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Crowdsourcing Semantics for Big Data in Geoscience Applications

Geographical Bias on Social Media and Geo-Local Contents System with Mobile Devices

Spatial Data Science. Soumya K Ghosh

A Bivariate Point Process Model with Application to Social Media User Content Generation

Non-Parametric Bayes

On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing

arxiv: v2 [cs.si] 6 Feb 2015

Conjoint Modeling of Temporal Dependencies in Event Streams. Ankur Parikh Asela Gunawardana Chris Meek

Crowd-sourced Cartography: Measuring Socio-cognitive Distance for Urban Areas based on Crowd s Movement

Using Social Media for Geodemographic Applications

Prediction of Citations for Academic Papers

APPLYING BIG DATA TOOLS TO ACQUIRE AND PROCESS DATA ON CITIES

Text Analytics (Text Mining)

An Application of Link Prediction in Bipartite Graphs: Personalized Blog Feedback Prediction

Supplementary Information for Emotional persistence in online chatting communities

P leiades: Subspace Clustering and Evaluation

Rational Spamming. Xinyu Cao MIT John R. Hauser MIT T. Tony Ke MIT Juanjuan Zhang MIT

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Statistics for Engineering, 4C3/6C3 Assignment 2

Statistical Substring Reduction in Linear Time

Latent Semantic Analysis. Hongning Wang

Ranked Retrieval (2)

Virtual Beach Making Nowcast Predictions

Exploring Urban Areas of Interest. Yingjie Hu and Sathya Prasad

GLOBAL NEEDS TO BE ADDRESSED IN A STRATEGIC PLAN FOR SPACE WEATHER. Developing products and services in space weather: Space Weather Channel in China

Identification of Bursts in a Document Stream

Transcription:

, pp.379-384 http://dx.doi.org/10.14257/astl.2013.29.78 A Framework of Detecting Burst Events from Micro-blogging Streams Kaifang Yang, Yongbo Yu, Lizhou Zheng, Peiquan Jin School of Computer Science and Technology, University of Science and Technology of China, 230027, Hefei, China jpq@ustc.edu.cn Abstract. Micro-blogs greatly accelerate the information flow on the Internet. Recent studies showed that burst events spread much faster in micro-blogging platform than any other media. Therefore, it is very useful to detect burst events from micro-blogging streams so that we can acquire real-time news events and use them for government and business decision making. Traditional methods of events detection were mainly focused on public news media, and are not very suitable for the micro-blogging platform. In this paper, we propose a framework of burst events detection with an emphasis on the characteristics of the micro-blogging streams. Experimental results on a real data set crawled from a commercial micro-blogging platform demonstrate the effectiveness of our method. Keywords: Micro-blog, Event Detection, Burst Event. 1 Introduction Micro-blogging platform has been very popular in people s daily life. For example, Sina Weibo is a popular social networking platform in China, where users are able to share and discuss news and events. Compare with traditional Web pages, micro-blogs have the following features: (a) the length of micro-blogs are restricted in 140 words, which make them very short and brief in describing information; (b) compare with traditional Web pages, micro-blogs are more colloquial and informal; (c) micro-blogs usually report more real-time information. Since micro-blogging service has been one of the major information sharing platforms, both governments and business companies want to extract valuable information from micro-blogs, such as burst events. However, the traditional methods for events detection are designed in according to long texts such as Web pages,and are not suitable for micro-blogs. Therefore, it is urgent to study effective method for burst events detection in micro-blogging platforms. In this paper we introduce a new method to solve this problem. Compared with the previous works, our contributions can be summarized as follows: (1) Based on the characteristics of micro-blog, we propose a framework to detect burst events from micro-blogging streams detection framework. (2)We conduct experiments on a real data set crawled from Sina.Weibo, and the ISSN: 2287-1233 ASTL Copyright 2013 SERSC

experimental results demonstrate the effectiveness of our method. The rest of the paper is organized as follows: Section 2 describes the related work. Section 3 discusses the framework to detect burst events from micro-blogs. In Section 4, we give the experimental results, and conclusion and further works are in Section 5. 2 Related Work Burst events detection is defined for a specific time and place events, which extend from the definition of topics. Current popular view is that the topic generally represents a relatively specific event, a central event or activity, but also directly related events and ensuing discussion. While an event is usually refers to something that happened at a particular time and place, related to the specific subject (people, institutions, organizations and so on).in our paper there is no clear distinction between topic detection and event detection. Traditional methods for burst event detection of Microblogging can be divided into the following categories:( a) Using the traditional event detecting and tracking (TDT), combined with widely used supervised machine learning methods to get the candidate event [1, 2]. (b) Monitor word frequency mutations: In [3], through by detecting the appearance of each word in recent time slices to determine whether it is a burst term; [4] proposed using wavelet to analysis microblogging information. ( c) Combining sentiment analysis methods to detect burst events [5 ]; (d )Proposed an improved clustering method based on microblog feature, using latent semantic analysis (LSA) on the vocabulary Text TF-IDF matrix, combined with the unique microblog on the label, facial expressions and other features clustering of microblog, detecting event on microblogging and extract the event summary. The majority of existing microblogging event detection methods do not fully consider the microblogging data features, or consider them too simply [ 3,6 ].Reference methods in paper [ 6 ], we proposed the measure criteria of hot burst and burst factor. According to the criteria to filer the microblogging keywords, and then combining keywords to obtain the relevant event summary. 3 A Framework to Detect Burst Events from Micro-blogs The framework of our method is shown in Fig.1. 380 Copyright 2013 SERSC

Micrologging Message Acquisition Data Cleaning Preprocessing Segmention&&POS Feature Trajectory Discrete Fourier Transform Model Word Filtrating Word Merger Summary Summary Fig. 1. The framework of detecting burst events from micro-blogs First, we perform preprocessing on the original dataset to obtain purify text from the raw dataset. Then we define a model to get the burst area and word.finally we get the burst event summary and other parameters. 3.1 Preprocessing Micro-blogging messages acquisition. We use the API provided by Sina weibo to acquire public micro-blogging texts. Data cleaning. There is a lot of noise in acquired messages, which does not actually useful to detect burst events. In preprocessing, we remove this noise from the message contents including emotion icons, mentioned names, URLs and other non-text. Segmentation and POS-tagging. Here we select widely used ICTCLAS [7] software which uses hidden Markov models to segment text. After segmentation, we discard the stop-words. 3.2 Feature trajectory model Lexical items feature. In order to fully measure the lexical items, We use the definition [6] of feature trajectories, word w in the feature event window T trajectory(df-idf) as:y w = [y w (1), y w (2) y w (T)], y w (t) = DD w(t) log N, DFw(t) is N(t) DDw the number of microblog which contains word w in slice t. N(t) is the total number of Copyright 2013 SERSC 381

microblog in slice t, N is the total microblog in window slice T. DD w is the number of microblog which contains word w in windows T. Discrete Fourier Transform (DFT). By means of Discrete Fourier transform the feature trajectory can obtained corresponding Spectrogram. According to the spectrogram we define the burst hot (S w = max{ x k 2 }, k = 2,3 T,) and burst factor (burstness = s w x 1 2, x 1 2 is the average DF-IDF value of corresponding interval T.). The Spectrogram can be used for frequency-domain analysis and words filtration. Word filtering. From the spectrogram we can observe the burst degree of each word. We can filter out hot words according to its spectrogram feature. First, filter out the periodic signal, the primary cycle is less than one should be filtered out. By setting different thresholds we can detect different levels of heat and hot burst words. Also what worth mentioning is that, a lot of experimental words appeared only once or several times in a time slice, which are considered to be noise. 3.3 Words Merger and Summary Word merger. A burst event or topic may have multiple bursts word, we use clustering method to merger those burst word. We define word distance as the degree of co-occurrence. The concurrence of w1 and w2 define as: d(w 1, w 2 ) = M w1 M w2,d(r min { M w1, M w2 } 1, R 2 ) = min{d(w 1, w 2 )}, w 1 R 1, w 2 R 2 M w is the number of microblogging which contains word w. We use hierarchical clustering method to cluster and get the summary of each burst event. Summary. We can generate hot events summary after hot words merger. Event Summary has the following format: event = {burst hot word set, burst time interval, heat, burst factor}. 4 Experimental Results 4.1 Data Set We acquired a collection of 1,728,000 microblog from Sina weibo, sampling frequency is less than 0.57% In our experiment, the observation window T is 3*24 hours. Burst factor threshold value is the mean value of burst factor. The threshold value of co-occurrence is 0.5. As the existing Chinese microblogging burst event detection has no standard data set, and no standard comparison evaluation criterion. Therefore we refer to the traditional information retrieval precision and recall rate as the evaluation criterion. But due to the frequency of experimental data is less than 0.57%, it is impossible to find all the hot events. This paper uses only accuracy rate indicators: the ration between detected hot events from experiments associated with the number of real events and all detected burst hot events. 382 Copyright 2013 SERSC

In order to verify whether the bursting events detected by experiment are real ev ents, we use Baidu advanced search tool to verify the experimental events respectively. We use the abstract of the experimental events as search keywords and also limit the search time period as the time period of the events. If the search results show that three of the top five are related to the detected events, we regard the detected events as real ones. 4.2 Experimental Results Finally we detect and figure out 40 burst events. According to the judge method mentioned above, 30 of 40 are real news events. Thus this method can obtained correct detection rate of about 75%. The rest ten by manual checks found that four are microblogging commercial promotion advertising, and the other six others are noise data with hot burst word. The reasons that affect the accuracy can be the follows: First, the Sina microblogging contains many commercial promotion and noise data, some zombie fan use hash tags which include burst tags to attract user attention; all those affect the purity of experimental data. Secondly, the limits the frequency of crawler lead to crawling data sampling rate is not high, this will also affect the results. Furthermore, since microblogging text are informal and include many out of vocabulary new word,the traditional word segmentation tool can t Figure out new words from microblog and as a result of this error rate of participle in microblog text is higher than traditional one. 5 Conclusion In this paper, we proposed a framework for detection burst events in micro-blogging streams. The experimental results on a real micro-blogging data set showed that the proposed method is effective in burst events detection. However, our method does not consider the noise in micro-blogging streams. Therefore, the future work will be focused on micro-blogging noise data detection and elimination. Another future work is to take the social network properties in burst events detection. Acknowledgement. This paper is supported by the National Science Foundation of China (No. 60776801 and No. 71273010), and the National Science Foundation of Anhui Province (no. 1208085MG117). References 1. Ritter, Alan, Oren Ezine, and Sam Clark. Open domain event extraction from twitter. In Proc. Of SIGKDD, ACM press, 2012. Copyright 2013 SERSC 383

2. Popescu, Ana-Maria, Marco Pennacchiotti, and Deepa Paranjpe. Extracting events and event descriptions from twitter. In Proc. Of WWW, ACM press, 2011. 3. Mathioudakis, Michael, and Nick Koudas. Twitter monitor: trend detection over the twitter stream. In Proc. Of SIGMOD, ACM press, 2010. 4. Weng, Jianshu, and Bu-Sung Lee. Event detection in Twitter. In Proc. Of the 5th International AAAI Conference on Weblogs and Social Media, 2011. 5. Thelwall, M., Buckley, K., and Paltoglou, G., Sentiment in Twitter events. Journal of the American Society for Information Science and Technology, 62(2), 406-418, 2011. 6. Qi He, Kuiyu Chang, and Ee-Peng Lim. Analyzing feature trajectories for event detection. In Proc. Of SIGIR, ACM press, 2007. 7. ICTCLAS, Available at http://ictclas.nlpir.org/ 384 Copyright 2013 SERSC