Incorporating Social Context and Domain Knowledge for Entity Recognition

Similar documents
Incorporating Social Context and Domain Knowledge for Entity Recognition

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Models and Applications to Short Documents

Statistical Debugging with Latent Topic Models

CS Lecture 18. Topic Models and LDA

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Document and Topic Models: plsa and LDA

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

GLAD: Group Anomaly Detection in Social Media Analysis

Latent Dirichlet Allocation (LDA)

Collaborative topic models: motivations cont

Term Filtering with Bounded Error

Topic Modelling and Latent Dirichlet Allocation

Using Both Latent and Supervised Shared Topics for Multitask Learning

Latent variable models for discrete data

Introduction To Machine Learning

CS6220: DATA MINING TECHNIQUES

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Collaborative Topic Modeling for Recommending Scientific Articles

Analyzing Burst of Topics in News Stream

Latent Dirichlet Allocation Introduction/Overview

Sparse Stochastic Inference for Latent Dirichlet Allocation

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Click Prediction and Preference Ranking of RSS Feeds

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors

Introduction to Bayesian inference

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Probabilistic Latent Semantic Analysis

Information Extraction from Text

Topic Models. Charles Elkan November 20, 2008

Mining Topic-level Opinion Influence in Microblog

Unified Modeling of User Activities on Social Networking Sites

CS145: INTRODUCTION TO DATA MINING

APPLICATIONS OF MINING HETEROGENEOUS INFORMATION NETWORKS

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

Probability Review and Naïve Bayes

Distinguish between different types of scenes. Matching human perception Understanding the environment

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Web Search and Text Mining. Lecture 16: Topics and Communities

Non-Parametric Bayes

Generative Models for Discrete Data

Cost and Preference in Recommender Systems Junhua Chen LESS IS MORE

Topic Modeling: Beyond Bag-of-Words

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Gaussian Mixture Model

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Distributed Gibbs Sampling of Latent Topic Models: The Gritty Details THIS IS AN EARLY DRAFT. YOUR FEEDBACKS ARE HIGHLY APPRECIATED.

Mixtures of Multinomials

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Conditional Random Field

Latent Dirichlet Bayesian Co-Clustering

odeling atient ortality from linical ote

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Kernel Density Topic Models: Visual Topics Without Visual Words

Modeling Environment

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

Lecture 13: Structured Prediction

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

What s so Hard about Natural Language Understanding?

Latent Dirichlet Allocation (LDA)

Lecture 22 Exploratory Text Analysis & Topic Models

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Distributed ML for DOSNs: giving power back to users

Undirected Graphical Models

Sampling Equation Derivation for Lex-MED-RTM

Relational Stacked Denoising Autoencoder for Tag Recommendation. Hao Wang

Topic modeling with more confidence: a theory and some algorithms

METHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS. Mark Dredze Department of Computer Science Johns Hopkins University

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Recent Advances in Bayesian Inference Techniques

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS6220: DATA MINING TECHNIQUES

MEI: Mutual Enhanced Infinite Community-Topic Model for Analyzing Text-augmented Social Networks

Introduction to Probabilistic Machine Learning

Probabilistic Graphical Models

Uncovering the Latent Structures of Crowd Labeling

Measuring Topic Quality in Latent Dirichlet Allocation

Graphical models for part of speech tagging

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Data Mining Techniques

Online Bayesian Passive-Agressive Learning

Content-based Recommendation

Bayesian Nonparametrics for Speech and Signal Processing

Topic Discovery Project Report

Classical Predictive Models

CSC411 Fall 2018 Homework 5

Expectation Propagation for Approximate Bayesian Inference

Online Learning and Sequential Decision Making

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

AN INTRODUCTION TO TOPIC MODELS

Tsuyoshi; Shibata, Yuichiro; Oguri, management - CIKM '09, pp ;

An Efficient Approach for Assessing Parameter Importance in Bayesian Optimization

Transcription:

Incorporating Social Context and Domain Knowledge for Entity Recognition Jie Tang, Zhanpeng Fang Department of Computer Science, Tsinghua University Jimeng Sun College of Computing, Georgia Institute of Technology 1

Entity Recognition in Social Media People use blogs, forums, and review sites to share opinions on politicians or products. One fundamental analytic issue is to recognize entity instances from the UGC short documents. However, the problem is very challenging S4 vs. Samsung Galaxy S4 Fruit company vs. Apple Inc. Peace West King vs. Xilai Bo (a sensitive Chinese politician) 2

A Concrete Example Social Network Documents Knowledge Base A Both Disease 1 and Disease 2 have symptom 1 reply Symptom Health B Re: Remember D2 also has symptom 3. Disease retweet C Treatment RT: S1 can be resolved by treatment 1 //@A: Both Disease 1 and Disease 2... Challenges: short text + social networks + domain knowledge =? 3

Related Work 4 Entity recognition Modeling as a ranking problem based on boosting and voted perceptron (Collins [9]) Incorporating long-distance dependency (Finkel et al. [13]) Use Labeled LDA [26] to exploit Freebase to help extraction (Ritter et al. [27]) Entity morph (Huang et al. [17]) Entity resolution A collective method for entity resolution in relational data (Bhattacharya and Getoor [4]) A hierarchical topic model for resolving name ambiguity (Kataria et al. [18]) Name disambiguation in digital libraries (Tang et al. [32])

Approach Framework SOCINST 5

Preliminary: Sequential Labeling OTH OTH OTH OTH OTH The label results y LOC LOC LOC LOC LOC POL POL POL POL POL The input text x 6 Peace-West King from Chongqing fell y * = max y p(y x; f,θ) where f represents features and Θ are model parameters.

Sequential Labeling with CRFs y POL POL OTH LOC OTH x Peace-West King from Chongqing fell p(y x,λ,µ) = 1 Z exp( λ f (x, y ) + µ k k i i j f j (x, y i, y i+1 )) i µ and λ are parameters to be learned from the training data. k i j f k denotes the k-th feature defined for token x i f j denotes the j-th feature defined for two consecutive tokens x i ; and x j ; 7

Sequential Labeling with CRFs y POL POL OTH LOC OTH x Peace-West King from Chongqing fell p(y x,λ,µ) = 1 Z exp( λ f (x, y ) + µ k k i i j f j (x, y i, y i+1 )) i µ and λ are parameters to be learned from the training data. Performance of the model will be bad when dealing with short-text due to sparsity k i j f k denotes the k-th feature defined for token x i f j denotes the j-th feature defined for two consecutive tokens x i ; and x j ; 8

Sequential Labeling Incorporating Topics y θ x P(z y) P(x z) POL POL OTH LOC OTH z 1 z 2 z 3 z T Peace-West King from Chongqing fell p(y x,θ,λ,µ) = 1 Z exp( λ f (x,θ, y ) + µ k k i i i j f j (x,θ, y i, y i+1 )) i k i j 9

Latent Dirichlet Allocation Distribution of document over topics β ϕ k k [1,K] Distribution of topic over words α θ m z m,n x m,n Word n [1,N m ] m [1,M] α, β : Prior distributions (Dirichlet distribution) Document Topic K p(x,z,θ,φ α,β) = p(φ z β) p(θ d α ) p(x i φ z ) p(z θ d ) z=1 [5] D. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993 1022, 2003. 10 M d=1 N d i=1

Extend to Model Authorship and Categories TM DS Generative process DS Shafiei TM Milios P(c z) P(w z) P(c z) P(w z) disease 0.23 sympton 0.19. disease1 0.23 sympton1 0.19 health 0.17. treatment 0.23 operation 0.19. treatment1 0.23 operation 0.19 disease1 0.17. Article Liberia Declared Free of Ebola Shafiei and Milios Disease Treatment After the West African nation goes more than a month with no new reported cases of viral infection, the World Health Organization says the country is Ebola-free. [35] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In 11 KDD 08, pages 990 998, 2008

ACT Model Generative process: authors words category tag Topic ACT category [35] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In 12 KDD 08, pages 990 998, 2008

Still challenges However, we still cannot model domain knowledge and social context! SOCINST: Modeling Domain Knowledge and Social Context Simultaneously 13

Modeling Domain Knowledge root β β c 1 c 2 ηβ ηβ θ~dirichlettree(β, η) w 1 w 2... w k β θ~dirichlet(β) ηβ w j [1] D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In 14 ICML 09, pages 25 32, 2009.

Modeling Social Context v 1 v 2 v 3 θ v1 =<0.1, 0.5,...> θ v2 =<0.3, 0.2,...> θ v3 v 12... θ B θ A B A C θ C User A s Social context is defined as a mixture of topic distributions of neighbors, i.e. j NB(vi ) γ j θ j multinomial mixture! v1v2 =θ v1 +θ v2 v 123 15

Theoretical Basis Aggregation property of Dirichlet distribution If then Inverse of the aggregation property If then (θ 1,,θ i,θ i+1,,θ K ) Dirichlet(α 1,,α i,α i+1,,α K ) (θ 1,,θ i +θ i+1,,θ K ) Dirichlet(α 1,,α i + α i+1,,α K ) (θ 1,,θ K ) Dirichlet(α 1,,α K ) (θ 1,,τθ i,(1 τ )θ i,,θ K ) Dirichlet(α 1,,τα i,(1 τ )α i,,α K ) 16

17 Model Learning

Sequential Labeling Incorporating Topics θ v1 =<0.1, 0.5,...> v 1 v 2 v 3 v 12 multinomial mixture! v1v2 =θ v1 +θ v2 θ v2 =<0.3, 0.2,...>... v 123 θ v3 root β β c 1 c 2 ηβ ηβ β θ~dirichlettree(β, η) w 1 w 2 w k... θ~dirichlet(β) ηβ w j p(y x,θ,λ,µ) = 1 Z exp( λ f (x,θ, y ) + µ k k i i i j f j (x,θ, y i, y i+1 )) i k i j 18

19 Experiments

20 All codes and datasets can be downloaded here http://aminer.org/socinst/ Dataset Data Sets Domain #documents #instances #relationships Weibo 1,800 545 10,763 I2B2 899 2,400 27,175 ICDM 12 Contest 2,110 565 NA Goal: Weibo: Our goal is to extract real morph instances in the dataset. I2B2: Our goal here is to extract private health information instances in the dataset. ICDM 12 Contest: Our goal is to recognize product mentions in the dataset.

I2B2 HISTORY OF PRESENT ILLNESS : Mr. Blind is a 79-year-old white male with a history of diabetes mellitus, inferior myocardial infarction, who underwent open repair of his increased diverticulum November 13th at Sephsandpot Center. The patient developed hematemesis November 15th and was intubated for respiratory distress. He was transferred to the Valtawnprinceel Community Memorial Hospital for endoscopy and esophagoscopy on the 16th of November which showed a 2 cm linear tear of the esophagus at 30 to 32 cm. Patient Doctor Date Location Hospital 21

22 ICDM 12 Contest

Results F1-Measure 0.9 0.8 0.7 0.6 0.5 0.4 0.3 SM RT CRF CRF+AT SOINST 0.2 0.1 0.0 Weibo I2B2 ICDM'12 23 SM: Simply extracts all the terms/symbols that are annotated RT: Recognizes target instances from the test data by a set of rule templates CRF: Trains a CRF model using features associated with each token CRF+AT: Uses Author-Topic (AT) [30] to train a model and then it use the learned topics as features for CRF for instance recognition SOCINST: Our proposed model

Results SM: Simply extracts all the terms/symbols that are annotated RT: Recognizes target instances from the test data by a set of rule templates. CRF: Trains a CRF model using features associated with each token CRF+AT: Uses Author-Topic (AT) [30] to train a model and then it use the learned topics as features for CRF for instance recognition SOCINST: Our proposed model 24

More Results ICDM 12 Contest Performance comparison of SOCINST and the first place [38] in ICDM 12 Contest. By incorporating the modeling results into the CRF model [38] 25 S. Wu, Z. Fang, and J. Tang. Accurate product name recognition from user generated content. In ICDM 12 Contest.

Effects of Social Context and Domain Knowledge SOCINST base we removed both social context and domain knowledge from our method; SOCINST-SC we removed social context from our method; SOCINST-DK we removed domain knowledge from our method; 26

27 Parameter Analysis

Parameter Analysis (cont.) * All the other hyperparameters fixed The number of topics is set to K = 15 28

29 AMiner (http://aminer.org)

Conclusion Study the problem of instance recognition by incorporating social context and domain knowledge Propose a topic modeling approach to learn topics by considering social relationships between users and context information from a domain knowledge base Experimental results on three different datasets validate the effectiveness and the efficiency of the proposed method. 30

Future work The general idea of incorporating social context and domain knowledge for entity recognition represents a new research direction Combining the sequential labeling model and the proposed SOCINST into a unified model should be beneficial Further incorporating other social interactions, such as social influence, to help instance recognition is an intriguing direction 31

Thank you! Collaborators: Jimeng Sun (Georgia Tech) Zhanpeng Fang (THU) Jie Tang, KEG, Tsinghua U, Download all data & Codes, http://keg.cs.tsinghua.edu.cn/jietang http://aminer.org/socinst 32

Modeling Short Text with Topics p d (x) = λ B p(x θ B ) + (1 λ) π d,k p(x θ k ) K k=1 log p(d) = n(x,d)log[λ B p(x θ B ) + (1 λ) π d,k p(x θ k )] x V K k=1 Topic Topic Topic θ 1 θ 2 θ 3 Background B warning 0.3 system 0.2.. Aid 0.1 donation 0.05 support 0.02.. statistics 0.2 loss 0.1 dead 0.05.. Is 0.05 the 0.04 a 0.03.. Document d θ 1 θ 2 θ k B θ d,1 θ d,2 θ d,k θ B Generating word x in doc d in the collection 1 - θ B Parameters: θ B = noise-level (manually set) θ 1 and π are estimated with Maximum Likelihood x 33

α θ x β ϕ k k [1,K] x m,n a m x m,n z m,n c m,n n [1,N m ] m [1,M] 34