Text mining and natural language analysis. Jefrey Lijffijt

Similar documents
Gaussian Mixture Model

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation (LDA)

Topic Modelling and Latent Dirichlet Allocation

Introduction To Machine Learning

Data Mining Techniques

Topic Models and Applications to Short Documents

Document and Topic Models: plsa and LDA

Collaborative topic models: motivations cont

CS 572: Information Retrieval

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

topic modeling hanna m. wallach

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning for natural language processing

Russell Hanson DFCI April 24, 2009

Topic Modeling: Beyond Bag-of-Words

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Evaluation Methods for Topic Models

CS145: INTRODUCTION TO DATA MINING

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Welcome to CAMCOS Reports Day Fall 2011

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

AN INTRODUCTION TO TOPIC MODELS

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Improving Topic Models with Latent Feature Word Representations

Modeling User s Cognitive Dynamics in Information Access and Retrieval using Quantum Probability (ESR-6)

Machine Learning for natural language processing

STRUCTURAL BIOINFORMATICS I. Fall 2015

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Measuring Topic Quality in Latent Dirichlet Allocation

Language Information Processing, Advanced. Topic Models

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Dimension Reduction (PCA, ICA, CCA, FLD,

Recent Advances in Bayesian Inference Techniques

Embeddings Learned By Matrix Factorization

Deep Learning for NLP Part 2

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Lecture 22 Exploratory Text Analysis & Topic Models

Comparative Summarization via Latent Dirichlet Allocation

2017 Predictive Analytics Symposium

A Bayesian mixture model for term re-occurrence and burstiness

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS Lecture 18. Topic Models and LDA

Information Retrieval

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

Lecture 13 : Variational Inference: Mean Field Approximation

Historical Analysis of New York Times Foreign News using Semi-Supervised Models. Kohei Watanabe WIAS

A Continuous-Time Model of Topic Co-occurrence Trends

Distributed ML for DOSNs: giving power back to users

Deep Learning for NLP

Latent Dirichlet Allocation

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Probabilistic Information Retrieval


Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Statistical Debugging with Latent Topic Models

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Improving Diversity in Ranking using Absorbing Random Walks

Topic 7: Evolution. 1. The graph below represents the populations of two different species in an ecosystem over a period of several years.

Chemical Data Retrieval and Management

Gaussian Models

REQUIREMENTS FOR THE BIOCHEMISTRY MAJOR

Text Mining: Basic Models and Applications

Comparing Relevance Feedback Techniques on German News Articles

Applying hlda to Practical Topic Modeling

Special Topics in Computer Science

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Modeling User Rating Profiles For Collaborative Filtering

CS6220: DATA MINING TECHNIQUES

Modeling Topics and Knowledge Bases with Embeddings

Study Notes on the Latent Dirichlet Allocation

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

REQUIREMENTS FOR THE BIOCHEMISTRY MAJOR

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Behavioral Data Mining. Lecture 2

Probabilistic Latent Semantic Analysis

CS612 - Algorithms in Bioinformatics

CS 188: Artificial Intelligence Fall 2011

Collaborative Hotel Recommendation based on Topic and Sentiment of Review Comments

Latent variable models for discrete data

Query-document Relevance Topic Models

Latent Dirichlet Allocation Based Multi-Document Summarization

Probabilistic Matrix Factorization

Bayesian Nonparametrics for Speech and Signal Processing

Computational Biology Course Descriptions 12-14

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Missouri Educator Gateway Assessments

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

A Particle Swarm Optimization (PSO) Primer

Diversity Regularization of Latent Variable Models: Theory, Algorithm and Applications

Latent Semantic Analysis. Hongning Wang

Statistical Models in Evolutionary Biology An Introductory Discussion

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Frontiers in Microbiology

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Transcription:

Text mining and natural language analysis Jefrey Lijffijt

PART I: Introduction to Text Mining

Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably large We need automated methods to Find, extract, and link information from documents

Main problems Classification: categorise texts into classes (given a set of classes) Clustering: categorise texts into classes (not given any set of classes) Sentiment analysis: determine the sentiment/ attitude of texts Key-word analysis: find the most important terms in texts

Main problems Summarisation: give a brief summary of texts Retrieval: find the most relevant texts to a query Question-answering: answer a given question Language modeling: uncover structure and semantics of texts And answer-questioning?

Definitions from Wikipedia Related domains Text mining [ ] refers to the process of deriving high-quality information from text. Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.

PART II: Clustering & Topic Models

Main topic today Text clustering & topic models Useful to categorise texts and to uncover structure in text corpora

Primary problem How to represent text What are the relevant features?

Main solution Vector-space (bag-of-words) model Word 1 Word 2 Word 3 Text 1 w1,1 w1,2 w1,3 Text 2 w2,1 w2,2 w2,3 Text 3 w3,1 w3,2 w3,3

Simple text clustering Clustering with k-means algorithm and cosine similarity Idea: two texts are similar if the frequencies at which words occur are similar s(w 1,w 2 )= w 1 w 2 w 1 w 2 Score in [0,1] (since w i,j 0) Widely used in text mining

Demo Reuters-21578 8300 (categorised) newswire articles Clustering is a single command in Matlab Data (original and processed.mat): http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://www.cad.zju.edu.cn/home/dengcai/data/textdata.html

More advanced clustering Latent Dirichlet Allocation (LDA) Also known as topic modeling Idea: texts are a weighted mix of topics [The following slides are inspired by and figures are taken from David Blei (2012). Probabilistic topic models. CACM 55(4): 77 84.]

are highlighted in pink; words about genetics, such as sequenced and a Technically, the model assumes that the topics are generated first, before the documents. is used to allocate the words of the document to different topics. Why latent? Keep reading. Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of topics, which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. The topics and topic assignments in this figure are illustrative they are not fit from real data. See Figure 2 for topics fit from data. 78 Topics gene dna genetic.,, 0.04 0.02 0.01 life evolve organism.,, 0.02 0.01 0.01 brain neuron nerve... 0.04 0.02 0.01 data number computer.,, 0.02 0.02 0.01 Documents COMMUNICATIONS OF THE ACM APRIL 2012 VOL. 55 NO. 4 Topic proportions and assignments

Technically (1/2) Topics are probability distribution over words The distribution defines how often each word occurs, given that the topic is discussed Texts are probability distributions over topics The distribution defines how often a word is due to a topic These are the free parameters of the model

Technically (2/2) For each word in a text, we can compute how probable it is that it belongs to a certain topic Given the topic probability and the topics, we can compute the likelihood of a document Latent (= hidden) Assumed Texts Word assignment Observed words Topics Assumed Prior Prior a q d b k Z d,n W d,n N D K h The optimisation problem is to find the posterior distributions for the topics and the texts (see article)

Figure 2. Real inference with LDA. We fit a 100-topic LDA model to 17,000 articles from the journal Science. At left are the inferred topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found in this article. Genetics Evolution Disease Computers human evolution disease computer genome evolutionary host models Probability 0.0 0.1 0.2 0.3 0.4 1 8 16 26 36 46 56 66 76 86 96 Topics dna genetic genes sequence gene molecular sequencing map information genetics mapping project species organisms life origin biology groups phylogenetic living diversity group new two bacteria diseases resistance bacterial new strains control infectious malaria parasite parasites information data computers system network systems model parallel methods networks software new sequences common tuberculosis simulations

Active area of research: extensions of LDA Figure 5. Two topics from a dynamic topic model. This model was fit to Science from 1880 to 2002. We have illustrated the top words at each decade. 1880 molecules atoms molecular matter 1900 molecules atoms matter atomic 1920 atom atoms electrons electron 1940 rays electron atomic atoms 1960 electron particles electrons nuclear 1980 electron particles ion electrons 2000 state quantum electron 1890 molecules atoms molecular matter 1910 theory atoms atom molecules 1930 electrons atoms atom electron 1950 particles nuclear electron atomic 1970 electron particles electrons state 1990 electron state atoms Proprtion of Science "Alchemy" (1891) "Mass and Energy" (1907) "The Wave Properties of Electrons" (1930) "The Z Boson" (1990) "Nuclear Fission" (1940) atomic "Structure of the Proton" (1974) "Quantum Criticality: Competing Ground States in Low Dimensions" (2000) Topic score quantum molecular 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Active area of research: extensions of LDA Figure 5. Two topics from a dynamic topic model. This model was fit to Science from 1880 to 2002. We have illustrated the top words at each decade. 1880 french france england country europe 1900 germany country france 1920 war france british 1940 war american international 1960 soviet nuclear international 1980 nuclear soviet weapons 2000 european nuclear countries 1890 england france country europe 1910 country germany countries 1930 international countries american 1950 international war atomic 1970 nuclear military soviet 1990 soviet nuclear japan Proprtion of Science "Farming and Food Supplies in Time of War" (1915) "Speed of Railway Trains in Europe" (1889) war "Science in the USSR" (1957) "The Atom and Humanity" (1945) "The Costs of the Soviet Empire" (1985) "Post-Cold War Nuclear Dangers" (1995) Topic score european nuclear 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000

Summary Text mining is concerned with automated methods to Find, extract, and link information from text Text clustering and topic models help us Organise text corpora Find relevant documents Uncover relations between documents

Further reading David Blei (2012). Probabilistic topic models. Communications of the ACM 55(4): 77 84. Christopher D. Manning & Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.