Text mining and natural language analysis Jefrey Lijffijt
PART I: Introduction to Text Mining
Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably large We need automated methods to Find, extract, and link information from documents
Main problems Classification: categorise texts into classes (given a set of classes) Clustering: categorise texts into classes (not given any set of classes) Sentiment analysis: determine the sentiment/ attitude of texts Key-word analysis: find the most important terms in texts
Main problems Summarisation: give a brief summary of texts Retrieval: find the most relevant texts to a query Question-answering: answer a given question Language modeling: uncover structure and semantics of texts And answer-questioning?
Definitions from Wikipedia Related domains Text mining [ ] refers to the process of deriving high-quality information from text. Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.
PART II: Clustering & Topic Models
Main topic today Text clustering & topic models Useful to categorise texts and to uncover structure in text corpora
Primary problem How to represent text What are the relevant features?
Main solution Vector-space (bag-of-words) model Word 1 Word 2 Word 3 Text 1 w1,1 w1,2 w1,3 Text 2 w2,1 w2,2 w2,3 Text 3 w3,1 w3,2 w3,3
Simple text clustering Clustering with k-means algorithm and cosine similarity Idea: two texts are similar if the frequencies at which words occur are similar s(w 1,w 2 )= w 1 w 2 w 1 w 2 Score in [0,1] (since w i,j 0) Widely used in text mining
Demo Reuters-21578 8300 (categorised) newswire articles Clustering is a single command in Matlab Data (original and processed.mat): http://www.daviddlewis.com/resources/testcollections/reuters21578/ http://www.cad.zju.edu.cn/home/dengcai/data/textdata.html
More advanced clustering Latent Dirichlet Allocation (LDA) Also known as topic modeling Idea: texts are a weighted mix of topics [The following slides are inspired by and figures are taken from David Blei (2012). Probabilistic topic models. CACM 55(4): 77 84.]
are highlighted in pink; words about genetics, such as sequenced and a Technically, the model assumes that the topics are generated first, before the documents. is used to allocate the words of the document to different topics. Why latent? Keep reading. Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of topics, which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. The topics and topic assignments in this figure are illustrative they are not fit from real data. See Figure 2 for topics fit from data. 78 Topics gene dna genetic.,, 0.04 0.02 0.01 life evolve organism.,, 0.02 0.01 0.01 brain neuron nerve... 0.04 0.02 0.01 data number computer.,, 0.02 0.02 0.01 Documents COMMUNICATIONS OF THE ACM APRIL 2012 VOL. 55 NO. 4 Topic proportions and assignments
Technically (1/2) Topics are probability distribution over words The distribution defines how often each word occurs, given that the topic is discussed Texts are probability distributions over topics The distribution defines how often a word is due to a topic These are the free parameters of the model
Technically (2/2) For each word in a text, we can compute how probable it is that it belongs to a certain topic Given the topic probability and the topics, we can compute the likelihood of a document Latent (= hidden) Assumed Texts Word assignment Observed words Topics Assumed Prior Prior a q d b k Z d,n W d,n N D K h The optimisation problem is to find the posterior distributions for the topics and the texts (see article)
Figure 2. Real inference with LDA. We fit a 100-topic LDA model to 17,000 articles from the journal Science. At left are the inferred topic proportions for the example article in Figure 1. At right are the top 15 most frequent words from the most frequent topics found in this article. Genetics Evolution Disease Computers human evolution disease computer genome evolutionary host models Probability 0.0 0.1 0.2 0.3 0.4 1 8 16 26 36 46 56 66 76 86 96 Topics dna genetic genes sequence gene molecular sequencing map information genetics mapping project species organisms life origin biology groups phylogenetic living diversity group new two bacteria diseases resistance bacterial new strains control infectious malaria parasite parasites information data computers system network systems model parallel methods networks software new sequences common tuberculosis simulations
Active area of research: extensions of LDA Figure 5. Two topics from a dynamic topic model. This model was fit to Science from 1880 to 2002. We have illustrated the top words at each decade. 1880 molecules atoms molecular matter 1900 molecules atoms matter atomic 1920 atom atoms electrons electron 1940 rays electron atomic atoms 1960 electron particles electrons nuclear 1980 electron particles ion electrons 2000 state quantum electron 1890 molecules atoms molecular matter 1910 theory atoms atom molecules 1930 electrons atoms atom electron 1950 particles nuclear electron atomic 1970 electron particles electrons state 1990 electron state atoms Proprtion of Science "Alchemy" (1891) "Mass and Energy" (1907) "The Wave Properties of Electrons" (1930) "The Z Boson" (1990) "Nuclear Fission" (1940) atomic "Structure of the Proton" (1974) "Quantum Criticality: Competing Ground States in Low Dimensions" (2000) Topic score quantum molecular 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Active area of research: extensions of LDA Figure 5. Two topics from a dynamic topic model. This model was fit to Science from 1880 to 2002. We have illustrated the top words at each decade. 1880 french france england country europe 1900 germany country france 1920 war france british 1940 war american international 1960 soviet nuclear international 1980 nuclear soviet weapons 2000 european nuclear countries 1890 england france country europe 1910 country germany countries 1930 international countries american 1950 international war atomic 1970 nuclear military soviet 1990 soviet nuclear japan Proprtion of Science "Farming and Food Supplies in Time of War" (1915) "Speed of Railway Trains in Europe" (1889) war "Science in the USSR" (1957) "The Atom and Humanity" (1945) "The Costs of the Soviet Empire" (1985) "Post-Cold War Nuclear Dangers" (1995) Topic score european nuclear 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Summary Text mining is concerned with automated methods to Find, extract, and link information from text Text clustering and topic models help us Organise text corpora Find relevant documents Uncover relations between documents
Further reading David Blei (2012). Probabilistic topic models. Communications of the ACM 55(4): 77 84. Christopher D. Manning & Hinrich Schütze (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.