Welcome s, Welcome to CAMCOS Reports Day Fall 2011
s, CAMCOS: Text Mining and Damien Adams, Neeti Mittal, Joanna Spencer, Huan Trinh, Annie Vu, Orvin Weng, Rachel Zadok December 9, 2011
Outline 1 s, 2 s, 3 4 5 6 7 8 9
s,
What is Text Mining? s, work deals with Modeling and Detecting s in documents using Text Mining.
What is Text Mining? s, work deals with Modeling and Detecting s in documents using Text Mining. So what exactly is text mining?
What is Text Mining? s, Text mining is the act of getting a computer to Read a document Identify topics
Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea?
Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea? It is easy to read a paper that is only a few pages long and identify the topics.
Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea? It is easy to read a paper that is only a few pages long and identify the topics. We learned how to do this in English class in grade school. But what about dozens, hundreds, or thousands of documents?
Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea? It is easy to read a paper that is only a few pages long and identify the topics. We learned how to do this in English class in grade school. But what about dozens, hundreds, or thousands of documents? The goal of text mining is to tackle documents of sizes that are not humanly feasible.
The DeLorean Motor Company s, In 2013, the DeLorian Motor Company will be producing DeLoreans again.
Flux Capacitor s, Suppose they need to recall certain DeLoreans due to flux capacitor issues.
Without Text Mining s, Without text mining, DMC would have to
Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports
Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports Manually identify topics, including the flux capacitor issue
Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports Manually identify topics, including the flux capacitor issue Recall DeLoreans
Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports Manually identify topics, including the flux capacitor issue Recall DeLoreans This could take days or even weeks!
With Text Mining s, On the other hand, DMC could use text mining to
With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer
With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer Read topics found from text mining
With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer Read topics found from text mining Recall DeLoreans
With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer Read topics found from text mining Recall DeLoreans This could take less than an hour!
Modeling s, The idea behind text mining is topic modeling.
Modeling s, The idea behind text mining is topic modeling. Given a document ( bag of words ), we wish to identify the topics.
Modeling s, The idea behind text mining is topic modeling. Given a document ( bag of words ), we wish to identify the topics. In the previous example, the document would be a collection of incident reports,
Modeling s, The idea behind text mining is topic modeling. Given a document ( bag of words ), we wish to identify the topics. In the previous example, the document would be a collection of incident reports, and the topic would be the flux capacitor issues.
Modeling s, Document! Words "
s, s,
What is a? s, What exactly is a topic?
What is a? s, What exactly is a topic? When we read a paper, the topic is the main idea.
s, How can a topic be defined?
s, How can a topic be defined? Definition A is a distribution of words in a document over a predetermined vocabulary.
Modeling s, What is topic modeling? We talked about it before, but here is a formal definition.
Modeling s, What is topic modeling? We talked about it before, but here is a formal definition. Definition Modeling is the using of methods to automatically assign words in documents to topics.
s, We focus on topic modeling using, Latent Dirichlet Allocation.
s, We focus on topic modeling using, Latent Dirichlet Allocation. (2002) is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan.
s, Definition Latent Dirichlet Allocation () is a generative process that defines a joint probability distribution over both the observed and hidden random variables. Simply put, uncovers the thematic structure hidden in a document. It generates the main ideas of a set of documents, which we call s.
Latent Dirichlet Allocation s, Examining what the words stand for, Latent: We observe the words in the documents, but the topics are hidden (latent) Dirichlet: uses the Dirichlet distribution (next slide) Allocation: We will allocate topics to documents
Dirichlet Distribution s, The Dirichlet distribution, with parameters 1,..., K,isa multivariate distribution of K random variables, x 1,...,x K.
Dirichlet Distribution s, The Dirichlet distribution, with parameters 1,..., K,isa multivariate distribution of K random variables, x 1,...,x K. Its density is Dir( 1,..., K ) / KY i=1 x i 1 i.
Example of Dirichlet Distribution (from Wikipedia) s, Consider an urn containing balls of K di erent colors. Initially, the urn contains 1 balls of color 1, 2 balls of color 2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with an additional ball of the same color. In the limit as N approaches infinity, the proportions of di erent colored balls in the urn will be distributed as Dir( 1,..., K ). Jumping ahead, in our case i will be the importance of topic i among K topics.
s,
Assumptions s, The topics are Dirichlet distributed over the words The documents are Dirichlet distributed over the topics Order of the documents does not matter (this is a deficiency, exactly what we address) Order of the words in the documents does not matter The number of topics is assumed known and fixed
Additional Assumptions s, automatically removes all the common words such as with, the, and, etc.
Additional Assumptions s, automatically removes all the common words such as with, the, and, etc. It also removes all the numeric values, commas, parentheses, etc.
Additional Assumptions s, automatically removes all the common words such as with, the, and, etc. It also removes all the numeric values, commas, parentheses, etc. It eliminates all the words that are repeated many times in the document
Additional Assumptions s, automatically removes all the common words such as with, the, and, etc. It also removes all the numeric values, commas, parentheses, etc. It eliminates all the words that are repeated many times in the document We only considered the words with 3 or more letters
Example s, We ran over a document about habitats. We looked for five topics.
Output s, 1 2 3 4 5 species.028 plant.019 habitat.038 species.021 species.076 environment.021 botanical.019 population.019 additional.021 particular.052 e ect.014 physical.019 species.019 cycle.021 analysis.041 references.014 zoological.014 natural.019 e ect.021 population.030 ecosystem.014 habitat.013 trophic.019 help.020 abundance.024.... 5 s distributed over a list of words. This list consist of entire vocabulary with varying probabilities In other words, sum of the probabilities of all of the words under any given topic is 1 Under each, as the probability of each word decreases, its position drops.....
s with s, The assumption that the order of the documents does not matter prevents us from di erentiating between new information and prior knowledge. fails to infer time sensitive information.
Questions? s, Q&A?
Time for a Break... s, We will now take a five minute break.
s,
Contribution s, contribution to this field is the ability to automatically detect emerging topics.
Definition s, An emerging topic is a topic that is more prominent now than it was before.
Definition s, An emerging topic is a topic that is more prominent now than it was before. We implemented a variable that measures that prominence.
Definition s, An emerging topic is a topic that is more prominent now than it was before. We implemented a variable that measures that prominence. i is the prominence of topic i over, say, the last month s worth of data.
Definition s, An emerging topic is a topic that is more prominent now than it was before. We implemented a variable that measures that prominence. i is the prominence of topic i over, say, the last month s worth of data. Now if 0 i is the prominence of topic i over, say, the last week, then we mathematically define Definition i is an if 0 i i > 1.
Definition s, David Blei devised the algorithm.
Definition s, David Blei devised the algorithm. team implemented with a new feature of topic detection.
Definition s, David Blei devised the algorithm. team implemented with a new feature of topic detection. i is an if 0 i i > 1.
Another Approach of Defining an s, Ranking is the change in position of importance of a topic. Alternate Definition: i is an emerging topic if there is a positive change in position.
Another Approach of Defining an s, Ranking is the change in position of importance of a topic. Alternate Definition: i is an emerging topic if there is a positive change in position. We did not use this definition as we would have assumed that gives topic output in some order.
Goal s, We want to use text mining to detect emerging topics relative to two di erent documents We then want to observe only the topics with the greatest relative importance
s, Algorithm
Algorithm s, Written in the statistical language R Uses the package topicmodels by Grün and Hornik
The Input s, algorithm takes three inputs A document (which we suspect having an emerging topic) An estimated number of topics (K) The percentage of recent data, e.g., 14% = 1 7,ifdocument=weekandrecent=lastday 23% = 7 30,ifdocument=monthandrecent=lastweek 17% = 2 12,ifdocument=yearandrecent=lasttwomonths
How the Algorithm Works s, 1 Preprocessing: Common words (of, the, is, from,...), special characters ($, %,...), and numbers are discarded 2 Using, we discover the K topics in the entire document as well as their importances 1, 2,..., K 3 For each topic i, we compute its importance 0 i in the recent part of the document 4 The topics are sorted in decreasing order according to 0 i i 5 The topics for which 0 i i > 1aredisplayedassuspected emerging topics
The Output s, A list of words grouped by topics A plot of relative importance of topics in the document
s,
Setup s, To test our algorithm, we created a document with an emerging topic in it.
Setup s, To test our algorithm, we created a document with an emerging topic in it. We took the Wikipedia entry for Habitat and introduced an emerging topic by appending an article from the EPA on Climate Change.
Setup s, To test our algorithm, we created a document with an emerging topic in it. We took the Wikipedia entry for Habitat and introduced an emerging topic by appending an article from the EPA on Climate Change. The emerging topic represented about 10% the size of the entire document.
Why an Should Take Up About 10% of a Document s, We will examine three di erent situations. Consider a company that turns in daily reports. From these reports you want to discover emerging topics.
Why an Should Take Up About 10% of a Document s, Consider comparing the reports from today against the last week of reports. If half of the reports from today are emerging topics, then 1 2 1 7 =0.071. That is, about 7% of the last weeks repots are about emerging topics.
Why an Should Take Up About 10% of a Document s, Now consider comparing the last week of reports against the past month of reports. If half of the last week of reports are emerging topics, then 1 2 7 30 =0.117. That is, about 12% of the last weeks reports are about emerging topics.
Why an Should Take Up About 10% of a Document s, Now consider comparing the last months worth of reports against the past years worth of reports. If half of the last months worth of reports are emerging topics, then 1 2 2 12 =0.083. That is, about 8% of the past years worth of reports are about emerging topics.
Why an Should Take Up About 10% of a Document s, So now lets recap what we have: 1 2 1 =0.071 7 1 2 7 =0.117 30 1 2 2 =0.083 12 As you can see, these values are pretty close. In fact, their average is about 0.09. Thus, 10%, give or take, is a good estimate.
Results s, We were looking for 10 topics. Here are our results.
Relative Importance of s, 0 i i s from Changes in Relative Importance s, Changes in Relative Importance 1.0 1.5 2.0 10 4 9 1 3 6 8 2 5 7 Number
s And here are the program s suggested emerging topics. s, In particular, the topic Greenhouse Atmosphere Breeding Optional Relative which is clearly attributable to climate change is correctly discovered and identified.
Example 2 s, next example is based on the character merchandise sales reports from Star Wars and Disney.
Setup s, We ran our program over these reports. We looked for 10 topics.
Results Here are our results. s, Clearly, there are topics from both the Star Wars and the Disney reports.
Relative Importance of s, 0 i i s,
s s, Since our potential emerging topics are from both sales reports, we don t know which topic is our emerging topic.
How Can We Determine the s? s, In order to find out the emerging topics, we took the first 90% of the sales reports and ran over it.
90% Results Here are the topics from the first 90% of the reports. s, As you can see, all of the topics are Star Wars topics. Now we compare these topics against the potential emerging topics from before, and we can discard Star Wars topics as the emerging topics.
s s, Here are the program s suggested emerging topics. Since we have discarded Star Wars as emerging topics, s 10, 6, and 3 are the emerging topics. That is, the Disney topics are our emerging topics.
s,
s, We have developed an algorithm that automatically detects emerging topics It performs well in our experiments original purpose was to find emerging topics in NASA air tra c control incident reports. We are in the process of examining NASA data.
s, Future work: Gain better understanding of the relationship between emerging and old topics (i.e., what is the mathematical meaning of the value of 0 i i?) We have made our software (in R) and test data publicly available at http://www.math.sjsu.edu/~koev/camcos
Acknowledgments and References s, We would like to thank: All of you for coming David Blei for his and DTM implementations and paper to Probabilistic Models Bettina Grün and Kurt Hornik for their paper topicmodels: An R Package for Fitting Models and their R package and script
Additional Thanks s, We would also like to thank sponsor NASA CAMCOS Professor Hsu Dr. Ginger Koev Professor Koev for supervising our team We would like to extend our gratitude to our friends and families for their support
Questions? s, Q&A?
Thanks! s, Thank You For Coming To CAMCOS Reports Day Fall 2011
Directions to Lunch Please join us for lunch at Flames! s, 4th St. * Flames San Fernando King Library SJSU Campus P San Salvador Student Union