Welcome to CAMCOS Reports Day Fall 2011

Similar documents
EMERGING TOPIC MODELS CAMCOS REPORT FALL 2011 NEETI MITTAL

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation Introduction/Overview

Welcome to CAMCOS Reports Day Spring 2009

AN INTRODUCTION TO TOPIC MODELS

Topic Models and Applications to Short Documents

Topic Modelling and Latent Dirichlet Allocation

Text mining and natural language analysis. Jefrey Lijffijt

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Collaborative Topic Modeling for Recommending Scientific Articles

CS Lecture 18. Topic Models and LDA

Distributed ML for DOSNs: giving power back to users

Advanced Introduction to Machine Learning

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Text Mining for Economics and Finance Latent Dirichlet Allocation

Lecture 22 Exploratory Text Analysis & Topic Models

News English.com Ready-to-use ESL/EFL Lessons by Sean Banville There are 13 signs of the Zodiac, expert says

Welcome to CAMCOS Reports Day Fall 2010

Wednesday, 10 September 2008

Outline. Wednesday, 10 September Schedule. Welcome to MA211. MA211 : Calculus, Part 1 Lecture 2: Sets and Functions

Language Information Processing, Advanced. Topic Models

Warm Up. Fourth Grade Released Test Question: 1) Which of the following has the greatest value? 2) Write the following numbers in expanded form: 25:

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Challenger Center Teacher Resources for Engaging Students in Science, Technology, Engineering, and Math

Chapter 3. Expressions and Equations Part 1

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Projects in Geometry for High School Students

HIRES 2017 Syllabus. Instructors:

Dimension Reduction (PCA, ICA, CCA, FLD,

Unit 1 Science Models & Graphing

Study Notes on the Latent Dirichlet Allocation

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Topic Modeling: Beyond Bag-of-Words

Mathematics Practice Test 2

General Physics (PHY 2130)

Applying hlda to Practical Topic Modeling

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Recent Advances in Bayesian Inference Techniques

Design and Implementation of Speech Recognition Systems

Text Mining: Basic Models and Applications

Lesson One Hundred and Sixty-One Normal Distribution for some Resolution

Lesson Objectives. Core Content Objectives. Language Arts Objectives

News English.com Ready-to-use ESL / EFL Lessons

Econ 250 Winter 2009 Assignment 2 - Solutions

Language as a Stochastic Process

The Shape, Center and Spread of a Normal Distribution - Basic

Simulating Future Climate Change Using A Global Climate Model

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Lecture 2: Linear regression

Chapter 1, Section 1 Exploring Geography

Statistical Debugging with Latent Topic Models

Name Class Date. You can use the properties of equality to solve equations. Subtraction is the inverse of addition.

Machine Learning

Analyzing Lines of Fit

Understanding and Using Variables

Physics Fundamentals of Astronomy

EE595A Submodular functions, their optimization and applications Spring 2011

Mathematics I Resources for EOC Remediation

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

(Refer Slide Time: 00:10)

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

GLAD: Group Anomaly Detection in Social Media Analysis

Bayesian Nonparametrics for Speech and Signal Processing

Background: Comment [1]: Comment [2]: Comment [3]: Comment [4]: mass spectrometry

GCSE style questions arranged by topic

Johns Hopkins Math Tournament Proof Round: Automata

Algebra I. Systems of Linear Equations and Inequalities. Slide 1 / 179. Slide 2 / 179. Slide 3 / 179. Table of Contents

Document and Topic Models: plsa and LDA

GRE Workshop Quantitative Reasoning. February 13 and 20, 2018

Simulating the Solar System

Released Assessment Questions, 20 16

Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter

GIS Institute Center for Geographic Analysis

16 : Approximate Inference: Markov Chain Monte Carlo

Generative Clustering, Topic Modeling, & Bayesian Inference

Lesson 2: Introduction to Variables

Agile Mind Mathematics 6 Scope and Sequence, Common Core State Standards for Mathematics

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

INSTRUCTIONAL PLANNING GUIDE FOR CHARACTERISTICS OF THE EARTH, MOON, AND SUN

Latent Dirichlet Allocation (LDA)

Math 440 Project Assignment

SYLLABUS FORM WESTCHESTER COMMUNITY COLLEGE Valhalla, NY lo CURRENT DATE: Please indicate whether this is a NEW COURSE or a REVISION:

Probabilistic Latent Semantic Analysis

Economics 345: Applied Econometrics Section A01 University of Victoria Midterm Examination #2 Version 2 Fall 2016 Instructor: Martin Farnham

Moon. Grade Level: 1-3. pages 1 2 pages 3 4 pages 5 page 6 page 7 page 8 9

Morpheus: Neo, sooner or later you re going to realize, just as I did, that there s a difference between knowing the path, and walking the path.

Paterson Public Schools

Estadística I Exercises Chapter 4 Academic year 2015/16

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Estimating Latent Variable Graphical Models with Moments and Likelihoods

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Calculator Review. Ti-30xs Multiview Calculator. Name: Date: Session:

PBL: Colonial Life. Create a Brochure Attracting People to Come to your Region

Lecture 3a: Dirichlet processes

Solving and Graphing a Linear Inequality of a Single Variable

(Sessions I and II)* BROWARD COUNTY ELEMENTARY SCIENCE BENCHMARK PLAN FOR PERSONAL USE

Latent Dirichlet Allocation Based Multi-Document Summarization

RHS Libraries. A guide for researchers. RHS Libraries. rhs.org.uk/libraries

Transcription:

Welcome s, Welcome to CAMCOS Reports Day Fall 2011

s, CAMCOS: Text Mining and Damien Adams, Neeti Mittal, Joanna Spencer, Huan Trinh, Annie Vu, Orvin Weng, Rachel Zadok December 9, 2011

Outline 1 s, 2 s, 3 4 5 6 7 8 9

s,

What is Text Mining? s, work deals with Modeling and Detecting s in documents using Text Mining.

What is Text Mining? s, work deals with Modeling and Detecting s in documents using Text Mining. So what exactly is text mining?

What is Text Mining? s, Text mining is the act of getting a computer to Read a document Identify topics

Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea?

Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea? It is easy to read a paper that is only a few pages long and identify the topics.

Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea? It is easy to read a paper that is only a few pages long and identify the topics. We learned how to do this in English class in grade school. But what about dozens, hundreds, or thousands of documents?

Why a Computer? s, You may ask, why do we want to get a computer to do this seemingly simple idea? It is easy to read a paper that is only a few pages long and identify the topics. We learned how to do this in English class in grade school. But what about dozens, hundreds, or thousands of documents? The goal of text mining is to tackle documents of sizes that are not humanly feasible.

The DeLorean Motor Company s, In 2013, the DeLorian Motor Company will be producing DeLoreans again.

Flux Capacitor s, Suppose they need to recall certain DeLoreans due to flux capacitor issues.

Without Text Mining s, Without text mining, DMC would have to

Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports

Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports Manually identify topics, including the flux capacitor issue

Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports Manually identify topics, including the flux capacitor issue Recall DeLoreans

Without Text Mining s, Without text mining, DMC would have to Spend days reading all reports Manually identify topics, including the flux capacitor issue Recall DeLoreans This could take days or even weeks!

With Text Mining s, On the other hand, DMC could use text mining to

With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer

With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer Read topics found from text mining

With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer Read topics found from text mining Recall DeLoreans

With Text Mining s, On the other hand, DMC could use text mining to Spend 10 minutes inserting all reports into the computer Read topics found from text mining Recall DeLoreans This could take less than an hour!

Modeling s, The idea behind text mining is topic modeling.

Modeling s, The idea behind text mining is topic modeling. Given a document ( bag of words ), we wish to identify the topics.

Modeling s, The idea behind text mining is topic modeling. Given a document ( bag of words ), we wish to identify the topics. In the previous example, the document would be a collection of incident reports,

Modeling s, The idea behind text mining is topic modeling. Given a document ( bag of words ), we wish to identify the topics. In the previous example, the document would be a collection of incident reports, and the topic would be the flux capacitor issues.

Modeling s, Document! Words "

s, s,

What is a? s, What exactly is a topic?

What is a? s, What exactly is a topic? When we read a paper, the topic is the main idea.

s, How can a topic be defined?

s, How can a topic be defined? Definition A is a distribution of words in a document over a predetermined vocabulary.

Modeling s, What is topic modeling? We talked about it before, but here is a formal definition.

Modeling s, What is topic modeling? We talked about it before, but here is a formal definition. Definition Modeling is the using of methods to automatically assign words in documents to topics.

s, We focus on topic modeling using, Latent Dirichlet Allocation.

s, We focus on topic modeling using, Latent Dirichlet Allocation. (2002) is an example of a topic model and was first presented as a graphical model for topic discovery by David Blei, Andrew Ng, and Michael Jordan.

s, Definition Latent Dirichlet Allocation () is a generative process that defines a joint probability distribution over both the observed and hidden random variables. Simply put, uncovers the thematic structure hidden in a document. It generates the main ideas of a set of documents, which we call s.

Latent Dirichlet Allocation s, Examining what the words stand for, Latent: We observe the words in the documents, but the topics are hidden (latent) Dirichlet: uses the Dirichlet distribution (next slide) Allocation: We will allocate topics to documents

Dirichlet Distribution s, The Dirichlet distribution, with parameters 1,..., K,isa multivariate distribution of K random variables, x 1,...,x K.

Dirichlet Distribution s, The Dirichlet distribution, with parameters 1,..., K,isa multivariate distribution of K random variables, x 1,...,x K. Its density is Dir( 1,..., K ) / KY i=1 x i 1 i.

Example of Dirichlet Distribution (from Wikipedia) s, Consider an urn containing balls of K di erent colors. Initially, the urn contains 1 balls of color 1, 2 balls of color 2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with an additional ball of the same color. In the limit as N approaches infinity, the proportions of di erent colored balls in the urn will be distributed as Dir( 1,..., K ). Jumping ahead, in our case i will be the importance of topic i among K topics.

s,

Assumptions s, The topics are Dirichlet distributed over the words The documents are Dirichlet distributed over the topics Order of the documents does not matter (this is a deficiency, exactly what we address) Order of the words in the documents does not matter The number of topics is assumed known and fixed

Additional Assumptions s, automatically removes all the common words such as with, the, and, etc.

Additional Assumptions s, automatically removes all the common words such as with, the, and, etc. It also removes all the numeric values, commas, parentheses, etc.

Additional Assumptions s, automatically removes all the common words such as with, the, and, etc. It also removes all the numeric values, commas, parentheses, etc. It eliminates all the words that are repeated many times in the document

Additional Assumptions s, automatically removes all the common words such as with, the, and, etc. It also removes all the numeric values, commas, parentheses, etc. It eliminates all the words that are repeated many times in the document We only considered the words with 3 or more letters

Example s, We ran over a document about habitats. We looked for five topics.

Output s, 1 2 3 4 5 species.028 plant.019 habitat.038 species.021 species.076 environment.021 botanical.019 population.019 additional.021 particular.052 e ect.014 physical.019 species.019 cycle.021 analysis.041 references.014 zoological.014 natural.019 e ect.021 population.030 ecosystem.014 habitat.013 trophic.019 help.020 abundance.024.... 5 s distributed over a list of words. This list consist of entire vocabulary with varying probabilities In other words, sum of the probabilities of all of the words under any given topic is 1 Under each, as the probability of each word decreases, its position drops.....

s with s, The assumption that the order of the documents does not matter prevents us from di erentiating between new information and prior knowledge. fails to infer time sensitive information.

Questions? s, Q&A?

Time for a Break... s, We will now take a five minute break.

s,

Contribution s, contribution to this field is the ability to automatically detect emerging topics.

Definition s, An emerging topic is a topic that is more prominent now than it was before.

Definition s, An emerging topic is a topic that is more prominent now than it was before. We implemented a variable that measures that prominence.

Definition s, An emerging topic is a topic that is more prominent now than it was before. We implemented a variable that measures that prominence. i is the prominence of topic i over, say, the last month s worth of data.

Definition s, An emerging topic is a topic that is more prominent now than it was before. We implemented a variable that measures that prominence. i is the prominence of topic i over, say, the last month s worth of data. Now if 0 i is the prominence of topic i over, say, the last week, then we mathematically define Definition i is an if 0 i i > 1.

Definition s, David Blei devised the algorithm.

Definition s, David Blei devised the algorithm. team implemented with a new feature of topic detection.

Definition s, David Blei devised the algorithm. team implemented with a new feature of topic detection. i is an if 0 i i > 1.

Another Approach of Defining an s, Ranking is the change in position of importance of a topic. Alternate Definition: i is an emerging topic if there is a positive change in position.

Another Approach of Defining an s, Ranking is the change in position of importance of a topic. Alternate Definition: i is an emerging topic if there is a positive change in position. We did not use this definition as we would have assumed that gives topic output in some order.

Goal s, We want to use text mining to detect emerging topics relative to two di erent documents We then want to observe only the topics with the greatest relative importance

s, Algorithm

Algorithm s, Written in the statistical language R Uses the package topicmodels by Grün and Hornik

The Input s, algorithm takes three inputs A document (which we suspect having an emerging topic) An estimated number of topics (K) The percentage of recent data, e.g., 14% = 1 7,ifdocument=weekandrecent=lastday 23% = 7 30,ifdocument=monthandrecent=lastweek 17% = 2 12,ifdocument=yearandrecent=lasttwomonths

How the Algorithm Works s, 1 Preprocessing: Common words (of, the, is, from,...), special characters ($, %,...), and numbers are discarded 2 Using, we discover the K topics in the entire document as well as their importances 1, 2,..., K 3 For each topic i, we compute its importance 0 i in the recent part of the document 4 The topics are sorted in decreasing order according to 0 i i 5 The topics for which 0 i i > 1aredisplayedassuspected emerging topics

The Output s, A list of words grouped by topics A plot of relative importance of topics in the document

s,

Setup s, To test our algorithm, we created a document with an emerging topic in it.

Setup s, To test our algorithm, we created a document with an emerging topic in it. We took the Wikipedia entry for Habitat and introduced an emerging topic by appending an article from the EPA on Climate Change.

Setup s, To test our algorithm, we created a document with an emerging topic in it. We took the Wikipedia entry for Habitat and introduced an emerging topic by appending an article from the EPA on Climate Change. The emerging topic represented about 10% the size of the entire document.

Why an Should Take Up About 10% of a Document s, We will examine three di erent situations. Consider a company that turns in daily reports. From these reports you want to discover emerging topics.

Why an Should Take Up About 10% of a Document s, Consider comparing the reports from today against the last week of reports. If half of the reports from today are emerging topics, then 1 2 1 7 =0.071. That is, about 7% of the last weeks repots are about emerging topics.

Why an Should Take Up About 10% of a Document s, Now consider comparing the last week of reports against the past month of reports. If half of the last week of reports are emerging topics, then 1 2 7 30 =0.117. That is, about 12% of the last weeks reports are about emerging topics.

Why an Should Take Up About 10% of a Document s, Now consider comparing the last months worth of reports against the past years worth of reports. If half of the last months worth of reports are emerging topics, then 1 2 2 12 =0.083. That is, about 8% of the past years worth of reports are about emerging topics.

Why an Should Take Up About 10% of a Document s, So now lets recap what we have: 1 2 1 =0.071 7 1 2 7 =0.117 30 1 2 2 =0.083 12 As you can see, these values are pretty close. In fact, their average is about 0.09. Thus, 10%, give or take, is a good estimate.

Results s, We were looking for 10 topics. Here are our results.

Relative Importance of s, 0 i i s from Changes in Relative Importance s, Changes in Relative Importance 1.0 1.5 2.0 10 4 9 1 3 6 8 2 5 7 Number

s And here are the program s suggested emerging topics. s, In particular, the topic Greenhouse Atmosphere Breeding Optional Relative which is clearly attributable to climate change is correctly discovered and identified.

Example 2 s, next example is based on the character merchandise sales reports from Star Wars and Disney.

Setup s, We ran our program over these reports. We looked for 10 topics.

Results Here are our results. s, Clearly, there are topics from both the Star Wars and the Disney reports.

Relative Importance of s, 0 i i s,

s s, Since our potential emerging topics are from both sales reports, we don t know which topic is our emerging topic.

How Can We Determine the s? s, In order to find out the emerging topics, we took the first 90% of the sales reports and ran over it.

90% Results Here are the topics from the first 90% of the reports. s, As you can see, all of the topics are Star Wars topics. Now we compare these topics against the potential emerging topics from before, and we can discard Star Wars topics as the emerging topics.

s s, Here are the program s suggested emerging topics. Since we have discarded Star Wars as emerging topics, s 10, 6, and 3 are the emerging topics. That is, the Disney topics are our emerging topics.

s,

s, We have developed an algorithm that automatically detects emerging topics It performs well in our experiments original purpose was to find emerging topics in NASA air tra c control incident reports. We are in the process of examining NASA data.

s, Future work: Gain better understanding of the relationship between emerging and old topics (i.e., what is the mathematical meaning of the value of 0 i i?) We have made our software (in R) and test data publicly available at http://www.math.sjsu.edu/~koev/camcos

Acknowledgments and References s, We would like to thank: All of you for coming David Blei for his and DTM implementations and paper to Probabilistic Models Bettina Grün and Kurt Hornik for their paper topicmodels: An R Package for Fitting Models and their R package and script

Additional Thanks s, We would also like to thank sponsor NASA CAMCOS Professor Hsu Dr. Ginger Koev Professor Koev for supervising our team We would like to extend our gratitude to our friends and families for their support

Questions? s, Q&A?

Thanks! s, Thank You For Coming To CAMCOS Reports Day Fall 2011

Directions to Lunch Please join us for lunch at Flames! s, 4th St. * Flames San Fernando King Library SJSU Campus P San Salvador Student Union