Week 7 Sentiment Analysis, Topic Detection

Similar documents
Text Analysis. Week 5

PROJECT TITLE. B.Tech. BRANCH NAME NAME 2 (ROLL NUMBER)

CLEAR SPACE AND MINIMUM SIZE. A clear space area free of competing visual elements should be maintained.

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Brand guidelines 2014

Neural Networks for NLP. COMP-599 Nov 30, 2016

USER MANUAL. ICIM S.p.A. Certifi cation Mark

How to give. an IYPT talk

Dorset Coastal Explorer Planning

Linear Algebra. Dave Bayer

Sediment Management for Coastal Restoration in Louisiana -Role of Mississippi and Atchafalaya Rivers

MEDIA KIT D E S T I N A T I O N S M A G A Z I N E D E S T I N A T I O N S M A G A Z I N E. C O M

sps hrb con csb rmb gsv gsh CASE STUDIES CASE STUDIES CALL TO ACTION CALL TO ACTION PRODUCTS APPLICATIONS TRAINING EDUCATORS RESOURCES ABOUT CONTACT

Australia s Optical Data Centre

{SOP_Title} Standard Operating Procedure. Owner: {SOP_Author} {SOP_Description} {SOP_ID} {Date}

Strings in AdS 4 CP 3, and three paths to h(λ)

Updated Tutor HS Physics Content - legacy. By: Alina Slavik

Celebrating PHILANTHROPY. Brian Hew

Let s Get Visual. Michelle Samplin-Salgado Andrea Goetschius Liesl Lu

Tutor HS Physics Test Content. Collection Editor: Julia Liu

General comments, TODOs, etc.

KING AND GODFREE SALAMI FEST MARCH TEN LYGON STREET CARLTON

CACHET Technical Report Template

Putting Weirdness to Work: Quantum Information Science QIS. Quantum Computer

Boolean and Vector Space Retrieval Models

It s all here. Game On. New Jersey s Manhattan connection. Billboard Samples. Unparalleled access to arts and culture.

STELLAR DYNAMICS IN THE NUCLEI OF M31 AND M32: EVIDENCE FOR MASSIVE BLACK HOLES A LAN D RESSLER

Naïve Bayes, Maxent and Neural Models

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Pleo Mobile Strategy

EI MEL QUAS NULLAM CONSTITUTO, NAM TE TIMEAM MENTITUM. John D. Sanderson A DISSERTATION

Applied Natural Language Processing

Putting Weirdness to Work: Quantum Information Science QIS. Quantum Computer. John Preskill, Caltech KITP Public Lecture 3 May 2006

Midterm sample questions

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Downtown: Location of Choice for Living 13. Message from the President 2. Downtown: Location of Choice for Business 7

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

CS145: INTRODUCTION TO DATA MINING

Information Retrieval and Organisation

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

R ASTED UGLY MUG COFFEE: IT MAY BE BLUE, BUT IT WON T GIVE YOU THE BLUES COFFEE DRINKS TO GET YOU THROUGH THE GRIND WHAT S THE DEAL WITH MATCHA?

Ranked Retrieval (2)

University of Illinois at Urbana-Champaign. Midterm Examination

Generative Clustering, Topic Modeling, & Bayesian Inference

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

A unique. Living Heritage

Multi-theme Sentiment Analysis using Quantified Contextual

LYFTER BRAND MANUAL. The Lyfter brand, including the logo, name, colors and identifying elements, are valuable company assets.

Data Mining 2018 Logistic Regression Text Classification

Language Models. Hongning Wang

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

DISTRIBUTIONAL SEMANTICS

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Training the linear classifier

Probability Review and Naïve Bayes

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

CMPT 365 Multimedia Systems. Lossless Compression

Statistical NLP for the Web

Classification & Information Theory Lecture #8

Logistic regression for conditional probability estimation

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Lorem ipsum dolor? Abstract

Continuous Performance Testing Shopware Developer Conference. Kore Nordmann 08. June 2013

Lecture 12: Algorithms for HMMs

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Lecture 2: Probability, Naive Bayes

Image and Multidimensional Signal Processing

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

VISUAL IDENTITY SYSTEM. Visual tools for creating and reinforcing the USC Viterbi Brand.

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Lecture 4 : Adaptive source coding algorithms

N-gram Language Modeling

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CPSC 340: Machine Learning and Data Mining

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Lecture 12: Algorithms for HMMs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

A MESSAGE FROM THE EDITOR, HAPPY READING

Uncertainty in prediction. Can we usually expect to get a perfect classifier, if we have enough training data?

Brand Guidelines Preparation for winter << name here >> North East Lincolnshire Leeds

Natural Language Processing. Statistical Inference: n-grams

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models

Paper Template. Contents. Name

Behavioral Data Mining. Lecture 2

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

2018 David Roisum, Ph.D. Finishing Technologies, Inc.

DT2118 Speech and Speaker Recognition

Generic Text Summarization

Exemplo de Apresentação

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Chairman s Statement 6. Board of Directors. Financial Highlights ANNUAL REPORT 2014

September Grand Fiesta American Coral Beach Cancún, Resort & Spa. Programa Académico. Instituto Nacional de. Cancerología

CS 224N HW:#3. (V N0 )δ N r p r + N 0. N r (r δ) + (V N 0)δ. N r r δ. + (V N 0)δ N = 1. 1 we must have the restriction: δ NN 0.

6.034 Introduction to Artificial Intelligence

Transcription:

Week 7 Sentiment Analysis, Topic Detection Reference and Slide Source: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text Mining. Association for Computing Machinery and Morgan & Claypool, New York, NY, USA.

Sensory data (Video, Audio, etc.) Web data (Text) (OSN, News, etc.) Multimedia Systems User data (User attributes, Preferences, etc.)

Opinions

Text is generated by humans, therefore rich in subjective information! Video data Record Output Real world Perceive (perspective) Observed world Express (English) Text data Lorem ipsum, dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Lorem ipsum, dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Lorem ipsum, dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat nonproident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Subjective and opinion rich!

Applications Decision support whether to go for the movie or not? Understanding human preferences Advertising Recommendations Business intelligence product feature evaluation

What is an opinion? A subjective statement describing what a person thinks or believes about something!

Chomu is the best town in India Something = Chomu Belief = Best town!

Naam Shabana Reviews Awesome, I saw twice... Unbelievably disappointing A good movie, if seen baby previously then it must be watched, Role of Tapsee Pannu is really appreciable. Instead of Naam Shabana they would have made Sar Dabaana ya jo dikhaye uska gala dabana

Opinion is generally analyzed in terms of sentiment, e.g., positive, negative & neutral!

Sentiment analysis has many other names Opinion extraction Opinion mining Sentiment mining Subjectivity analysis

Opinion Mining Tasks Detecting opinion holder Detecting opinion target Detecting opinion sentiment

Sentiment Analysis Simplest task: Is the attitude of this text positive or negative? More complex: Rate the attitude of this text from 1 to 5 Advanced: Detect the target, source, or complex attitude types

Sentiment Analysis Assumes all other parameters are knows! E.g. in teaching feedback, students writing about instructor.

Feature Selection

Choose all words or only adjectives? Generally all words turns out to work better Experiments

Word occurrence may matter more than word frequency The occurrence of the word fantastic tells us a lot The fact that it occurs 5 times may not tell us much more

I didn t like this movie vs I really like this movie How to handle negation?

Add NOT_ to every word between negation and following punctuation: didn t like this movie, but I didn t NOT_like NOT_this NOT_movie but I

n-grams Character n-grams: sequences of n adjacent characters treated as a unit Word n-grams: sequences of n adjacent words treated as a unit n-grams generally move by 1 unit

Character n-grams example Text: student 2-grams st tu ud de en nt

Character n-grams example Text: In this paper 4-grams In_t n_th _thi this his_ is_p s_pa _pap pape aper

Word n-grams example Text: The cow jumps over the moon 2-grams the cow cow jumps jumps over over the the moon

n=1 - unigram n=2 - bigram n=3 - trigram

What happens to n-grams when when one character is misspelled in word? Assignment Vs Asignment n-grams are robust to grammatical errors!

Word Vs Character n-grams, which one is more sparse? Character n-grams are much less in number than word n-grams.

Other n-grams n-grams of PoS Tags E.g. adjective followed by a noun n-gram of word classes place, names, people n-gram of word topics politics, technical

Analysis Methods

Classification/Regression Support Vector Machines Naïve Bayse Classifier Logistic Regression Ordinal logistic regression

Logistic Regression for Binary Classification log p( y =1 X ) p( y = 0 X ) M p( y =1 X ) = log 1 p( y =1 X ) = α + x β j i i i=1

Rating Detection Sentiment is generally measured on a scale from positive to negative with other labels in-between. Logistic regression is improvised to analyze multi-class classification, e.g. ratings from 1 to 5.

Multi-level Logistic Regression p(r k X) > 0.5? No p(r k 1 X) > 0.5? No p(r 2 X) > 0.5? No r = 1 Yes Yes Yes r = k r = k 1 r = 2 ment analysis: prediction of ratings.

Problems Too many parameters Classifiers are not not really independent

Ordinal Logistic Regression Assume β parameters to be the same for all classifiers Keel α different for each classifier Hence, M+K-1 parameters in total p(y j = 1 X) p(r j X) log = log = α j + M i=1 x iβ i β i 2 < p(yj = 0 X) 1 p(r j X)

More Sentiments Emotion angry, sad, ashamed Mood cheerful, depressed, irritable... Interpersonal stances friendly, cold... Attitudes loving, hating... Personality traits introvert, extrovert...

Topic detection: Given a text document, find main topic of the document cover? 1. I like to eat broccoli and bananas. 2. My sister adopted a kitten yesterday. 3. Look at this cute hamster munching on a piece of broccoli. 35

Topic detection has two main tasks Discover k topics covered across all documents Measure individual topic coverage by each document

Idea: choose individual terms as topics! E.g. Sports, Travel, Food 37

Terms can be chosen based on TF or TF- IDF ranking! 38

Problem: The top few terms can be similar or even synonym Solution: Go down the ranking and choose the terms that are different from the already chosen terms Maximal Marginal Relevance (MMR) 39

Maximal Marginal Relevance (MMR) Score = λ Relevance (1 λ) Redundancy Example v Relevance: Original ranking v Redundancy: Similarity with already selected terms

Count the frequency of these terms in the document for coverage! Doc d i θ 1 π i1 Sports count( sports, d i ) = 4 θ 2 θ k Travel Science π ik π i2 count( travel, d i ) = 2 count( science, d i ) = 1 π ij = k L=1 count(θ j, d i ) count(θ L, d i ) 41

Limitations of Single Term as a topic 1. It is difficult which terms are similar while choosing a topic 2. Topics can be complicated, e.g. Politics in Sports or Sports Injuries 3. There can be multiple topics in a document, Sports and Politics 4. Ambiguous words

Solution: Model topics as word distribution! θ 1 Sports θ 2 Travel θ k Science P(w θ 1 ) sports 0.02 game 0.01 basketball 0.005 football 0.004 play 0.003 star 0.003 nba 0.001 travel 0.0005 P(w θ 2 ) travel 0.05 attraction 0.03 trip 0.01 flight 0.004 hotel 0.003 island 0.003 culture 0.001 play 0.0002 P(w θ k ) science 0.04 scientist 0.03 spaceship 0.006 telescope 0.004 genomics 0.004 star 0.002 genetics 0.001 travel 0.00001 p(w θ i ) = 1 Vocabulary set: V = {w 1, w 2, } w2v 43

Refined Topic Modeling Input: A set of documents Output: output consists of two types of distributions Word distribution foe each topic Topic distribution for each document

Two Distributions Word distribution for topic i global across all documents w V Topic distribution for document i local to a document p(w θ i ) = 1. k π ij = 1, i. j=1 45

How to obtain the the probability distributions from the input data?

Maximum Likelihood Estimate of a Generative Model p(data model, ) Parameter estimation/inferences * = argmax p(data model, ) * Choose a model that has maximum probability of producing the training data. 47

Advantages Multiple words allow us to describe fairly complicated topics. The term weights model subtle differences of semantics in related topics. Because we have probabilities for the same word in different topics, we can accommodate multiple senses of a word, addressing the issue of word ambiguity. 48

Mining One Topic Input: C = {d}, V Output: {θ} P(w θ) Doc d Text data Lorem ipsum, Dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. θ text? mining? association? database? query? 100% 49

Mining One Topic p(w θ) = c(w i,d) i d i =1 to M where M is the number of words in the dictionary! 50

Mining Multiple Topics Doc 1 Doc 2 Doc N Text data Lorem ipsum, dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Lorem ipsum, dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Lorem ipsum, dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. θ 1 θ 2 sports 0.02 game 0.01 basketball 0.005 football 0.004 travel 0.05 attraction 0.03 trip 0.01 30% π 11 π 21 = 0% π N1 = 0% π 12 π 22 π N2 12% θ k science 0.04 scientist 0.03 spaceship 0.006 π 1k π 2k π Nk 8%

How to Compress Text? The computer science and engineering department offers many computer related courses. Although there is science term in the name, no science courses are offered in the department.

Run Length Coding? The computer science and engineering department offers many computer related courses. Although there is science term in the name, no science courses are offered in the department. Not a good idea!

Differential Coding? The computer science and engineering department offers many computer related courses. Although there is science term in the name, no science courses are offered in the department. Not a good idea!

Huffman Coding? The computer science and engineering department offers many computer related courses. Although there is science term in the name, no science courses are offered in the department.

Huffman Coding Removes the source information Frequent symbols assigned shorter codes Prefix coding

Can we refer to a dictionary entry of the word?

Dictionary-based Compression Do not encode individual symbols as variable-length bit strings Encode variable-length string of symbols as single token The tokens form an index into a phrase dictionary If number of tokens are smaller than number of phrases, we have compression

Lempel-Ziv-Welsh (LZW) Coding A dictionary-based coding algorithm Build the dictionary dynamically Initially the dictionary contains only character codes The remaining entries in the dictionary are then build dynamically

LZW Compression set w = NIL loop read a character k if wk exists in the dictionary else endloop w = wk output the code for w add wk to the dictionary w=k Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

Example Input String: w k NIL ^ ^ W W E E D D ^ ^ W ^W E E ^ ^ W ^W E ^WE E E ^ E^ W W E WE B B ^ ^ W ^W E ^WE T T EOF ^WED^WE^WEE^WEB^WET Output Index Symbol ^ 256 ^W W 257 WE E 258 ED D 259 D^ 256 260 ^WE E 261 E^ 260 262 ^WEE 261 263 E^W 257 264 WEB B 265 B^ 260 266 ^WET T set w = NIL loop read a character k if wk exists in the dictionary w = wk else output the code for w add wk to the dictionary w = k endloop Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

LZW Decompression read fixed length token k (code or char) output k w=k loop read a fixed length token k entry = dictionary entry for k output entry add w + first char of entry to the dictionary endloop w = entry Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

Example of LZW Input String (to decode): Example ^WED<256>E<260><261><257>B<260>T w k Output Index Symbol read a fixed length token k ^ ^ (code or char) ^ W W E W E 256 257 ^W WE output k w = k E D D 258 ED loop D <256> ^W 259 D^ read a fixed length token k ^W E ^WE E^ WE B ^WE E <260> <261> <257> B <260> T E ^WE E^ WE B ^WE T 260 261 262 263 264 265 266 ^WE E^ ^WEE E^W WEB B^ ^WET (code or char) entry = dictionary entry for k output entry add w + first char of entry to the dictionary w = entry endloop Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

Where is Compression? Input String: ^WED^WE^WEE^WEB^WET 19 * 8 bits = 152 bits Encoded: ^WED<256>E<260><261><257>B<260>T 12*9 bits = 108 bits (7 symbols and 5 codes, each of 9 bits) Why 9 bits? Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

0 1 9 bits 9 bits <- ASCII characters (0 to 255) <- Codes (256 to 512) Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

Original LZW Uses 12-bit Codes! 0 0 0 0 12 bits <- ASCII characters (0 to 255 entries) 0 0 0 1 <- Codes (256 to 4096 entries) 1 1 1 1 Ref: http://users.ece.utexas.edu/~ryerraballi/msb/pdfs/m2l3.pdf

What if we run out of codes? Flush the dictionary periodically Or Grow length of codes dynamically