Behavioral Data Mining. Lecture 2

Similar documents
Behavioral Data Mining. Lecture 3 Naïve Bayes Classifier and Generalized Linear Models

Introduction to Bayesian Learning. Machine Learning Fall 2018

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

The Naïve Bayes Classifier. Machine Learning Fall 2017

Information Retrieval

The Bayes classifier

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CS 361: Probability & Statistics

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Introduction to Machine Learning

Introduction to Machine Learning

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

MLE/MAP + Naïve Bayes

Naive Bayes classification

Probability Review and Naïve Bayes

Bayesian Learning Features of Bayesian learning methods:

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction into Bayesian statistics

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Naive Bayes Classifier. Danushka Bollegala

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Information Retrieval and Organisation

CPSC 340: Machine Learning and Data Mining

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Generative Models for Discrete Data

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

CS 361: Probability & Statistics

an introduction to bayesian inference

Naïve Bayes classification

A Novel Click Model and Its Applications to Online Advertising

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Bayesian Models in Machine Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

MLE/MAP + Naïve Bayes

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Machine Learning

Machine Learning Algorithm. Heejun Kim

III. Naïve Bayes (pp.70-72) Probability review

Bayesian Methods: Naïve Bayes

Naïve Bayes, Maxent and Neural Models

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Feature selection. Micha Elsner. January 29, 2014

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CS6220: DATA MINING TECHNIQUES

PV211: Introduction to Information Retrieval

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Algorithmisches Lernen/Machine Learning

naive bayes document classification

COMP90051 Statistical Machine Learning

Introduction to Bayesian Learning

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

T Machine Learning: Basic Principles

Machine Learning for natural language processing

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Learning Bayesian network : Given structure and completely observed data

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Latent Dirichlet Allocation (LDA)

Be able to define the following terms and answer basic questions about them:

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CS-E3210 Machine Learning: Basic Principles

Ranked Retrieval (2)

From Bayes Theorem to Pattern Recognition via Bayes Rule

CS 188: Artificial Intelligence Spring Today

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Midterm sample questions

Machine Learning, Fall 2012 Homework 2

Model Averaging (Bayesian Learning)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lets start off with a visual intuition

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CS 188: Artificial Intelligence Fall 2009

CS6220: DATA MINING TECHNIQUES

CS145: INTRODUCTION TO DATA MINING

Applied Natural Language Processing

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Info 2950, Lecture 4

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Day 6: Classification and Machine Learning

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Gaussian Models

2011 Pearson Education, Inc

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Information Retrieval

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Transcription:

Behavioral Data Mining Lecture 2

Autonomy Corp

Bayes Theorem

Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events e and data D. e = I have a cold D = runny nose, watery eyes, coughing P(D e)p(e) P(e D)= P(D) So Bayes theorem is diagnostic.

Bayes Theorem P(e D)= P(D e)p(e) P(D) P(e) is called the prior probability of e. Its what we know (or think we know) about e with no other evidence. P(D e) is the conditional probability of D given that e happened, or just the likelihood of D. This can often be measured or computed precisely it follows from your model assumptions. P(e D) is the posterior probability of e given D. It s the answer we want, or the way we choose a best answer. You can see that the posterior is heavily colored by the prior, so Bayes has a GIGO liability. e.g. its not used to test hypotheses

Bayesian vs Frequentist Perspectives Bayesian inference is used most often used to represent probabilities which are beliefs about the world. It represents the Bayesian perspective. The Bayesian perspective was not widely accepted until the mid 20 th century, since most statisticians held a frequentist perspective. The frequentist perspective interprets probability narrowly as relative frequencies of events. Bayesian User Conference 2012: Tentative program follows. Based on attendance, each day s final program will be announced the following day

Alternatives We know from the axioms of probability that P(e) + P( e) = 1 Or if we have multiple, mutually exclusive and exhaustive hypotheses (e.g. classifying) e 1 e n, then P(e 1 ) + + P(e n ) = 1 Its also true that P(e D) + P( e D) = 1 And P(e 1 D)+ + P(e n D) = 1 The normalizing probability P(D) in Bayes theorem may not be known directly in which case we can compute it as P(D) = P(D e 1 ) P(e 1 ) + + P(D e n ) P(e n ) Or not. We don t need it to rank hypotheses.

Ranking The un-normalized form of Bayes is P(e D) P(D e)p(e) Which is good enough to find the best hypothesis.

Chains of evidence Bayes theorem has a recursive form for updates given new information. P(e D2, D1) = P(D 2 e, D 1 ) P(e D 1 ) P(D 2 D 1 ) Where the old posterior P(e D1) becomes the new prior.

Using Logs For any positive p and q, p > q log(p) > log(q) It follows that we can work with logs of probabilities and still find the same best or most likely hypothesis. Using logs: Avoids numerical problems (overflow and underflow) that happen with long chains of multiplications. Often leads to linear operations between model and data values.

Naïve Bayes Classifier We assume we have a set of documents, and a set of classification classes. Then the probability that a document d lies in class c is Where is the probability that the k th term (word) appears in a document in class c. is the prior for class c. nd is the number of terms in document d. This simple form follows from the assumption that the are independent. Since that s almost never true, this is a naïve assumption.

Classification For classification, we want the most likely class given the data. This is the Maximum A-Posteriori estimate, or MAP estimate. Where the denotes an estimate of. Things will start to simplify when we take logs: And then notice that the second term can be computed with a vector dot product.

Classification i.e. if we compute log P(t c) for all terms, and then take an inner product with a (0,1) sparse vector for terms occurring in the document, we get the RHS sum. Note: for fast classification, you can insert the log P(t c) vectors as documents in a search engine that supports term weights. Then using a new document as a query, the engine will retrieve a likelihood-sorted list of classes. The first one is the best match.

Best and Worst of NB Classifiers Simple and fast. Depend only on term frequency data for the classes. One shot, no iteration. Very well-behaved numerically. Term weight depends only on frequency of that term. Decoupled from other terms. Can work very well with sparse data, where combinations of dependent terms are rare. Subject to error and bias when term probabilities are not independent (e.g. URL prefixes). Can t model patterns in the data. Typically not as accurate as other methods*. * This may not be true when data is very sparse.

Parameter Estimation For the priors, we can use a maximum likelihood estimate (MLE) which is just: where N c is the number of training docs in class c, N total number of training docs. For the we have: where T ct is the number of occurrences of term t in training documents from class c, V is the term vocabulary.

Bernoulli vs Multinomial models There are two ways to deal with repetition of terms: Multinomial models: distinct placements of terms are treated as independent terms with the same probability Bernoulli: Placements are not counted. Term presence is a binary variable The multinomial formula nd terms where nd is the length of the doc. Bernoulli has M terms where M is the number of distinct terms in the doc.

Bernoulli vs Multinomial models Multinomial: regarded as better for long docs Bernoulli: better for short docs (Tweets? SMS?). Also often better for web browsing data: counts are less meaningful because of Browser refresh, back button etc.

Dealing with Zero Counts Its quite common (because of power law behavior) to get T ct counts which are zero in the training set, even though those terms may occur in class documents outside the training set. Just one of these drives the posterior estimate for that document/class to zero, when in fact it may be a good choice. You can avoid zero probability estimates by smoothing.

Or add-one smoothing: Laplace Smoothing Where B is V the vocabulary size. Rationale: Analyze the counts as an explicit probabilistic process (Dirichlet/multinomial), and find the expected value of the model parameter, instead of the MLE estimate.

Additive Smoothing In practice, its worth trying smaller values of the additive smoothing parameter. Typically a positive value < 1 is used This is a compromise between the expected and most likely parameter for the count model.

Results

First Assignment Train and evaluate a naïve Bayes classifier on sentiment dataset: a collection of movie reviews with positive/negative tags. Use ScalaNLP to preprocess the text data into sparse vectors and matrices, and write your classifier for that data. Run it and measure accuracy on a disjoint sample of the same dataset (cross-validation). The result will be usable as a sentiment weight vector for marking up social media posts.

Feature Selection Word frequencies follow power laws: Wikipedia: word rank on x, frequency on y

Feature Selection One consequence of this is that vocabulary size grows almost linearly with the size of the corpus. A second consequence is that about half the words occur only once. Rare words are less helpful for classification, most are not useful at all. Feature selection is the process of trimming a feature set (in this case the set of words) to a more practical size.

Mutual Information Mutual information measures the extent to which knowledge of one variable influences the distribution of another. Where U is a random variable which is 1 if term e t is in a given document, 0 otherwise. C is 1 if the document is in the class c, 0 otherwise. These are called indicator random variables. Mutual information can be used to rank terms, the highest will be kept for the classifier and the rest ignored.

CHI-Squared CHI-squared is an important statistic to know for comparing count data. Here it is used to measure dependence between word counts in documents and in classes. Similar to mutual information, terms that show dependence are good candidates for feature selection. CHI-squared can be visualized as a test on contingency tables like this one: Right-Handed Left-Handed Total Males 43 9 52 Females 44 4 48 Total 87 13 100

The CHI-squared statistic is: CHI-Squared It measures the extent to which the actual counts in the table are different from the expected counts which assume independence.

Feature Set Size and Accuracy

Summary Bayes theorem allows us to adjust beliefs about hidden events from data. Naïve Bayes classifiers use simple generative models (multinomial or bernoulli) to infer posterior probabilities of class membership. Naïve Bayes assumes independence between term probabilities. NB is fast, simple and very robust. Feature selection can drastically reduce feature set size, and improve accuracy.