Entropy. Expected Surprise

Similar documents
Least Squares Classification

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Probabilistic Classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Stephen Scott.

Classification & Information Theory Lecture #8

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

COMP 328: Machine Learning

CSCI-567: Machine Learning (Spring 2019)

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Generative Learning algorithms

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Examination Artificial Intelligence Module Intelligent Interaction Design December 2014

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Introduction to Machine Learning

Machine Learning for natural language processing

Naïve Bayes Classifiers

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Introduction to Machine Learning

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Bayesian Decision Theory

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

CSCE 478/878 Lecture 6: Bayesian Learning

Estimating Parameters

Exercise 1: Basics of probability calculus

CS 188: Artificial Intelligence Spring Today

Machine Learning, Fall 2012 Homework 2

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CSC 411: Lecture 09: Naive Bayes

Machine Learning Linear Classification. Prof. Matteo Matteucci

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Ch 4. Linear Models for Classification

COMPUTATIONAL LEARNING THEORY

Probability Based Learning

Forward algorithm vs. particle filtering

an introduction to bayesian inference

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Introduction to Machine Learning

Performance Evaluation and Comparison

Lecture 8 October Bayes Estimators and Average Risk Optimality

Machine Learning for Signal Processing Bayes Classification

Notes on Machine Learning for and

CS 188: Artificial Intelligence. Machine Learning

Logistic Regression. Machine Learning Fall 2018

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Part 1: Expectation Propagation

Final Examination CS 540-2: Introduction to Artificial Intelligence

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayesian Learning Features of Bayesian learning methods:

Stephen Scott.

Generative Clustering, Topic Modeling, & Bayesian Inference

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

COMP90051 Statistical Machine Learning

Classification Algorithms

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

University of Illinois at Urbana-Champaign. Midterm Examination

Lecture 3 Classification, Logistic Regression

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Machine Learning

Bayesian Methods: Naïve Bayes

ECE521 Lecture7. Logistic Regression

Graphical Models for Collaborative Filtering

Lecture : Probabilistic Machine Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Naïve Bayes, Maxent and Neural Models

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Click Prediction and Preference Ranking of RSS Feeds

CPSC 340: Machine Learning and Data Mining

Bayesian Learning. Bayesian Learning Criteria

Learning with Probabilities

Algorithm-Independent Learning Issues

Bayesian Machine Learning

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Language as a Stochastic Process

Lecture 9: Bayesian Learning

Algorithmisches Lernen/Machine Learning

MLE/MAP + Naïve Bayes

Review: Bayesian learning and inference

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Sequential Supervised Learning

CPSC 540: Machine Learning

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Transcription:

Entropy Let X be a discrete random variable The surprise of observing X = x is defined as log 2 P(X=x) Surprise of probability 1 is zero. Surprise of probability 0 is (c) 200 Thomas G. Dietterich 1 Expected Surprise What is the expected surprise of X? x P(X=x) [ log 2 P(X=x)] x P(X=x) log 2 P(X=x) This is known as the entropy of X: H(X) 1 0.9 0.8 0.7 0.6 H(X) 0.5 0.4 0. 0.2 0.1 (c) 200 Thomas G. Dietterich 2 0 0 0.2 0.4 0.6 0.8 1 P(X=0) 1

Shannon s Experiment Measure the entropy of English Ask humans to rank-order the next letter given all of the previous letters in a text. Compute the position of the correct letter in this rank order Produce a histogram Estimate P(X ) from this histogram Compute the entropy H(X) = expected number of bits of surprise of seeing each new letter (c) 200 Thomas G. Dietterich Predicting the Next Letter Eve r y t h i ng_ in_camp_w as_ drenched_ the_camp_fire_as_well_for_they_ were but heedless lads like their generation and had made no provision against rain Here was matter for dismay for they were soaked through and chilled They were eloquent in their distress but they presently discovered that the fire had eaten so far up under the great log it had been built against where it curved upward and separated itself from the ground that a hand breadth or so of it had escaped wetting so they patiently wrought until with shreds and bark gathered from the under sides of sheltered logs they coaxed the fire to burn again Then they piled on great dead boughs till they had a roaring furnace and were glad hearted once more They dried their boiled ham and had a feast and after that they sat by the fire and expanded and glorified their midnight adventure until morning for there was not a dry spot to sleep on anywhere around (c) 200 Thomas G. Dietterich 4 2

Statistical Learning Methods The Density Estimation Problem: Given: a set of random variables U = {V 1,, V n } A set S of training examples {U 1,, U N } drawn independently according to unknown distribution P(U) Find: A bayesian network with probabilities Θ that is a good approximation to P(U) Four Cases: Known Structure; Fully Observable Known Structure; Partially Observable Unknown Structure; Fully Observable Unknown Structure; Partially Observable (c) 200 Thomas G. Dietterich 5 Bayesian Learning Theory Fundamental Question: Given S how to choose Θ? Bayesian Answer: Don t choose a single Θ. (c) 200 Thomas G. Dietterich 6

A Bayesian Network for Learning Bayesian Networks Θ U U 1 U 2 U N P(U U 1,,U N ) = P(U S) = P(U S) / P(S) = [ Θ P(U Θ) i P(U i Θ) P(Θ)] / P(S) P(U S) = Θ P(U Θ) [P(S Θ) P(Θ) / P(S)] P(U S) = Θ P(U Θ) P(Θ S) Each Θ votes for U according to its posterior probability. Bayesian Model Averaging (c) 200 Thomas G. Dietterich 7 Approximating Bayesian Model Averaging Summing over all possible Θ s is usually impossible. Approximate this sum by the single most likely Θ value, Θ MAP Θ MAP = argmax Θ P(Θ S) = argmax Θ P(S Θ) P(Θ) P(U S) P(U Θ MAP ) Maximum Aposteriori Probability MAP (c) 200 Thomas G. Dietterich 8 4

Maximum Likelihood Approximation If we assume P(Θ) is a constant for all Θ, then MAP become MLE, the Maximum Likelihood Estimate Θ MLE = argmax Θ P(S Θ) P(S Θ) is called the likelihood function We often take logarithms Θ MLE = argmax Θ P(S Θ) = argmax Θ log P(S Θ) = argmax Θ log i P(U i Θ) = argmax Θ i log P(U i Θ) (c) 200 Thomas G. Dietterich 9 Experimental Methodology Collect data Divide data randomly into training and testing sets Choose Θ to maximize log likelihood of the training data Evaluate log likelihood on the test data (c) 200 Thomas G. Dietterich 10 5

Known Structure, Fully Observable Age Mass Diabetes Insulin Preg Preg 1 2 2 1 Glucose Insulin 1 2 4 0 1 1 5 0 1 0 0 8 2 2 0 2 1 Mass 2 4 Age Diabetes? 0 6 0 0 4 0 4 1 4 1 2 0 2 1 4 0 4 0 Glucose (c) 200 Thomas G. Dietterich 11 Learning Process Simply count the cases: N(Age =2) P (Age =2)= N P (Mass =0 Preg =1,Age=2)= N(Mass=0,Preg =1,Age=2) N(P reg =1,Age=2) (c) 200 Thomas G. Dietterich 12 6

Laplace Corrections Probabilities of 0 and 1 are undesirable because they are too strong. To avoid them, we can apply the Laplace Correction. Suppose there are k possible values for age: P (Age =2)= N(Age =2)+1 N + k Implementation: Initialize all counts to 1. When the counts are normalized, this automatically computes k. (c) 200 Thomas G. Dietterich 1 Spam Filtering using Naïve Bayes Spam money confidential nigeria machine learning Spam {0,1} One random variable for each possible word that could appear in email P(money=1 Spam=1); P(money=1 Spam=0) (c) 200 Thomas G. Dietterich 14 7

Probabilistic Reasoning All of the variables are observed except Spam, so the reasoning is very simple: P(spam=1 w 1,w 2,,w n ) = α P(w 1 spam=1) P(w 2 spam=1) P(w n spam=1) P(spam=1) (c) 200 Thomas G. Dietterich 15 Likelihood Ratio To avoid normalization, we can compute the log odds : P (spam =1 w 1,...,wn) P (spam =0 w =1,...,w n ) = αp (w 1 spam =1) P (wn spam =1) P (spam =1) αp (w 1 spam =0) P (w n spam =0) P (spam =0) P (spam =1 w 1,...,wn) P (spam =0 w =1,...,w = α n) α P (w 1 spam =1) (wn spam =1) (spam =1) P (w 1 spam =0) P P (w n spam =0) P P (spam =0) log P (spam =1 w 1,...,wn) P (spam =0 w =1,...,w n ) =logp (w 1 spam =1) P (wn spam =1) P (spam =1) +...+log +log P (w 1 spam =0) P (w n spam =0) P (spam =0) (c) 200 Thomas G. Dietterich 16 8

Design Issues What to consider words? Read Paul Graham s articles (see web page) Read about CRM114 Do we define w j to be the number of times w j appears in the email? Or do we just use a boolean: presence/absence of the word? How to handle previously unseen words? Laplace estimates will assign them probabilities of 0.5 and 0.5 and NB will therefore ignore them. Efficient implementation Two hash tables: one for spam and one for non-spam that contain the counts of the number of messages in which the word was seen. (c) 200 Thomas G. Dietterich 17 Correlations? Naïve Bayes assumes that each word is generated independently given the class. HTML tokens are not generated independently. Should we model this? Spam HTML money confidential nigeria <b> href (c) 200 Thomas G. Dietterich 18 9

Dynamics We are engaged in an arms race between the spammers and the spam filters. Spam is changing all the time, so we need our estimates P(w i spam) to be changing too. One Method: Exponential moving average. Each time we process a new training message, we decay the previous counts slightly. For every w i : N(w i spam=1) := N(w i spam=1) 0.9999 N(w i spam=0) := N(w i spam=0) 0.9999 Then add in the counts for the new words. Choose the constant (0.9999) carefully. (c) 200 Thomas G. Dietterich 19 Decay Parameter 1 0.9 0.9999**x 0.8 0.7 0.6 0.5 0.4 0. 0.2 0.1 0 0 5000 10000 15000 20000 25000 0000 5000 40000 45000 50000 half life is 690 updates (how did I compute that?) (c) 200 Thomas G. Dietterich 20 10

Architecture.procmailrc is read by sendmail on engr accounts. This allows you to pipe your email into a program you write yourself. #.procmail recipe # pipe mail into myprogram, then continue processing it :0fw:.msgid.lock /home/tgd/myprogram # if myprogram added the spam header, then file into # the spam mail file :0: * ^X-SPAM-Status: SPAM.* mail/spam (c) 200 Thomas G. Dietterich 21 Tokenize Hash Classify Architecture (2) (c) 200 Thomas G. Dietterich 22 11

Classification Decision False positives: good email misclassified as spam False negatives: spam misclassified as good email Choose a threshold θ log P (spam =1 w 1,...,wn) P (spam =0 w =1,...,w n ) > θ (c) 200 Thomas G. Dietterich 2 Plot of False Positives versus False Negatives As we vary θ, we change the number of false positives and false negatives. We can choose the threshold that achieves the desired ratio of FP to FN False Negatives 800 0 700 29 28 600 27 26 25 24 500 2 22 21 20 400 19 18 17 00 16 15 14 1 1211 200 109 8-10*FP 100 7 6 5 0 0 50 100 150 200 250 00 50 400 False Positives 4-1*FP 2 1 (c) 200 Thomas G. Dietterich 24 12

Methodology Collect data (spam and non-spam) Divide data into training and testing (presumably by choosing a cutoff date) Train on the training data Test on the testing data Compute Confusion Matrix: Predicted Class spam True Class nonspam spam nonspam TP FN FP TN (c) 200 Thomas G. Dietterich 25 Choosing θ by internal validation Subdivide Training Data into subtraining set and validation set Train on subtraining set Classify validation set and record the predicted log odds of spam for each validation example Sort and construct FP/FN graph Choose θ Now retrain on entire training set (c) 200 Thomas G. Dietterich 26 1