Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Similar documents
Introduction to AI Learning Bayesian networks. Vibhav Gogate

CS 188: Artificial Intelligence. Machine Learning

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

CSE446: Naïve Bayes Winter 2016

CSE546: Naïve Bayes Winter 2012

CS 188: Artificial Intelligence. Outline

Forward algorithm vs. particle filtering

CS 188: Artificial Intelligence Fall 2011

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

CS 5522: Artificial Intelligence II

CS 343: Artificial Intelligence

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA

MIRA, SVM, k-nn. Lirong Xia

EE 511 Online Learning Perceptron

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

CSE 473: Artificial Intelligence Autumn Topics

Bayes Nets III: Inference

CS 188: Artificial Intelligence Learning :

Hidden Markov Models. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 19 Apr 2012

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Statistical NLP Spring A Discriminative Approach

Bayesian Methods: Naïve Bayes

Naïve Bayes classification

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Spring Announcements

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Spring Announcements

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

CS 188: Artificial Intelligence Spring Announcements

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

CPSC 340: Machine Learning and Data Mining

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

CS 188: Artificial Intelligence Fall 2011

Approximate Inference

Machine Learning, Fall 2009: Midterm

6.867 Machine learning

Be able to define the following terms and answer basic questions about them:

CS 188: Artificial Intelligence. Bayes Nets

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Introduction to Machine Learning

Probability Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 27 Mar 2012

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Introduction to Machine Learning Midterm Exam

Linear Classifiers. Michael Collins. January 18, 2012

Introduction to Machine Learning

MLE/MAP + Naïve Bayes

Final Examination CS 540-2: Introduction to Artificial Intelligence

Linear classifiers Lecture 3

The Naïve Bayes Classifier. Machine Learning Fall 2017

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Probability Review and Naïve Bayes

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

ECE521 Lecture7. Logistic Regression

CMU-Q Lecture 24:

Logistic Regression. Machine Learning Fall 2018

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

ECE521 week 3: 23/26 January 2017

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS 188: Artificial Intelligence Spring Announcements

CSEP 573: Artificial Intelligence

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Binary Classification / Perceptron

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Reinforcement Learning Wrap-up

Bayesian Support Vector Machines for Feature Ranking and Selection

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

CS 188: Artificial Intelligence Spring 2009

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Introduction to Machine Learning Midterm Exam Solutions

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

CS 188: Artificial Intelligence Fall Recap: Inference Example

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

What Causality Is (stats for mathematicians)

Machine Learning for Signal Processing Bayes Classification and Regression

Introduc)on to Bayesian methods (con)nued) - Lecture 16

CS540 ANSWER SHEET

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CS188: Artificial Intelligence, Fall 2010 Written 3: Bayes Nets, VPI, and HMMs

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Generative Clustering, Topic Modeling, & Bayesian Inference

MLE/MAP + Naïve Bayes

Generative Learning algorithms

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Classification & Information Theory Lecture #8

Undirected Graphical Models

Transcription:

Machine Learning Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421 Introduction to Artificial Intelligence 8 May 2012 g 1 Many slides courtesy of Dan Klein, Stuart Russell, or Andrew Moore

Announcements Final exam review Please vote for a time slot 2 http//www.when2meet.com/?439876-ejd1w Will decide by class Thursday!

Machine Learning 3 Up until now how to reason in a model and how to make optimal decisions Machine learning how to select a model on the basis of data / experience Learning parameters (e.g. probabilities) Learning structure (e.g. BN graphs) Learning hidden concepts (e.g. clustering)

Example Spam Filter Input email Output spam/ham Setup Get a large collection of example emails, each labeled spam or ham Note someone has to hand label all this data! Want to learn to predict labels of new, future emails Features The attributes used to make the ham / spam decision 4 Dear Sir. Words FREE! Text Patterns $dd, CAPS Non-text SenderInContacts First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Example Digit Recognition 5 Input images / pixel grids Output a digit 0-9 Setup Get a large collection of example images, each labeled with a digit Note someone has to hand label all this data! Want to learn to predict labels of new, future digit images Features The attributes used to make the digit decision Pixels (6,8)=ON Shape Patterns NumComponents, AspectRatio, NumLoops 0 1 2 1??

Other Classification Tasks In classification, we predict labels y (classes) for inputs x Examples 6 Spam detection (input document, classes spam / ham) OCR (input images, classes characters) Medical diagnosis (input symptoms, classes diseases) Automatic essay grader (input document, classes grades) Fraud detection (input account activity, classes fraud / no fraud) Customer service email routing many more Classification is an important commercial technology!

Important Concepts 7 Data labeled instances, e.g. emails marked spam/ham Training set Held out set Test set Features attribute-value pairs which characterize each x Experimentation cycle Learn parameters (e.g. model probabilities) on training set (Tune hyperparameters on held-out set) Compute accuracy of test set Very important never peek at the test set! Evaluation Accuracy fraction of instances predicted correctly Overfitting and generalization Want a classifier which does well on test data Overfitting fitting the training data very closely, but not generalizing well Training Data Held-Out Data Test Data

Bayes Nets for Classification One method of classification 8 Use a probabilistic model! Features are observed random variables Fi Y is the query variable Use probabilistic inference to compute most likely Y You already know how to do this inference

Simple Classification M Simple example two binary features S F direct estimate Bayes estimate (no assumptions) Conditional independence + 9

General Naïve Bayes A general naive Bayes model Y x F n parameters Y F1 Y parameters 10 F2 Fn n x F x Y parameters We only specify how each feature depends on the class Total number of parameters is linear in n

Inference for Naïve Bayes 11 Goal compute posterior over causes Step 1 get joint probability of causes and evidence Step 2 get probability of evidence Step 3 renormalize +

General Naïve Bayes What do we need in order to use naïve Bayes? Inference (you know this part) Estimates of local conditional probability tables 12 Start with a bunch of conditionals, P(Y) and the P(Fi Y) tables Use standard inference to compute P(Y F1 Fn) Nothing new here P(Y), the prior over labels P(Fi Y) for each feature (evidence variable) These probabilities are collectively called the parameters of the model and denoted by θ Up until now, we assumed these appeared by magic, but they typically come from training data we ll look at this now

A Digit Recognizer 13 Input pixel grids Output a digit 0-9

Naïve Bayes for Digits Simple version One feature Fij for each grid position <i,j> Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image Each input maps to a feature vector, e.g. Here lots of features, each is binary Naïve Bayes model What do we need to learn? 14

Examples CPTs 15 1 0.1 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 0 0.1 0 0.80 0 0.80

Parameter Estimation Estimating distribution of random variables like X or X Y Empirically use training data For each outcome x, look at the empirical rate of that value r g This is the estimate that maximizes the likelihood of the data Elicitation ask a human! 16 g Usually need domain experts, and sophisticated ways of eliciting probabilities (e.g. betting games) Trouble calibrating

A Spam Filter Naïve Bayes spam filter Collection of emails, labeled spam or ham Note someone has to hand label all this data! Split into training, heldout, test sets Classifiers 17 First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. Data Dear Sir. Learn on the training set (Tune it on a held-out set) Test it on new emails TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT "REMOVE" IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99 Ok, Iknow this is blatantly OT but I'm beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.

Naïve Bayes for Text Bag-of-Words Naïve Bayes Predict unknown class label (spam vs. ham) Assume evidence features (e.g. the words) are independent Warning subtly different assumptions than before! Generative model Tied distributions and bag-of-words Usually, each variable gets its own conditional probability distribution P(F Y) In a bag-of-words model 18 Word at position i, not ith word in the dictionary! Each position is identically distributed All positions share the same conditional probs P(W C) Why make this assumption?

Example Spam Filtering Model What are the parameters? ham 0.66 spam 0.33 19 the to and of you a with from... 0.0156 0.0153 0.0115 0.0095 0.0093 0.0086 0.0080 0.0075 the to of 2002 with from and a... 0.0210 0.0133 0.0119 0.0110 0.0108 0.0107 0.0105 0.0100 Where do these tables come from?

Spam Example Word P(w spam) P(w ham) Tot Spam Tot Ham (prior) 0.33333 0.66666-1.1-0.4 Gary 0.00002 0.00021-11.8-8.9 would 0.00069 0.00084-19.1-16.0 you 0.00881 0.00304-23.8-21.8 like 0.00086 0.00083-30.9-28.9 to 0.01517 0.01339-35.1-33.2 lose 0.00008 0.00002-44.5-44.0 weight 0.00016 0.00002-53.3-55.0 while 0.00027 0.00027-61.5-63.2 you 0.00881 0.00304-66.2-69.0 sleep 0.00006 0.00001-76.0-80.5 P(spam w) = 98.9 20

Example Overfitting 2 wins!! 21

Example Spam Filtering Raw probabilities alone don t affect the posteriors; relative probabilities (odds ratios) do south-west nation morally nicely extent seriously... inf inf inf inf inf inf screens minute guaranteed $205.00 delivery signature... inf inf inf inf inf inf What went wrong here? 22

Generalization and Overfitting Relative frequency parameters will overfit the training data! Just because we never saw a 3 with pixel (15,15) on during training doesn t mean we won t see it at test time Unlikely that every occurrence of minute is 100% spam Unlikely that every occurrence of seriously is 100% ham What about all the words that don t occur in the training set at all? In general, we can t go around giving unseen events zero probability As an extreme case, imagine using the entire email as the only feature Would get the training data perfect (if deterministic labeling) Wouldn t generalize at all Just making the bag-of-words assumption gives us some generalization, but isn t enough 23 To generalize better we need to smooth or regularize the estimates

Estimation Smoothing Problems with maximum likelihood estimates Basic idea 24 If I flip a coin once, and it s heads, what s the estimate for P(heads)? What if I flip 10 times with 8 heads? What if I flip 10M times with 8M heads? We have some prior expectation about parameters (here, the probability of heads) Given little evidence, we should skew towards our prior Given a lot of evidence, we should listen to the data

Estimation Smoothing Relative frequencies are the maximum likelihood estimates In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution???? 25

Estimation Laplace Smoothing Laplace s estimate Pretend you saw every outcome once more than you actually did 26 H H T Can derive this as a MAP estimate with Dirichlet priors (see cs5350)

Estimation Laplace Smoothing Laplace s estimate (extended) Pretend you saw every outcome k extra times What s Laplace with k = 0? k is the strength of the prior H T Laplace for conditionals 27 H Smooth each condition independently

Estimation Linear Interpolation In practice, Laplace often performs poorly for P(X Y) Another option linear interpolation 28 When X is very large When Y is very large Also get P(X) from the data Make sure the estimate of P(X Y) isn t too different from P(X) What if α is 0? 1?

Real NB Smoothing For real classification problems, smoothing is critical New odds ratios helvetica seems group ago areas... 11.4 10.8 10.2 8.4 8.3 verdana Credit ORDER <FONT> money... 28.8 28.4 27.2 26.9 26.5 Do these make more sense? 29

Tuning on Held-Out Data 30 Now we ve got two kinds of unknowns Parameters the probabilities P(Y X), P(Y) Hyperparameters, like the amount of smoothing to do k, α Where to learn? Learn parameters from training data Must tune hyperparameters on different data Why? For each value of the hyperparameters, train and test on the held-out data Choose the best value and do a final test on the test data

Generative vs. Discriminative Generative classifiers Discriminative classifiers 31 E.g. naïve Bayes We build a causal model of the variables We then query that model for causes, given evidence E.g. perceptron (next) No causal model, no Bayes rule, often no probabilities Try to predict output directly Loosely mistake driven rather than model driven 31

The Binary Perceptron Inputs are feature values Each feature has a weight Sum is the activation If the activation is Positive, output 1 Negative, output 0 f1 f2 f3 32 w1 w2 w3 Σ >0? 32

Example Spam Imagine 4 features Free (number of occurrences of free ) Money (occurrences of money ) BIAS (always has value 1) free money 33 BIAS free money the... 1 1 1 0 BIAS free money the... -3 4 2 0 33

Binary Decision Rule In the space of feature vectors money Any weight vector is a hyperplane One side will be class 1 Other will be class -1 BIAS free money the... 34-3 4 2 0 2 1 = SPAM 1-1 = HAM 0 0 1 free 34

Multiclass Decision Rule If we have more than two classes 35 Have a weight vector for each class Calculate an activation for each class Highest activation wins 35

Example BIAS win game vote the... win the vote BIAS win game vote the... 36-2 4 4 0 0 BIAS win game vote the... 1 2 0 4 0 1 1 0 1 1 BIAS win game vote the... 2 0 2 0 0 36

The Perceptron Update Rule 37 Start with zero weights Pick up training instances one by one Try to classify If correct, no change! If wrong lower score of wrong answer, raise score of right answer 37

Example win the vote win the election win the game BIAS win game vote the... 38 BIAS win game vote the... BIAS win game vote the... 38

Examples Perceptron 39 Separable Case 39

Mistake-Driven Classification In naïve Bayes, parameters From data statistics Have a causal interpretation One pass through the data Training Data For the perceptron parameters From reactions to mistakes Have a discriminative interpretation Go through the data until held-out accuracy maxes out Held-Out Data Test Data 40 40

Properties of Perceptrons 41 Separability some parameters get the training set perfectly correct Convergence if the training is separable, perceptron will eventually converge (binary case) Mistake Bound the maximum number of mistakes (binary case) related to the margin or degree of separability Separable Non-Separable 41

Examples Perceptron 42 Non-Separable Case 42

Issues with Perceptrons 43 Overtraining test / held-out accuracy usually rises, then falls Overtraining isn t quite as bad as overfitting, but is similar Regularization if the data isn t separable, weights might thrash around Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization finds a barely separating solution 43

Linear Separators 44 Which of these linear separators is optimal? 44

Support Vector Machines 45 Maximizing the margin good according to intuition and theory. Only support vectors matter; other training examples are ignorable. Support vector machines (SVMs) find the separator with max margin 45

Summary Naïve Bayes Perceptrons 46 Build classifiers using model of training data Smoothing estimates is important in real systems Make less assumptions about data Mistake-driven learning Multiple passes through data 46