Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Similar documents
Statistical NLP Spring A Discriminative Approach

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Natural Language Processing. Classification. Margin. Linear Models: Perceptron. Issues with Perceptrons. Problems with Perceptrons.

Classification. Statistical NLP Spring Example: Text Classification. Some Definitions. Block Feature Vectors.

CS 188: Artificial Intelligence. Outline

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

EE 511 Online Learning Perceptron

CS 188: Artificial Intelligence. Machine Learning

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

CS 188: Artificial Intelligence Fall 2011

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Support vector machines Lecture 4

CS 343: Artificial Intelligence

MIRA, SVM, k-nn. Lirong Xia

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

CS 5522: Artificial Intelligence II

Announcements - Homework

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Machine Learning for Structured Prediction

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

18.9 SUPPORT VECTOR MACHINES

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Machine Learning for NLP

Statistical NLP Spring 2010

Kernel Methods and Support Vector Machines

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear & nonlinear classifiers

Linear Classifiers and the Perceptron

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 4: Training a Classifier

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

CPSC 340: Machine Learning and Data Mining

Support Vector Machines

SVM optimization and Kernel methods

Multiclass Classification-1

Logistic Regression Logistic

Deep Learning for Computer Vision

Logistic Regression. Machine Learning Fall 2018

Machine Learning for NLP

Jeff Howbert Introduction to Machine Learning Winter

Classification and Regression Trees

Nonlinear Classification

Max Margin-Classifier

L5 Support Vector Classification

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Linear Classifiers. Michael Collins. January 18, 2012

ECE521 week 3: 23/26 January 2017

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

ML4NLP Multiclass Classification

Support Vector Machines. Machine Learning Fall 2017

Pattern Recognition 2018 Support Vector Machines

Machine Learning Lecture 5

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Lecture 4: Training a Classifier

Statistical Pattern Recognition

Support Vector Machines

CS 188: Artificial Intelligence Spring Today

Introduction to Logistic Regression and Support Vector Machine

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Warm up: risk prediction with logistic regression

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Classification II: Decision Trees and SVMs

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Support Vector Machine. Industrial AI Lab.

Introduction to Machine Learning Midterm, Tues April 8

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning Practice Page 2 of 2 10/28/13

ECE521 Lecture7. Logistic Regression

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Statistical Methods for Data Mining

Machine Learning And Applications: Supervised Learning-SVM

Brief Introduction to Machine Learning

Lecture 3: Multiclass Classification

Perceptron (Theory) + Linear Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Introduction to Machine Learning

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Discriminative Models

6.867 Machine learning

Statistical Methods for SVM

Optimization and Gradient Descent

Introduction to Support Vector Machines

Midterm Exam Solutions, Spring 2007

6.036 midterm review. Wednesday, March 18, 15

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Transcription:

Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example: image of object object type Example: query + webpages best match Example: symptoms diagnosis Three main ideas Representation as feature vectors / kernel functions Scoring by linear functions Learning by optimization INPUTS CANDIDATE SET CANDIDATES TRUE OUTPUTS FEATURE VECTORS Some Definitions close the {door, table, } table door close in x y= door x 1 = the y= door y occurs in x x 1 = the y= table Feature Vectors Example: web page ranking (not actually classification) Features x i = Apple Computers 1

Block Feature Vectors Sometimes, we think of the input as having features, which are multiplied by outputs to form the candidates win election Non Block Feature Vectors Sometimes the features of candidates cannot be decomposed in this regular way Example: a parse tree s features may be the productions NP present in the tree S NP VP N N V S NP VP N V N Different candidates will thus often share features We ll return to the non block case later S VP NP N N VP V NP N VP V N Linear Models: Scoring In a linear model, each feature gets a weight w Linear Models We score hypotheses by multiplying features and weights: Linear Models: Decision Rule The linear decision rule: Binary Classification Important special case: binary classification Classes are y=+1/ 1 BIAS : -3 free : 4 money : 2 Decision boundary is a hyperplane money 2 1 +1 = SPAM We ve said nothing about where weights come from -1 = HAM 0 0 1 free 2

Multiclass Decision Rule If more than two classes: Highest score wins Boundaries are more complex Harder to visualize Learning There are other ways: e.g. reconcile pairwise decisions Learning Classifier Weights How to pick weights? Two broad approaches to learning weights Generative: work with a probabilistic model of the data, weights are (log) local conditional probabilities Advantages: learning weights is easy, smoothing is well understood, backed by understanding of modeling Discriminative: set weights based on some error related criterion Advantages: error driven, often weights which are good for classification aren t the ones which best describe the data We ll mainly talk about the latter for now Goal: choose best vector w given training data For now, we mean best for classification The ideal: the weights which have greatest test set accuracy / F1 / whatever But, don t have the test set Must compute weights from training set Maybe we want weights which give best training set accuracy? Hard discontinuous optimization problem May not (does not) generalize to test set Easy to overfit Though, min-error training for MT does exactly this. Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label, 1 loss for wrong label Can weight mistakes differently (e.g. false positives worse than false negatives or Hamming distance over structured labels) We could, in principle, minimize training loss: Linear Models: Perceptron The perceptron algorithm Iteratively processes the training set, reacting to training errors Can be thought of as trying to drive down training error The (online) perceptron algorithm: Start with zero weights w Visit training instances one by one Try to classify If correct, no change! If wrong: adjust weights This is a hard, discontinuous optimization problem 3

Example: Best Web Page Examples: Perceptron Separable Case x i = Apple Computers 20 Perceptrons and Separability Examples: Perceptron A data set is separable if some parameters classify it perfectly Separable Non Separable Case Convergence: if training data separable, perceptron will separate (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable 22 Issues with Perceptrons Overtraining: test / held out accuracy usually rises, then falls Overtraining isn t the typically discussed source of overfitting, but it can be important Regularization: if the data isn t separable, weights often thrash around Averaging weight vectors over time can help (averaged perceptron) [Freund & Schapire 99, Collins 02] Problems with Perceptrons Perceptron goal : separate the training data 1. This may be an entire feasible space 2. Or it may be impossible Mediocre generalization: finds a barely separating solution 4

Objective Functions Margin What do we want from our weights? Depends! So far: minimize (training) errors: This is the zero one loss Discontinuous, minimizing is NP complete Not really what we want anyway Maximum entropy and SVMs have other objectives related to zero one loss Linear Separators Classification Margin (Binary) Which of these linear separators is optimal? Distance of x i to separator is its margin, m i Examples closest to the hyperplane are support vectors Margin of the separator is the minimum m m 27 Classification Margin For each example x i and possible mistaken candidate y, we avoid that mistake by a margin m i (y) (with zero one loss) Maximum Margin Separable SVMs: find the max margin w Margin of the entire separator is the minimum m It is also the largest for which the following constraints hold Can stick this into Matlab and (slowly) get an SVM Won t work (well) if non separable 5

Why Max Margin? Why do this? Various arguments: Solution depends only on the boundary cases, or support vectors (but remember how this diagram is broken!) Solution robust to movement of support vectors Sparse solutions (features not in support vectors get zero weight) Generalization bound arguments Works well in practice for many problems Max Margin / Small Norm Reformulation: find the smallest w which separates data Remember this condition? scales linearly in w, so if w isn t constrained, we can take any separating w and scale up our margin Instead of fixing the scale of w, we can fix = 1 Support vectors Soft Margin Classification What if the training set is not linearly separable? Slack variables ξ i can be added to allow misclassification of difficult or noisy examples, resulting in a soft margin classifier Maximum Margin Non separable SVMs Add slack to the constraints Make objective pay (linearly) for slack: Note: exist other choices of how to penalize slacks! ξ i ξ i C is called the capacity of the SVM the smoothing knob Learning: Can still stick this into Matlab if you want Constrained optimization is hard; better methods! We ll come back to this later Maximum Margin Likelihood 6

Linear Models: Maximum Entropy Maximum entropy (logistic regression) Use the scores as probabilities: Maximize the (log) conditional likelihood of training data Make positive Normalize Maximum Entropy II Motivation for maximum entropy: Connection to maximum entropy principle (sort of) Might want to do a good job of being uncertain on noisy cases in practice, though, posteriors are pretty peaked Regularization (smoothing) Maximum Entropy Loss Comparison Log Loss If we view maxent as a minimization problem: Remember SVMs We had a constrained minimization This minimizes the log loss on each example but we can solve for i Giving One view: log loss is an upper bound on zero one loss 7

Hinge Loss Consider the per-instance objective: Plot really only right in binary case SVMs: Max vs Soft Max Margin This is called the hinge loss Unlike maxent / log loss, you stop gaining objective once the true label wins by enough You can start from here and derive the SVM objective Can solve directly with sub gradient decent (e.g. Pegasos: Shalev Shwartz et al 07) Maxent: You can make this zero but not this one Very similar! Both try to make the true score better than a function of the other scores The SVM tries to beat the augmented runner up The Maxent classifier tries to beat the soft max Loss Functions: Comparison Separators: Comparison Zero One Loss Hinge Log Example: Sensors Conditional vs Joint Likelihood Reality Raining Sunny P(+,+,r) = 3/8 P(-,-,r) = 1/8 P(+,+,s) = 1/8 P(-,-,s) = 3/8 NB Model Raining? M1 M2 NB FACTORS: P(s) = 1/2 P(+ s) = 1/4 P(+ r) = 3/4 PREDICTIONS: P(r,+,+) = (½)(¾)(¾) P(s,+,+) = (½)(¼)(¼) P(r +,+) = 9/10 P(s +,+) = 1/10 8

Example: Stoplights Example: Stoplights Lights Working Lights Broken P(g,r,w) = 3/7 P(r,g,w) = 3/7 P(r,r,b) = 1/7 NB Model NS Working? EW Reality NB FACTORS: P(w) = 6/7 P(r w) = 1/2 P(g w) = 1/2 P(b) = 1/7 P(r b) = 1 P(g b) = 0 What does the model say when both lights are red? P(b,r,r) = (1/7)(1)(1) = 1/7 = 4/28 P(w,r,r) = (6/7)(1/2)(1/2) = 6/28 = 6/28 P(w r,r) = 6/10! We ll guess that (r,r) indicates lights are working! Imagine if P(b) were boosted higher, to 1/2: P(b,r,r) = (1/2)(1)(1) = 1/2 = 4/8 P(w,r,r) = (1/2)(1/2)(1/2) = 1/8 = 1/8 P(w r,r) = 1/5! Changing the parameters bought accuracy at the expense of data likelihood 9