Machine Learning for NLP

Similar documents
Machine Learning for NLP

Introduction to Logistic Regression and Support Vector Machine

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

MIRA, SVM, k-nn. Lirong Xia

Generalized Linear Classifiers in NLP

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Warm up: risk prediction with logistic regression

Logistic Regression. COMP 527 Danushka Bollegala

Support vector machines Lecture 4

Ad Placement Strategies

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Logistic Regression. Machine Learning Fall 2018

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Statistical Learning Theory and the C-Loss cost function

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

CS 188: Artificial Intelligence. Outline

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning and Data Mining. Linear classification. Kalev Kask

Deep Learning for Computer Vision

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 7

Linear discriminant functions

Max Margin-Classifier

Stochastic Gradient Descent

1 Machine Learning Concepts (16 points)

Advanced statistical methods for data analysis Lecture 2

Statistical NLP Spring A Discriminative Approach

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Data-Driven Dependency Parsing

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Logistic Regression & Neural Networks

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Machines and Bayes Regression

Lecture 2 Machine Learning Review

Support Vector Machines and Kernel Methods

Machine Learning And Applications: Supervised Learning-SVM

Ch 4. Linear Models for Classification

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

10-701/ Machine Learning - Midterm Exam, Fall 2010

Discriminative Models

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Multiclass Classification-1

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Kernel Methods and Support Vector Machines

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Models for Classification

Support Vector Machines for Classification: A Statistical Portrait

Classification Based on Probability

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Linear & nonlinear classifiers

CMU-Q Lecture 24:

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Lecture 3: Multiclass Classification

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

ESS2222. Lecture 4 Linear model

Lecture 5: Logistic Regression. Neural Networks

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Machine Learning Lecture 5

Part of the slides are adapted from Ziko Kolter

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

CSC 411: Lecture 04: Logistic Regression

Machine Learning for Structured Prediction

MLCC 2017 Regularization Networks I: Linear Models

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Overfitting, Bias / Variance Analysis

Midterm exam CS 189/289, Fall 2015

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

EE 511 Online Learning Perceptron

Classification Logistic Regression

Classification objectives COMS 4771

CSC 411 Lecture 17: Support Vector Machine

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Training the linear classifier

Machine Learning (CS 567) Lecture 3

CS60021: Scalable Data Mining. Large Scale Machine Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear classifiers: Logistic regression

Discriminative Models

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Announcements - Homework

Lecture 11. Linear Soft Margin Support Vector Machines

Lecture 3 Classification, Logistic Regression

Transcription:

Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26)

Outline Last time: Preliminaries: input/output, features, etc. Perceptron Assignment 2 Today: Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy) Next time: Naive Bayes classifiers Generative and discriminative models Machine Learning for NLP 2(26)

Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Improvement 1: shuffle training data between iterations Improvement 2: average weight vectors seen during training Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w (i+1) = w (i) + f(x t, y t ) f(x t, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for NLP 3(26)

Margin Training Testing Denote the value of the margin by γ Machine Learning for NLP 4(26)

Maximizing Margin For a training set T, the margin of a weight vector w is the smallest γ such that w f(x t, y t ) w f(x t, y ) γ for every training instance (x t, y t ) T, y Ȳ t Machine Learning for NLP 5(26)

Maximizing Margin Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin ɛ R 2 γ 2 T Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data But the perceptron does not pick w to maximize the margin! Machine Learning for NLP 6(26)

Maximizing Margin Let γ > 0 such that: max w 1 w f(x t, y t ) w f(x t, y ) γ γ (x t, y t ) T and y Ȳ t Note: algorithm still minimizes error w is bound since scaling trivially produces larger margin β(w f(x t, y t ) w f(x t, y )) βγ, for some β 1 Machine Learning for NLP 7(26)

Max Margin = Min Norm Let γ > 0 Max Margin: max w 1 γ Min Norm: min w 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) γ (x t, y t ) T and y Ȳ t = such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Instead of fixing w we fix the margin γ = 1 Technically γ 1/ w Machine Learning for NLP 8(26)

Support Vector Machines min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Quadratic programming problem Can be solved with out-of-the-box algorithms Batch learning algorithm w set w.r.t. all training points Machine Learning for NLP 9(26)

Support Vector Machines Problem: Sometimes T is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Common technique: Sequential Minimal Optimization (SMO) Sparse: solution depends only on features in support vectors Machine Learning for NLP 10(26)

Margin Infused Relaxed Algorithm (MIRA) Another option maximize margin using an online algorithm Batch vs. Online Batch update based on entire training set (SVM) Online update based on one instance at a time (Perceptron) MIRA max-margin perceptron or online SVM Machine Learning for NLP 11(26)

MIRA Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Online (MIRA): Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t 5. i = i + 1 6. return w i MIRA has much smaller optimizations, only Ȳ t constraints Cost: sub-optimal optimization Machine Learning for NLP 12(26)

Interim Summary What we have covered Linear classifiers: Perceptron SVMs MIRA All are trained to minimize error With or without maximizing margin Online or batch What is next Logistic Regression / Maximum Entropy Train linear classifiers to maximize likelihood Machine Learning for NLP 13(26)

Logistic Regression / Maximum Entropy Define a conditional probability: ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y Note: still a linear classifier arg max y P(y x) = arg max y = arg max y = arg max y e w f(x,y) Z x e w f(x,y) w f(x, y) Machine Learning for NLP 14(26)

Log-Linear Models Linear model: Make scores positive: f(x, y) w exp [f(x, y) w] Normalize: P(y x) = exp [f(x, y) w] n i=1 exp [f(x, y i) w] Machine Learning for NLP 15(26)

Log-Linear Models Crash course in exponentiation: Note: exp x = a x (for some base a) 0 < exp x < 1 if x < 0 exp x = 1 if x = 0 1 < exp x if x > 0 The inverse of exponentiation is the logarithm: log exp x = x Hence, the log-linear model is linear in log(arithmic) space Machine Learning for NLP 16(26)

Log-Linear Models Suppose we have (only) two classes with the following scores: Using base 2, we have: Normalizing, we get: P(y 1 x) = P(y 2 x) = f(x, y 1 ) w = 1.0 f(x, y 2 ) w = 2.0 exp [f(x, y 1 ) w] = 2 exp [f(x, y 2 ) w] = 0.25 exp[f(x,y 1 ) w] exp[f(x,y 1 ) w]+exp[f(x,y 2 ) w] exp[f(x,y 2 ) w] exp[f(x,y 1 ) w]+exp[f(x,y 2 ) w] = 2 2.25 = 0.89 = 0.25 2.25 = 0.11 Machine Learning for NLP 17(26)

Logistic Regression / Maximum Entropy P(y x) = ew f(x,y) Z x Q: How do we learn weights w A: Set weights to maximize log-likelihood of training data: w = arg max P(y t x t ) = arg max log P(y t x t ) w w t In a nut shell we set the weights w so that we assign as much probability to the correct label y for each x in the training set t Machine Learning for NLP 18(26)

Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = 1... 1000 1 (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 0, 1, 0] Error in this case is 0 so w minimizes error [ 1, 0, 1, 0] [ 1, 1, 0, 0] = 1 > [ 1, 0, 1, 0] [0, 0, 1, 1] = 1 [ 1, 0, 1, 0] [0, 0, 3, 1] = 3 > [ 1, 0, 1, 0] [3, 1, 0, 0] = 3 However, log-likelihood = 126.9 (omit calculation) Machine Learning for NLP 19(26)

Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = 1... 1000 1 (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 7, 1, 0] Error in this case is 1 so w does not minimizes error [ 1, 7, 1, 0] [ 1, 1, 0, 0] = 8 > [ 1, 7, 1, 0] [0, 0, 1, 1] = 1 [ 1, 7, 1, 0] [0, 0, 3, 1] = 3 < [ 1, 7, 1, 0] [3, 1, 0, 0] = 4 However, log-likelihood = -1.4 Better log-likelihood and worse error Machine Learning for NLP 20(26)

Aside: Min error versus max log-likelihood Max likelihood min error Max likelihood pushes as much probability on correct labeling of training instance Even at the cost of mislabeling a few examples Min error forces all training instances to be correctly classified SVMs with slack variables allows some examples to be classified wrong if resulting margin is improved on other examples Machine Learning for NLP 21(26)

Logistic Regression ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y w = arg max log P(y t x t )* w t w = arg min w t log P(y t x t ) (*) The objective function (*) is concave/convex Therefore there is a global maximum/minimum No closed form solution, but lots of numerical techniques Machine Learning for NLP 22(26)

Gradient Descent We want to minimize negative log-likelihood Convexity guarantees a single minimum Gradient descent: 1. Guess an initial weight vector w 0 (all w 0 = 0.0) 2. Repeat until convergence: 2.1 Use gradient of w i to determine descent direction 2.2 Update w i+1 w i + gradient step Machine Learning for NLP 23(26)

Logistic Regression = Maximum Entropy Well known equivalence Max Ent: maximize entropy subject to constraints on features Empirical feature counts must equal expected counts Quick intuition Partial derivative in logistic regression F (w) = f i (x t, y t ) w i t t P(y x t )f i (x t, y ) y Y Difference: empirical counts expected counts Derivative set to zero maximizes function Equal counts optimize the logistic regression objective! Machine Learning for NLP 24(26)

Linear Models Basic form of a linear (multiclass) classifier: y = arg max y w f(x, y) Different learning objectives: Perceptron separate data (0-1 loss) SVM/MIRA maximize margin (hinge loss) Logistic regression maximize likelihood (log loss) Generalized learning objective: arg min w n i=1 l(y i, arg max y w f(x i, y)) Machine Learning for NLP 25(26)

Regularization Regularized learning objective: arg min w n i=1 l(y i, arg max y w f(x i, y)) + λr(w) R(w) prevents weights from getting too large (overfitting) Common regularization functions: L 1 norm = n i=1 w i L 2 norm = n i=1 w2 i Promotes sparse weights Promotes dense weights Machine Learning for NLP 26(26)