NLP Programming Tutorial 6 - Advanced Discriminative Learning

Similar documents
Sequential Data Modeling - Conditional Random Fields

Sequential Data Modeling - The Structured Perceptron

Logistic Regression & Neural Networks

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

NLP Programming Tutorial 8 - Recurrent Neural Nets

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. COMP 527 Danushka Bollegala

Introduction to Logistic Regression and Support Vector Machine

6.036 midterm review. Wednesday, March 18, 15

Support vector machines Lecture 4

Midterm exam CS 189/289, Fall 2015

Support Vector Machines

Machine Learning for NLP

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

SVMs, Duality and the Kernel Trick

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear & nonlinear classifiers

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

NLP Programming Tutorial 11 - The Structured Perceptron

Kernelized Perceptron Support Vector Machines

Machine Learning for Structured Prediction

Logistic Regression Logistic

Bayesian Machine Learning

Logistic Regression. Stochastic Gradient Descent

Machine Learning for NLP

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

Announcements - Homework

Warm up: risk prediction with logistic regression

10-701/ Machine Learning - Midterm Exam, Fall 2010

Classification Logistic Regression

ESS2222. Lecture 4 Linear model

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS145: INTRODUCTION TO DATA MINING

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning Lecture 6 Note

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Overfitting, Bias / Variance Analysis

Linear Regression (continued)

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Ad Placement Strategies

Perceptron (Theory) + Linear Regression

Introduction to Logistic Regression

Machine Learning Practice Page 2 of 2 10/28/13

Linear classifiers: Logistic regression

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear discriminant functions

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Least Mean Squares Regression

EE 511 Online Learning Perceptron

Machine Learning Basics

Linear classifiers: Overfitting and regularization

Logistic Regression. William Cohen

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Max Margin-Classifier

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Neural networks and support vector machines

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

ECS171: Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

18.9 SUPPORT VECTOR MACHINES

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

MIRA, SVM, k-nn. Lirong Xia

CS260: Machine Learning Algorithms

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

COMS 4771 Introduction to Machine Learning. Nakul Verma

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

COMS F18 Homework 3 (due October 29, 2018)

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

CMU-Q Lecture 24:

Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Least Mean Squares Regression. Machine Learning Fall 2018

CS229 Supplemental Lecture notes

Stochastic Gradient Descent

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning Lecture 7

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CPSC 340: Machine Learning and Data Mining

Transcription:

NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Review: Classifiers and the Perceptron 2

Prediction Problems Given x, predict y 3

Example we will use: Given an introductory sentence from Wikipedia Predict whether the article is about a person Give Gonso was a Sanron n sect priest (754-827) in the late Nara and early Heian periods. Predict Yes! Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. No! This is binary classification 4

Mathematical Formulation y = sign (w ϕ(x)) = sign ( i=1 I w i ϕ i ( x)) x: the input φ(x): vector of feature functions {φ 1 (x), φ 2 (x),, φ I (x)} w: the weight vector {w 1, w 2,, w I } y: the prediction, +1 if yes, -1 if no (sign(v) is +1 if v >= 0, -1 otherwise) 5

Online Learning create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) y' = predict_one(w, phi) if y'!= y update_weights(w, phi, y) In other words Try to classify each training example Every time we make a mistake, update the weights Many different online learning algorithms The most simple is the perceptron 6

Perceptron Weight Update In other words: w w + y ϕ(x) If y=1, increase the weights for features in φ(x) Features for positive examples get a higher weight If y=-1, decrease the weights for features in φ(x) Features for negative examples get a lower weight Every time we update, our predictions get better! update_weights(w, phi, y) for name, value in phi: w[name] += value * y 7

Stochastic Gradient Descent and Logistic Regression 8

Perceptron and Probabilities Sometimes we want the probability P( y x) Estimating confidence in predictions Combining with other systems However, perceptron only gives us a prediction y=sign(w ϕ( x)) In other words: 1 P( y=1 x)=1 if w ϕ (x) 0 P( y=1 x)=0 if w ϕ (x)<0 p(y x) 0.5 0-10 -5 0 5 10 w*phi(x) 9

The Logistic Function The logistic function is a softened version of the function used in the perceptron Perceptron 1 P( y=1 x)= e w ϕ ( x) 1+e w ϕ (x) Logistic Function 1 p(y x) 0.5 p(y x) 0.5 0-10 -5 0 5 10 w*phi(x) 0-10 -5 0 5 10 w*phi(x) Can account for uncertainty Differentiable 10

Logistic Regression Train based on conditional likelihood Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ŵ=argmax w i P( y i x i ; w) How do we solve this? 11

Stochastic Gradient Descent Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 12

Gradient of the Logistic Function Take the derivative of the probability d d w P ( y=1 x) = d d w = ϕ (x) w ϕ( x) e w ϕ (x) 1+e w ϕ (x) e (1+e w ϕ (x) ) 2 dp(y x)/dw*phi(x) d d w P ( y= 1 x) = d d w (1 ew ϕ (x) w ϕ (x)) 1+e w ϕ (x) e = ϕ (x) (1+e w ϕ (x) ) 2 0.4 0.3 0.2 0.1 0-10 -5 0 5 10 w*phi(x) 13

Example: Initial Update Set α=1, initialize w=0 x = A site, located in Maizuru, Kyoto y = -1 d w ϕ(x)=0 e0 P ( y= 1 x) = d w (1+e 0 ) ϕ (x) 2 = 0.25 ϕ (x) w w + 0.25ϕ (x) w unigram Maizuru = -0.25 w unigram, = -0.5 w unigram in = -0.25 w unigram Kyoto = -0.25 w unigram A = -0.25 w unigram site = -0.25 w unigram located = -0.25 14

Example: Second Update x = Shoken, monk born in Kyoto y = 1 w ϕ (x)= 1-0.5-0.25-0.25 d d w P ( y=1 x) = e 1 (1+e 1 ) ϕ (x) 2 = 0.196ϕ (x) w w +0.196 ϕ( x) w unigram Maizuru = -0.25 w unigram, = -0.304 w unigram in = -0.054 w unigram Kyoto = -0.054 w unigram A = -0.25 w unigram site = -0.25 w unigram located = -0.25 w unigram Shoken = 0.196 w unigram monk = 0.196 w 15 unigram born = 0.196

SGD Learning Rate? How to set the learning rate α? Usually decay over time: α= 1 C+t parameter number of samples Or, use held-out data, and reduce the learning rate when the likelihood rises 16

Classification Margins 17

Choosing between Equally Accurate Classifiers Which classifier is better? Dotted or Dashed? O X O X O X 18

Choosing between Equally Accurate Classifiers Which classifier is better? Dotted or Dashed? O X O X O X Answer: Probably the dashed line. Why?: It has a larger margin. 19

What is a Margin? The distance between the classification plane and the nearest example: O X O X O X 20

Support Vector Machines Most famous margin-based classifier Hard Margin: Explicitly maximize the margin Soft Margin: Allow for some mistakes Usually use batch learning Batch learning: slightly higher accuracy, more stable Online learning: simpler, less memory, faster convergence Learn more about SVMs: http://disi.unitn.it/moschitti/material/interspeech2010-tutorial.moschitti.pdf Batch learning libraries: LIBSVM, LIBLINEAR, SVMLite 21

Online Learning with a Margin Penalize not only mistakes, but also correct answers under a margin create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) val = w * phi * y if val <= margin update_weights(w, phi, y) (A correct classifier will always make w * phi * y > 0) If margin = 0, this is the perceptron algorithm 22

Regularization 23

Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 24

Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 Probably classifier 2! It doesn't use irrelevant information. 25

Regularization A penalty on adding extra weights L2 regularization: Big penalty on large weights, small penalty on small weights High accuracy L1 regularization: Uniform increase whether large or small Will cause many weights to become zero small model 5 4 3 2 1 0-2 -1 0 1 2 L2 L1 26

L1 Regularization in Online Learning After update, reduce the weight by a constant c update_weights(w, phi, y, c) for name, value in w: if abs(value) < c: w[name] = 0 else: w[name] -= sign(value) * c for name, value in phi: w[name] += value * y If abs. value < c, set weight to zero If value > 0, decrease by c If value < 0, increase by c 27

Example Every turn, we Regularize, Update, Regularize, Update Regularization: c=0.1 Updates: {1, 0} on 1 st and 5 th turns {0, -1} on 3 rd turn R 1 U 1 Change: {0, 0} {1, 0} w: {0, 0} {1, 0} R 2 U 2 R 3 U 3 {-0.1, 0} {0, 0} {-0.1, 0} {0, -1} {0.9, 0} {0.9, 0} {0.8, 0} {0.8, -1} R 4 U 4 Change:{-0.1, 0.1} {0, 0} R 5 U 5 R 6 U 6 {-0.1, 0.1} {1, 0} {-0.1, 0.1} {0, 0} w: {0.7, -0.9}{0.7, -0.9} {0.6, -0.8} {1.6, -0.8} {1.5, -0.7}{1.5, -0.7} 28

Efficiency Problems Typical number of features: Each sentence (phi): 10~1000 Overall (w): 1,000,000~100,000,000 update_weights(w, phi, y, c) for name, value in w: if abs(value) <= c: w[name] = 0 else: w[name] -= sign(value) * c for name, value in phi: w[name] += value * y This loop is VERY SLOW! 29

Efficiency Trick Regularize only when the value is used! getw(w, name, c, iter, last) if iter!= last[name]: # regularize several times c_size = c * (iter - last[name]) if abs(w[name]) <= c_size: w[name] = 0 else: w[name] -= sign(w[name]) * c_size last[name] = iter return w[name] This is called lazy evaluation, used in many applications 30

Choosing the Regularization Constant The regularization constant c has a large effect Large value small model lower score on training set less overfitting Small value large model higher score on training set more overfitting Choose best regularization value on development set e.g. 0.0001, 0.001, 0.01, 0.1, 1.0 31

Exercise 32

Exercise Write program: train-svm/train-lr: Create an svm or LR model with L2 regularization constant 0.001 Train a model on data-en/titles-en-train.labeled Predict the labels of data-en/titles-en-test.word Grade your answers and compare them with the perceptron script/grade-prediction.py data-en/titles-en-test.labeled your_answer Extra challenge: Try many different regularization constants Implement the efficiency trick 33

Thank You! 34