Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26)
Outline Last time: Preliminaries: input/output, features, etc. Perceptron Assignment 2 Today: Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy) Next time: Naive Bayes classifiers Generative and discriminative models Machine Learning for NLP 2(26)
Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Improvement 1: shuffle training data between iterations Improvement 2: average weight vectors seen during training Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w (i+1) = w (i) + f(x t, y t ) f(x t, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for NLP 3(26)
Margin Training Testing Denote the value of the margin by γ Machine Learning for NLP 4(26)
Maximizing Margin For a training set T, the margin of a weight vector w is the smallest γ such that w f(x t, y t ) w f(x t, y ) γ for every training instance (x t, y t ) T, y Ȳ t Machine Learning for NLP 5(26)
Maximizing Margin Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin ɛ R 2 γ 2 T Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data But the perceptron does not pick w to maximize the margin! Machine Learning for NLP 6(26)
Maximizing Margin Let γ > 0 such that: max w 1 w f(x t, y t ) w f(x t, y ) γ γ (x t, y t ) T and y Ȳ t Note: algorithm still minimizes error w is bound since scaling trivially produces larger margin β(w f(x t, y t ) w f(x t, y )) βγ, for some β 1 Machine Learning for NLP 7(26)
Max Margin = Min Norm Let γ > 0 Max Margin: max w 1 γ Min Norm: min w 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) γ (x t, y t ) T and y Ȳ t = such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Instead of fixing w we fix the margin γ = 1 Technically γ 1/ w Machine Learning for NLP 8(26)
Support Vector Machines min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Quadratic programming problem Can be solved with out-of-the-box algorithms Batch learning algorithm w set w.r.t. all training points Machine Learning for NLP 9(26)
Support Vector Machines Problem: Sometimes T is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Common technique: Sequential Minimal Optimization (SMO) Sparse: solution depends only on features in support vectors Machine Learning for NLP 10(26)
Margin Infused Relaxed Algorithm (MIRA) Another option maximize margin using an online algorithm Batch vs. Online Batch update based on entire training set (SVM) Online update based on one instance at a time (Perceptron) MIRA max-margin perceptron or online SVM Machine Learning for NLP 11(26)
MIRA Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Online (MIRA): Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t 5. i = i + 1 6. return w i MIRA has much smaller optimizations, only Ȳ t constraints Cost: sub-optimal optimization Machine Learning for NLP 12(26)
Interim Summary What we have covered Linear classifiers: Perceptron SVMs MIRA All are trained to minimize error With or without maximizing margin Online or batch What is next Logistic Regression / Maximum Entropy Train linear classifiers to maximize likelihood Machine Learning for NLP 13(26)
Logistic Regression / Maximum Entropy Define a conditional probability: ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y Note: still a linear classifier arg max y P(y x) = arg max y = arg max y = arg max y e w f(x,y) Z x e w f(x,y) w f(x, y) Machine Learning for NLP 14(26)
Log-Linear Models Linear model: Make scores positive: f(x, y) w exp [f(x, y) w] Normalize: P(y x) = exp [f(x, y) w] n i=1 exp [f(x, y i) w] Machine Learning for NLP 15(26)
Log-Linear Models Crash course in exponentiation: Note: exp x = a x (for some base a) 0 < exp x < 1 if x < 0 exp x = 1 if x = 0 1 < exp x if x > 0 The inverse of exponentiation is the logarithm: log exp x = x Hence, the log-linear model is linear in log(arithmic) space Machine Learning for NLP 16(26)
Log-Linear Models Suppose we have (only) two classes with the following scores: Using base 2, we have: Normalizing, we get: P(y 1 x) = P(y 2 x) = f(x, y 1 ) w = 1.0 f(x, y 2 ) w = 2.0 exp [f(x, y 1 ) w] = 2 exp [f(x, y 2 ) w] = 0.25 exp[f(x,y 1 ) w] exp[f(x,y 1 ) w]+exp[f(x,y 2 ) w] exp[f(x,y 2 ) w] exp[f(x,y 1 ) w]+exp[f(x,y 2 ) w] = 2 2.25 = 0.89 = 0.25 2.25 = 0.11 Machine Learning for NLP 17(26)
Logistic Regression / Maximum Entropy P(y x) = ew f(x,y) Z x Q: How do we learn weights w A: Set weights to maximize log-likelihood of training data: w = arg max P(y t x t ) = arg max log P(y t x t ) w w t In a nut shell we set the weights w so that we assign as much probability to the correct label y for each x in the training set t Machine Learning for NLP 18(26)
Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = 1... 1000 1 (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 0, 1, 0] Error in this case is 0 so w minimizes error [ 1, 0, 1, 0] [ 1, 1, 0, 0] = 1 > [ 1, 0, 1, 0] [0, 0, 1, 1] = 1 [ 1, 0, 1, 0] [0, 0, 3, 1] = 3 > [ 1, 0, 1, 0] [3, 1, 0, 0] = 3 However, log-likelihood = 126.9 (omit calculation) Machine Learning for NLP 19(26)
Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = 1... 1000 1 (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 7, 1, 0] Error in this case is 1 so w does not minimizes error [ 1, 7, 1, 0] [ 1, 1, 0, 0] = 8 > [ 1, 7, 1, 0] [0, 0, 1, 1] = 1 [ 1, 7, 1, 0] [0, 0, 3, 1] = 3 < [ 1, 7, 1, 0] [3, 1, 0, 0] = 4 However, log-likelihood = -1.4 Better log-likelihood and worse error Machine Learning for NLP 20(26)
Aside: Min error versus max log-likelihood Max likelihood min error Max likelihood pushes as much probability on correct labeling of training instance Even at the cost of mislabeling a few examples Min error forces all training instances to be correctly classified SVMs with slack variables allows some examples to be classified wrong if resulting margin is improved on other examples Machine Learning for NLP 21(26)
Logistic Regression ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y w = arg max log P(y t x t )* w t w = arg min w t log P(y t x t ) (*) The objective function (*) is concave/convex Therefore there is a global maximum/minimum No closed form solution, but lots of numerical techniques Machine Learning for NLP 22(26)
Gradient Descent We want to minimize negative log-likelihood Convexity guarantees a single minimum Gradient descent: 1. Guess an initial weight vector w 0 (all w 0 = 0.0) 2. Repeat until convergence: 2.1 Use gradient of w i to determine descent direction 2.2 Update w i+1 w i + gradient step Machine Learning for NLP 23(26)
Logistic Regression = Maximum Entropy Well known equivalence Max Ent: maximize entropy subject to constraints on features Empirical feature counts must equal expected counts Quick intuition Partial derivative in logistic regression F (w) = f i (x t, y t ) w i t t P(y x t )f i (x t, y ) y Y Difference: empirical counts expected counts Derivative set to zero maximizes function Equal counts optimize the logistic regression objective! Machine Learning for NLP 24(26)
Linear Models Basic form of a linear (multiclass) classifier: y = arg max y w f(x, y) Different learning objectives: Perceptron separate data (0-1 loss) SVM/MIRA maximize margin (hinge loss) Logistic regression maximize likelihood (log loss) Generalized learning objective: arg min w n i=1 l(y i, arg max y w f(x i, y)) Machine Learning for NLP 25(26)
Regularization Regularized learning objective: arg min w n i=1 l(y i, arg max y w f(x i, y)) + λr(w) R(w) prevents weights from getting too large (overfitting) Common regularization functions: L 1 norm = n i=1 w i L 2 norm = n i=1 w2 i Promotes sparse weights Promotes dense weights Machine Learning for NLP 26(26)