Machine Learning for NLP

Similar documents
Machine Learning for NLP

Generalized Linear Classifiers in NLP

Introduction to Logistic Regression and Support Vector Machine

MIRA, SVM, k-nn. Lirong Xia

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Introduction to Data-Driven Dependency Parsing

Warm up: risk prediction with logistic regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Support vector machines Lecture 4

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Max Margin-Classifier

Statistical NLP Spring A Discriminative Approach

Linear & nonlinear classifiers

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Stochastic Gradient Descent

Machine Learning and Data Mining. Linear classification. Kalev Kask

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning Practice Page 2 of 2 10/28/13

ESS2222. Lecture 4 Linear model

Machine Learning for Structured Prediction

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Multiclass Classification-1

Perceptron (Theory) + Linear Regression

Announcements - Homework

Structured Prediction

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Support Vector Machines and Bayes Regression

ML4NLP Multiclass Classification

CSC 411 Lecture 17: Support Vector Machine

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Logistic Regression. COMP 527 Danushka Bollegala

Ad Placement Strategies

CS 188: Artificial Intelligence. Outline

Lecture Support Vector Machine (SVM) Classifiers

EE 511 Online Learning Perceptron

Ad Placement Strategies

Overfitting, Bias / Variance Analysis

Linear Classifiers. Michael Collins. January 18, 2012

Kernel Methods and Support Vector Machines

1 Machine Learning Concepts (16 points)

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines and Kernel Methods

Logistic Regression & Neural Networks

Probabilistic Graphical Models

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

Machine Learning Lecture 5

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Deep Learning for Computer Vision

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Optimization and Gradient Descent

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 3: Multiclass Classification

Structured Prediction

Classification Logistic Regression

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Machine Learning And Applications: Supervised Learning-SVM

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 2 Machine Learning Review

Algorithms for Predicting Structured Data

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CPSC 340: Machine Learning and Data Mining

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Regression with Numerical Optimization. Logistic

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Logistic regression and linear classifiers COMS 4771

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Neural Network Training

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Classification objectives COMS 4771

PAC-learning, VC Dimension and Margin-based Bounds

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

18.6 Regression and Classification with Linear Models

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Day 4: Classification, support vector machines

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear & nonlinear classifiers

Generative v. Discriminative classifiers Intuition

FINAL: CS 6375 (Machine Learning) Fall 2014

Learning by constraints and SVMs (2)

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Transcription:

Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50)

Introduction Linear Classifiers Classifiers covered so far: Decision trees Nearest neighbor Next two lectures: Linear classifiers Statistics from Google Scholar (October 2009): Maximum Entropy & NLP 2660 hits, 141 before 2000 SVM & NLP 2210 hits, 16 before 2000 Perceptron & NLP, 947 hits, 118 before 2000 All are linear classifiers that have become important tools in any NLP/CL researcher s tool-box in past 10 years Machine Learning for NLP 2(50)

Introduction Outline Today: Preliminaries: input/output, features, etc. Linear classifiers Perceptron Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy) Next time: Structured prediction with linear classifiers Structured perceptron Structured large-margin classifiers (SVMs, MIRA) Conditional random fields Case study: Dependency parsing Machine Learning for NLP 3(50)

Preliminaries Inputs and Outputs Input: x X e.g., document or sentence with some words x = w1... w n, or a series of previous actions Output: y Y e.g., parse tree, document class, part-of-speech tags, word-sense Input/output pair: (x, y) X Y e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x Machine Learning for NLP 4(50)

Preliminaries Feature Representations We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector f(x, y) : X Y R m For some cases, i.e., binary classification Y = { 1, +1}, we can map only from the input to the feature space f(x) : X R m However, most problems in NLP require more than two classes, so we focus on the multi-class case For any vector v R m, let v j be the j th value Machine Learning for NLP 5(50)

Preliminaries Features and Classes All features must be numerical Numerical features are represented directly as fi (x, y) R Binary (boolean) features are represented as f i (x, y) {0, 1} Multinomial (categorical) features must be binarized Instead of: fi (x, y) {v 0,..., v p } We have: fi+0 (x, y) {0, 1},..., f i+p (x, y) {0, 1} Such that: f i+j (x, y) = 1 iff f i (x, y) = v j We need distinct features for distinct output classes Instead of: fi (x) (1 i m) We have: fi+0m (x, y),..., f i+nm (x, y) for Y = {0,..., N} Such that: f i+jm (x, y) = f i (x) iff y = y j Machine Learning for NLP 6(50)

Preliminaries Examples x is a document and y is a label 1 if x contains the word interest f j (x, y) = and y = financial 0 otherwise f j (x, y) = % of words in x with punctuation and y = scientific x is a word and y is a part-of-speech tag { 1 if x = bank and y = Verb f j (x, y) = 0 otherwise Machine Learning for NLP 7(50)

Preliminaries Examples x is a name, y is a label classifying the name 8 < f 0 (x, y) = : 8 < f 1 (x, y) = : 8 < f 2 (x, y) = : 8 < f 3 (x, y) = : 1 if x contains George and y = Person 0 otherwise 1 if x contains Washington and y = Person 0 otherwise 1 if x contains Bridge and y = Person 0 otherwise 1 if x contains General and y = Person 0 otherwise 8 < f 4 (x, y) = : 8 < f 5 (x, y) = : 8 < f 6 (x, y) = : 8 < f 7 (x, y) = : 1 if x contains George and y = Object 0 otherwise 1 if x contains Washington and y = Object 0 otherwise 1 if x contains Bridge and y = Object 0 otherwise 1 if x contains General and y = Object 0 otherwise x=general George Washington, y=person f(x, y) = [1 1 0 1 0 0 0 0] x=george Washington Bridge, y=object f(x, y) = [0 0 0 0 1 1 1 0] x=george Washington George, y=object f(x, y) = [0 0 0 0 1 1 0 0] Machine Learning for NLP 8(50)

Preliminaries Block Feature Vectors x=general George Washington, y=person f(x, y) = [1 1 0 1 0 0 0 0] x=george Washington Bridge, y=object f(x, y) = [0 0 0 0 1 1 1 0] x=george Washington George, y=object f(x, y) = [0 0 0 0 1 1 0 0] One equal-size block of the feature vector for each label Input features duplicated in each block Non-zero values allowed only in one block Machine Learning for NLP 9(50)

Linear Classifiers Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights Let w R m be a high dimensional weight vector If we assume that w is known, then we define our classifier as Multiclass Classification: Y = {0, 1,..., N} y = arg max y = arg max y w f(x, y) m w j f j (x, y) j=0 Binary Classification just a special case of multiclass Machine Learning for NLP 10(50)

Linear Classifiers - Bias Terms Often linear classifiers presented as y = arg max y m w j f j (x, y) + b y j=0 Where b is a bias or offset term But this can be folded into f x=general George Washington, y=person f(x, y) = [1 1 0 1 1 0 0 0 0 0] x=general George Washington, y=object f(x, y) = [0 0 0 0 0 1 1 0 1 1] j 1 y = Person f 4 (x, y) = 0 otherwise j 1 y = Object f 9 (x, y) = 0 otherwise w 4 and w 9 are now the bias terms for the labels Machine Learning for NLP 11(50)

Binary Linear Classifier Divides all points: Machine Learning for NLP 12(50)

Multiclass Linear Classifier Defines regions of space: i.e., + are all points (x, y) where + = arg max y w f(x, y) Machine Learning for NLP 13(50)

Separability A set of points is separable, if there exists a w such that classification is perfect Separable Not Separable This can also be defined mathematically (and we will shortly) Machine Learning for NLP 14(50)

Supervised Learning how to find w Input: training examples T = {(x t, y t )} T t=1 Input: feature representation f Output: w that maximizes/minimizes some important function on the training set minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, Naive Bayes) Assumption: The training data is separable Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case Machine Learning for NLP 15(50)

Perceptron Choose a w that minimizes error w = arg min 1 1[y t = arg max w f(x t, y)] w y t 1[p] = { 1 p is true 0 otherwise This is a 0-1 loss function Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions Machine Learning for NLP 16(50)

Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i + 1 8. return w i Machine Learning for NLP 17(50)

Perceptron: Separability and Margin Given an training instance (x t, y t ), define: Ȳ t = Y {y t } i.e., Ȳ t is the set of incorrect labels for x t A training set T is separable with margin γ > 0 if there exists a vector u with u = 1 such that: u f(x t, y t ) u f(x t, y ) γ for all y Ȳ t and u = j u2 j (Euclidean or L 2 norm) Assumption: the training set is separable with margin γ Machine Learning for NLP 18(50)

Perceptron: Main Theorem Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training R2 γ 2 where R f(x t, y t ) f(x t, y ) for all (x t, y t ) T and y Ȳ t Thus, after a finite number of training iterations, the error on the training set will converge to zero For proof, see Collins (2002) Machine Learning for NLP 19(50)

Perceptron Summary Learns a linear classifier that minimizes error Guaranteed to find a w in a finite amount of time Perceptron is an example of an online learning algorithm w is updated based on a single training instance in isolation w (i+1) = w (i) + f(x t, y t ) f(x t, y ) Compare decision trees that perform batch learning All training instances are used to find best split Machine Learning for NLP 20(50)

Margin Training Testing Denote the value of the margin by γ Machine Learning for NLP 21(50)

Maximizing Margin For a training set T, the margin of a weight vector w is the smallest γ such that w f(x t, y t ) w f(x t, y ) γ for every training instance (x t, y t ) T, y Ȳ t Machine Learning for NLP 22(50)

Maximizing Margin Intuitively maximizing margin makes sense More importantly, generalization error to unseen test data is proportional to the inverse of the margin ɛ R 2 γ 2 T Perceptron: we have shown that: If a training set is separable by some margin, the perceptron will find a w that separates the data However, the perceptron does not pick w to maximize the margin! Machine Learning for NLP 23(50)

Maximizing Margin Let γ > 0 such that: max w 1 w f(x t, y t ) w f(x t, y ) γ γ (x t, y t ) T and y Ȳ t Note: algorithm still minimizes error w is bound since scaling trivially produces larger margin β(w f(x t, y t ) w f(x t, y )) βγ, for some β 1 Machine Learning for NLP 24(50)

Max Margin = Min Norm Let γ > 0 Max Margin: such that: max w 1 w f(x t, y t ) w f(x t, y ) γ γ (x t, y t ) T and y Ȳ t = Min Norm: min w such that: 1 2 w 2 w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Instead of fixing w we fix the margin γ = 1 Technically γ 1/ w Machine Learning for NLP 25(50)

Support Vector Machines min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Quadratic programming problem a well known convex optimization problem Can be solved with out-of-the-box algorithms Batch learning algorithm w set w.r.t. all training points Machine Learning for NLP 26(50)

Support Vector Machines Problem: Sometimes T is far too large Thus the number of constraints might make solving the quadratic programming problem very difficult Common technique: Sequential Minimal Optimization (SMO) Sparse: solution depends only on features in support vectors Machine Learning for NLP 27(50)

Margin Infused Relaxed Algorithm (MIRA) Linear Classifiers Another option maximize margin using an online algorithm Batch vs. Online Batch update parameters based on entire training set (SVM) Online update parameters based on a single training instance at a time (Perceptron) MIRA can be thought of as a max-margin perceptron or an online SVM Machine Learning for NLP 28(50)

MIRA Batch (SVMs): min 1 2 w 2 such that: w f(x t, y t ) w f(x t, y ) 1 (x t, y t ) T and y Ȳ t Online (MIRA): Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. w (i+1) = arg min w* w* w (i) such that: w f(x t, y t ) w f(x t, y ) 1 y Ȳ t 5. i = i + 1 6. return w i MIRA has much smaller optimizations with only Ȳ t constraints Cost: sub-optimal optimization Machine Learning for NLP 29(50)

Interim Summary What we have covered Linear classifiers: Perceptron SVMs MIRA All are trained to minimize error With or without maximizing margin Online or batch What is next Logistic Regression / Maximum Entropy Train linear classifiers to maximize likelihood Machine Learning for NLP 30(50)

Logistic Regression / Maximum Entropy Define a conditional probability: ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y Note: still a linear classifier arg max y P(y x) = arg max y = arg max y = arg max y e w f(x,y) Z x e w f(x,y) w f(x, y) Machine Learning for NLP 31(50)

Logistic Regression / Maximum Entropy P(y x) = ew f(x,y) Z x Q: How do we learn weights w A: Set weights to maximize log-likelihood of training data: w = arg max P(y t x t ) = arg max log P(y t x t ) w w t In a nut shell we set the weights w so that we assign as much probability to the correct label y for each x in the training set t Machine Learning for NLP 32(50)

Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = 1... 1000 1 (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 0, 1, 0] Error in this case is 0 so w minimizes error [ 1, 0, 1, 0] [ 1, 1, 0, 0] = 1 > [ 1, 0, 1, 0] [0, 0, 1, 1] = 1 [ 1, 0, 1, 0] [0, 0, 3, 1] = 3 > [ 1, 0, 1, 0] [3, 1, 0, 0] = 3 However, log-likelihood = 126.9 (omit calculation) Machine Learning for NLP 33(50)

Aside: Min error versus max log-likelihood Highly related but not identical Example: consider a training set T with 1001 points 1000 (x i, y = 0) = [ 1, 1, 0, 0] for i = 1... 1000 1 (x 1001, y = 1) = [0, 0, 3, 1] Now consider w = [ 1, 7, 1, 0] Error in this case is 1 so w does not minimizes error [ 1, 7, 1, 0] [ 1, 1, 0, 0] = 8 > [ 1, 7, 1, 0] [0, 0, 1, 1] = 1 [ 1, 7, 1, 0] [0, 0, 3, 1] = 3 < [ 1, 7, 1, 0] [3, 1, 0, 0] = 4 However, log-likelihood = -1.4 Better log-likelihood and worse error Machine Learning for NLP 34(50)

Aside: Min error versus max log-likelihood Max likelihood min error Max likelihood pushes as much probability on correct labeling of training instance Even at the cost of mislabeling a few examples Min error forces all training instances to be correctly classified SVMs with slack variables allows some examples to be classified wrong if resulting margin is improved on other examples Machine Learning for NLP 35(50)

Aside: Max margin versus max log-likelihood Let s re-write the max likelihood objective function w = arg max log P(y t x t ) w = arg max w = arg max w t log t e w f(x,y) y Y ew f(x,y ) w f(x, y) log t y Y e w f(x,y ) Pick w to maximize score difference between correct labeling and every possible labeling Margin: maximize difference between correct and all incorrect The above formulation is often referred to as the soft-margin Linear Classifiers Machine Learning for NLP 36(50)

Logistic Regression ew f(x,y) P(y x) =, where Z x = e w f(x,y ) Z x y Y w = arg max log P(y t x t ) (*) w The objective function (*) is concave t Therefore there is a global maximum No closed form solution, but lots of numerical techniques Gradient methods (gradient ascent, iterative scaling) Newton methods (limited-memory quasi-newton) Machine Learning for NLP 37(50)

Logistic Regression Summary Define conditional probability ew f(x,y) P(y x) = Z x Set weights to maximize log-likelihood of training data: w = arg max w log P(y t x t ) Can find the gradient and run gradient ascent (or any gradient-based optimization algorithm) F (w) = ( w 0 F (w), w i F (w) = t f i (x t, y t ) t t w 1 F (w),..., w m F (w)) P(y x t )f i (x t, y ) y Y Machine Learning for NLP 38(50)

Logistic Regression = Maximum Entropy Well known equivalence Max Ent: maximize entropy subject to constraints on features Empirical feature counts must equal expected counts Quick intuition Partial derivative in logistic regression F (w) = w i t f i (x t, y t ) t P(y x t )f i (x t, y ) y Y First term is empirical feature counts and second term is expected counts Derivative set to zero maximizes function Therefore when both counts are equivalent, we optimize the logistic regression objective! Machine Learning for NLP 39(50)

Linear Classification Basic form of (multiclass) classifier: y = arg max y w f(x, y) Different learning methods: Perceptron separate data (0-1 loss, online) Support vector machine maximize margin (hinge loss, batch) Logistic regression maximize likelihood (log loss, batch) All three methods are widely used in NLP Machine Learning for NLP 40(50)

Appendix Proofs and Derivations Machine Learning for NLP 41(50)

Convergence Proof for Perceptron Perceptron Learning Algorithm Training data: T = {(x t, y t )} T t=1 1. w (0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y = arg max y w (i) f(x t, y) 5. if y y t 6. w (i+1) = w (i) + f(x t, y t ) f(x t, y ) 7. i = i + 1 8. return w i w (k 1) are the weights before k th mistake Suppose k th mistake made at the t th example, (x t, y t) y = arg max y w (k 1) f(x t, y) y y t w (k) = w (k 1) + f(x t, y t) f(x t, y ) Now: u w (k) = u w (k 1) + u (f(x t, y t) f(x t, y )) u w (k 1) + γ Now: w (0) = 0 and u w (0) = 0, by induction on k, u w (k) kγ Now: since u w (k) u w (k) and u = 1 then w (k) kγ Now: w (k) 2 = w (k 1) 2 + f(x t, y t) f(x t, y ) 2 + 2w (k 1) (f(x t, y t) f(x t, y )) w (k) 2 w (k 1) 2 + R 2 (since R f(x t, y t) f(x t, y ) and w (k 1) f(x t, y t) w (k 1) f(x t, y ) 0) Machine Learning for NLP 42(50)

Convergence Proof for Perceptron Perceptron Learning Algorithm We have just shown that w (k) kγ and w (k) 2 w (k 1) 2 + R 2 By induction on k and since w (0) = 0 and w (0) 2 = 0 w (k) 2 kr 2 Therefore, k 2 γ 2 w (k) 2 kr 2 and solving for k k R2 γ 2 Therefore the number of errors is bounded! Machine Learning for NLP 43(50)

Gradient Ascent for Logistic Regression Gradient Ascent Let F (w) = t log ew f(x t,yt ) Z x Want to find arg maxw F (w) Set w 0 = O m Iterate until convergence w i = w i 1 + α F (w i 1 ) α > 0 and set so that F (w i ) > F (w i 1 ) F (w) is gradient of F w.r.t. w A gradient is all partial derivatives over variables wi i.e., F (w) = ( w 0 F (w), w 1 F (w),..., w m F (w)) Gradient ascent will always find w to maximize F Machine Learning for NLP 44(50)

Gradient Ascent for Logistic Regression The partial derivatives Need to find all partial derivatives w i F (w) F (w) = t = t = t log P(y t x t ) log log e w f(xt,yt) y Y ew f(xt,y ) e P j w j f j (x t,y t) y Y ep j w j f j (x t,y ) Machine Learning for NLP 45(50)

Partial derivatives - some reminders Gradient Ascent for Logistic Regression 1. x log F = 1 F x F We always assume log is the natural logarithm loge 2. x ef = e F x F t F t = t 3. x x F t F 4. x G = G x F F x G G 2 Machine Learning for NLP 46(50)

Gradient Ascent for Logistic Regression The partial derivatives w i F (w) = = t = t = t e P j w j f j (x t,y t) log w i y Y ep j w j f j (x t,y ) t e P j w j f j (x t,y t) log w i y Y ep j w j f j (x t,y ) ( y Y ep j w j f j (x t,y ) e P j w j f j (x t,y t) Z xt ( e P j w j f )( P e j w j f j (x t,y t) ) j (x t,y t) w i Z xt )( P e j w j f j (x t,y t) P w i y Y e w w j j f ) j (x t,y ) Machine Learning for NLP 47(50)

The partial derivatives Now, P e j w j f j (x t,y t ) = Zxt w i Z xt because w i Z xt = w i Gradient Ascent for Logistic Regression w i e P j w j f j (x t,y t ) e P j w j f j (x t,y t ) w i Z xt Zx 2 t = Zxt e P j w j f j (x t,y t ) f i (x t, y t) e P j w j f j (x t,y t ) w i Z xt Z 2 x t = ep j w j f j (x t,y t ) Z 2 x t (Z xt f i (x t, y t) w i Z xt ) = ep j w j f j (x t,y t ) Z 2 x t (Z xt f i (x t, y t) X e P j w j f j (x t,y ) f i (x t, y )) y Y X e P j w j f j (x t,y ) = X e P j w j f j (x t,y ) f i (x t, y ) y Y y Y Machine Learning for NLP 48(50)

The partial derivatives From before, e w i Sub this in, P j w j f j (x t,y t ) Z xt = e w i F (w) = X t = X t = X t = X t P j w j f j (x t,y t ) Gradient Ascent for Logistic Regression (Z Zx 2 xt f i (x t, y t) t X e P j w j f j (x t,y ) f i (x t, y )) y Y Z xt ( e P j w j f )( P e j w j f j (x t,y t ) ) j (x t,y t ) w i Z xt 1 (Z xt f i (x t, y t) X e P j w j f j (x t,y ) f i (x t, y ))) Z xt f i (x t, y t) X t f i (x t, y t) X t y Y y Y X e P j w j f j (x t,y ) f i (x t, y ) Z xt X P(y x t)f i (x t, y ) y Y Machine Learning for NLP 49(50)

Gradient Ascent for Logistic Regression FINALLY!!! After all that, w i F (w) = t f i (x t, y t ) t P(y x t )f i (x t, y ) y Y And the gradient is: F (w) = ( w 0 F (w), w 1 F (w),..., So we can now use gradient assent to find w!! w m F (w)) Machine Learning for NLP 50(50)