Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Similar documents
Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Alternative Parameterizations of Markov Networks. Sargur Srihari

Multiclass Logistic Regression

Intelligent Systems (AI-2)

Alternative Parameterizations of Markov Networks. Sargur Srihari

Probabilistic Graphical Models

Undirected Graphical Models: Markov Random Fields

Intelligent Systems (AI-2)

Iterative Reweighted Least Squares

Undirected Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

Bayesian Logistic Regression

Machine Learning. 7. Logistic and Linear Regression

Lecture 13: Structured Prediction

A brief introduction to Conditional Random Fields

Neural Network Training

Intelligent Systems (AI-2)

Probabilistic Graphical Models

Conditional Random Field

Probabilistic Models for Sequence Labeling

Sequential Supervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

MAP Examples. Sargur Srihari

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Ch 4. Linear Models for Classification

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Machine Learning Lecture 7

Learning Parameters of Undirected Models. Sargur Srihari

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Classification: Probabilistic Generative Models

Linear Models for Classification

Reading Group on Deep Learning Session 1

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Linear Models for Classification

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Introduction to Probabilistic Graphical Models

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

Naïve Bayes classification

Graphical models for part of speech tagging

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Lecture 5

Introduction to Machine Learning

Probabilistic Graphical Models

Probabilistic Graphical Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Conditional Random Fields: An Introduction

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

STA 4273H: Statistical Machine Learning

Notes on Markov Networks

Structure Learning in Sequential Data

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Logistic Regression. Seungjin Choi

CSC 412 (Lecture 4): Undirected Graphical Models

Pattern Recognition and Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

STA 414/2104: Machine Learning

Lecture 9: PGM Learning

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Linear and logistic regression

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Variable Elimination: Algorithm

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression. COMP 527 Danushka Bollegala

Linear Classification

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Variable Elimination: Algorithm

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Bayesian Networks Representation

Conditional Random Fields for Sequential Supervised Learning

Midterm sample questions

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Chris Bishop s PRML Ch. 8: Graphical Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

CS Lecture 4. Markov Random Fields

Gaussian and Linear Discriminant Analysis; Multiclass Classification

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Undirected graphical models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Probability and Information Theory. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Lecture 2: Simple Classifiers

Machine Learning 2017

Brief Introduction of Machine Learning Techniques for Content Analysis

Transcription:

Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1

Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination of BN and MN CRF for Image Segmentation CRF for Text Analytics Naiive Bayes and Naiive Markov Learning the models 2

Conditional Distribution Representation Nodes correspond to Y U X Y are target variables and X are observed variables Parameterized as ordinary Markov Network Set of factors Φ 1 (D 1,..Φ m (D m Can be encoded as a log-linear model Viewed as encoding a set of factors Model represents P(Y X rather than P(Y,X To naturally represent a conditional distribution Avoid representing a probabilistic model over X Disallow potentials involving only variables in X 3

Conditional Random Fields MN encodes a joint distribution over X An MN can also be used to represent a conditional distribution P(Y X Y is a set of target variables X is a set of observed variables Representation is called a CRF Has an analog in directed graphical models Conditional Bayesian Networks 4

CRF Definition An undirected graph H with nodes X U Y Network is annotated with a set of factors φ 1 (D 1,..φ m (D m such that D i X Network encodes a conditional distribution as P(Y X = 1 Z(X m P(Y, X = φ i (D i i=1 Z(X = P(Y, X Y P(Y, X Where Z(X is the marginal distribution of X and!p(y, X is the joint distribution Joint distribution (unnormalized is a product of factors Partition function Is now a function of X Two variables in H are connected by an edge whenever they appear in the scope of a factor

Deriving the CRF definition (1 Conditional distribution from Baye s rule: P(Y X = P(Y, X P(X (2 From definition of Gibbs distribution P Φ (X 1,..X n = 1 Z P(Y, X = Numerator of (1 is: 1 Z(Y, X m!p(y, X where P(Y!, X = φi (D i and Z(Y, X = P(Y!, X i=1 (3 Denominator of (1 from sum rule applied to numerator: 1 P(X = P(Y, X = Z(Y, X Y P(X 1,..X n where P(X 1,..X n = φ i (D i Z = P(X 1,..X n Y P(Y, X m i=1 X 1,..X n Y,X (4 Combining (1-(3 gives CRF: P(Y / X = Y 1!P(Y, X where Z(X =!P(Y, X Y!P(Y, X = 1 Z(X!P(Y, X 6

Difference between CRF & Gibbs Different normalization in partition function Z(X A Gibbs distribution factorizes into a set of factors and partition function Z CRF P Φ (X 1,..X n = 1 Z P(Y X = 1 Z(X Induces a different value of Z for every assignment x to X Summation only over Y P(X 1,..X n where P(X 1,..X n = φ i (D i Z = P(X 1,..X n P(Y, X where P(Y, X = φ i (D i Z(X = P(Y, X Difference denoted by feature variables greyed-out Known X (shown dark grey Y has a distribution dependent on X m i=1 m i=1 X 1 Y X 2 X 1,..X n X 3 X 4 X 5 Y 1 Y 2 Y 3 Y 4 Y 5 7

Example of CRF CRF over Y={Y 1,..Y k } and X={X 1,..X k } Edges are Y i Y i+1 and Y i X i 1 P(Y X = P(Y, X Z(X P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i k 1 i=1 Z(X = P(Y, X Y i=1 Linear chain-structured CRF k X 1 Observed Feature Variables: Assumed known when model is used (hence greyed-out X 2 X 3 X 4 X 5 For sequence labeling Y 1 Y 2 Y 3 Y 4 Y 5 8

Main Strength of CRF Avoid encoding over the variables in X Allows incorporating into model A rich set of observed variables Whose dependencies are complex or poorly understood Allows including continuous variables Distributions may not have simple parametric forms Can incorporate domain knowledge Rich features without modeling joint distribution 9

CRF Image segmentation cow Original image grass (a (b (c (d Each superpixel is a random variable Classification using node potentials alone Segmentation using pairwise Markov Network encoding Each image defines a probability distribution over the variables representing super-pixel labels Rather than define joint distribution over pixel values we define a conditional distribution over segment labels given the pixel values Avoids making a parametric assumption over (continuous pixel values Can define image processing routines to define rich features, e.g., presence or direction of an image gradient at pixel» such features usually rely on multiple pixels» So defining correct joint distribution or independence properties over the features is non-trivial 10

Directed and Undirected Dependencies A CRF defines a conditional distribution of Y on X Thus it can be viewed as a partially directed graph Where we have an undirected component over Y Which has variables in X as parents 11

CRFs for Text Analysis Important use for CRF framework Part-of-speech labeling Named Entity Tagging People, places, organizations, etc Extracting structured information from text From a reference list Publications, titles, authors, journals, tyear Models share a similar structure 12

Named Entity (NE Tagging Entities often span multiple words Type of entity may not be apparent from individual words New York is location, New York Times is organization For each word X i introduce target variable Y i which is its entity type Outcomes for Y i are (in BIO notation B-PERSON, I-PERSON, B-LOCATION, I-LOCATION, B-ORGANIZATION, I- ORGANIZATION, OTHER B: beginning, I: inside entity B allows segmenting adjacent entities of same type 13

CRF for NE Tagging B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH OTH OTH Y Mrs. Green spoke today in New York Green chairs the finance committe X Set of known variables (are words: X Two factors for each word φ t 1 (Y t,y t +1 φ t 2 (Y t, X 1,..X T Factor to represent dependency between neighboring target variables φ 1 t (Y t,y t +1 Factor to represent dependency between target Y t and its context in word sequence Can depend on arbitrary features of entire input word sequence X 1,..X T (Three here φ t 2 (Y t, X 1,..X T 14

Linear Chain CRF for NE Factor to represent dependency between target Y t and its context in word sequence φ 2 t (Y t, X 1,..X T Can depend on arbitrary features of entire input word sequence X 1,..X T Not encoded using table factors but use log-linear models Factors derived from feature functions such as f t (Y t,x t =I{Y t =B-ORGANIZATION,X t = Times }

Features for NE Tagging For word X i Capitalized, In list of common person names, In atlas of location names, End with ton, Exactly York, Following Times For word sequence More than two-sports related terms, New York is a sports organization Hundreds or thousands of features Sparse (zero for most words Same feature variable can be connected to multiple target variables Y i dependent on identity of several words in window 16

Performance of CRF Linear Chain CRFs provide high per-token accuracies High 90% range on many natural data sets High per field Precision and Recall Where entire phrase categories and boundaries must be correct 80-95% depending on data set 17

Including additional information in NE Linear chain graphical model is augmented When word occurs multiple times in a document it has the same label Include factors that connect identical words Results in skip-chain CRF shown next 18

Skip Chain CRF for NE Recognition B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH OTH OTH Mrs. Green spoke today in New York Green chairs the finance committe First occurrence of Green has neighboring words that provide strong evidence that it is a Person. Second occurrence is more ambiguous. Augmenting with a long range factor allows to predict correctly. Graphical structure over Y can easily depend on the Xs 19

Joint inference: Part-of-Speech Labeling/Noun-phrase Segmentation Pair of coupled linear chain CRFs Noun phrase is composed of several words that depends on the POS and word B I O O O B I O B I I NP ADJ N V IN V PRP N IN DT N N POS British Airways rose after announcing its withdrawal from the UAL deal KEY B I O N ADJ Begin noun phrase Within noun phrase Not a noun phrase Noun Adjective V IN PRP DT Verb Preposition Possesive pronoun Determiner (e.g., a, an, the (b 20

Partially Directed Models Probabilistic Graphical models are useful for complex systems: Directed (Bayesian networks Undirected (Markov networks Can unify both representations Incorporate both directed/undirected dependencies CRFs can be viewed as partially directed graphs CRFs can be generalized to chain graphs Which have subgraphs with chains Network in which undirected components depend upon each other in a directed fashion 21

Directed and Undirected Dependencies A CRF defines a conditional distribution of Y on X Thus can be viewed as a partially directed graph 22

CRF as Partially Directed Graph CRF defines a conditional distribution of Y on X Linear chain CRF X 1 X 2 X 3 X 4 X 5 Y 1 Can be viewed as one with undirected component over Y which has X as parents Y 2 Y 3 Y 4 Y 5 Equivalent Partially directed variant (CRF X 1 Y 1 X 2 Y 2 X 3 Y 3 X 4 Y 4 X 5 Y 5 Factors are defined over Ys only, each of which has an X value Equivalent Models 23

CRF vs HMM Logistic CPDs (logistic regression is conditional analog of Naïve Bayes CRF is conditional analog of HMM 24

Models for Sequence Labeling HMM k P(X,Y = P(X i Y i P(Y i+1 Y i i=1 Conditioning on Unknown X 1 Y 1 X 2 Y 2 X 3 Y 3 X 4 Y 4 X 5 Y 5 Generative Since Y i s are unknown joint distribution has to be estimated from data Determining P(Y X depends on first determining P(X,Y Discriminative CRF Conditioning on Known P(Y X = 1 Z(X P(Y X is obtained directly P(Y, X P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i k 1 i=1 Z(X = P(Y, X Y MEMM k i=1 X 1 Y 1 X 1 X 2 Y 2 X 2 X 3 Y 3 X 3 X 4 Y 4 X 4 X 5 Y 5 X 5 k P(X,Y = P(Y i X i P(Y i+1 Y i i=1 Y 1 Y 2 Y 3 Y 4 Y 5 25

CRF (Partially Directed and MEMM Linear chain structured CRF Y = {Y 1,..Y k }, X = {X 1,..X k } P(Y X = 1 Z(X P(Y, X P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i k 1 i=1 Z(X = P(Y, X Y k i=1 Equivalent Partially directed variant (CRF X 1 Y 1 X 2 Y 2 X 1 Y 1 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 4 Y 4 X 4 Y 4 X 5 Y 5 X 5 Y 5 Fully-directed version (a Bayesian network X 1 X 2 X 3 X 4 X 5 Called Max Entropy Markov Model (MEMM is also a conditional model but is Non-equivalent Y 1 Y 2 Y 3 Y 4 Y 5 k P(Y X = P(Y i X i P(Y i Y i 1 i=1 Y 1 is independent of X 2 if Y 2 is unknown. If Y 2 known we have dependency due to v-structure Sound Conditional BN requires edges from all variables in X to each Y i In CRF probability of Y depends on: values of all variables X={X 1,..X k } MEMM is more efficient, fewer parameters to be learned. 26

Models for Sequence Labeling Sequence of observations X={X 1,..X k }. Need a joint label Y={Y 1,..Y k }. Both CRF and MEMM are Discriminative Models That directly obtain conditional probability P(Y X HMM is a generative model That needs joint probability P(X,Y CRF X 1 Y 1 P(Y X = X 2 Y 2 X 3 Y 3 1 P(Y, X Z(X k 1 X 4 Y 4 P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i i=1 Z(X = P(Y, X Y k i=1 X 5 Y 5 Y 1 is independent of X 2 if we are not given Y 2 More generally, Y i X j - X -j Later observation has no effect on posterior probability of current state. In activity recognition in video sequence, frames are labelled as running/walking. Earlier frames may be blurry but later ones clearer. MEMM X 1 X 2 X 3 X 4 X 5 Y 1 Y 2 Y 3 Y 4 Y 5 k P(Y X = P(Y i X i P(Y i Y i 1 i=1 Models have Trade-offs in expressive power and learnability MEMM and HMM are more easily learned As purely directed models their parameters can be computed in closed-form using maximum likelihood CRF requires iterative gradient-based approach which is more expensive HMM Needs joint distribution X 1 Y 1 X 2 Y 2 X 3 Y 3 k X 4 Y 4 X 5 Y 5 P(X,Y = P(X i / Y i P(Y i Y i 1 i=1 P(Y / X = P(X,Y P(X

CRF Example: Naïve Markov model Binary-valued variables X={X 1,..X k } and Y={Y} Variables independent of each other and only dependent on class Y Pairwise potential between Y and each X i Φ i (X i,y=exp{w i I{X i =1,Y=1}} Single node potential Φ 0 (Y=exp{w 0 I{Y=1}} From CRF definition k w i x i P(Y = 1 x 1,..x k = exp w 0 + P(Y = 0 x 1,..x k = exp 0 which is equivalent to { } = 1 i=1 P(Y = 1 x 1,..x k = sigmoid w 0 + k i=1 w i x i I is indicator function which takes value 1 when its argument is true and else 0 X 1 X 2 X k Y Logistic CPD (regression: not defined by a table but induced by parameters. Efficient: Linear (not exponential as in full BN in the no of parents where sigmoid(z = e z 1 + e z

Naïve Markov and Naïve Bayes Binary variables X={X 1,..X k } and Y={Y} 1. Logistic regression is conditional analog of Naïve Bayes Classifier k P(Y = 1 x 1,..x k = sigmoid w 0 + w i x i i=1 Discriminative Model (k parameters X 1 X 2 X k 1. Naïve Bayes P(Y = 1 X 1,..X k = P(Y = 1 P(Y, X 1,..X k P(Y, X 1,..X k = P(Y P(X i Y k i=1 Y X 1 X 2 X k Generative Model (k parameters: Y We have to first obtain k CPDs conditioned on unknown from which we can get the distribution conditioned on known 29

Logistic Regression Revisited Input X, target classes Y=0 and Y=1 A posteriori probability of Y=1 is P(Y=1 X =y(x = σ (w T X where X is a M-dimensional feature vector σ (. is the logistic sigmoid function Goal: determine the M parameters Known as logistic regression in statistics Although a model for classification rather than for regression Machine Learning Logistic Sigmoid σ(a a Properties: A. Symmetry σ(-a=1-σ(a B. Inverse a=ln(σ /1-σ known as logit. Also known as log odds since it is the ratio ln[p(y=1 x/p(y=0 x] C. Derivative dσ/da=σ(1-σ

Determining Logistic Regression parameters Maximum Likelihood Approach for Two classes Data set consists of (input,target pairs: (X n, t n where t n {0,1}, n =1,..,N Since t is binary we can use the Bernoulli distribution for it p(t w = y t (1 y 1 t, where y = σ (w t X Likelihood function associated with N observations p(t w = N t y n n n =1 { 1 y n } 1 t n where t =(t 1,..,t N T and y n = p(y=1 X n Machine Learning 31

Error Fn for Logistic Regression Likelihood function p(t w = N n =1 y n t n { } 1 t n 1 y n Error function is the negative of the log-likelihood N n =1 { } E(w = ln p(t w = t n ln y n + (1 t n ln(1 y n Known as Cross-entropy error function Machine Learning 32

Gradient of Error Function Error function E(w = ln p(t w = t n ln y n + (1 t n ln(1 y n where y n = σ(w T X n Using Derivative of logistic sigmoid n =1 Gradient of the error function E(w = N n=1 ( y n t n X n N Error x Feature Vector Contribution to gradient by data point n is error between target t n and prediction y n = σ (w T φ n times input X n Machine Learning { } Analytical Derivative: Let z = z 1 + z 2 where z 1 = t lnσ (w t X and z 2 = (1 tln[1 σ ( t X] dz 1 dw = tσ (wt X[1 σ (w t X]X σ (w t X and dσ da = σ(1 σ dz2 dw = (1 tσ (wt X[1 σ (w t X]( X [1 σ (w t X] 33 d dx (lnax = a x Therefore dz dw = (σ (wt X tx = (y tx

Simple Sequential Algorithm No closed-form maximum likelihood solution for determining w Given Gradient of error function E(w = N n=1 ( y n t n X n Solve using an iterative approach w τ +1 = w τ η E n where E n = (y n t n X n Error x Feature Vector Machine Learning Solution has severe over-fitting problems for linearly separable data So use IRLS algorithm 34

Multi-class Logistic Regression Work with soft-max function instead of logistic sigmoid p(y = k X = y k (X = exp(a k exp(a j j where a k =w kt X Machine Learning 35

Muti-class Likelihood Function 1-of K Coding scheme For feature vector X n, target vector t n belonging to class Y=k is a binary vector with all elements zero except for element k p(t w 1,.., w K = where y nk =y k (X n N K k=1 p(c k X n t n,k t = y nk nk n=1 N n=1 K k=1 T is a N x K matrix of elements with elements t nk Machine Learning 36

Multi-class Error Function 1. Error Function: negative loglikelihood E(w 1,...,w K = ln p(t w 1,..,w K = n =1 k=1 t nk ln y nk Known as cross-entropy error function for multi-class 1. Gradient of error function wrt one parameter vector w j w j E(w 1,..., w K = Machine Learning N n=1 (y nj t nj X n N K Error x Feature Vector Derivatives of Soft-max y k (X = exp(a k exp(a j j a k =w kt X y k a j = y k (I kj y j where I kj are elements of the identity matrix

IRLS Algorithm for Multi-class 3. Hessian matrix comprises blocks of size M x M Block j,k is given by wk w j E(w 1,..., w K = N n=1 y nk (I kj y nj X n Hessian matrix is positive-definite, therefore error function has a unique minimum 4. Batch Algorithm based on Newton-Raphson X n T Machine Learning 38

Iterative Reweighted Least Squares (IRLS Efficient approximation using Newton-Raphson iterative optimization w (new = w (old H 1 E(w where H is the Hessian matrix whose elements are the second derivatives of E(w with respect to the components of w Newton s Method Machine Learning Since we are solving for derivative of E(w Need second derivative 39

IRLS Steps IRLS is applicable to both Linear Regression and Logistic Regression We discuss Logistic Regression, for which we need 1. Error function E(w Logistic Regression: Bernoulli Likelihood Function 2. Gradient E(w 3. Hessian H = E(w 4. Newton-Raphson update w (new = w (old H 1 E(w Machine Learning 40

IRLS for Logistic Regression Posterior probability of class Y=1 is p(y=1 X =y(x = σ (w T X Likelihood Function for data set {X n,t n }, t n {0,1} p(t w = 1. Error Function Log-likelihood yields Cross-entropy N { } E(w = t n ln y n + (1 t n ln(1 y n n =1 N n =1 y n t n { 1 y n } 1 t n Machine Learning 41

IRLS for Logistic Regression 2. Gradient of Error Function: 3. Hessian: N E(w = (y n t n X n = X T (y t n=1 H = E(w = N n=1 y n (1 y n X n X n T = X T RX R is NxN diagonal matrix with elements R nn =y n (1-y n =w T X (1-w T X n Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, u T Hu>0 error function is a concave function of w and so has a unique minimum Machine Learning 42

IRLS for Logistic Regression 4. Newton-Raphson update: w (new = w (old H 1 E(w Substituting H = X T RX and E(w = X T (y t w (new = w (old (X T RX -1 X T (y-t = (X T RX -1 {XXw (old -X T (y-t} = (X T RX -1 X T Rz where z is a N-dimensional vector with elements z =Xw (old -R -1 (y-t Update formula is a set of normal equations Since Hessian depends on w Apply them iteratively each time using the new weight vector Machine Learning 43