Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer PDF Free Download

Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary & multi-class case, conditional random fields Marc Toussaint University of Stuttgart Summer 205

Discriminative Function Represent a discrete-valued function F : R n Y via a discriminative function f : R n Y R such that F : x argmax y f(x, y) A discriminative function f(x, y) maps an input x to an output ŷ(x) = argmax f(x, y) y A discriminative function f(x, y) has high value if y is a correct answer to x; and low value if y is a false answer In that way a discriminative function e.g. discriminates correct sequence/image/graph-labelling from wrong ones 2/35

Example Discriminative Function Input: x R 2 ; output y {, 2, 3} displayed are p(y = x), p(y =2 x), p(y =3 x) 0.8 0.6 0.4 0.2 0 0.9 0.5 0. 0.9 0.5 0. 0.9 0.5 0. -2-0 2 3-2 - 0 2 3 (here already scaled to the interval [0,]... explained later) 3/35

How could we parameterize a discriminative function? Well, linear in features! f(x, y) = k j= φ j(x, y)β j = φ(x, y) β Example: Let x R and y {, 2, 3}. Typical features might be φ(x, y) = [y = ] x [y = 2] x [y = 3] x [y = ] x 2 [y = 2] x 2 [y = 3] x 2 Example: Let x, y {0, } be both discrete. Features might be φ(x, y) = [x = 0][y = 0] [x = 0][y = ] [x = ][y = 0] [x = ][y = ] 4/35

more intuition... Features connect input and output. Each φ j (x, y) allows f to capture a certain dependence between x and y If both x and y are discrete, a feature φ j (x, y) is typically a joint indicator function (logical function), indicating a certain event Each weight β j mirrors how important/frequent/infrequent a certain dependence described by φ j (x, y) is f(x) is also called energy, and the following methods are also called energy-based modelling, esp. in neural modelling 5/35

In the remainder: Logistic regression: binary case Multi-class case Preliminary comments on the general structured output case (Conditional Random Fields) 6/35

Logistic regression: Binary case 7/35

Binary classification example (MT/plot.h -> gnuplot pipe) 3 train decision boundary 2 0 - -2-2 - 0 2 3 Input x R 2 Output y {0, } Example shows RBF Ridge Logistic Regression 8/35

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {0, } Bad idea: Squared error regression (See also Hastie 4.2) 9/35

A loss function for classification Data D = {(x i, y i )} n i= with x i R d and y i {0, } Bad idea: Squared error regression (See also Hastie 4.2) Maximum likelihood: We interpret the discriminative function f(x, y) as defining class probabilities p(y x) = ef(x,y) y ef(x,y ) p(y x) should be high for the correct class, and low otherwise 9/35

Logistic regression In the binary case, we have two functions f(x, 0) and f(x, ). W.l.o.g. we may fix f(x, 0) = 0 to zero. Therefore we choose features with arbitrary input features φ(x) R k We have φ(x, y) = [y = ] φ(x) 0 else ŷ = argmax f(x, y) = y if φ(x) β > 0 and conditional class probabilities 0.9 0.8 exp(x)/(+exp(x)) p( x) = with the logistic sigmoid function σ(z) = e f(x,) = σ(f(x, )) e f(x,0) + ef(x,) ez =. +e z e z + 0.7 0.6 0.5 0.4 0.3 0.2 0. 0-0 -5 0 5 0 Given data D = {(x i, y i)} n i=, we minimize L logistic (β) = n i= log p(yi xi) + λ β 2 = [ ] n i= y i log p( x i ) + ( y i ) log[ p( x i )] + λ β 2 0/35

Optimal parameters β Gradient (see exercises): n = i= (p i y i )φ(x i ) + 2λIβ = X (p y) + 2λIβ, L logistic (β) β p i := p(y = x i ), X = L logistic (β) β φ(x ). φ(x n ) R n k is non-linear in β (it enters also the calculation of p i ) does not have analytic solution Newton algorithm: iterate β β H - L logistic (β) β with Hessian H = 2 L logistic (β) β = X W X + 2λI 2 W diagonal with W ii = p i ( p i ) /35

RBF ridge logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary 0-2 0 - -2-2 - 0 2 3 3 2 0 - -2-3 -2-0 2 3-2 - 0 2 3./x.exe -mode 2 -modelfeaturetype 4 -lambda e+0 -rbfbias 0 -rbfwidth.2 2/35

polynomial (cubic) logistic regression: 3 (MT/plot.h -> gnuplot pipe) train decision boundary 0-2 0 - -2-2 - 0 2 3 3 2 0 - -2-3 -2-0 2 3-2 - 0 2 3./x.exe -mode 2 -modelfeaturetype 3 -lambda e+0 3/35

Recap: Classification Regression parameters β predictive function f(x) = φ(x) β least squares loss L ls (β) = n i= (yi f(xi))2 parameters β discriminative function f(x, y) = φ(x, y) β class probabilities p(y x) e f(x,y) neg-log-likelihood L neg-log-likelihood (β) = n i= log p(yi xi) 4/35

Logistic regression: Multi-class case 5/35

Logistic regression: Multi-class case Data D = {(x i, y i )} n i= with x i R d and y i {,.., M} We choose f(x, y) = φ(x, y) β with φ(x, y) = [y = ] φ(x) [y = 2] φ(x). [y = M] φ(x) where φ(x) are arbitrary features. We have M (or M-) parameters β Conditional class probabilties p(y x) = ef(x,y) y ef(x,y ) (optionally we may set f(x, M) = 0 and drop the last entry) f(x, y) = log p(y x) p(y =M x) (the discriminative functions model log-ratios ) Given data D = {(x i, y i )} n i=, we minimize L logistic (β) = n i= log p(y =y i x i ) + λ β 2 6/35

Optimal parameters β Gradient: L logistic (β) β c = n i= (p ic y ic )φ(x i ) + 2λIβ c = X (p c y c ) + 2λIβ c, p ic = p(y =c x i ) Hessian: H = 2 L logistic (β) β c β d = X W cd X + 2[c = d] λi W cd diagonal with W cd,ii = p ic ([c = d] p id ) 7/35

polynomial (quadratic) ridge 3-class logistic regression: 3 2 (MT/plot.h -> gnuplot pipe) train p=0.5 0.8 0.6 0.4 0.2 0 0.9 0.5 0. 0.9 0.5 0. 0.9 0.5 0. 0 - -2-0 2 3-2 - 0 2 3-2 -2-0 2 3./x.exe -mode 3 -modelfeaturetype 3 -lambda e+ 8/35

Structured Output & Structured Input 9/35

Structured Output & Structured Input regression: R n R structured output: R n binary class label {0, } R n integer class label {, 2,.., M} R n sequence labelling y :T R n image labelling y :W,:H R n graph labelling y :N structured input: relational database R labelled graph/sequence R 20/35

Examples for Structured Output Text tagging X = sentence Y = tagging of each word http://sourceforge.net/projects/crftagger Image segmentation X = image Y = labelling of each pixel http://scholar.google.com/scholar?cluster=344770229904273582 Depth estimation X = single image Y = depth map http://make3d.cs.cornell.edu/ 2/35

CRFs in image processing 22/35

CRFs in image processing Google conditional random field image Multiscale Conditional Random Fields for Image Labeling (CVPR 2004) Scale-Invariant Contour Completion Using Conditional Random Fields (ICCV 2005) Conditional Random Fields for Object Recognition (NIPS 2004) Image Modeling using Tree Structured Conditional Random Fields (IJCAI 2007) A Conditional Random Field Model for Video Super-resolution (ICPR 2006) 23/35

Conditional Random Fields 24/35

Conditional Random Fields (CRFs) CRFs are a generalization of logistic binary and multi-class classification The output y may be an arbitrary (usually discrete) thing (e.g., sequence/image/graph-labelling) Hopefully we can minimize efficiently argmax f(x, y) y over the output! f(x, y) should be structured in y so this optimization is efficient. The name CRF describes that p(y x) e f(x,y) defines a probability distribution (a.k.a. random field) over the output y conditional to the input x. The word field usually means that this distribution is structured (a graphical model; see later part of lecture). 25/35

CRFs: Core equations f(x, y) = φ(x, y) β p(y x) = ef(x,y) y ef(x,y ) = ef(x,y) Z(x,β) Z(x, β) = log y e f(x,y ) (log partition function) L(β) = i log p(y i x i ) = i [f(x i, y i ) Z(x i, β)] Z(x, β) = y 2 Z(x, β) = y p(y x) f(x, y) p(y x) f(x, y) f(x, y) Z Z This gives the neg-log-likelihood L(β), its gradient and Hessian 26/35

Training CRFs Maximize conditional likelihood But Hessian is typically too large (Images: 0 000 pixels, 50 000 features) If f(x, y) has a chain structure over y, the Hessian is usually banded computation time linear in chain length Alternative: Efficient gradient method, e.g.: Vishwanathan et al.: Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods Other loss variants, e.g., hinge loss as with Support Vector Machines ( Structured output SVMs ) Perceptron algorithm: Minimizes hinge loss using a gradient method 27/35

CRFs: the structure is in the features Assume y = (y,.., y l ) is a tuple of individual (local) discrete labels We can assume that f(x, y) is linear in features f(x, y) = k φ j (x, y j )β j = φ(x, y) β j= where each feature φ j (x, y j ) depends only on a subset y j of labels. φ j (x, y j ) effectively couples the labels y j. Then e f(x,y) is a factor graph. 28/35

Example: pair-wise coupled pixel labels x y y 2 y 3 y 4 y W y 2 y 3 y H Each black box corresponds to features φ j (y j ) which couple neighboring pixel labels y j Each gray box corresponds to features φ j (x j, y j ) which couple a local pixel observation x j with a pixel label y j 29/35

Kernel Ridge Regression the Kernel Trick Reconsider solution of Ridge regression (using the Woodbury identity): ˆβ ridge = (X X + λi k ) - X y = X (XX + λi n ) - y 30/35

Kernel Ridge Regression the Kernel Trick Reconsider solution of Ridge regression (using the Woodbury identity): ˆβ ridge = (X X + λi k ) - X y = X (XX + λi n ) - y Recall X = (φ(x ),.., φ(x n )) R k n, then: f ridge (x) = φ(x) β ridge = φ(x) X (XX }{{}}{{} +λi) - y κ(x) K K is called kernel matrix and has elements K ij = k(x i, x j ) := φ(x i ) φ(x j ) κ is the vector: κ(x) = φ(x) X = k(x, x :n ) The kernel function k(x, x ) calculates the scalar product in feature space. 30/35

The Kernel Trick We can rewrite kernel ridge regression as: f rigde (x) = κ(x) (K + λi) - y with K ij = k(x i, x j ) κ i (x) = k(x, x i ) at no place we actually need to compute the parameters ˆβ at no place we actually need to compute the features φ(x i ) we only need to be able to compute k(x, x ) for any x, x This rewriting is called kernel trick. It has great implications: Instead of inventing funny non-linear features, we may directly invent funny kernels Inventing a kernel is intuitive: k(x, x ) expresses how correlated y and y should be: it is a meassure of similarity, it compares x and x. Specifying how comparable x and x are is often more intuitive than defining features that might work. 3/35

Every choice of features implies a kernel. But, Does every choice of kernel correspond to specific choice of feature? 32/35

Reproducing Kernel Hilbert Space Let s define a vector space H k, spanned by infinitely many basis elements {φ x = k(, x) : x R d } Vectors in this space are linear combinations of such basis elements, e.g., f = i α iφ xi, f(x) = i α ik(x, x i) Let s define a scalar product in this space, by first defining the scalar product for every basis element, φ x, φ y := k(x, y) This is positive definite. Note, it follows φ x, f = i α i φ x, φ xi = i α ik(x, x i) = f(x) The φ x = k(, x) is the feature we associate with x. Note that this is a function and infinite dimensional. Choosing α = (K + λi) - y represents f ridge (x) = n i= αik(x, xi) = κ(x) α, and shows that ridge regression has a finite-dimensional solution in the basis elements {φ xi }. A more general version of this insight is called representer theorem. 33/35

Example Kernels Kernel functions need to be positive definite: z: z >0 : k(z, z ) > 0 K is a positive definite matrix Examples: Polynomial: k(x, x ) = (x x ) d Let s verify for d = 2, φ(x) = ( x 2, 2x x 2, x 2 2 ) : k(x, x ) = ((x, x 2) = (x x + x 2x 2) 2 ) 2 x x 2 = x 2 x 2 + 2x x 2x x 2 + x 2 2x 2 2 = (x 2, 2x x 2, x 2 2)(x 2, 2x x 2, x 2 2 ) = φ(x) φ(x ) Squared exponential (radial basis function): k(x, x ) = exp( γ x x 2 ) 34/35

Kernel Logistic Regression For logistic regression we compute β using the Newton iterates β β (X W X + 2λI) - [X (p y) + 2λβ] () = (X W X + 2λI) - X [(p y) W Xβ] (2) Using the Woodbury identity we can rewrite this as (X W X + A) - X W = A - X (XA - X + W - ) - (3) β 2λ X (X 2λ X + W - ) - W - [(p y) W Xβ] (4) [ ] = X (XX + 2λW - ) - Xβ W - (p y). (5) We can now compute the discriminative function values f X = Xβ R n at the training points by iterating over those instead of β: [ ] f X XX (XX + 2λW - ) - Xβ W - (p y) (6) [ ] = K(K + 2λW - ) - f X W - (p X y) (7) Note, that p X on the RHS also depends on f X. Given f X we can compute the discriminative function values f Z = Zβ R m for a set of m query points Z using [ ] f Z κ (K + 2λW - ) - f X W - (p X y), κ = ZX (8) 35/35

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015