Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods Kernelizing linear methods COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 1

Regression vs classification Regression: map x R n to y R. Classification: map x R n to a label y Y = {1,, k} (k classes). Input space is divided into decision regions. Linear classification: decisions surfaces are linear (or affine) in x. How to represent the output? If k = 2, Y = { 1, 1} can be convenient... One-hot encoding is another option... We will mainly talk about binary classification, i.e. k = 2. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 2

Linear models for classification Hyperplane in R n : defined by the equation w x = 0 for some w R n. Least-squares for classification: Y = {1,, k} and h W (x) = arg max i=1,,k (W x) i where W = [ w 1 w k ] R n k. Learning: encode each output sample y i to a vector in R k using onehot encoding and minimize the squared error (closed form solution). COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 3

Linear models for classification Least-squares for classification: Y = {1,, k} and h W (x) = arg max i=1,,k (W x) i where W = [ w 1 w k ] R n k. Learning: encode each output sample y i to a vector in R k using onehot encoding and minimize the squared error (closed form solution). Very sensitive to outliers: COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 4

Linear models for classification Least-squares for classification: Y = {1,, k} and h W (x) = arg max i=1,,k (W x) i where W = [ w 1 w k ] R n k. Learning: encode each output sample y i to a vector in R k using onehot encoding and minimize the squared error (closed form solution). Very sensitive to outliers: Recall that minimizing the squared error corresponds to maximizing the likelihood under the assumption of a Gaussian conditional distribution. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 5

Linear models for classification Hyperplane in R n : defined by the equation w x = 0 for some w R n. Binary linear discriminant function: Y = { 1, 1} and h w (x) = sign(w x). Learning: Perceptron algorithm (Rosenblatt, 1957) iterative algorithm (closely related to stochastic gradient descent, more on that later...). converges when the data is linearly separable. Converges... ok, but to which hyperplane? What is the best one? Support Vector Machines (next lecture) More generally, Generalized linear models: h w (x) = ψ(w x) for some activation function ψ. If ψ takes its values in (0, 1), we could interpret h w (x) as the conditional probability P (y = 1 x)! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 6

Logistic sigmoid (recall): σ(z) = 1 1 + exp( z) Symmetry: σ( z) = 1 σ(z) ) Inverse: z = ln ( σ(z) 1 σ(z) Derivative: σ (z) = σ(z)(1 σ(z)) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 7

Logistic regression Suppose we represent the hypothesis itself as a logistic function of a linear combination of inputs: h(x) = σ(w x) = 1 1 + exp( w x) This is also known as a sigmoid neuron Suppose we interpret h(x) as P (y = 1 x) Then the log-odds ratio, ( ) P (y = 1 x) ln = w x P (y = 0 x) is linear in x (nice!) The optimum weights will maximize the conditional likelihood of the outputs, given the inputs. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 8

The cross-entropy error function Suppose we interpret the output of the hypothesis, h(x i ), as the probability that y i = 1 Setting Y = {0, 1} for convenience, the log-likelihood of a hypothesis h is: m m { log h(xi ) if y log L(h) = log P (y i x i, h) = i = 1 log(1 h(x i )) if y i = 0 = i=1 i=1 m y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 The cross-entropy error function is the opposite quantity: ( m ) J D (w) = y i log h(x i ) + (1 y i ) log(1 h(x i )) i=1 COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 9

Cross-entropy error surface for logistic function ( m ) J D (w) = y i log σ(w x i ) + (1 y i ) log(1 σ(w x i )) i=1 Nice error surface, unique minimum, but cannot be solved in closed form COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 10

Gradient descent The gradient of J at a point w can be thought of as a vector indicating which way is uphill. 2000 1500 SSQ 1000 500 10 0 5 0 w0 5 10 10 5 w1 0 5 10 If this is an error function, we want to move downhill on it, i.e., in the direction opposite to the gradient COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 11

Example gradient descent traces If the error function is convex, gradient descent converges to the global minimum (under some conditions on the learning rates) For more general hypothesis classes, there may be local optima In this case, the final solution may depend on the initial parameters COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 12

Gradient descent algorithm The basic algorithm assumes that J is easily computed We want to produce a sequence of vectors w 1, w 2, w 3,... with the goal that: J(w 1 ) > J(w 2 ) > J(w 3 ) >... lim i w i = w and w is locally optimal. The algorithm: Given w 0, do for i = 0, 1, 2,... w i+1 = w i α i J(w i ), where α i > 0 is the step size or learning rate for iteration i. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 13

Maximization procedure: Gradient ascent First we compute the gradient of log L(w) wrt w: log L(w) = 1 y i h i w (x i ) h w(x i )(1 h w (x i ))x i 1 + (1 y i ) 1 h w (x i ) h w(x i )(1 h w (x i ))x i = i x i (y i y i h w (x i ) h w (x i ) + y i h w (x i )) = i (y i h w (x i ))x i = X (y ŷ) where ŷ R m is defined by (ŷ) i = h w (x i ). COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 14

The update rule (because we maximize) is: w w + α log L(w) = w + αx (y ŷ) where α (0, 1) is a step-size or learning rate parameter This is called logistic regression If one uses features of the input, we simply have: w w + αφ (y ŷ) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 15

Roadmap: theoretical guarantees for gradient descent and second-order methods If the cost function is convex then gradient descent will converge to the optimal solution for an appropriate choice of the learning rates. We will show that the cross-entropy error function is convex. We will see how we can use a second-order method to choose optimal learning rates. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 16

Convexity A function f : R d R is convex if for all a, b R d, λ [0, 1]: f(λa + (1 λ)b) λf(a) + (1 λ)f(b). f is concave if f is convex. If f and g are convex functions, αf + βg is also convex for any real numbers α and β. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 17

First-order characterization: Characterizations of convexity f is convex for all a, b : f(a) f(b) + f(b) (a b) (the function is globally above the tangent at b). Second-order characterization: f is convex the Hessian of f is positive semi-definite. The Hessian contains the second-order derivatives of f: H i,j = 2 f x i x j. H is positive semi-definite a Ha 0 for all a R d H has only non-negative eigenvalues. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 18

Logistic regression: convexity of the cost function ( m ) J(w) = y i log σ(w x i ) + (1 y i ) log(1 σ(w x i )) i=1 where σ is the sigmoid function. We show that log σ(w x) and log(1 σ(w x)) are convex in w: ( w log σ(w x) ) = ( w σ(w x) ) σ(w = σ (w x) w (w x) x) σ(w = (σ(w x) 1)x x) 2 w ( log σ(w x) ) = w ( σ(w x)x ) = σ(w x)(1 σ(w x))xx It is easy to check that this matrix is positive semi-definite for any x. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 19

Convexity of the cost function ( m ) J(w) = y i log σ(w x i ) + (1 y i ) log(1 σ(w x i )) i=1 where σ(z) = 1/(1 + e z ) (check that σ (z) = σ(z)(1 σ(z))). Similarly you can show that 2 w w ( log(1 σ(w x)) ) = σ(w x)x ( log(1 σ(w x)) ) = σ(w x)(1 σ(w x))xx J(w) is convex in w. The gradient of J is X (ŷ y) where ŷ i = σ(w x i ) = h(x i ). The Hessian of J is X RX where R is diagonal with entries R i,i = h(x i )(1 h(x i )). COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 20

Another algorithm for optimization Recall Newton s method for finding the zero of a function g : R R At point w t, approximate the function by a straight line (its tangent) Solve the linear equation for where the tangent equals 0, and move the parameter to this point: w t+1 = w t g(wt ) g (w t ) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 21

Application to machine learning Suppose for simplicity that the error function J has only one parameter We want to optimize J, so we can apply Newton s method to find the zeros of J = d dw J We obtain the iteration: w t+1 = w t J (w t ) J (w t ) Note that there is no step size parameter! This is a second-order method, because it requires computing the second derivative But, if our error function is quadratic, this will find the global optimum in one step! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 22

Second-order methods: Multivariate setting If we have an error function J that depends on many variables, we can compute the Hessian matrix, which contains the second-order derivatives of J: H ij = 2 J w i w j The inverse of the Hessian gives the optimal learning rates The weights are updated as: w w H 1 w J This is also called Newton-Raphson method for logistic regression, or Fisher scoring COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 23

Which method is better? Newton s method usually requires significantly fewer iterations than gradient descent Computing the Hessian requires a batch of data, so there is no natural on-line algorithm Inverting the Hessian explicitly is expensive, but almost never necessary Computing the product of a Hessian with a vector can be done in linear time (Pearlmutter, 1993), which helps also to compute the product of the inverse Hessian with a vector without explicitly computing H COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 24

Newton-Raphson for logistic regression Leads to a nice algorithm called iteratively reweighted least squares (or iterative recursive least squares) The Hessian has the form: H = Φ RΦ where R is the diagonal matrix of h(x i )(1 h(x i )) (you can check that this is the form of the second derivative). The weight update becomes: w w (Φ RΦ) 1 Φ (ŷ y) which can be rewritten as the solution of a weighted least square problem: w (Φ RΦ) 1 Φ R(Φw R 1 (ŷ y)) COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 25

Regularization for logistic regression One can do regularization for logistic regression just like in the case of linear regression Recall regularization makes a statement about the weights, so does not affect the error function Eg: L 2 regularization will have the optimization criterions: J(w) = J D (w) + λ 2 w w COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 26

Probabilistic view of logistic regression Consider the additive noise model we discussed before: y i = h w (x i ) + ɛ where ɛ are drawn iid from some distribution At first glance, log reg does not fit very well We will instead think of a latent variable ŷ i such that: ŷ i = h w (x i ) + ɛ Then the output is generated as: y i = 1 iff ŷ i > 0 COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 27

Generalized Linear Models Logistic regression is a special case of a generalized linear model: E[Y x] = g 1 (w x). g is called the link function, it relates the mean of the response to the linear predictor. Linear regression: E[Y x] = E[w x +ε x] = w x (g is the identity). Logistic regression: E[Y x] = P (Y = 1 x) = σ(w x) Poisson regression: E[Y x] = exp(w x) (for count data).... COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 28

Linear regression with feature vectors revisited Recall: we want to minimize the (regularized) error function: J(w) = 1 2 (Φw y) (Φw y) + λ 2 w w Using the identity (M M + αi) 1 M = M (MM + αi) 1, the solution can be rewritten as w = (Φ Φ + λi n ) 1 Φ y = Φ (ΦΦ + λi m ) 1 y = Φ a with a R m. Inversion of an m m matrix instead of n n! The solution w is a linear combination of input points! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 29

Linear regression with feature vectors revisited (cont d) Let K = ΦΦ R m m be the so-called Gram matrix. K i,j = φ(x i ) φ(x j ) measures the similarity between inputs i and j. No need to compute the feature map, just need to know how to compute φ(u) φ(v) for any u, v. Solution of regularized least squares: w = Φ a with a = (K + λi) 1 y. The predictions for the input data are given by ŷ = Φw = ΦΦ a = K(K + λi) 1 y. The prediction for a new input point x is given by y = φ(x) Φ a = k x a where k x R m is defined by (k x ) i = φ(x) φ(x i ) Never need to compute the feature map φ explicitly! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 30

Another derivation: Linear regression with feature vectors revisited Find the weight vector w which minimizes the (regularized) error function: J(w) = 1 2 (Φw y) (Φw y) + λ 2 w w Instead of closed-form solution, take the gradient and rearrange the terms Setting the learning rate α = 1 λ, the solution takes the form: w = 1 λ m (w φ(x i ) y i )φ(x i ) = i=1 m a i φ(x i ) = Φ a i=1 where a is a vector of size m (with a i = 1 λ (w φ(x i ) y i )) Main idea: use a instead of w as parameter vector COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 31

Another derivation: Re-writing the error function Instead of J(w) we have J(a): J(a) = 1 2 a ΦΦ ΦΦ a a ΦΦ y + 1 2 y y + λ 2 a ΦΦ a Denote ΦΦ = K Hence, we can re-write this as: J(a) = 1 2 a KKa a Ky + 1 2 y y + λ 2 a Ka This is quadratic in a, and we can set the gradient to 0 and solve. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 32

Another derivation: Dual-view regression By setting the gradient to 0 we get: a = (K + λi m ) 1 y Note that this is similar to re-formulating a weight vector in terms of a linear combination of instances Again, the feature mapping is not needed either to learn or to make predictions! This approach is useful if the feature space is very large COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 33

Kernel functions Whenever a learning algorithm can be written in terms of dot-products, it can be generalized to kernels. A kernel is any function K : R n R n R which corresponds to a dot product for some feature mapping φ: K(x 1, x 2 ) = φ(x 1 ) φ(x 2 ) for some φ. Conversely, by choosing a feature map φ : R n R l, we implicitly choose a kernel function (l may even be infinite!) Recall that φ(x 1 ) φ(x 2 ) = cos (x 1, x 2 ) where denotes the angle between the vectors, so a kernel function can be thought of as a notion of similarity. COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 34

Example: Quadratic kernel Let K(x, z) = (x z) 2. Is this a kernel? K(x, z) = = ( n i=1 i,j {1...n} x i z i ) n x j z j = j=1 (x i x j ) (z i z j ) i,j {1...n} x i z i x j z j Hence, it is a kernel, with feature mapping: φ(x) = x 2 1, x 1 x 2,..., x 1 x n, x 2 x 1, x 2 2,..., x 2 n Feature vector includes all squares of elements and all cross terms. Note that computing φ takes O(n 2 ) but computing K takes only O(n)! COMP-652 and ECSE-608, Lecture 4 - September 18, 2017 35