Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Size: px

Start display at page:

Download "Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron"

Roland Wade
6 years ago
Views:

1 CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer theorem Stochastic Gradient Descent Perceptron, other examples Kernels Recall the feature transformation: For x X, y Y, prediction function is: f : X Y Let φ be the feature transformation function, then we have a mapping: After the transformation, we have: x φ(x) f(x) = w T φ(x) + w 0 Note here f(x) is a linear function with respect to φ(x), and it s a non-linear function with respect to x. (Under the condition that φ(x) is a non-linear function) Define kernel function as: K(x i, x j ) = φ T (x i )φ(x j ) This is especially useful when φ(x) is infinite dimensional. The motivation of infinite dimensional φ(x) is that in theory, you can almost always make your data points linearly 1

2 2 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 1: Data lift to higher dimension becomes linear separable, from Kim (2013) spreadable, by bending the data space to infinite dimensional space. See Figure 1 When φ(x) has infinite dimensions, solving w is hard: w T φ(x) where w is also infinite dimensional. Since usually the value of f(x) is in the form of: Φ T (x)φ(x) we can use kernel function to represent f(x), without knowing φ(x) and w For a given data point, We ve shown 2 applications of kernels: f(x) = w T φ(x) f(x) = K(x, x i )α i Linear Regression Using a linear algebra identity SVM Using the method of Lagrange multipliers

3 14 : Online Learning, Stochastic Gradient Descent, Perceptron 3 Hilbert Spaces H k is the space of well behaved linear functions like w T φ(x). And this is equivalent to the space of bounded functions: f(x) = α i K(x, x i ) K(x i, x j ) = φ T (x i )φ(x j ) Support Vector Machines The optimization problem is to maximize the margin, which is proportional to 1 w : s.t where min w y i f(x i ) 1, i [n] f(x) = w T x Why set the threshold to 1? Suppose change that to some constant C: y i f(x i ) C y i f(x i ) C 1 y i w T C x i 1 This won t change the solution to our optimization problem: min w So we can just set C = 1 without losing generality. s.t y i f(x i ) 1, i [n] Non-spreadable Data Recall that with the slack variable ξ i, we have: min w,ξ i w s.t ξ i 0, ξ i yf(x i ) 1 ξ i

4 4 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 2: Visualize SVM with slack variable (data points with circle are on the margin) Intuitively, ξ i represents how much tolerance to misclassification the model has. See Figure 2 f(x) = w T φ(x) w = α i y i φ(x i ) where α i 0 w = α i φ(x i, where α i = α i y i When data is separable (no slack), we have α i 0 iff yf(x) = 1 For the non-separable case, we have α i > 0 iff y i f(x i ) = 1 ξ i otherwise α i = 0 The prediction function can be written as this form: f(x) = α i K(x, x i ) i Note that storage for kernel function is O(nd)

5 14 : Online Learning, Stochastic Gradient Descent, Perceptron 5 Whereas storage for SVM is O(dS) Where S is the number of support vectors This is because we can ignore those entries in K with value of K(x, x i ), where x i is not a support vector. Representer Theorem The optimization problem in the Hilbert space is: min f H k Where l(y i, f(x i ) is the loss function. l(y i, f(x i )) + λ f 2 H k Representer Theorem says the solution of this can be written as: f (x) = α i K(x, x i ) We have shown that In kernel ridge regression, use matrix identity we have: l(y i, f(x i ) = (y i f(x i )) 2 In kernel SVM, use dual representation (Lagrange Multipliers) we have: l(y i, f(x i ) = max(0, 1 y i f(x i )) This also applies to other kernel algorithms, for example Kernel logistic regression Kernel Poisson regression

6 6 14 : Online Learning, Stochastic Gradient Descent, Perceptron Stochastic Gradient Descent All the loss functions we ve seen so far is average loss over data points: L(w) = 1 L i (w) n In regression we have: L i (w) = (y i w T x i w 0 ) 2 In SVM we have: L i (w) = max(0, 1 y i (w T x i + w 0 )) The gradient is defined as: w L(w) = 1 n w L i (w) The population empirical risk is: L(w) = E P [L(w)] Under weak conditions we have: L(w) = E P [ w L(w)] We can replace E P [ w L(w)] using a unbiased value, meaning we can pick a β such that: For example, we can choose mean (µ) Let X P, µ = E[X] bias = E[β] E P [ w L(w)] = 0 E[µ] = E [ 1 n ] x i = E[X] We can also choose a single example x i from the data points, because E[x i ] = E[X] So we can choose w L i (w) as an unbiased estimator of E[ w L(w)] Then our Stochastic Gradient Descent algorithm is defined as: In the tth iteration:

7 14 : Online Learning, Stochastic Gradient Descent, Perceptron 7 Pick i uniformly from [n] Update parameter w t+1 = w t + η t w L i (w) go to next iteration until hit stop condition. For example we can limit the number of iterations That is, in each iteration we randomly pick a data point (y i, x i ), and use that data point to calculate L i (w) and L i (w). Then use this randomly selected gradient to update w. For example in the linear regression case: L i (w) = (y i w T x i ) 2 L i (w) = 2x i (y i w T x i ) w t+1 = w t + η t ( 2x i (y i w T x i )) Some useful properties of Stochastic Gradient Descent: Will converge eventually, if L(w) is convex and appropriately choose step size. See Figure 3 and Figure 4 Easily deal with streaming data for online learning. Adaptive. See Figure 5 Deal with non-differentiable functions easily. See Figure 6 Figure 3: Behavior of gradient descent, from Wikipedia, the free encyclopedia (2007). The intuition is: SGD GD + noise w L i (w) w L(w) + noise

8 8 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 4: Behavior of stochastic gradient descent Figure 5: SGD model shifts while get new data (new data are the bigger ones) Perceptron Perceptron is an algorithm used for linear classification. If data is separable, perceptron will find a linear separator.

9 14 : Online Learning, Stochastic Gradient Descent, Perceptron 9 Figure 6: GD gets stuck at the flat point, but SGD won t Consider the linear model where f(x) = w T x + w 0 h(x) = sign ( f(x) ) The loss function is: L i (w) = max ( 0, yf(x) ) Then we have the gradient: 0, y i f(x i ) > 0 w L i (w) = y i x i, y i f(x i ) < 0 0, y i f(x i ) = 0 The perceptron algorithm is defined as: For i [n], the update rule in each iteration is: w t+1 = w t, { wt + η t y i x i, if y i f(x i ) < 0 (wrong prediction) if y i f(x i ) > 0 (correct prediction)

10 10 14 : Online Learning, Stochastic Gradient Descent, Perceptron

11 Bibliography Kim, E. (2013). (left) the decision boundary w shown to be linear in 3-d space, (right) the decision boundary w, when transformed back to 2-d space, is nonlinear. [Online; accessed October 24, 2017]. URL Wikipedia, the free encyclopedia (2007). Illustration of gradient descent on a series of level sets. [Online; accessed October 24, 2017]. URL 11

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training