CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer theorem Stochastic Gradient Descent Perceptron, other examples Kernels Recall the feature transformation: For x X, y Y, prediction function is: f : X Y Let φ be the feature transformation function, then we have a mapping: After the transformation, we have: x φ(x) f(x) = w T φ(x) + w 0 Note here f(x) is a linear function with respect to φ(x), and it s a non-linear function with respect to x. (Under the condition that φ(x) is a non-linear function) Define kernel function as: K(x i, x j ) = φ T (x i )φ(x j ) This is especially useful when φ(x) is infinite dimensional. The motivation of infinite dimensional φ(x) is that in theory, you can almost always make your data points linearly 1
2 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 1: Data lift to higher dimension becomes linear separable, from Kim (2013) spreadable, by bending the data space to infinite dimensional space. See Figure 1 When φ(x) has infinite dimensions, solving w is hard: w T φ(x) where w is also infinite dimensional. Since usually the value of f(x) is in the form of: Φ T (x)φ(x) we can use kernel function to represent f(x), without knowing φ(x) and w For a given data point, We ve shown 2 applications of kernels: f(x) = w T φ(x) f(x) = K(x, x i )α i Linear Regression Using a linear algebra identity SVM Using the method of Lagrange multipliers
14 : Online Learning, Stochastic Gradient Descent, Perceptron 3 Hilbert Spaces H k is the space of well behaved linear functions like w T φ(x). And this is equivalent to the space of bounded functions: f(x) = α i K(x, x i ) K(x i, x j ) = φ T (x i )φ(x j ) Support Vector Machines The optimization problem is to maximize the margin, which is proportional to 1 w : s.t where min w 2 2 2 y i f(x i ) 1, i [n] f(x) = w T x Why set the threshold to 1? Suppose change that to some constant C: y i f(x i ) C y i f(x i ) C 1 y i w T C x i 1 This won t change the solution to our optimization problem: min w 2 2 2 So we can just set C = 1 without losing generality. s.t y i f(x i ) 1, i [n] Non-spreadable Data Recall that with the slack variable ξ i, we have: min w,ξ i w 2 2 + s.t ξ i 0, ξ i yf(x i ) 1 ξ i
4 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 2: Visualize SVM with slack variable (data points with circle are on the margin) Intuitively, ξ i represents how much tolerance to misclassification the model has. See Figure 2 f(x) = w T φ(x) w = α i y i φ(x i ) where α i 0 w = α i φ(x i, where α i = α i y i When data is separable (no slack), we have α i 0 iff yf(x) = 1 For the non-separable case, we have α i > 0 iff y i f(x i ) = 1 ξ i otherwise α i = 0 The prediction function can be written as this form: f(x) = α i K(x, x i ) i Note that storage for kernel function is O(nd)
14 : Online Learning, Stochastic Gradient Descent, Perceptron 5 Whereas storage for SVM is O(dS) Where S is the number of support vectors This is because we can ignore those entries in K with value of K(x, x i ), where x i is not a support vector. Representer Theorem The optimization problem in the Hilbert space is: min f H k Where l(y i, f(x i ) is the loss function. l(y i, f(x i )) + λ f 2 H k Representer Theorem says the solution of this can be written as: f (x) = α i K(x, x i ) We have shown that In kernel ridge regression, use matrix identity we have: l(y i, f(x i ) = (y i f(x i )) 2 In kernel SVM, use dual representation (Lagrange Multipliers) we have: l(y i, f(x i ) = max(0, 1 y i f(x i )) This also applies to other kernel algorithms, for example Kernel logistic regression Kernel Poisson regression
6 14 : Online Learning, Stochastic Gradient Descent, Perceptron Stochastic Gradient Descent All the loss functions we ve seen so far is average loss over data points: L(w) = 1 L i (w) n In regression we have: L i (w) = (y i w T x i w 0 ) 2 In SVM we have: L i (w) = max(0, 1 y i (w T x i + w 0 )) The gradient is defined as: w L(w) = 1 n w L i (w) The population empirical risk is: L(w) = E P [L(w)] Under weak conditions we have: L(w) = E P [ w L(w)] We can replace E P [ w L(w)] using a unbiased value, meaning we can pick a β such that: For example, we can choose mean (µ) Let X P, µ = E[X] bias = E[β] E P [ w L(w)] = 0 E[µ] = E [ 1 n ] x i = E[X] We can also choose a single example x i from the data points, because E[x i ] = E[X] So we can choose w L i (w) as an unbiased estimator of E[ w L(w)] Then our Stochastic Gradient Descent algorithm is defined as: In the tth iteration:
14 : Online Learning, Stochastic Gradient Descent, Perceptron 7 Pick i uniformly from [n] Update parameter w t+1 = w t + η t w L i (w) go to next iteration until hit stop condition. For example we can limit the number of iterations That is, in each iteration we randomly pick a data point (y i, x i ), and use that data point to calculate L i (w) and L i (w). Then use this randomly selected gradient to update w. For example in the linear regression case: L i (w) = (y i w T x i ) 2 L i (w) = 2x i (y i w T x i ) w t+1 = w t + η t ( 2x i (y i w T x i )) Some useful properties of Stochastic Gradient Descent: Will converge eventually, if L(w) is convex and appropriately choose step size. See Figure 3 and Figure 4 Easily deal with streaming data for online learning. Adaptive. See Figure 5 Deal with non-differentiable functions easily. See Figure 6 Figure 3: Behavior of gradient descent, from Wikipedia, the free encyclopedia (2007). The intuition is: SGD GD + noise w L i (w) w L(w) + noise
8 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 4: Behavior of stochastic gradient descent Figure 5: SGD model shifts while get new data (new data are the bigger ones) Perceptron Perceptron is an algorithm used for linear classification. If data is separable, perceptron will find a linear separator.
14 : Online Learning, Stochastic Gradient Descent, Perceptron 9 Figure 6: GD gets stuck at the flat point, but SGD won t Consider the linear model where f(x) = w T x + w 0 h(x) = sign ( f(x) ) The loss function is: L i (w) = max ( 0, yf(x) ) Then we have the gradient: 0, y i f(x i ) > 0 w L i (w) = y i x i, y i f(x i ) < 0 0, y i f(x i ) = 0 The perceptron algorithm is defined as: For i [n], the update rule in each iteration is: w t+1 = w t, { wt + η t y i x i, if y i f(x i ) < 0 (wrong prediction) if y i f(x i ) > 0 (correct prediction)
10 14 : Online Learning, Stochastic Gradient Descent, Perceptron
Bibliography Kim, E. (2013). (left) the decision boundary w shown to be linear in 3-d space, (right) the decision boundary w, when transformed back to 2-d space, is nonlinear. [Online; accessed October 24, 2017]. URL http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html Wikipedia, the free encyclopedia (2007). Illustration of gradient descent on a series of level sets. [Online; accessed October 24, 2017]. URL https://commons.wikimedia.org/wiki/file:gradient_descent.png 11