Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Similar documents
Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Introduction to Logistic Regression and Support Vector Machine

Max Margin-Classifier

Support Vector Machines, Kernel SVM

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Deep Learning for Computer Vision

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

ECS171: Machine Learning

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Discriminative Models

Linear Regression (continued)

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Optimization and Gradient Descent

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machine (continued)

HOMEWORK 4: SVMS AND KERNELS

ICS-E4030 Kernel Methods in Machine Learning

Linear & nonlinear classifiers

Overfitting, Bias / Variance Analysis

Discriminative Models

SVMs, Duality and the Kernel Trick

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

STA141C: Big Data & High Performance Statistical Computing

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Support Vector Machines and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Support Vector Machines

Support vector machines Lecture 4

Statistical Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines

CS260: Machine Learning Algorithms

18.9 SUPPORT VECTOR MACHINES

CS-E4830 Kernel Methods in Machine Learning

Machine Learning A Geometric Approach

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

MLCC 2017 Regularization Networks I: Linear Models

CS145: INTRODUCTION TO DATA MINING

Support Vector Machines for Classification: A Statistical Portrait

Linear Models in Machine Learning

Announcements - Homework

Kernelized Perceptron Support Vector Machines

Warm up: risk prediction with logistic regression

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Dan Roth 461C, 3401 Walnut

Lecture Support Vector Machine (SVM) Classifiers

Linear classifiers: Overfitting and regularization

Kernel Methods and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

ECS289: Scalable Machine Learning

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

L5 Support Vector Classification

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning And Applications: Supervised Learning-SVM

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Computational and Statistical Learning Theory

Support Vector Machine (SVM) and Kernel Methods

Neural Networks and Deep Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines and Bayes Regression

CIS 520: Machine Learning Oct 09, Kernel Methods

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Machine Learning. Support Vector Machines. Manfred Huber

Oslo Class 2 Tikhonov regularization and kernels

Support Vector Machines

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

Classification Logistic Regression

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

RegML 2018 Class 2 Tikhonov regularization and kernels

Online Learning With Kernel

Machine Learning for NLP

Stochastic Gradient Descent. CS 584: Big Data Analytics

Midterm exam CS 189/289, Fall 2015

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

(Kernels +) Support Vector Machines

Bits of Machine Learning Part 1: Supervised Learning

CSCI567 Machine Learning (Fall 2014)

Support Vector Machines for Classification and Regression

Machine Learning Lecture 6 Note

Support Vector Machines

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Statistical Methods for Data Mining

Support Vector Machines

Homework 2 Solutions Kernel SVM and Perceptron

Machine Learning and Data Mining. Linear classification. Kalev Kask

Introduction to Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

Convex Optimization Lecture 16

Transcription:

CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer theorem Stochastic Gradient Descent Perceptron, other examples Kernels Recall the feature transformation: For x X, y Y, prediction function is: f : X Y Let φ be the feature transformation function, then we have a mapping: After the transformation, we have: x φ(x) f(x) = w T φ(x) + w 0 Note here f(x) is a linear function with respect to φ(x), and it s a non-linear function with respect to x. (Under the condition that φ(x) is a non-linear function) Define kernel function as: K(x i, x j ) = φ T (x i )φ(x j ) This is especially useful when φ(x) is infinite dimensional. The motivation of infinite dimensional φ(x) is that in theory, you can almost always make your data points linearly 1

2 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 1: Data lift to higher dimension becomes linear separable, from Kim (2013) spreadable, by bending the data space to infinite dimensional space. See Figure 1 When φ(x) has infinite dimensions, solving w is hard: w T φ(x) where w is also infinite dimensional. Since usually the value of f(x) is in the form of: Φ T (x)φ(x) we can use kernel function to represent f(x), without knowing φ(x) and w For a given data point, We ve shown 2 applications of kernels: f(x) = w T φ(x) f(x) = K(x, x i )α i Linear Regression Using a linear algebra identity SVM Using the method of Lagrange multipliers

14 : Online Learning, Stochastic Gradient Descent, Perceptron 3 Hilbert Spaces H k is the space of well behaved linear functions like w T φ(x). And this is equivalent to the space of bounded functions: f(x) = α i K(x, x i ) K(x i, x j ) = φ T (x i )φ(x j ) Support Vector Machines The optimization problem is to maximize the margin, which is proportional to 1 w : s.t where min w 2 2 2 y i f(x i ) 1, i [n] f(x) = w T x Why set the threshold to 1? Suppose change that to some constant C: y i f(x i ) C y i f(x i ) C 1 y i w T C x i 1 This won t change the solution to our optimization problem: min w 2 2 2 So we can just set C = 1 without losing generality. s.t y i f(x i ) 1, i [n] Non-spreadable Data Recall that with the slack variable ξ i, we have: min w,ξ i w 2 2 + s.t ξ i 0, ξ i yf(x i ) 1 ξ i

4 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 2: Visualize SVM with slack variable (data points with circle are on the margin) Intuitively, ξ i represents how much tolerance to misclassification the model has. See Figure 2 f(x) = w T φ(x) w = α i y i φ(x i ) where α i 0 w = α i φ(x i, where α i = α i y i When data is separable (no slack), we have α i 0 iff yf(x) = 1 For the non-separable case, we have α i > 0 iff y i f(x i ) = 1 ξ i otherwise α i = 0 The prediction function can be written as this form: f(x) = α i K(x, x i ) i Note that storage for kernel function is O(nd)

14 : Online Learning, Stochastic Gradient Descent, Perceptron 5 Whereas storage for SVM is O(dS) Where S is the number of support vectors This is because we can ignore those entries in K with value of K(x, x i ), where x i is not a support vector. Representer Theorem The optimization problem in the Hilbert space is: min f H k Where l(y i, f(x i ) is the loss function. l(y i, f(x i )) + λ f 2 H k Representer Theorem says the solution of this can be written as: f (x) = α i K(x, x i ) We have shown that In kernel ridge regression, use matrix identity we have: l(y i, f(x i ) = (y i f(x i )) 2 In kernel SVM, use dual representation (Lagrange Multipliers) we have: l(y i, f(x i ) = max(0, 1 y i f(x i )) This also applies to other kernel algorithms, for example Kernel logistic regression Kernel Poisson regression

6 14 : Online Learning, Stochastic Gradient Descent, Perceptron Stochastic Gradient Descent All the loss functions we ve seen so far is average loss over data points: L(w) = 1 L i (w) n In regression we have: L i (w) = (y i w T x i w 0 ) 2 In SVM we have: L i (w) = max(0, 1 y i (w T x i + w 0 )) The gradient is defined as: w L(w) = 1 n w L i (w) The population empirical risk is: L(w) = E P [L(w)] Under weak conditions we have: L(w) = E P [ w L(w)] We can replace E P [ w L(w)] using a unbiased value, meaning we can pick a β such that: For example, we can choose mean (µ) Let X P, µ = E[X] bias = E[β] E P [ w L(w)] = 0 E[µ] = E [ 1 n ] x i = E[X] We can also choose a single example x i from the data points, because E[x i ] = E[X] So we can choose w L i (w) as an unbiased estimator of E[ w L(w)] Then our Stochastic Gradient Descent algorithm is defined as: In the tth iteration:

14 : Online Learning, Stochastic Gradient Descent, Perceptron 7 Pick i uniformly from [n] Update parameter w t+1 = w t + η t w L i (w) go to next iteration until hit stop condition. For example we can limit the number of iterations That is, in each iteration we randomly pick a data point (y i, x i ), and use that data point to calculate L i (w) and L i (w). Then use this randomly selected gradient to update w. For example in the linear regression case: L i (w) = (y i w T x i ) 2 L i (w) = 2x i (y i w T x i ) w t+1 = w t + η t ( 2x i (y i w T x i )) Some useful properties of Stochastic Gradient Descent: Will converge eventually, if L(w) is convex and appropriately choose step size. See Figure 3 and Figure 4 Easily deal with streaming data for online learning. Adaptive. See Figure 5 Deal with non-differentiable functions easily. See Figure 6 Figure 3: Behavior of gradient descent, from Wikipedia, the free encyclopedia (2007). The intuition is: SGD GD + noise w L i (w) w L(w) + noise

8 14 : Online Learning, Stochastic Gradient Descent, Perceptron Figure 4: Behavior of stochastic gradient descent Figure 5: SGD model shifts while get new data (new data are the bigger ones) Perceptron Perceptron is an algorithm used for linear classification. If data is separable, perceptron will find a linear separator.

14 : Online Learning, Stochastic Gradient Descent, Perceptron 9 Figure 6: GD gets stuck at the flat point, but SGD won t Consider the linear model where f(x) = w T x + w 0 h(x) = sign ( f(x) ) The loss function is: L i (w) = max ( 0, yf(x) ) Then we have the gradient: 0, y i f(x i ) > 0 w L i (w) = y i x i, y i f(x i ) < 0 0, y i f(x i ) = 0 The perceptron algorithm is defined as: For i [n], the update rule in each iteration is: w t+1 = w t, { wt + η t y i x i, if y i f(x i ) < 0 (wrong prediction) if y i f(x i ) > 0 (correct prediction)

10 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Bibliography Kim, E. (2013). (left) the decision boundary w shown to be linear in 3-d space, (right) the decision boundary w, when transformed back to 2-d space, is nonlinear. [Online; accessed October 24, 2017]. URL http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html Wikipedia, the free encyclopedia (2007). Illustration of gradient descent on a series of level sets. [Online; accessed October 24, 2017]. URL https://commons.wikimedia.org/wiki/file:gradient_descent.png 11