Functional Gradient Descent

Similar documents
CIS 520: Machine Learning Oct 09, Kernel Methods

Kernel Conjugate Gradient

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

RegML 2018 Class 2 Tikhonov regularization and kernels

Oslo Class 2 Tikhonov regularization and kernels

Support Vector Machines and Kernel Methods

Support Vector Machines

Reproducing Kernel Hilbert Spaces

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines and Bayes Regression

Linear & nonlinear classifiers

Advanced Introduction to Machine Learning

22 : Hilbert Space Embeddings of Distributions

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Reproducing Kernel Hilbert Spaces

Lecture 10 February 23

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

L5 Support Vector Classification

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Logistic Regression Logistic

Representer theorem and kernel examples

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

1 Review of Winnow Algorithm

LECTURE 10: REVIEW OF POWER SERIES. 1. Motivation

Optimization and Gradient Descent

Reproducing Kernel Hilbert Spaces

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machine (continued)

Regression with Numerical Optimization. Logistic

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Lecture 10: Support Vector Machine and Large Margin Classifier

Machine Learning - Waseda University Logistic Regression

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

Support Vector Machine

CS-E4830 Kernel Methods in Machine Learning

Kernel Methods. Machine Learning A W VO

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Deep Learning for Computer Vision

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

5.6 Nonparametric Logistic Regression

Lecture Support Vector Machine (SVM) Classifiers

Lecture 11. Linear Soft Margin Support Vector Machines

Kernel Method: Data Analysis with Positive Definite Kernels

Support Vector Machines

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Midterm exam CS 189/289, Fall 2015

Review: Support vector machines. Machine learning techniques and image analysis

Semi-Supervised Learning in Reproducing Kernel Hilbert Spaces Using Local Invariances

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 8 : Eigenvalues and Eigenvectors

9.2 Support Vector Machines 159

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

The Representor Theorem, Kernels, and Hilbert Spaces

Spectral Regularization

Kernels MIT Course Notes

Support Vector Machine (SVM) and Kernel Methods

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

2 Tikhonov Regularization and ERM

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Linear Models in Machine Learning

Support Vector Machine

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Regularization via Spectral Filtering

Kernel Logistic Regression and the Import Vector Machine

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Kernels and the Kernel Trick. Machine Learning Fall 2017

Lecture 2: Linear Algebra Review

Optimization Tutorial 1. Basic Gradient Descent

A Distributed Solver for Kernelized SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

ICS-E4030 Kernel Methods in Machine Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Generalization theory

4.1 Online Convex Optimization

Support Vector Machines

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Math (P)Review Part II:

Support Vector Machines

Lecture 4 February 2

Class 2 & 3 Overfitting & Regularization

Learning by constraints and SVMs (2)

Hilbert Space Methods in Learning

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Kernel Methods. Charles Elkan October 17, 2007

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Max Margin-Classifier

Lecture 10: A brief introduction to Support Vector Machine

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

Transcription:

Statistical Techniques in Robotics (16-831, F12) Lecture #21 (Nov 14, 2012) Functional Gradient Descent Lecturer: Drew Bagnell Scribe: Daniel Carlton Smith 1 1 Goal of Functional Gradient Descent We have seen how to use online convex programming to learn linear functions by optimizing costs of the following form: L(w) = (y i w T x i ) 2 + λ w 2 }{{}}{{} i loss regularization/prior We want generalize this to learn over a space of more general functions f : R n R. The high-level idea is to learn non-linear models using the same gradient-based approach used to learn linear models. L(f) = i (y i f(x i )) 2 + λ f 2 Up until now we have only considered functions of the form f(x) = w T x, but we will now extend this to a more general space of functions. 2 Functionals In the case of functional gradient descent, we d like to be able to work in a generalized space of functions, instead of a space of weights. In order to proceed, we will need some notion of functions on functions, functionals. A functional E : f R is a function of functions f H K. In contrast, an operator accepts a function and returns a function. Operator: E : f f Functional: E : f R As an example let us write the terms of our loss function from above as functionals: E 1 [f] = f 2 E 2 [f] = (y f(x)) 2 E[f] = i (y i f(x i )) 2 + λ f 2 1 Based on the scribe work of Abhinav Shrivastava, Varun Ramakrishna, Dave Rollinson, Daniel Munoz, Tomas Simon, Jack Singleton and Sergio Valcarcel 1

Another simple example of an functional is the evaluation functional, F x [f], which evaluates f(x) for a given value of x. Other examples include functionals that return the arclenghth of a function, or functionals that return the maximum-value of a function. 3 Functional Gradients A gradient can be thought of as: Vector of partial derivatives Direction of steepest ascent. (maxf(x + x), where x is small) Linear approximation of the function (or functional), ie. f(x 0 +ɛ) = f(x 0 )+ɛ f(x 0 ) +O(ɛ 2 ). }{{} gradient We will use the third definition. A functional gradient E[f] is defined implictly as the linear term of the change in a function due to a small perturbation ɛ in its input: E[f + ɛg] = E[f] + ɛ E[f], g + O(ɛ 2 ). Before computing the gradients for these functionals, let us look at a few tools that will help us derive the gradient of the loss functional 3.1 Chain Rule For Functional Gradients Consider differentiable functions C : R R that are functions of functionals G, C(G[f]). Our cost function L[f] from before was such a function, these are precisely the functions that we are interested in minimizing. The derivative of these functions follows the chain rule: Example: If C = ( f 2 ) 3, then C = 3( f 2 ) 2 (2f) C(G[f]) = C(G[f]) λ G(f) G[f] (1) 3.2 Another useful functional gradient As stated above, one simple but useful functional that we frequently come across is the evaluation functional. The evaluation functional evaluates f at the specified x: E x [f] = f(x) Gradient is E x = K(x, ) E x [f + ɛg] = f(x) + ɛg(x) + 0 = f(x) + ɛ K(x, ), g + 0 = E x [f] + ɛ E x, g + O(ɛ 2 ) It is called a linear functional due to the lack of a multiplier on perturbation ɛ. 2

4 Review of Kernels Ultimately, we wish to learn a function f : R n R that assigns a meaningful score given a data point. For example, in binary classification, we would like an f( ) to return both positive and negative values, given positive and negative samples, respectively. A kernel K : R n R n R intuitively measures the correlation between f(x i ) and f(x j ). Considering a matrix K with entries K ij = K(x i, x j ), then matrix K must satisfy the properties: K is symmetric (K ij = K ji ) K is positive-definite ( x R n : x 0, x T Kx > 0) Hence, a valid kernel is the inner product: K ij = x i, x j. A function can be considered that is a weighted composition of many kernels centered at various locations x i : f( ) = α i K(x i, ), (2) where Q is the number of kernels that compose f( ) and α i R is each kernel s associated weight. All functions f( ) with kernel K that satisfy the above properties and can be written in the form of Equation 2 are said to lie in a Reproducing Kernel Hilbert Space (RKHS) H K : f H K However to do gradient descent on the space of such functions, we need the notion of a distance, norm and an inner product. We formalize this by introducing the Reproducing Kernel Hilbert Space. 5 Reproducing Kernel Hilbert Space The Reproducing Kernel Hilbert Space (RKHS), denoted by H k, is the space of functions f( ) that can be written as i α ik(x i, ), where k(x i, x j ) satisfies certain properties described below. To be able to manipulate objects in this space of functions, we will look at some key properties: The inner product of f, g H k is defined as f, g = i α i β j k(x i, x j ) = α Kβ j where f( ) = i α ik(x i, ), g( ) = j β jk(x j, ), α and β are vectors comprising, respectively, α i and β i components, and K is n by m (where n is the number of x i in f, and m those in g) with K ij = k(x i, x j ). Note that this will satisfy linearity (in both arguments): λf, g = λ f, g 3

f 1 + f 2, g = f 1, g + f 2, g With this inner product, the norm will be: f 2 = f, f = α Kα. It is worth noting that α and β are very negative when opposite each other, and 0 where orthogonal. The two functions have a lot of overlap if they put bumps at similar places. The reproducing property is observed by taking the inner-product of a function with a kernel f, K(x, ) and functional E: E x [f] = f, K(x, ) = α i K(x i, ), K(, x ) = = f(x ) }{{} eval @ x α i K(x i, ), K(, x ) = α i K(x i, x ) An example of a valid kernel for x R n is the inner product: k(x i, x j ) = x T i x j. Intuitively, the kernel measures the correlation between x i and x j. A very commonly used kernel is the RBF or Radial Basis Function kernel, which takes the form k(x i, x j ) = e 1 γ x i x j 2. With this kernel in mind, a function can be considered as a weighted (by α i ) composition of bumps (the kernels) centered at the Q locations x i : f( ) = α i K(x i, ), 6 Loss Minimization Again, let us consider our cost function defined over all functions f in our RKHS, as before our loss is: L(f) = (y i f(x i )) 2 + λ f 2 i The purpose of f, f is to penalize the complexity of the solution f. Here it acts like the log of a gaussian prior over functions. Intuitively, the probability can be thought of as being distributed according to P (f) = 1 Z e 1 2 f,f (in practice this expression doesn t work because Z becomes infinite). We want to find the best function f in our RKHS so as to minimize this cost, and we will do this by moving in the direction of the negative gradient: f α L. To do this, we will first have to be able to express the gradient of a function of functions (ie. a functional such as L[f]). 6.1 Functional gradient of the regularized least squares loss function Let s look at the functional gradient of the second term of the loss function: E[f] = f 2 (3) 4

Expanding it out using a Taylor s series type expansion E[f + ɛg] = f + ɛg, f + ɛg = f + 2 f, ɛg + ɛ 2 g = f + ɛ 2f, g + O(ɛ 2 ) We observe that E[f] = f 2 = 2f (4) Now for the first term of the loss function E[f] = i (y i f(x i )) 2 (5) Using the chain rule we have E[f] = 2(y i f(x i )) (f(x i )) (6) We observe that (f(x i )) is the functional gradient of the evaluation functional. Substituting in the gradient of the evaluaton functional as computed in the previous section we have : E[f] = 2(y i f(x i ))K(x i, ) (7) 7 Functional gradient descent Regularized least squares loss function L[f] L[f] = (y i f(x i )) 2 + λ f 2 L[f] = (y i E xi [f]) 2 + λ f 2 L[f] = 2(y i f(x i ))K(x i, ) + 2λf Update rule for the regularized least squares loss function: f t+1 f t η t L f t η t ( 2(y t f t (x t ))K(x t, ) + 2λf t ) f t (1 2η t λ) + 2η t (y t f t (x t ))K(x t, ) where η t is the learning rate at time step t. The update rule is equivalent to: Adding a kernel K(x t, ) weighted by 2η t (y t f t (x t )). Shrinking all other weights by (1 2η t λ) multiplier. SVM loss function L(f) L(f(x t ), y t ) = max(0, 1 y t f(x t )) + λ f 2 (8) 5

The sub-gradient L has two cases. One where the prediction is correct by margin = 1, and the other where is not correct by margin = 1 (margin error). { 0 if (1 y i f(x i )) 0 L((x t ), y t ) = L (f(x t ), y t )f (9) (x t ) = y t K(x t, ) else margin error The update rule is equivalent to: Adding a kernel K(x t, ) weighted by η t y t in case of margin error. Shrinking all other weights by (1 2η t λ) multiplier. 6