LMS Algorithm Summary

Similar documents
Least Squares SVM Regression

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning. Support Vector Machines. Manfred Huber

Statistical Machine Learning from Data

CIS 520: Machine Learning Oct 09, Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Linear & nonlinear classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernel Methods. Machine Learning A W VO

(Kernels +) Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Support Vector Machines

Kernel methods, kernel SVM and ridge regression

Lecture 10: Support Vector Machine and Large Margin Classifier

CS798: Selected topics in Machine Learning

Lecture Notes on Support Vector Machine

Support Vector Machine

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines for Classification and Regression

Support Vector Machines and Kernel Methods

Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear & nonlinear classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machines.

Support Vector Machines

Introduction to Machine Learning

Max Margin-Classifier

Modelli Lineari (Generalizzati) e SVM

Support Vector Machines Explained

Perceptron Revisited: Linear Separators. Support Vector Machines

Kernel Methods and Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines and Speaker Verification

SVMs, Duality and the Kernel Trick

Support Vector Machine

Support Vector Machines

Support Vector Machine & Its Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Kernel Methods. Barnabás Póczos

ICS-E4030 Kernel Methods in Machine Learning

Formulation with slack variables

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Introduction to Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Kernel Methods. Outline

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Basis Expansion and Nonlinear SVM. Kai Yu

Applied inductive learning - Lecture 7

Advanced Introduction to Machine Learning

Learning with kernels and SVM

SVMs: nonlinearity through kernels

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

5.6 Nonparametric Logistic Regression

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Introduction to SVM and RVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Learning From Data Lecture 25 The Kernel Trick

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Applied Machine Learning Annalisa Marsico

Neural networks and support vector machines

Kernel Principal Component Analysis

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Pattern Recognition and Machine Learning

Linear Classification and SVM. Dr. Xin Zhang

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machines: Maximum Margin Classifiers

The Representor Theorem, Kernels, and Hilbert Spaces

10-701/ Recitation : Kernels

Non-linear Support Vector Machines

Classifier Complexity and Support Vector Classifiers

SUPPORT VECTOR MACHINE

Soft-margin SVM can address linearly separable problems with outliers

Pattern Recognition 2018 Support Vector Machines

Transcription:

LMS Algorithm Summary

Step size tradeoff

Other Iterative Algorithms LMS algorithm with variable step size: w(k+1) = w(k) + µ(k)e(k)x(k) When step size µ(k) = µ/k algorithm converges almost surely to optimal weights. Conjugate gradient (CG) algorithm: Gradients and line search used to form CGs. Algorithm converges in n steps. Newton s algorithm: Let g(n) be gradient and H(n) be Hessian of w(n) then approximate energy function by J(w) J(w(n)) + (w-w(n)) T g(n) +.5 (w-w(n)) T H(n)(w-w(n)) Take gradient of approximation and set to zero to get w(n+1) = w(n) + H(n) -1 g(n) Algorithm involves inverting Hessian matrix (costly).

Recursive Least Square Algorithm Can develop an on-line version of LS algorithm called Recursive LS (RLS) algorithm Algorithm based on using Sherman-Morrison-Woodbury formula: (A+vv T ) -1 = A -1 - A -1 v(1+v T A -1 v)v T A -1 where A= X T X contains old data and v=x(m+1) contains new data at time m+1 Similar to Kalman filter equations where we update estimate recursively adding new information or innovations. Update is O(n 2 ) operations

RLS Algorithm Comments Often exponentially weighted algorithm implemented. Update correlation matrix, gain factor, and weights. Parameters of RLS algorithm: Initial correlation matrix and weight decay factor. Convergence is typically an order of magnitude faster than LMS algorithm. Algorithm theoretically converges to zero excess mean squared error and convergence does not depend on eigenvalues. Many variations to account for more stable matrix computations: QR and Cholesky factorizations.

Nonlinear Methods Multilayer feedforward networks: error back propagation algorithm Kernel methods: Support Vector Machines (SVM) Least squares methods Radial Basis Functions (RBF)

Kernel Methods In many classification and detection problems a linear classifier is not sufficient. However, working in higher dimensions can lead to curse of dimensionality. Solution: Use kernel methods where computations done in dual observation space. O X X Input space O Φ Φ: X Z X O O Feature space X

Linear SVM (dual representation) SVM quadratic programming problem involves computing inner products. max W(α) = Σ α(i) - ½Σ α(i) α(j) d(i) d(j)(x(i) T x(j) ) subject to α(i) 0, and Σ α(i) d(i) = 0. The hyperplane decision function can be written as f(x) = sgn (Σ α(i) d(i) xt x(i) + b)

Polynomial Transformations Φ is a transformation from input space to higher order polynomial. Example: let x=(x 1,x 2 ) T, then quadratic polynomial can be represented as Φ(x) = (x 12,x 2 2,(2) ½ x 1 x 2 ) T and computations can be represented by kernels using inner products k(x,y) = Φ(x) Φ(z) T = x 1 2 z 1 2 + x 2 2 z 2 2 + 2 x 1 z 1 x 2 z 2 With polynomials feature space dimension will get large but kernel trick allows us to do computations using kernel functions k(x,z) = (x T z ) 2 without using transformation functions φ (x).

Mercer Kernels Φ is a transformation from input space R n to feature space R d (which could be infinite dimensional). We consider Mercer kernels that can be represented as K(x,z) = i λ i ϕ i (x) ϕ i (z) = φ (x)t φ (z) where the kernel functions are positive semi-definite K(x,z)g(x)g(z)dx dz 0 for any g(x) that are square integrable functions. Kernel trick allows us to do computations using kernel functions K(x,z) without knowledge of transformation functions φ (x).

Examples of Kernel Functions Polynomial Kernel of order p: K(x,z) = (x T z) P Polynomial Kernel of order p: K(x,z) = (x T z+1) P Gaussian Kernel: K(x,z) = exp(- x-z 2 /(2σ 2 )) Sigmoidal Kernel: K(x,z) = tanh(ax T z b) for some values of a and b Bilinear transformation: K(x,z) = f(x)f(z) Other kernels: splines, strings, K(x z) = min(x,z)

Kernel Properties Symmetric: K(x,z) = K(z,x) Schwarz Inequality: K(x,z) (K(x,x),K(z,z)) ½ Closure properties Reproducing Kernel Hilbert Spaces (RKHS) Autocorrelation function of a second order process, (Gaussian processes)

Closure Properties and Feature Space Representation Let K 1 (x,z) and K 2 (x,z) be kernel functions Additive, K(x,z) = K 1 (x,z)+k 2 (x,z): Φ(x) = (Φ 1 (x),φ 2 (z)) T Positive scaling, K(x,z) = αk 1 (x,z) for α>0: Φ(x) = α ½ Φ 1 (x) Product, K(x,z) = K 1 (x,z) K 2 (x,z): Φ(x)(i,j) = Φ 1 (x)(i),φ 2 (z)(j) (tensor product) Symmetric positive semidefinite matrix A, K(x,z) = x T Az: Φ(x) = L T Φ 1 (x) (A=L L T Cholesky factorization) Bilinear transformation: K(x,z) =f(x)f(z): Φ(x) =f(x)

Support Vector Machine Optimal margin classifier with slack variables and kernel functions described by Support Vector Machine (SVM). min (w,ξ) ½ w 2 + C Σ ξ i subject to ξ i 0 i, d(i) (wt φ(x(i) )+ b) 1 - ξ i, i, and C>0. (Hinge loss function) In dual space max W(α) = Σ α(i) - ½Σ α(i)α(j) d(i)d(j) K(x(i),x(j) ) subject to C α(i) 0, and Σ α(i)d(i)= 0. Weights can be found by w = Σ α(i) d(i) φ( x(i) ).

Representation of decision surface In primal space decision surface is a linear hyperplane in feature space and can be represented as f(x) = sgn (w T φ( x(i) ) + b) In dual space decision surface can be represented via kernels and Lagrange multipliers as f(x) = sgn (Σ α i d(i) K(x, x(i) )+ b )

Dual representation of SVM

Other Kernel Applications Multiclass problems Regression problems (ε-insensitive loss function) Quadratic cost function with equality constraints LS SVM pattern classification, Kernel Fisher Discriminant Analysis LS SVM regression, Kernel ridge regression Unsupervised Learning Principal Component Analysis (PCA) Kernel PCA

SVM Regression For regression problems target output y R. Estimates will have error and we need to decide on what error criteria to use. Let error e=d-f(x), then ε - insensitive cost criterion defined as ( e - ε) u(ε- e ). cost - ε ε e

SVM Regression QP Formulation min (w,b,ξ,ξ*) ½ w 2 + C Σ (ξ i + ξ i *) subject to ξ i 0 i, d(i) -wt φ(x(i) ) - b ε + ξ i ξ i * 0 i, w T φ(x(i) ) + b d(i) ε + ξ i * for 1 i m Points that lie outside * of ε tube are SV. * * * * * * * * * * * * * * *

SVM Regression QP Dual Space Representation Like SVM classification we use Lagrange multipliers to add inequality constraints to objective function and then take partial derivatives to get the following dual space representation. max W(α) = Σ d(i)(α(i) - α(i)*) -½Σ Σ (α(i) - α(i)*) (α(j) - α(j)*) K(x(i), x(j) ) - εσ (α(i)+ α(i)*) subject to 0 α(i) C, 0 α(i)* C, and Σ (α(i) - α(i)*) = 0. Dual space representation f(x) = Σ (α(i) - α(i)*) K(x,x(i)) + b

Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (wt Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error term. Key differences with between SVM and LS SVM: ε - insensitive cost replaced by quadratic error cost. Inequality constraint replaced by equality constraint.

Primal Solution Substitute for e i and take partial derivatives of objective function with respect to w and b and set to 0. This yields least square solution given by (R(l) + I/(lC))w = P(l) m X (l) T w + b = m Y (l) where m X (l) and m Y (l) are sample means of Φ(X) and Y respectively. R(l) and P(l) are sample autocorrelation of Φ(X) and crosscorrelation of Φ(X) and Y respectively. If (R(l) + I/(lC)) is nonsingular we have that w = (R(l) + I/(lC)) -1 P(l) giving MSE solution plus regularization term.

Finding Dual Solution Introduce Lagrange multipliers L(w,b,e,α) = ½ w 2 + ½C Σ e(i) 2 - Σ α(i) (d(i) (wt Φ (x(i) )+ b) e(i)) where α(i) 0.

KKT Conditions Again take partial derivatives and set to 0. L(w,b,e,α) / w = 0, L(w,b,e,α) / b = 0, L(w,b,e,α) / α = L(w,b,e,α) / e(i) = 0. We therefore have that w = Σ α(i)φ(x(i)) Σ α(i) = 0 α(i)= C e(i), 1 i m d(i) (wt Φ(x(i)) + b) e(i))= 0, 1 i m

Dual Solution to LS SVM Let α be vector of Lagrange multipliers and d be vector of outputs then solution has following form: 0 1 1 T K+I/C b α = 0 d where K(x,z) = Φ(x) T Φ(z) and denotes l vector of 1s. f(x) = Σ α(i) K(xT x(i))+ b

Comments about LS SVM Solution to LS SVM depends on d, dimensionality of feature space Φ(x) in primal space and m, number of training samples in dual space. Both solutions involve solving a set of linear equations. Work in space that has lower dimension. Adaptive on-line solutions can now be implemented. Algorithm easily constructed for pattern classification problems. In dual space, practically all input training examples are support vectors as Lagrange multipliers, α are proportional to error, e.

Similarity to other methods Kernel Fisher Discriminant Analysis Proximal Support Vector Machines Radial Basis Functions Gaussian Processes Kriging

LS SVM Solution Solution in primal or dual space involves a solution to a set of respectively (m+1, d+1) linear equations. Dual space solution: unlike SVM solution all input training examples are support vectors Objectives: want good performance with low to moderate computational complexity On-line vs. batch Sparseness: reduced system, subspace method Criteria for choosing SV Numerical stability: matrix computations