Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

Similar documents
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

ARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi.

A New Trust Region Algorithm Using Radial Basis Function Models

Linear Regression and Its Applications

COS 424: Interacting with Data

A Characterization of Principal Components. for Projection Pursuit. By Richard J. Bolton and Wojtek J. Krzanowski

Slide05 Haykin Chapter 5: Radial-Basis Function Networks

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

In the Name of God. Lectures 15&16: Radial Basis Function Networks

D. Shepard, Shepard functions, late 1960s (application, surface modelling)

Vectors in many applications, most of the i, which are found by solving a quadratic program, turn out to be 0. Excellent classication accuracies in bo

Kernel Principal Component Analysis

Linear Regression Linear Regression with Shrinkage

Statistical Pattern Recognition

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Statistical Methods for SVM

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Problem Set 9 Due: In class Tuesday, Nov. 27 Late papers will be accepted until 12:00 on Thursday (at the beginning of class).

2 Tikhonov Regularization and ERM

Chapter 4: Interpolation and Approximation. October 28, 2005

Reproducing Kernel Hilbert Spaces

INRIA Rh^one-Alpes. Abstract. Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis

Mathematical foundations - linear algebra

Linear Regression Linear Regression with Shrinkage

Statistical Learning & Applications. f w (x) =< f w, K x > H = w T x. α i α j < x i x T i, x j x T j. = < α i x i x T i, α j x j x T j > F

Kernel Methods. Charles Elkan October 17, 2007

Spatial Process Estimates as Smoothers: A Review

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

ARTIFICIAL INTELLIGENCE LABORATORY. and. A.I. Memo No May, An Equivalence Between Sparse Approximation and. Support Vector Machines

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Algebra, 4th day, Thursday 7/1/04 REU Info:

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

1. Introduction Let f(x), x 2 R d, be a real function of d variables, and let the values f(x i ), i = 1; 2; : : : ; n, be given, where the points x i,

. Consider the linear system dx= =! = " a b # x y! : (a) For what values of a and b do solutions oscillate (i.e., do both x(t) and y(t) pass through z

only nite eigenvalues. This is an extension of earlier results from [2]. Then we concentrate on the Riccati equation appearing in H 2 and linear quadr

Contents. 6 Systems of First-Order Linear Dierential Equations. 6.1 General Theory of (First-Order) Linear Systems

Automatic Capacity Tuning. of Very Large VC-dimension Classiers. B. Boser 3. EECS Department, University of California, Berkeley, CA 94720

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Review and problem list for Applied Math I

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

When can Deep Networks avoid the curse of dimensionality and other theoretical puzzles

Machine Learning (BSMC-GA 4439) Wenke Liu

Linear Models for Regression. Sargur Srihari

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Error Empirical error. Generalization error. Time (number of iteration)

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

New concepts: Span of a vector set, matrix column space (range) Linearly dependent set of vectors Matrix null space

. (a) Express [ ] as a non-trivial linear combination of u = [ ], v = [ ] and w =[ ], if possible. Otherwise, give your comments. (b) Express +8x+9x a

Curves clustering with approximation of the density of functional random variables

Exercises * on Linear Algebra

Preliminary Examination, Part I Tuesday morning, August 31, 2004

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Linear Algebra: Characteristic Value Problem

Lecture 5 : Projections

Kernel Methods. Machine Learning A W VO

Widths. Center Fluctuations. Centers. Centers. Widths

MATH 590: Meshfree Methods

Math Camp Notes: Everything Else

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

X/94 $ IEEE 1894

4.1 Eigenvalues, Eigenvectors, and The Characteristic Polynomial

PATTERN CLASSIFICATION

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Computation. For QDA we need to calculate: Lets first consider the case that

Calculating Forward Rate Curves Based on Smoothly Weighted Cubic Splines

Abstract. Jacobi curves are far going generalizations of the spaces of \Jacobi

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Motivating the Covariance Matrix

Theorems. Least squares regression

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

LECTURE NOTE #11 PROF. ALAN YUILLE

Least Squares Approximation

Spectral Regularization

PCA, Kernel PCA, ICA

STATISTICAL LEARNING SYSTEMS

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

CIS 520: Machine Learning Oct 09, Kernel Methods

Data Mining Techniques

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Announcements. Proposals graded

Neural Network Training

Statistical Methods for Data Mining

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.

Group Theory. 1. Show that Φ maps a conjugacy class of G into a conjugacy class of G.

MLCC 2015 Dimensionality Reduction and PCA

Transcription:

Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis Functions and dimensionality reduction From Additive Splines to ridge functions approximation

Regularization approach Asmooth function that approximates the data set D = f(x i y i ) 2 R d Rg N i=1 can be found minimizing the functional: X H[f]= N (y i ; f(x i )) 2 + ZRd ds j f(s)j ~ 2 ~G(s) i=1 where ~ G is a positive, even, function, decreasing to zero at innity, and a small, positive number.

Main result (Duchon, 1977 Meinguet, 1979 Wahba, 1977 Madych and Nelson, 1990 Poggio and Girosi, 1989 Girosi, 1992) The function that minimizes the functional X H[f]= N (y i ; f(x i )) 2 + ZRd ds j f(s)j ~ 2 ~G(s) i=1 has the form: X f(x)= N i=1 c i G(x ; x i )+ k X =1 d (x) where G is conditionally positive denite of order m, G ~ is the Fourier transform of G and f g k =1 is a basis in the space of polynomials of degree m ; 1.

Computation of the coecients The coecients are found by solving the linear system: (G + I)c +; T d = y ;c =0 where I is the identity matrix, and we have dened (y) i = y i (c) i = c i (d) i = d i (G) ij = G(x i ; x j ) (;) i = (x i )

Radial Basis Functions If the function ~ G is radial ( ~ G = ~ G(ksk) the smoothness functional is rotationally invariant, that is [f(x)] = [f(rx)] 8R 2 O(d) where O(d) is the group of rotations in d dimensions. The regularization solution becomes the so called Radial Basis Functions technique: f(x)= N X i=1 c i G(kx ; x i k)+p(x)

Radial Basis Functions Dene r = kxk G(x) =e ;r2 G(x) = gaussian q r 2 + c 2 multiquadric G(x) = 1 pc 2 +r 2 inverse multiquadric G(x) =r 2n+1 G(x) =r 2n ln r multivariate splines multivariate splines

Some good properties of RBF Very well motivated in the framework of regularization theory The solution is unique and equivalent to solve a linear system Amount of smoothness is tunable (with ) Radial Basis Functions are universal approximators Huge amount of literature on this subject (Interpretation in terms of \neural networks") Biologically plausible Simple interpretation in terms of\smooth look-up table" Similar to other non-parametric techniques, such as nearest neighbor and kernel regression

Some not-so-good properties of RBF Computationally expensive (O(N 3 ) where N =number of data) Linear system often badly ill-conditioned The same amount of smoothness is imposed on dierent regions of the domain

This function has dierent smoothness properties in dierent regions of its domain

Least Squares Regularization Networks We look for an approximation to the regularization solution: f(x)= N X f (x)= i=1 n X + =1 c i G(x ; x i ) c G(x ; t ) where n << N and the vectors t are called centers. (Broomhead and Lowe, 1988 Moody and Darken, 1989 Poggio and Girosi, 1989)

Least Squares Regularization Networks f (x)= n X =1 c G(x ; t ) Suppose the centers t have been xed. How do we nd the coecients c? + Least Squares

Dene Least Squares Regularization Networks X E(c1 ::: c n )= N (y i ; f (x i )) 2 i=1 The least squares criterion is min c E(c1 ::: c n ) The problem is convex and quadratic in the c, and the solution satises: @E @c =0

Least Squares Regularization Networks @E @c =0 + G T G c = G T y where we have dened (y) i = y i (c) = c (G) i = G(x i ; t ) Therefore c = G + y, where G + =(G T G) ;1 G T is the pseudoinverse of G.

Least Squares Regularization Networks Given the centers t we know howtondthe c. How do we choose the t? 1. a subset of the examples 2. by a clustering algorithm (k-means, for examples) 3. by least squares (\moving centers")

Centers as a subset of the examples Fair technique. The subset is a random subset, which should reect the distribution of the data. Not many theoretical results availabl. Main problem: how many centers? Main answer: we don't know. Cross validation techniques seem a reasonable choice.

Clustering in Rd Clustering a set of data points fx i g N i=1 means nding a set of m representative points fm k g m k=1. Let S k be the Voronoi polytopes of center m k : S k = fx : kx ; m k k < kx ; m j k j 6= kg A set of representative points M = fm k g m k=1 is optimal if solves: min M mx k=1 X x i 2S k kx i ; m k k

K-means algorithm The k-means algorithm (MacQueen, 1967) is an iterative technique for nding optimal representative points (cluster centers): m (t+1) k = 1 #S (t) k X x i 2S (t) k This algorithm is guaranteed to nd a local mininum. Many variations of this algorithm have been proposed. x i

K-means algorithm S 1 S 2 S 3

K-means algorithm S 1 S 2 S 3

K-means algorithm S 1 S 2 S 3

Finding the centers by clustering Very common. However it makes sense only if the input data points are clustered. No theoretical results. Not clear that it is agoodidea,especially for pattern classication cases.

Moving centers Dene X E(c1 ::: c n t 1 ::: t n )= N (y i ; f (x i )) 2 i=1 The least squares criterion is min E(c1 ::: c n t c t 1 ::: t n ) The problem is not convex and quadratic anymore: expect multiple local minima.

Moving centers :-) Very exible, in principle very powerful :-) Some theoretical understanding :-( Very expensive computationally due to the local minima problem :-( Centers sometimes move in \weird" ways

Some new approximation schemes Radial Basis Functions with moving centers is a particular case of a function approximation technique of the form: f(x)= n X =1 c H(x p ) where fp g n =1 is a set of parameters, which can be estimated by least squares techniques. Radial Basis Functions corresponds to the choice H(x p )=G(kx ; p k)

Neural Networks Neural networks correspond to a dierent choice. We set p (w ) and choose: H(x p )=(x w + ) where is a sigmoidal function. The resulting approximation scheme is f(x)= n X =1 c (x w + ) and it is called Multilayer Perceptron with one layer of hidden units

Examples of sigmoidal functions f(x)= 1 1+e ;x f(x) = tanh(x) Notice that when is very large the sigmoid approaches a step function.

One hidden layer Perceptron

Multilayer Perceptrons :-) Approximation schemes of this type have been successful in a number of cases :-( Interpretation of the approximation technique is not immediate :-) Multilayer Perceptrons are \universal approximator" :-) Some theoretical results available :-( Computationally expensive, local minima :- Motivation of this technique?

Multilayer perceptrons are particular cases of the more general ridge function approximation techniques: f(x)= n X =1 h (x w ) in which we have chosen h (x)=c h(x + ) for some xed function h (sigmoid).

Projection ursuit Regression (Friedman amd Stuetzle, 1981 Huber, 1986) Look for an approximating function of the form f(x)= n X =1 h (x w ) where w are unit vectors and h are functions to be tted to the data (often cubic splines, trigonometric polynomials or and Radial Basis Functions). The underlying idea is that all the information lies in few, one dimensional projections of the data. The unit vectors w are the \interesting projections" that have to be pursued.

PPR algorithm Assume that the rst ; 1 terms of the expansion have been determined, and de- ne the residuals r (;1) i = y i ; ;1 X =1 h (x i w ) Find the -th vector w and the -th function h that solve min w h NX (r (;1) i=1 i ; h (x i w )) 2 Go back to the rst step and iterate

Motivation of PPR Cluster analysis of high-dimensional set of data points D: Diaconis and Freedman (1984) show that for most high-dimensional clouds, most one-dimensional projections are approximately normal. In cluster analysis normal means \uninteresting". Therefore there are few interesting projections, that can be found maximizing a \projection index" that measures deviation from normality.

Principal Component Analysis

Let fx g N =1 be a set of points in Rd, with d possibly very large. Every point is specied by d numbers. Is there a more compact representation? It depends on the distribution of the data points in R d...

There are random points... 0.058 0.567 0.147 0.337 0.918 0.5 0.005 0.273 0.678 0.112 0.688 0.821 0.948 0.91 0.961 0.867 0.475 0.117 0.013 0.554 0.912 0.655 0.244 0.708 0.405 0.091 0.628 0.285 0.385 0.032 0.034 0.972 0.003 0.529 0.816 0.542 0.973 0.14 0.692 0.263 0.266 0.283 0.403 0.504 0.425 0.498 0.069 0.019 0.172 0.06 0.01 0.352 0.46 0.835

and random points... 0.033 0.42 0.84 0.727 0.577 1.154 0.372 0.957 1.914 0.177 0.45 0.9 0.222 0.562 1.124 0.693 0.644 1.288 0.985 0.805 1.61 0.167 0.766 1.532 0.701 0.078 0.156 0.768 0.791 1.582 0.51 0.874 1.748 0.432 0.375 0.75 0.942 0.251 0.502 0.805 0.795 1.59 0.373 0.181 0.362 0.189 0.541 1.082 0.77 0.724 1.448

Nr. Feature 1 pupil to nose vertical distance 2 pupil to mouth vertical distance 3 pupil to chin vertical distance 4 nose width 5 mouth width 6 face width halfway between nose tip and eyes 7 face width at nose position 8-13 chin radii 14 mouth height 15 upper lip thickness 16 lower lip thickness 17 pupil to eyebrow separation 18 eyebrow thickness

Features 8 to 14

Principal Component Analysis The data might lie (at least approximately) in a k-dimensional linear subspace of R d, with k<<d: X x k (x d i )d i i=1 where D = fd i g k i=1 vectors: is an orthonormal set of d i d j = ij Principal Component Analysis let us nd the best orthonormal set of vectors fd i g k i=1,which are called the rst k principal components.

The orthonormal basis D = fd i g k i=1 is found by solving the following least squares problem: min D NX =1 max D X (x ; k (x d i )d i ) 2 = NX i=1 =1 i=1 = 1 N max D kx (x d i ) 2 = kx i=1 d i Cd i where C is the correlation matrix: C ij = 1 NX x i N xj =1

Let us consider the case k =1,and d 1 = d. The problem is now to solve: 1 N max kdk=1 d Cd where C is a symmetrix matrix (correlation matrix). Setting C = R T R d = Rd with R a rotation matrix and diagonal, we have now to solve: 1 N kd max k=1 d d

Notice that, since kd k =1,then: d d max If we choose the vector d max such that d max =(0 0 ::: 1 0 0 ::: 0) and d max = max d max then d max d max = max and d max maximizes the quadratic form d d.

The rst principal component, that is the vector d max that maximizes the quadratic form d Cd, is therefore d max = R T d max and it can be easily seen that satises the eigenvector equation: Cd max = max d max The rst principal component is therefore the eigenvector of the correlation matrix corresponding to the maximum eigenvalue.

Case k>1 It can be shown that the rst k principal components are the eigenvectors of the correlation matrix C corresponding to the rst k largest eigenvalues.

Try this at home: The approximation error of the rst k principal components fd i g k i=1 is given by X E(k)= N X (x ; k (x d i )d i ) 2 =1 i=1 E(k)= N X i=k+1 i where i are the eigenvalues of the correlation matrix C. Hint: when k = N the error is zero because the correlation matrix C is symmetric, and its eigenvectors form a complete basis.

An example in 20 dimensions: x 2 R 20, x 10 = x 8 + x 12, 500 points.

Correlation matrix 0 5 10 15 20 0 5 10 15 20

Correlation matrix -100 0 100 200 300 400 Z 20 15 10 Y 5 5 10 15 20 X

Eigenvalues of the correlation matrix Eigenvalues 0.0 0.2 0.4 0.6 0.8 1.0 5 10 15 20

Another point of view on PCA Let fx g N =1 be a set of points in Rd. Problem: nd the unit vector d for which the projection of the data on d has the maximum variance. Assuming that the data points have zero mean, this means: max kdk=1 1 N NX (x d) 2 =1

This is formally the same as nding the rst principal component. The second principal component is the projection of maximum variance in a subspace orthogonal to the rst principal component.... The k-th principal component is the projection of maximum variance in a subspace wich is orthogonal to the subspace spanned by the rst k ; 1 principal components.

Extensions of Radial Basis Functions Dierent variables can have dierent scales: f(x y)=y 2 sin(100x) Dierent variables could have dierent units of measure f = f(x x x) Not all the variables are independent or relevant: f(x y z t)=g(x y z(x y)) Only some linear combinations of the variables are relevant: f(x y z)=sin(x + y + z)

Extensions of regularization theory A priori knowledge: the relevant variables are linear combination of the original ones: z = Wx for some (possibly rectangular) matrix W f(x) =g(wx) =g(z) and the function g is smooth The regularization functional is now X H[g]= N (y i ; g(z i )) 2 + [g] i=1 where z i = Wx i.

Extensions of regularization theory (continue) The solution is g(z)= N X i=1 c i G(z ; z i ) : Therefore the solution for f is: f(x)=g(wx)= N X i=1 c i G(Wx ; Wx i )

If the matrix W were known, the coecients could be computed as in the radial case: (G + I)c = y where (y) i = y i (c) i = c i (G) ij = G(Wx i ; Wx j ) and the same criticisms of the Regularization Networks technique apply, leading to Generalized Regularization Networks: f (x)= n X =1 c G(Wx ; Wt )

Since W is not known, it could be found by least squares. Dene X E(c1 c n W)= N (y i ; f (x i )) 2 Then we can solve: i=1 min c W E(c 1 c n W) The problem is not convex and quadratic anymore: expect multiple local minima.

From RBF to HyperBF When the basis function is radial the Generalized Regularization Networks becomes f(x)= n X =1 c G(kx ; t kw) that is a non radial basis function technique

Least Squares 1. min c E(c1 ::: c n ) 2. min c t E(c1 ::: c n t 1 ::: t n ) 3. min c W E(c1 c n W) 4. min c t W E(c1 ::: c n t 1 ::: t n W)

A non radial gaussian function

A non radial multiquadric function

Additive models An additive model has the form f(x)= d X f (x ) =1 where In other words f (x )= N X i=1 c i G(x ; x i ) f(x)= d X NX =1 i=1 c i G(x ; x i )

Extensions of Additive Models If we have less centers than examples we obtain: f(x)= d X f (x ) =1 where In other words f (x )= N X =1 c G(x ; t ) f(x)= d X NX =1 =1 c G(x ; t )

Extensions of Additive Models If we now allow for an arbitrary linear transformation of the inputs: x! Wx where W is a d 0 d matrix, we obtain: X f(x)= d0 nx =1 =1 c G(x w ; t ) where w is the -th row of the matrix W,

Extensions of Additive Models The expression X f(x)= d0 can be written as nx =1 =1 c G(x w ; t ) where X f(x)= d0 =1 h (x w ) h (y)= n X =1 c G(y ; t ) This form of approximation is called ridge approximation

From the extension of additive models we can therefore justify an approximation technique of the form X f(x)= d0 nx =1 =1 c G(x w ; t ) Particular case: n = 1 (one center). Then we derive the following technique: X f(x)= d0 =1 c G(x w ; t) which is a Multilayer Perceptron with a Radial Basis Functions G instead of the sigmoid function. Notice that the sigmoid function is not a Radial Basis Functions

Regularization Networks Regularization networks (RN) Radial Stabilizer RBF Regularization Additive Stabilizer Additive Splines Product Stabilizer Tensor Product Splines Generalized Regularization Networks (GRN) Movable metric Movable centers HyperBF if x = 1 Movable metric Movable centers Ridge Approximation Movable metric Movable centers