Spectral Filtering for MultiOutput Learning

Size: px

Start display at page:

Download "Spectral Filtering for MultiOutput Learning"

Caitlin Gaines
5 years ago
Views:

1 Spectral Filtering for MultiOutput Learning Lorenzo Rosasco Center for Biological and Computational Learning, MIT Universita di Genova, Italy

2 Plan Learning with kernels Multioutput kernel and regularization Spectral filtering Perspectives

3 Scalar Case function estimation from samples f : R d R (x i,y i ) n i=1. ble functio kernel models f = j K(x j, )c j

4 Kernels and Regularization RKHS: Definitions Hilbert space of functions H,, H such that k : R d R d R and k(x, ) H and f(x) = f, k(,x) H Tikhonov Regularization min f H { 1 n n (y i f(x i )) 2 + λ f 2 H}. i=1

5 Kernel Design feature map K(x, s) = Φ(x), Φ(s) regularizers J(f) = f 2 H

6 Multiple Outputs vector functions f : R d R T samples fixed we can (x T i,yt i )n T i=1 kernel models f = j K(x j, )c j c j R T

7 RKHS RKHS: Definitions Hilbert space of functions H,, H such that K : R d R d R T T and for c R T K(x, )c H and f(x) = f, K(,x)c H Tikhonov Regularization min f=(f 1,...,f T ) H { T j=1 1 n T n (y j i f j (x j i ))2 + λ f 2 H}. i=1 epresenter theorem shows that regularized solution can

8 Which Kernels? Component wise definition K :(R d,t) (R d,t) R K((x, t), (x,t )) A general class of kernels K(x, x )= r k r (x, x )A r

9 Kernels and Regularizers Consider K(x, x )=k(x, x )A Then f 2 H = j,i A j,i f j,f i k with f =(f 1,,f T )

10 Example: Mixed Effect Γ ω (x, x )=K(x, x )(ω1 + (1 ω)i) matrix whose entries are all equal to 1 J(f) =A ω B ω T f l 2 K + ωt T f l 1 T T f q 2 K l=1 l=1 q=1 = and = (2 2 + ) It is composed o

11 Example: Clustering Outputs M specifies the clusters G lq = ɛ 1 δ lq +(ɛ 2 ɛ 1 )M lq. ( ). K(x, x )=k(x, x )G J(f) =ɛ 1 r f l f c 2 K + ɛ 2 r m c f c 2 K, c=1 l I(c) c=1 22

12 Example: Graph M is an adjacency matrix among the tasks L = D M, K(x, x )=k(x, x )L J(f) = 1 2 T f l f q 2 KM lq + T f l 2 KM ll. l,q=1 l=1 ) can be rewritten as:

13 Inference and Computations Least Squares and Tikhonov c =(K + λni) 1 Y Kernel Matrix is (Tn T ) (Tn T ) c, Y are Tn T Computing the solution for N different regularization parameter is expensive O(N(Tn T ) 3 )

14 Ill-posed Problems Well-posedness in the sense of Hadamard A solution exists The solution is unique The solution depends continuously on the data Problems that are not well-posed are termed ill-posed. Well-posedness i Af = g A g H K

15 Regularization and Filtering c = i 1 σ i + λn u i,y u i Spectral Filtering c = i G λ (σ i ) u i,y u i

16 Regularization and Filtering 1 σ j Y, u j G λ (σ j ) Y, u j Low pass filter large eigenvalues small eigenvalues j

17 Classical Examples Tikhonov Regularization G λ (σ) = 1 σ + λ

18 Other Examples Many other Examples of Filters (only some known in machine learning) TSVD (principal component regression) Landweber iteration (L 2 boosting) ν- method iterated Tikhonov (Engl et al., Rosasco et l. 5, Lo Gerfo et al. 8, Bauer et. al. 5)

19 Early Stopping The filter correspond to a truncated expansion of the inverse. G λ (σ) =η t (1 ησ) j 1 σ j=1 A 1 η t (I ηa) j j=1 Implementation set for i= c i α = 1,..., t α i = α i 1 + η(y Kα i 1 ) Estimator n f t = αik(x t i, ) i=1

20 Early stopping at work Y X

21 Early stopping at work Y X

22 Early stopping at work Y X

23 Early stopping at work Y X

24 Remarks Empirical risk minimization with no constraints Regularization parameter is t: iteration regularizes No need of SVD Only matrix/vector multiplication O(N(Tn T ) 2 )

25 Fast Solution for Tikhonov Regularization For Kernel of the form K(x, x )=k(x, x )A we can diagonalize A and rotate data. Tikhonov Regularization can be solved at the price of a single task!

26 Vector fields v 1 (x, y) = 2sin(3x)sin(1.5y) v 2 (x, y) = 2cos(3y)cos(1.5x) + Convolution with a Gaussian

27 Useful Kernels Divergence Free Γ df (x, x )= 1 x y 2 e 2σ σ2 2 A x,x A x,x = (x x )(x x ) T σ 2 + ((T 1) x x 2 ) σ 2 I «««! Curl Free Γ cf (x, x )= 1 x x 2 σ 2 e 2σ 2 I x x σ «x x σ «T!.

28 Numerical Results TRUE FIELD ESTIMATED FIELD ESTIMATED FIELD TRUE FIELD ESTIMATED FIELD NOISELESS CASE.25.2 (b) DIVERGENCE FREE PART CURL FREE PART (a) CURL FREE PART NOISELESS CASE.25 INDIP VVR.15.2 (c) (d).1 Figure 1: (a) Visualization of the artificial vector field Test Error DIVERGENCE FREE PART (b) Test Error GENCE FREE PART.15 without noise. (b) Field estimated from a noisy training set of 1 examples. (c-d) Divergence-free part and curlfree part estimated with the corresponding kernels..5 (d) (c).1.5 (d) noise. (b) Field estimated from a noisy training examples. Divergence-free part and curligure 1: (a) (c-d) Visualization of the artificial vector field estimated with corresponding kernels. ithout noise. (b)the Field estimated from a noisy training et of 1 examples. (c-d) Divergence-free part and curlee part estimated with the corresponding kernels. f points from this grid and their predictions hole grid compared to the field. The Saturday, 12,from 29 this ample ofdecember points grid correct and their predictions.2 PROPORTIONAL NOISE INDIP VVR Number of training examples Number of training examples Figure 2: Test errors for the proposed vector valued ap-.15and for learning each component of the field proach (VVR) independently (INDIP). Noiseless case (left) and field with gaussian noise of width equal to 2% of the field magnitude (right)..5 sample of points.1.1 from this grid and their predictions on the whole grid compared to the correct field. The number of training examples is varied from 1 to 2 and for each cardinality the experiment is repeated 1 times with different sampling of the training points Following (?), we use an angular measure of error to compare two fields. If vo = (vo1, vo2 ) and ve = (ve1, ve2 ) are the original and estimated fields, we consider the transformation v v = (v1,v12,1) (v 1, v 2, 1). The er err = arccos(v v ). This error ror measure is then 1 2 e o 1 2 Computation time comparison measure was derived by interpreting the vector field Number of training examples Number of training examples 7!!method as a velocity field and it is convenient because it hantikhonov regularization 6 dles large and small signals without the amplification inherent in a relative measure of vector differences. 5 We consider the noiseless case and the case where the 4 vector field is corrupted with gaussian noise, which we assume to be proportional to the signal. This means 3 that for each point of the field, the standard deviation 2 of the noise in that point is proportional to the magnitude of the field. The accuracy obtained in both cases 1 is shown in Fig.??, from which the advantage of using the combination of divergence-free and curl-free kernumber of training examples nels is apparent, especially for fewer training points. Fig.?? (b) shows an example of estimated field and Figure 3: Computation time (in seconds) for the whole its decomposition (c-d) into a divergence-free part and regularization path the ν-method and for Tikhonov regucurl-free part. 1 2 Number of training examples : (a) Visualization of the artificial vector field INDIP VVR INDIP VVR 2 Number of training examples Figure 2: Test errors for the proposed vector valued approach (VVR) and for learning each component of the field independently casevector (left)valued and field Figure 2: Test(INDIP). errors for Noiseless the proposed ap- with gaussian noise of equal each to 2% of the field magnitude proach (VVR) andwidth for learning component of the field (right). independently (INDIP). Noiseless case (left) and field with Time (seconds) (c).15 PROPORTIONAL NOISE CURL FREE PART Test Error (a) INDIP NOISELESS CASE.25 INDIP VVR VVR Test Error (b).25 Test Error (a) PROPORTIONAL NOISE Test Error RUE FIELD gaussian noise of width equal to 2% of the field magnitude (right).

29 : (a) Visualization of the artificial vector field Figure 2: Test errors for the proposed vector valued ap- oise. (b) Field estimated from a noisy training examples. (c-d) Divergence-free part and curlestimated with the corresponding kernels. proach (VVR) and for learning each component of the field independently (INDIP). Noiseless case (left) and field with gaussian noise of width equal to 2% of the field magnitude (right). Numerical Results f points from this grid and predictions TRUEtheir FIELD ESTIMATED FIELD hole grid compared to the correct field. The of training examples is varied from 1 to 2 NOISELESS CASE PROPORTIONAL NOISE TRUE FIELD ESTIMATED FIELD ach cardinality the experiment is repeated 1 (a) th different sampling of the training points. (b) DIVERGENCE FREE PART CURL FREE PART g (?), we use an angular measure of error to two fields. If vo = (vo1, vo2 ) and ve = (ve1, ve2 ) NOISELESS CASE PROPORTIONAL NOISE riginal and estimated fields, we consider the INDIP mation v v = (v1,v2,1) (v (c), v, 1). The er-(d) INDIP VVR VVR (a) (b) ure is then err = arccos(v e v o ). This error.2.2 Figure 1: (a) Visualization of the artificial vector field Figure 2: Test errors for thecomputation proposed vector valued time apcomparison without noise. the (b) Field estimated from a noisy training was derived by interpreting vector field proach (VVR) and for learning each component of the field DIVERGENCE FREE PARTset of 1 examples. CURL FREE PART (c-d) Divergence-free part and curl7 independently (INDIP). Noiseless case (left) and field with free part estimated with the corresponding kernels. gaussian noise of width equal to 2% of the field magnitude city field and it is convenient because it han!!method (right). Tikhonov regularization 6 e and small signals without the from amplification sample of points this grid and their predictions on the whole grid compared to the correct field. The.1.1 in a relative measure of vector differences. number of training examples is varied from 1 to 2 5 and for each cardinality the experiment is repeated 1 ider the noiseless case times andwiththe case where the different sampling of the training points Following (?), noise, we use an angular measure of error to ld is corrupted with gaussian which we compare two fields. If v = (v, v ) and v = (v, v ) are thesignal. original and estimated fields, we consider the o be proportional to the This means 3 (v, v, 1). The ertransformation v v = (d) (c) measure is then err =deviation arccos(v v ). This error each point of the field, ror the standard Number of training examples Number of training examples measure was derived by interpreting the vector field 2 as a velocity field to and itthe is convenient because it hanise in that point is proportional magnidles large and small signals without the amplification igure 1: The (a) Visualization of the artificial vector field he field. accuracy inherent obtained inmeasure both cases 12: Test errors for the proposed vector valued apin a relative of vector differences. Figure ithout noise. (b) Field estimated from a noisy training We consider the noiseless case and the case where theproach (VVR) and for learning each component of the field in Fig.??, from which the advantage using vector field is corrupted part withof gaussian noise, which we et of 1 examples. (c-d) Divergence-free and curl (INDIP). Noiseless case (left) and field with assume to be proportional to the signal. This meansindependently ee part estimated with the corresponding kernels. bination of divergence-free and ker- deviationgaussian noise of width that for each pointcurl-free of the field, the standard equal to 2% of the field magnitude Number of training examples of the noise in that point is proportional to the magnipparent, especially fortude fewer training of the field. The accuracypoints. obtained in both cases(right). is shown in Fig.??, from which the advantage of using b) shows an example of combination estimated field and divergence-free and curl-free kersaturday,of December 29 thisthegrid ample points12,from and oftheir predictions Test Error Test Error e 1 (v 1,v 2,1) 1 e 2 o 1 e 2 e 2 Test Error Time (seconds) 2 o Computation time comparison 7 6 Time (seconds) 1 o 1 Number of training examples Test Error 1 2 Number of training examples o INDIP VVR INDIP VVR!!method Tikhonov regularization Number of training examples 2

30 T x f(x) = 1 n Some Theory: Random Operators n K(x, x i )f(x i ) Tf(x) = i=1 K(x, x)f(x)dρ(x) P ( T T x Ct ) n 1 e t2 The above result implies convergence of eigenfunctions and eigenvalues

31 Learning Rates Theorem If T r f ρ R with r > 1/2 and σ i i 1/b, b > 1, then ( ) P f n f ρ 2 ρ C τn 2rb 2rb+1 1 e τ 2 for λ = n 1 2rb+1. The above estimate is optimal in a minimax sense. Parameter choice can be done adaptively (Caponnetto et al., b=1 Bauer at al. )

32 Comments One name, 3 problems? Learning the kernels?

33 Vector Fields and Multi-tasks Y Task 1 Y Component 1 Task 2 X X Component 2 X X

34 Different Regimes? n>d>t (classical) d>n (high dimensional inference) T>n, n>t (??) curse of dimensionality vs blessing of smoothness smoothness/d should be big

35 Multiple Classes Inputs belong to one of T classes Journal of Machine Learning Research 5 (24) Submitted 4/3; Revised 8/3; Published 1/4 In Defense of One-Vs-All Classification Ryan Rifkin Honda Research Institute USA 145 Tremont Street Boston, MA , USA Aldebaro Klautau UC San Diego La Jolla, CA , USA rif@alum.mit.edu a.klautau@ieee.org

36 One Versus All Coding 1 = (1, 1, 1, ), 2=( 1, 1, 1, )... Regression of Coding Vectors min f=(f 1,,f T ) n i=1 y i f(x i ) 2 T + λ T j=1 f j 2 Classification Rule c(x) = max f j (x) j=1,,t

37 Remarks No correlation among classes How can we estimate it? In simulation one observe improvement in probability estimation but NOT in classification performances.

38 Regression vs Classification the components of the regression function are proportional of the conditional probabilities of each class the obtained estimator is Bayes Consistent

39 Learning the Kernel? Bayesian Approaches (consistency guarantees/stability/computability?) Regularization (what is the underlying Kernel? How are the outputs related?)

Regularization via Spectral Filtering

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,