Learning with primal and dual model representations

Size: px

Start display at page:

Download "Learning with primal and dual model representations"

Archibald Hopkins
5 years ago
Views:

1 Learning with primal and dual model representations Johan Suykens KU Leuven, ESAT-STADIUS Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium CIMI 2015, Toulouse Learning with primal and dual model representations - Johan Suykens

2 Introduction and motivation Learning with primal and dual model representations - Johan Suykens

3 Data world (a) C C C 3 C 4 Learning with primal and dual model representations - Johan Suykens 1

4 Challenges data-driven general methodology scalability need for new mathematical frameworks Learning with primal and dual model representations - Johan Suykens 2

5 Different paradigms SVM & Kernel methods Convex Optimization Sparsity & Compressed sensing Learning with primal and dual model representations - Johan Suykens 3

6 Different paradigms SVM & Kernel methods Convex Optimization? Sparsity & Compressed sensing Learning with primal and dual model representations - Johan Suykens 3

7 Sparsity through regularization or loss function Learning with primal and dual model representations - Johan Suykens 3

8 Sparsity: through regularization or loss function through regularization: model ŷ = w T x + b min j w j + γ i e 2 i sparse w through loss function: model ŷ = i α ik(x, x i ) + b min w T w + γ i L(e i ) sparse α ε 0 +ε Learning with primal and dual model representations - Johan Suykens 4

9 Sparsity: matrices and tensors vector x matrix X tensor X data vector x data matrix X data tensor X vector model: matrix model: tensor model: ŷ = w T x ŷ = W,X ŷ = W, X sparsity: sparsity: sparsity: j w j W W [Signoretto M., Tran Dinh Q., De Lathauwer L., Suykens J.A.K., ML 2014] Robust tensor completion [Yang, Feng, Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 5

10 Sparsity: matrices and tensors vector x matrix X tensor X data vector x data matrix X data tensor X vector model: matrix model: tensor model: ŷ = w T x ŷ = W,X ŷ = W, X sparsity: sparsity: sparsity: j w j W W Learning with tensors [Signoretto, Tran Dinh, De Lathauwer, Suykens, ML 2014] Robust tensor completion [Yang, Feng, Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 5

11 Function estimation in RKHS Find function f such that [Wahba, 1990; Evgeniou et al., 2000] 1 min f H K N N L(y i, f(x i )) + λ f 2 K i=1 with L(, ) the loss function. f K is norm in RKHS H K defined by K. Representer theorem: for convex loss function, solution of the form f(x) = NX α i K(x, x i ) i=1 Reproducing property f(x) = f, K x K with K x ( ) = K(x, ) Sparse representation by ǫ-insensitive loss [Vapnik, 1998] Learning with primal and dual model representations - Johan Suykens 6

12 Learning with primal and dual model representations Learning with primal and dual model representations - Johan Suykens 6

13 Learning models from data: alternative views - Consider model ŷ = f(x; w), given input/output data {(x i, y i )} N i=1 : min w wt w + γ N (y i f(x i ;w)) 2 i=1 - Rewrite the problem as min w,e w T w + γ N i=1 (y i f(x i ;w)) 2 subject to e i = y i f(x i ;w), i = 1,...,N - Construct Lagrangian and take condition for optimality - For a model f(x;w) = h j=1 w jϕ j (x) = w T ϕ(x) one obtains then ˆf(x) = N i=1 α ik(x, x i ) with K(x, x i ) = ϕ(x) T ϕ(x i ). Learning with primal and dual model representations - Johan Suykens 7

14 Learning models from data: alternative views - Consider model ŷ = f(x; w), given input/output data {(x i, y i )} N i=1 : min w wt w + γ N (y i f(x i ; w)) 2 i=1 - Rewrite the problem as min w,e w T w + γ N i=1 e2 i subject to e i = y i f(x i ;w),i = 1,...,N - Express the solution and the model in terms of Lagrange multipliers α i - For a model f(x;w) = h j=1 w jϕ j (x) = w T ϕ(x) one obtains then ˆf(x) = N i=1 α ik(x, x i ) with K(x, x i ) = ϕ(x) T ϕ(x i ). Learning with primal and dual model representations - Johan Suykens 7

15 Least Squares Support Vector Machines: core models Regression Classification min w,b,e wt w + γ i min w,b,e wt w + γ i e 2 i e 2 i s.t. y i = w T ϕ(x i ) + b + e i, i s.t. y i (w T ϕ(x i ) + b) = 1 e i, i Kernel pca (V = I), Kernel spectral clustering (V = D 1 ) min w,b,e wt w + γ i v i e 2 i s.t. e i = w T ϕ(x i ) + b, i Kernel canonical correlation analysis/partial least squares min w,v,b,d,e,r wt w + v T v + ν i (e i r i ) 2 s.t. { ei = w T ϕ (1) (x i ) + b = v T ϕ (2) (y i ) + d r i [Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010] Learning with primal and dual model representations - Johan Suykens 8

16 Probability and quantum mechanics Kernel pmf estimation Primal: 1 min w,p 2 w, w subject to p i = w,ϕ(x i ), i = 1,...,N and N i=1 p i = 1 i Dual: p i = P Nj=1 K(x j,x i ) P Ni=1 P Nj=1 K(x j,x i ) Quantum measurement: state vector ψ, measurement operators M i Primal: 1 min w,p 2 w w subject to p i = Re( w M i ψ ), i = 1,...,N and N i=1 p i = 1 i Dual: p i = ψ M i ψ (Born rule, orthogonal projective measurement) [Suykens, Physical Review A, 2013] Learning with primal and dual model representations - Johan Suykens 9

17 SVMs: living in two worlds... Primal space x x x x x o o o o Feature space ϕ(x) x x x o oo x x o Input space Parametric Dual space ŷ = sign[w T ϕ(x) + b] Nonparametric ŷ K(x i, x j ) = ϕ(x i ) T ϕ(x j ) (Mercer) ŷ = sign[ P #sv i=1 α iy i K(x, x i ) + b] ŷ w 1 w nh α 1 ϕ 1 (x) ϕ nh (x) K(x, x 1 ) x x α #sv K(x, x #sv ) Learning with primal and dual model representations - Johan Suykens 10

18 SVMs: living in two worlds... Primal space x x x x x o o o o Feature space ϕ(x) x x x o oo x x o Input space Parametric Dual space ŷ = sign[w T ϕ(x) + b] Nonparametric ŷ K(x i, x j ) = ϕ(x i ) T ϕ(x j ) ( Kernel trick ) ŷ = sign[ P #sv i=1 α iy i K(x, x i ) + b] ŷ w 1 w nh α 1 ϕ 1 (x) ϕ nh (x) K(x, x 1 ) x x Parametric Non parametric α #sv K(x, x #sv ) Learning with primal and dual model representations - Johan Suykens 10

19 Linear model: solving in primal or dual? inputs x R d, output y R training set {(x i, y i )} N i=1 Model ր ց (P) : ŷ = w T x + b, w R d (D) : ŷ = i α i x T i x + b, α RN Learning with primal and dual model representations - Johan Suykens 11

20 Linear model: solving in primal or dual? inputs x R d, output y R training set {(x i, y i )} N i=1 Model ր ց (P) : ŷ = w T x + b, w R d (D) : ŷ = i α i x T i x + b, α RN Learning with primal and dual model representations - Johan Suykens 11

21 Linear model: solving in primal or dual? few inputs, many data points: d N primal : w R d dual: α R N (large kernel matrix: N N) Learning with primal and dual model representations - Johan Suykens 12

22 Linear model: solving in primal or dual? many inputs, few data points: d N primal: w R d dual : α R N (small kernel matrix: N N) Learning with primal and dual model representations - Johan Suykens 12

23 From linear to nonlinear model: Feature map and kernel Model ր ց (P) : (D) : ŷ = w T ϕ(x) + b ŷ = i α i K(x i, x) + b Mercer theorem: K(x, z) = ϕ(x) T ϕ(z) Feature map ϕ(x) = [ϕ 1 (x); ϕ 2 (x);...;ϕ h (x)] Kernel function K(x, z) (e.g. linear, polynomial, RBF,...) Use of feature map and positive definite kernel [Cortes & Vapnik, 1995] Extension to infinite dimensional case: - LS-SVM formulation [Signoretto, De Lathauwer, Suykens, 2011] - HHK Transform, coherent states, wavelets [Fanuel & Suykens, 2015] Learning with primal and dual model representations - Johan Suykens 13

24 Coherent states { η x H} x X in HHK transform min w H,e i,b 1 2 w w H + γ 2 N i=1 e 2 i s.t. y i = η xi w H +b+e i, i = 1,...,N HHK Transform: W η : H H K : w η w H (P) : ŷ = η x w H + b transform ŷ = W η η x W η w K + b ր M K(x, z) = η x η z H K(x, z) = ξ x ξ z K, ξ x = W η η x ց (D) : ŷ = i α i K(x i,x) + b ŷ = i α i K(x i,x) + b [Fanuel & Suykens, TR15-101, 2015] Learning with primal and dual model representations - Johan Suykens 14

25 HHK transform Coherent states { η x H} x X in min w H,e i,b 1 2 w w H + γ 2 N i=1 e 2 i s.t. y i = η xi w H +b+e i, i = 1,...,N HHK Transform: W η : H H K : w η w H (P) : ŷ = η x w H + b H H K ŷ = W η η x W η w K + b ր M K(x, z) = η x η z H K(x, z) = ξ x ξ z K, ξ x = W η η x ց (D) : ŷ = i α i K(x i,x) + b ŷ = i α i K(x i, x) + b [Fanuel & Suykens, TR15-101, 2015] Learning with primal and dual model representations - Johan Suykens 14

26 Sparsity by fixed-size kernel method Learning with primal and dual model representations - Johan Suykens 14

27 Fixed-size method: steps 1. selection of a subset from the data 2. kernel matrix on the subset 3. eigenvalue decomposition of kernel matrix 4. approximation of the feature map based on the eigenvectors (Nyström approximation) 5. estimation of the model in the primal using the approximate feature map (applicable to large data sets) [Suykens et al., 2002] (ls-svm book) Learning with primal and dual model representations - Johan Suykens 15

28 Selection of subset random quadratic Renyi entropy incomplete Cholesky factorization Learning with primal and dual model representations - Johan Suykens 16

29 Nyström method big kernel matrix: Ω (N,N) R N N small kernel matrix: Ω (M,M) R M M (on subset) Eigenvalue decompositions: Ω (N,N) Ũ = Ũ Λ and Ω (M,M) U = U Λ Relation to eigenvalues and eigenfunctions of the integral equation K(x, x )φ i (x)p(x)dx = λ i φ i (x ) with ˆλ i = 1 M λ i, ˆφi (x k ) = M u ki, ˆφi (x ) = M λ i M u ki K(x k,x ) k=1 [Williams & Seeger, 2001] (Nyström method in GP) Learning with primal and dual model representations - Johan Suykens 17

30 Fixed-size method: estimation in primal For the feature map ϕ( ) : R d R h obtain an approximation ϕ( ) : R d R M based on the eigenvalue decomposition of the kernel matrix with ϕ i (x ) = ˆλi ˆφi (x ) (on a subset of size M N). Estimate in primal: min w, b 1 2 wt w + γ 1 2 N (y i w T ϕ(x i ) b) 2 i=1 Sparse representation is obtained: w R M with M N and M h. [Suykens et al., 2002; De Brabanter et al., CSDA 2010] Learning with primal and dual model representations - Johan Suykens 18

31 Fixed-size method: performance in classification pid spa mgt adu ftc N N cv N test d FS-LSSVM (# SV) C-SVM (# SV) ν-svm (# SV) RBF FS-LSSVM 76.7(3.43) 92.5(0.67) 86.6(0.51) 85.21(0.21) 81.8(0.52) Lin FS-LSSVM 77.6(0.78) 90.9(0.75) 77.8(0.23) 83.9(0.17) 75.61(0.35) RBF C-SVM 75.1(3.31) 92.6(0.76) 85.6(1.46) 84.81(0.20) 81.5(no cv) Lin C-SVM 76.1(1.76) 91.9(0.82) 77.3(0.53) 83.5(0.28) 75.24(no cv) RBF ν-svm 75.8(3.34) 88.7(0.73) 84.2(1.42) 83.9(0.23) 81.6(no cv) Maj. Rule 64.8(1.46) 60.6(0.58) 65.8(0.28) 83.4(0.1) 51.23(0.20) Fixed-size (FS) LSSVM: good performance and sparsity wrt C-SVM and ν-svm Challenging to achieve high performance by very sparse models [De Brabanter et al., CSDA 2010] Learning with primal and dual model representations - Johan Suykens 19

32 Two stages of sparsity stage 1 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS, 2015], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Learning with primal and dual model representations - Johan Suykens 20

33 Two stages of sparsity stage 1 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS, 2015], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Learning with primal and dual model representations - Johan Suykens 20

34 Two stages of sparsity stage 1 stage 2 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS 2015], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Learning with primal and dual model representations - Johan Suykens 20

35 Two stages of sparsity stage 1 stage 2 primal FS model estimation reweighted l 1 dual subset selection Nyström approximation Synergy between parametric & kernel-based models [Mall & Suykens, IEEE-TNNLS 2015], reweighted l 1 [Candes et al., 2008] Other possible approaches with improved sparsity: SCAD [Fan & Li, 2001]; coefficientbased l q (0 < q 1) [Shi et al., 2013]; two-level l 1 [Huang et al., 2014] Learning with primal and dual model representations - Johan Suykens 20

36 Kernel-based models for spectral clustering Learning with primal and dual model representations - Johan Suykens 20

37 Kernel PCA Primal problem: [Suykens et al., 2002] min w,b,e 1 2 wt w 1 N 2 γ i=1 e 2 i s.t. e i = w T ϕ(x i ) + b, i = 1,...,N. Dual problem corresponds to kernel PCA [Scholkopf et al., 1998] Ω c α = λα with λ = 1/γ with Ω c,ij = (ϕ(x i ) ˆµ ϕ ) T (ϕ(x j ) ˆµ ϕ ) the centered kernel matrix. Interpretation: 1. pool of candidate components (objective function equals zero) 2. select relevant components Robust and sparse versions [Alzate & Suykens, 2008]: by taking other loss functions Learning with primal and dual model representations - Johan Suykens 21

38 Robustness: Kernel Component Analysis original image corrupted image KPCA reconstruction KCA reconstruction Weighted LS-SVM [Alzate & Suykens, IEEE-TNN 2008]: robustness and sparsity Learning with primal and dual model representations - Johan Suykens 22

39 Kernel Spectral Clustering (KSC): case of two clusters Primal problem: training on given data {x i } N i=1 1 min w,b,e 2 wt w γ 1 2 et V e subject to e i = w T ϕ(x i ) + b, i = 1,...,N with weighting matrix V and ϕ( ) : R d R h the feature map. Dual: V M V Ωα = λα with λ = 1/γ, M V = I N T N V 1 N 1 T NV weighted centering matrix, N Ω = [Ω ij ] kernel matrix with Ω ij = ϕ(x i ) T ϕ(x j ) = K(x i, x j ) Taking V = D 1 with degree matrix D = diag{d 1,...,d N } and d i = N j=1 Ω ij relates to random walks algorithm. [Alzate & Suykens, IEEE-PAMI, 2010] Learning with primal and dual model representations - Johan Suykens 23

40 Lagrangian: Lagrangian and conditions for optimality L(w,b, e;α) = 1 2 wt w γ 1 2 N v i e 2 i + i=1 N α i (e i w T ϕ(x i ) b) i=1 Conditions for optimality: L w = 0 w = i α iϕ(x i ) L b = 0 i α i = 0 L = 0 α i = γv i e i, i = 1,...,N e i L = 0 e i = w T ϕ(x i ) + b, i = 1,...,N α i Eliminate w,b,e, write solution in Lagrange multipliers α i. Learning with primal and dual model representations - Johan Suykens 24

41 Kernel spectral clustering: more clusters Case of k clusters: additional sets of constraints 1 min w (l),e (l),b l 2 k 1 l=1 w (l)t w (l) 1 2 k 1 l=1 γ l e (l)t D 1 e (l) subject to e (1) = Φ N nh w (1) + b 1 1 N e (2) = Φ N nh w (2) + b 2 1 N. e (k 1) = Φ N nh w (k 1) + b k 1 1 N where e (l) = [e (l) 1 ;...;e(l) N ] and Φ N n h = [ϕ(x 1 ) T ;...;ϕ(x N ) T ] R N n h. Dual problem: M D Ωα (l) = λdα (l), l = 1,...,k 1. [Alzate & Suykens, IEEE-PAMI, 2010] Learning with primal and dual model representations - Johan Suykens 25

42 Primal and dual model representations k clusters k 1 sets of constraints (index l = 1,...,k 1) M ր ց (P) : sign[ê (l) ] = sign[w (l)t ϕ(x ) + b l ] (D) : sign[ê (l) ] = sign[ j α(l) j K(x,x j ) + b l ] Learning with primal and dual model representations - Johan Suykens 26

43 Advantages of kernel-based setting model-based approach out-of-sample extensions, applying model to new data consider training, validation and test data (training problem corresponds to eigenvalue decomposition problem) model selection procedures sparse representations and large scale methods Learning with primal and dual model representations - Johan Suykens 27

44 Model selection: toy example e (2) i,val σ 2 = 0.5, BLF = e (2) i,val e (1) i,val σ 2 = 0.16, BLF = e (1) i,val validation set x (2) x (2) x (1) x (1) train + validation + test data BAD GOOD Learning with primal and dual model representations - Johan Suykens 28

Example: image segmentation 4 3 2 i,val e

2 e (1) i,val 1 0 1 2 Learning with primal

45 Example: image segmentation i,val e (3) e (2) i,val e (1) i,val Learning with primal and dual model representations - Johan Suykens 29

46 Hierarchical KSC [Alzate & Suykens, 2012] Learning with primal and dual model representations - Johan Suykens 30

47 Hierarchical KSC [Alzate & Suykens, 2012] Learning with primal and dual model representations - Johan Suykens 31

175 SV Time-complexity O(R 2 N 2 ) in [Alzate & Suykens, 2008] Time-complexity O(R 2 N) in [Novak,

48 Kernel spectral clustering: sparse kernel models original image binary clustering Incomplete Cholesky decomposition: Ω GG T with G R N R and R N Image (Berkeley image dataset): (154, 401 pixels), 175 SV Time-complexity O(R 2 N 2 ) in [Alzate & Suykens, 2008] Time-complexity O(R 2 N) in [Novak, Alzate, Langone, Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 32

49 Kernel spectral clustering: sparse kernel models original image sparse kernel model Incomplete Cholesky decomposition: Ω GG T with G R N R and R N Image (Berkeley image dataset): (154, 401 pixels), 175 SV Time-complexity O(R 2 N 2 ) in [Alzate & Suykens, 2008] Time-complexity O(R 2 N) in [Novak, Alzate, Langone, Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 32

50 Incomplete Cholesky decomposition and reduced set For KSC problem M D Ωα = λdα, solve the approximation U T M D UΛ 2 ζ = λζ from Ω GG T, singular value decomposition G = UΛV T and ζ = U T α. A smaller matrix of size R R is obtained instead of N N. Pivots are used as subset { x i } for the data Reduced set method [Scholkopf et al., 1999]: approximation of w = N i=1 α iϕ(x i ) by w = M j=1 β jϕ( x j ) in the sense min w w 2 2 β j β j Sparser solutions by adding l 1 penalty, reweighted l 1 or group Lasso. [Alzate & Suykens, 2008, 2011; Mall & Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 33

51 Incomplete Cholesky decomposition and reduced set For KSC problem M D Ωα = λdα, solve the approximation U T M D UΛ 2 ζ = λζ from Ω GG T, singular value decomposition G = UΛV T and ζ = U T α. A smaller matrix of size R R is obtained instead of N N. Pivots are used as subset { x i } for the data Reduced set method [Scholkopf et al., 1999]: approximation of w = N i=1 α iϕ(x i ) by w = M j=1 β jϕ( x j ) in the sense min β w w ν j β j Sparser solutions by adding l 1 penalty, reweighted l 1 or group Lasso. [Alzate & Suykens, 2008, 2011; Mall & Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 33

52 Core models + constraints Core model + regularization terms additional constraints Learning with primal and dual model representations - Johan Suykens 34

53 Core models + constraints Core model + regularization terms additional constraints optimal model representation model estimate Learning with primal and dual model representations - Johan Suykens 34

54 Kernel spectral clustering: adding prior knowledge Pair of points x, x : c = 1 must-link, c = 1 cannot-link Primal problem [Alzate & Suykens, IJCNN 2009] min w (l),e (l),b l 1 2 k 1 l=1 w (l)t w (l) k 1 l=1 γ l e (l)t D 1 e (l) subject to e (1) = Φ N nh w (1) + b 1 1 N. e (k 1) = Φ N nh w (k 1) + b k 1 1 N w (1)T ϕ(x ) = cw (1)T ϕ(x ). w (k 1)T ϕ(x ) = cw (k 1)T ϕ(x ) Dual problem: yields rank-one downdate of the kernel matrix Learning with primal and dual model representations - Johan Suykens 35

55 Adding prior knowledge original image without constraints Learning with primal and dual model representations - Johan Suykens 36

56 Adding prior knowledge original image with constraints Learning with primal and dual model representations - Johan Suykens 36

57 Semi-supervised learning using KSC (1) N unlabeled data, but additional labels on M N data X = {x 1,...,x N,x N+1,...,x M } Kernel spectral clustering as core model (binary case [Alzate & Suykens, WCCI 2012], multi-way/multi-class [Mehrkanoon et al., TNNLS 2015]) min w,e,b 1 2 wt w γ 1 2 et D 1 e+ρ 1 2 M m=n+1 subject to e i = w T ϕ(x i ) + b, i = 1,...,M (e m y m ) 2 Dual solution is characterized by a linear system. Suitable for clustering as well as classification. Other approaches in semi-supervised learning and manifold learning, e.g. [Belkin et al., 2006] Learning with primal and dual model representations - Johan Suykens 37

58 Semi-supervised learning using KSC (2) Dataset size n L /n U test (%) FS semi-ksc RD semi-ksc Lap-SVMp Spambase / (20%) ± ± ± 0.03 Satimage / (20%) ± ± ± Ring / (20%) ± ± ± Magic / (20%) ± ± ± Cod-rna / (20%) ± ± ± Covertype / (5%) ± ± ± / ± ± / ± ± / ± ± 0.06 FS semi-ksc: Fixed-size semi-supervised KSC RD semi-ksc: other subset selection related to [Lee & Mangasarian, 2001] Lap-SVM: Laplacian support vector machine [Belkin et al., 2006] [Mehrkanoon & Suykens, 2014] Learning with primal and dual model representations - Johan Suykens 38

59 Semi-supervised learning using KSC (3) original image KSC given a few labels semi-supervised KSC [Mehrkanoon, Alzate, Mall, Langone, Suykens, IEEE-TNNLS 2015], videos Learning with primal and dual model representations - Johan Suykens 39

60 SVD from LS-SVM Learning with primal and dual model representations - Johan Suykens 39

61 SVD within the LS-SVM setting (1) Singular Value Decomposition (SVD) of A R N M A = UΣV T with U T U = I N, V T V = I M, Σ = diag(σ 1,...,σ p ) R N M. Obtain two sets of data points (rows and columns): x i = A T ǫ i, z j = Aε j for i = 1,...,N, j = 1,...,M where ǫ i, ε j are standard basis vectors of dimension N and M. Compatible feature maps: ϕ : R M R N, ψ : R N R N where ϕ(x i ) = C T x i = C T A T ǫ i ψ(z j ) = z j = Aε j with C R M N a compatibility matrix. [Suykens, ACHA, 2015, in press] Learning with primal and dual model representations - Johan Suykens 40

62 Primal problem: SVD within the LS-SVM setting (2) min w,v,e,r wt v γ N i=1 e2 i γ M j=1 r2 j subject to e i = w T ϕ(x i ), i = 1,...,N r j = v T ψ(z j ), j = 1,...,M From the Lagrangian and conditions for optimality one obtains: [ ] ϕ(x i ) T ψ(z j ) [β] = [α] Λ [ ] ψ(z j ) T ϕ(x i ) [α] = [β] Λ Theorem: If ACA = A holds, this corresponds to the shifted eigenvalue problem in Lanczos decomposition theorem. Goes beyond the use of Mercer theorem; extensions to nonlinear SVDs [Suykens, ACHA, 2015, in press] Learning with primal and dual model representations - Johan Suykens 41

63 Conclusions Synergies parametric and kernel based-modelling Primal and dual representations Sparse kernel models using fixed-size method Applications in supervised and unsupervised learning and beyond Finite and infinite dimensional case Beyond Mercer kernels Software: see ERC AdG A-DATADRIVE-B website Learning with primal and dual model representations - Johan Suykens 42

64 Acknowledgements (1) Co-workers at ESAT-STADIUS: M. Agudelo, C. Alaiz, C. Alzate, A. Argyriou, R. Castro, J. De Brabanter, K. De Brabanter, L. De Lathauwer, B. De Moor, M. Espinoza, M. Fanuel, Y. Feng, E. Frandi, B. Gauthier, D. Geebelen, H. Hang, X. Huang, L. Houthuys, V. Jumutc, Z. Karevan, R. Langone, Y. Liu, R. Mall, S. Mehrkanoon, M. Novak, J. Puertas, L. Shi, M. Signoretto, V. Van Belle, J. Vandewalle, S. Van Huffel, C. Varon, X. Xi, Y. Yang, and others Many people for joint work, discussions, invitations, organizations Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet, OPTEC, IUAP DYSCO, FWO projects, IWT, iminds, BIL, COST Learning with primal and dual model representations - Johan Suykens 43

65 Acknowledgements (2) Learning with primal and dual model representations - Johan Suykens 44

66 Thank you Learning with primal and dual model representations - Johan Suykens 45

Learning with primal and dual model representations: a unifying picture

Learning with primal and dual model representations: a unifying picture Johan Suykens KU Leuven, ESAT-STADIUS Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Email: johan.suykens@esat.kuleuven.be