Kernel Reconstruction ICA for Sparse Representation

Size: px

Start display at page:

Download "Kernel Reconstruction ICA for Sparse Representation"

Jason Reeves
5 years ago
Views:

1 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY 2007 Kernel Reconstruction ICA for Sparse Representation Yanhui Xiao, Zhenfeng Zhu, and Yao Zhao arxiv: v [cscv] 9 Apr 203 Abstract Independent Coponent Analysis (ICA) is an effective unsupervised tool to learn statistically independent representation However, ICA is not only sensitive to whitening but also difficult to learn an over-coplete basis Consequently, ICA with soft Reconstruction cost(rica) was presented to learn sparse representations with over-coplete basis even on unwhitened data Whereas RICA is infeasible to represent the data with nonlinear structure due to its intrinsic linearity In addition, RICA is essentially an unsupervised ethod and can not utilize the class inforation In this paper, we propose a kernel ICA odel with reconstruction constraint (krica) to capture the nonlinear features To bring in the class inforation, we further extend the unsupervised krica to a supervised one by introducing a discriination constraint, naely d-krica This constraint leads to learn a structured basis consisted of basis vectors fro different basis subsets corresponding to different class labels Then each subset will sparsely represent well for its own class but not for the others Furtherore, data saples belonging to the sae class will have siilar representations, and thereby the learned sparse representations can take ore discriinative power Experiental results validate the effectiveness of krica and d-krica for iage classification Index Ters Independent coponent analysis, nonlinear apping, supervised learning, iage classification INTRODUCTION Sparsity is an attribute characterizing a ass of natural and anade signals [], and has played a vital role in the success of any achine learning algoriths and techniques such as copressed sensing [2], atrix factorization [3], sparse coding [4], dictionary learning [5], [6], sparse auto-encoders [7], Restricted Boltzann Machines (RBMs) [8] and Independent Coponent Analysis (ICA) [9] Aong these, ICA transfors an observed ultidiensional rando vector into sparse coponents which are statistically as independent fro each other as possible Specifically, to estiate the independent coponents, a general principle is the axiization of nongaussianity [9] This is based on the central liit theore that su of independent rando variables is closer to gaussian than any of the original rando variables, ie, non-gaussian is independent Meanwhile, sparsity is one for of non-gaussianity [0], which is doinant in natural iages Then axiization of sparseness in natural iages is basically equivalent to axiization of non-gaussianity Thus, ICA has been successfully applied to learn sparse representation for classification tasks by axiizing sparsity [] However, there are two ain drawbacks to standard ICA ) ICA is sensitive to whitening, which is an iportant preprocessing step in ICA to extract efficient features In Y Xiao and Z Zhu are with the Institute of Inforation Science, Beijing Jiaotong University, and with Beijing Key Laboratory of Advanced Inforation Science and Network Technology, Beijing 00044, China (e-ail: xiaoyanhui@gailco, zhfzhu@bjtueducn) Y Zhao is with the Institute of Inforation Science, Beijing Jiaotong University, and with State Key Laboratory of Rail Traffic Control and Safety, Beijing 00044, China (e-ail: yzhao@bjtueducn) addition, standard ICA is difficult to exactly whiten high diensional data For exaple, an input iage of size pixels could be exactly whitened by principal coponent analysis(pca), while it has to solve the eigen-decoposition of the 0,000 0,000 covariance atrix 2) ICA is hard to learn the over-coplete basis (that is the nuber of basis vectors is greater than diensionality of input data) Whereas Coates et al [2] have shown that several approaches with over-coplete basis, eg, sparse autoencoders [7], K-eans [2] and RBMs [8], obtain an iproveent for the perforance of classification This puts ICA at a disadvantage copared to these ethods Both drawbacks are ainly due to the hard orthonorality constraint in standard ICA Matheatically, that is WW T = I, which is utilized to prevent degenerate solution for the basis atrix W where each basis vector is a row of W While this orthonoralization cannot be satisfied when W is over-coplete Specifically, the optiization proble of standard ICA is generally solved by using gradient descent ethods, where W is orthonoralized at each iteration by syetric orthonoralization, ie, W (WW T ) /2 W, which doesn t work for over-coplete learning In addition, although alternative orthonoralization ethods could be eployed to learn over-coplete basis, they not only are expensive to copute but also ay arise fro the cuulation of errors To address the above issues, QV Le et al [3] replaced the orthonorality constraint with a robust soft reconstruction cost for ICA (RICA) Thus, RICA can learn sparse representation with highly over-coplete basis even on unwhitened data However, this odel

2 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY is so far also a linear technique which is infeasible to discover nonlinear relationships aong input data Additionally, as an unsupervised ethod, RICA ay not be sufficient for classification tasks, which failed to consider the association between the training saple and its class Recall that, to explore the nonlinear features, kernel trick [4] can be used to nonlinearly project the input data into a high diensional feature space Therefore, we develop a kernel extension of RICA (krica) to represent the data with nonlinear structure In addition, to bring in label inforation, we further extend the unsupervised krica to a supervised one by introducing a discriination constraint, naely d-krica Particularly, this constraint axiizes the hoogeneous representation cost and iniizes the inhoogeneous representation cost jointly, which leads to learn a structured basis consisted of basis vectors fro different basis subsets corresponding to the class labels Then each subset will sparsely represent well for its own class but not for the others Furtherore, data saples belonging to the sae class will have siilar representations, and thereby the obtained sparse representation can take ore discriinative power It is iportant to note that this work is fundaentally based on our previous work DRICA [5] In coparison to DRICA, we further iprove our work as follows: ) By taking advantage of the kernel trick, we replace the linear projection with nonlinear one to capture the nonlinear features Experiental results show that our kernel extension usually further iproves the iage classification accuracy 2) The discriinative capability of basis is further enhanced by axiizing the hoogeneous representation cost besides iniizing the inhoogeneous representation cost siultaneously Thus, we can obtain a set of ore discriinative basis vectors that are forced to sparsely represent better for their own classes but poorer for the others Experients show that this basis can further boost the perforance for iage classification 3) In the experients, we conduct coprehensive analysis for our proposed ethod, eg, the effects of different paraeters and kernels for iage classification, experient settings, and the siilarity coparative analysis The rest of the paper is organized as follows In Section 2, we revisit related works on sparse coding and RICA, and describe the connection between the Then we give a brief review of reconstruction ICA in Section 3 Section 4 introduces the details of our proposed krica, including its optiization proble and ipleentation By incorporating the discriination constraint, krica is further extended to supervised learning in Section 5 Section 6 presents extensive experiental results on iage classification Finally, we conclude our work in Section 7 2 RELATED WORK In this section, we will review soe related work in the following aspects: () Sparse coding and its applications; (2) Connection between RICA and sparse coding; (3) The other kernel sparse representation algoriths Sparse coding is an unsupervised ethod for reconstructing a given signal by selecting a relatively sall subset of basis vectors fro an over-coplete basis set, and eanwhile aking the reconstruction error as sall as possible Because of its plausive statistical theory [6], sparse coding has attracted ore and ore attention fro scientists in coputer vision field Meanwhile, it has been successfully used for ore and ore coputer vision applications, eg, iage classification [7], [8], [9], face recognition [20], iage restoration [2] etc This success is largely due to two factors: ) The sparsity characteristic ubiquitously exists in any coputer vision applications For exaple, for iage classification, the iage coponents can be sparsely reconstructed by utilizing siilar coponents of other iages fro sae class [7] Another exaple is face recognition The face iage to be tested can be accurately reconstructed by a few training iages fro the sae category [20] As a consequence, sparsity is the foundation for these applications based on sparse coding 2) Iages are often corrupted by noise, which ay arise due to sensor iperfection, poor illuination or counication errors While sparse coding can effectively select the related basis vectors to reconstruct the clean iage, and eanwhile can deal with noise by allowing the reconstruction error and prooting sparsity Therefore, sparse coding has been successfully applied to iage denoising [22], iage restoration [2] etc Siilar to sparse coding, ICA with a reconstruction cost (RICA) [3] also can learn highly over-coplete sparse representation In addition, in [3], it has been shown that RICA is atheatically equivalent to sparse coding if using explicit encoding and ignoring the nor ball constraint The above-entioned studies only seek the sparse representations of the input data in the original data space, which are incopetent to represent the data with nonlinear structure To solve this proble, Yang et al [23] developed a two-phase kernel ICA algorith: whitened kernel principal coponent analysis (KPCA) plus ICA Different fro [23], another solution [24] was proposed to use contrast function based on canonical correlations in a reproducing kernel Hilbert space However, both of these ethods couldn t learn the over-coplete sparse representation of nonlinear features due to the orthonorality constraint Therefore, to find such representation, Gao et al [25], [26] presented a kernel sparse coding ethod (KSR) in a high diensional feature space But this work failed to utilize the class inforation as an unsupervised approach Additionally, in Section 44, we will show that our proposed kernel extension of RICA (krica) is equivalent to KSR under certain conditions

3 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY RECONSTRUCTION ICA Since sparsity is one for of non-gaussianity, axiization of sparsity for ICA is equivalent to axiization of independence[0] Given the unlabeled data set X = {x i } i= where x i R n, the optiization proble of standard ICA [9] is generally defined as in g(w j x i ) W i= j= st WW T = I, where g( ) is a nonlinear convex function, W = [w,w 2,,w K ] T R K n is the basis atrix, K is the nuber of basis vectors and w j is j-th row basis vector in W, and I is the identity atrix Additionally, the orthonorality constraint WW T = I is traditionally utilized to prevent the basis vectors in W fro becoing degenerate Meanwhile, a good general purpose sooth L penalty is: g( ) = log(cosh( )) [0] However, as above pointed out, the orthonorality constraint akes standard ICA difficult to learn the over-coplete basis In addition, ICA is sensitive to whitening These drawbacks restrict ICA to scale high diensional data Consequently, RICA [3] used a soft reconstruction cost to replace the orthonorality constraint in ICA Applying this replaceent to Equation (2), RICA can be forulated as the following unconstrained proble in W () [ W T Wx i x i 2 2 +λg(wx i)], (2) i= where paraeter λ is a tradeoff between reconstruction and sparsity Swapping the orthonorality constraint with a reconstruction penalty, the RICA could learn sparse representations even on the data without whitening when W is over-coplete Furtherore, since the L penalty is not sufficient to learn invariant features [0], RICA [3], [27] replaced it by a L 2 pooling penalty which encourages pooling features to group siilar features together to achieve coplex invariances such as scale and rotational invariance Besides, the L 2 pooling can also proote sparsity for feature learning Particularly,L 2 pooling [28], [29] is a two-layered network with square nonlinearity in the first layer, and square-root nonlinearity in the second layer: g(wx i ) = j= ε+h j (Wx i ) 2, (3) whereh j is the row of spatial pooling atrixh R K K fixed to unifor weights and ε is a sall constant to prevent division by zero Nevertheless, RICA is infeasible to represent the data with nonlinear structure due to its intrinsic linearity In addition, this odel just siply learned the overcoplete basis set with reconstruction cost while failed to consider the association between the training saple and its class, which ay be insufficient for classification tasks To address these probles, on one hand, we focus on developing a kernel extension of RICA to find the sparse representation of nonlinear features On the other hand, we ai to learn a ore discriinative basis by bringing in class inforation than unsupervised RICA, which will facilitate the better perforance of sparse representation in classification tasks 4 KERNEL EXTENSION FOR RICA Motivated by the success that kernel trick can capture the nonlinear structure in data [4], we propose a kernel version of RICA, called krica, to learn the sparse representation of nonlinear features 4 Model Forulation Suppose that there is a kernel function κ(, ) induced by a high diensional feature apping φ : R n R N, where n N Given two data points x i and x j, κ(x i,x j ) = φ(x i ) T φ(x j ) represents a nonlinear siilarity between the Then the function aps the data and basis fro the original data space to the feature space as follows x φ φ(x) W = [w,,w K ] T φ W = [φ(w ),,φ(w K )] T (4) Furtherore, by substituting the apped data and basis into Equation (2), we can get the following objective function of krica in W [ W T Wφ(x i ) φ(x i ) 2 2 +λg(wφ(x i))] (5) i= Due to its excellent perforance in any coputer vision applications [4], [25], Gaussian kernel, ie, κ(x i,x j ) = exp( γ x i x j 2 2) is used in this study Thus, the nor ball constraints on basis in RICA can be reoved owing to φ(w i ) T φ(w i ) = κ(w i,w i ) = In addition, we perfor kernel principal coponent analysis (KPCA) in the feature space for data whitening siilar to [23], which akes the proble of ICA estiation sipler and better conditioned [0] When data is whitened, there exists a close relationship between kernel ICA [23] and krica Regarding this relationship, we have the following Lea: Lea 4 When the input data set X = {x i } i= is whitened in the feature space, the reconstruction cost i= W T Wφ(x i ) φ(x i ) 2 2 is equivalent to the orthonorality cost W T W I 2 F Where F is the Frobenius nor Lea 4 shows that kernel ICA s hard orthonorality constraint and krica s reconstruction cost are equivalent when data is whitened While krica can learn the over-coplete sparse representation of nonlinear features and kernel ICA fails to work due to the orthonorality constraint Please see the Appendix A for a detailed proof

4 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY Ipleentation The Equation (5) is an unconstrained convex optiization proble To solve this proble, we rewrite the objective as follows f(w) = [ W T Wφ(x i ) φ(x i ) 2 2 +λg(wφ(x i))] i= = [+ κ(w u,x i )κ(w u,w v )κ(w v,x i ) i= u= v= 2 (κ(w u,x i )) 2 +λ K ε+ h ju (κ(w u,x i )) 2 ], u= j= u= (6) where w u and w v are the rows of basis W, and h ju is the eleent in pooling atrix H Since the row w j of W is contained in the kernel κ(w j, ), it is very hard to directly utilize the optiization ethods in RICA, eg L-BFGS and CG [30], to copute the optial basis Thus, to solve this proble, we alternatively optiize each row of basis W instead With respect to each updating row w p of W, the derivative of f(w) is f = γ w p [ 4κ(w p,x i )κ(w p,w v )κ(w v,x i ) i= v= ((w p x i )+(w p w v )) 8κ(w p,x i )(w p x i ) h jp κ(w p,x i )(w p x i ) +2λ ] j= ε+ K h jv (κ(w v,x i )) 2 v= f Then, to copute the optial w p, we set w p = 0 Since w p is contained in κ(w p, ), it is challenging to solve the Equation (7) Thus, we seek the approxiate solution instead of the exact solution Inspired by fixed point algorith [25], to update w p in the (q)-th iteration, we utilize the result of w p in the (q )-th iteration to calculate the part in the kernel function In addition, we utilize k-eans to initialize the basis followed by [25] Let denote the w p in the (q)-th iteration as w p,(q), and the Equation (7) with respect to w p,(q) becoes f γ = w p,(q) [ 4κ(w p,(q ),x i )κ(w p,(q ),w v ) i= v= κ(w v,x i )((w p,(q) x i )+(w p,(q) w v )) 8κ(w p,(q ),x i ) h jp κ(w p,(q ),x i )(w p,(q) x i ) (w p,(q) x i )+2λ ] j= ε+ K h jv (κ(w v,x i )) 2 = 0 v= When all the reaining rows are fixed, the proble becoes a linear equation of w p,(q), which can be solved straightforwardly (7) 43 Connection between krica and KSR It is clear there is a close connection between the proposed krica and KSR [25] Siilar to krica, KSR attepts to find the sparse representation of nonlinear features in a high diensional feature space and its optiization proble is in W,s i [ W T s i φ(x i ) 2 2 +λ s i ], (8) i= where s i R K is the sparse representation of saple x i Therefore, there are two ajor differences between the () KSR utilizes explicit encoding for sparse representation corresponding to input data saple, ie, s i = Wφ(x i ) Since the objective of Equation (8) in KSR is not convex, the basis W and sparse codes v i should be optiized, alternatively (2) The siple L penalty, g(s i ) = s i, is eployed by KSR to proote sparsity while krica uses L 2 pooling instead, which can force the pooling features to group siilar features together to achieve invariance, and eanwhile optiize the sparsity 5 SUPERVISED KERNEL RICA Given the labeled training data, our goal is to utilize class inforation to learn a structured basis set, which is consisted of basis vectors fro different basis subsets corresponding to different class labels Then each subset will sparsely represent well for its own class but not for the others Thus, to learn such basis, we further extend the unsupervised krica to a supervised one by introducing a discriination constraint, naely d- krica Matheatically, when the saple x i is labeled as y i {,,c} where c is the total nuber of classes, we can further utilize class inforation to learn a structured basis set W = [W (),W (2),,W (c) ] T R K n, where W (yi) R k n is the basis subset that can well represent the saple x i belonging to the y i -th class rather than others, k is the nuber of basis vectors for each subset and K = k c Let denote s i = Wx i where s i can be regarded as the sparse representation of saple x i [3] 5 Discriination constraint Since we ai to utilize class inforation to learn a structured basis, we hope that the saplex i labeled as y i will only be reconstructed by the basis subset W yi with coefficients s i To achieve this goal, an inhoogeneous representation cost constraint [5], [3] was utilized to iniize the inhoogeneous representation coefficients of s i, ie, coefficients corresponding to basis vectors other than belonging to W yi However, this constraint only focuses on iniizing the inhoogeneous coefficients while fails to consider axiizing the the hoogeneous ones, which is not sufficient to learn an optial structured basis Consequently, to learn such basis, we

5 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY introduce a discriination constraint, which axiizes the hoogeneous representation cost and iniizes the inhoogeneous representation cost, jointly Matheatically, we define the hoogeneous cost as P + and the inhoogeneous cost as P Specifically, P + and P are P + = D +yi s i 2 2, P = D yi s i 2 2, (9) where D +yi R K and D yi R K select the hoogeneous and inhoogeneous representation coefficients of s i, respectively For exaple, assuing W = [W (),W (2),W (3) ] T, W (yi) R 2 n (y i {,2,3}) and y i =3, D +yi and D yi can be respectively defined as follows D +3 = [ ] D 3 = [ 0 0 ] Intuitively, we can define the discriination constraint function d(s i ) as P P +, which eans the sparse representation s i in ters of basis atrix W will only concentrate on the basis subset W (yi) However, this constraint is non-convex and unstable To address the proble, we propose to incorporate an elastic ter s i 2 2 into d(s i ) Thus, d(s i ) is defined as d(s i ) = D yi s i 2 2 D +yi s i 2 2 +η s i 2 2 (0) It can be proved that if η k+, d(s i ) is strictly convex to s i Please see the Appendix B for a detailed proof The constraint (0) axiizes the hoogeneous representation cost and iniizes the inhoogeneous representation cost, siultaneously, which leads to learn a structured basis consisted of basis vectors fro different basis subsets corresponding to the class labels Then each subset will sparsely represent well for its own class but not for the others Furtherore, data saples belonging to the sae class will have siilar representations, and thereby the obtained new representations can take ore discriinative power By incorporating the discriination constraint into the krica fraework (d-krica), we can get the following objective function in W [ W T Wφ(x i ) φ(x i ) i= λg(wφ(x i ))+αd(wφ(x i ))], () where λ and α are the scalars controlling the relative contribution of the corresponding ters Given a test saple, Equation () eans that the learned basis set can sparsely represent it with nonlinear structure while deands its hoogeneous representations as large as possible and eanwhile inhoogeneous representations as sall as possible Following krica, the optiization proble () can be easily solved by the above proposed fixed point algorith TABLE Iage classification Accuracy on Caltech 0 dataset 6 EXPERIMENTS Training size 5 30 ScSPM [7] 670% 732% D-KSVD [6] 65% 730% LC-KSVD [9] 677% 736% RICA [3] 67% 737% KICA [23] 652% 728% KSR [25] 679% 75% DRICA [5] 678% 744% d-rica 687% 756% krica 682% 754% d-krica 73% 77% In this section, we will firstly introduce the feature extraction for iage classification Then, we evaluate the perforances of our krica and d-krica for iage classification on three public datasets: Caltech 0 [32], CIFAR-0 [2] and STL-0 [2] Furtherore, we study the selections of tuning paraeters and kernel functions for our ethod Finally, we give the siilarity atrix to further illustrate the perforances of krica and d- krica 6 Feature Extraction for Classification Given a p p input iage patch (with d channels) x R n (n = p p d), krica can transfor it to a new representation s = Wφ(x i ) R K in the feature space, where p is tered as the receptive field size For an iage of N M pixels (with d channels), we could obtain a (N p + ) (M p + )(with K channels) feature following the sae setting in [3], by estiating the representation for each p p subpatch of the input iage To reduce the diensionality of the iage representation, we utilize siilar pooling ethod in [3] to for a reduced 4K-diensional pooled representation for iage classification Given the pooled feature for each iage, we utilize linear SVM for classification 62 Classification on Caltech 0 Caltech 0 dataset consists of 944 iages which are divided aong 0 object classes and background class including anials, vehicles, etc Following the coon experient setup [7], we ipleent our algorith on 5 and 30 training iages per category with basis size K = 020 and 0 0 receptive fields, respectively Coparison results are shown in Table 2 We copare our classification accuracy with ScSPM [7], D-KSVD [6], LC-KSVD [9], RICA [3], KICA [23], KSR [25] and DRICA [5] In addition, in order to copare with DRICA, we incorporate the discriination constraint (0) into the RICA fraework (2), naely d-rica Table shows that krica and d-krica outperfor the other copeting approaches

6 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY TABLE 2 Test Classification Accuracy on CIFAR-0 dataset Model Accuracy Iproved Local Coord Coding [8] 745% Conv Deep Belief Net (2 layers) [33] 789% Sparse auto-encoder [2] 734% Sparse RBM [2] 724% K-eans (Hard) [2] 686% K-eans (Triangle) [2] 779% K-eans (Triangle, 4000 features) [2] 796% RICA [3] 84% KICA [23] 783% KSR [25] 826% DRICA [5] 82% d-rica 829% krica 834% d-krica 845% 63 Classification on CIFAR-0 The CIFAR-0 dataset includes 0 categories and color iages in all with 00 iages per category, such as airplane, autoobile, truck and horse etc In addition, there are 000 training iages and 0000 testing iages Specifically, 000 iages fro each class are randoly selected as test iages and the other 00 iages fro each class as training iages In this experient, we fix the size of basis set to 4000 with 6 6 receptive fields followed by [2] We copare our approach with RICA, K-eans (Triangle, 4000 features) [2], KSR, DRICA and d-rica etc Table 2 shows the effectiveness of our proposed krica and d-krica Accuracy (%) d krica krica 40 KSR d RICA 35 DRICA RICA ICA 30 KICA keans The Size of Over coplete Basis Fig Classification perforance on STL-0 dataset with varying basis size and 8 8 receptive fields TABLE 3 Test Classification Accuracy on STL-0 dataset Model Accuracy Raw pixels [2] 38% K-eans(Triangle 0 features) [2] 55% RICA(8x8 receptive fields) [3] 54% RICA(0x0 receptive fields) [3] 529% KICA [23] 5% KSR [25] 544% DRICA [5] 542% d-rica 548% krica 552% d-krica 569% 64 Classification on STL-0 In STL-0, there are 0 classes(eg, airplane, dog, onkey and ship etc), where each iage is 96x96 pixels and color In addition, this dataset is divided into 0 training iages (0 pre-defined folds), 800 test iages per class and 00,000 unlabeled iages for unsupervised learning In our experients, we set the size of basis set K= 0 and 8 8 receptive fields in the sae anner described in [3] Table 3 shows the classification results of the raw pixels [2], K-eans, RICA, KSR, DRICA, d-rica, krica and d-krica As can be seen, d-rica achieves better perforance than DRICA on all of the above datasets It is because that DRICA just only iniized the inhoogeneous representation cost for structured basis learning, while d-rica siultaneously axiizes the hoogeneous representation cost and iniizes the inhoogeneous representation cost, which akes the learned sparse representation take ore discriinative power Although both DRICA and d-rica introduce the class inforation, unsupervised krica still perfors better than both these algoriths This eans that krica iplies ore discriinative power for classification by representing the data with nonlinear structure Additionally, since krica utilizes the L 2 pooling instead of L penalty to achieve feature invariance, it deonstrates better perforance than KSR Furtherore, the d-krica achieves better perforance than krica in all the cases by bringing in class inforation We also investigate the effect of basis size for our proposed krica and d-krica on STL-0 dataset In our experients, we try seven sizes:, 00, 200, 400, 800, 200 and 0 As shown in Fig, the classification accuracies of d-krica and krica continue to increase when the basis size goes up to 0 and the perforances augent slightly fro basis size of 800 Especially, d- krica outperfors all the other algoriths all the way 65 Tuning Paraeter and Kernel Selection In the experients, the tuning paraeters in krica and d-krica, ie λ, α and γ in the objective function, are verified by cross validation to avoid over-fitting More specifically, we experientally set these paraeters as follows The effect of λ : The paraeter λ is the weight of sparsity ter, which is an iportant factor in krica To facilitate the paraeter selection, we experientally investigate how the perforance of krica varies with the paraeter λ on STL-0 dataset in Fig 2 (γ = 0 )

7 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY Accuracy (%) RICA krica Accuracy (%) 55 d krica d RICA DRICA λ α Fig 2 The relationship between the weight of sparsity ter (λ) and classification accuracy on STL-0 dataset Fig 2 shows that krica achieves best perforance when λ is fixed to be 0 2 Thus, we set λ = 0 2 for STL-0 data In addition, we test the accuracy of RICA under the sae sparsity weight It is easy to find that our proposed nonlinear RICA (krica) can consistently outperfor linear RICA with respect to λ Siilarly, we experientally set λ = 0 for Caltech data and λ = 0 2 for CIFAR-0 data The effect of α : The paraeter α controls the weight of discriination constraint ter When α = 0, the supervised d-krica optiization proble becoes the unsupervised krica proble Fig 3 shows the relationship between the weight of discriination constraint ter α and classification accuracy on the STL-0 We can see that d-krica achieves best perforance when α = 0 Hence, we set α = 0 for STL-0 data In particular, d-rica achieves better perforance than DRICA in a wide range of α values This is because that DRICA just only iniizes the inhoogeneous representation cost, while d-rica jointly optiizes both the hoogeneous and inhoogeneous representation costs for basis learning, which akes the learned sparse representations take ore discriinative power Furtherore, by representing the data with nonlinear structure, d-krica iplies ore discriinative power for classification and outperfors both these algoriths Siilarly, we set α = for Caltech data and α = 0 for CIFAR-0 data The effect of γ : When we utilize the Gaussian kernel in krica, it is vital to select the kernel paraeter γ, which affects the iage classification accuracy Fig 4 shows the relationship between γ and classification accuracy on STL-0 dataset Therefore, we set γ = 0 for STL-0 data Siilarly, we experientally set γ = 0 2 for Caltech data and γ = 0 for CIFAR-0 data We also investigate the effect of different kernels for krica in iage classification, ie, Polynoial kernel: ( + x T y) b, Inverse Distance kernel: +b x y, Inverse Fig 3 The relationship between the weight of discriination constraint ter (α) and classification accuracy on STL-0 dataset Accuracy (%) γ krica Fig 4 Classification perforance on STL-0 dataset with varying kernel paraeter (γ) in Gaussian kernel Square Distance kernel: +b x y, Exponential Histogra Intersection kernel: 2 i,e in(ebxi byi ) Table 4 deonstrates the classification perforances of different kernels on STL-0 dataset, and Gaussian kernel outperfors the other kernels Thus, we eploy Gaussian kernel in our studies Following the work [26], we set b=3 for Polynoial kernel and b= for the others TABLE 4 Classification perforances of different kernels on STL-0 dataset Kernel Accuracy Polynoial kernel 542% Inverse Distance kernel 383% Inverse Square Distance kernel 476% Exponential Histogra Intersection kernel 365% Gaussian kernel 569%

8 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY Siilarity Analysis In above sections, we have shown the effectiveness of krica and d-krica for iage classification To further illustrate their perforances, we firstly choose 90 iages fro three classes in Caltech 0, and 30 iages for each class Then we copute the siilarity between sparse representations of these iages for RICA, krica and d-krica, respectively Fig 5 deonstrates the siilarity atrices corresponding to sparse representations of RICA, krica and d-krica, respectively Each eleent (i, j) in siilarity atrix is the sparse representation siilarity easured by Euclidean distance between iage i and j Since a good sparse representation ethod can ake the new representations belonging to the sae class ore siilar, their siilarity atrix also should be block-wise Fig 5 shows that nonlinear krica takes ore discriinative power than linear RICA, and d- krica achieves best by binging in class inforation 7 CONCLUSIONS In this paper, we propose a kernel ICA odel with reconstruction constraint (krica) to capture the nonlinear features To bring in the class inforation, we further extend the unsupervised krica to a supervised one by introducing a discriination constraint This constraint leads to learn a structured basis consisted of basis vectors fro different basis subsets corresponding to different class labels Then each subset will sparsely represent well for its own class but not for the others Furtherore, data saples belonging to the sae class will have siilar representations, and thereby the obtained sparse representation can take ore discriinative power The experients conducted on standardized datasets have deonstrated the effectiveness of our proposed ethod APPENDIX A PROOF OF LEMMA 4 Poof Since the input data set X = {x i } i= is whitened in the feature space by KPCA, we have E[φ(X)φ(X) T ] = φ(x i )φ(x i ) T = I, i= where I is the identity atrix Furtherore, we can obtain W T Wφ(x i ) φ(x i ) 2 2 i= = Tr[(W T Wφ(x i ) φ(x i )) T (W T Wφ(x i ) φ(x i ))] i= =Tr[(W T W I) T (W T W I) = W T W I 2 F, φ(x i )φ(x i ) T ] i= where Tr[ ] denotes the trace of a atrix, and the steps of derivation eploy the atrix property T r(ab) = T r(ba) Thus, the reconstruction cost is equivalent to the orthonorality constraint when data is whitened in the feature space APPENDIX B PROOF OF THE CONVEXITY OF d(s i ) We rewrite the Equation (0) as d(s i ) = D yi s i 2 2 D +y i s i 2 2 +η s i 2 2 = Tr[s T i DT y i D yi s i s T i DT +y i D +yi s i +ηs T i s i] (2) Then, we can obtain its Hessian atrix 2 d with respect to s i 2 d =2D T y i D yi 2D T y i+d yi+ +2ηI (3) Without loss of generality, we assue D +yi = [0 0 D yi = [ k {}}{ 0 0 ] R K, k {}}{ 0 0 ] R K After soe derivations, we have 2 d = 2 A, where A = η η η η η η + The convexity of d(s i ) depends on whether its Hessian atrix 2 d, ie atrix A, is positive definite or not [34] Meanwhile, the K K atrix A is positive definite if and only if z T Az > 0 for all nonzero vectors z R K [35], where z T denotes the transpose Let the size of upper left atrix in A be t t, and suppose z = [z,,z t,z t+,,z t+k,z t+k+,,z K ] T Then, we have Az = (η +)z +z 2 + +z t +z t+k+ + +z K z +z 2 + +(η +)z t +z t+k+ + +z K (η )z t+ z t+2 z t+k z t+ z t+2 +(η )z t+k z +z 2 + +z t +(η +)z t+k+ + +z K z +z 2 + +z t +z t+k+ + +(η +)z K Furtherore, we can get

9 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY (a) RICA (b) krica (c) d-krica Fig 5 The siilarity atrices for sparse representations of RICA, krica and d-krica z T Az =(η +) t+k t K z 2 i +(η ) z 2 i +(η +) i= i=t+ z 2 i i=t+k+ + 2 z i z j + 2 z i z j + 2 z i z j i t 2 j t 2 z i z j =η( t+ i t+k t+2 j t+k t z 2 i + i t t+k+ j K t+ i t+k t+2 j t+k K z 2 i )+(z +z 2 + +z t +z t+k+ i= i=t+k+ t+k + +z K ) 2 +(η ) z 2 i 2 z i z j i=t+ t+ i t+k t+2 j t+k Define function h(η) = z T Az, and when η k +, it is easy to verify that h(η) h(k +) = (k +)( t z 2 i + i= i=t+ z 2 i )+(z + +z t i=t+k+ t+k +z t+k+ + +z K ) 2 +k z 2 i 2 z i z j = k( + t z 2 i + i= z 2 i + i= t+ i t+k t+2 j t+k z 2 i )+(z + +z t +z t+k+ + +z K ) 2 i=t+k+ (z i z j ) 2 t+ i t+k t+2 j t+k Since K z 2 i > 0, we have h(η) h(k + ) > 0 Thus, i= Hessian atrix 2 d is positive definite for η k +, which guarantees that d(s i ) is convex to s i ACKNOWLEDGMENTS We thank Wende Dong for helpful discussions, and acknowledge Quoc V Le for providing the RICA code REFERENCES [] E Candes and M Wakin, An introduction to copressive sapling, IEEE Signal Proc Mag, vol 25, no 2, pp 2 30, arch 2008 [2] D Donoho, Copressed sensing, IEEE Trans Infor Theory, vol 52, no 4, pp , 2006 [3] D Lee, H Seung et al, Learning the parts of objects by nonnegative atrix factorization, Nature, vol 40, no 6755, pp , 999 [4] B Olshausen et al, Eergence of siple-cell receptive field properties by learning a sparse code for natural iages, Nature, vol 38, no 6583, pp 7 9, 996 [5] M Aharon, M Elad, and A Bruckstein, K-svd: An algorith for designing overcoplete dictionaries for sparse representation, IEEE Trans Signal Process, vol 54, no, pp , 2006 [6] Q Zhang and B Li, Discriinative k-svd for dictionary learning in face recognition, in Proc Coput Vis Pattern Recognit, 200 [7] Y Bengio, P Lablin, D Popovici, and H Larochelle, Greedy layer-wise training of deep networks, in Proc Adv Neural Infor Process Syst, vol 9, 2007, p 53 [8] G Hinton, S Osindero, and Y Teh, A fast learning algorith for deep belief nets, Neural Coput, vol 8, no 7, pp , 2006 [9] A Hyvärinen, J Karhunen, and E Oja, Independent coponent analysis Wiley-interscience, 200, vol 26 [0] A Hyvärinen, J Hurri, and P Hoyer, Natural iage statistics Springer, 2009, vol [] A Hyvärinen, P Hoyer, and M Inki, Topographic independent coponent analysis, Neural Coput, vol 3, no 7, pp , 200 [2] A Coates, H Lee, and A Ng, An analysis of single-layer networks in unsupervised feature learning, in Proc AISTATS, 200 [3] Q Le, A Karpenko, J Ngia, and A Ng, Ica with reconstruction cost for efficient overcoplete feature learning, in Proc Adv Neural Infor Process Syst, 20 [4] J Shawe-Taylor and N Cristianini, Kernel ethods for pattern analysis Cabridge university press, 2004 [5] Y Xiao, Z Zhu, S Wei, and Y Zhao, Discriinative ica odel with reconstruction constraint for iage classification, in Proc ACM Multiedia, 202, pp [6] D Donoho, For ost large underdeterined systes of linear equations the inial l-nor solution is also the sparsest solution, Coun Pur Appl Math, vol 59, no 6, pp , 2006 [7] J Yang, K Yu, Y Gong, and T Huang, Linear spatial pyraid atching using sparse coding for iage classification, in Proc Coput Vis Pattern Recognit, 2009, pp [8] K Yu and T Zhang, Iproved local coordinate coding using local tangents, in Proc Int Conf Mach Learn, 200 [9] Z Jiang, Z Lin, and L Davis, Learning a discriinative dictionary for sparse coding via label consistent k-svd, in Proc Coput Vis Pattern Recognit, 20, pp [20] J Wright, A Yang, A Ganesh, S Sastry, and Y Ma, Robust face recognition via sparse representation, IEEE Trans Pattern Anal Mach Intell, vol 3, no 2, pp , 2009

10 JOURNAL OF L A T E X CLASS FILES, VOL 6, NO, JANUARY [2] J Mairal, F Bach, J Ponce, G Sapiro, and A Zisseran, Nonlocal sparse odels for iage restoration, in Proc Coput Vis Pattern Recognit, 2009, pp [22] M Elad, Sparse and redundant representations: fro theory to applications in signal and iage processing Springer, 200 [23] J Yang, X Gao, D Zhang, and J Yang, Kernel ica: An alternative forulation and its application to face recognition, Pattern Recognit, vol 38, no 0, pp , 2005 [24] F Bach and M Jordan, Kernel independent coponent analysis, J Mach Learn Res, vol 3, pp 48, 2003 [25] S Gao, I Tsang, and L Chia, Kernel sparse representation for iage classification and face recognition, in Proc ECCV, 200, pp 4 [26] S Gao, I W-H Tsang, and L-T Chia, Sparse representation with kernels, IEEE Trans Iage Process, vol 22, no 2, pp , feb 203 [27] Q Le, R Monga, M Devin, K Chen, G Corrado, J Dean, and A Ng, Building high-level features using large-scale unsupervised learning, in Proc Int Conf Mach Learn, 202 [28] Y LeCun, Learning invariant feature hierarchies, in Proc ECCV Workshops and Deonstrations, 202, pp [29] Y-L Boureau, J Ponce, and Y LeCun, A theoretical analysis of feature pooling in visual recognition, in Proc Int Conf Mach Learn, 200, pp 8 [30] M Schidt, infunc, 2005 [3] J Yang, J Wang, and T Huang, Learning the sparse representation for classification, in Proc ICME, 20, pp 6 [32] L Fei-Fei, R Fergus, and P Perona, Learning generative visual odels fro few training exaples: An increental bayesian approach tested on 0 object categories, Coput Vis Iage Und, vol 06, no, pp 59 70, 2007 [33] A Krizhevsky, Convolutional deep belief networks on cifar-0, Unpublished anuscript, 200 [34] S Boyd and L Vandenberghe, Convex optiization, 2004 [35] G H Golub and C F Van Loan, Matrix coputations, 996

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes