School of Information Engineering, Guangdong University of Technology, Guangzhou, , China; 2

Size: px

Start display at page:

Download "School of Information Engineering, Guangdong University of Technology, Guangzhou, , China; 2"

Easter Banks
5 years ago
Views:

1 Sparse Representation Based Augmented Multinomial Logistic Extreme Learning Machine with Weighted Composite Features for Spectral-Spatial Classification of Hyperspectral Images Faxian Cao 1, Zhijing Yang 1*, Jinchang Ren 2, Wing-Kuen Ling 1 1 School of Information Engineering, Guangdong University of echnology, Guangzhou, , China; 2 Department of Electronic and Electrical Engineering, University of Strathclyde, Glasgow, G1 1XW, UK * yzhj@gdut.edu.cn Abstract Although extreme learning machine (ELM) has been successfully applied to a number of pattern recognition problems, it fails to provide sufficient good results in hyperspectral image (HSI) classification due to two main drawbacks. he first is due to the random weights and bias of ELM, which may lead to ill-posed problems. he second is the lack of spatial information for classification. o tackle these two problems, in this paper, we propose a new framework for ELM based spectral-spatial classification of HSI, where probabilistic modelling with sparse representation and weighted composite features (WCF) are employed respectively to derive the optimized output weights and extract spatial features. First, the ELM is represented as a concave logarithmic likelihood function under statistical modelling using the maximum a posteriori (MAP). Second, the sparse representation is applied to the Laplacian prior to efficiently determine a logarithmic posterior with a unique maximum in order to solve the ill-posed problem of ELM. he variable splitting and the augmented Lagrangian are subsequently used to further reduce the computation complexity of the proposed algorithm and it has been proven a more efficient method for speed improvement. hird, the spatial information is extracted using the weighted composite features (WCFs) to construct the spectral-spatial classification framework. In addition, the lower bound of the proposed method is derived by a rigorous mathematical proof. Experimental results on two publicly available HSI data sets demonstrate that the proposed methodology outperforms ELM and a number of state-of-the-art approaches. Index erms hyperspectral image (HSI), spectral-spatial classification, extreme learning machine (ELM), maximum a posterior (MAP), sparse representation, Laplacian prior, variable splitting, augmented Lagrangian. I. INRODUCION With rich spectral and spatial information contained in a three-dimensional hypercube, hyperspectral images (HSI) provides a unique way for characterizing objects in geographical scenes, especially remote sensing images [1]. However, classification of high dimensional data such as HSI is still challenging, particularly due to the unfavorable ratio between the limited number of training samples and large number of spectral bands, i.e. the Hughes phenomenon [2]-[4]. o tackle this problem, a number of feature extraction and data classification approaches have been proposed [12]. hese include the singular spectrum analysis (SSA) [5]-[8], segmented auto-encoders [11], principal component analysis (PCA) and its variations [13]-[14], spectral-spatial classification using multiple kernels and active learning [15][16], where support vector machines (SVM) [9][10] are widely used for data classification. Although these approaches have produced quite good results, their performance can be further improved by addressing two main difficulties: (1) Inaccurate classification under a large number of spectral bands yet limited training samples. (2) Relatively low efficiency for processing high dimensional HSI data. As a single forward layer neural network, the extreme learning machine (ELM) is a fast and effective machine learning method and has received a wide attention due to its good performance [17]-[19]. he ELM needs not tune the hidden layer parameters once the number of hidden layer nodes is determined. In ELM, the weight and bias vectors between the input layer and the hidden layer are randomly generated, which are independent of the training samples and the specific applications [1]. ELM have achieved good performance in many applications [20-23] due to its good performance. Also, ELM has been widely applied to HSI classification. For example, In [24],[25], bilateral filtering and extended morphological profiles were used for feature extraction respectively. And ELM was used for classification. In [26],[27], [28], superpixel, watershed and Gabor filter were used for feature extraction and ELM was used for classification. Although these methods ELM-based have achieved good performance, they just combine ELM and other feature extraction method. hese ELM-based method ignore one important problem in ELM that the randomly generated input weights and bias of ELM may cause ill-posed problems. Based on this perspective, we propose augmented sparse multinomial logistic ELM (ASMLELM) for HSIs classification. Based on proposed ASMLELM, we propose weighted composite features (WCFs) for extracting rich spatial information for ASMLELM. herefore, a novel framework ASMLELM-WCFs have been proposed for spectral-spatial classification of HSI, and the main contributions can be highlighted as follows. 1

2 First, the ELM is represented by a maximum a posteriori (MAP) based probabilistic model, which is further represented by a concave logarithmic likelihood function (LLF). As the LLF value can be arbitrarily large if the training data is separable, a prior/regularized term on the LLF is critical [29], for which the sparse representation is employed for representing the ELM. In order to settle the ill-posed problem in ELM, Laplacian prior/regularized term is applied to improve the sparsity of the learnt weights of the proposed ELM classifier. he Laplacian function has the heaviest tailed density derived from its logarithmic concave, where its sparsity promoting nature has been theoretically well justified [29]-[32]. his can guarantee the logarithmic posterior to have a unique maximum when it is combined with a concave logarithmic likelihood [29]. As the ELM is represented by a probabilistic model under a maximum a posteriori (MAP), it can enjoy the advantage of existing a unique maximum. herefore, the ill-posed problem in ELM can be addressed by above operation namely spare multinomial logistic ELM (SMLELM). Recently, variable splitting and augmented Lagrangian [33] have been proposed for speed improvement. Hence, we adopt the idea variable splitting and augmented Lagrangian for proposed SMELM to construct proposed augmented SMLELM (ASMLELM), which transform SMLELM into a new form using variable splitting and augmented Lagrangian for speed improvement. Second, by combining composite kernels (CK) [34] and weighted mean filters (WMFs) [35], the weighted composite features (WCFs) are utilized to extract spatial features and further improve the classification accuracy. Accordingly, three improved spectral-spatial classifiers are derived, which include the ELM, the nonlinear ELM (NLELM) and the kernel ELM (KELM) based classifiers. hird, the generalization bounds of the proposed method is derived in a similar way to determine the margin bounds of the sparse multinomial logistic regression (SMLR) [29][36][37]. hese bounds can provide a theoretical insight of and justification for our proposed methods. he rest of this paper is organized as follows. In Section II, the background of the ELM is introduced. he proposed method is detailed in Section III. Section IV reports the experimental results in benchmarking with a few state-of-the-art approaches. Finally, some conclusions are drawn in Section V. A. Basic Concepts of ELM II. HE EXREME LEARNING MACHINE (ELM) he ELM is a generalized single layer feedforward neural network (SLFNs) [1], [17]. he weight vector and the bias between input layer and hidden layer are randomly initialized, though the final values will be determined by the learning process. Once the initial values for the weight/bias vectors are assigned, the hidden layer output matrix remains unchanged in the learning process [1]. Let X (x 1, x 2,, x N ) R d N be the training data of a HIS, which has N pixels and each pixel has a d-dimensional feature. Let Y = (y 1, y 2,, y N ) R M N be a matrix representing the class label of the training samples, where M is the number of class in datasets. Given a pixel label y i, if it belongs to the k-th class, we have y i,j = { 1, j = k, 0, otherwise. he model of a single hidden layer of the neural network with L hidden neurons and the activation function H(x) can be expressed as follows: L j=1 β j H(w j x i + b j ) = y i, i=1,2,,n (1) where β j represents the weight vector between the hidden layer and the output layer; w j and b j are the weight vector and bias from the inputs to the hidden layer, respectively; H(w j x i + b j ) represents the output of the j-th hidden neuron with respect to the input sample x i. Obviously, equation (1) can be further expressed in the following matrix form: H β = Y (2) where β = [β 1 β M ] L M, H =[H(x 1 ) H(x n )] L N, and H(x i ) = [H 1 (x i ) H L (x i )] L 1. H is the hidden layer output matrix, and β is the output weight matrix between hidden layer and output layer. From (2), β can be simply obtained below, where is the Moore Penrose generalized inverse of a matrix [17]. β = (H ) Y (3) B. Constrained Optimization of the ELM he constrained optimization of the ELM aims to achieve not only the smallest training error but also the smallest output weights [19]: min H β Y 2 and β 2 (4) According to the Bartlett s neural network generalization theory [38], the smaller weights will result in a smaller training error of the feedforward neural networks. According to this optimization theory, Eq. (4) can be rewritten as: N 2 min L ELM = 1 β β,ξ i 2 F 2 + C 1 ξ 2 2 i=1 i 2, s. t. H (x i )β = y i ξ i, i=1,..,n (5)

3 where ξ i is the training error for the training sample x i, C is the regularization parameter. Based on the Karush-Kuhn-ucker (KK) theorem [39], training the ELM is equivalent to solve the following dual optimization problem: N min ELM = 1 β (β,α,ξ i ) 2 F 2 + C 1 ξ 2 2 i=1 i 2 i=1 j=1 α i,j (H (x i )β j y i,j + ξ i,j ) (6) where β j is the column vector of the matrix β, and α i,j is the Lagrange multiplier. From the KK theorem, we can further derive L ELM = 0 β = H α β j (7) L ELM = 0 α ε i = Cε i, i i = 1,, N (8) L ELM = 0 H (x α i )β = y i ξ i i i=1,, N (9) where α i = [α i,1, α i,2,, α i,m ] and α = [α 1, α 2,, α N ]. hen, it can be shown that the output weight β is: N M β = H ( I C + H H) 1 Y (10) he activation functions of the neurons in the hidden layer are unknown, and any kernel satisfying the Mercer s conditions can be used: Ω KELM = H H { Ω KELM (x i, x j ): h(x i ) (11) h(x i ) = K(x i, x j ) In fact, the Gaussian kernel is one of the good choices: K ELM (x i, x j ) = exp ( x i x j 2 ) (12) 2 σ ELM Based on the above analysis, two well-known constrained optimization methods of ELM had been proposed. One is to define β in (10) without a kernel, namely nonlinear ELM (NLELM). he other one is to use the kernel function for the kernel ELM (KELM) as given below. β NLELM = H( I C + H H) 1 Y (13) β KELM = ( I C + K(x i, x j )) 1 Y (14) III. HE PROPOSED FRAMEWORK A. Augmented Sparse Multinomial Logistic Extreme Learning Machine (ASMLELM) he goal of a supervised learning algorithm is to design a classifier based on a set of N training samples that is capable of distinguishing M classes on the basis of an input vector of length d [29]. Under the multinomial logistic regression model [40], β in equations (3), (13) and (14) can be transformed to a new form via a probability model. If training sample x i belongs to the j-th class, the probability model can be represented by the following equation: P(y i,j = 1 H(x i ), β) = exp (β j H(x i )) 3 j=1 exp ( β j H(xi )) In (3), (13), (14) and (15), β may not be optimal due to the ill-posed problem. herefore, it is important to find the optimal β to obtain high classification accuracy. In order to find the optimal parameter β for the ELM, β will be estimated again after presenting the ELM by a probabilistic model. o this end, the maximum likelihood (ML) estimation is introduced to the ELM. Let β = [β 1 ; β 2 ; ; β M ] (L M) 1 be a column vector with L M elements. A simple maximization of the logarithmic likelihood can be expressed as follows: max β N M (15) L(β) = i=1( j=1 y i,j β j H(x i ) log j=1 exp (β j H(x i ))) (16) In order to maximize L(β), consider the second order aylor series of L(β) evaluated at β : where ρ (0,1) and L(β) L(β ) = (β β ) L(β ) (β β ) 2 L(β + ρ(β β ))(β β ) M (β β ) L(β ) (β β ) B(β β ) (17) B 1 2 [I 11 M ] HH (18) where I R M M is an identity matrix, 1 = [1, 1,,1] and is the kronecker matrix product. his result has been proved in [40], [41]. hen, the ML estimation can be expressed as follows:

4 β = arg max β β ( L(β ) Bβ ) β Bβ (19) Hence, β at the (t+1)-th iteration can be expressed by a simple update equation: β t+1 = B 1 (Bβ t L(β t )) (20) From (20), it can be seen that it is very similar to an iteratively reweighted least squares (IRLS) algorithm [42]. However, the Hessian matrix in the IRLS algorithm is replaced by matrix B. Since B 1 can be precomputed, it is a big advantage of the proposed approach. Compared to the IRLS algorithm, whose Hessian matrix must be inverted at each iteration [29], [43], our proposed approach is better than the IRLS algorithm. However, the concave LLF value can be arbitrarily large if the training data is separable. From [29], it is known that a prior on the logarithmic likelihood is crucial. In order to address the ill-posed problem in ELM, the prior/regularized term is adopted on β. Here, the Laplacian prior is used: L1(β) = L(β) L(β ) + logp(β) (21) where p(β) exp ( λ β 1 ) (22) and β 1 = β l l denotes the l 1 norm and β l = β l 2. Consider the following inequality: where h>0 and u>0. For any β, we have herefore, h + u 2 h u yields u 1 ( u + h) (23) 2 h λ β 1 1 λ( β l 2 2 l β + β l l l ) (24) β ( L1(β ) Bβ ) β (B λ )β (25) can be maximized. Here, = diag{ β 11 1,, β LM 1 } (26) Finally, (20) can be expressed by the following equation: β t+1 = (B λ t ) 1 (Bβ t L(β t )) (27) From the above, it can be seen that the Laplacian prior/regularized term is applied to β with λ acting as a regularization parameter. he Laplacian prior imposed on the sparse multinomial logistic ELM (SMLELM) controls the complexity of the SMLELM classifier and improves the generalization capacity of the SMLELM since p(β) in (22) forces many components of β to be zero. Since the term L(β) in (16) is not quadratic and p(β) in (22) is nonsmooth, finding the solution of the optimization problem defined as (25) is very difficult. Recently, a novel method called the majorization minimization [43] has been proposed to decompose this kind of problems [29], [44]-[47]. However, the computation complexity of this algorithm is very large. In [48], the logistic regression via a variable splitting and an augumented Lagrangian (LORSAL) have been used for speed improvement. Moreover, variable splitting and augmented Lagrangian have been used for speed improvement in some work [33] [16] [48]. his algorithm have been proven that can greatly reduce the computation complexity [16], [45], [49]. As the variable splitting and augmented Lagrangian approach shows a good performance in the corresponding works, then we can sue this approach to reduce the timeconsuming of proposed SMLELM, which transform proposed SMLELM into a new form. Variable splitting is a very simple procedure which consist in creating a new variable[50]. hen the problem defined in (21) is equivalent to: (β, v ) = arg min L(β) + λ v 1 s.t. β = v (28) β,v his optimization problem can be solved via applying the direction method of multipliers [51] (see also [52] and the references therein).his neural network is called the augmented SMLELM(ASMLELM). Apply the augmented Lagrangian [50] to solve the equation (28). hen, the solution of (28) at (t+1)-th iteration can be rewritten as follows: β t+1 = arg min L(β) + γ β β 2 vt b t 2 (29) v t+1 = arg arg min λ v 1 + γ v 2 βt v b t 2 (30) and b t+1 = b t β t+1 + v t+1 (31) t where γ 0 is the weight of the augmented SMLELM (ASMLELM). For any γ 0, the sequence β converges to a minimizer [45], [48] [50]. For the convenience, γ = 10λ is set. he solution of the problem defined in (28) is the simple soft-threshold rule [51], [45]. It can be expressed as: v t+1 = max (0, abs(e) 1 } (32) 10 4

5 where e = β t+1 b t (33) After designing the ASMLELM framework which can be applied to the ELM, NLELM and KELM, three new spectral algorithms for performing the HSI classification can be generated. hey are named as the ASMLELM for the basic ELM (ASMLBELM), the ASMLELM for the NLELM (ASMLNLELM) and the ASMLELM for the KELM (ASMLKELM). he pseudocodes for these three methods are shown in Algorithm 1. Algorithm 1:he ASMLELM for basic the ELM, the NLELM and the KELM Input: he training sample pairs {x = (x 1, x 2,, x n ) and Y = (y 1, y 2,, y n )} as well as the parameters λ, b = 0. raining phase L: he number of nodes in a hidden layer. H( ): he sigmoid function. β: he output weight in the hidden layer. 1: Randomly generate the input weight {w 1, w L } and the bias {b 1,, b L }. 2: For each training sample x i, calculate the hidden layer matrix H(x i ) = [H 1 (w 1 x i + b 1 ),..., H L (w L x i + b L )] L 1. 3: Calculate the output weight (1) β = (H ) Y for the ASMLBELM. (2) min L ELM = 1 β β,ξ i 2 F 2 + C 1 N ξ 2 2 i=1 i 2, s. t. H (x i )β = y i ξ i i=1,..,n,. Here, β = H ( I C + H H) 1 Y for the ASMLNLELM. (3) Let π = ( I C + H H) 1 Y and the Gaussian kernel K rain (x i, x j ) = exp ( x i x j 2 2 σ ELM ). hen, π = ( I C + K rain) 1 Y for the ASMLKELM. 4. Represent the ELM by a probability model (1) P(y i,j = 1 H(x i ), β) = exp (β j H(x i )) j=1 exp ( β j H(xi )) (2) P(y i,j = 1 K train (x i ), π) = exp (π j K rain ) j=1 exp ( π j Krain ) for the ASMLBELM and the ASMLNLELM. for the ASMLKELM. 5. ASMLELM: he ML estimate based on the sparse representation with the Laplacian prior and the LORSAL algorithm. 5.1 β = arg max β β ( L(β ) Bβ ) β (B λ t )β. 5.2 Set t= Repeat. 5.4 β t+1 = arg min L(β) + 10λ β β 2 vt b t v t+1 = arg arg min λ v λ β v v 2 bt b t+1 = b t β t+1 + v t Increase t to t+1; If the ASMLKELM is applied, replace β by π. 5.8 Quit the algorithm until the stopping criterion is met. Prediction phase {X (x 1, x 2,, x N ) and Y = (y 1, y 2,, y N )}. 1: (1) Calculate the output layer matrix H (x i ) = [H 1 (w 1 x i + b 1 ),, H L (w L x i + b L )] L 1 i=1,,n for the ASMLBELM and the ASMLNLELM. (2) K test = H H = exp ( x i x j 2 ) for the ASMLKELM. 2σ ELM 2: (1) P(y i,j = 1 H (x i ), β) = exp (β k H (x i )) k=1 exp ( β k H (x i )) (2) P(y i,j = 1 K test, π) = exp (π k K test ) k=1 exp ( π kktest ) for the ASMLBELM and the AMSLNLELM. for the ASMLKELM. B. Weighted Composite Features Based ASMLELM (ASMLELM-WCFs) 5

6 From the above, it can be seen that the ASMLELM can just use the spectral information of the HSI data for classification. A pixel and its spatial neighborhood pixel likely belong to the same class [1]. herefore, the spatial information is very important. he spatial information will be adopted to the ASMLELM. Hence, the WCFs will be used to perform the spectral spatial classification for the proposed ASMLELM. For a given pixel x i, let the pixel coordinate of sample x i be (p, q), then the local pixel neighborhood centered at x i is N(x i ) = {x = (p, q) p [p a, p + a]; q [q a; q + a]}, a=(w opt-1)/2 w where w opt is the width of the neighborhood window. Let x i be s the spectral feature of the training sample and x i be the information extracted from a local spatial neighborhood of the pixel x w i. It can be represented as x s i = x w xk w k v k xk w v k with the weight v k = exp { z x i x ik 2 } measuring the spectral distance between the central pixel and the neighboring pixels (x ij N(x i )). Following the setting in [35], we set z = 0.2 in this work. hen, the output matrix of the hidden layer defined in (2) and (13) can be expressed as: H = μh w + (1 μ)h s (34) where H w = [H w (x w 1 ) H w (x w n )] L N (35) H s = [H s (x s 1 ) H s (x s n )] L N (36) and μ is a combination coefficient balancing the spectral and spatial information. For the KELM defined in (14), let β be where and 6 β = ( I C + K) 1 Y (37) K = μk Hw + (1 μ)k Hs (38) K Hw (x w i, x w j ) = exp ( x i w w x j 2 ) 2 σ w (39) K Hs (x s i, x s j ) = exp ( x i s s x j 2 ) (40) 2 σ s Here, σ w and σ s control the widths of the spectral and spatial Gaussian kernel. Now, three new methods for performing the spectral-spatial HSI classification can be proposed via the ASMLELM and WCFs. hat is, the ASMLBELM-WCFs, the ASMLNELM-WCFs and the ASMLKELM-WCFs. he details are summarized in Algorithm 2. Figure 1 show the flowchart of proposed ASMLELM-WCFs. Algorithm 2: ASMLELM with WCFs (ASMLELM-WCFs) Input: he spectral feature {x w = (x w 1, x w w 2,, x n ) and Y = (y 1, y 2,, y n )}, the spatial feature {x s (x s 1, x s s 2,, x n ) and Y = (y 1, y 2,, y n )} as well as the parameters C, λ, b = 0. raining phase: L: he number of nodes in a hidden layer. H( ): he sigmoid function. he output weight of the hidden layer β. 1: Randomly generate the input weight {w 1, w L } and the bias {b 1,, b L }. 2: For any training sample x i, calculate the hidden layer matrix H w (x w i ) = [H 1 (w 1 x w i + b 1 ),..., H L (w L x w i + b L )] L 1 and H s (x s i ) = [H 1 (w 1 x s i + b 1 ),..., H L (w L x s i + b L )] L 1. Here, (1) H = μh w + (1 μ)h s for the ASMLBELM-WCFs. (2) H = μh w + (1 μ)h s for the ASMLNLELM-WCFs and the ASMLKELM-WCFs. 3: Calculate the output weight (1) β = (H ) Y for the ASMLBELM-WCFs. (2) min L ELM = 1 β β,ξ i 2 F 2 + C 1 N ξ 2 2 i=1 i 2, s. t. H (x i )β = y i ξ i i=1,..,n. Here, β = H ( I C + H H) 1 Y for the ASMLNLELM-WCFs. (3) π = ( I C + K train) 1 Y, K train = μh w H w + (1 μ)h s H s = μk Hw + (1 μ)k Hs for the ASMLKELM- WCFs. 4.Represent the ELM by a probability mode P(y i,j = 1 H(x i ), β) = exp (β j H(x i )) j=1 exp ( β j H(xi )) P(y i,j = 1 K train (x i ), π) = exp (π j CK train (x i )) j=1 exp ( π j Ktrain (x i )) for the ASMLBELM-WCF and the ASMLNLELM-WCFs. for the ASMLKELM-WCFs. 5. ASMLELM : he ML estimate based on the sparse representation with the Laplacian prior and the LORSAL

7 algorithm. 5.1 β = arg max β β ( L(β ) Bβ ) β (B λ t )β. 5.2 t:= Repeat. 5.4 β t+1 = arg min L(β) + 10λ β β 2 vt b t v t+1 = arg arg min λ v λ β v v 2 bt b t+1 = b t β t+1 + v t Quit the algorithm until the stopping criterion is met. Prediction phase: he spectral feature {X w = (x w 1, x w w 2,, x N ) and Y = (y 1, y 2,, y N )} as well as the spatial feature {X s = (x s 1, x s s 2,, x N ) and Y = (y 1, y 2,, y N )}. 1: Calculate the output layer matrix H w (x w i ) = [H 1 (w 1 x w i + b 1 ),..., H L (w L x w i + b L )] L 1 and H s (x s i ) = [H 1 (w 1 x s i + b 1 ),..., H L (w L x s i + b L )] L 1. Here, (1) H = μh w + (1 μ)h s for the ASMLBELM-WCFs. (2) H = μh w + (1 μ)h s for the ASMLNLELM-WCFs. (3) H = μh w + (1 μ)h s, K test = μh w H w + (1 μ)h s H s = μk Hw + (1 μ)k Hs for ASMLKELM-WCFs. 2: (1) P(y i,j = 1 H (x i ), β) = exp (β j H (x i )) j=1 exp ( β j H (x i )) (2) P(y i,j = 1 K test (x i ), β) = exp (π j K test (x i )) k=1 exp ( π j Ktest (x i )) for the ASMLBELM-WCFs and the ASMLNLELM-WCFs. for the ASMLKELM-WCFs. D. he lower bound of the ASMLELM In this section, the lower bound of the proposed ASMLELM will be derived. From (19), we have: 2 L(β) B (41) From (20), it is well known that B is symmetric and negative definite independent from β. β at the (t+1)-th iteration is defined as: β t+1 = β t (B λ t ) 1 L(β t ) (42) Equation (42) corresponds to the following equation: From [29], (43) can be expressed as follows: hen, we have the following two lemmas: Lemma 1: (a): Q1(β) is maximized at: β = β (B λ t ) 1 L(β ). Q(β) = (β β ) L(β ) (β β ) B(β β ) λ β 1 (43) Q1(β) = (β β ) L(β ) (β β ) (B λ )(β β ) (44) (b): Q1(β )= 1 2 L (β )(B λ t ) 1 L(β ) 0, where the inequality is strictly satisfied if L(β ) 0. Proof: (a) As Q1(β) = L(β ) + (B λ )(β β ) = 0, we have β = β (B λ t ) 1 L(β ). (b) As Q1(β )= ((B λ t ) 1 L(β )) L(β ) ((B λ t ) 1 L(β )) (B λ t )((B λ t ) 1 L(β )) = L(β ) (B λ t ) 1 L(β ) L(β ) (B λ t ) 1 L(β ) = 1 2 L(β ) (B λ t ) 1 L(β ) 0, the inequality is strictly satisfied if L(β ) 0. Lemma 2: (a) Monotonicity: L(β t+1 ) L(β t ). (b) Convergence: he sequence L(β t ) converges to 0 if L is bounded as described in (a). Proof: 7

(a) For the convenience, let h = (B λ t ) 1 L(β t ). hen, we have L(β t+1 ) L(β t ) = h L(β t ) + 1 2 h 2 L(β t + ρh)h h L(β t ) + 1 2 h (B λ t )h 0.

8 (a) For the convenience, let h = (B λ t ) 1 L(β t ). hen, we have L(β t+1 ) L(β t ) = h L(β t ) h 2 L(β t + ρh)h h L(β t ) h (B λ t )h 0. (b) o prove this lemma, suppose that L(β t ) is bounded by a value larger than 0. From (b) of Lemma 1, it can be seen that the increments are lower bounded. herefore, it contradicts the boundedness of Q1. As a result, it can be concluded that the sequence L(β t ) converges to 0. A. HSI Data Sets Figure 1. he flowchart of proposed ASMLELM-WCFs framework IV. Experiments and Analysis In this section, the performances of the proposed new framework using two well-known HSI data sets (the Indian Pines data set and the Pavia University data set) will be evaluated. hese two data sets are available in the public. (1) Indian Pines data set: he Indian Pines HSI data set consists of the urban images collected by the AVIRIS sensors built in June he image scene has pixels with 200 spectral bands, where 20 spectral bands are the water absorption bands. Each band is ranging from 0.2μm to 2.4μm and the spatial resolution of the HSI data is 20m per pixel. here are totally 16 classes in the HSI data. (2) Pavia University data set: he Pavia University HSI data set consists of data over the city Pavia Italy acquired by the ROSIS instrument in he image scene has pixels with 103 spectral bands after removing 12 water absorption bands. he spatial resolution of the HSI data is 1.3m per pixel. here are totally 9 classes in the HSI data. B. Measurements and Parameter Setting he parameter setting and the measurements are described below before conducting the experiments. he proposed six methods are compared with the state-of-the-art methods including SVM, SVM with CK (SVM-CK), LORSAL, KLORSAL, SMLR-SpAV, BELM, NLELM, and KELM. he LIBSVM [53] software is used for the implementation of the SVM and the SVM-CK. For the kernel based methods such as the SVM, the SVM-CK, the KELM, the ASMLKELM and the ASMLKELM-WCFs, the Gaussian kernel is used. he Gaussian kernel parameter σ and the penalty parameter C will be automatically tuned by the three folds cross validations. C=2 p, σ = 2 q, p ={1, 2,, 11, 12, 13, 14, 15} and q ={-6, -5, -4, -3, -2, -1, 0, 1}. Other parameters of the SVM and 8

9 the SVM-CK are set the same as [1]. he parameters of the LORSAL, the KLORSAL (kernel based LORSAL) and the SMLR- SpAV (KLORSAL with the weighted Markov random field [54]-[55]) are chosen the same as [43]. All experiments are conducted in MALAB R2015a and run in a computer with 2.9 GHz CPU and 32.0 G RAM. (1) For the proposed ASMLBELM, the total number of the neurons in the hidden layer L and λ are very important parameters. hey will be evaluated in the next subsection. (2) For the ASMLNLELM, the parameter C will be automatically tuned by these three folds cross validations. he effects of L and λ will be evaluated in the next subsection. (3) he important parameters of the ASMLKELM are C, σ and λ. C and σ will be automatically tuned by these three folds cross validations. he effect of λ will be evaluated in the next subsection. (4) C, L, λ and μ are important parameters of the proposed ASMLBELM-WCFs. C will be automatically tuned by these three folds cross validations. μ is empirically set to be 0.1 in the experiments. For the parameters L and λ, their effects will be evaluated in the next subsection. (5) he proposed ASMLNLELM-WCFs has several parameters such as L, C, λ and μ. he parameter C will be automatically tuned by these three folds cross validations. μ is empirically set to be 0.1 in the experiments. For the parameters L and λ, their effects will be evaluated in the next subsection. (6) For the ASMLKELM-WCFs, C and σ will be automatically tuned by these three folds cross validations. μ is empirically set to be 0.1 in the experiments. he effect of λ will be evaluated in the next subsection. C. Discussion on Parameters In this subsection, several important parameters of the proposed methods will be evaluated and its performance will be compared with the BELM and the NLELM. It is worth noting that the window size is set at 9 for the WCFs based methods in both experiments 1 and experiment 2, which means the widths of the neighborhood window are set at 9. he effects of the window size for the proposed CF based methods will be evaluated after evaluating the effects of λ and L. Experiment 1: In Fig. 2, the effect of the parameter λ (λ = 2 a ) on the proposed method is evaluated. For convenience, in this experiments, the number of the hidden layer is set at L=550 and L=900 for the Indian Pines data set and the Pavia University data set, respectively. Fig. 1(a) and Fig. 1(b) plot the OA results as a function of a with 1043 and 3921 training samples (10% and 9% of the available samples) of the Indian Pines data set and the Pavia University data set, respectively. From Fig. 1, it can be seen that these three spectral and these three spatial classifiers are still stable to achieve the good performances when the parameter a is varying. It can also be seen that the proposed spectral classifiers can achieve higher accuracy when a is relatively large. Overall, the proposed six classifiers can achieve good and stable performances when the parameter a is varying. For following experiments, both Indian Pines datasets and Pavia University datasets, we set a to be -20 for the proposed ASMLBELM-WMFs, ASMLNLELM-WMFs and ASMLKELM-WMFs, and assign a to be -10 for the proposed ASMLBELM and ASMLNLELM. For the proposed ASMLKELM, we set a to be -17 in Indian Pines datasets, and set a to be -13 in Pavia University datasets. Experiment 2: In Fig. 3, the OA results are plotted as a function of the hidden layer neurons L and its effects on the proposed methods, the BELM and the NLELM. From Fig. 2, it can be seen that the ASMLBELM and the ASMLNLELM always achieve higher accuracies than the BELM and the NLELM. It can also be seen that the proposed spectral spatial classifiers based on the ASMLBELM-WCFs and the ASMLNLELM-WCFs can greatly improve the performances of the proposed spectral classifiers based on the ASMLBELM and the ASMLNLELM. his is because the spatial information is considered in these two spectral classifiers. herefore, the ASMLBELM, the ASMLNLELM, the ASMLBELM-WCFs and the ASMLNLELM-WCFs can achieve better performances compared with the BELM and the NLELM for any value of L. For all the next experiments, if no special emphasized, we set L to be 450 for BELM, ASMLBELM-based (including ASMLBELM and ASMLBELM-WCFs), 1000 for NLELM, ASMLNLELM-based (including ASMLNLELM and ASMLNLELM-WCFs) in Indian Pines datasets. In Pavia University datasets, we set L to be 1100 for BELM, NLELM, ASMLBELM, ASMLNLELM, ASMLBELM-WCFs and ASMLNLELM- WCFs. Experiment 3: In this experiment, the effects of the widths of the window for the ASMBELM-WCFs, the ASMNLELM-WCFs and the ASMKELM-WCFs will be evaluated. Fig. 4 plots the OA result as a function of the window size of these three spectral spatial classifiers. From Fig. 3, it can be seen that the results of the proposed method are not very good when the window size is too small. However, the results of the proposed methods become very good and stable when the window size is large. his shows the generalization for achieving the good performance of the spectral spatial classifiers. For the convenience, the window size is set as 13 both for the Indian Pines data set and the Pavia University data set in the following experiments. 9

10 (a) (b) Fig. 2. Effect of the parameter a in an image in (a) the Indian Pines data set and (b) Pavia University data set. (a) (b) Fig. 3. Effect of the total number of neurons in the hidden layer L in an image in (a) the Indian Pines data set (b) the Pavia University data set. (a) (b) Fig. 4. Effect of the window size in an image in (a) the Indian Pines data set and (b) the Pavia University data set. able 1. Classification accuracy using 10% of the labeled samples per class for the Indian Pines data set (he best results are bolded). NO rain est SVM SVM-CK LORSAL KLORSAL SMLR-SpAV BELM NLELM KELM ASMLBELM ASMNLELM ASMLKELM ASMLBELM-WCFs ASMNLELM-WCFs ASMLKELM-WCFs

14 130 1164 95.82 98.51 94.60 95.84 99.23 94.28 96.20 96.34 95.52 95.49 96.31 99.80 99.91 99.73 15 38 342 61.02 93.36 64.01 68.10 98.16 61.26 65.35 60.94 65.58 66.46 65.47 99.27 99.15 98.

11 OA AA k able 2. Classification accuracy using 9% of the labeled samples per class for the Pavia University data set (he best results are bolded). NO. rain est SVM SVM-CK LORSAL KLORSAL SMLR-SpAV BELM NLELM KELM ASMLBELM ASMNLELM ASMLKELM ASMLBELM-WCFs ASMNLELM-WCFs ASMLKELM-WCFs OA AA KAPPA D. Experiment Results and Comparisons in the Indian Pines Data Set and the Pavia University Data Set In this subsection, the classification results are evaluated using the Indian Pines data set and the Pavia University data set. able 1 and able 2 show the training samples and the testing samples of the Indian Pines data set and the Pavia University data set. 10% of the training samples are employed for the training and the remaining samples are employed for the testing in the Indian Pines data set. Similarly, 9% of the training samples are employed for the training and the remaining samples are employed for the testing in the Pavia University data set. he classification accuracies are shown in able 1 and able 2. It can be seen that the ASMLBELM, the ASMLNLELM and ASMLKELM obtain the higher accuracies than the BELM the NLELM and the KELM, respectively. he performances of these proposed three spectral classifiers are improved dramatically when the spatial information (WCFs) are considered. Compared with other spectral spatial classifiers such as the SVM-CK and the SMLR-SpAV, the ASMLBELM-WCFs, the ASMLNLELM-WCFs and ASMLKELM-WCFs have achieved the higher classification accuracies than the SVM-CK and the SMLR-SpAV. Based on the above results, it can be concluded that the proposed six methods can achieve the very good performances. Fig. 5 and Fig. 6 show the images and the ground truth obtained by different methods for the Indian Pines data set and the Pavia University data set. E. Experiment Results with Different Number of raining Samples of he Proposed methods he performances of these six methods are evaluated under different numbers of training samples. he total number of training samples is chosen as 5, 10, 15, 20, 25, 30, 35 and 40 from each class. Half of the total samples are chosen when the total number of training samples is more than half of the total samples of the class. For the parameters in the proposed methods, the approaches discussed in the previous subsections are employed. 11

12 (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) (k) (l) (m) (n) (o) Fig. 5. Images of the Indian Pines data set. (a) he SVM (OA=85.71). (b) he SVM-CK (OA=96.05). (c) he LORSAL (OA=73.14). (d) he KLORSAL (OA=82.26). (e) he SMLR-SpAV (OA=97.50). (f) he BELM (OA=77.16). (g) he NLELM (OA=79.18). (h) he KELM (OA=85.28). (i) he ASMLBELM (OA=80.62). (j) he ASMLNLELM (OA=80.47). (k) he ASMLKELM (OA=85.93). (l) he ASMLBELM- WCFs (OA=98.33). (m) he ASMLNLELM-WCFs (OA=98.21). (n) he ASMLKELM-WCFs (OA=98.57) with 10% training samples. (o) he ground truth. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) 12

13 (k) (l) (m) (n) (o) Fig. 6. Images of the Pavia University data set. (a) he SVM (OA=89.14). (b) he SVM-CK (OA=98.93). (c) he LORSAL (OA=77.63). (d) he KLORSAL (OA=87.79). (e) he SMLR-SpAV (OA=98.88). (f) he BELM (OA=90.87). (g) he NLELM (OA=90.80). (h) he KELM (OA=92.82). (i) he ASMLBELM (OA=92.06). (j) he ASMLNLELM (OA=91.71). (k) he ASMLKELM (OA=93.07). (l) he ASMLBELM- WCFs (OA=99.46). (m) he ASMLNLELM-WCFs (OA=99.44). (n) he ASMLKELM-WCFs (OA=99.69) with 9% training samples. (o) he ground truth. From able 3 and able 4, it can be seen that the classification accuracies of these three spectral classifiers are better than those of the BELM, the NLELM and the KELM, respectively. hese three spectral spatial classification algorithms can also achieve better performances. Compared with other classification methods listed in able 3 and able 4, these three spectral spatial classification methods achieve the higher accuracies than other spectral and spectral spatial classification methods. We can find an interesting result in able 3 and able 4 that the classification accuracies of BELM decrease when the number of training samples increases in most cases. his is because the ill-posed problem is particularly serious in BELM. Fortunately, the proposed ASMLBELM and ASMLNLELM-WCFs well alleviate this problem. In able 3 and able 4, we also display the execution time of the six proposed algorithms and other methods when using 100 training samples. From able 3 and able 4, in Indian Pines datasets and Pavia University datasets, the three proposed spectral algorithms, i.e. ASMLBELM, ASMLBELM and ASMLKELM need more consuming time than BELM, NLELM and KELM, respectively. But these three proposed algorithms have less consuming time than SVM. In Indian Pines data set, the proposed ASMLEBELM-WCFs and ASMLNLELM-WCFs algorithms need less consuming time than SVM-CK and SMLR-SpAV. Although the proposed ASMLEKELM-WCFs needs more consuming time than SMLR-SpAV, it needs less consuming time than SVM-CK. In Pavia University datasets, the three proposed spectral spatial algorithms have less consuming time than SVM-CK and SMLR-SpAV. Based on the above analysis, it can be seen that the proposed six classification methods achieve the very good performances especially the three spectral spatial classification algorithms. able 3. Classification accuracy (%) under different numbers of training samples for the Indian Pines data set (he best results are bolded). Number Index SVM SVM-CK LORSAL KLORSAL SMLR-SpAV BELM NLELM KELM ASMLBELM ASMNLELM ASMLKELM ASMLBELM-WCFs ASMNLELM-WCFs ASMLKELM-WCFs 5 OA AA k OA AA k OA AA k OA AA k OA AA k OA AA k OA AA k OA AA k ime(s)

14 able 4. Classification accuracy (%)under different numbers of training samples for the Pavia University data set (Best results are bolded). Number Index SVM SVM-CK LORSAL KLORSAL SMLR-SpAV BELM NLELM KELM ASMLBELM ASMNLELM ASMLKELM ASMLBELM-WCFs ASMNLELM-WCFs ASMLKELM-WCFs 5 OA AA k OA AA k OA AA k OA AA k OA AA k OA AA k OA AA k OA AA k ime(s) V. Conclusion his paper proposes three spectral algorithms and three spectral spatial methods which are improvements of ELM for performing the HSI classification. First, the ELM is represented by a probabilistic model via the MAP. hen, it was represented by a concave logarithmic likelihood function where its maximum can be obtained. Second, the Laplacian prior is adopted to it. he performance of the proposed framework is improved. hird, we adopt the LORSAL algorithm for the proposed framework in order to improve the performances. Fourthly, the spatial information are considered and the spectral spatial framework is employed for performing the HSI classification to further improve the classification accuracy. Finally, the lower bounds of the proposed method is derived by a rigorous mathematical proof, which demonstrate the good performances of proposed six proposed algorithms. For future work, we will focus on improving the computationally of proposed ASMLKELM by further spare representation. Moreover, we will also develop the classification accuracy by resorting to extended multi-attribute profiles [56] (EMAPs) method. ACKNOWLEDGMEN his work is supported by the National Nature Science Foundation of China (no , ), the raining program for outstanding young teachers in higher education institutions of Guangdong Province (no. YQ ). REFERENCES [1] Y. Zhou, J. Peng, C. L. P. Chen, Extreme learning machine with composite kernels for hyperspectral image classification, IEEE Journal of Selected opics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp , [2] A. Plaza, J. A. Benediktsson, J. W. Boardman, et al, Recent advances in techniques for hyperspectral image processing, Remote sensing of environment, vol. 113, pp. S110-S122, [3] Hughes G, On the mean accuracy of statistical pattern recognizers, IEEE transactions on information theory, vol. 14, no. 1, pp , [4] J. Li, P. R. Marpu, A. Plaza, et al, Generalized composite kernel frame work for hyperspectral image classification, IEEE ransactions on Geoscience and Remote Sensing, vol. 51, no. 9, pp , [5]. Qiao, J. Ren et al, Effective denoising and classification of hyperspectral images using curvelet transform and singular spectrum analysis, IEEE rans. Geoscience and Remote Sensing, vol. 55, no. 1, pp ,

15 [6] J. Zabalza, J. Ren, J. Zheng, J. Han, H. Zhao, S. Li, and S. Marshall, "Novel two dimensional singular spectrum analysis for effective feature extraction and data classification in hyperspectral imaging," IEEE rans. Geoscience and Remote Sensing, vol. 53, no. 8, pp , [7]. Qiao, J. Ren, C. Craigie, Z. Zabalza, C. Maltin, S. Marshall, "Singular spectrum analysis for improving hyperspectral imaging based beef eating quality evaluation," Computers and Electronics in Agriculture, [8] J. Zabalza, J. Ren, Z. Wang, H. Zhao, J. Wang, and S. Marshall, "Fast implementation of singular spectrum analysis for effective feature extraction in hyperspectral imaging," IEEE Journal of Selected opics in Earth Observation and Remote Sensing, vol. 8, no. 6, pp , [9] F. Melgani, L. Bruzzone, Classification of hyperspectral remote sensing images with support vector machines, IEEE ransactions on geoscience and remote sensing, vol. 42, no. 8, pp , [10] Q. He, C. Du, Q. Wang, F. Zhuang, and Z. Shi, A parallel incremental extreme SVM classifier, Neurocomputing, vol. 74, pp , [11] J. Zabalza, J. Ren, J. Zheng, H. Zhao, C. Qing, Z. Yang, P. Du and S. Marshall, Novel segmented stacked autoencoder for effective dimensionality reduction and feature extraction in hyperspectral imaging, Neurocomputing, vol. 185, pp. 1-10, 2016 [12] J. Ren, Z. Zabalza, S. Marshall and J. Zheng, "Effective feature extraction and data reduction with hyperspectral imaging in remote sensing," IEEE Signal Processing Magazine, vol. 31, no. 4, pp , [13] J. Zabalza, J.-C. Ren, J. Ren, Z. Liu, and S. Marshall, "Structured cova ciance principle component analysis for real-time onsite feature extraction and dimensionality reduction in hyperspectral imaging," Applied Optics, vol. 53, no. 20, pp , [14] J. Zabalza, J. Ren, M. Yang, Y. Zhang, J. Wang, S. Marshall, J. Han, "Novel Folded-PCA for Improved Feature Extraction and Data Reduction with Hyperspectral Imaging and SAR in Remote Sensing," ISPRS J. Photogrammetry & Remote Sensing, vol. 93, no. 7, pp , [15] L. Fang, S. Li, W. Duan, J. Ren, J. Atli Benediktsson, "Classification of hyperspectral images by exploiting spectral-spatial information of superpixel via multiple kernels," IEEE rans. Geoscience and Remote Sensing, vol. 53, no. 12, pp , [16] J. Li, JM. Bioucas-Dias, A. Plaza, Spectral spatial classification of hyperspectral data using loopy belief propagation and active learning, IEEE ransactions on Geoscience and Remote Sensing, vol. 51, no. 2, pp , [17] G. B. Huang, Q. Y. Zhu, C. K. Siew, Extreme learning machine: theory and applications, Neurocomputing, vol. 70, no. 1, pp , [18] G. B. Huang, An insight into extreme learning machines: random neurons, random features and kernels, Cognitive Computation, vol. 6, no. 3, pp , [19] G. B. Huang, H. Zhou, X. Ding, et al, Extreme learning machine for regression and multiclass classification, IEEE ransactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 2, pp , [20] Y. Wang, F. Cao, and Y. Yuan, A study on effectiveness of extreme learning machine, Neurocomputing, vol. 74, pp , [21] H. J. Rong, Y. S. Ong, A. H. an, and Z. Zhu, A fast pruned-extreme learning machine for classification problem, Neurocomputing, vol. 72, pp , [22] G. B. Huang, X. Ding, and H. Zhou, Optimization method based extreme learning machine for classification, Neurocomputing, vol. 74, pp , [23] A. Samat, P. Du, S. Liu, et al, E 2 LMs: Ensemble Extreme Learning Machines for Hyperspectral Image Classification, IEEE Journal of Selected opics in Applied Earth Observations and Remote Sensing, vol. 7, no. 4, pp , [24]Shen Y, Xu J, Li H, et al. ELM-based spectral-spatial classification of hyperspectral images using bilateral filtering information on spectral band-subsets[c]//geoscience and Remote Sensing Symposium (IGARSS), 2016 IEEE International. IEEE, 2016: [25]Argüello F, Heras D B. ELM-based spectral spatial classification of hyperspectral images using extended morphological profiles and composite feature mappings[j]. International Journal of Remote Sensing, 2015, 36(2): [26]Duan W, Li S, Fang L. Spectral-spatial hyperspectral image classification using superpixel and extreme learning machines[c]//chinese Conference on Pattern Recognition. Springer, Berlin, Heidelberg, 2014: [27]Heras D B, Argüello F, Quesada-Barriuso P. Exploring ELM-based spatial spectral classification of hyperspectral images[j]. International Journal of Remote Sensing, 2014, 35(2): [28]Chen C, Li W, Su H, et al. Spectral-spatial classification of hyperspectral image based on kernel extreme learning machine[j]. Remote Sensing, 2014, 6(6): [29] B. Krishnapuram, L. Carin, M. A.. Figueiredo, et al, Sparse multinomial logistic regression: Fast algorithms and generalization bounds, IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 6, pp , [30] D. L. Donoho, M. Elad, Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization, Proceedings of the National Academy of Sciences, vol. 100, no. 5, pp , [31] J. Friedman,. Hastie, S. Rosset, et al, Discussion of boosting papers, Ann. Statist, vol. 32, pp , [32] A. Y. Ng, Feature selection, L 1 vs. L 2 regularization, and rotational invariance, Proceedings of the twenty-first international conference on Machine learning. ACM, vol. 78, [33]Afonso M V, Bioucas-Dias J M, Figueiredo M A. Fast image recovery using variable splitting and constrained optimization[j]. IEEE ransactions on Image Processing, 2010, 19(9):

Extreme Sparse Multinomial Logistic Regression: A Fast and Robust Framework for Hyperspectral Image Classification

Article Extreme Sparse Multinomial Logistic Regression: A Fast and Robust Framework for Hyperspectral Image Classification Faxian Cao 1, Zhijing Yang 1, *, Jinchang Ren 2, Wing-Kuen Ling 1, Huimin Zhao