A powerful method for feature extraction and compression of electronic nose responses

Sensors and Actuators B 105 (2005) 378 392 A powerful method for feature extraction and compression of electronic nose responses A. Leone a, C. Distante a,, N. Ancona b, K.C. Persaud c, E. Stella b, P. Siciliano a a Istituto per la Microelettronica e Microsistemi CNR Sez. Lecce, via Monteroni, 73100 Lecce, Italy b Istituto di Studi sui Sistemi Intelligenti per l Automazione CNR, via Amendola 166/5, 70123 Bari, Italy c DIAS UMIST, 3DIAS, UMIST, P.O. Box 88, Sackville Street, Manchester M60 1QD, UK Received 30 April 2004; received in revised form 16 June 2004; accepted 21 June 2004 Available online 27 July 2004 Abstract This paper focuses on the problem of data representation for feature selection and extraction of 1D electronic nose signals. While PCA signal representation is a problem dependent method, we propose a novel approach based on frame theory where an over-complete dictionary of functions is considered in order to find the near-optimal representation of any 1D signal considered. Feature selection is accomplished with an iterative methods called matching pursuit which select from the dictionary the functions that reduce the reconstruction error. In this case we can use the representation functions found for feature extraction or for signal compression purposes. Classification results of the selected features is performed with neural approach showing the high discriminatory power of the extracted feature. 2004 Elsevier B.V. All rights reserved. Keywords: Feature extraction; Electronic nose; Gabor functions; Frame theory; Matching pursuit 1. Introduction Data representation is one of the main concern issues for developing intelligent machines that are able to solve real life problems. The choice of the most suitable representation for the particular application can significantly influence the general performance of the machine, either for computational speed and reliability. The goal of data representation is to find an appropriate set of coefficients (or features) that numerically describe the data. Such a representation involves some prior knowledge of the problem at hand, and involves linear or non linear transformations of original data (Fig. 1). A system that perform the representation task can be independent by the particular data, as in the case of Haar functions, or dependent by the data as in the case of principal component analysis. In both cases, the extracted features rel- Corresponding author. E-mail addresses: alessandro.leone@imm.cnr.it; cosimo.distante@imm. cnr.it (Distante); kcpersaud@umist.ac.uk (K.C. Persaud). ative to the original signal f, characterize the signal itself. Usually these transformations implies the convolution of the original signal with some appropriate kernel functions that often constitute a basis of the input space. In such a case, the representation of any original signal to the given basis is unique. We are interested to study system of functions that are over-complete (not an orthogonal basis) in which the representation of each element of the space is not necessarily unique. We require that the representation can be obtained in several ways starting from linearly dependent functions. As an example, let us consider a basis be a small dictionary of just a few thousand of words. Any concept can be described using terms of the vocabulary, but at the expense of long sentences. On the other hand, with a very large dictionary, let s say 100,000 words, concepts can be described with much more shorter sentences, in a more compact way, sometimes with a single word. Moreover, and more important for our context, large dictionaries allow to describe the same concept with different sentences, according to the message we want to transmit to the audience or reader. Over-complete dictionaries, allows to representing signals in many different 0925-4005/$ see front matter 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.snb.2004.06.026

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 379 Fig. 1. Pseudo-Code of the matching pursuit algorithm. ways and, in principle, it is possible to envisage that among all the possible representations, there is one suitable for the particular application at hand. Traditional transform based compression methods often use an orthogonal basis, with the objective of representing as much signal information as possible with a weighted sum of a few basis vectors as possible. The optimal transform depends on the statistics of the stochastic process that produced the signal. Often Gaussian assumption is made, hence finding the eigenvectors of the autocorrelation matrix of the steady state responses in order to minimize the average differential entropy [1]. If the process is not Gaussian, the principal component analysis need not be the optimal transform. It is a nontrivial task to find the optimal transform even when the statistics are known. Also, the signal is non-stationary and no fixed transform will be optimal in all signal regions. One way to overcome this problem is to have an over-complete set of vectors, "a frame", which span the finite-dimensional input space. Introduced in the context of non-harmonic Fourier series, recently Frames are becoming an interesting new trend in signal processing [2,3]. The advantage of using a frame rather than an orthogonal transform is the large amount of available kernel functions we have in order to select a small set whose linear combination match rather well with the original signal. In this paper, we analyze e-nose signal representation in terms of Frames in the context of odor detection and recognition, problem which is getting particular attention for its practical and commercial exploitation in different fields. In [4], an orthogonal decomposition of the signal is performed using wavelet functions. In the study, several feature extraction methods with complete dictionaries have been compared by minimizing a generalization error. For approaching this problem we follow the root of learning from examples in which the odor we want to detect is described by previously stored patterns. More specifically, signal detection can be seen as a classification problem, especially in the context of ambient air monitoring where measurements are made in an uncontrolled environment. The general problem of learning from examples, and in particular classification, can be interpreted as the problem of approximating a multivariate function from sparse 1 [5] data, where data are in the form (input, output) pairs. The optimal solution is the one minimizing the generalization error, called risk [6], or equivalently maximizing the capacity of the machine to correctly classify never seen before patterns. The solution found by support vector machine (SVM) for two class classification problems [7,8] is the hyperplane minimizing an upper bound of the risk constituted by the sum of the empirical risk, that is the error on the trained data, and of a measure of the complexity of the hypothesis space 2 expressed in term of VC-dimension. In this paper, we investigate the connections between the form of the representation and the generalization properties of the learning machines. It is thus time to wonder how to select an appropriate basis for processing a particular class of signals. The composition coefficients of a signal in a basis define a representation that highlights some particular signal properties. For example, wavelet coefficients provide explicit information on the location and type of signal singularities. The problem is to find a criterion for selecting a basis that is intrinsically well adapted to represent a class of signals. Mathematical approximation theory suggest choosing a basis that can construct precise signal approximations with a linear combination of a small number of vector selected inside the basis. This selected vectors can be interpreted as intrinsic signal structures. The signal is approximated or represented as a superposition of basic waveforms which are chosen from a dictionary of such waveform, so as to best match the signal. The goal is to obtain expansions which provide good synthesis for applications such as denoising, compression and feature extraction for classification purposes. The question we are interested to answer is: among all the possible representations of the data in terms of elements of over-complete dictionaries, is there a suitable representation of the data providing the best generalization capacity of the learning machine? We investigate several representations of e-nose signals using in combination method of frames proposed by [9] and method of matching pursuit introduced by [10]. Over-complete dictionaries are obtained by using translated and scaled version of Haar and Gabor functions with different number of center frequencies. 2. Function representation and reconstruction One of the main problems of computational intelligence is related to function representation and reconstruction. Func- 1 Recall that the concept of sparsity to feature extraction is referred to selecting a representation for a signal with a few number of coefficients different from zero. 2 With this term we mean the set of functions the machine implements.

380 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 tion must be discretized so as to be implemented on a computer, as well as when it is specified on the computer, the input is a representation of the function. Associated with each representation technique we must have a reconstruction method. These two operations enable us to move functions between the mathematical and the representation universes when necessary. As we will see, the reconstruction methods are directly related with the theory of function approximation. Let us denote by S a space of sequences 3 (c j ) of real or complex numbers. The space S has norm that allows us to compute distances between sequences of the space. A representation of a space of functions Λ is an operator F : Λ S into some space of sequences. So far, for a given function f Λ, its representation F(f ) is a sequence. F(f ) = (c j ) S F is called the representation operator. It is required in general that F preserves norm or satisfies some stability conditions. The most important examples of representations occurs when the space of functions is a subspace of the space L 2 (R), of square integrable functions (finite energy), L 2 (R) ={f : R R; f (t) 2 dt< }, (1) R and the representation space S is the space l 2 of the square summable sequences l 2 = (c j) ; c 2 = cj 2 < (2) When the representation operator is invertible, we can reconstruct f from its representation sequence: f = F 1 ((f j ) ). In this case, we have an exact representation, also called ideal representation. A method to compute the inverse operator gives us the reconstruction equation. We should remark that in general invertibility is a very strong requirement for a representation operator. In fact weaker conditions such as invertibility on the left suffices to obtain exact representations. Several representation/reconstraction methods are available in literature, among them: Basis and Frame representations. 2.1. Basis representation A natural technique to obtain a representation of a space of functions consists in constructing basis of the space. A set B ={e j ; j Z} is a basis of a function space Λ if the vectors e j are linearly independent, and for each f Λ there exists a sequence (c j ) of numbers such that f = c j e j (3) 3 The sequence can be thought as a set of coefficients obtained by projecting a function onto some space. thus, lim n f c je j = 0. You can think to B as the basis obtained by extracting principal component of raw data, which its eigenvectors are linearly independent. If all the eigenvectors are kept in the reconstruction process, the entire signal is completely recovered. Then projecting the function onto those eigenvectors we obtain a sequence of coefficients (c j ) which constitute a different representation of the original signal. A representation operator can be defined by F(f ) = (c j ). (4) In order to guarantee unicity of the representation, we must impose that the space of functions admits an inner product and it is complete in the norm induced by the inner product. These spaces are called Hilbert space that we denote by the symbol H. Definition 1. A collection of functions {ϕ j ; j Z} in a separable Hilbert space H is a complete orthonormal set if the conditions below are satisfied: 1 Normalization: ϕ j = 1 for each j Z; 2 Orthogonality: ϕ j,ϕ k =0ifj k; 3 Completeness: For all f H, N 0, and any ɛ>0 N f f, ϕ j ϕ j <ɛ. (5) j= N The third condition says that linear combinations of functions from the set can be used to approximate arbitrary functions from the space. Complete orthonormal stets are also called orthonormal basis of the Hilbert space. As is usually done with principal component analysis, we retain the first eigenvectors that carries the largest amount of information by discarding the remaining eigenvectors whose corresponding eigenvalues are very small. If all the eigenvectors are chosen, the norm in Eq. (5) is zero, otherwise ɛ is dependent on the number of the retained eigenvectors (components). The reconstruction of the original signal is given by f = f, ϕ j ϕ j (6) It is easy to see that the orthogonality condition implies that the elements ϕ j are linearly independent. This implies that the representation sequence ( f, ϕ j ) is uniquely determined by f. 2.2. Frame theory The basic idea of representing fuctions on a basis, consists in decomposing it using a countable set of simpler functions. The existence of a complete orthonormal set, and its construction is in general a very difficult task. On the other hand, orthonormal representations are too much restrictive and rigid. Therefore, it is important to obtain collections of functions {ϕ j ; j Z} which do not constitute necessarily an

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 381 orthonormal set or are not linearly independent, but can be used to define a representation operator. One such collection is constituted by the frames. Definition 2. A collection of functions {ϕ j ; j Z} is a frame if there exist constants A and B satisfying 0 <A B< +, such that for all f H,wehave A f 2 ( f, ϕ j ) 2 B f 2 (7) The constants A and B are called frame bounds. When A = B we say that the frame is tight and for all f H: ( f, ϕ j ) 2 = A f 2 (8) Moreover, if A = 1 then the frame is an orthonormal basis. Let c j = f, ϕ j for all j Z, and since the system {ϕ j ; j Z} is an orthonormal basis then ϕ j,ϕ k =δ jk where δ jk is the Kronecker s symbol. Theorem 3. If B ={ϕ j ; j Z} is an orthonormal basis, then B is a tight frame with A = 1. Proof. Since B is an orthonormal basis, then ϕ j = 1 j, f 2 = f, f = c j ϕ j, c k ϕ k = k Z k Z c j c k ϕ j,ϕ k = k Z c 2 k = k Z ( f, ϕ k ) 2 and then (8) holds with A = 1. The expression (6) shows that in the case of an orthonormal basis, any f can be expressed as a linear combination of ϕ j, with coefficients obtained projecting f on the same elements of the basis. Let us analyse this property when {ϕ j ; j Z} is a tight frame. Using the polarization identity and the Eq. (8) (reference report Nicola), for all g H we have: f, g = 1 f, ϕ j ϕ j,g. (9) A It follows that in weak sense we have: f = 1 f, ϕ j ϕ j. (10) A The representation of f in terms of the elements of a tight frame, given by Eq. (10), is very similar to the one in terms of the elements of an orthonormal basis given by (6). However, it is important to underline that frames, even tight frames, are not orthonormal basis. Expression (10) shows how we can obtain approximations of a function f using frames. In fact, it motivates us to define a new representation operator F, also called frame operator, analogous to what we did for orthonormal basis: F(f ) = (c j ) where c j = f, ϕ j (11) We should remark that this operator in general is not invertible. Note that F is an operator which takes in input a function f H and returns a sequence of c l 2, that is from the (2) a set of coefficients (features). The j-th component c j = (Ff ) j is the projection of the function f on the jth element ϕ j of the frame. Moreover, due to the linearity property of the scalar product, we can conclude that F is a linear operator. By the definition of frame (7) and by using the definition of frame operator, we have that A f 2 ((Ff ) j ) 2 B f 2. (12) Since ((Ff ) j) 2 = c2 j = c 2 then we can conclude that Ff 2 B f 2 which implies that the frame operator F is bounded, and it is also a continuous operator. For this type of operators, we can define the adjoint operator F : l 2 H which takes in input a sequence c l 2 and returns a function f H defined by: F c, f = c, Ff which holds for all c l 2 and for all f H. By using the definition of frame operator (11) we have F c, f = c j (Ff ) j = So we can conclude that: F c = c j ϕ j. c j f, ϕ j = c j ϕ j,f. This formula makes evident the role of the adjoint operator F of the frame operator F. In fact the adjoint operator F takes as input a sequence c l 2 and returns the function of H obtained as a linear combination of the frame elements ϕ j with coefficients c j, that is F c = c jϕ j. Finally, note that the operators F and F have the same norm and then since F is a bounded operator, also its adjoint F is a bounded operator. Let us consider c = F f and let us analyse the adjoint operator F of F. By definition, F is such that: F c, f = c, Ff. In these spaces we know that u, v =u v and so by definition of adjoint operator we have: (F c) u = c Fu c (F ) u = c Fu. From this equality it follows that (F ) = F, and then we can conclude that F = F, that is the adjoint operator of the frame operator F is the transpose of the matrix F.

382 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 By definition of frame operator (11) we have ( f, ϕ j ) 2 = (Ff ) 2 j = Ff 2 = Ff, Ff = c, Ff Using the definition of adjoint operator we have: ( f, ϕ j ) 2 = F c, f = F Ff, f and the frame condition (7) becomes: A f, f F Ff, f B f, f. The operator F F is self-adjoint 4 because (F F) = F (F ) = F F and then the following holds true: AId F F BId where Id f = f is the identity operator. Moreover, F F is a bounded operator, and positive definite since A>0, then it is invertible and the inverse operator (F F) 1 is bounded by A 1 as follows B 1 Id (F F) 1 A 1 Id. We can apply the new inverted operator (F F) 1 to the original frame {ϕ j } obtaining a new family: ϕ j = (F F) 1 ϕ j. (13) The family { ϕ j } constitutes a frame, called dual frame specified by the following theorem: Theorem 4. Let {ϕ j } be a frame with bounds A,B. The dual frame defined by ϕ j = (F F) 1 ϕ j satisfies f H, 1 B f 2 and f = FFf = ( f, ϕ j ) 2 1 f 2 A f, ϕ j ϕ j = f, ϕ j ϕ j If the frame is tight (i.e.a=b)then ϕ j = A 1 ϕ j. 4 A bounded linear operator F mapping a Hilbert space H into itself is said to be self-adjoint if F = F. Theorem 2 makes evident the fact that if we want to reconstruct f from the coefficients f, ϕ j we must use the dual frame of {ϕ j } that is: f = f, ϕ j ϕ j. (14) In conclusion, when we have a frame {ϕ j }, the only thing that we need to do in order to apply (14) is to compute the dual frame ϕ j = (F F) 1 ϕ j. 3. Finite dimensional frames In this section, we analyze frames in the particular cases, useful for practical applications, of H = R n and H = C n. At this aim, consider a family of vectors ( ) l ϕ j of Rn. The family is called dictionary [11] and the elements of the family are called atoms. By the definition of frame we know that the family of vectors ( ) l ϕ j constitutes a frame if there exist two constants A>0 and B< such that, for all f R n we have: A f 2 l ( f, ϕj ) 2 B f 2 i=1 (15) Let F be a l n matrix having the frame vectors as its rows. The matrix F is the frame operator, where F : R n R l. Let c R l be the vector obtained when we apply the frame operator F to the vector f R n, that is: c = Ff. Then: c j = (Ff ) j = f, ϕ j for j = 1, 2,...,l By using the definition of frame operator F, the (15) becomes: A f 2 Ff 2 B f 2 (16) and this inequalities hold for all f R n, and then F is a bounded operator. Consider the adjoint operator F of the frame operator. We have already seen that the adjoint operator coincides with the transpose of F, that is: F = F. The operator F : R l R n is the n l matrix such that for all f R n and c R l we have F c, f = c, Ff Analyze in detail how F works. In fact, substituting c = Ff in the definition of the adjoint operator we have: F c, f = l c j (Ff ) j F c, f = l c j ϕ j,f

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 383 This implies that F c = l c j ϕ j This equality shows that F associates to a vector c R l the vector v R n equal to the linear combination of the frame elements ϕ j with coefficients c 1,c 2,...,c l. Rewrite the inequalities in (16) expliciting the scalar products. Then A f, f Ff, Ff B f, f By using the definition of adjoint operator we have: A f, f F Ff, f B f, f (17) where the matrix F F is symmetric. For definition of frame, we know that A>0and then we can conclude that the matrix F F is positive definite, that is for all f 0wehave: f F Ff > 0. Then the matrix F F is invertible and so we can compute the vectors of the dual frame ( ϕ ) l j by using the formula (13), that is: ϕ j = (F F) 1 ϕ j. Moreover, the vector f can be recovered by the coefficients c = Ff, where c j = f, ϕ j, by using the formula (14), that is: f = f = l c j ϕ j By using the definition of dual frame we have: l c j (F F) 1 ϕ j Moreover, noting that here ϕ j are column vectors while they are row vectors in the frame operator F, then we can write: f = F c (18) where F = (F F) 1 F is the pseudoinverse of F. The formula (18) makes evident the difference between the frame operator F and the operator F. F, also called analysis operator, associates a vector of coefficients c (features) to a signal f, i.e. Ff = c, decomposing (projecting) the signal through the atoms of the dictionary. This operation involves a l n matrix. On the contrary F, also called synthesis operator, builds up a signal f as a superposition of the atoms of the dual dictionary weighted with coefficients c, i.e. f = F c. This operation involves a n l matrix. Note that the columns of F are the elements of the dual frame. So far we have seen as, given a frame (ϕ j ) l of vectors of R n, to decompose a signal by using the frame elements and as to reconstruct it by using the dual frame elements. It remains to understand when a given family of vectors constitutes a frame and how to compute the frame bounds. In fact, by using the definition of scalar product between elements of R n, f, v =f v, from (17) follows: Af f f F Ff Bf f For f 0 we have that: A f F Ff f f B The central term of this inequality is known as the Rayleigh s quotient R(f ) of the matrix F F. It is possible to show [12] that, if λ min and λ max are respectively the minimum and maximum eigenvalues of the symmetric matrix F F, then λ min R(f ) λ max. This means that the eigenvalues of the matrix F F lie in the interval [A,B]. Since any A and B such that 0 <A A and B B < can be used as frame bounds, then we choose A as large as possible and B as small as possible. This is equivalent to take the minimum and maximum eigenvalues of F F as frame bounds. Note that in the case of tight frame all the eigenvalues of F F coincide. So we have a practical recipe for establishing if a system of vectors ( ) l ϕ j is a frame. In fact, build the matrix F having the vectors ϕ j as rows. Compute the minimum and maximum eigenvalues of F F.Ifλ min > 0 then the system ( ) l ϕ j is a frame, with frame bounds λ min and λ max. Finally, note that the redundancy ratio B/A coincides with the condition number λ max /λ min of F F. 3.1. Frames and economic representations We have seen that in case of frames, the representation of a signal f in terms of the frame elements is not unique. Once we have computed the coefficients c, projecting the signal on the frame elements, there is a unique way to recover the signal f, in particular by using the elements of the dual frame (14). Now the question is: among all possible representations of the signal f in terms of the frame elements, what properties have the coefficients c? In[3] is shown (see also [11]) that, among all representations of a signal, the method of frames selects the one whose coefficients have minimum l 2 norm. This is equivalent to say that the vector c is solution of the following problem: min c 1 2 c 2 s.t. Fc= f where, here the frame elements are the columns of the matrix F for the sake of convenience. This is equivalent to start from the dual frame. In other words, for representing the signal f by using the elements of the dictionary F, the method of frame selects the minimum-length solution. Finally, note that the method of frame is not sparsity preserving. In fact, suppose that your signal is just an element of the dictionary and so c should have only one component different from zero. On the

384 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 contrary, each atom having a non zero scalar product with the signal contributes to the solution, and so the corresponding component of c is different from zero. 4. An overview of matching pursuit (MP) method A matching pursuit (MP) is a greedy algorithm which, rather than solving the optimal approximation problem, progressively refines the signal approximation with an iterative procedure. In other words, MP is a non linear algorithm that decomposes any signal into a linear expansion of waveforms that, generally, belong to a dictionary D of functions. MP is an iterative procedure which, at each step, selects the atom of the dictionary D which best reduces the residual between the current approximation of the signal and the signal itself. The best-adapted decomposition is selected by the following greedy strategy. We are interested with MP to finding a subset of D that best matches the given signal f. Let us consider a dictionary D of atoms {ϕ j } L such that D is a frame with L number of atoms, and let us indicate by u the residue of a signal f to be analysed at the first step. The first step of MP is to approximate f by projecting it on a vector ϕ jo D so then f = f, ϕ jo ϕ jo + u. (19) Since the residual u is orthogonal 5 to ϕ jo then f 2 = f, ϕ jo 2 + u 2 The objective is to minimizing the residual u so then choosing the atom ϕ jo such that maximizes f, ϕ j. Since we are working in finite dimensions we choose ϕ jo such that f, ϕ jo = max ϕj D f, ϕ j. (20) Suppose that we have already computed the residual at kth step u k. We choose ϕ jk D such that u k,ϕ jk = max ϕj D u k,ϕ j and projecting u k on ϕ jk we have u k+1 = u k u k,ϕ jk ϕ jk. In other worlds, at kth step the algorithm selects the dictionary atom that best correlates with the residual (minimizing the reconstruction error) and adds to the current approximation the selected atom ϕ jk. The orthogonality of u k+1 with ϕ jk implies that u k+1 2 = u k 2 u k,ϕ jk 2. 5 Rewriting Eq. (19) asu = f f, ϕ jo ϕ jo we can see that from f is taken out the component in direction of ϕ jo. Therefore, at a generic nth step the original signal f can be expressed by n 1 f = u k,ϕ jk ϕ jk + u n. (21) k=0 Similarly, for k between 0 and n 1 yields n 1 f 2 = u k,ϕ jk 2 + u n 2 (22) k=0 The residual u n is the approximation error of f after choosing n vectors in the dictionary, and the energy of this error is given by Eq. (22). For any f H, the convergence of the error to zero is a consequence of a theorem proved by Jones [13], i.e. lim n u n = 0. Hence, f = and u k,ϕ jk ϕ jk k=0 f 2 = u k,ϕ jk 2 k=0 In finite dimensions, the convergence is exponential. From the Eq. (21) follows that MP represent the signal f as linear combination of the dictionary atoms with coefficients computed minimizing at each step the residual. In general the algorithm ends when the residual is less than a given threshold or when the number of iterations k is equal to the number of wished atoms. If we stop the algorithm after few steps (high thresholds), MP provides an approximation of the signal with only few atoms and the reconstruction error could be very high (low computational time). However, when the algorithm selects a greater number of atoms (low thresholds), it provides a better approximation of the signal (i.e. low reconstruction error), but the computational time will be high. The approximations derived from MP can be refined by orthogonalizing the directions of projection. The resulting orthogonal pursuit converges with a finite number of iterations in finite-dimensional spaces, which is not the case for a nonorthogonal MP. The vector selected at each iteration by the MP algorithm is not, in general, orthogonal to the previously selected vectors. So then, at iteration p the selected atom ϕ jp is not always orthogonal to the k 1 atoms previously selected (ϕ j0,ϕ j1,...,ϕ jp 1 ). In subtracting the projection of u k onto ϕ jk the algorithm reintroduces new components in the directions of the previously selected atoms {ϕ jk } p k=0. This can be avoided by orthogonalizing the {ϕ jk } p k=0 with a Gram Schmidt procedure. It has been proved in [14] that the residual of an ortoghonal pursuit converge strongly to zero

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 385 and that the number of iterations required for convergence is less than or equal to the dimension of the space H. Thus, in finite-dimensional spaces, orthogonal MP are guaranteed to converge in a finite number of steps, unlike nonorthogonal pursuits. 5. 1D Gabor function A one dimensional (1D) Gabor function h ω0,σ(t) is a complex function given by the following expression: h ω0,σ(t) = g σ (t)f ω0 (t) (23) where g σ (t) is a normalized Gaussian function with standard deviation σ: g σ (t) = 1 e t2 /2σ 2 (24) 2πσ and f ω0 (t) = e jω 0t (25) is a complex exponential with angular frequency ω 0 (recall that ω 0 = 2πf 0 ). The real and imaginary components of a Gabor function are given by (see Fig. 2): h r ω 0,σ (t) = g σ(t) cos(ω 0 t) (26) h i ω 0,σ (t) = g σ(t) sin(ω 0 t) (27) From the Eq. (23), we see that a Gabor function is the product in time between a Gaussian function and a complex exponential. Then, for the convolution theorem in frequency domain, we have that: H σ,ω0 (ω) = (G σ Fω 0 )(ω) (28) It is simple to see that the Fourier transform of a Gabor function is a Gaussian function shifted in the frequency domain, centered on the frequency ω 0, having maximum amplitude 2π; H σ,ω0 (ω) = 2πe σ2 (ω ω 0 ) 2 /2 (29) Moreover, H σ,ω0 (ω) is a real function and represents a band-pass filter. Note that, since the Gaussian function is not band-limited, because it is a positive function, then it does not exist a frequency ω c such that H σ,ω0 (ω) = 0 for ω >ω c. So, we determine the half-peak bandwidth of the filter, that is we want to find the cut-off frequency ω c such that the filter response is equal to 3 db, or equivalently: H σ,ω0 (ω) 1 2 max H σ,ω 0 (ω) for ω ω c (30) Using the Eq. (29), we obtain that: 2ln2 ω c = ω 0 ± (31) σ Therefore, the half-peak bandwidth of a Gabor function with the standard deviation σ is: 2ln2 B = 2 (32) σ showing that the filter bandwidth B is proportional to the inverse of σ.now,letf 0 be the center frequency of the generic Gabor filter in a bank, where each filter is an appropriate Gabor function having passing bandwidth of B octave. The value of σ 0 in a filter bank is: Fig. 2. Gabor function: (top) real part, (middle) imaginary part and (bottom) Fourier transform.

386 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 σ 0 = (2B + 1) 2ln2 (2 B (33) 1)2πf 0 Note that each filter of the bank is a B octave filter, where the passing bandwidth of the particular filter depends on its center frequecy f 0. In particular, the σ 0 of an octave Gabor filter is proportional to the inverse of the center frequency f 0. Moreover, said f ci the center frequency of the interval [f i,f i+1 ] for i = 0,1,...,n 1, then the center frequency of each subintervals verify the following relation: f ci = 2 B f ci 1 for i = 1, 2,...,n 1 (34) 6. Experimental results We show results relative to the analysis and synthesis of 1D real signals derived by an e-nose. In the one dimensional case, each atom of the dictionary have length equal to the length of the 1D signal we want to analyze (we call n the length of 1D signal). The typical signal f is composed of n = 256 components (n = 2 l = 256 with l = 8). Fig. 3 shows an a typical 1Dsignal as response of an e-nose. 6.1. Haar dictionary The first dictionary we have considered is an overcomplete Haar dictionary, by scaling and translating Haar function with shift parameter equal to t = 1, so for each value of the scaling-parameter n vectors are added to the dictionary. The elements of the Haar basis were obtained as impulse responses from the Haar synthesis filters. Fig. 4 shows some atoms of a Haar dictionary (see Table 1) used in the experiments, generated by using t = 1 and scaling parameter as power of 2, from 2 0 to 2 l. Now we report the result obtained for the generation of Haar dictionary. Let us con- Fig. 4. Some atoms of the Haar dictionary. sider λ min and λ max the minimum and maximum eigenvalues of the matrix F l n (the Frame dictionary) computed using SVD, where l is the number of atoms (the rows of F), that is the number of elements of the dictionary. Let e = f ˆf is the error between the input signal f and the recovered signal ˆf. 6.2. Gabor dictionary We performed the same analysis and synthesis experiments by using a time and frequency translation invariant Gabor dictionary [10], by scaling, translating and modulating a window of a Gaussian function. Gaussian windows are used because of their optimal time and frequency energy concentration, proved by the "Heisenberg Uncertainty Principle". In the generation of the Gabor dictionary, for each value of the parameters ω 0 and σ in (23), h ω0,σ is sampled in n points and all the translated versions are considered as elements (atoms) of the dictionary. In other words, for each pair of parameters, n vectors are added to the dictionary. All the center frequencies f 0 = ω 0 /2π are in the range [0, 1/2] where 1/2 is the Nyquist critical frequency. Note that the atoms of the Gabor dictionary are complex elements, because modulating versions (e jω 0t ) of Gaussian window. Therefore, each atom have a real part and an imaginary part. In Table 2, we report the results obtained for different parameters value ω 0 and B. Let f 0 indicates the number of center frequencies used for the generation of the Gabor dictionary. In this case, the frequencies were obtained spacing the range [0, 1/2], so that f ci = 2 B f ci 1. Fig. 3. Typical e-nose response. Table 1 Haar dictionary results t 1 l 2049 λ min 1.0 λ max 16.0 e 4.3e 12

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 387 Table 2 Gabor dictionary results f 0 B l λ min λ max e 5 0.5 1280 4.16e 16 18.65 1.6e 05 5 1 1280 1.32e 03 44.52 1.3e 12 5 2 1280 1.81e 01 281.38 2.6e 12 10 0.5 2560 1.70e 02 91.22 8.7e 13 10 1 2560 1.18 595.08 2.0e 13 10 2 2560 0.18 1561.40 4.7e 12 6.3. MP filters for 1D-signal analysis An electronic nose incorporates an array of chemical sensors, whose response constitutes an odor pattern. A single sensor in the array should not be highly specific in its responses but should respond to a broad range of compounds such that different patterns are expected to be related to different odors. A typical gas sensor response is show in Fig. 3. Using MP scheme, a particular response of electronic nose is represented as an n-dimensional vector (c 0,c 1,...,c L 1 ), called a coefficient vector. One computes the coefficient value by projecting the response 1D-signal onto a set of basis functions, {ϕ 0,...,ϕ L 1 } which need not be orthogonal. Because the basis is not necessarily orthogonal, an iterative projection algorithm is applied to calculate the coefficients. If the basis is orthogonal, then the algorithm reduces to the standard projection method. The projection algorithm adjusts for the non-orthogonal property by using residual 1D-signal. If f is a 1D-signal (or template) like in Fig. 3, then u k is the residual signal during iteration k, where u 0 = f. The coefficient c k is the projection of the residual signal u k onto the basis element ϕ k,as c k = u k,ϕ k where, is the inner product between two functions. The residual signal is updated after each iteration by u k+1 = u k c k ϕ k (35) After the nth iteration, a 1D-signal f is decomposed into a sum of residual signal: n 1 f = (u k u k+1 ) + u n (36) k=0 Rearranging Eq. (35) and substituting into Eq. (36) yields n 1 f = c k ϕ k + u n (37) k=0 So, the approximation of the original signal after n iterations is n 1 ˆf = c k ϕ k (38) k=0 The approximation needs not be very accurate, since the encoded information is enough for recognition. When we arrive at the wanted approximation error, a subset of atoms of D is defined (called F D): the number of rows of F is equal to the number of selected atoms through MP Fig. 5. Reconstruction of the known curve using different number of atoms.

388 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 6.4. MP filters for 1D-signal synthesis The f signal can be recovered by the coefficient vector c = F f where c k = f, ϕ k, using the next formula: ˆf = F c (40) Fig. 6. Reconstruction error along iterations (number of selected atoms). algorithm (n), whereas the number of columns of F is equal to the number of samples of the signal f (m = 256). Therefore, it s possible to think that a generic signal can be expressed by projection of the signal on the matrix F n m as: c = F f (39) where c is the coefficient vector relative to the signal f (c is an n-dimensional). where F = (F F) 1 F is the "pseudoinverse of F". The Eq. (40) makes evident the difference between the operator F and the operator F. Operator F, also called "analysis operator", associates a vector of coefficient c (features) to a signal f, decomposing (projecting) the signal through the atoms of the sub-dictionary F. On the contrary F, also called synthesis operator, builds up a reconstructed signal ˆf as a superposition of the atoms of the dual sub-dictionary F weighted with coefficients c. Note that the columns of F are the elements of the dual matrix of F and, therefore, F is a matrix m n. Fig. 5 shows the reconstruction of the curve used in the training MP process at several atoms extracted. The more the atoms extracted the better of course is its representation as shown in Fig. 6. A number of 5, 10, 20 and 35 atoms have been found with MP using the curve shown in Fig. 3. It has been shown that with these set of extracted atoms we can reconstruct unseen signals, as shown in Fig. 7. This means that the extracted atoms span very well the domain represented by functions of this type, let us say e-nose functions. A comparison between MP coefficients and the standard wavelet transform coefficients using Haar is performed. Synthesis over the two sets of coefficients is shown in Fig. 8. In Fig. 7. Reconstruction of an unseen signal using different number of atoms extracted with curve shown in Fig. 3.

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 389 Fig. 8. Synthesis comparison between eight MP coefficients and eight standard Haar wavelet transform coefficients. the standard Haar wavelet transform, the fifth decomposition level is chosen with its eight approximation coefficients. No detail coefficients are used in the synthesis process (i.e. zero coefficients). Note that MP coefficients are more expressive than standard wavelet transform, since this last cannot capture high frequency content different by noise. As opposed, the selected atoms in the MP process are strongly scaled and localized in the two important regions of the signal (starting times for sampling and recovery processes) which allow a better synthesis of the analysed signal (Fig. 8). Recall that the 8 atoms used with MP are obtained with the signal shown in Fig. 3. 6.5. MP features classification The c coefficient vector for a given signal is the feature input vector to a recognition system. In this case, the idea is to use such features to classify odors. In our case, an odor is simply expressed from the responses of an array of 19 sensors. Fig. 9 shows the responses of an array of chemical sensors when exposed to a given odor. For each odor response, analysis is performed using MP algorithm, by varying the number of the atoms in the subdictionary F. In order to evaluate the best number of coefficients to retain (the number of atoms to select from the over-complete dictionary D), a generalization error is evaluated. The objective is to find, with a leave-one-out procedure, the number of atoms for the signal that minimizes the generalization error. Fig. 9. Responses of an array of 19 gas sensors. Table 3 Summary of the PNN classification error with leave-one-out procedure of the methodologies investigated. The number of coefficients are related to each sensor response features extracted Method Error % Number of coefficients Haar wavelet transf. 11 8 MP over Haar dictionary 10 5 MP over Gabor dictionary 9% 18 MP over Haar Gabor dictionary 9 18

390 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 Fig. 10. Classification error with the Haar dictionary. Fig. 11. Classification error with different Gabor dictionaries.

A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 391 Fig. 12. Classification error with Gabor and Haar dictionaries. The 19 coefficient vectors (one for each signal of the responses, such that each coefficient vector has equal length to the number of atoms of F) are given in input to a neural network. Table 3 reports the classification errors against the different techniques proposed in the paper. It is interesting to note that with almost the same classification error it is possible to represent the signal with a small number of information. For example, with the classical Haar wavelet transform, eight coefficients for each sensor response are needed to have the lowest classification error of 11%, while with the method proposed in this paper (matching pursuit over overcomplete Haar dictionaries) only five coefficients are needed to obtain almost the same classification error, so far saving 37% of memory allocation and also a certain amount of computational power. Obviously for a single sensor this is irrelevant, but for a large array of sensors this becomes very important. In the last two lines of Table 3 are reported MP Gabor Dictionary and MP over Gabor Haar Dictionary coefficients extracted from each response curve. The classification error reduces slightly with a larger number of coefficients but with a better reconstruction error with respect to MP over Haar dictionary method. This to verify if the extracted features are good features for discriminatory purposes in the classification process and, therefore, with the aim to have clustered responses associated at same odors. A probabilistic neural network (PNN) is used. A PNN is a class of neural networks which combine some of the best attributes of statistical pattern recognition and feedforward neural networks [15]. PNNs feature very fast training times and produce outputs with Bayes posterior probabilities. These useful features come at the expense of larger memory requirements and slower execution speed for prediction of unknown patterns compared to conventional neural networks. Figs. 10 12 show the classification error for different generated sub-dictionaries. In particular, in Fig. 11 Gabor subdictionaries are found by varying also the number of center frequencies. Fig. 12 shows classification results over subdicionaries generated by the union of two over-complete Gabor and Haar dictionaries. 7. Conclusions In this paper a new feature extraction technique has been studied and implemented for an array of conductive polymer gas sensors. It has been shown that the extracted features provide good discrimination capabilities in the recognition phase, and are superior to the classical orthonormal decomposition methods such as the discrete wavelet transform. The use of a large dictionary allows to select a subset of analyzing functions that span the whole space of the problem domain. The basis function of the feature space has been selected with the well known matching pursuit algorithm, which, rather than solving the optimal approximation problem, progressively refines the signal approximation with an iterative procedure. Convergence of the procedure is assured,

392 A. Leone et al. / Sensors and Actuators B 105 (2005) 378 392 and in a few iteration has been possible to capture the nearoptimal signal representation. Several dictionaries have been used and the results of the MP algorithm compared in terms of classification error. Acknowledgment The authors would like to thank Osmetech Inc. USA for providing data used in the experiments of this work. This work is supported in part by NATO-CNR N.217. 34PROT.038131 and NOSE II - 2nd Network on artificial Olfactory Sensing. References [1] S. Mallat, A Wavelet Tour of Signal Processing, second ed., Academic Press, 1999. [2] I. Daubechies, Ten lectures on wavelets CBMS-NSF Regional Conferences series in Applied Mathematics, SIAM (1992). [3] N. Ancona, E. Stella, Image representation with overcomplete dictionaries for object detection. Internal report 02/2003, ISSIA CNR, Bari, Italy, July 2003. http://www.ba. cnr.it/iesina18/publications/ ancona-stella-report-02 2003.pdf. [4] C. Distante, M. Leo, P. Siciliano, K.C. Persaud, On the study of feature extraction method for electronic nose, Sens. Actuators B 87 (2002) 274 288. [5] T. Poggio, F. Girosi, A theory of networks for approximation and learning, Technical report, Massachusetts Institute of Technology (1989) A.I. Memo 1140. [6] V. Vapnik, Statistical Learning Theory, Wiley, 1998. [7] C. Distante, N. Ancona, P. Siciliano, Support vector machines for olfactory signals recognition, Sens. Actuators B 88 (2003) 30 39. [8] C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, in: Data Mining and Knowledge Discovery, Kluwer Academic Publishers, 2, 1998, pp. 1 43. [9] R.J. Duffin, A.C. Schaeffer, A class of non-harmonic Fourier series, Trans. Am. Math. Soc. 72 (1952) 341 366. [10] S. Mallat, Z. Zhang, Matching pursuit with time-frequency dictionaries, IEEE Trans. Signal Process. 41 (1993) 3397 3415. [11] S. Chen, D. Donhoho, M. Saunders. Atomic decomposition by basis pursuit. Technical Report 479, Department of Statistics, Stanford University, 1995. [12] G. Strang, Linear Algebra and its Applications, Harcourt Brace & Company, 1988. [13] L.K. Jones, On a conjecture of huber concerning the convergence of projection pursuit regression, Ann. Statist. 15 (1987) 880 882. [14] L. Blum, M. Shub, S. Smale, On the theory of computation and complexity over the real numbers: Np-completeness, recursive functions, and universal machines, Bull. Am. Math. Soc. 21 (1989) 1 46. [15] O. Duda Richard, E. Hart Peter, G. Stork David, Pattern Classification, second ed., Wiley Interscience, 2000.