Discriminant Analysis: A Unified Approach

Size: px

Start display at page:

Download "Discriminant Analysis: A Unified Approach"

Elvin Ball
5 years ago
Views:

1 Discriminant Anaysis: A Unified Approach Peng Zhang & Jing Peng Tuane University Eectrica Engineering & Computer Science Department New Oreans, LA 708 {zhangp,jp}@eecs.tuane.edu Norbert Riede Tuane University Mathematics Department New Oreans, LA 708 riede@math.tuane.edu Abstract Linear discriminant anaysis (LDA) as a dimension reduction method is widey used in data mining and machine earning. It however suffers from the sma sampe size (SSS) probem when data dimensionaity is greater than the sampe size. Many modified methods have been proposed to address some aspect of this difficuty from a particuar viewpoint. A comprehensive framework that provides a compete soution to the SSS probem is sti missing. In this paper, we provide a unified approach to LDA, and investigate the SSS probem in the framework of statistica earning theory. In such a unified approach, our anaysis resuts in a deeper understanding of LDA. We demonstrate that LDA (and its noninear extension) beongs to the same framework where powerfu cassifiers such as support vector machines (SVMs) are formuated. In addition, this approach aows us to estabish an error bound for LDA. Finay our experiments vaidate our theoretica anaysis resuts. Introduction The purpose of this paper is to present a unified theoretica framework for discriminant anaysis. Discriminant anaysis [6, 0] has been successfuy used as a dimensionaity reduction technique for many cassification probems. According to Fisher s criterion, one has to find a projection matrix W that maximizes: J(W ) = W T S b W W T S w W where S b and S w are so-caed between-cass and withincass matrices, respectivey. In practice, the sma sampe size (SSS) probem is often encountered, where S w is singuar. Therefore, the maximization probem can be difficut to sove. To address this issue, the term εi is added, where ε is a sma positive number and I the identity matrix of proper () size. This resuts in maximizing J(W ) = W T S b W W T (S w + εi)w. (2) It can then be soved without any numerica probems. In [9], Friedman discusses reguarized discriminant anaysis with regard to the sma sampe size probem. Equation (2) is a specia case of his reguarized discriminant anaysis in practice. However, Friedman eaves open the question of whether an optima choice for the parameter ε exists, and if so, whether this optima choice is a unique one? Further, Friedman s anaysis is based on traditiona parametric methods in statistics, and Gaussian distributions for underying random variabes are often assumed. That is, his reguarization technique is one of improving the estimation of the covariance matrices that are possiby biased due to the sma data size. The vaidity of this technique can be very imited. For exampe, the noninear case of discriminant anaysis, caed generaized discriminant anaysis (GDA), becomes very popuar in practice and shows promising potentia in many appications. The idea is to map the data into a kerne induced feature space and then perform discriminant anaysis there. However, in a high and possibe infinite dimensiona feature space, GDA is not we justified from Friedman s point of view. Thus, a rigorous justification for the reguarization term εi is sti missing, especiay when GDA is considered. In [7], Evgeniou, Ponti and Poggio present a theoretica framework for the reguarization functiona (the so caed reguarization network) minh[f] = f (y i f(x i )) 2 + λ f 2 K, (3) where f 2 K is a norm in a Reproducing Kerne Hibert Space H defined by the positive definite function K, is the number of data points or exampes and λ is a reguarization parameter. In their paper they justify the above functiona

2 using statistica earning theory that is mosty deveoped by Vapnik [22]. In this paper, we wi show that discriminant anaysis can be cast in the framework of statistica earning theory. More specificay, Fisher s criterion () is equivaent to the reguarization network (3) without reguarization, which is, minh[f] = f (y i f(x i )) 2. And Equation (2) is equivaent to the reguarization network (3) with the parameter ε corresponding to the reguarization term λ. Thus the roe of εi is we justified by statistica earning theory. Further we show that the optima choice of λ exists, and that it is unique. Additionay we estabish an error bound for the function that discriminant anaysis tries to find.. Reated Work The reationship between east squares regression and Fisher s inear discriminant has been we known for a ong time. There is a good review in [6]. Actuay in the semina paper [8], Fisher aready pointed out its connection to the regression soution. In [7], Mika aso provides an extensive review of LDA. Another widey used technique in statistics, canonica correation anaysis, is aso cosey reated to east squares regression []. The reguarized east squares (RLS) methods have been studied for a ong time, under different names. In statistics, ridge regression [2] has been very popuar for soving bady conditioned inear regression probems. After Tikhonov pubished his book [2], it was reaized that ridge regression uses the reguarization term in Tikhonov s sense. In the 980 s, weight decay was proposed to hep prune unimportant neura network connections, and was soon recognized [3] that weight decay is equivaent to ridge regression. Recenty this od method was found essentia in the framework of statistica earning theory, abeed by different names (RLS [8], Reguarized Network [7], east squares SVMs [9], and proxima SVMs []). It is demonstrated that this regression method is fuy comparabe to SVMs when used as a cassifier [8, 24]. Most recenty, an error bound for RLS given a finite sampe data set was deveoped in [5]. Many eements discussed in this paper are scattered throughout the iterature and may ook famiiar to some audience but not to others. The fact is that most are not comprehensive enough for this topic and uncear issues pop up for many. For exampe, many researchers such as Mika [7] appy the reguarization technique to address the SSS probem but they sti foow Friedman s principe that is quite imited. It is our goa to adopt a simpe yet unified approach to making this topic more comprehensive. 2 Discriminant Learning Anaysis In this section, we first review LDA using Fisher s criterion, and then go on to study it from a earning theory point of view, which we ca discriminant earning anaysis (DLA). 2. Linear Discriminant Anaysis In LDA, within-cass, between-cass, and mixture scatter matrices are used to formuate the criteria of cass separabiity. Consider a J-cass probem, where m 0 is the mean vector of a data, and m j is the mean vector of j-th cass data. A within-cass scatter matrix characterizes the scatter of sampes around their respective cass mean vectors, and it is expressed by S w = J j j= (xj i m j)(x j i m j) T, where j is the size of the data in the j th cass. A betweencass scatter matrix characterizes the scatter of the cass means around the mixture mean m 0. It is expressed by S b = J j= j(m j m 0 )(m j m 0 ) T. The mixture scatter matrix is the covariance matrix of a sampes, regardess of their cass assignments, and it is given by S m = (x i m 0 )(x i m 0 ) T = S w + S b. Fisher s criterion is used to find the projection matrix that maximizes (). In order to determine the matrix W that maximizes J(W ), one can sove the generaized eigenvaue probem: S b w i = λ i S w w i. The eigenvectors corresponding to the argest eigenvaues form the coumns of W. For a two cass probem, it can be written in a simper form: S w w = m = m m 2, (4) where m and m 2 are the means of the two casses. 2.2 LDA as Least Squares Function Estimation LDA is a supervised subspace earning probem where we are given a set z = {(x i, y i )} of training exampes drawn i.i.d. from the probabiity space Z = X Y. Here probabiity measure ρ is defined, and x i are the n- dimensiona inputs. We first consider two cass probems where we have y i {, } as cass abes. Without oss of generaity, we assume x has zero mean, i.e., E(x) = 0. Let us consider: f MSE = arg min (y i f(x i )) 2. (5) f This is a mean squared function estimation probem. Here we first consider the hypothesis space H L, the function cass of inear projections. That is, H L = {f f(x) = w T x, x X}. Thus we wi sove w opt = arg min w (y i w T x i ) 2. (6)

3 It turns out that the soution of (6) is the same as the soution obtained by Fisher s criterion. Lemma The inear system derived by the east squares criterion (6) is equivaent to the one derived by Fisher s criterion (), up to a constant, in two cass probems. Proof: We rewrite (6) as: (y i w T x i ) 2 = (y X T w) T (y X T w) = (y T w T X)(y X T w) = 2( m 2 m 2 ) T w + w T S m w. We use the fact that Xy = m 2 m 2. Taking the derivative with respect to w and setting the resut to 0 we obtain S m w = ( m 2 m 2 ). (7) This is equivaent to Equation (4). When S w is fu rank, three equations S m w = ( m 2 m 2 ), S m w = m m 2, and S w w = m m 2 have the same soution w up to a constant factor, given that the overa mean is 0. For detais see Appendix A. 2.3 Interpretation of LDA within Learning Theory When both casses are gaussian distributed, the LDA can be shown optima. However more often in reaity they are not, then it is not cear how good the LDA is. The conventiona view of LDA is imited here. The earning framework, on the other hand, is more genera. There is no need to assume the distribution of the data to discuss the goodness. Instead, the earning approach directy aims to find a good function mapping and the goodness is we defined (see Appendix C and??). In a genera sense, LDA tries to find a transformation f such that the transformed data have a arge cass mean difference and sma within-cass scatters. The equivaence between (6) and () can be understood in another way: one minimizes the mean squared error with respect to each cass mean whie keeping the mean difference constant. We define f ρ as the best function that minimizes the mean squared error. That is f ρ = arg min (y f(x)) 2 dρ. (8) f Z Given a set of data z, one tries to earn the function f z that is as cose as possibe to the optima mapping f ρ. That is, f opt = arg min f Z (f z f ρ ) 2 dρ. However, the probabiity measure ρ is unknown and so is f ρ. Instead, we may z consider (5) that is caed the empirica error (risk) minimization. Because (6) is equivaent to Fisher s criterion, it is cear that Fisher s criterion eads to an empirica error minimization probem. From the statistica earning theory point of view, however, without controing the function norm, soving equation (5) often eads to overfitting data and the soution is not stabe. And when the sampe size is smaer than the dimensionaity, it is an i-posed probem and the soution is not unique. That is where diverse LDA techniques have been deveoped to sove the SSS probem. 2.4 Discriminant Learning Anaysis as Reguarized Least Squares Regression Foowing [7], we instead minimize over a hypothesis space H the foowing reguarized functiona for a fixed positive parameter λ: f DLA = arg min f H (y i f(x i )) 2 + λ f 2 H. (9) Again, we consider the inear projection function space H L. In this case, f 2 = w T w. w opt = arg min w (y i w T x i ) 2 + λw T w. (0) Thus, (y i w T x i ) 2 + λw T w = 2( m 2 m 2 ) T w + w T (S m + λi)w. Taking the derivative with respect to w and setting the resut to 0, we have (S m + λi)w = ( m 2 m 2 ). () In Appendix A, it is shown that () is equivaent to (S w + λi)w = m, (2) and therefore soving it is equivaent to maximizing (2). We ca this approach discriminant earning anaysis. Since for any given positive reguarization term λ, (9) aways has a unique soution, we have Proposition 2 Discriminant earning anaysis is a weposed probem. 2.5 Noninear DLA In the new framework of discriminant earning anaysis, the noninear case is not an extension of the inear case by the kerne trick. In the above anaysis we have adopted the inear hypothesis space H L, whereby we have been ooking for an optimum soution in that space. If we choose a genera Reproducing Kerne Hibert Space (RKHS) as our hypothesis space, we wi obtain noninear function mappings as the subspace. The derivation of DLA in a genera function space is simpy carried out in terms of function

4 anaysis, in a simiar way as in inear space. Here we skip the detais of the derivations. Briefy speaking, for a two cass probem, the noninear DLA soution is found by foowing a simpe agorithm [8] (K + λi)c = y, (3) where K is the Gram matrix of the training data, and K = [k(x i, x j )] where k is the kerne function. y is the vector of cass abes. After soving c = [c,..., c ], any given new data x wi be projected onto the subspace by s = c ik(x i, x). 2.6 Error Bound for DLA By setting the discriminant earning anaysis in the framework of statistica earning theory, we can provide an error bound for f DLA obtained by (9) in the foowing theorem: Theorem 3 Let f ρ be defined by (8), with confidence δ, the error bound for the soution f DLA of (9) is given by: (f DLA f ρ ) 2 dρ X S(λ) + A(λ), (4) where A(λ) and S(λ) are given by λ /2 L 4 k f ρ 2 and 32M 2 (λ+c k ) 2 λ v (, δ). And there exists an unique λ that 2 minimizes S(λ) + A(λ). This decomposition is often caed bias-variance tradeoff. A(λ) is caed the approximation error. Given a hypothesis space, it measures the error between the optima function in it and the true target. L 4 k in A(λ) is simpy an operator. S(λ) bounds the sampe error and is essentiay derived by the big number aw. M and C k are constants and v (, δ) is the unique soution of an equation whose coefficients contain sampe size and confidence parameter δ. Due to its compex nature, we do not provide the detais of the proof, which can be found in [5]. To the best of our knowedge, this is the first known error bound for LDA. 3 Muti-cass DLA We have presented discriminant earning anaysis as an aternative approach to LDA in two cass probems. The data is projected onto ony one dimension that is adequate for the two cass probem. However, LDA is generay used to find a subspace with d dimensions for a mutipe cass probem. In this section we extend DLA to the muti-cass case. For a two cass probem, the function f is a mapping from a vector x to a scaer y R. If we want to find a subspace with d > dimensions, we actuay consider a mapping from vector x to vector y R d. This direct approach has some difficuties. First, statistica earning theory is concerned with ony the first kind of function mapping f : R n R, and cannot be directy extended to f : R n R d. Second, subspace dimensionaity d is not fixed (rather it is often provided by the user). These difficuties make the attempt at directy casting DLA in the mutipe cass case as a east squares estimation probem a very difficut task. Notice that LDA in a mutipe cass probem can be decomposed into two cass probems. In the i-th two cass probem, it treats the i-th cass as one cass and a remaining casses as the second cass. Each binary cass probem is soved first, and after finding a subspaces, PCA is appied to find eigenvectors having the argest eigenvaues. These new eigenvectors are the soution of the origina muti-cass LDA probem (See Appendix B). Thus, DLA can be naturay extended to the muti-cass case. Simpy, in the decomposition step, we repace two cass LDA by two cass DLA. In the inear case, it turns out that the DLA agorithm in the muti-cass case simpy soves S b W = (S w + λi)w Λ. (5) This is equivaent to soving (2). 4 Discussions In this section, we wi anayze the roe of the reguarization term λ. We are especiay interested in comparing the DLA agorithm (5) to various LDA agorithms, such as PCA+LDA [2, 20], scatter-lda [5, 0], newlda [3] and DLDA [23]. These techniques are mosty proposed for soving facia recognition probems where the SSS probem aways occurs. We summarize them briefy: PCA+LDA: Appy PCA to remove the nu space of S w first, then maximize J(W ) = W T S b W W T S w W. Scatter-LDA: Same as PCA+LDA but maximizing J(W ) = W T S b W W T S m W instead. newlda: If S w is fu rank then sove reguar LDA; ese in the nu space of S w, find the eigenvectors of S b with argest eigenvaues. DLDA: Appy PCA to remove the nu space of S b first, then find the eigenvectors of S w corresponding to the smaest eigenvaues. Some researchers aso maximize J(W ) = W T S m W W T S w W, it is equivaent to maximizing J(W ) = W T S b W W T S ww.

5 It shoud be noted that PCA+LDA and Scatter-LDA can be equivaent when S w and S m span the same subspace. However, they are different when S b totay or partiay spans the nu space of S w, thus S w and S m span different subspaces. For facia images the atter case turns out to be more common. In [3], Chen et a. prove that the nu space of S w contains discriminant information. They aso show that Scatter-LDA is not optima in that it fais to distinguish the most discriminant information in the nu space of S w. Thus they proposed the newlda method. However, newlda fe short of making use of any information outside of that nu space. DLDA, on the other hand, discards the nu space of S b. This is very probematic. First, it fais to distinguish the most discriminant information in the nu space of S w, as in Scatter-LDA. Second, the nu space of S b is not competey useess. This can be seen in a two cass case. DLDA simpy chooses m as the subspace. However this can be biased because it ignores S w. It shoud be noted that in [6] there is a miseading caim, where the authors mention that there is no significant difference between newlda and DLDA. This incorrect caim is aso pointed by others [4]. In a word, a these techniques make hard decisions, either discarding a nu space, or ony working in a nu space. In the foowing, we can see that DLA aways works in the whoe space, not making any hard decisions. Instead, it fuses information from both subspaces. DLA can work in the whoe space because it is we-posed. We first consider how LDA is soved. For the moment exists. For symmetric matrix S w there is a matrix U such that U T U = I, S w = UΣU T, and Sw T = UΣ U T, where Σ is a diagona matrix. U T is a transform matrix. Consider the foowing in a two cass LDA et us assume that S w is fu rank, and thus S w w = S w m = UΣ U T m. (6) The between vector m is transformed to a new basis by U T, then mutipied by a constant Σ ii aong each dimension, and finay transformed again by U back to the origin basis. To iustrate this process. We consider a simpe case where the data are in two dimensions. x and x 2 are the origina axis bases, as shown in Figure. x and x 2 are the orthogona axis bases in which S w is[ diagonaized. ] We 2 0 choose S w such that U T S w U = Σ =. Thus [ ] Σ m = 2, and Σ 22 = 2. Let = U T m. Then m m 2 x 2 x 2 LDA DLA Λ 2 =2/3 Λ =/3 m x and m 2 are the projection on x and x 2, respectivey. By mutipying Σ, m = 2 m and m 2 = 2m 2. The LDA soution, Sw m is shown in Figure -(a) and denoted by LDA. Now consider DLA: (S w + λi)w = m. Notice that if matrix U can diagonaize S w then it can diagonax (a) The soution of LDA and DLA x 2 LDA Λ 2 =2/3 x 2 Λ = m, DLDA DLA x newlda (b) The soution of DLA when S w s eigenvaue is zero in one dimension Figure. Iustration and comparison of soutions by LDA, DLA, DLDA and newlda. ize (S w + λi) at the same time: U T (S w + λi)u = U T S w U + λu T U = Σ + λi. Thus, (S w + λi) = U(Σ + λi) U T. To iustrate the effect of the reguarization term, we choose λ = and in Fig., Λ i are the eigenvaues of S w + λi. Now Λ = 3 x, and Λ 2 = 2 3. The vector (S w + λi) m is shown in Fig. -(a) and denoted by DLA. We now consider the case where the SSS probem occurs. We keep Λ 2 =.5 the same but Λ = (that is, the origina eigenvaue of S w is 0 aong this dimension). This case is potted in Fig. -(b). DLA takes advantage of information from both words. However, PCA+LDA wi throw away the dimension x and ony keep the direction x 2. Thus it ignores critica discriminant information. newlda, on the other hand, wi ony keep the dimension x and throw away any discriminant information aong x 2. DLDA simpy chooses m. Thus, when the SSS probem does not occur, DLA has a smoothing effect on the eigenvaues of S w. If the SSS probem does occur, DLA fuses and keeps a baance between the information both in and out of the nu space of S w. This baance is achieved by choosing proper λ. 5 Experiments 5. Feret Facia Images We tested DLA agorithm against PCA+LDA, newlda, DLDA and Scatter-LDA on FERET facia image data ( We extracted 400 images for the experiment, where there are 50 individuas with 8 images from each. The images have been preprocessed and normaized using standard methods from the FERET database. The size of each image is 50 x 30 pixes, with 256 grey eves per pixe.

6 We randomy choose five images per person for training, and the remaining three for testing. Thus the training set has 250 images whie the testing set has 50. Subspaces are cacuated from the training data, and the -NN cassifier (3-NN faied for a the methods due to five images per person in the training data) is used to obtain the accuracy rates after projecting the data onto the subspace. To obtain average performance, each method is repeated 0 times. Thus there are tota,500 testing images for each method. The reguarization term λ is chosen by 0-fod cross-vaidation. The average accuracy rates are shown in Figure 2. The X- axis represents the dimensionaity of the subspace. For each technique, the higher the dimension, the ess discriminant the dimension. For most techniques, the accuracy rates increase quicky in the first 5 dimensions, and then increase sowy with additiona dimensions. DLA is uniformy better than any other agorithms, demonstrating its efficacy. It achieves the highest accuracy rate of newlda performs quite we in these experiments, again demonstrating that the most discriminant information is in the nu space of S w, for the facia recognition task. On the other hand, Scatter-LDA does not perform we at ower dimensiona subspaces. But it eventuay performs better than PCA+LDA, when dimensions are more than 42. A methods achieve their highest accuracy rate with a 49 dimensiona subspace, which is not surprising, for this is a 50-cass probem. It is noticed that the performance of newlda and Scatter-LDA (its tai is not shown in the pot) drops quicky with unnecessary dimensions. accuracy rate DLA newlda DLDA PCA+LDA Scatter LDA Dimensions Figure 2. Comparison of DLA, LDA, newlda, DLDA and Scatter-LDA on the FERET image data. data sets GDA NDLA Better? breast cancer cancer Wisconsin credit heart Ceve heart Hungary ionosphere new thyroid pima gass sonar iris average Tabe. Cassification error rates in subspaces computed by GDA and NDLA, using 3-nn cassifier, on UCI data sets. 5.2 UCI Data Sets In these experiments, we compare noninear LDA (GDA) and noninear DLA (NDLA) in two cass cassification probems. We did not test other noninear techniques such as KDDA [6] due to their compex impementation. The impementation of kerne DLDA is very invoved. We use data sets from the UCI machine earning database. They are a two cassification probems. For each data set, we randomy choose 60% as training and the remaining 40% as testing. We train GDA and NDLA on the training data and obtain projections. We then project both training and test data on the chosen subspace and use the 3-NN cassifier to obtain error rates. Note that for the two cass case, one dimensiona subspace is sufficient. We use the Gaussian kerne k(x, x 2 ) = exp x x 2 2 σ to induce the Reproducing Kerne Hibert Space for both GDA and NDLA. 2 The kerne parameter σ and reguarization term λ were determined through 0-fod cross-vaidation. We repeat the experiments 0 times on each data set to obtain the average error rates. The resuts are shown in Tabe. On 9 data sets out of, NDLA performs better than GDA. Especiay on the breast cancer credit, heart ceve, heart Hungary and gass data, it is significanty better than GDA, with 95% confidence. 6 Summary This paper presents a earning approach, Discriminant Learning Anaysis for effective dimension reduction. Whie discriminant anaysis is formuated as a east square approx-

7 imation probem, the DLA is a reguarized one. Regardess of the data distribution, the approach aows us to aways measure the goodness of mapping function that DLA computes: an error bound for it is estabished. The commony adopted technique: S w + εi is we justified in the framework of statistica earning theory. The existence and uniqueness of the term ε is obtained. The DLA is aso compared to many other competing agorithms and it is demonstrated both theoreticay and experimentay, that DLA is more robust than others. Appendix A In this appendix, we say Aw = c v and Aw = c 2 m are equivaent in the sense that the soution ws are in the same direction, where A is a matrix, c i s are scaers and m is a vector. First we show that S m w = m and S w w = m have the same soution (same set of eigenvectors), where m = m m 2 is the mean difference of the two cass probem. We know that soving S w w = m is equivaent to soving [6] S w S b Φ = ΦΛ (7) where Φ and Λ are the eigenvector and eigenvaue matrices of Sw S b. Since we have S w = S m S b, foowing [0] (pp. 454), (7) can be converted to (S m S b )ΦΛ = S b Φ S m ΦΛ = S b Φ(I + Λ) S m S b Φ = ΦΛ(I + Λ). (8) This shows that Φ is aso the eigenvector matrix of Sm S b, and its eigenvaue matrix is Λ(I + Λ). When the components of Λ and α i are arranged from the argest to the smaest as α... α n, the corresponding components of Λ(I + Λ) α are aso arranged as +α... α n +α n. That is, the t eigenvectors of Sm S b corresponding to the t argest eigenvaues are the same as the t eigenvectors of Sw S b corresponding to the t argest eigenvaues. As a specia case (two cass probems), S m w = m and S w w = m share the same eigenvector soution. Now we show that S m w = ( m 2 m 2 ) and S m w = m m 2 share the same soution aso. Consider that the overa mean m 0 is 0. m 0 = m + 2 m 2 = 0, we have m = 2 m, m 2 = m, and m 2 m 2 = 2 2 m. Thus S m w = ( m 2 m 2 ) becomes S m w = 22 m. With a constant c, the soution of S m w = cm is sti in the same direction aong the mean difference m, and thus is equivaent to soving Sw S b Φ = ΦΛ. It is easy to show that (S m + λi)w = ( m 2 m 2 ) is equivaent to (S w +λi)w = m m 2, simpy repacing S w and S m by S w + +λi and S m + λi in the above anaysis. Appendix B: Muti-Cass LDA In Appendix A, we showed that for a two cass probem, LDA and DLA correspond to soving (4) and (2), respectivey. S m can be written as S m = U m Λ /2 m Λ /2 m Um, T (4) can be written as Λ /2 m Umw T = Λ /2 m Umm T v. Note that v can be treated as a data point. For a muti-cass probem, Assuming S m is fu rank, LDA corresponds to the system S b W = S m W Λ. It can be write as N b V = V Λ, (9) where V = Λ /2 m UmW, T N b = Λ /2 m UmS T b U m Λ /2 m. This is a simpe eigenvaue probem. By soving V, we can compute W by W = U m Λ /2 m V. The eigenvectors in V with the argest eigenvaues have a one-to-one correspondence to the eigenvectors in W with the corresponding argest eigenvaues in the origina equation. Note that S b = J j= j(m j m 0 )(m j m 0 ) T = J j= ( j m j )( j m j ) T and N b is N b = Λ /2 m UmS T b U m Λ /2 m J = ( j Λ /2 m Um T m j )( j Λ /2 m Um T m j ) T. (20) j= If we treat every vector j Λ /2 m Um T m j as a point, then N b is the covariance of these J points. Equation (9) can aso be viewed as a PCA probem on these J points. In the above anaysis if we repace S m by S m + λi, we obtain that DLA corresponds to soving S b W = (S m +λi)w Λ w. Further S b W = (S w + λi)w Λ w. (2) The eigenvectors corresponding to the argest eigenvaues are chosen as the DLA subspace basis. Appendix C: Overview of Statistica Learning Theory The first step in the deveopment of earning theory is the assumption of existence of a probabiity measure ρ on the product space Z = X Y, from which the data are drawn. One way to define the expected error of f is the east squares error: ε(f) = Z (f(x) y)2 dρ for f : X Y. We try to minimize this error by earning. Define f ρ (x) = ydρ(y x), where ρ(y x) is the conditiona probabiity Y measure on Y w.r.t. x. It has been shown [4] that the expected error ε ρ of f ρ represents a ower bound on the error ε(f) for any f. Hence at east in theory, the function f ρ is the idea one and so is often caed the target function. However, since measure ρ is unknown, f ρ is unknown as we.

8 Starting from the training data z = {(x i, y i )} i, statistica earning theory as deveoped by Vapnik [22] buids on the so-caed empirica risk minimization (ERM) induction principe. Given a hypothesis space H, one attempts at minimizing (f z (x i ) y i ) 2, where f z H, (22) However, in genera, without controing the norm of approximation functions, this probem is i-posed. Generay speaking, a probem is caed we-posed if its soution exists, and is unique and stabe (depends continuousy on the data). A probem is i-posed if it is not we-posed. Reguarization theory [2] is a device that was deveoped to turn an i-posed probem into a we-posed one. The crucia step is to repace the functiona in (22) by the foowing reguarized functiona (f(x i ) y i ) 2 + λ f 2 K, f H, (23) where λ is a reguarization term, and f 2 K the norm of f in L 2 ρ(x). Now minimizing (23) makes it we-posed and can be soved by eementary inear agebra. Let f λ,z be the minimizer of (23), the question then is: How good an approximation is f λ,z to f ρ, or how sma is ε(f λ,z ) = X (f λ,z(x) f ρ (x)) 2 dρ X? Further, what is the best choice for λ to minimize this error? In [5], the answers to these questions are given. They state: For each m N and δ [0, ), there is a function E m,δ = E : R + R, such that, for a λ > 0, X (f λ,z(x) f ρ (x)) 2 dρ X E(λ) with confidence δ. And there is a unique minimizer of E(λ) that is found by a simpe agorithm to yied the best reguarization parameter λ = λ. References [] M. Bartett. Further aspects of the theory of mutipe regression. In Proceedings of the Cambridge Phiosophica Society, number 34, 938. [2] V. Behumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. fisherfaces: Recognition using cass specific inear projection. IEEE Trans. Pattern Anaysis and Machine Inteigence, 9(7):7 720, 997. [3] L. Chen, H. M. Liao, M.T.Ko, J. Lin, and G. Yu. A new da-based face recognition system which can sove the sma sampe size probem. Pattern Recognition, 33:73 726, 200. [4] F. Cucker and S. Smae. On the mathematica foundations of earning. Buetin of the American Mathematica Society, 39(): 49, 200. [5] F. Cucker and S. Smae. Best choices for reguarization parameters in earning theory: On the bias-variance probem. Foundations Comput. Math., (4):43 428, [6] R. O. Duda and P. E. Hart. Pattern Cassification and Scene Anaysis. John Wiey-Sons, New York, edition, 973. [7] T. Evgeniou, M. Ponti, and T. Poggio. Reguarization networks and support vector machines. Advances in Computationa Mathematics, 3(): 50, [8] R. Fisher. The use of mutipe measurements in taxonomic probems. Ann. Eugenics, 7:78 88, 936. [9] J. H. Friedman. Reguarized discriminant anaysis. Journa of the American Statistica Association, 84(405):65 75, 989. [0] K. Fukunaga. Introduction to statistica pattern recognition. Academic Press, 990. [] G. Fung and O. L. Mangasarian. Proxima support vector machine cassifiers. In F. Provost and R. Srikant, editors, Proceedings KDD-200: Knowedge Discovery and Data Mining, August 26-29, 200, San Francisco, CA, pages 77 86, New York, 200. [2] A. Hoer and R. Kennard. Ridge regression: Biased estimation for nonorthogona probems. Technometrics, 2(3):55 67, 970. [3] A. K. J. Hertz and R. Pamer. Introduction to the Theory of Neura Computation. Addison Wesey, 99. [4] J.Yang, A.F.Frangi, J.-Y. Yang, D. Zhang, and Z. Jin. Kpca pus da: a compete kerne fisher discriminant framework for feature extraction and recognition. IEEE Trans. on Pattern Anaysis and Machine Inteigence, 27: , [5] K. Liu, Y. Cheng, and J. Yang. A generaized optima set of discriminant vectors. Pattern Recognition, 25(7):73 739, 992. [6] J. Lu, K. N. Pataniotis, and A. N. Venetsanopouos. Face recognition using kerne direct discriminant anaysis agorithms. IEEE Transactions on Neura Networks, 4():7 26, [7] S. Mika. Kerne Fisher Discriminants. PhD thesis, University of Technoogy, Berin, October [8] T. Poggio and S. Smae. The mathematics of earning: Deaing with data. Notices of the American Mathematica Society, 50(5): , [9] J. A. K. Suykens, T. V. Geste, J. D. Brabanter, B. D. Moor, and J. Vandewae. Least Squares Support Vector Machines. Word Scientific Pub. Co., Singapore, [20] D. Swets and J. Weng. Using discriminant eigenfeatures for image retrieva. IEEE Trans. Pattern Anaysis and Machine Inteigence, 8(8):83 836, 996. [2] A. N. Tikhonov and V. Y. Arsenin. Soutions of I-posed probems. John Wiey and Sons, Washington D.C., 977. [22] V. Vapnik. Statistica Learning Theory. Wiey, New York, 998. [23] H. Yu and J. Yang. A direct da agorithm for highdimension data with appication to face recognition. Pattern Recognition, 34: , 200. [24] P. Zhang and J. Peng. Svm vs reguarized east squares cassification. In Proceedings of IEEE Internationa Conference on Pattern Recognition, voume, pages 76 79, 2004.

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO