Face Recognition. Lauren Barker

Size: px

Start display at page:

Download "Face Recognition. Lauren Barker"

Timothy Owen
5 years ago
Views:

1 Face Recognition Lauren Barker 24th April 2011

2 Abstract This report presents an exploration into the various techniques involved in attempting to solve the problem of face recognition. Focus is paid to feature extraction methods, with the aim of determining the optimal way to extract discriminating features from a set of face images. Initially Principal Component Analysis is investigated, which is used to form the basis of the Eigenface method. This approach is then improved by considering the concepts of Linear Discriminant Analysis and Kernel methods, to establish more robust and elegant solutions to the problem.

3 Declaration This piece of work is a result of my own work except where it forms an assessment based on group project work. In the case of a group project, the work has been prepared in collaboration with other members of the group. Material from the work of others not involved in the project has been acknowledged and quotations and paraphrases suitably indicated. 1

4 Contents 1 Introduction Motivation The Problem of Face Recognition A Brief History of Face Recognition Organisation of Report Notation Image Pre-Processing Motivation Defining an Image Colour Images Greyscale Images Converting Between RGB and Greyscale Histogram Equalisation Description of Face Databases Assumptions on Face Images Eigenfaces Motivation Background to Method Principal Component Analysis Finding the Principal Components Decomposing Variance Data Compression Limitations of PCA Finding the Eigenfaces How many Eigenfaces? The Feature Space Face Recognition Near the Face Space Near a Face Class Outcomes Summary of Eigenface based Face Recognition Learning to Recognise New Faces Locating and Detecting Faces Eigenface Implementation The Training Set The Testing Set Eigenface Experimental Results Euclidean Distance vs Mahalanobis Distance The Number of Eigenfaces Used faces94 vs faces Disadvantage of the Eigenface Method

5 4 Linear Discriminant Analysis Motivation Two Class Case Fisher s Idea Optimality Assumptions Multiclass Case Connection to the Least Squares Approach Connection to the Bayesian Approach Comparison of PCA and Fisher s LDA Classification Nearest Neighbour Discriminant Rule Connection to Maximum Likelihood Discriminant Rule Connection to the Bayes Discriminant Rule Limitations of Fisher s LDA Fisherfaces The Fisherface Method Fisherface Method Implementation The Training Set The Testing Set Fisherface Method Experimental Results Optimal Number of Fisher s Linear Discriminants faces94 vs faces Comparison to the Eigenface Method Disadvantage of the Fisherface Method Kernel Methods Motivation Introduction to Kernel Methods The Kernel Trick Kernel Functions Kernel Fisher Discriminant Issues with Kernel Fisher Discriminant Gerneralised Kernel Fisher Discriminant Analysis The Kernel Fisherface Method KPCA KPCA and LDA Two Step KFD Algorithm The Kernel Fisherface Method Training Set Testing Set Connection to the Fisherface Method Analysis of the Kernel Fisherface Method Conclusion Conclusion Future Work Acknowledgements A Symmetric Matrix Property One 87 B Symmetric Matrix Property Two 88 C Rayleigh Quotient Theorem 90 3

6 D Eigenface Code with Euclidean Distance 92 E Eigenface Code with Mahalanobis Distance 96 F Fisherface Code: The Training Set 100 G Fisherface Code: The Testing Set 103 4

7 Chapter 1 Introduction 1.1 Motivation Most people are unable to determine one tree from another or to identify a particular cow in a herd. How is it then that humans are so easily able to distinguish a familiar face from a crowd? Taking just a fraction of a second, the ability to recognise faces so effortlessly is one of the most remarkable attributes of human vision. Developed over the early years of childhood, facial recognition is an activity we all perform routinely and subconsciously in our everyday lives. Our capability to read the expressions of others with whom we interact is an imperative part of our social skills and, alongside associated abilities, has played an essential role on the evolution of the human race. How are humans able to conduct this activity so effortlessly, yet the task of face recognition still remains one of the most challenging computer vision problems to date? Research in face recognition is motivated not just by the fundamental challenge this recognition problem poses, but also the multitude of practical applications where human face identification is required. Face recognition is one of the primary biometric technologies, alongside finger printing, iris scanning and voice recognition. Owing to the rapid advances in technologies such as mobile devices, digital cameras and the Internet, as well as the ever increasing demands on security, the use of biometrics is becoming increasingly important within every aspect of modern society. Mug-shot matching, crowd control, user verification and enhanced human computer interaction [1] would all become possible if an effective face recognition system could be implemented. 1.2 The Problem of Face Recognition The problem of face recognition can be simply stated as: given an image of a scene identify or verify the identity of the face of one or more individuals in the scene from a database of known individuals [1]. The problem can be broken down into three distinct stages: 1. Face detection and image pre-processing: Faces need to be detected within a given scene and the image cropped to give a smaller image containing just the detected face. Early approaches to face detection focused on single face segmentation using methods such as whole-face templates and skin colour, whilst later developments have led to automatic computer based face detection procedures. Once segmented the face image needs to undergo certain image processing methods to be roughly aligned and normalised. This aims to take into account factors such as lighting, position, scale and rotation of the face within the image plane [20]. 2. Feature Extraction: Key facial features need to be extracted to enable identification of the face. Various approaches exist including holistic methods, such as Principal Component Analysis [5] and Linear Discriminant Analysis [20] which use the whole face and are based on the pixels 5

Various approaches for identification have been developed including many eigenspace based approaches [21]. Figure 1.1: Three stages of the face recognition problem.

8 intensity values, and feature-based methods that identify local features of the face, such as eyes and mouth [3]. 3. Face Identification: Based on the features found from feature extraction, this step attempts to classify the unknown face image as one of the individuals known by the machine. Various approaches for identification have been developed including many eigenspace based approaches [21]. Figure 1.1: Three stages of the face recognition problem. This report shall focus on the second and third stage of the problem, namely that of feature extraction and then subsequent face identification based on these features. We shall assume that a face image has already been detected within a scene from a still image. Furthermore, this image has then been segmented to give a smaller image just containing the detected face which then undergoes specified image preprocessing. We shall impose strict conditions on the images under investigation in that they are all of the same dimension. So, the purpose of this report is the answer the question: Is the image of a known individual? If so, whom? In mathematical terms, we are given an initial training set of M face images, {xi }M i=1, containing g classes {C1,..., Cg } where each class represents a different individual. We aim to determine if an unknown face image xk (an image not in the training set) is one of these g known individuals or not. This can consequently be seen as an example of a basic pattern recognition problem, as we wish to assign an input value (face image) to one of a given set of classes (individuals in the training set). 1.3 A Brief History of Face Recognition The early attempts at the problem of face recognition began in the 1960 s. These were based on an intuitive approach to the problem of locating the major features of the face and then comparing these to the same features on other faces. These methods relied upon handmade marks on photographs to locate major features such as the mouth, eyes, nose and ears. Distances were then calculated from these marks to a common reference point and then compared to those of known individuals. In 1971, A. Goldstein et al. [7] developed a system which used 21 feature markers. However this approach proved hard to fully automate due to the subjective nature of the feature markers used. Attempts at more automated approaches to the problem begun with Fischler and Elschlager in 1973 [8]. Their approach used templates of the features of different pieces of the face. However after continued research it was found that these features do not contain enough significantly unique data to be able to uniquely represent a face. Using a similar notion, a further attempt at the problem was developed in the so called Connectionist approach. This seeks to classify a face using a combination of both range of gestures and a set of identifying markers, based of 2 dimensional pattern recognition and neural net principles. However, to achieve an acceptable degree of accuracy this approach requires 6

9 an extremely large training set, hence is computationally expensive to implement. The first fully automated recognition approaches were developed in the 1980 s, these were mainly statistical and based on the general concept of pattern recognition. These approaches relied upon greyscale intensity values and compared face images to a generic face model of expected features. Sirovich and Kirby [6] developed the concept of Eigenfaces using Principal Component Analysis to economically represent face images in 1987, at Brown University. Then Turk and Pentland [5] applied this concept to face recognition developing the Eigenface method. Since then, various different approaches have been developed to tackle the problem of face recognition. These include: Neural Networks; Fisher Linear Discriminant Analysis; Hidden Markov models; Probabilistic matching; Elastic Bunch graphs and Gabor Wavelets amongst the many others, see [1]. Current machine recognition systems have become ever more sophisticated, though they are still far away from the capability of the human face perception system. Subsequently the problem remains an extremely relevant research area with vast array of potential future developments in the field. 1.4 Organisation of Report This report is structured as follows: Chapter One introduces the problem of face recognition and its many applications in the modern day world. Chapter Two defines what is meant by an image and the details the image pre-processing techniques that aim to normalise and standardise images. The chapter also describes the key attributes of the face databases and the strict conditions placed on all face images that shall be used for the experimental requirements in implementing the face recognition methods developed in later chapters. Chapter Three introduces the method of Principal Component Analysis and its application with regards to face recognition, namely the method of Eigenfaces. The Eigenface method is then implemented with the results and subsequent limitations of the procedure discussed. Chapter Four introduces the general concept of Linear Discriminant Analysis, in particular Fisher s Linear Discriminant Analysis. These concepts are then used to formulate an improved approach to the problem of face recognition, namely the Fisherface method. The method is then implemented and the results subsequently discussed. Chapter Five expands upon the previous chapter and introduces the notion of non-linear discriminant analysis in the form of Kernel methods. Kernel Fisher s Discriminant analysis is outlined and an approach to the problem of face recognition developed in the form of a two step Kernel Fisherface method. Chapter Six then concludes this report and gives a discussion of areas of potential interest for future work. 7

10 1.5 Notation We shall begin by defining certain notation that is used from here on in throughout this report: Symbol Explanation N Dimension of an observation (face image) represented as a vector (N = n 1 n 2 ). m Dimension of a projected observation in the feature space. M Number of observations in the training set. M i Number of observations in the training set of the ith individual, where i = 1,..., g. x i Observation in the N-dim image space. x i k The kth observation of the ith individual, k = 1,..., M i y i Projected observation in the feature space. g Number of different classes (individuals) in the initial training set. C k Class k, k = 1,..., g. z i The class observation x i belongs to, z i {C 1,..., C g }. µ Average observation for i = 1,..., M µ i Average observation for the ith class C i i = 1,..., g. y j Average feature vector of the jth class. 8

Chapter 2 Image Pre-Processing 2.1 Motivation The performance of face recognition methods is highly dependent on various factors that affect an image.

11 Chapter 2 Image Pre-Processing 2.1 Motivation The performance of face recognition methods is highly dependent on various factors that affect an image. These include: variations due to lighting; image quality; pose and facial expression. Most face recognition techniques incorporate a method to compare images with one another. Hence, to ensure the effectiveness of such techniques, it is preferable to remove all systematic differences between images. Thus we wish to exclude, as far as possible, the differences due to the aforementioned factors. In implementing image preprocessing, we aim to identify and remove these variations. However, before discussing these preprocessing methods we shall begin by defining exactly what is meant by an image. 2.2 Defining an Image Each image shall be taken to be rectangular of dimension n 1 n 2, where n 1 is the number of pixels horizontally and n 2 is the number of pixels vertically. Each image can be represented as an n 1 n 2 = N dimensional feature vector, treating each pixel as a single feature [17]. Observe figure 2.1. So the kth image can be expressed as: x k = (x k,1, x k,2,..., x k,n ) T (2.1) where row of pixels in the image are placed one after the other, so concatenating the array, to form a vector of length N. The first n 1 elements (x k,1, x k,2,..., x k,n ) will be the first row of the image, the next n 1 the second row and so on. The value x k,i represents the intensity value of the ith pixel of the kth individual. The way in which this intensity value is defined depends on whether the image is in colour or in greyscale. Figure 2.1: Concatenating the rows of an image array. 9

2.2.1 Colour Images Colour images can be considered in terms of the colour RGB model, which describes a colour in terms of the primary colours red, green and blue.

12 2.2.1 Colour Images Colour images can be considered in terms of the colour RGB model, which describes a colour in terms of the primary colours red, green and blue. So each pixel intensity value is expressed as an RGB triplet (R, G, B). Using an 8-bit representation for each primary colour component, the triplet is given as 3 integer values each in the range of 0 to 255. See table 2.2 for examples of various colours and their corresponding RGB triplets. Colour (R, G, B) Black (0, 0, 0) White (255, 255, 255) Red (255, 0, 0) Green (0, 255, 0) Blue (0, 0, 255) Yellow (255, 255, 0) Cyan (0, 255, 255) Magenta (255, 0, 255) Figure 2.2: 8-bit RGB triplets for various colours Greyscale Images In greyscale images each pixel can be considered in terms of a single intensity value. Still using an 8-bit representation, each pixel is given by a single integer value between 0 and 255; where black is given by 0 and white is given by 255. Note that greyscale images are distinct from black and white images which can be represented in a 1-bit numerical form. This is because black and white images only contain two colours: black and white; whereas greyscale images contain 254 shades of grey in between these two extremes. 2.3 Converting Between RGB and Greyscale This report shall focus on greyscale face images. Greyscale images offer the advantage that they contain two thirds less data that the same image in an RGB colour format. This is because in greyscale each pixel is given by a single integer intensity value as opposed to 3 integers per pixel for the colour version. Thus the conversion from colour to greyscale can be seen as a data compression, as it compresses from a 24-bit, three-channel colour images to an 8-bit, single-channel greyscale image. Methods of face detection can rely upon colour for segmentation of face images, [9], however this report shall focus on methods of face recognition which do not depend upon the colour of an image in their implementation. Hence to simplify the problem and reduce the data storage requirements we shall focus on greyscale images. Unless otherwise stated, we shall assume from here on that all images discussed throughout this report are in their greyscale format. (a) Colour image (b) Grey scale image Figure 2.3: RGB to greyscale conversion of a face image from the faces94 database. 10

13 There are various methods used to convert an image from RGB to greyscale. One common strategy is based on matching the luminance of the greyscale image to the luminance of the colour image [10]. Using the luminosity function, which describes the sensitivity of the human eye to different wavelengths of light, we know that green light contributes the most to the intensity perceived whereas blue light contributes the least. We then use this to create a linear combination of the RGB components to describe the intensity value at a certain pixel: Greyscale (x k,i ) = Red (x k,i ) Green (x k,i ) Blue (x k,i ) where x k,i is the ith pixel the image x k. This linear combination is the weighted sum used in the rgb2gray function in MATLAB R, [11]. It shall be assumed that this is the linear combination used consistently throughout this report to convert colour images to greyscale images. Consider figure 2.3 for an example of a face image, from the faces94 database, converted from colour to greyscale. 2.4 Histogram Equalisation Changes in illumination can dramatically affect the recognition performance of face recognition methods. Variations in lighting conditions can mean that some images might be of low contrast. Histogram equalisation is one method to adjust and correct the contrast of a image. This technique increases the global contrast of an image using the image s histogram. Histogram equalisation uses a non-linear monotonic mapping to reassign the greyscale values of pixels in the input image such that the output image contains a uniform distribution across all grey levels. This means that areas of lower local contrast gain a higher contrast [10]. Consider a greyscale image, described by the vector x k, where each pixel in the image x k,j, for j = 1,..., N, is given by a discrete integer grey level in the range [0, 255], using an 8-bit representation. The histogram of such an image is defined to be the frequency density of each interval of greyscale values in the range of 0 to 255. Consider the probability that the pixel: x k,j is of grey level i. This then gives rise to the discrete Probability Distribution Function (PDF) of the image: p(i) = 1 N N P (x k,j = i) (2.2) j=1 This gives the image s histogram for intensity grey level i, which has been normalised to [0, 1]. Then from the PDFs for each each grey level we can define the Cumulative Distribution function (CDF): CDF (i) = i p(q) (2.3) This then gives the image s accumulated normalised histogram. Observe figure 2.4 to visualise the effect on an image when histogram equalisation is implemented. The aim of histogram equalisation is to transform the image s histogram to give a histogram containing a uniform distribution across all grey levels. Note that in figure 2.4 the output histogram does not look decisively uniform. However, the corresponding CDF is approximately linear, indicating that the image is indeed equalised. We want each level in [0, 255] to be as equally represented as possible in the transformed image. This means we look to apply a linear transformation to give a new image x k such that the CDF of x k is normalised across [0, 255]. Define CDF min to be the minimum grey level of the original image x. Then the general histogram equalisation formulation, using the expression of the CDF, can be expressed as: ( ) CDF (i) CDFmin h(i) = round (L 1) (2.4) N CDF min where L is the number of grey levels used, so in this case L = 256. Applying this transformation, the minimum grey level in the input image becomes 0 in the output image and the maximum grey level becomes 255. We shall use the histeq function in MATLAB R [11] and from here on in assume that all images investigated have undergone histogram equalisation. q=0 11

(a) Original image (b) Output image (c) Original image s histogram (d) Output image s histogram (e) Original

4: (a) The original image from the faces94 database, (c) its histogram quantised to 64 intervals and (e) its

(b) The histogram equalised image from the faces94 database and (d) its histogram quantised to 64 intervals.

14 (a) Original image (b) Output image (c) Original image s histogram (d) Output image s histogram (e) Original image s CDF (f) Output image s CDF Figure 2.4: (a) The original image from the faces94 database, (c) its histogram quantised to 64 intervals and (e) its associated CDF. The image lacks contrast as the majority of the grey levels are bunched between 0 and 100. (b) The histogram equalised image from the faces94 database and (d) its histogram quantised to 64 intervals. The image s contrast has been adjusted and the grey levels are fairly evenly distributed between 0 and 255, observe the CDF in (f) which indicates an approximately uniform distribution. 12

2.5 Description of Face Databases Two face databases shall be considered, with credit due to The University of Essex for production and distribution of the databases [12].

of individuals 153 (male and female) 72 (male and female) Image Resolution 180 200 pixels 180 200 pixels Background Plain green Red curtain, lighting variation from shadowing as the individual moves

individuals moves due to artificial lighting arrangements Expression variation Some expression variation Some expression variation Note that the key difference between the databases lies in the

Thus implementation of face recognition methods on these two sets will enable a comparison of how well each technique considered is able to cope with these sources of variation.

15 2.5 Description of Face Databases Two face databases shall be considered, with credit due to The University of Essex for production and distribution of the databases [12]. A summary of each database is shown in the table below: Description faces94 faces94 Images per individual No. of individuals 153 (male and female) 72 (male and female) Image Resolution pixels pixels Background Plain green Red curtain, lighting variation from shadowing as the individual moves Head scale None Large head scale variations Head turn, tilt, slant Minor variations Minor variations Position of face Minor changes Some translation Lighting variation None Major changes as individuals moves due to artificial lighting arrangements Expression variation Some expression variation Some expression variation Note that the key difference between the databases lies in the variation between images due to head scale and lighting, with the faces95 database being more complex than the faces94 database. Observe figure 2.5 for examples of images from each database. Thus implementation of face recognition methods on these two sets will enable a comparison of how well each technique considered is able to cope with these sources of variation. (a) faces94 image 1 (b) faces94 image 2 (c) faces95 image 1 (d) faces95 image 2 Figure 2.5: Images from the faces94 and faces95 databases [12]. 2.6 Assumptions on Face Images Throughout this report we shall impose strict assumptions on the face images considered. We shall assume: Images contain front facing individuals i.e. there is minimal pose variation due to rotation of the face within the image. Images have a constant background. All images have been segmented to be of the same dimension. All images contain a roughly centered face image. This simplifies the problem but places slightly unrealistic conditions on the images considered. It is unlikely that images taken in an everyday environments will abide to these strict regulations. However, approaches based on images from these constrained environments would be suitable for say a household or office situation. 13

16 We will intitally begin to develop approaches to the face recognition problem based on face databases complying with these strict limitations. We then hope we are able to adapt and extend these methods, to achieve successful face recognition systems which enable these restrictions to be relaxed. 14

17 Chapter 3 Eigenfaces 3.1 Motivation Faces tend to be similar in appearance and contain significant statistical similarities. This observation suggests that it is possible to remove these statistical redundancies and accurately represent a face image in a lower dimension. For example consider an image of dimension 180 by 200 pixels which can then be represented as a vector of length 36000, or equivalently a point in dimensional space. A set of images will then map to a collection of points in this high dimensional image space. However the aforementioned statistical similarities of faces means that the set of images will not be randomly distributed in this image space thus can be described by a relatively low dimensional subspace. This fact is used as the basis of all dimensionality reducing approaches to the problem of face recognition. All such methods seek to find an optimal projection of face images from the high dimensional image space to that of a significantly lower dimensional feature space. In particular the Eigenface method uses a linear projection in order to project images into such low dimensional feature space. The algorithm was motivated by the work produced by Sirovich and Kirby (1987) [6] who developed a technique for economically representing pictures of faces using Principal Component Analysis. They showed that any face can be approximately reconstructed using a just small number of principal components, called Eigenpictures and the corresponding projections (coefficients) along each Eigenpicture. This was further extended upon by Turk and Pentland (1991) [5] who used the principal components as their feature vectors and the Euclidean distance, to develop the face recognition method known as Eigenfaces. We will first give a brief overview of the ideas behind the method, explaining the concept of Principal component Analysis and then shall use this to formulate the Eigenface method. 3.2 Background to Method The Eigenface method aims to reduce the dimensionality of the original image space by using Principal Component Analysis (PCA) to select a new set of uncorrelated variables. The aim is to choose these new variables in such a way that retains as much of the variation as possible from the original set of variables which define the original image space. This objective is then equivalent to finding the principal components of the image space. These principal components, also called Eigenfaces, can be thought of as a set of feature vectors that represent the characteristic features within the face set and together this set of Eigenfaces characterise the variation between face images. The core concept of the method is based on the observation that each face image can be exactly represented as a linear combination of the Eigenfaces. Furthermore, each face image can also be approximated using only the best Eigenfaces, observe figure 3.1. These best Eigenfaces are defined to be those accounting for the most variation within the set of face images. The Eigenfaces can then be thought of as a set of key features describing a set of face images. We then select the m best Eigenfaces, where m << N the dimension of each image, to create the m-dimensional feature space which best represents the set of images. Every individual s face can then be characterised by the 15

18 (a) Original image (b) 5 eigenfaces Figure 3.1: The original image from a dummy training set of 20 images from the faces94 database. The image in (a) is approximately reconstructed by the linear combination of eigenfaces: Original image (F ace 1) (F ace 2) (F ace 3) (F ace 4) (F ace 5). weighted sum of the m Eigenfaces needed to approximately construct it in the feature space. These weights being stored in a feature vector of length m. This means that any collection of face images can be classified by storing a feature vector for each face image and a small set of m Eigenfaces. This subsequently greatly reduces the data storage requirements for storing a database of face images. Furthermore, this suggests that we are able to compare and recognise faces images by just comparing these feature vectors. This is the basis of the Eigenface method. Using these low dimensional feature vectors we can determine two key facts about a face image in question: If the image is a face at all? If the feature vector of the image in question differs too much from the feature vectors of known face images (i.e. images which we know are of a face), it is likely the image is not a face. If the image contains a known individual or not? Similar faces possess similar features (Eigenfaces) to similar degrees (weights). If we extract the feature vectors from all the images available then the images could be grouped into clusters, with each cluster representing a certain individual. That is, all images having similar feature vectors are likely to be similar faces. We can then determine if the image in question belongs to one of these individuals by its proximity to the clusters. The Eigenface method can subsequently be seen in two steps: 1. A training set of images is used to find the Eigenfaces and hence train a computer to recognise the individuals in these images. 2. A testing set of images, i.e. a set containing images not in the training set, is considered to determine if the identity of the individuals in the images is known of not. We will begin by considering the training set and how to find the Eigenfaces using the concept of Principal Component Analysis. 3.3 Principal Component Analysis As stated, Principal Component Analysis (PCA) aims to the reduce the dimensionality of a data set consisting of a large number of potentially correlated variables, whilst retaining as much as possible of the total variation. This is achieved by an orthogonal transformation to a new set of uncorrelated variables, called the Principal Components (PCs) [15]. The first principal component is specified 16

19 such that it accounts for as much of the variability in the original data variables as possible. Then each succeeding component in turn, is constructed to have the highest variance possible, under the constraint that it be orthogonal to the proceeding components. PCA thus can be simply thought of as a coordinate rotation, aligning the transformed axes with the directions of maximal variance. This orthogonal transformation is shown in figure 3.2 which gives a plot of 50 observation on two highly correlated variables: x 1 and x 2 and a plot of the data transformed to the two PCs: z 1 and z 2. It is apparent that there is more variation in the direction of z 1 than either of the original variables x 1 or x 2. Also note there is very little variation in the second PC z 2 which is orthogonal to z 1. (a) Plot of 50 observations on two variables x 1 and x 2 (b) Plot of 50 observations with respect to the two principal components z 1 and z 2 Figure 3.2: Orthogonal transformation to the principal components [15] We can generalise this trivial example to a data set with more than 2 variables. If the variables have substantial correlations among them then it is hoped that the first few PCs will account for most of the variation in the original variables. This then suggests a dimension reduction scheme by transforming the original variables via a linear projection onto these first few PCs. We shall begin by considering how to find these PCs Finding the Principal Components One way to examine the aim of PCA is to identify the linear directions in which the original data set is best represented in a least squares sense. We shall begin by considering the first principal component which is specified to be in the direction of maximal variability of the original variables. Thus in a least squares sense, the first PC is defined to be in the direction such that it minimises the least squared error between the data points and their projection onto the line in this direction. Assume we have a data set consisting of M vectors each of length N. The mean vector of this data set is represented by µ and the covariance matrix is represented by Σ. The covariance matrix is defined to be the matrix whose (i, j)th element, Σ i,j, is the covariance between the the ith and jth elements of x when i j and the variance of the ith element when i = j. Then consider, if we were to approximate x with a single straight line, which one would be the best line? This is then equivalent to selecting the direction in which there is maximal variability of the original data set. One approach to solve this problem is to look at it geometrically Pearson (1901)[13]: Minimise the expected squared distance between x and their projections x onto the line: E(xx 2 ) Thus by Pythagoras we aim to: E(xx 2 ) }{{} Min = E(µx 2 ) E(µx 2 ) }{{} Max (3.1) 17

20 where E(xx 2 ) denotes the expected squared length of the line segment connecting x and x, i.e. xx = x.x. We take this expectation over all data points in the data set. Observe figure 3.3 for an illustration of this objective. Figure 3.3: Aim to minimise the distance between all points x and their projections x. Image adapted from [13]. As the distance E(µx 2 ) does not depend on the fitted line, minimising E(xx 2 ) is equivalent to maximising E(µx 2 ). We shall define γ to be the vector in the direction of the best line. Then investigating this quantity further: E(µx 2 ) = E(µx µx ) = E(γ T (x µ)(x µ) T γ) = V ar(γ T (x µ)) = γ T V ar(x µ)γ = γ T Σγ (3.2) Hence we need to maximise γ T Σγ. However it is clear that this maximisation will not be achieved for finite γ. We then need to impose a normalisation constraint. We shall specify that γ T γ = 1, that is γ to be of unit length. Note that other constraints can be used, but for simplicity in this derivation we shall use the unit vector constraint. This then gives a constrained maximisation problem which can be addressed through the use of a Lagrange multiplier. Define: P (γ) = γ T Σγ λ(γ T γ 1) (3.3) Then the stationary points of of (3.3) are the maximum values of γ T Σγ. Then: dp (γ) dγ = 2Σγ 2λγ (3.4) Then setting this to zero yields: Σγ = λγ (3.5) This is then a eigenvalue problem, thus the stationary values of P (γ) are given by the eigenvectors of the covariance matrix Σ. But which eigenvector is in the direction of maximal variability? We need to select the eigenvector that maximises P (γ). Note that Σ R N N is a symmetric matrix so has N eigenvalues λ 1 λ 2... λ N with associated orthonormal eigenvectors {γ 1,..., γ N }. Observe that multiplying (3.5) by γ T gives: γ T Σγ = λγ T γ V ar(γ T (3.6) x) = λ By using the fact that γ T = γ 1 as γ is an orthonormal eigenvector. Then the left hand side is now exactly what we want to maximise. Thus maximising V ar(γ T x) means then that we need to choose the eigenvector γ 1 corresponding to the largest eigenvalue, namely λ 1. So γ T 1 x is defined to first PC of x = (x 1,..., x N ) T. Thus the projection of x into the one-dimensional space given by the first PC is in the direction of maximal variance. Observe figure 3.4 which plots the two PCs of the example dataset. The second PC is specified such that it is in the direction of maximal variability subject to being orthogonal to the first PC. This is then given by the eigenvector with 18

21 Figure 3.4: 2 dimensional example with the two principal components plotted. the second largest eigenvalue. In general, the kth PC is given by γ T k x with variance var(γt k x) = λ k, where λ k is the kth largest eigenvalue of Σ and γ k is the corresponding eigenvector. As γ i i are the orthogonal eigenvectors of Σ, this means the PCs give a set of orthonormal variables, as desired Decomposing Variance The covariance matrix Σ is symmetric by its construction, hence it has a set of N orthogonal eigenvectors. From (3.5) and considering the j-th eigenvector γ j of Σ we have: Σγ j = λ j γ j j = 1,..., N (3.7) For each j, either side of this equation is a vector of length N. Then considering all N eigenvectors, we can bind these N vectors together to form an N N matrix: (Σγ 1,..., Σγ N ) = (λ 1 γ 1,..., λ N γ N ) (3.8) Expressed in matrix form: Then further defining: λ 1 0 Σ (γ 1,..., γ N ) = (γ 1,..., γ N )..... (3.9) 0 λ N λ 1 0 Γ = (γ 1,..., γ N ) Λ =..... (3.10) 0 λ N where Γ is the orthogonal matrix containing the eigenvectors and Λ is a symmetric, diagonal matrix with the eigenvalues along the diagonal and zeros in all other positions. Hence we can express (3.8) as: ΣΓ = ΓΛ Σ = ΓΛΓ 1 = ΓΛΓ T (3.11) This is the eigen decomposition of Σ, as given by the Theorem in Appendix B which states that any symmetric matrix can be diagonalised by its orthonormal eigenvectors. Thus each λ j provides some decomposition of the variance, as considering just the jth element gives: λ j = V ar(γ T j x) j = 1,..., N (3.12) 19

So computing the sum over all j: λ 1 +... + λ N = T r(λ) = T r(γλγ T ) = T r(σ) = T V (x) (3.13) The trace of the variance matrix is then equal to the total variance of the data set T V (x).

22 So computing the sum over all j: λ λ N = T r(λ) = T r(γλγ T ) = T r(σ) = T V (x) (3.13) The trace of the variance matrix is then equal to the total variance of the data set T V (x). Therefore the proportion of the total variance explained by the j-th principal component is given by: Data Compression λ j = V ar(γt j x) λ λ N T V (x) (3.14) We have an initial data set M images, each given by a vector of length N: {x 1,..., x M }. The aim of the Eigenface method is to compress the data set to a new set of variables of a smaller dimension, say m<<n. We aim to do this in such a way that retains as much variability as possible present in the original N variables. How should we select m? To select the number m it can be useful to look at a scree plot. This plots the variance explained by the jth principal component, which is equal to its eigenvalue as shown in (3.12), versus j. A widely used rule of thumb is to take m as the principal component with the largest eigenvalue before the scree plot levels off i.e. the largest component which represents a significant amount of variability [13]. Then all the components beyond m will be dismissed. Figure 3.5: Toy example of a scree plot based on 3 measurements for 30 insects. The first principal component accounts for 93.1% of variation within the data set and the first two account for 99.5% Observe figure 3.5 of a scree plot for a toy example with M=50 observations each with N=3 variables. Clearly it is the case that the first principal component accounts for the majority of the variation in the data set. We can then compress the data by projecting all data points {x 1,..., x M } onto the m dimensional subspace, spanned by the m largest principal components, to give the feature vectors {y 1,..., y M }: f : x i R N y i R m x i (γ 1,..., γ m ) T (x i µ) (3.15) Consider figure 3.6 which shows the projection of the two dimensional data set, given in figure 3.4, Figure 3.6: Projection onto the first principal component of a two sample data set given in figure 3.4. on to the first principal component. Observe how the majority of the variation present in the data set is retained in this one dimensional representation. 20

23 3.3.4 Limitations of PCA PCA theoretically produces the optimal linear projection for compressing a high dimensional data set to that of a lower dimension whilst retaining maximal variability. We have seen this optimality in a least means square error sense. However the effectiveness of PCA is limited by the assumptions behind the method. We shall briefly summarise these limitations and hint as to when such assumptions might perform poorly. 1. Assumption the large variances are important: PCA can be viewed as a coordinate rotation to a new axis in the directions of maximal variability. We thus assume that directions of maximal variance are of importance with directions of low variation being made redundant. However this may not always be the case. In the context of face recognition, as stated by Adini. Y et al. (1997)[27], the variations between the images of the same face due to illumination and viewing direction are almost always larger than image variations due to change in face identity. Thus the projections using PCA may not be useful in determining the identity of an unknown face image. We shall address this issue using Linear Discriminant analysis, (Chapter 4), in which we separate up the sources of variation into those that are useful for discriminating between face images and those that are not. We then aim to find the directions that are optimal for classifying individuals. 2. Non parametric: PCA is a non parametric technique as it does not rely upon any assumptions of the probability distribution of the data set. This non parametric nature can be seen as an advantage but conversely is also an inherent weakness of the method. This is because the scheme does not allow for any prior knowledge of the data set to be utilised and hence data compression via PCA can often incur a loss of information. PCA thus does not include the label information of the data set, so doesn t account the class separability. This consequently means that the there is no guarantee the directions of maximum variance found by PCA will contain good features for discrimination. We shall address this issue using Linear Discriminant analysis, (Chapter 4), in which we take into account the class separability and make assumptions on the structure of each class in order to find the optimal feature vectors for discriminatory purposes. 3. Linearity: It is assumed that the observations in the data set are linear combinations of a certain basis. With regards to the context of face images this means we assume that each face image can be uniquely described by a linear combination of the pixels. Is this a valid assumption? Is a face image not also defined by the relationship between three of more pixels, such as that describing a curve or an edge? We shall seek to determine if this linearity assumption is valid by developing non-linear approaches to the problem of face recognition, such as those based on kernel methods, (Chapter 5). 3.4 Finding the Eigenfaces We will now apply the theory behind PCA to the problem of face recognition with the aim to find the set of principal components of the training face set. We are given a set of M training images {x 1,..., x M }. Each n 1 n 2 dimensional image can be expressed as a vector of length n 1 n 2 = N, where each pixel is treated as a single feature [17]. So the kth image can be expressed as: x k = (x k,1, x k,2,..., x k,n ) T (3.16) In the training set we assume there is at least one image of each individual. So each image belongs to one of the classes {C 1,..., C g }, where each class represents a different individual. 21

24 We shall define the mean image of the training set to be: µ = 1 M M x i (3.17) i=1 This mean gives the average greyscale intensity value across all M images at all N pixel locations. Observe figure 3.7 for the average face image calculated from the faces94 database. Figure 3.7: Average face image using a subset of 500 images from the faces94 database. We are interested in the variation within the training set, thus we wish to observe how much each image varies from this mean image. Hence we need to subtract this mean image µ R N from every image in the training set to give a set of mean centered data points {x i µ} M i=1. Then we consider using a linear transformation, mapping the original N dimension mean centered image, x k µ, to a new m dimension feature vector, where m<<n. This gives a new feature vector y k R m for each image: y k = W T (x k µ) k = 1... M (3.18) We consider such a projection using W R N m, which is the optimal projection matrix according to PCA. Considering the objectives of PCA, the optimal projection matrix W opt is chosen so that the m new uncorrelated variables capture as much of the variability as possible of the N original variables. That is: W opt = [w 1,..., w m ] R m N (3.19) Hence {w 1,..., w m } is the set of m eigenvectors of the data set s covariance matrix Σ, corresponding to the largest eigenvalues, λ 1 λ 2 λ m. These eigenvectors are of unit length and are of the same dimension as the original face images. They are referred to as Eigenfaces and collectively they give the face space. See figure 3.8 for a collection of 30 Eigenfaces calculated using the faces94 database. However, we need to consider the computation requirements in calculating these principal components. The covariance matrix of the training set Σ is a matrix of size N N. In the context of face images, where N represents the number of pixels in each image, this N N matrix is potentially very large and unmanageable. For example a training set consisting of images of size pixels, would create a covariance matrix of size It is then not always practical to solve for the eigenvectors of Σ directly. However there is a trick we can use to minimise the computational requirements. Define v i = x i µ to be the mean centered image for all images i = 1,..., M, which form the columns of the matrix V. Then we can write the covariance matrix in the form: Σ = V V T (3.20) However to reduce the computational requirements we can consider solving the eigenvalues and eigenvectors of the M M dimensional matrix V T V : V T V d i = α i d i (3.21) 22

25 Figure 3.8: 30 Eigenfaces calculated from the faces94 database. Note that by multiplying both sides by V we get: V V T (V d i ) = α i (V d i ) Σ(V d i ) = α i (V d i ) (3.22) Hence the eigenvectors w i and eigenvalues λ i of Σ can be obtained by solving for the orthogonal eigenvectors and eigenvalues of the M M symmetric matrix V T V, which is computationally less expensive to solve. Note that the rank of the matrix V is limited by the number of images in the training set M. This is because it is only possible to sum up a finite number of image vectors, as there are only M images in the training set. Hence the rank of the matrix V T V cannot exceed M 1, with the -1 coming from the subtraction of the mean vector µ. This means there are at most M 1 eigenvectors with non-zero eigenvalues. As an eigenvector with a eigenvalue of value zero represents a direction in which there is zero variability, due to expression in (3.12), we need only to consider the M 1 non-zero eigenvalues and corresponding vectors that represent the directions of variability. Let d i and α i for i = 1,..., M 1 be the eigenvectors and corresponding non-zero eigenvalues of V T V. Thus from 3.22, Σ and V T V have the same eigenvalues: Then the eigenvectors are related by: w i = α i = λ i (3.23) V d i V d i (3.24) Note that we specified the eigenvectors to be of unit length thus we then need to normalise these eigenvectors so that w i = 1. Subsequently using the Theorem given in Appendix A, this collection of normalised eigenvectors, namely the Eigenfaces, gives an orthonormal basis for the face space How many Eigenfaces? We can then select the first m Eigenfaces from the M 1 ordered eigenvectors corresponding to the largest non-zero eigenvalues of the covariance matrix. Then we can use these to form the optimal projection matrix W R N m. But how many Eigenfaces should we take? i.e. what value should we take m to be? To determine m, as detailed in [17], the number of Eigenfaces we should use, we need to first rank the eigenvalues in non increasing order: λ 1 λ 2... λ M 1. Using (3.12) we know that the residual mean square error from using only the first m<n eigenvectors is equal to the sum of the eigenvalues not used, i.e. M 1 k=m+1 λ k. Hence we select m such that the sum of the unused eigenvalues, 23

26 as a fraction of the total variation in the original image space, is less than some predetermined value P. Hence m must satisfy: M 1 k=m+1 λ k N k=1 λ < P (3.25) k P therefore represents the proportion of variation of the original variables that is not retained by the PCA transformation to the new set of m variables i.e. the m Eigenfaces. We subsequently select a small value of P such that we obtain a good reduction in the dimension of the original feature vectors set whilst retaining a large proportion of the variation in the original image space. In implementing this method we shall compare its performance using a differing number of Eigenfaces m and hence various values of P The Feature Space Collectively this optimal set of m Eigenfaces describes the new lower dimensional feature space. Then every face image in the training set, {x 1,..., x M }, can be expressed as a weighted sum of these m Eigenfaces by linearly projecting onto this feature space. Therefore, linear projection of the ith face image in the training set onto each of the m Eigenfaces gives: So the original image x i forms a new feature vector of length m: y i = W T (x i µ) (3.26) y i = (y i,1,..., y i,m ) T (3.27) As the training set can contain several face images for each of the g individuals, we expect the feature vectors of all images of one individual to be similar. Hence the feature vectors for different individuals will form clusters in the feature space. We can subsequently calculate the class vector for each individual. This is defined to be the average feature vector for each individual. This class vector is found by averaging the feature vectors across every image of the individual. Assume there are M k images of the kth individual in the training set, where g } k=1 M k = M. Thus the kth individual having the set of associated feature vectors: {y k 1,, yk. Then the class vector is defined to be: Mk y k = 1 M i y k i (3.28) M i i=1 This gives the set of class vectors: { y 1,..., y g } representing each of the g individuals. Hence we have now exactly defined the feature space, as have specified the Eigenfaces for our training set and found the corresponding feature vector that describes each of the M images in this feature face. This then completes the first step of the Eigenface method. 3.5 Face Recognition We will now consider the second step of the Eigenface method in which we have an image belonging to the testing set, i.e. an unknown image not present in the training set. We aim to determine if the individual in this image is known of not, that is if it an image of one of the g individuals in the training set. We first need to find the feature vector for the new image in question, x j, by projecting it onto each the m Eigenfaces: y j = W T (x j µ) (3.29) where µ is the mean face image from all images in the training set. This feature vector, y j = (y j,1,..., y j,m ) T, is of length m and represents the weighted sum of the Eigenfaces needed to approximately reconstruct the face image x j. Observe figure 3.9 which gives the input image and the approximated image using m = 40 Eigenfaces. Further figure 3.10 illustrates the contribution of each of the 40 eigenface in approximating the individual given in image in figure 3.9(b). 24

27 (a) Original image (b) Approximation using 40 eigenfaces. Figure 3.9: Original image from the face94 database not used in the training set and its projection into the face space using 40 eigenfaces. Figure 3.10: The weights associated with each of the 40 eigenfaces in approximating the images in 3.9(b). Once a new face image x j has been projected into the face space and its feature vector y j obtained, we can then determine: 1. If the image is of a face, whether known or unknown, i.e. is the image near the face space? 2. If the image is of a known or unknown individual, i.e. is the image near one of the defined face classes? Near the Face Space By projecting each face image into the low dimensional feature space it is likely that several images will project down onto the same feature vector. Not all of these images will necessarily look like a face. We thus need to use a measure of faceness to determine if the image is of a face or not. The distance between the image and the face space is a possible candidate for such a measure [5]. We shall consider using the Euclidean distance, which we will denote by ɛ. This is then just the difference between the mean adjusted image v j = x j µ and its projection into the face space: v j = m i=1 y j,iw i, given by: ɛ = v j v j (3.30) We then need to specify a threshold θ to be the maximum allowable distance an image can be from the face space to still be classified as a face, that is we require: ɛ < θ (3.31) 25

28 Figure 3.11 illustrates the projection of a non face image into the face space. The relative measures of distance between this projection and the face space is considerably larger than that of any face image. We hence need to select the threshold to discriminate against these non-face images. Figure 3.11: A non-face image and its projection into the face space with relative distance between the projection and the face space of An example of two face images have relative distances between their projections and the face space of 29.5 and Note the difference in distance between face images and the non-face image. Experiments conducted by Turk and Pentland [5]. The threshold θ is a subjective measure set by hand. Its value depends on the dataset of images in question and also the aim of the face recognition procedure. One way to set this threshold would be to observe the average distance between each mean adjusted face image and the average image over all images in the training set as well as the maximal distance. The threshold can then be set by considering these two quantities alongside the aim of the procedure. That is if our objective is to have a low rejection rate of face images, i.e. few face images are declared non face image, we would then select a high θ. Conversely, if we require a high accuracy of classification of face images, that is we wish to maximise the number of face images classified to the correct individual, we would then need to take low θ value. However note this objective would lead to a high rejection rate, in that many face images would be rejected as non faces Near a Face Class If the image in question is sufficiently close to the face space and hence classified as a face image, we can then look to determine if the image is of a known face. The feature vector for the image y j, can be used in a standard pattern recognition algorithm in order to determine which of the defined face classes (C 1,..., C g ), if any, best describes the face. We have defined y k to be the } average feature {y k 1,, yk Mk vector of class k produced by averaging all the M i feature vectors: projection of the M i images: {x k 1,, xk Mk }, of individual k into the feature space. That is: produced from y k = 1 M i y k i (3.32) M i This is consequently a classification problem as we are interested in assigning the image to one of the known individuals. We thus need a similarity metric to compare the similarity between the face image in investigation and that of each known individual. There are various such similarity metrics, but in implementing the Eigenface method we shall consider the nearest neighbour [5] approach using just the two metrics: Euclidean distance Mahalanobis distance Further, for interests sake, we will then show that under certain simplifying assumptions the use of the Mahalanobis distance and Euclidean distance is equivalent to using a Bayes classifier. i=1 26

29 In using a nearest neighbour approach, we are implicitly assuming a Gaussian distribution of each face class within the face space. It is difficult, if not impossible, to estimate the true distribution of each face class. Nonetheless, we are able to base our assumptions on the observation that face images of the one individual are similar in appearance. This suggests that feature vectors from images of a certain individual are also likely to be similar. Consequently we expect that the feature vectors will be grouped into clusters in the feature space, with each cluster representing a different individual. It is then reasonable to assume we are able to adequately describe each face class using the mean feature vector of the class and a measure of spread. Subsequently a Gaussian distribution appears acceptable. Euclidean Distance The nearest neighbour approach uses a distance similarity metric to identify a new face image by searching for the most similar vector in the training set to this new image s feature vector. The simplest metric to use in such a method, as described in Turk and Pentland [5], is the Euclidean distance. This means we aim to find the face class k that minimises the Euclidean distance between the feature vector y j of the image in question and the average feature vector y k of the kth individual. This distance is given by: ɛ k = d(y j, y k ) = yj y k = (y j y k ) T (y j y k ) (3.33) Formally this is nearest neighbour classification, as the image is classified by assigning it to the label of the closest point in the training set measuring all distances in the feature space. The face is determined to belong to class k if the distance is less than some predetermined threshold θ k, that is: ɛ k = min {ɛ i ɛ i < θ i } (3.34) i=1,...,g We need to specify a threshold θ i for each of the i = 1,..., g individuals. This then gives the maximum allowable distance the image in question can be from the average class vector of the kth class to still be classified as the kth individual. The image is classified as unknown if ɛ i θ i i = 1,... g. Otherwise the image is classified as class k for which ɛ k < θ k and the Euclidean distance ɛ k is the smallest distance over all possible distances, i.e. over all ɛ i = y j y i for i = 1,..., g. Mahalanobis Distance Another distance similarity measure that can be used in a nearest neighbour approach is the Mahalanobis distance. The Mahalanobis distance is defined to be the distance between two vectors that takes into account the correlations of the data set. The Mahalanobis distance between the two vectors x R m and z R m is given by the expression: d(x, z) = (x z) T (Σ) 1 (x z) (3.35) where Σ is the covariance matrix for the data set from which the observations x and z are taken from. From a geometric point of view the Mahalanobis distance has a scaling effect on the input data space. This is because multiplication by the inverse covariance matrix means that directions in which there is greater variability are compressed whilst directions in which there is less variability are expanded. Hence the Mahalanobis distance offers the advantage over the Euclidean distance in that it takes into account the variability of the data set. So the Mahalanobis distance between the feature vector y j, of the image in question and the class vector of the kth individual y k is defined to be: (yj ) T ɛ k = d(y j, y k ) = y k (W T ΣW ) ( ) 1 y j y k (3.36) where we have made the assumption that all class have a common covariance matrix Σ. Then the face image is determined to belong to class k if the distance is less than the predetermined threshold θ k, that is: ɛ k = min {ɛ i ɛ i < θ i } (3.37) i=1,...,g 27

30 Comparison of Euclidean and Mahalanobis Distance We shall compare the differences between using the Euclidean distance and the Mahalanobis distance for classification in the feature space. In assigning an unknown image to one of the g known individuals we use a nearest neighbour approach in that we assume the closer the feature vector y j is to the average feature vector of a class y k, the more like the unknown image is to be of the kth individual. However it would also be useful to know if the set of feature vectors belonging to a class are spread over and large or small range and thus if a given distance from the average is noteworthy or not. Here offers the advantage of the Mahalanobis distance over the Euclidean, in that it takes into account the variability of the data set. The Mahalanobis distance gives an equation of an ellipsoid in which all points on this ellipsoid have the same Mahalanobis distance from the center, which is the class average y k. The axis of the ellipsoid takes into account the correlations in the feature space. Directions in which the ellipsoid has a small axis means that feature vectors in this direction must be closer to the center to be classified as this class, whereas directions in which the ellipsoid has a large axis means that feature vectors are allowed to be further away from the center. Thus the likelihood of the feature vector belonging to a class depends on both the distance to the center and the direction. Observe the difference with the Euclidean distance measure which gives a sphere, in which all feature vectors on this sphere have the same Euclidean distance from the center. Thus this does not take into account the correlations of the data space. Observe figure (a) Sphere around center. Points A (b) Ellipsoid around the center. Points and B have the same Euclidean distance from the center y k. distance from the center y k A and B have the same Mahalanobis. Figure 3.12: Euclidean distance and Mahalanobis distance in a two dimensional space. A useful and interesting observation is that when considering vectors in the PCA reduced space the Mahalanobis distance can be shown to be equivalent to a scaled Euclidean distance, with each component weighted by the inverse of its corresponding eigenvalue [21]. The correlation matrix in the PCA reduced space is given by W T ΣW, where W is the matrix whose columns are formed from the principal components. Note that the correlation matrix Σ is a real symmetric matrix, hence by the Theorem in Appendix B it is diagonalisable by an orthogonal matrix Γ formed by the orthonormal eigenvectors of Σ. That is: Σ = ΓΛΓ 1 = ΓΛΓ T (3.38) where Λ 1 = diag {λ 1,..., λ N } is a matrix with the eigenvalues of Σ along the diagonal. Then we can rewrite the correlation matrix in the PCA reduced space as: W T ΣW = W T ΓΛΓ T W (3.39) Note that both Γ and W are matrices formed from the orthonormal eigenvectors of Σ, as we have previously shown the principal components correspond to these eigenvectors. Denote γ j to be the jth 28

31 column of Γ and w i to be the ith column of W. Then as the vectors are orthonormal: { 1 if w T i γ j = γj T w T wj = γ i i = 0 if w j γ i Hence W T Γ forms an m N matrix and Γ T W forms an N m matrix, where in each there is at most one 1 in each column and at most one 1 in each row, with 0s elsewhere. These matrices can then be seen as left and right linear transformation matrices which on multiplication perform a reordering of the columns and rows of Λ selecting those which contain the m largest eigenvalues and placing them in descending order. Hence (3.39) can be further reduced to: where Λ m = diag {λ 1,..., λ m } with λ 1 λ 2... λ m. W T ΓΛΓ T W = Λ m (3.40) Thus considering the Mahalanobis distance between two vectors y i and y j in the PCA reduced space means that we can rewrite (3.36) as: (yi ) T d(y i, y j ) = y j (Λ m ) ( ) 1 y i y j d(y i, y j ) = m k=1 1 ( ) (3.41) 2 yi,k y λ j,k k Hence in this PCA transformed space the Mahalanobis distance is equivalent to a Euclidean distance with each component weighted by the inverse of its corresponding eigenvalue. An alternative way to do this normalisation is to apply the standard PCA method and multiply the optimal projection matrix W by Λ 1/2. This transformation of PCA is called the Whitening Transformation [26], where: { } Λ 1/2 = diag λ 1/2 1/2 1,..., λ m (3.42) Observe that the covariance matrix of the Whitening transformed vectors is the identity matrix, hence this is a decorrelation method. Then using the Euclidean distance over all whitened PCA vectors gives the same result as using the Mahalanobis distance. So using the Mahalanobis distance as the metric in a nearest neighbour approach, we calculate the distance between the feature vector in question y j and the average feature vector y k using: m ɛ k = d(y j, y k ) = 1 ( ) 2 yj,q y λ k,q (3.43) q q=1 Then with a predetermined threshold θ i, specified for each class, an image y j is classified to class k if: ɛ k = min {ɛ i ɛ i < θ i } (3.44) i=1,...,g That is if that the Mahalanobis distance ɛ k is the smallest distance over all possible distances ɛ i for all i = 1,..., g. Connection to the Bayes Classifier We shall take a slight detour and explore an interesting fact. Under certain normality assumptions on the training set, Bayes classification is equivalent to Nearest neighbour classification using the Mahalanobis distance. Then under further assumptions this is also equivalent to using the Euclidean distance. It has been shown, [18], that the Bayes Classifier is the best classifier for identifying which 29

32 class an unknown face belongs to in a parametric sense. This is because the Bayes Classifier yields the minimum error when the underlying probability density functions (PDFs) of each group are known, with this error is known as the Bayes error [16]. We will prove this optimality in Chapter 4. Bayes Classifiers assign the most likely class to a given observation as described by its feature vector. Naive Bayes classifiers greatly simplifies the classification problem by assuming that the features are independent given the class, that is: m P (y C i ) = P (y j C i ) (3.45) j=1 where y = (y 1,..., y m ) T is a feature vector in the PCA reduced subspace and C i is the ith class, i.e. the ith individual. This then enables the conditional density function for the observation given the class to be calculated. We can then use Bayes Theorem to construct the classifier. Let C 1,..., C g denote the classes representing each individual in the training set and y the feature vector of an image in question in the PCA reduced subspace. We are then interested in the posterior probability function of a class C i given an observation y. Bayes Theorem states: P (C i y) = P (C i)p (y C i ) P (y) (3.46) where P (C i ) is the prior probability of class C i and P (y C i ) is the conditional probability density function of C i, which can be thought of as the likelihood of gaining observation y given the class C i. These prior probabilities utilise our prior knowledge, if any, of the data set. Say for example we know, a priori, we are more likely to gain an observation for the kth class as opposed to the lth class. Then we would assign prior probabilities such that P (C k ) > P (C l ). In contrast if we have no prior knowledge of the data set, then we assume all classes are equally as likely and thus set all of the prior probabilities to be equal. Hence (3.46) can be expressed in words as: P osterior = P rior Likelihood Evidence Note that P (y) is constant for all classes and assuming independence of classes, the partition theorem gives: P (y) = g i=1 P (y C i)p (C i ). We then only need to compare these posterior probabilities for an observation given each class, hence we can ignore the constant. Consequently (3.46) can be thought of as: P (C i y) P (C i )P (y C i ) (3.47) We then apply a Maximum A Posterior (MAP) decision rule to classify an observation y as one of the classes. This means that we classify the observation y as the class which gives the highest posterior probability. This gives the Bayes Classifier: P (C i y) P (C i )P (y C i ) = max {P (C j )P (y C j )} (3.48) j Subsequently the face image y is classified as the class C i with the largest posterior probability over all classes i = 1,..., g. However within the training set of face images there is not usually enough images of each individual to be able to estimate the conditional probability density function for each class. So how do we actually calculate these probabilities? We thus need to make a compromise and assume a particular density form for the within class density. This consequently turns the problem into a parametric problem. Using the aforementioned justification based on the similarity of feature vectors belonging to one individual, we will assume a Gaussian distribution for the within class densities in the PCA reduced space. This then gives the multivariate normal: { 1 P (y C i ) = exp 1 } (2π) m/2 1/2 Σ i 2 (y y i) T Σ 1 i (y y i ) (3.49) 30

33 where y i is the mean vector and note that we have defined Σ i to be the covariance of the ith class in the PCA reduced space. Also note that a further simplification may be required due to the difficulty involved in calculating the covariance matrix Σ i for each class. This difficulty arises when there is a limited number of images within each class of the training set, i.e. M i is small. From the derivation of the PCA scheme we know that Σ, the overall covariance matrix is diagonalisable, however this is not necessarily the case for the the within class covariance matrices Σ i. If we assume all the class covariance matrices, in the PCA reduced space, to be equal and given by Σ, then (3.49) becomes: { 1 P (y C i ) = exp 1 } (2π) m/2 1/2 Σ 2 (y y i) T Σ 1 (y y i ) (3.50) Further, if we make one more simplifying assumption in that we assume no prior knowledge of the data set, that is the prior probability of each class is equal. Then the Bayes Classifier (3.48) is equivalent to selecting the class that maximises P (y C i ) for all i = 1,..., g classes. Consequently we then need to select the class that maximises (3.50). Observe that the first part is constant for all classes, hence we just need to select the class that maximises the exponent. This is then equivalent to minimising: (y y i ) T Σ 1 (y y i ) (3.51) This is exactly equivalent to selecting the ith class such that the Mahalanobis distance between the feature vector in question and the average class vector is minimised. Thus under the assumption of Gaussian distributions, equal priors and equal covariance matrices of each class the Bayes classifier is exactly equal to using a nearest neigbour approach with the Mahalanobis distance. Alternatively we could also assume the within class covariance matrices Σ i, i = 1,..., g to be unit matrices, that is: Σ i = I m (3.52) Then under this additional simplifying assumption (3.50) becomes: { 1 P (y C i ) = exp 1 } (2π) m/2 2 (y y i) T (y y i ) The MAP rule corresponds to minimising the distance: (3.53) (y y i ) T (y y i ) (3.54) This then gives a distance classifier which exactly corresponds to minimising the Euclidean distance in the nearest neighbour approach. 3.6 Outcomes In summary there are consequently 4 possible outcomes from considering a given face image and its associated feature vector: 1. Near the face space, (ɛ < θ) and near a face class y k, (ɛ k < θ k ). So the image is recognised as a face and identified as individual k. 2. Near the face space, (ɛ < θ) and distant from all face classes y k, (ɛ k > θ k ) k = 1,..., g. So the image is recognised as an unknown individual and can optionally be added as a new face class. 3. Distant from the face space, (ɛ > θ) and near a face class y k, (ɛ k < θ k ). This is then a false positive as the image is not a face. 4. Distant from the face space, (ɛ > θ) and distant from all face classes y k, (ɛ k > θ k ). So the image is not a face. 31

34 3.7 Summary of Eigenface based Face Recognition To summarise, the following steps are involved in Eigenface based face recognition [5]: 1. Training Set: Collect a training set of M face images, ideally this should include several images of each individual. Calculate the mean face image: µ = 1 M M i=1 x i 2. Eigenvalues and Eigenvectors: Consider the N N covariance matrix: Σ = V V T where v i = x i µ, the mean centered images, form the columns of V. To reduce the computation requirements calculate the eigenvalues and eigenvectors of the M M matrix V T V : V T V d i = α i V d i Then note that: V V T (V d i ) = α i (V d i ) Σ(V d i ) = α i (V d i ) Hence giving the eigenvalues λ i = α i and orthonormal eigenvectors w i = V d i of Σ. Select the m eigenvectors w i with the largest eigenvalues λ i. These eigenvectors correspond to the principal components. 3. Eigenfaces The m eigenvectors give the Eigenfaces and are used as the columns of the optimal projection matrix: W = [w 1,..., w m ] 4. Feature vectors: For each image in the training set calculate its feature vector, y i, by linearly projecting onto the m Eigenfaces: y i = W T (x i µ) For each of the g individuals calculate their associated class feature vector: V d i y k = 1 M i M i i=1 y k i by averaging the feature vectors produced from all M i images of the ith individual. Where y k i is the ith feature vector belonging to individual k, i = 1,..., M i. 5. Thresholds: Specify the thresholds: θ that defines the maximum allowable distance from the face space, (3.31). θ k that defines the maximum allowable distance from each face class k = 1,..., g, (3.37) 6. Testing set: Given a new image of a face x i, i.e. not within the training set, calculate: Its feature vector by projection into the feature space giving the m dimensional vector: y i = W T (x i µ) 32

35 The distance ɛ = v i v i between the mean centered image and the face space. ɛ k = d(y i, y k ) using either a Euclidean distance or a Mahalanobis distance to give a distance measure between the image s feature vector and each known face class. Then classify the image as either a face if ɛ < θ or a non face otherwise. Then further use a nearest neighbour approach and a given distance measure to classify the image as the kth individual if ɛ k = min i=1,...,g {ɛ i ɛ i < θ i } or otherwise an unknown face image. 3.8 Learning to Recognise New Faces If an image is sufficiently close to the face space but is distant from all known face classes, it is initially labeled as unknown. However can we learn anything from these unknown face images? If a collection of unknown feature vectors cluster together this suggests the presence of a new but unidentified individual. The images corresponding to the feature vectors in this cluster can then be checked for similarity, by requiring that the distance from each image to the mean of the image is a less than a predefined threshold value. If the similarity test is passed this suggests all the images are of the same individual and thus a new face class can be added to the original training set. The Eigenfaces can then be recalculated to include these new face classes in addition to those classes in the initial training set to give g + 1 individuals we are now able to recognise. 3.9 Locating and Detecting Faces We have just focused on the identification part of the face recognition problem when considering the Eigenface method. However we can utilise our knowledge of the face space to locate faces in an image, thus solving the first step of face detection of the problem. Faces do not tend to change by that much when projected into the face spaces whereas images of non faces do, as we have seen in figure 3.11, [5]. Hence we can segment an image down into various subimages and then scan the image, at each location calculating the distance ɛ between the local subimage and the face space. This distance can then be used with a specified threshold to determine if the subimage is sufficiently close to the face space and thus if a face is present in the subimage or not. However note that this process of scanning the entire image is computationally extremely expensive Eigenface Implementation This section focuses on the computation details of training a computer system to implement and test the Eigenface method. Two codes, written in MATLAB R [19], to implement the method are given in Appendix D (using the Euclidean distance) and Appendix E (using the Mahalanobis distance). The codes covers both steps of the Eigenface method in first training a computer using the training set of face images and then secondly the identification procedure of classifying a non-trained face image. Both codes are identical apart from the distance measure used in classification. We shall examine the code for each section and explain the computations in relation to the steps in the summary of the procedure given above The Training Set Initially, a training set of face images is selected from a database to calibrate the machine. The method was implemented using both the faces94 and faces 95 databases [12] and in each case a random selection of g = 50 individuals were selected from the database. M i = 10 images of each individual were randomly chosen to form a training set of M = 500 images. Each image is of dimension pixels. The images have all undergone image preprocessing of RGB to greyscale conversion and histogram equalisation using the MATLAB R functions rgb2gray and histeq. The first step of the code involves reading into the computer all 500 training images. Each image is first transformed from an array to a vector of length N = by concatenating each rows of the image array. We then use 33

each of these vectors to from the N M dimensional training matrix X, where each column X(:, i) gives a the vector representing the ith image.

Hence taking the mean across the ith row of X to give the mean pixel value at the ith position, to form the mean vector of length N. Observe figure 3.

13: Mean face images Step Two Involves calculating the eigenvectors and eigenvalues of the covariance matrix Σ = V V T.

We know the eigenvalues of Σ and V V T are the same and the eigenvectors are related by w i = V d i V d i.

36 each of these vectors to from the N M dimensional training matrix X, where each column X(:, i) gives a the vector representing the ith image. Step One Involves calculating the mean face image of the training set. We use the data matrix X to calculate the mean value at each pixel position. Hence taking the mean across the ith row of X to give the mean pixel value at the ith position, to form the mean vector of length N. Observe figure 3.13 for the two mean images from the two training set considered. (a) faces94 (b) faces95 Figure 3.13: Mean face images Step Two Involves calculating the eigenvectors and eigenvalues of the covariance matrix Σ = V V T. To reduce the amount of computation required of MATLAB R in running code we calculate the eigenvalues of V V T which is a matrix as opposed to Σ which is of dimension We know the eigenvalues of Σ and V V T are the same and the eigenvectors are related by w i = V d i V d i. We further reduce the number of computations required by only considering eigenvectors which correspond to eigenvalues whose value is greater than zero by use of the sort and zero functions. We thus form the vector l which contains all the non-zero eigenvalues i.e. l i > 0 listed in descending order. The matrix d is then formed where the first column is the eigenvector of V V T corresponding the the largest non-zero eigenvalue and the last column is the eigenvector corresponding to the smallest non-zero eigenvalue. We then form the matrix U which contains the corresponding eigenvectors of Σ by multiplication of the matrix d by V. We specified that the eigenvectors are of unit length. Thus we normalise each column of the matrix U such that the eigenvectors are orthonormal i.e. u i = 1. Thus we have calculated the Eigenfaces and they are given by the columns of the matrix U. Step Three We need to specify the number of Eigenfaces, m, we are investigating. For the experimental procedures the code was run for m = 10, 20, 30, 40 Eigenfaces. Hence we select the first m columns of U to form the optimal m N matrix W. We can then reshape each column of the matrix to visualise eigenfaces. Observe figure 3.14 which gives an examples of two such Eigenfaces. Note that in the first Eigenface the facial features such as the eyes, nose and hairline are visible whereas in the 40th Eigenface there is much more noise. (a) 1st Eigenface (b) 40th Eigenface Figure 3.14: Eigenfaces calculated from the faces94 training set. 34

37 Step Four Covers the projection of each image in the training set into the face space, giving their corresponding feature vectors y i = W T (x i µ). We form a new m M matrix omega which as each feature vector as a column. We then need to calculate the average feature vector for each individual. There are 10 images per individual in the training set so we need to take the average for each set of ten consecutive columns of the matrix omega. This then gives a new m 50 matrix Q, where the kth column gives the average feature vector for individual k, i.e The Testing Set y k = y k j j=1 The second half of the code covers the face recognition part of the problem using, a testing set to determine if the images in this set are of known individuals or not. The code in Appendix D uses the Euclidean distance as the distance measure and Appendix E uses the Mahalanobis distance. We shall focus on investigating the distance ɛ k of the untrained image to each of the 50 face classes as opposed to the distance ɛ to the face space. This means the threshold θ, which gives the maximum allowable distance from the face space to be classified as a face image, is assumed to be infinite. Subsequently we assume, a priori, that all images are of faces. We shall construct 3 testing sets consisting of untrained images of the individuals in the training set, with each set containing one randomly selected image of each of the g = 50 individuals. Thus each testing set contains 50 face images. We consider one set of these testing images at a time and run the second part of the code using the Euclidean distance and then the Mahalanobis distance, given the predetermined thresholds and a specified number of eigenfaces m. Each image in the testing set is read in separately and the image s array turned into a vector of length N by concatenating the rows of the array. Then each image in the testing set is used a column vector to form the matrix S. The following procedure is the same for both the Euclidean distance and the Mahalanobis distance apart from a differentiation in the final part of the code on how the distance between the feature vector in question and each of the face classes is specified. Step Five Considering one image in the testing set at a time. This step covers the projection of the new image onto the m Eigenfaces to give its associated feature vector y k = W T (x k µ). We can then reshape this feature vector to observe the approximation of the image using the m eigenfaces. Observe the approximations in figure 3.15 from the faces94 and faces95 databases. (a) Input testing image. (b) Approximated face image. (c) Input testing image (d) Approximated face image. Figure 3.15: (a) and (b) faces94 database. (c) and (d) faces95 database. We then use a similarity metric of either the Euclidean distance or the Mahalanobis distance to measure the proximity of the image s feature vector to each face class, with the kth face class given by y k = Q(:, k). Observe figure 3.16 which illustrates the Euclidean distance to each of the 50 face classes. 35

Then the image is taken to be of class k for which ɛ k = min i (ɛ i ɛ i, θ i ), with θ i given thresholds for each class.

38 (a) faces94 (b) faces95 Figure 3.16: Euclidean distance between the image and the face classes of each of the 50 individuals in the training set. When considering the Mahalanobis distance we need to use the vector l which contains the eigenvalues listed in descending order. Then the image is taken to be of class k for which ɛ k = min i (ɛ i ɛ i, θ i ), with θ i given thresholds for each class. This minimum value is identified and returned along with this minimum value s position which represents the face class of nearest proximity. A loop is written in the code so that this procedure is repeated for every image in the testing set, thus the number of correctly classified testing images gives a measure of the accuracy of the procedure Eigenface Experimental Results In implementing the Eigenface method there are several variables that can either be varied or need to be kept constant. In investigating this method we shall examine the effect of 3 key variables: Number of Eigenfaces We shall take m = 10, 20, 30, 40 and examine how this affects the accuracy of recognition of the method. Note that it is useful to know the optimum value for m as we seek to find the optimal balance between recognition accuracy of the method (i.e. retaining enough variation of the original image space to be able to accurate represent an image) and the compression of the image space to reduce data storage and computational requirements. Distance Measure We shall investigate using the Euclidean distance and the Mahalanobis distance in measuring the distance between the feature vector of each image in the testing set and each face class. We shall keep the threshold parameters θ i i = 1,..., g constant. We shall only consider a testing set in which the images are of individuals in the training set thus can assume a priori that all images are sufficiently close to the face space hence none should be rejected as non face images. We are subsequently only interested in the recognition accuracy of the method, hence the threshold θ is taken to be infinite. Image Database We shall investigate using the faces94 and face95 databases. These databases have different characteristics, in particular in terms of the amount of pose and lighting variation present between images. Implementation on these two databases will enable a comparison to see how well the Eigenface method is able to cope with such sources of variation. 36

Throughout the experimental procedure the same two training sets of M = 500 face images of g = 50 randomly selected individuals shall be used, one created from each database.

39 Throughout the experimental procedure the same two training sets of M = 500 face images of g = 50 randomly selected individuals shall be used, one created from each database. We will also keep constant the allowable thresholds θ i i = 1,..., 50. We shall experiment using two faces databases: faces94 and faces95. Each database will have its own associated 500 image training set and 3 sets of testing images. We shall run the code three times on each testing set to ensure the reliability of the results. We can then define the percentage recognition accuracy of the procedure to be the percentage of correctly classified individuals given a fixed threshold and specified distance measure. We shall take an average of this recognition accuracy for a given number of eigenfaces and distance measure over the results from the 3 repeats for each of the 3 testing sets. This will ensure the reliability of the final results. We can further calculate the standard deviation (SD) for the recognition accuracy to give a measure of the reliability of the results produced from the method. This then enables an error bound to be calculated for the average recognition accuracy value stated. In experimenting we shall select the threshold θ i, i = 1,..., 50 in such a way to reduce the number of images not recognised as any of those within the database, i.e. to reduce the rejection rate of images. This is because we know the testing sets are constructed from images of known individuals, i.e. individuals within the training set, so we know a priori that all images in the set are of known individuals. We want to test if we have trained the computer sufficiently, using the Eigenface method, to be able to recognise these individuals. We are subsequently not so interested in the rejection rate of the Eigenface procedure but the recognition ability of the method. This means we need to select a high threshold, thus all images will be classified as their nearest neighbour. However note that this is likely to give more errors than a smaller threshold would give. The thresholds are taken to be: Euclidean distance: θ i = i = 1,..., 50 Mahalanobis distance: θ i = 0.5 i = 1,..., 50 These thresholds mean that no images are rejected as unknown and each image is classified as its nearest neighbour. Figure gives a visual example of this nearest neighbour classification using the specified threshold for the Euclidean distance. Figure 3.17: Given threshold for the Euclidean distance and nearest neighbour classification rule We shall now examine the results produced from such an experiment: 37

40 Euclidean Distance: faces 94 faces 95 No. Eigenfaces Recognition Accuracy SD Recognition Accuracy SD % 1.73% 53.33% 17.52% % 1.00% 68.67% 16.09% % 1.00% 68.89% 17.23% % 2.65% 72.00% 18.00% Mahalanobis Distance: faces 94 faces 95 No. Eigenfaces Recognition Accuracy SD Recognition Accuracy SD % 2.65% 63.33% 14.53% % 1.00% 71.33% 14.53% % 1.00% 68.67% 21.66% % 1.00% 72.00% 19.67% (a) Recognition accuracy and error bounds using the Euclidean distance. (b) Recognition accuracy and error bounds using the Mahalanobis distance. Figure 3.18: Results using the Euclidean distance and Mahalanobis distance Euclidean Distance vs Mahalanobis Distance Observe that using the Mahalanobis distance generally gives a higher recognition accuracy across all numbers of Eigenfaces used, in comparison to that found using the Euclidean distance. This is clearly shown in the graph given in figure This confirms the Mahalanobis distance to be a better distance metric as opposed to the Euclidean distance, in the sense that it is a more accurate measure as it produces a higher recognition accuracy. This is because the Mahalanobis distance takes into account the correlations of the original data set thus is a measure of distance and spread whereas the Euclidean distance does not take into account this spread, as previously discussed. 38

41 Figure 3.19: Recognition accuracy using the Euclidean distance and the Mahalanobis distance across both databases The Number of Eigenfaces Used Consider the plot given in figure Generally across both databases and distance measures there is a large increase in the recognition accuracy in the method between using 10 and 20 Eigenfaces. Then, as the number of Eigenfaces used increases beyond 20 the recognition accuracy percentage tends to only very gradually increase. Further note that between the number of Eigenfaces 10 and 20 there is generally a large decrease in the standard deviation of the results. The SD then tends to gradually increase again as the number of Eigenfaces used increases. The SD is a measure of the precision of the method. Hence we wish to select the number of Eigenfaces such that the SD is minimised whilst the recognition accuracy of maximised, in addition to minimising the computation requirements of the procedure. After 20 Eigenfaces there is a gradual increase in the recognition accuracy of the method, however the procedure also becomes less precise as the SD also gradually increases. This suggests that the slight increase in recognition accuracy is not significant enough to warrant the additional computation requirements and the reduction in precision as the number of Eigenfaces increases beyond 20. This then indicates that the optimal number of Eigenfaces is approximately around 20. This means that using 20 Eigenfaces we capture the majority of the variation necessary for classification. To investigate this further, the experiment was repeated for the number of Eigenfaces: 16, 18, 20, 22 and 24. The results are shown in the plot given in figure Observing the graph we can draw further conclusions on the optimal number of Eigenfaces with regards to the recognition accuracy of the method. We need to employ the principles of parsimony as we wish to maximise the recognition accuracy whilst minimising the number of Eigenfaces used, hence minimising the dimensionality of the new feature vectors. Note, generally as the number of Eigenfaces used increases above 20 the recognition accuracy actually tends to decrease slightly thus the method becomes less accurate. This can be observed in both databases and distance measures. Figure 3.20 illustrates this dip just after 20 Eigenfaces. This suggests that there is noise present in the data sets, that is the principal components after number 20 represent unwanted sources of variation i.e. those that are not useful for discriminating purposes. Then after this slight decrease at 20 Eigenfaces there is a gradual upwards trend. Hence using the principle of parsimony, to select the smallest number of Eigenfaces for an optimal recognition accuracy, the optimal number of Eigenfaces appears to be approximately 20 for both databases. This consequently means we are able to fairly accurately represent and recognise each face image within the face databases investigated using just the first 20 Eigenfaces. 39

42 Figure 3.20: Investigating the optimal number of Eigenfaces for recognition. Another way to consider the optimal number of Eigenfaces is to look at the proportion of the total variation a certain subset of the Eigenfaces retains. This can be assessed through the use of scree plots, which plot the variation explained by each Eigenface. Observe the scree plots given for each database in figure (a) Scree plot for faces94. (b) Scree plot for faces95. Figure 3.21: Scree plots considering the amount of variation represented in the first 100 Eigenfaces. These scree plots clearly show that for both databases the vast majority of the variation is accounted for in the first 20 Eigenfaces and after this point the plot levels off. This means the inclusion of extra Eigenfaces beyond the 20th does not significantly increase the amount of variation retained. Hence using the widely used rule of thumb, as discussed previously, we should take approximately the first 20 Eigenfaces. The conclusions drawn from these scree plots are supported by the plots given in figure 3.22 which plots the number of Eigenfaces used versus the percentage of the total variation the given number of Eigenfaces represents. The plots confirm that taking the first 20 Eigenfaces appears to be optimal in terms of retaining the majority of the variation present in the original data set whilst also minimising the number of Eigenfaces used. On both plots 20 Eigenfaces is approximately the turning point of the line, which means inclusion of extra Eigenfaces beyond 20 does not significantly increase the proportion of variation explained. Consequently, this supports the conclusions drawn from the percentage recognition 40

43 (a) faces94. (b) faces95. Figure 3.22: Percentage of variation retained in using the first m eigenfaces. accuracy plot as at approximately 20 Eigenfaces we have retained the majority of the variation useful for classification hence achieving a high recognition accuracy faces94 vs faces95 Across both distance measures and all number of eigenfaces investigated, the faces95 database has significantly lower recognition accuracy percentages and higher SDs of the results produced in comparison to the results from the faces94 database. We know that the faces94 database contains very little variation due to lighting and pose, whereas the faces95 database contains significant variation due to these sources. This lighting variation is caused by shadowing as the individual moves backwards and forwards. The lower recognition accuracies of the faces95 set suggests that the recognition method fails when there is significant within class variation of the set of images belonging to one individual i.e. variations due to pose, lighting, etc. This is further confirmed by the significantly larger SDs for the faces95 set in comparison to the faces94 set. The larger SDs indicate a lack of precision of the results produced from the method based on this database, as there are large error bounds on the recognition accuracies. This subsequently indicates a major limitation of the Eigenface method, due to the fact that it is unable to deal with large variations between images of the same individual i.e. such as those due to pose and lighting. To investigate this further, we have created two more testing sets using images from the faces95 database of individuals present in the training set but taking images of these individuals which were not used in the training set. Each testing set contains one image of each of the 50 individuals. The images in testing set one have been selected such that there is very limited variation due to lighting and pose in comparison to those of the training set. Whilst the images in testing set two have been selected such that there is significant variation due to these sources. 1 Examples of such images used are given in figure The same experiment was run again and the average recognition accuracy results are given below: Euclidean Euclidean Mahalanobis Mahalanobis No. Eigenfaces Set 1 Set 2 Set 1 Set % 32% 74% 44% 20 74% 48% 82% 52% 30 79% 46% 86% 40% 40 84% 48% 88% 46% 1 Note that the selection of the two testing sets based on lighting and pose variation present is obviously subjective. However the author has used their best judgment to select the testing sets based on the specified criteria. 41

(a) Image used in the (b) Image from set 1. (c) Image from set 2. training set. Figure 3.

Observe that in test set 2, in which there is significant lighting and pose variation compared to the training set, the recognition accuracy is consistently lower than of test set 1.

12 Disadvantage of the Eigenface Method We have seen how the Eigenface method is optimal in terms of its ability to represent and recognise face images under idealised conditions.

A key disadvantage of the Eigenface approach is that the method does not distinguish between the different roles of the variation between images.

The method treats both sources of variation equally [20, 21].

44 (a) Image used in the (b) Image from set 1. (c) Image from set 2. training set. Figure 3.23: Difference in lighting and pose of image from set 2 as compared with the image of the same individual in set 1. Observe that in test set 2, in which there is significant lighting and pose variation compared to the training set, the recognition accuracy is consistently lower than of test set 1. This is true across all number of eigenfaces. This consequently indicates that Eigenface method fails for large variations in lighting and pose Disadvantage of the Eigenface Method We have seen how the Eigenface method is optimal in terms of its ability to represent and recognise face images under idealised conditions. However the robustness of the method is limited by the fact that the method fails for large variations in light and pose. A key disadvantage of the Eigenface approach is that the method does not distinguish between the different roles of the variation between images. That is, the method does not differentiate the between class variation i.e. due to a change in the individual in the image and the within class variation i.e. variation that is not due to a change in the individual in the image such as due to pose and lighting. The method treats both sources of variation equally [20, 21]. The Eigenface method subsequently is an unsupervised technique, as does not include any information about which class each observation in the data set belongs to. This is the second limitation, as previously discussed, due to the use of PCA in feature selection. Subsequently, we do not utilise the label information of the data set which is then equivalent to data loss. So in using PCA to obtain the projection that maximises the total variation, the method selects the most expressive features (MEFs) [17]. This means the within class variation is retained, for example due to lighting and pose. Observe figure 3.24, and note the large difference in image of the same individual with the same facial expression and pose but different sources of lighting. (a) Image one (b) Image two Figure 3.24: Same individual from faces95 database viewed under different lighting conditions. These variations may be irrelevant for determining how the classes are divided [27]. This is subsequently the first limitation we discussed of PCA in that we assume that large variances are important, but this is not always true. We have seen in the experiments conducted how the Eigenface method s recognition accuracy is greatly affected by variation due to light. This means that whilst the Eigenface method is optimal in terms of representation and reconstruction of face images is not optimal 42

45 for discriminating one face class from another. Thus in forming the low dimensional feature space it would be useful to look for feature vectors that clearly discriminate between the classes. That is, to select most discriminant features (MDFs). Thus it would be useful to classify the sources of variation according to whether they are due to a change in individual present in the image or due to other sources of variation such as lighting and pose. This naturally leads us on to the notion of discriminant analysis in which we seek to find an improved dimensionality reduction method in which the feature space is such that it best discriminates between images of different individuals. 43

46 Chapter 4 Linear Discriminant Analysis 4.1 Motivation The objective of linear discriminant analysis (LDA) is to perform a dimensionality reduction whilst preserving as much class discriminatory information as possible. Using LDA, we wish to select a set of the feature vectors, called discriminants, so that a set of face images can be represented in a low dimensional feature space in such a way that best separates the classes [22]. The resulting feature space is called the discriminant space. Note the difference between LDA and PCA. PCA aims to find the subspace of feature vectors that correspond to the directions of maximal variance in the original space, not taking into account the class information. In comparison LDA uses the class information of observation to find the set of discriminants that best discriminate between class. So in using LDA it is hoped we will be able to better discriminate between individuals and hence improve our face recognition success. The variation within classes lies in a linear subspace of the original image space [20]. We assume that classes are convex and linearly separable. This is a valid assumption with regards to the problem of face recognition as we take each individual within the training set to be distinct from every other. We can consequently use LDA to perform a dimensionality reducing linear projection to the low dimensional discriminant space, whilst still preserving this linear separability. LDA is thus called a class specific linear projection [20]. We hope that in this new discriminant space we will be able to better discriminate between classes and hence classify new images. We shall begin by considering the ideas and concepts behind LDA using the two class case and then shall later generalise to the multi-class situation. 4.2 Two Class Case Assume we have a training set consisting of M images of just g = 2 individuals. Say we have M 1 images belonging to the first individual: {x i } M 1 i=1 and M 2 images belonging to the second individual: {x i } M 2 i=1, where M 1 + M 2 = M. We seek to obtain the new scalar y by projecting each of the images in this training set onto a single line given by: y = w T x (4.1) We wish to select this line such that it maximises the separability of the two classes in this new 1- dim space, i.e. such that it maximises the separability of the scalars. But how do we select such a line? In order to find a good projection we need to exactly specify what defines a good measure of separation between the projections. A sensible suggestion would be to select the line that best separates the group means. Let µ 1 and µ 2 be the means of classes one and two respectively. Then we need to look for the direction w that ensures that the means of the projected points ˆµ 1 = w T µ 1 and ˆµ 2 = w T µ 2 44

47 are separated as much as possible. That is, we wish to maximise the distance: ˆµ 1 ˆµ 2 = w T (µ 1 µ 2 ) (4.2) Observe the example of the two class case with two dimensional observations, given in figure 4.1, in which we consider two possibilities for this line. However, as this figure shows, maximising the distance Figure 4.1: The direction the maximises the distance between the projected means is not optimal for discrimination between classes [23]. between the projected means is not always optimal for discrimination between the classes. This is because in using this criterion we do not take into account the variability within each class. We then need to specify an alternative criteria to define this optimal line. This leads to Fisher s idea Fisher s Idea Fisher s suggestion [25] was to look for the linear projection w T x which maximises the between class variation whilst also minimising the within class variation. This means we need to search for the direction that maximises the distance between the projected means whilst also minimising the scatter within the classes. This is shown in figure 4.2. This illustrates how the projection found according to Fisher s criteria is much better for discriminatory purposes between classes. Figure 4.2: The direction the maximises the distance between the projected means and minimising the scatter within the group [23]. We therefore aim to find the projection whereby observations from the same class are projected into the discriminant space in such a way that they are close to each other, whilst at the same time the projected means of different classes are as far apart as possible. The resulting method is called Fisher s LDA [16]. 45

48 So according to Fisher s suggestion we need to maximise the distance between the projected class means, as given by: ( ˆµ 2 ˆµ 1 ) 2 = (w T µ 2 w T µ 1 ) 2 = (w T µ 2 w T µ 1 ) T (w T µ 2 w T µ 1 ) = (µ 1 µ 2 ) T ww T (µ 1 µ 2 ) = w T (µ 2 µ 1 ) T (µ 2 µ 1 )w = w T S B w Hence we need to select w to maximise this above expression, where S B = (µ 2 µ 1 ) T (µ 2 µ 1 ) is defined to be the between class scatter of the original image space. Further according to Fisher s suggestion we also need to minimise the within class scatter. The within class covariance matrices of the original observations for classes one and two are denoted by Σ 1 and Σ 2 respectively. Hence the covariance matrices of the projected points are denoted by ˆΣ 1 and ˆΣ 2. The overall within class scatter is equal to the sum of the individual class scatter, hence we aim to minimise: ˆΣ 1 + ˆΣ 2 = w T Σ 1 w + w T Σ 2 w = w T (Σ 1 + Σ 2 )w = w T S W w Where S W = Σ 1 + Σ 2 is defined to be the within class scatter over all classes on the original image space. (4.3) (4.4) Figure 4.3: Illustration of Fishers discriminant for two classes. We search for a direction w, such that the difference between the class means projected onto this direction, ˆµ 1 and ˆµ 2, is large and that the variance around these means, ˆΣ1 and ˆΣ 2, is small. Hence to maximise the distance between the projected class means whilst also minimising the within class variation is equivalent to maximising the ratio [16]: max w J(w) = max w w T S B w w T S W w (4.5) Note that the denominator of the objective function J(w) is unbounded since we can select any arbitrary w. We therefore need to impose a constraint on the denominator. We are only interested in the direction of w, hence length is not important. Thus we shall then impose the constraint such that w T S B w = 1. This then turns the objective into: max w wt S B w (4.6) 46

49 Subject to: w T S W w = 1 (4.7) The objective is now in the form of a constrained optimisation problem, which we can address through the use of Lagrange multipliers: L(w, λ) = w T S B w λ(w T S W w 1) (4.8) Then differentiating with respect to w and setting equal to zero gives: L w = 2S Bw 2λS W w = 0 S B w = λs W w (4.9) This is now in the form of a standardised eigenvector problem. So, provided S W is non singular, the optimal direction w opt is one of the eigenvectors of S 1 W S B. Further note that if we assume the data set to be homoscedastic, i.e. each class has a common covariance matrix, then S B is a symmetric matrix. Subsequently S 1 W S B will also be a symmetric matrix and hence will have a set of orthogonal eigenvectors. But which of these eigenvectors is optimal? To consider which eigenvector is optimal we shall begin by differentiating (4.5) with respect to w, which yields: (w T S B w)s W w (w T S W w)s B w = 0 S B w = wt S B w w T S W w S W w (4.10) Then it immediately follows from the above equation the w must be a generalised eigenvector of (4.9). Further the quantity wt S B w in (4.10) is equal to the eigenvalue for a given eigenvector w. Thus, the w T S W w eigenvalues are equal to the objective function J(w), given in (4.5), which we seek to maximise. That is for a given eigenvector w i with eigenvalue λ i : J(w i ) = wt i S Bw i w T i S W w i = λ i (4.11) Hence the optimal direction corresponds to the eigenvector w i of Sw 1 S B with the largest eigenvalue. Subsequently we define w opt to be the eigenvector with the largest eigenvalue λ: { } w T w opt = arg max i S B w i i w T i S (4.12) W w i This optimal direction is known as Fisher s Linear Discriminant. However for the case of just two classes it is interesting to observe there exists an even simpler solution [22]. Taking a, b, c, d to be real constants we can define the two class means as: ( ) ( ) a c µ 1 = µ b 2 = d Then: Further: S B w = µ 2 µ 1 = ( ) c a d b ( ) c a (c a, d b) d b Thus S B µ 2 µ 1. That is to say that S B w is in the same direction as µ 2 µ 1. Then since S 1 w S B w = λw we get that: S 1 W (µ 2 µ 1 ) w (4.13) 47 ( w1 w 2 )

50 As we are not interested in the length of w but only the direction, we can then express the optimal Fisher s Linear Discriminant for the two class case as: { } w T w opt = arg max i S B w i i w T i S S 1 W W w (µ 2 µ 1 ) (4.14) i Optimality We shall now formally prove that the eigenvector corresponding to the largest eigenvalue of (4.9) is indeed the optimal solution via a proof by contradiction argument [28]. Take w opt to be the optimal solution to (4.5) with corresponding eigenvalue λ. Assume that there is an eigenvector w of (4.9) with corresponding eigenvalue λ, such that J(w opt ) < λ. Then evaluating the generalised eigen problem (4.9) at w and multiplying by w T gives: S B w = λs W w w T S B w = λ w T S W w wt S B w w T S W w = λ =J( w) >J(w opt ) Where the last inequality follows from our assumption that J(w opt ) < λ. This is consequently a contradiction to the assumption that w opt is the optimal solution to (4.5). Hence the optimal direction is the eigenvector corresponding to the largest eigenvalue. 4.3 Assumptions In the formulation of Fisher s LDA and further investigation using it we need to make several assumptions based on the training set of face images we shall consider. These assumptions include: 1. Each image must be classified as a member of one of the 2 or more mutually exclusive classes. 2. The data set is homoscedastic. 3. The observations in each class have a unimodal Gaussian distribution and the classes are linearly separable. Fisher s LDA is thus a parametric method as we assume a certain distributional form of the dataset. Observe that if the second and third assumptions do not hold true then the LDA projection will not to be able to preserve the structure of the input image space which might be required for classification purposes. Fisher s criterion assumes a homoscedastic unimodal Gaussian distribution of each class in its construction. In the context of a database of face images this Gaussian assumption appears to be valid. This is because images of the same individual do not vary greatly and only experience slight variations due to factors such as lighting, pose, etc. These variations are likely to be similar for every individual in the training set so the assumption of homoscedasticity seems OK. Hence if we have a large enough number of images of each individual in the training set we can assume a homoscedastic Gaussian distribution of each class. So provided these assumptions are satisfied, we can use Fisher s LDA to project the training set of face image to a new space that has maximal class separability. 4.4 Multiclass Case We shall now generalise Fisher s LDA to the multiclass case in which we have g classes. We now need to search for the linear projection such that the observations from all g different classes are far away 48

51 from each other and at the same time the observations of the same class are close together. In the two class case the discriminant space is of dimension one as we defined just one discriminant. However now we need to generalise to g classes. So how many discriminants are we able to define in this case? Note that S B is the sum of g matrices each with a rank of 1 or less, with the mean vector of each class being constrained by: µ = 1 g µ i (4.15) g This means that S B will have a rank of at most g 1. Subsequently there will be at most g 1 non zero eigenvalues λ i of S 1 W S B. Hence we can have up to g 1 discriminants. This consequently means we can view Fisher s LDA as a dimensionality reduction method, as we are projecting down from the N dimensional input space into the discriminant space which is of dimension at most g 1. We aim to do this in such a way to best summarise the difference between the groups. Note that compared to the 2 class example the projection y = W T x is now not scalar, it is of dimension at most g 1. So we aim to find the optimal projection matrix W opt which maximises the Fisher s criterion: W T S B W J(W ) = W T (4.16) S W W Where S B is the between class variation. i=1 S B = g M i (µ i µ)(µ i µ) T (4.17) i=1 Where S W is the within class variation. S W = g i=1 x k C i (x k µ i )(x k µ i ) T (4.18) Where µ i = M i k=1 x i is the average of class i. The resulting discriminant space created from this projection will be such that all g classes are best separated. We now seek the optimal projection matrix W opt that maximises this criterion. Note that as S W and S B are symmetric, positive definite matrices, the criterion is in the form of a generalised Rayleigh quotient [28]. We shall begin by just considering one column of W given by the vector w. Then examining the objective for this single vector gives rise to the same objective as in the two class case. Thus, we have shown can this can turned into a standard eigen problem. Further, the optimal vector w that maximises this objective is the eigenvector corresponding to the largest eigenvalue of this generalised eigenproblem: S B w = λs W w (4.19) So provided S W is non singular, the Rayleigh Quotient is maximised at the largest eigenvalue of (4.19). This result is also given in a more formal manner in Theorem found in Appendix C [16]. This Theorem states that in general a Rayleigh Quotient can always be turned into a standard eigenproblem and thus is maximised by the eigenvalue corresponding to the largest eigenvalue. This then gives the first Fisher s discriminant to be the eigenvector of S 1 W S B corresponding to the largest eigenvalue. As stated before S 1 W S B is a symmetric matrix thus has orthogonal eigenvectors. The second Fisher s discriminant is then defined to be in the direction that maximises the Fisher s criterion subject to the constraint that it is orthogonal to the first Fisher s discriminant. This then corresponds to the eigenvector with the second largest eigenvalue. In general the k-th Fisher s discriminant is in the direction that maximises the Fisher s criterion subject to the constraint that it is orthogonal to all (k-1) previous discriminants, thus corresponds to the eigenvector with the kth largest 49

52 eigenvalue. Hence W opt is the matrix with columns formed from the eigenvectors corresponding to the largest non-zero eigenvalues of the generalised eigenproblem: { W T } S B W W opt = [w opt,1,..., w opt,g 1 ] = arg max W W T (S 1 W S W W S B λ i )w opt,i = 0 (4.20) Thus W opt can have a most g 1 columns. Subsequently, the linear projection using the optimal projection matrix W opt can be viewed as a coordinate rotation of our original image space such that the new axis are aligned with the directions of maximal class separability. 4.5 Connection to the Least Squares Approach We shall now take a slight detour to examine an interesting comparison, in that the Fisher s LDA bears a strong connection to a least squares approach to discrimination [28]. Considering a least squares approach means we are looking for linear discriminant function of the form: f(x) = W T x + b (4.21) This linear discriminant function can be evaluated for an observation, x i, to gives its discriminant score, f(x i ). This discriminant score defines how the observation is described in the discriminant space. We then define a discriminant rule to classify an observation to one of the classes based on the value of its discriminant score. We aim to construct this classification rule in such a way as to minimise the sum of least squares error between how the function classifies each observation in the training set and the actual class each one belongs to. We define z i (C 1,..., C g ) for observation x i to denote the actual class the observation belong to. For example if the observation x i belongs to the first class then z i = C 1. Hence this is a linear least squares approach in which we aim to minimise the expected sum of squares error over all data points x, given by: E(w, b) = N (f(x i ) z i ) 2 = i=1 N (W T x i + b z i ) 2 (4.22) To examine this further we will go back to investigating the simple case of just two classes C 1 and C 2. Where the are M = M 1 + M 2 observations in total, M 1 belonging to class one and M 2 belonging to class two. We shall assign labels to each sample within the training set, with label +1 for class one and label -1 for class two. Hence the least squares problem of: min w,b E(w, b), can be written in matrix form: min w,b [ ] [ X T w X2 T 1 2 b i=1 ] [ ] (4.23) where X = [X 1, X 2 ] is a matrix containing all of the observations within the training set partitioned according to the classes, i.e. labels ±1 and 1 i is a vector of ones of corresponding length. Thus the solution to this least squares problem is of the form AX b 2 and can be computed by using the generalised inverse of the matrix A. For example, if x is such a solution then: x = A T b = (A T A) 1 A T b, assuming that A T A is not singular. Then A T A = I and hence a necessary and sufficient condition for x to be the solution of the least squares problem is that: (A T A)x = A T b. Then applying this to (4.23) gives: [ ] [ ] [ ] [ ] [ ] X1 X 2 X T w X1 X 1 T 1 1 T 2 X2 T = b 1 T 1 1 T (4.24) Then multiplying out these matrices and using the definitions of the sample means and within-class scatter as given above: [ SW + M 1 µ 1 µ T 1 M 1 µ 1 + M 2 µ 2 (M 1 µ 1 + M 2 µ 2 ) T M 1 + M 2 ] [ w b Then using the second equation of above to solve for b yields: ] [ ] M2 µ = 2 M 1 µ 1 M 2 M 1 (4.25) b = M 1 M 2 (M 1 µ 1 + M 2 µ 2 ) T w M 1 + M 2 (4.26) 50

53 Then substituting this expression for b into the first equation in (4.25) and algebraically manipulating, yields: ( (S W + M 1 µ 1 µ T M1 M 2 (M 1 µ 1 + M 2 µ 2 ) T ) w 1 )w + (M 1 µ 1 + M 2 µ 2 ) = M 2 µ 2 M 1 µ 1 M 1 + M 2 ( S W + M ) 1M 2 S B w + M M 2 2 (4.27) (µ 1 µ2) = 0 M 1 + M 2 M 1 + M 2 Note that S B w is still in the direction of (µ 2 µ 1 ), there then exists a scalar α R such that: Then substituting (4.28) into (4.27) yields: ( M 1 M 2 M 2 S B w = 1 + M 2 ) 2 α (µ 2 µ 1 ) (4.28) M 1 + M 2 M 1 + M 2 S W w = α(µ 1 µ 2 ) w = αs 1 W (µ 2 µ 1 ) (4.29) This consequently shows that the solution to the least squares problem is in the same direction as the solution to Fisher s Linear discriminant analysis. However note they will be of different lengths, but as already noted we are only interested in the direction of w, hence both solutions are identical. 4.6 Connection to the Bayesian Approach Fisher s discriminants are also closely linked to the Bayesian approach to the discrimination problem using the Bayes classifier [18]. This is under the assumption that the two classes each have a Gaussian distribution and that the covariance matrices are equal Σ 1 = Σ 2 = Σ. We shall examine this connection using the two class case but it is easily generalised to the multiclass case. Define z i = 1 to specify that the corresponding observation x i i = 1,..., M 1 belongs to class one and z j = 1 specifying that the corresponding x j j = 1,..., M 2 belongs to class two. Then the class conditional probability densities for each class are given by: { 1 P (x z = 1) = exp 1 } (2π) N/2 1/2 Σ 2 (x µ 1) T Σ 1 (x µ 1 ) (4.30) { 1 P (x z = 1) = exp 1 } (2π) N/2 1/2 Σ 2 (x µ 2) T Σ 1 (x µ 2 ) (4.31) Then we shall use Bayes Theorem to calculate the posterior probability of a class given a observation x. Considering class one: p(z = 1 x) = P (x z = 1)P (z = 1) P (x z = 1)P (z = 1) + P (x z = 1)P (z = 1) (4.32) which can then rewrite in a more compact form: where: p(z = 1 x) = exp(a) ( ) P (x z = 1)P (z = 1) a = log P (x z = 1)P (z = 1) (4.33) (4.34) 51

54 Then substituting the class conditional probability densities (4.30) and (4.31) into the above gives: ( ) ( P (z = 1) a = log + 1 ) ( P (z = 1) 2 (x µ 1) T Σ 1 (x µ 1 ) 1 ) 2 (x µ 2) T Σ 1 (x µ 2 ) ( ) P (z = 1) = log 1 ( x T Σ 1 x x T Σ 1 µ 1 µ T 1 Σ 1 x + µ 1 Σ 1 ) µ 1 P (z = 1) ( x T Σ 1 x x T Σ 1 µ 2 µ T 2 Σ 1 x + µ 2 Σ 1 ) µ 2 2 ( ) P (z = 1) = log + x T Σ 1 µ 1 1 P (z = 1) 2 µt 1 Σ 1 µ 1 x T Σ 1 µ µt 1 Σ 1 µ 2 ( ) P (z = 1) = log 1 P (z = 1) 2 µt 1 Σ 1 µ µt 1 Σ 1 µ 2 + x T Σ 1 µ 1 x T Σ 1 µ 2 Defining: and = Σ 1 µ T 1 x Σ 1 µ T 2 x + b = w T x + b w = Σ 1 (µ 1 µ 2 ) (4.35) ( ) P (z = 1) b = log 1 P (z = 1) 2 µt 1 Σ 1 µ µt 1 Σ 1 µ 2 (4.36) This direction is then the same, up to a scaling factor, as the optimal direction found using Fisher s LDA. 4.7 Comparison of PCA and Fisher s LDA Both Fisher s LDA and PCA use a linear projection as a means of a dimension reduction scheme. However due to the differing nature of their constructions, the resulting space produced from each projection can be very different [24]. We will examine a simple example to visualise the difference between PCA and Fisher s LDA for discriminatory purposes. Consider a training set with g = 4 classes, where each observation each in N = 2 dimensions. Assume that the 4 classes are linearly separable and that each can be represented by a homoscedastic unimodal Gaussian distribution. Subsequently the assumptions of Fisher s LDA are satisfied. Hence the training set can be seen to lie in a linear subspace. Applying PCA to such a training set will search for the vector with the largest associated variance. However, applying Fisher s LDA will search for the vector that best discriminates between the two classes, this being orthogonal to the direction of all 4 classes. Consider the plot shown in figure 4.4 which gives the data set and the the first Fisher s Linear discriminant plotted. Then consider the linear projection of the training set from the two dimensional space to the one dimensional space onto the feature vector found by PCA and the discriminant vector found by Fisher s LDA. These projections are given in figure 4.5. Comparing the two projections one can see that PCA combines the 4 classes together, thus they are no longer separable in the projected space. Hence such a projection is not useful for discriminatory purposes between the 4 classes. However Fisher s LDA achieves a better between class scatter, as the 4 classes are fairly well separated in the one dimensional discriminant space. Hence the projection produced according to Fisher s LDA is better for discrimination between the classes. This suggests we can use Fisher s LDA as a dimensionality reduction scheme to develop an improved approach to the problem of face recognition. 52

55 Figure 4.4: Data set consisting of 4 classes with the direction of the first Fisher s Linear discriminant plotted. (a) Projection onto the first PC 4.8 Classification (b) Projection onto the first Fisher s linear discriminant Figure 4.5: Linear projection according to PCA and Fisher s LDA. So we can use Fisher s LDA to find an optimal projection matrix to linearly project the training set of face images to a new lower dimensional space, defined such that there is maximal class separability. We shall call this space the discriminant space. In this discriminant space we can then seek to define a discriminant rule to classify a new image as one of the g known individuals. This means we can see Fisher s LDA as a two step process: 1. First the Fisher s linear discriminant variables w i are found according to Fisher s criterion, thus describing the discriminant space. 2. Then within this discriminant space we apply a discriminant rule to allocate an unknown observation to one of the known classes. We have so far completed the first step as have used Fisher s LDA to find the discriminants. We are then able to project our training set into the new discriminant space defined such that there is maximal separability between classes. So we now need to investigate the second step by considering a image not in this training set. We then need to formulate a discriminant rule to classify this image as one of the g known individuals. Fisher proposed the nearest neighbour classification rule [25] for such a discriminant rule. 53

56 4.8.1 Nearest Neighbour Discriminant Rule The discriminant space in defined in such a way that the class means have maximal separation whilst all observations are of minimal distance from their corresponding class mean. Thus Fisher proposed [25] a sensible rule for classification such that an observation x is classified to the class in which the distance between the class mean and the observation in this discriminant space is minimised. The resulting discriminant rule is thus an example of nearest neighbour classification and which we shall we refer to as the Nearest Neigbour discriminant rule. In the discriminant space the discriminants form an orthogonal set. Further we have assumed homoscedasticity of the classes, thus all class covariance matrices in this discriminant space are equal and symmetric. Hence in this space the comparing the Mahalanobis distance is equivalent to comparing the the Euclidean distance, as for comparison requirements we can just ignore the multiplication of the inverse of the common covariance matrix. Thus we can use the Euclidean distance measure along with the Nearest Neighbour discriminant rule to classify an image. This means an observation x k is allocated to the class whose projected mean is closest to its discriminant vector W T x k in the discriminant space. That is to say we classify the image x k as the i-th individual if: W T x k ˆµ i < W T x k ˆµ j j i (4.37) where the distances are given by the Euclidean distances and ˆµ i is the mean of the ith class in this discriminant space Connection to Maximum Likelihood Discriminant Rule Fisher s LDA is based on the assumption that the g classes have homoscedastic multivariate Gaussian distributions. This assumption provides the basis for the validity in using a nearest neighbour approach to classification in such a discriminant space. However if we have a sufficiently large number of images of each individual within the training set then it is possible to estimate the parameters that define each of these distributions with a high degree of accuracy [16]. We are then able to use a Maximum Likelihood rule to classify a new image as one of the g groups. We shall further show that this Maximum Likelihood rule, under certain assumptions, has a strong connections to the Nearest Neighbour discriminant rule previously described. The likelihood of the observation x in class C j is just the probability of an observation x given class C j. Hence to emphasise that we are thinking of the likelihood of an observation x as a function of the class parameter j we shall write the probability density function (pdf) of the jth class as f j (x) = L j (x). Then using these likelihoods defined for all classes j = 1,..., g we are able to define a Maximum Likelihood discriminant rule to discriminate between the classes. The Maximum Likelihood discriminant rule (ML rule) for allocating an observation x to one of the classes C 1,..., C g, is defined such that x is allocated to the class which gives the largest likelihood to x. This means that we allocate an observation x to the class C j which satisfies: L j (x) = max i=1,...,g L i(x) (4.38) The construction of the Fisher s LDA approach was based on the key assumption that the classes are given by homoscedastic multivariate normal distributions. So the likelihood on an observation belonging to the ith class in the discriminant space is given by: { 1 L i (x) = exp 1 } (2π) N/2 1/2 Σ 2 (W T x ˆµ i ) T Σ 1 (W T x ˆµ i ) (4.39) with Σ being the common covariance matrix for all g classes. We can then use the ML rule to allocate an observation to the class in which it has the greatest likelihood of belonging. The result of this classification is then given by the following Theorem. 54

57 Theorem 1. Maximum Likelihood Discrimination Theorem If C i is the class population with an N-dimensional multivariate normal distribution: N N (µ i, Σ) i = 1,..., g and Σ > 0. Then the maximum likelihood discriminant rule allocates x to the class C j, such that j is the value of i that maximises the square of the Mahalanobis distance: (x µ i ) T Σ 1 (x µ i ) When g = 2, then the ML rule allocates x to class C 1 if: α T (x µ) > 0 where α = Σ 1 (µ 1 µ 2 ) and µ = 1 2 (µ 1 + µ 2 ), and the observation is allocated to class C 2 otherwise. Proof: As the observations are from a multivariate normal distribution, the likelihood of x being in class C i is: { 1 L i (x) = (2π) N/2 Σ 1/2 exp 1 } 2 (x µ i) T Σ 1 (x µ i ) }{{} constant for all groups Then the ML rule maximises this likelihood over all possible classes. Hence x is allocated to class C i with the largest likelihood, which is equivalent to selecting the class C i for which the the exponent is minimised. Hence we wish to minimise the Mahalanobis distance, (x µ i ) T Σ 1 (x µ i ) thus proving the first part of the theorem. Then for the second part of the theorem in which there are only two classes, first note that L 1 (x) > L 2 (x) iff: (x µ 1 ) T Σ 1 (x µ 1 ) < (x µ 2 ) T Σ 1 (x µ 2 ) Then cancelling and simplifying this expression gives the result. Observe that in the discriminant space the Theorem states that we must allocate an observation x k to the ith class for which the Mahalanobis distance between this observation and mean of the ith class is minimised. Hence we wish to minimise: (W T x k ˆµ i ) T Σ 1 (W T x k ˆµ i ) (4.40) We have made the assumption of homoscedasticity, that is that all class have a common covariance matrix Σ which is defined in the PCA reduced space. It is then interesting to note that when the common covaiance matrix Σ is a scalar multiple of the identity matrix, the distance we are seeking to minimise is exactly the Euclidean distance between the observation and each class mean. In order to achieve such a covariance matrix we would need to impose further extreme assumptions, namely that in the PCA reduced space the variables are uncorrelated and each variable has a common variance, in addition to that of homoscedasticity. This is obviously fairly implausible in the context of face images; however considering such an extreme case enables us to making an interesting comparison with the Nearest Neighbour discriminant rule. Then selecting the class that maximises (4.40), when these strict assumptions hold, means we can just compare the Euclidean distance between the observation and each class mean, selecting the ith class that minimises: (W T x k ˆµ i ) T (W T x k ˆµ i ) (4.41) That is, we allocate the observation x k to the ith class if: (W T x k ˆµ i ) T (W T x k ˆµ i ) < (W T x k ˆµ j ) T (W T x k ˆµ j ) j = 1,..., g i j (4.42) This ML rule is then equivalent to the Nearest Neighbour discriminant rule given by (4.37). 55

58 4.8.3 Connection to the Bayes Discriminant Rule The ML discriminant, as rule formulated above, is a special example of a Bayes discriminant rule in which we have assumed an equal prior probability of an observation belonging to each class. It can be shown that these Bayes discriminant rules are optimal for classification purposes. Subsequently, as we have shown that the Nearest Neighbour rule is equivalent to the ML rule, this Nearest Neighbour discriminant rule as proposed by Fisher is also optimal [29]. We shall briefly take a slight detour to consider this optimality and the conditions that ensure it. In certain situations it is suitable to assume that the various class populations have prior probabilities. For example in the context of face recognition we could be considering a population in which female faces are more likely than males, hence the prior knowledge can be incorporated into the problem. This additional information can be included in the analysis by using a Bayes discriminant rule. The prior probability of class i is denoted by P (C i ) and hence the Bayes discriminant rule is given by: Allocate observation x k to the class C i, where i {1,..., g}, which maximises: P (C i )L i (x k ). This function to be maximised is proportional to the posterior likelihood of class C j given the observation x. Note that the ML rule is a special case of the Bayes rule in which we assume that the prior probabilities of each class are equal. In the context of the face recognition experiments we are conducting we have assumed this assumption to be valid. This is because we are considering a random set of individuals from each of the databases hence we can assume we are equally likely to have an image of one individual as any other. However if this was not the case the known prior probabilities of each class could be incorporated to form an improved discriminant rule. The Bayes discriminant rule, including the ML rule as a special case, has certain optimal properties [16]. To examine these optimal properties of the Bayes discriminant rule it is useful to consider the wider set of all possible discriminant rules. Within this wider set it is then possible to average the result of any two such discriminant rules. Hence, by the definition of convexity, the set of all discriminant rules forms a convex set. To examine this wider set we first need to define a randomised discriminant rule. A randomised discriminant rule d(x) involves assigning an observation x to a class C j with probability Φ j (x), where Φ 1,..., Φ g are all non-negative function defined on R m, where m is the dimension of the observation x satisfying N i=1 Φ i(x) = 1 x [16]. The Bayes discriminant rule is defined as having the probabilities: { 1 if P (Cj )L Φ j (x) = j (x) = max i P (C i )L i (x) 0 otherwise This then enables us to define a measure of performance for a discriminant rule in assigning observations to the correct class. The probability of allocating an individual to the class i when it actually belongs to the class j is given by: P ij = Φ i (x)l j (x)dx (4.43) Then if the observation x belongs to class i the probability of the discriminant rule correctly classifying this observation is P ii and the probability of incorrect classification is given by 1 P ii. This then gives us a way to measure the performance of discriminant rule, as it is summarised by the values P 11,..., P ii. This means we are able to compare various different rules and select the optimal. We say that a discriminant rule d(x), with associated probabilities of correct allocation {P ii } g i=1, is as good as another rule d (x), with associated probabilities {P ii }g i=1, if: P ii P ii i = 1,..., g (4.44) 56

59 Further we say that a rule d(x) is better than d (x) if for at least one i = 1,..., g the inequality is strict. Then if d(x) is a rule for which there is no better rule, then d(x) is said to be admissible. Note that it may not always be possible to compare two such discriminant rules using this criterion. Take for example two rules with P 11 > P 11 but P 22 < P 22. However we are able to use this criterion to prove the second optimal property of Bayes rules in that they are admissible rules. Consider the following Theorem: Theorem 2. Optimality of Bayes Rule All Bayes discriminant rules, including the special case of the ML rule, are admissible. This theorem consequently states that out of the set of all possible discriminant rules, the Bayes rules are optimal in terms of their classification performance. Proof: Let d (x) be the Bayes rule with prior probabilities P (C i ), i = 1,..., g. Assume that there exists a rule, say d(x), that is better than d (x). The probabilities of correct classification of d (x) and of d(x) are defined as {Pii }g i=1 and {P ii} g i=1 respectively. Then according to this assumption there must be at least one i = 1,..., g for which P ii > Pii. As both discriminant rules use the same prior probabilities for each class, then: However consider: g P (C i )P ii = i=1 g i=1 g P (C i )P ii > i=1 g i=1 Φ i (x)p (C i )L i (x)dx P (C i )P ii g i=1 { g Φ i (x) Φ i (x) max P (C j )L j (x)dx j } i=1 max P (C j )L j (x)dx j max P (c j )L j (x)dx j g Φ i (x)p (C j )L j (x)dx g i=1 i=1 P (C i )P ii Hence this contradicts our assumption the the rule d(x) is better than the Bayes rule. Consequently there is no such rule that is better than the Bayes rule; therefore the Bayes rule is admissible. In summary we have shown Bayes discriminant rules are optimal. Further the ML rule is a special case a Bayes rule with equal priors and we have shown that, under certain conditions, the Nearest Neighbour discriminant rule proposed by Fisher is equivalent to such a ML rule. Then by optimality of Bayes rules, this gives that this Nearest Neighbour discriminant rule is also optimal for classification in the discriminant space as described by Fisher s LDA Limitations of Fisher s LDA In the construction of Fisher s LDA we have separated the sources of variation to select those that are optimal for discriminatory purposes and further we have also incorporated the class information of observations. Thus we have excluded the first two limitations, as previously described, that inhibit any method based on PCA. However the use of Fisher s LDA still has its limitations. LDA can produce at most g 1 features, due to the constraint on the rank of S B. If more features are required for classification then some additional method must be applied to provide these extra features. 57

60 Fisher s LDA is a parametric method as it assumes homoscedastic unimodal Gaussian likelihoods. If the distributions are significantly non-gaussian then the LDA projection will not be able to preserve any structure of the data needed for classification. See figure 4.6 for an example of such a situation. Fisher s LDA implicitly assumes that the class means are the discriminating factor, not the variations within each class. This means it will fail if this is not the case. To implement Fisher s LDA we require a training set where there is more than one observation per class. Observe that in the situation where there is just one observation per class, the total within class scatter for each class is equal to zero. Subsequently the denominator of the fisher criterion in (4.16) goes to zero, thus we are not able to construct a valid Fisher s criterion. This problem can be resolved by using tricks to generate multiple observations belonging to a particular class from a single observation of this class. Examples of such tricks include proper geometric or grey-level transformation, [30]. We shall assume that we always have a training set in which we have multiple observations for each class. A major limitation of the procedure is its dependence on S W being non-singular and thus invertible. However this is not always feasible. When we do have such a singularity, Fisher s LDA will fail. To guarantee S W is non-singular we require at least N + g observations in the training set, where N is the dimension of the original images and g is the number of classes. The maximum rank of S W is then M g, where M is the number of images in the training set. In the context of the face image training set N is normally very large, as it is the number of pixels in the image, in comparison to the number of images M. Hence in most situations it is not possible to achieve a training set with at least N + g images.[20] This means that it is possible to choose a matrix W such that the within class scatter of the projected samples can be made exactly zero. Figure 4.6: Non Gaussian face classes in which case Fisher s LDA will fail. 4.9 Fisherfaces In applying Fisher s LDA in the context of the face recognition problem we initially encounter a major obstacle. This is because it is unlikely our training set will contain N + g or more images. N is the dimension of each image and g is the number of individuals we are considering in our training set, consequently N + g has the potential to be an extremely large number. Subsequently it is highly possible for S w, the within class scatter matrix, to be singular and hence we are unable to implement Fisher s LDA. However we are able to exploit the fact that the variation within classes lies in a linear subspace of the image space. Thus we know we can perform a dimensionality reduction using a linear projection whilst still preserving this separability. This observation is not only the motivation used for Fisher s LDA but also conveniently can utilised this fact to avoid the problem of singular S w. That is, we can first use another linear dimensionality reducing projection, to project to a feature space of dimension M g, in which we have ensured S w is unable to be singular whilst still preserving the linear separability of the classes. We can then apply the standard Fisher s LDA in the new feature 58

61 space without encountering any problems. Belhumeur et al (1997) [20] proposed the use of PCA as this first dimensionality reducing projection thus formulating the two step procedure to the problem of face recognition in the method of Fisherfaces. The Fisherfaces approach thus can be viewed as a two step method: 1. PCA is first applied to reduce the dimensionality the training set, projecting from the N dimensional image space to a subspace of dimension M g. This gives the feature vector: u k = W T P CA(x k µ) (4.45) This resulting S W in this M g dimensional feature space is non-singular. 2. Fisher s LDA is then applied in this feature space to provide a second linear projection to the discriminant space of dimension m g 1, such that there is optimal class separability. This gives the discriminant vector: y k = W T LDAu k (4.46) So the Fisherface method is given by the linear projection: Where the optimal projection matrix is given by: Step One: PCA y k = W T opt(x k µ) (4.47) W opt = W P CA W LDA (4.48) We shall begin by examining the first step in which PCA is used to reduce the dimensionality of the image space. That is, we seek the feature space of dimension M g given by: W P CA = arg max W T ΣW (4.49) W This then corresponds to the first M g principal components which are given by the eigenvectors associated to the largest eigenvalues of Σ. Hence we have disregarded the g 1 smallest principal components. Σ is the covariance matrix of the training set in the original image space. We know the within class scatter matrix S W has rank of at most M g, so in this new space of dimension M g we can ensure that S W is non singular. Step Two: Fisher s LDA Secondly Fisher s LDA is applied in this new M g dimensional feature space. We select the matrix W LDA such that the between class variance of the dimensionally reduced observations is maximised and the the within class variation is minimised. That is: W T WP T CA W LDA = arg max S BW P CA W W W T WP T CA S W W P CA W (4.50) Thus we need to select the eigenvectors of: (W T P CAS W W P CA ) 1 (W T P CAS B W P CA ) (4.51) corresponding to the largest eigenvalues. We can ensure W T P CA S W W P CA is non-singular in this dimensionally reduced space, hence the inverse exists. The rank of S B is at most g 1 hence there exists at most g 1 non zero eigenvalues. We can then select m, where m g 1, eigenvectors corresponding to the non zero eigenvalues. These m give the Fisher s Linear discriminants which form the projection matrix W LDA and thus describe the discriminant space. 59

62 Face recognition can then be achieved on this discriminant space, by projecting an unknown face image into this space. y k = W T (x k µ) W R N m (4.52) where W = W P CA W LDA and µ is the mean face image in to original face space. Then using Fisher s proposal of the Nearest Neighbour discriminant rule we are able determine if the unknown image belongs to any of the g classes. However how many eigenvectors do we need to take to be successful in discriminating in the discriminant space? We shall investigate this through an experimental application using our face databases The Fisherface Method Preparing the Training Set 1. Using the initial training set of M images: {x 1,..., x M }, where each of the g individual has at least two images, calculate the mean face image: µ = 1 M M i=1 x i 2. Form the optimal PCA projection matrix W P CA whose columns are the first M g eigenvectors of the training set s covariance matrix Σ, corresponding to the largest eigenvalues (4.49). Linearly project each face image in the training set onto these eigenvectors to give the associated M g dimensional feature vectors: u k = W T P CA(x µ) 3. Form the optimal Fisher s LDA projection matrix W LDA according to Fisher s criterion (4.50), whose columns are formed from from the first m g 1 eigenvectors of: (W T P CAS W W P CA ) 1 (W T P CAS B W P CA ) corresponding to the largest non zero eigenvalues. Where S W is the within class scatter matrix and S B the between class scatter matrix for the training set defined in the original image space. For each each image use its feature in the PCA reduced space to find its corresponding discriminant vector: y k = WLDAu T k 4. Form the overall projection matrix W opt = W P CA W LDA which defines this discriminant space. Identifying a New Image Gain an unknown test image x j that is not within the training set. 1. Project the image into the m dimensional discriminant space to give its discriminant vector: y j = W T opt(x j µ) 2. Use the Euclidean distance to compare the distance of the projected image to each of the g known face classes { y 1,..., y g }. Use the Nearest Neighbour discriminant rule, with predefined thresholds θ i for each class, such that the image x j is allocated to the jth class if: y j y k = ɛ j < θ j where y k the mean of the projected training images for kth class k = 1,..., g and this ɛ j is the smallest distance over all classes for which ɛ i < θ i. 60

63 4.10 Fisherface Method Implementation This section shall focus on the computational details of training a computer system to implement and test the Fisherface method. The code, written in MATLAB R [19], is given in the Appendix F and Appendix G. Inspiration for the code is with thanks to the MATLAB R Central File exchange [14] and the first part of the Eigenface code previously compiled. The first part of the code, Appendix F, involves the training set of face images which is used to calibrate the machine. The second part of the code, Appendix G, involves the classification stage of the Fisherface method, in which the machine aims to identify the identity of each individual in a testing set. Note that we shall use the same two training sets from the faces94 and faces95 databases as used before in the Eigenface method implementation, each containing 500 images of 50 randomly selected individuals. We will also use the same sets of 3 testing images for each database, with each set containing 50 images. We shall examine the code in relation to the steps detailing the Fisherface method as given above The Training Set Steps One and Two The initial part of the Fisherface code involves reading in a database of training images and using PCA to find the optimal projection matrix W P CA. This optimal projection matrix is formed from the first M g eigenvectors corresponding to the largest eigenvalues of Σ. Note that this first section of the code is identical to the initial code written for the Eigenface method, as detailed above. So for details of steps one and two of the Fisherface method please refer to the details of the Eigenface method code where we now always take the first M g principal components. So after applying the linear projection, using the optimal projection matrix W P CA, each image in the training set is given by its feature vector: u i = W T P CA (x i µ) in the M g feature space. Collectively these feature vectors form the matrix P rojectedimagesp CA R (M g) M. Steps Three and Four Next we need to find the optimal linear projection matrix W LDA according to Fisher s criterion considering the feature vectors, in the PCA reduced feature space. The code calculates the within class scatter matrix W T P CA S W W P CA and the between class scatter matrix W T P CA S BW P CA in this PCA reduced space. Then we need to consider the g 1 non-zero eigenvalues and eigenvectors of: (W T P CAS W W P CA ) 1 (W T P CAS B W P CA ) We order these eigenvalues in descending order such that λ 1 λ 2... λ g 1. Then we select the first m g 1 eigenvectors, corresponding to the m largest eigenvalues, to form the columns of the optimal projection matrix W LDA. In experimentation we shall consider various values of m in order to determine the optimal number of discriminants for discriminatory purposes. Next we project each PCA reduced feature vector into this Fisher s LDA reduced space to give a new feature vector y i = WLDA T u i. In the code, the training set is described in the discriminant space by the matrix P rojectedimagef isher, where each column represents the discriminant vector for one of the M images. We also need to form a new matrix PCAFQ, where each of the g columns of this matrix gives the average feature vector for one of the g individuals. In the training set there are 10 images per individual so we need to take the average of every 10 consecutive columns of the matrix P rojectedimagesf isher. Hence the kth column of PCAFQ gives the average Fisher s linear discriminant for the kth individual: P CAF Q(:.k) = y k = y k j j=1 61

64 The Testing Set The second half of the code covers the face identification part of the face recognition problem. The aim is to classify, or not, an unknown face image, i.e. an image not in the training set, as one of the 50 known individuals in the training set. When considering the Fisherface method the Euclidean distance will be used as the distance metric using Fisher s Nearest neighbour discriminant rule to determine class membership. We then classify a test image as the individual which has the smallest Euclidean distance between the feature vector in question and the class mean, given that this distance is less than some predetermined threshold θ k. In experimenting we shall select the threshold θ i, i = 1,, 50 in such a way to reduce the number of images not recognised as any of those within the database, i.e. to reduce the rejection rate. This is because we know the testing sets are constructed from images of known individuals, i.e. individuals within the training set. So we know, a priori, that all images in the set are of known individuals. We are subsequently not so interested in the rejection rate of the Fisherface procedure but the recognition abilities of the method. This means we need to select a high threshold so all images are classified as their nearest neighbours. However note that this is likely to give more errors than a smaller threshold would give. We consider one set of these testing images at a time and run the second part of the code. Each image in the testing set is read in separately and the image s array is transformed into a vector of length N by concatenating the array s rows. Each image in the test set then forms one of the columns of the 50 N dimensional matrix S. Steps One and Two The next section of the code covers the linear projection of a testing image into the discriminant space as specified by Fisher s criterion. After this projection each image in the test set is given by its discriminant vector: y i = W T LDAW T P CA(x i µ) We then use the Euclidean distance to measure the proximity of the image s discriminant vector to each of the face classes, with the kth face class being given by average class feature vector y k = P CAF Q(:.k). The image is then taken to be of class k if ɛ k = min i (ɛ i ɛ i, θ i ), with θ i given thresholds for each class. This minimum value is identified and returned along with this minimum value s position which represents the face class of nearest proximity. A loop is written in the code so that this procedure is repeated for every image in the testing set. Thus the number of correctly classified testing images gives a measure of the accuracy of the procedure Fisherface Method Experimental Results In implementing the Fisherface method there are several variables that can either be varied or need to be kept constant. In experimenting with this method we shall consider the effect of varying just one key variable: Number of Fisher s Linear Discriminants When forming the projection matrix W LDA we can select up to g 1 eigenvectors, thus W LDA can have a maximum of g 1 columns. We aim to minimise the data storage and computing requirements of the method, so we wish to minimise the number of eigenvectors taken whilst maximising the recognition accuracy of the method. So we shall consider how many Fisher s linear discriminants do we need to accurately discriminate between face images of different individuals. All other variables will be kept constant. We shall consistently use the same training sets, thus calibrate the system using the same images each time. W P CA will also remain of the same dimension and the thresholds θ i shall be kept the same for each class. We shall experiment using two faces databases: faces94 and faces95. Each database will have its own associated 500 image training set and 3 sets of testing images. We shall run the code three times on 62

65 each testing set to ensure the reliability of the results. We can then define the percentage recognition accuracy of the procedure to be the number of correctly classified individuals given a fixed threshold and specified distance measure. We shall take an average of this recognition accuracy for a given number of eigenvectors over the average values from the 3 repeats for each of the 3 testing sets. This will ensure the reliability of the final results. We can then calculate the standard deviation (SD) for the recognition accuracy to give a measure of the reliability of the results produced for the method. This then enables an error bound to be calculated for every average recognition accuracy value stated. In experimenting we shall select the thresholds θ i, i = 1,..., 50, in such a way to reduce the number of images not recognised as any of those within the database, i.e. to reduce the rejection rate of image. This is because we know the testing sets are constructed from images of known individuals, i.e. individuals within the training set. Subsequently we know, a priori, that all images in the testing set are of known individuals. We consequently are not so interested in the rejection rate of the Fisherface procedure but the recognition abilities of the method. This means we need to select a high threshold, thus all images will be classified as their nearest neighbour. However note that this is likely to give more errors than a smaller threshold would give. The thresholds are taken to be: θ i = 10 i = 1,..., 50 These thresholds means that no images are rejected as unknown and each is classified as its nearest neighbour. Observe the results below: Faces 94 Faces 95 No. Eigenvectors Recognition Accuracy SD Recognition Accuracy SD % 4.58% 58.00% 9.00% % 4.00% 66.67% 7.21% % 3.00% 67.33% 6.56% % 2.00% 69.33% 6.56% % 2.00% 69.33% 6.56% % 1.00% 70.00% 6.00% % 1.00% 70.00% 6.00% Figure 4.7: Recognition accuracy and error bounds of the results produced from the Fisherface method. 63

66 Optimal Number of Fisher s Linear Discriminants We can take up to g 1 Fisher s linear discriminants to form the optimal projection matrix W. However, how many are optimal for discrimination using our two datasets? We wish to minimise the number of discriminants used in order to reduce the data storage and computer processing requirements, whilst producing a discriminant space in which recognition accuracy is high. We can assess the optimal number by observing the plot in figure 4.7. The plot gives the percentage recognition accuracy versus the number of eigenvectors, hence the number of Fisher linear discriminants m, used in the method. The experiment was run using: m = 3, 5, 10, 20, 20, 40 and 49 eigenvectors. In both databases the plot clearly shows there is a large increase in the percentage recognition accuracy between using 3 and 5 eigenvectors. However beyond the use of 5 eigenvectors there is only a gradual improvement to the accuracy of the method as the number of eigenvectors is increased. This indicates that approximately 5 Fisher discriminants are optimal in producing a discriminant space in which the classes are best separated and hence is optimal for discriminatory purposes. This means that both our training sets can be optimally represented using just 5 Fisher s linear discriminants, given by the projection matrix W R N 5. Then each image is given by its discriminant vector, y i = W T (x i µ), of length 5. This representation of the datasets is thus optimal for discriminatory purposes between different individuals faces94 vs faces95 The major draw back of the Eigenface method is due to the fact that it treats all sources of variation as equal. Hence we saw that the method fails when there was significant variation due to lighting or pose. The Eigenface method subsequently proved to be much less accurate on the faces95 database. In formulation of the Fisherface method we separated the sources of variation, with the aim to improve the accuracy of recognition even when there is significant within class variation is present. We can use the experimental results produced to assess if the method is indeed an improvement. Recall that the faces94 database has limited within class variation whilst the faces95 has significant variation due to lighting and pose. Consider the plot given in figure 4.7. We can clearly see that the Fisherface method is less accurate on the faces95 database as the percentage recognition accuracy is consistently lower than the faces94. Observe that across all number of eigenvectors used the difference between the recognition accuracy of the two databases is on average approximately 25%. However implementation of the Eigenface method on these two datasets produced a difference in percentage recognition accuracy which was on average approximately 30% across all number of Eigenfaces used. This indicates that although the recognition abilities of the Fisherface method are still affected by siginificant within class variation, it is still shown to be an improved, more robust technique in comparison to the Eigenface method. We shall assess this further by considering again the two other testing sets which have been created using images from the faces95 database. The testing sets have been selected such that the images in testing set one have very limited variation due to lighting and pose in comparison to those of the training set. In contrast the images in testing set two have been selected such that there is significant variation due to the aforementioned sources. Observe the results produced from implementation of the Fisherface method using these two testing sets: No. eigenvectors Set 1 Set % 58% 5 76% 67% 10 92% 67% 20 93% 69% 30 93% 69% 40 93% 70% 64

67 Figure 4.8: Comparison of the results produced using the Fisherface method on testing set one and testing set two. These results have been plotted to give figure 4.8. For comparison recall the results produced using the Eigenface method with the Mahalanobis distance: No. Eigenfaces Set 1 Set % 44% 20 82% 52% 30 86% 40% 40 88% 46% Examining the results produced when using the Fisherface method, we can clearly see that the method is decisively more accurate on the first image set. This observation implies that the method is still affected by the variations in lighting and pose. However observe that the difference between the recognition of the results produced from the two test sets is smaller using the Fisherface method in comparison to the difference in recognition accuracy achieved using the Eigenface method. This consequently indicates that Fisherface method is able to achieve better discrimination within datasets when there is significant within class variation, i.e. due to lighting and pose. Subsequently the method is more robust Comparison to the Eigenface Method Both the Fisherface and Eigenface methods are dimensionality reduction techniques based on linearly projecting the image space to a lower dimensional subspace. The methods thus both have similar computational requirements. However the Eigenface method suffers from several major limitations due to the fact that the method does not include any class information and further the procedure treats all sources of variation as equal. Thus the Eigenface approach subsequently fails when there is significant within class variation, such as that due to lighting or pose. Whereas, in the Fisherface method we sought to eliminate these limitations by including the class labels of observations and separating the sources of variation, with aim to produce a linear subspace that best discriminates between individuals. Results show that the Fisherface method is indeed an improvement, as it selects projections that appear to perform well over a range of lighting conditions and pose variations. 65

68 4.13 Disadvantage of the Fisherface Method Although the Fisherface method has proven to be more efficient than the Eigenface method in terms of its face recognition abilities, the procedure is still limited by the fact that it is a linear subspace analysis method. This is because it uses a linear transform to project the image space into the new lower dimensional discriminant space. However, this means it is unable to efficiently represent the nonlinear relations of the input space. Extreme lighting conditions or dramatic changes in pose results in a highly complex and non-linear distribution. Experimental results have shown that, although the method is an improvement to the Eigenface method in dealing with these complexities of face images, the procedure s accuracy is still affected by such sources of variation. Hence it is reasonable to assume that a better solution can be achieved by taking into account higher order statistical dependencies using nonlinear methods. One candidate which does just this is the so-called kernel methods [34]. 66

69 Chapter 5 Kernel Methods 5.1 Motivation Linear Discriminant techniques rely upon the assumption that a linear discriminant function is sufficient to discriminate between classes within the data set. Hence such techniques fail for nonlinear problems. PCA and Fisher s LDA use information based on second order dependencies i.e. pixel-wise covariance among the pixels and thus are insensitive to the dependences among multiples pixels in an image [34]. Higher order dependencies in a face image include nonlinear relations among pixels such as the relationship between three of more pixels in a curve or edge. Is this a valid assumption to assume these higher order statistics are not important for recognition? For most real world classification problems, as is the case for the problem of face recognition, this linearity assumption is a major simplification and hence linear discrimination is not always complex enough. Thus to increase the accuracy of the discriminant one could try to look for nonlinear methods. This then leads to the notion of kernel methods. We shall begin by first describing the basic idea behind such kernel methods. We shall then go on to suggest a proposal for their application with regards to the face recognition problem; combining the class separability property of Fisher s LDA and the nonlinear properties of kernel methods to formulate an approach we shall call the Kernel Fisherface method. 5.2 Introduction to Kernel Methods The basic idea of kernel methods in discriminant analysis is to first preprocess the data by some nonlinear mapping Φ to a feature space F and then secondly to apply a linear discriminant method in this space F. The idea is that for appropriate Φ, linear discrimination in the image space of Φ will be sufficient. So written in a more formal manner: Φ : R N F X Φ(X) (5.1) where the mapping Φ is applied to the training data set of M image: {x 1,..., x M }. Then we consider our linear discriminant method in the space F instead of the original image space, i.e. we are now working with the set {Φ(x 1 ),..., Φ(x M )}. The difference is that with kernel methods for a suitably chosen mapping Φ we gain a method that has nonlinear properties but also retains the simplicity and ease of calculation of the equivalent linear discriminant method. We aim to select the map Φ in such a way to capture the non-linear relations of the image space. Hence for appropriate Φ these complex relations in images can then be easily detected for discriminatory purposes. See figure 5.1 for an example of a suitably chosen map The Kernel Trick How do we define this non linear mapping? In certain situations we are able to design an appropriate nonlinear mapping Φ by hand if we have sufficient knowledge about the training set. Then if this function Φ is not too complex, it may then be possible to explicitly apply this mapping to the training 67

70 Figure 5.1: Three separate approaches to the same two class problem. (a) A linear discriminant function applied in the original image space produces errors for classification. (b) A better separation of classes by a nonlinear discriminant function. (c) The nonlinear discriminant function corresponds to a linear discriminant function in the feature space produced by the non-linear mapping Φ. The data points are mapped from the input space to the feature space F by the nonlinear mapping Φ [28] set and then carry out the discriminant analysis. However, we incur issues when we do not have sufficient knowledge to be able to linearise this problem or even if we are able to design the mapping it may not possible to explicitly calculate such a mapping. For example consider when the space we wish to map to F is of a very high dimension or even infinite. Clearly it would not be possible to explicitly calculate the mapping to such a space directly. However note the key fact that we are able to compute scalar products in such high dimensional spaces. This then leads to the idea of the kernel trick [29]. The kernel trick is to take the original linear discriminant method in the space F and then formulate it in such a way the we only use Φ(x) in scalar products. We are able to evaluate these kernel scalar products without needing to explicitly calculate the nonlinear mapping Φ, hence are still able to solve the discriminant problem in the feature space F Kernel Functions We thus need to define a kernel function, k(x i, x j ), corresponding to the nonlinear map Φ. The table below gives examples of some common kernel functions for x i, x j R N [28]. Gaussian RBF Kernel Function k(x i, x j ) = exp ( xi x j 2 c ) c R Polynomial k(x i, x j ) = ((x i x j ) + θ) d d N, θ R Sigmoidal k(x i, x j ) = tanh(k(x i x j ) + θ) k, θ R 1 Inverse multi-quadric k(x i, x j ) = xi x j 2 +c 2 c R + We then select an appropriate kernel function for the map Φ such that: (Φ(x i ) Φ(x j )) = k(x i, x j ) (5.2) Thus choosing the kernel function is equivalent to selecting the appropriate nonlinear map Φ that best captures the nonlinear features of our images space. Hence we can then reformulate our discriminations problem such that Φ(x) only appears in scalar products. We can then replace each of these scalar products by a kernel function k(x i, x j ), and hence do need we do not ever need to explicitly calculate the mapping to the space F. To examine the reformulation of a discrimination problem using a Kernel function consider the following example taken from [28]. Observe figure 5.2. In two dimensions it is clear a linear discrimination function is not suitable. The figure suggests an ellipsoid shaped nonlinear boundary is needed to discriminate between the two classes. This indicating the appropriate linear map for such a space to be: Φ : R 2 R 3 (x 1, x 2 ) (z 1, z 2, z 3 ) := (x 2 1, 2x 1 x 2, x 2 2) Observing figure 5.2 clearly illustrates that in the space F, produced from this nonlinear mapping, a linear hyperplane is sufficient for separation of the two classes. Thus we will be able to successfully 68

71 Figure 5.2: A two dimensional example. Using the second order monomials x 2 1, 2x 1 x 1 and x 2 2 as features, then separation of the two classes is achieved by a linear hyperplane (right). In the original input space this construction corresponds to a nonlinear boundary separating the two classes (left). implement a linear discriminant technique in this space F. Note, in this example the feature space F could be easily calculated directly by carrying out the mapping Φ(x) = (x 2 1, 2x 1 x 2, x 2 2 )T explicitly. However consider if we had a training set consisting of images of dimension pixels, thus giving vectors of length 256 and we choose the non-linear map of all products of 5 vector entries. Then we would be mapping to a space that contains all 5 th order products of 256 possible vector positions, i.e. mapping to a space of dimension: So it is clear that we would not be able to carry out this mapping explicitly. Hence we use the kernel trick and the key fact that we are able to compute scalar products in this high dimensional space. Retuning back to examining our 2 dimensional example with the nonlinear map given in (5.2.2). Given Φ we can thus reformulate the problem in terms of a kernel function k corresponding to this mapping: (Φ(x i ) Φ(x j )) = (x 2 i1, 2x i1 x i2, x 2 i2)(x 2 j1, 2x j1 x i2, x 2 j2) T = ((x i1, x i2 )(x j1, x j2 ) T ) 2 = (x i x j ) 2 := k(x i, x j ) Hence we can formulate our linear discriminant method in the space F in which we only use Φ in scalar products and then we can simply replace any occurrences of (x i x j ) 2 by the kernel function k(x i, x j ). This is then an example of the polynomial kernel function with d = 2. Thus it computes the scalar product in the space of all 2 vector entires of x i and x j. By using such a kernel function we have incorporated the nonlinear features of the original space into our analysis and thus will be able to better discriminate between the two classes. We shall use this notion of a kernel function and take Fisher s LDA to be our linear discrimination technique to formulate a non-linear generalisation. By incorporating the nonlinear property of Kernel methods to the optimal class separability attributed to Fisher s LDA we hope to attain a method in which we have further improved our ability to discriminated between different classes. 69

72 5.3 Kernel Fisher Discriminant We shall now use the ideas outlined above to generate a nonlinear generalisation of Fisher s LDA using kernel functions. Fisher s LDA fails to perform well when nonlinear functions are required for discrimination as it relies on the assumption of linearity. To overcome this weakness Mika et al. [29] formulated Kernel Fisher Discriminant (KFD). KFD is based on the key idea behind all kernel methods of using a nonlinear mapping Φ and then solving the problem of Fisher s LDA in the resulting feature space F. This consequently gives a set of nonlinear discriminant vectors in the original input space. It is hoped that for suitably chosen Φ we will have captured the non-linear features of the space and subsequently by applying Fisher s LDA in the space F, we will have enhanced our discriminatory powers. Using the ideas above, we thus need a formulation of Fisher s discriminant that only uses Φ in scalar products. We can then replace these scalar products with a suitable kernel function and hence formulate a nonlinear discriminant method. Our aim in applying Fisher s LDA in this space F is thus to maximise the Fisher s criterion: J Φ (ϕ) = ϕt S Φ B ϕ ϕ T S Φ W ϕ (5.3) where ϕ F and SB Φ is the between class scatter matrix in F and SΦ W is the within class scatter matrix in F. So for simplicity and to easily visualise the construction of the method we shall begin by focusing on a case with just two classes. This two class case has been generalised to the multi-class case by Baudat [31] to form the Generalised Kernel Discriminant method (GDA), which we shall consider later. Hence in the two class case the scatter matrices are defined as: S Φ B = (µ Φ 1 µ Φ 2 )(µ Φ 1 µ Φ 2 ) T (5.4) SW Φ = M i (Φ(x j i ) µφ i )(Φ(x j i ) µφ i ) T (5.5) i=1,2 j=1 where we define µ Φ i = 1 Mi M i j=1 Φ(xi j ) to be the mean of the ith class in the space F. So in this space F we can just apply the normal Fisher s LDA. In order to find the optimal discriminants we need to find the eigenvalues λ and the corresponding eigenvectors ϕ of the generalised eigenproblem: λs Φ W ϕ = S Φ Bϕ (5.6) Thus the optimal direction then corresponds to the eigenvector with the largest associated non zero eigenvalue: ϕ T SB Φ ϕ opt = arg max ϕ ϕ ϕ T SW Φ ϕ (5.7) Then considering the argument proposed before in the justification of the need to use the kernel trick, if F is of a very high or possibly infinite dimension, then it is not possible to solve this directly. However, we can use the kernel trick and seek a formulation of the problem in terms of just dot products of the vectors, (Φ(x p ) Φ(x q )) which we can then replace by a kernel function. Consequently we can solve the problem without needing to explicitly calculate the non-linear mapping Φ to the space F. We know that any solution ϕ F must lie in the span of all training observations mapped into F [29]. Hence we can find an expression for ϕ as an expansion of the form: ϕ = Then using the expansion in (5.8) and the definition of µ φ i M α i Φ(x i ) (5.8) i=1 we can write: ϕ T µ Φ i = 1 g M i α j k(x j, x i k M ) i j=1 k=1 (5.9) = α T Q i 70

73 Where we have replaced the dot products by the corresponding kernel function and defined (Q i ) j = 1 Mi M i k=1 k(x j, x i k ), where xi k is the kth sample of class i. Considering: x t r the rth sample of class t and x u s the sth sample of class u, we have defined the kernel function to be: k(x t r, x u s ) = ( φ(x t r) φ(x u s ) ) (5.10) Now returning to the Fisher s criterion in (5.3) and first considering the numerator containing SB Φ. Using the definition of SB Φ and (5.9), we can rewrite the numerator in the form: where: ϕ T S Φ Bϕ = ϕ T (µ Φ 1 µ Φ 2 )(µ Φ 1 µ Φ 2 ) T ϕ = ϕ T (µ Φ 1 (µ Φ 1 ) T µ Φ 1 (µ Φ 2 ) T µ Φ 2 (µ Φ 1 ) T + µ Φ 2 (µ Φ 2 ) T ϕ = α T Q 1 Q T 1 α α T Q 1 Q T 2 α α T Q 2 Q T 1 α + α T Q 2 Q T 2 α = α T Qα Q = (Q 1 Q 2 )(Q 1 Q 2 ) T Next examining the denominator of the Fisher s criterion (5.3) and we consider applying a similar transformation as above. Hence using the definition of µ Φ i = 1 Mi M i j=1 Φ(xi j ) and equation (5.8) we find that: ϕ T Swϕ Φ = ϕ T M i ( Φ ( x i j µ Φ ) ( ( i Φ x i j µ Φ )) T ) i ϕ i=1,2 j=1 = ϕ T M i = α T Bα i=1,2 j=1 ( Φ ( x i j) ( Φ ( x i j )) T Φ ( x i j ) ( µ Φ i ) T µ Φ i ( Φ ( x i j )) T + µ Φ i (µ Φ i ) T ) ϕ where: In which we define: B = k i (I 1 Mi )ki T i=1,2 K i to be the g M i kernel matrix for class i with entries (K i ) pq = k(x p, x i q). I - the identity matrix. 1 M i - the matrix with all entries 1 M i. So rewriting the numerator and denominator of (5.3) as above, Fisher s criterion can be expressed as: J(α) = αt Qα α T Bα (5.11) Hence we can find the Fisher s linear discriminant in the space F by maximising the above. So the problem can now be solved in exactly the same way as Fisher s LDA considered in the original input space, but now in the feature space F. So, we need to find the leading direction α which is just the eigenvector corresponding to the largest eigenvalue of B 1 Q. Hence we can now rewrite (5.7) as: ϕ opt = arg max ϕ = arg max α 71 ϕ T SB Φϕ ϕ T SW Φ ϕ α T Qα α T Bα (5.12)

74 This then gives the KFD approach. So given an observation x, its projection on to the optimal discriminant as described by KFD is given by: ϕ Φ(x) = g α i (Φ(x i ) Φ(x)) = i=1 g α i k(x i, x) (5.13) i= Issues with Kernel Fisher Discriminant One problem does however arise with this method [29]. Namely we cannot guarantee that B is nonsingular, so it is possible its inverse may not exist. One way to deal with this numerical problem is to replace B with B β, where B β is the matrix B with a multiple of the identity matrix added: B β := B + βi (5.14) where β is a very small constant and I is the identity matrix. This clearly makes the solution more numerically stable, as for appropriate β, B β will become positive definite. Further as with Fisher s LDA, KFD cannot cope with the situation in which there is there is just one observation per class within the training data set. How do we resolve this issue? Observe that in this situation, B will be equal to zero and hence the denominator of the Fisher s criterion in (5.11) goes to zero. Thus we are unable to construct a valid Fisher s criterion. This problem can be resolved by using tricks to generate multiple observations belonging to a particular class from a single observation of this class. We shall not cover such methods here but with regards to a training set of images in the face recognition problem tricks such as proper geometric or grey-level transform can be used to generate these multiple images, see [30]. 5.4 Gerneralised Kernel Fisher Discriminant Analysis In order to apply the Kernel Fisher s discriminant method to the problem of face recognition we need to consider the multiclass case and hence use the Generalised Kernel Fisher Discriminant method [31]. We are considering the situations in which we have a training set with M images of g different individuals. We require that each individual has more than one image within this training set. Hence we must use Generalised Kernel Discriminant method as we have g different classes. So we aim to find the discriminant vectors, which form the columns of the matrix Ψ, that maximise the Fisher s Criterion: ϕ J Φ T SB Ψ (Ψ) = Ψ Ψ T SW Φ Ψ (5.15) where the scatter matrices are defined as: S Φ B = 1 M g (µ Φ i µ Φ )(µ Φ i µ Φ ) T (5.16) i=1 S Φ W = 1 M g M i ( ( ) ) ( ( ) ) Φ x i j µ Φ i Φ x i j µ Φ T i (5.17) i=1 j=1 where µ Φ = 1 g ) Mi M i=1 j=1 (x Φ i j is the average of all observations in the training set. Then applying LDA in the space F, mapped to by Φ, determines the set of optimal discriminant vectors Ψ = [ϕ 1,..., ϕ m ] which maximises Fisher s criterion (5.15). Then using the same justification as in Fisher s LDA and the Theorem for Rayleigh Quotients given in Appendix C, the optimal discriminant vectors are the eigenvectors of the generalised eigen-problem: λ i S Φ W ϕ i = S Φ Bϕ i (5.18) 72

75 We then take the first m eigenvectors, corresponding to the largest eigenvalues. Hence the matrix Ψ is formed from these optimal discriminant vectors: Ψ T SB Φ Ψ = [ϕ 1,..., ϕ m ] = arg max Ψ Ψ Ψ T SW Φ Ψ (5.19) Since we are able to express any of these eigenvectors as a linear combination of the observations in the feature space, we have: M ϕ i = a i,j Φ(x j ) = Qα i (5.20) j=1 where we have defined Q = [Φ(x 1 ),..., Φ(x M )] and α i = (a i,1,..., a i,m ) T is the vector containing the coefficients for ϕ i. Then substituting (5.20) into the Fisher s criterion (5.15) and considering the ith eigenvector ϕ i yields: where the matrix K is defined as: J k (α i ) = αt i (KW K)α i α T i (KK)α i (5.21) K = K 1 M K K1M + 1 M K1M (5.22) This is a centralised kernel matrix, where 1 M matrix whose elements determined by: = (1/M) M M and K = Q T Q is the M M kernel ( K)ij = Φ(x i ) T Φ(x j ) = (Φ(x i ) Φ(x j )) = k(x i, x j ) (5.23) with k(x i, x j ) being the kernel function corresponding to the non linear mapping Φ. Further we have also defined W = diag(w 1,..., W g ), where W j is an M j M j matrix with all terms equal to 1/M j. Hence W is an M M block matrix. Now consider the QR decomposition of the centralised kernel matrix K into an upper triangular matrix and an orthogonal matrix. K is a symmetric matrix by its construction subsequently, by the Theorem in Appendix B, it can be diagonalised by its orthonormal eigenvectors. Suppose that γ 1,..., γ m are the orthonormal eigenvectors of K corresponding to m non-zero eigenvalues λ 1 λ 2... λ m, where m is the rank of matrix K. The K can expressed in its eigen decomposition as: K = ΓΛΓ T (5.24) where Γ = (γ 1,..., γ m ) is an orthogonal matrix ( Γ T Γ = I) formed from the orthonormal eigenvectors of K and Λ = diag(λ 1,..., λ m ) is a triangular matrix with the eigenvalues of K along the diagonal and zeros elsewhere. Then substituting K = ΓΛΓ T into (5.21) gives: J K (α i ) = (Λ 1 2 Γ T α i ) T (Λ 1 2 Γ T W ΓΛ 1 2 )(Λ 1 2 Γ T α i ) (Λ 1 2 Γ T α i ) T Λ(Λ 1 2 Γ T α i ) (5.25) Then defining: Then Fisher s criterion in (5.25) is transformed to: β i = Λ 1 2 Γ T α i (5.26) J(β i ) = βt i Sβ B β i β T i Sβ W β i (5.27) where S β B = Λ 1 2 Γ T W ΓΛ 1 2 and S β W = Λ. We can easily see that Sβ W is positive definite and that Sβ B is semi-positive definite. So (5.27) is in the form of a standard generalised Rayleigh quotient. Thus using the results discussed previously and the Theorem given inappendix C, we obtain a set of optimal discriminant vectors β 1, β 2,..., β d which are the eigenvectors of (S β W ) 1 S β B corresponding to the d 73

76 largest eigenvalues, where d g 1. From (5.26) we know that for a given β i, there exists at least one α i satisfying α i = ΓΛ 1 2 β i. Thus after determining β 1,..., β d we can obtain a set of optimal solutions with respect to the Fisher s criterion given in (5.21), that is: α i = ΓΛ 1 2 βi ; i = 1,..., d (5.28) Then using the above we can formulate the optimal discriminant vectors for the original Fisher s criterion, (5.15), defined on the space F as: Hence the optimal discriminant vectors are given by: ϕ i = Qα i = QΓΛ 1 2 βi ; i = 1,..., d (5.29) Ψ = (ϕ 1,..., ϕ d ) = (QΓΛ 1 2 β1,..., QΓΛ 1 2 βd ) = (QΓΛ 1 2 )(β1,..., β d ) (5.30) The set of discriminant vectors subsequently describe the discriminant space in which we have both optimal class separability and further has incorporated the nonlinear characteristics of the original space. So given an observation x i and its mapped image Φ(x i ), given by the suitable non-linear map Φ, the corresponding discriminant vector y j is obtained by the KFD transformation: 5.5 The Kernel Fisherface Method y i = Ψ T Φ(x i ) (5.31) We can now use kernel Fisher s discriminant analysis with regards to the problem of face recognition to develop an approach which combines the class separability property of Fisher s LDA with the nonlinear property kernel methods. Before we begin to formulate such a method, note that the application of the KFD algorithm is not as transparent or simple to implement as Fisher s LDA algorithm. In constructing our approaches to solve the problem of face recognition we wish to develop procedures which are as intuitive to apply as possible with minimal computational requirements. Can we construct the KFD algorithm in a more transparent way? One approach would be to see if the method can be broken down into simpler steps. Conveniently it can be shown that KFD is is equivalent to Kernel Principal Component analysis (KPCA) plus Fisher s Linear Discriminant analysis [34]. This is then useful as it leads to the suggestion of a simpler two phase algorithm in which KPCA is first performed to reduce the dimension of the space F, then secondly LDA is performed for further feature-extraction and discrimination in the KPCA transformed space. Note the similarities of this two phase method to that of the linear Fisherface method in which PCA is first performed for dimension reduction and then secondly LDA is performed in this dimensionally reduced space. This subsequently leads to the suggestion of a Kernel Fisherface method. We shall begin by investigating KPCA; that is applying PCA in the feature space F mapped to by the nonlinear map Φ. We shall then use the results to specify a two step process for KFD and then further use this to formulate a two step Kernel Fisherface method KPCA We shall investigate the use of PCA in extracting nonlinear features [34]. That is, we wish to consider applying PCA in the space F. Assume that we have mapped our original training set of M face images, using the nonlinear function Φ, to the space F. Thus we are now working with the dataset: {Φ(x 1 ),..., Φ(x M )}. If we then assume further that the mapped data is centered, that is M i=1 Φ(x i) = 0 and we have unit variance. This assumption is necessary as we require centered data to perform principal component analysis. Note that we could relax this requirement and use the 74

77 centralised kernel matrix as given by (5.22). However for clarity in the construction of KPCA we shall assume the data set is centralised and use the standard kernel matrix K. Then in this space we define the covariance matrix to be: S Φ = 1 M Φ(x i )Φ(x i ) T (5.32) M i=1 We can then perform PCA in this space in the normal way. Hence to find the principal components we know that we just need to select the non-zero eigenvalues, λ 0, and the corresponding eigenvectors w Φ F satisfying the eigen problem: λw Φ = S Φ w Φ (5.33) We then need to reformulate the method such that Φ(x i ) only appear in scalar products, thus we can use the kernel trick and hence implement PCA in the space F. So substituting (5.32) into (5.33) we note that: S Φ w = 1 M (Φ(x i ) w Φ )Φ(x i ) (5.34) M i=1 Hence all all the solutions w Φ, with corresponding λ 0, must lie in the span of Φ(x 1 ),..., Φ(x M ), thus there must exist coefficients γ i, i = 1,..., M such that: w Φ = M γ i Φ(x i ) (5.35) i=1 Hence we may consider solving the equivalent system: λ(φ(x i ) w Φ ) = (Φ(x i ) S Φ w Φ ) i = 1,..., M (5.36) Then defining the M M kernel matrix K, whose elements are given by ( K) ij = (Φ(x i ) Φ(x i )) and substituting (5.35) into (5.36) to give: Mλ Kγ = K 2 γ (5.37) where γ is defined to be the column vector γ T = (γ 1,..., γ M ) containing the coefficients. Hence to find the coefficients γ k of (5.37) and subsequently the solution w Φ we simply need to solve the eigenvalue problem: Mλγ = Kγ (5.38) Thus we need to find the non-zero eigenvalues λ k and corresponding eigenvectors γ k = (γ1 k,..., γk M )T, for k = 1,..., m where m is strictly less than rank of K. Recalling back to our construction of PCA in the linear space, we specified the principal components to be of length one. Hence we require that the feature vectors w Φ k in F, corresponding to our solutions γk, are of unit length i.e. (w Φ k wφ k ) = 1. Thus: M (w Φ k wφ k ) = 1 = M γ k,i γ k,j (Φ(x i ) Φ(x j )) i=1 j=1 = (γ k Kγ k ) = λ k (γ k γ k ) Hence we have our set of principal components, which we shall call kernel principal components, given by w Φ k for k = 1,..., m. Then considering an observation Φ(x) in the feature space F, we can find its feature vector by projecting onto the eigenvectors w Φ k in F : (w Φ k Φ(x)) = M i=1 γ k i λk (Φ(x i ) Φ(x)) (5.39) 75

78 Note that neither (5.39) or the definition of the kernel matrix require Φ(x i ) in explicit form, they are only needed in the dot product. Hence we can use the kernel trick in using a kernel function to compute these dot product without ever actually having to explicitly perform the map Φ. (w Φ k Φ(x)) = M So the feature vector of the observation x k in the KPCA space is given by: i=1 γ k i γk k(x i, x) (5.40) y k = = ( ) γ1 γ T m (,..., Φ(x1),..., Φ(x M ) ) T Φ(xk ) λ1 λm ( ) γ1 γ T m,..., [k(x 1, x k ),..., k(x M, x k )] λ1 λm (5.41) Observe figure 5.3 which illustrates the use of KPCA in selecting the nonlinear directions. (a) Original image space (b) KPCA transformed space Figure 5.3: After KPCA is applied the three classes are clearly distinguishable using just the first components. Image source: principal component analysis, date accessed: 01/12/10 So we have formulated the method of PCA using the kernel trick in the space F. Note this then leads to the suggestion of a Kernel Eigenface method based on KPCA for face recognition, using the kernel principal components as features. However it was our aim to include the class separability property of Fisher s LDA into our non linear kernel method. But is linear discriminant analysis needed when using kernel functions, could we just use KPCA? We shall not explore this in this report, but see [34] which concludes the Kernel Fisherface method offers a superior performance over the Kernel Eigenface method. This is because KPCA still suffers from the first two limitations of PCA. This is because, although we have now included the nonlinear features, the method still selects the most expressive feature which may not be optimal for discriminatory purposes in classifying face images. It has been shown [33] that in certain situations KPCA is not as good as Fisher s LDA. Hence our aim to use KFD, with regards to the problem of face recognition, appears valid KPCA and LDA We shall now return back to the KFD transformation and formulating the procedure as a two step process. Consider the Fisher kernel discriminant vector formed by the KFD projection: y i = Ψ T Φ(x i ) = (QΓΛ 1 2 )(β1,..., β d )Φ(x i ) (5.42) 76

79 This transformation can then be divided into two parts: y i = G T u i (5.43) where u i = ( QΓΛ 1 2 ) T Φ (xi ) (5.44) and G = (β 1,..., β d ) Begin by considering the transformation given by (5.44). Recalling that we previously defined: Q = [Φ(x 1 ),..., Φ(x M )] Γ = (γ 1,..., γ m ) Λ = diag(λ 1,..., λ m ) Hence we can rewrite equation (5.44) as: u i = = ( ) γ1 γ T m (,..., Φ(x1),..., Φ(x M ) ) T Φ(xi ) λ1 λm ( ) γ1 γ T m,..., [k(x 1, x i ),..., k(x M, x i )] λ1 λm (5.45) Where γ 1,..., γ m are the orthonormal eigenvectors of the centralised kernel matrix K with associated non-zero eigenvectors λ 1,..., λ m. Then (5.45) is exactly the KPCA transformation, which transforms the feature space F into the Euclidean space R m as given by (5.40), [32]. Then examining the Fisher s criterion J(β) in (5.27). Considering the scatter matrices S β B and Sβ W where: S β B = Λ 1 2 Γ T W ΓΛ 1 2 (5.46) S β W = Λ (5.47) It can clearly be seen that these are also the scatter matrices defined in the KPCA transformed feature space. Note that the construction of S β B is not so intuitive in the form given in (5.46), hence it is useful to give an alternative expression. We shall consider constructing this matrix directly in R m, based on the KPCA transformed features: S β B = 1 M g M i (µ i µ)(µ i µ) T (5.48) i=1 where µ i is the mean vector of the ith class in the KPCA reduced space and µ is the mean vector over all M observations in the KPCA reduced space. As a result J(β) in (5.27) is actually the linear Fisher s criterion in the KPCA transformed space with optimal linear fisher discriminant vectors β 1,..., β d. Hence the transformation: y i = G T u i (5.49) is just the Fisher s linear discriminant transformation in the KPCA transformed space. Hence we are just projecting onto the optimal eigenvectors of (S β W ) 1 S β B according to Fisher s criterion, defined in the KPCA transformed space to give the Fisher kernel discriminant vectors. Thus we have reformulated KFD as a two step procedure first applying KPCA and then secondly applying Fisher s LDA in the KPCA reduced space. Note that this is more intuitive and easier to understand than directly applying KFD. Further this two step process reveals the structure of KFD and its relationship with KPCA Two Step KFD Algorithm A two-phase algorithm using KPCA and Fisher s LDA to implement KFD analysis [32]. 77

80 Step One: KPCA Perform KPCA in the input space R N. 1. Construct the centralised inner product kernel matrix 2. Calculate the orthogonal eigenvectors K = K 1 M K K1M + 1 M K1M γ 1,..., γ m of the kernel matrix K corresponding to the m largest eigenvalues λ 1,..., λ m. 3. Calculate the feature vectors for each observation in the KPCA transformed feature space: u i = ( ) γ1 γ T m,..., [k(x 1, x i ),..., k(x M, x i )] λ1 λm Step Two: Fisher s LDA Perform Fisher s LDA in the KPCA transformed space R m. 1. Construct the: Between class scatter matrix S β B using (5.48) Within class scatter matrix S β W = diag(λ 1,..., λ m ) 2. Calculate the eigenvectors β 1,..., β d of (S β W ) 1 S β B corresponding to the d largest eigenvalues, where (d g 1). 3. Then carry out the Fisher s LDA projection to find the discriminant vectors: y i = G T u i where G = (β 1,..., β d ), this is the Fisher s kernel discriminant vector. Note that for numerical robustness of the 2-step procedure we need to ensure is non singular S β W. Thus in step one, m is selected to be an integer value strictly less than the rank of the matrix K. Generally the rank of K is M 1, where M is the total number of observations in the training set. 5.6 The Kernel Fisherface Method We shall use the two step FDA algorithm to formulate a Kernel Fisherface method for the face recognition problem. We are given an initial training set of M greyscale face images {x i } M i=1. This set contains g individuals, with each individual having multiple face images within the set. We then aim to assign an unknown face image x i, i.e. one not in the training set, to one of the g classes based on the kernel Fisher discriminants. The idea is that by incorporating the non-linear feature we should be able to better discriminate between individuals and hence be more successful in correctly identifying this unknown face image. We thus consider the following method in which we first use the training set to find the kernel fisher discriminant and then secondly use these discriminants to try to correctly classify an unknown image Training Set Given a training set of M images of g individuals. 78

81 Step One: KPCA 1. Specify a kernel function: k(x i, x j ) = (Φ(x i ), Φ(x j )) = ( K) ij corresponding to the nonlinear map Φ that best captures the nonlinear properties of our original image space. Then construct the centralised inner product kernel matrix: where 1 M = (1/M) M M. K = K 1 M K K1M + 1 M K1M 2. Calculate the nonzero eigenvalues of K, λ 1 λ 2,..., λ M 1. Select the m eigenvectors: γ 1,..., γ m corresponding to the m largest eigenvalues where m is strictly less than the rank of the matrix K. 3. Carry out the KPCA transformation for each of theı = 1,..., M images in the training set to give the feature vectors: ( ) γ1 γ T u i = m,..., [k(x 1, x i ),..., k(x M, x i )] λ1 λm Step Two: Fisher s LDA 1. Aim to maximise the Fisher s criterion in the KPCA transformed space: J(β) = βt S β B β β T S β W β 2. Construct the between class scatter matrix in the KPCA transformed space: S β B = 1 g M i (µ i µ)(µ i µ) T M and the within class scatter matrix: i=1 S β W = diag(λ 1,..., λ m ) 3. Calculate the eigenvectors β 1,..., β d of (S β W ) 1 S β B, corresponding to the d largest eigenvalues, where: d g Carry out the Fisher s LDA projection for each image in the training set to give the Fisher kernel discriminant vector: y i = G T u i where G = (β 1,..., β d ) Testing Set Given an unknown image x k, i.e. an not in the training set. We aim to determine if it is of a known individual or not. 1. Calculate the the images feature vector in the KPCA space by projection onto the m principal components: ( ) γ1 γ T u k = m,..., [k(x 1, x k ),..., k(x M, x k )] λ1 λm 2. Calculate the Fisher kernel discriminant vector by projection of the feature vector onto the d kernel discriminants: y k = G T u k 3. Use the nearest neighbour discriminant rule to classify the image as the jth individual if: yk y j < yk y i i = 1,..., g where y i is the average Fisher kernel discriminant vector of the ith individual. 79

82 5.7 Connection to the Fisherface Method Observe the similarities between the two phase Kernel Fisherface approach and that of the linear two step Fisherface method. The methods are consistent in their structure, in that first principal component analysis is conducted, either KPCA and PCA and secondly Fisher s LDA is applied. The consistency is actually easily interpreted. Consider if a linear mapping Φ was used, that is: k(x i, x i ) := (x i x i ) (5.50) which is a polynomial kernel function of degree 1. So we are considering a linear space F. Then the KFD would exactly degenerate to PCA and LDA. Hence with this linear kernel function, the Kernel Fisherface method is exactly equal to the Fisherface method. 5.8 Analysis of the Kernel Fisherface Method The Kernel Fisherface method combines the nonlinear properties of kernel methods with the discriminatory power of Fisher s LDA. Results produced from the implementation of this technique are widely available, see [34] and [33]. We shall analyse these results to determine if the inclusion of the nonlinear characteristics does indeed improve our ability to distinguish between face images. The results, given in [34], illustrate how the Kernel Fisherface method achieves a higher recognition accuracy in comparison to that of the Fisherface and Eigenface methods. We subsequently infer that the technique is successful in extracting the nonlinear features of face images. Furthermore, we can conclude that these nonlinear features are indeed useful for discriminatory purposes. In particular the method provides a superior performance in situations in which there is extreme lighting variations. The limited success of the Fisherface method in these conditions suggests nonlinear feature extraction is necessary when considering such variations. This therefore hinting that our assumption of linearity is not invalid. Subsequently the use of Kernel Fisherfaces is advantageous in extreme lighting conditions. Additionally, the use of KFD has been proposed to model multiview face images[35]. This is supported by experimental results, given in [34], which illustrated the success of the Kernel Fisherface method when considering images containing significant pose variation. Consequently we are able to relax the strict restrictions on the face images we have so far been limited to considering i.e. we are no longer restricted to just full frontal face images. Kernel Fisherfaces is thus an effective method for face recognition in many real-world applications. Subsequently the technique offers a robust approach to the problem. 80

83 Chapter 6 Conclusion 6.1 Conclusion In this report we have focused our efforts on the identification stage of the face recognition problem. We have considered various approaches to determine the optimal way to extract discriminating features from a set of face images and then identify the individual in an image based on these extracted features. As an intuitive first approach to this problem, we begun by considering linear subspace analysis to attempt this feature extraction. The idea was that we would be able to make the high dimensional image space more compact and useful of discriminatory purposes, by linearly projecting face images down into a new lower dimensional subspace, as defined by certain optimal features. We begun by considering the Eigenface method, given in Chapter 3, in which we used PCA to define the low dimensional feature space. The idea of PCA is to select a set of orthonormal basis eigenvectors, called the principal components, with the aim of retaining as much variability from the original space as possible. We called these principal components Eigenfaces. We showed how any face image can be approximately reconstructed using just a small number of these Eigenfaces. Furthermore, considering face databases from the University of Essex [12], our experimental analysis showed that using just the first 20 Eigenfaces was optimal to maximise the recognition accuracy of the Eigenface method. This analysis is shown in figure However the success of the Eigenface method is inhibited by the 3 key limitations occurring due to the use of PCA. The first being the assumption that large variances are important. This means the Eigenface method selects the most expressive features, which could include those due to lighting and pose variations. We saw, through our experimental investigations, how such features are not optimal for discriminatory purposes. The second being due to the non parametric nature of PCA. This is because PCA does not make any assumptions on the structure of the data set. This subsequently leads to data loss, as we do not account for class separability. The third limitation being due to the linear nature of the method. In using a linear projection we are unable to take into account any of the higher order pixel correlations present in face images. The Eigenface method consequently assumes that such non linear relations are not necessary to discriminate between different individuals. We then used these limitations as the motivation behind all further methods we go on to consider. We subsequently aim to construct improved approaches in which we have removed one or more of these restrictions. In Chapter 4 we begun to consider the use linear discriminant analysis, namely Fisher s LDA. We subsequently addressed the first two aforementioned limitations, as we sought to find the optimal discriminant space in which there is maximal group separability. Fisher s LDA thus utilises the class labels of each image as it seeks to find the optimal linear transformation that maximises the between class scatter whilst at the same time minimising the within class scatter. We then constructed the 2 step Fisherface method in which we first applied PCA to the image space and then secondly used Fisher s LDA to extract the optimal discriminant vectors. Implementation of the Fisherface method illustrated its superior performance in comparison to that of the Eigenface method, especially under 81

84 extreme pose and lighting variation. Finally in Chapter 5 we sought to address the third limitation, considering if it is valid to assume that only linear features are required to discriminate between face images? We utilised the concept of kernel methods to capture the nonlinear characteristics of the image space using the kernel trick. The kernel trick is to first map our image space into the implicit feature space F, via a suitable nonlinear mapping and then secondly to apply a linear discrimination technique within this new space. We subsequently formulated the powerful Kernel Fisher s discriminant method, called the Kernel Fisherfaces, which combines this nonlinear property of kernel methods with the discriminatory powers of Fisher s LDA. We have illustrated the robustness of such a method, in terms of its success in situations in which we allow variation due to lighting and pose. 6.2 Future Work There are still numerous issues which must be addressed in order to ensure the robustness of a working face recognition system based on the methods we have discussed. We shall briefly mention two key areas and hint towards a potential solution that could be explored in future work. In all preceding analysis we have implicitly assumed a Gaussian distribution of the face space when considering a nearest neighbour approach to classification. We have discussed how a Gaussian distribution appears reasonable. However, it is difficult if near not impossible to estimate a true distribution of face images. We subsequently have no a priori reason to assume and particular density function of face images. It would then be useful if we could develop an unsupervised method that enables us to learn about the distribution of face classes to either confirm our Gaussian assumption of propose an alternative model. Nonlinear networks suggest a useful and promising way to learn about the face space distributions by example [36]. We have seen the robustness of the Kernel Fisherface method with regards to small variations in pose. However, are we able to further improve this robustness and extend the technique to recognise a face image in say profile view? [4] One potential suggestion would be to redefine each face class in terms of a set of characteristics views [5]. For example an individuals could be represented by a face class corresponding to a frontal face image, sides views at ±45 degrees and then left and right profile images. It is then hoped that we would be able to recognise a image containing a face anywhere from frontal to profile view by approximating the real views by interpolation among the known fixed views, observe figure 6.1. Figure 6.1: Various pose views of one individual [38]. Another possible extension, to further improve our knowledge of the face space, is to investigate the inclusion of prior knowledge into the discriminant problem. Throughout all of our analysis we have assumed an equal prior probability of each face class. But is it really valid to assume we are equally likely to gain an image of one individual over any other? It is reasonable to assume we will have some prior knowledge of the face set we are considering. We could then look to include this information into our discriminant problem, subsequently developing an improved Bayesian approach. It is indeed likely that we will have some knowledge of the context of the images we are considering. For example, consider the face recognition systems used for video surveillance in a casino. We could assume, a priori, that any face image is more likely to be that of an adult than a teenage or child. We could then go about deciding upon the prior probabilities of each age group. Subsequently we are then able to assign these probabilities to each of our face classes to include this knowledge in our analysis. This would then suggest an improved method of discrimination, based on a Bayesian probabilistic approach. 82

85 Throughout this report we have focused on the identification part of the face recognition problem. We have subsequently developed methods which enable us to determine if the identity of an individual is known or not. However we could alter the face recognition problem and how we address it? For instance, instead of seeking to determine the identity we could aim to distinguish some other feature. Possible suggestions include: gender; the presence of eye glasses; hairstyle or even facial expression. Hence, instead of classifying the images based on the individual present, we could group images based on some other defining feature. Say for example we organised our images based on the facial expression the individual is portraying. We would need to access whether there is sufficient variation between say a smile and a frown to be easily distinguishable and thus if the classification techniques we have considered are able to discriminate between the two. If so, we could then seek to train a machine to recognise emotions. In a similar manner we could also alter our approach to the problem to develop a method for computer age estimation and further age simulation [38]. Human age can be directly inferred from distinct patterns emerging from facial appearance. Consequently if we were to organise the images into age groups would it be possible to develop a classification technique to automatically assign a year to an individual s face? For example humans can easily figure out the ageing process and assign an age estimate to each of the face images of Albert Einstein given in figure, 6.2. It would subsequently be interesting and practically very useful, to determine if we could train a machine to do the same. The use of the nearest neighbour classifier or Artificial Neural Networks have been suggested as possible approaches [38]. Figure 6.2: The age process shown on Albert Einstein s face [38]. This subsequently leads on to the prospect of face image synthesising, in which we could render an image aesthetically with natural aging effects. This has a wide range of potential applications, from forensic art and security control to cosmetology. 6.3 Acknowledgements I am particularly thankful to my supervisor, Dr Kasper Peeters, for his advice and guidance throughout this project. I am also indebted to the Face Recognition Homepage [2], for both enlightening me to the extremely relevant research area that is face recognition and providing me with a basis point for my exploration into the field. Acknowledgment is also due to the following software used during this project: MATLAB R : used for the coding requirements of this project. LATEX: used in the production of this report. Microsoft Paint: used to assist in the production of figures used in this report. Microsoft Excel: used to enable the analysis data and productions relevant plots. 83

Eigenimaging for Facial Recognition

Eigenimaging for Facial Recognition Aaron Kosmatin, Clayton Broman December 2, 21 Abstract The interest of this paper is Principal Component Analysis, specifically its area of application to facial recognition