COVARIANCE REGULARIZATION FOR SUPERVISED LEARNING IN HIGH DIMENSIONS

Size: px

Start display at page:

Download "COVARIANCE REGULARIZATION FOR SUPERVISED LEARNING IN HIGH DIMENSIONS"

Rosanna Stafford
5 years ago
Views:

1 COVARIANCE REGULARIZATION FOR SUPERVISED LEARNING IN HIGH DIMENSIONS DANIEL L. ELLIOTT CHARLES W. ANDERSON Department of Computer Science Colorado State University Fort Collins, Colorado, USA MICHAEL KIRBY Department of Mathematics Colorado State University Fort Collins, Colorado, USA ABSTRACT This paper studies the effect of covariance regularization for classification of high-dimensional data. This is done by fitting a mixture of Gaussians with a regularized covariance matrix to each class. Three data sets are chosen to suggest the results are applicable to any domain with high-dimensional data. The regularization needs of the data when pre-processed using the dimensionality reduction techniques principal component analysis (PCA) and random projection are also compared. Observations include that using a large amount of covariance regularization consistently provides classification accuracy as good if not better than using little or no covariance regularization. The results also indicate that random projection complements covariance regularization. INTRODUCTION When classifying high-dimensional data, the mixture of Gaussians (MoG) model has been largely neglected in the literature in favor of estimates to a mixture of Gaussians. Another common solution is to reduce the dimension of the data prior to learning. PCA is the most popular method in these situations but random projection is gaining attention (Candes & Tao 26). Covariance regularization remains an active research topic (Robinson 29). This paper applies a mixture of Gaussians model learned via expectation-maximization (EM) with shrinkage covariance regularization to three data sets from different domains. The effect of covariance regularization is examined in conjunction with both of these dimensionality reduction techniques. Section 2 begins by presenting several popular algorithms as MoG with covariance regularization. After presenting the experimental methodology in Section 3 and experimental results in Section 4, conclusions are summarized in Section 5. 2 BACKGROUND A summary of MoG with covariance regularization is presented in this Section. Also, several long-standing algorithms are presented as MoG with covariance regularization. dane@cs.colostate.edu

2 2. COVARIANCE REGULARIZATION Covariance regularization is simply a method for reducing the number of free parameters in a model. The effect of having too little data to provide an accurate estimate of the parameters true values is over-fitting which results in poor generalization to unseen data. When using a MoG, this can also cause the learned covariance matrix to be singular. When a mixture of Gaussians is applied to highdimensional data, it is commonly constrained (e. g. use only the onal of the covariance matrix) to ensure non-singularity, for example (Hammoud & Mohr 2). Many variations on the mixture of Gaussians model have been proposed to tackle these problems. Moghaddam and Pentland proposed the use of a mixture of Gaussians on PCA-projected data for object detection and tested it on human faces, human hands, and facial features such as eyes, nose, and mouth (Moghaddam & Pentland 997). They show how to use principal component analysis to estimate a Gaussian distribution. Hinton et al. applied a mixture of factor analyzers (MFA) to estimate a MoG for a handwritten digit classification problem (Hinton et al. 997). They proposed that factor analysis, performed locally on each cluster, could model the manifold formed through the application of small image transformations to each class prototype image. Each digit class may have multiple, separated, continuous manifolds; therefore each class is represented using a mixture of factor analyzers. Tipping and Bishop brought principal component analysis to the world of mixture of Gaussians through their mixture of probabilistic principal component analyzers (MPPCA) (Tipping & Bishop 999). MPPCA is based upon MFA but FA is replaced by PCA performed locally for each cluster. Researchers have long focused on linear discriminant analysis () for classification problems. and quadratic discriminant analysis (QDA) have roots in a mixture of Gaussians. assumes that each class is modeled using a single Gaussian constrained to share a covariance matrix with the other classes. QDA allows each class to be modeled using a single, unique Gaussian resulting in a non-linear decision boundary (Bishop 27). The estimate can be regularized further like done in PCA+ (Belhumeur et al. 997). In recent years several variations on this theme have appeared where the PCA null space (eigenvectors associated with zero eigenvalues) is exploited or completely ignored for its lack of discriminatory information. Regularized discriminant analysis (RDA) is one such technique (Ye & Wang 26). The transformation methods such as MFA and MPPCA are presented, not as covariance regularization methods, but as manifold and subspace learning methods. These algorithms allow for simultaneous classification and dimensionality reduction which was an improvement upon previous efforts which created subspaces prior to or after clustering. The benefit of these algorithms was considered to be their invariance to small image transformations and their relationship to covariance regularization has been largely ignored. The data itself may also be modified to require less covariance to achieve a good fit. PCA is a popular method for data whitening. Random projection has also been shown to whiten the data (Dasgupta 2, Deegalla & Bostrom 26). 2.2 MIXTURE OF GAUSSIANS WITH COVARIANCE REGULARIZATION MoG is historically the most used and researched mixture model. Its strength comes from its ability to approximate any density function. Shrinkage is the covariance regularization approach applied in these experiments. Shrinkage has the benefit of being simple and efficient to calculate. Shrinkage estimates, Σ, usually involve a parameter, λ, that balances between the empirical covariance matrix, 2

3 Σ, and some target matrix, T : Σ = λt + ( λ)σ () Two target matrices from (Schäfer & Strimmer 25) are considered in this paper, computed using (2) or (3). T ij = T ij = j if i = j if i j j σii if i = j if i j (2) (3) When λ =, (2) is equivalent to fuzzy kmeans (Mitchell 997) while (3) is equivalent to a MoG with only variance. These two versions will be referred to as FKM and DIAG respectively. 3 METHODOLOGY In these experiments, a MoG is fit to each class using EM (Bishop 27) with the additional shrinkage step, (). For implementation details, see (Elliott 29). Although a mixture of Gaussians is fit to each class in an unsupervised way, supervised learning is still performed by fitting a MoG to each class and then computing the Bayes optimal classification (Mitchell 997): argmax P(c)p(x Θ c). (4) c C Here, P(c) is simply the fraction of the training data that belong to class c and p(x Θ c) is the probability of a data point given the MoG model for class c. 3. DATA SETS Three data sets are used for experimentation: mfeat, isolet, and a set of appearance-based data. The appearance-based image data set consists of three different classes collected from the Internet: cat/dog (2 images), Christmas tree ( images), and sunsets (93 images) (Elliott 29). The cat/dog images have been manipulated to include only the face of the animal and are hand registered using the eyes. The Christmas tree images were chosen to have a tree in the middle and the sunset images were chosen to have a bright middle region and a dark lower region. The multiple features (mfeat) data set (Frank & Asuncion 2) consists of 649 features extracted from 2 samples of handwritten digits. The isolet data set (Frank & Asuncion 2) consists of 67 features are extracted extracted from 7797 samples of 5 subjects speaking each letter of the alphabet twice. 3.2 DESCRIPTION OF EXPERIMENTS and logistic regression (LogReg) (Bishop 27) results are included with each experiment for comparison. Supervised MoG with shrinkage has two experimental parameters: C {, 2, 3}, the number of clusters per class, and λ [, ], the shrinkage parameter. The best combination of experimental parameter values are chosen using cross-validation. Algorithm comparison is summarized by the classification accuracy averaged over a number of random partitions of the data. LogReg and have no experimental parameters. Random projection matrices are obtained via the QR decomposition of a random matrix. The PCA subspace is created from the training data, X. 3

4 4 RESULTS Figure shows the classification accuracy on the appearance-based image data. These algorithms performed similarly on the validation data and the testing data indicating that cross-validation is a good method for choosing parameters. FKM, DIAG, and LogReg perform similarly until the dimension hits 2, at which point LogReg s performance begins to decline while DIAG and FKM enjoy their best performance. This disparity may be a result of the data being multi-modal when in higher dimension. Test data acc Test data acc (a) PCA projected data (b) Random projected data Figure : Classification accuracy on appearance-based image data for both projection methods. As the dimension of the data increases, the performance of DIAG decreases to a level below that of LogReg on the PCA-projected data. DIAG is much more competitive when the data is pre-processed using random projection beating LogReg in all but the smallest dimensions. DIAG s performance drop-off for the PCA-projected data occurs where the number of eigenvectors first exceeds the number of training samples. FKM performs consistently at or near the top when preprocessed using either projection method and, along with LogReg, has nearly identical classification accuracies for both projection methods. Unlike with the randomly-projected data, is able to occasionally run without becoming numerically unstable on the PCA-projected data because the PCA subspace has just enough variance in these dimensions that the covariance matrix used by is nonsingular. Figure 3 shows the results for mfeat and isolet experiments. The difference in performance between LogReg, DIAG, and FKM are much less pronounced with these two data sets for both projection methods. This is most likely a result of the two data sets being much larger and possibly uni-modal (which assists LogReg and ). However, the graphs of the three data sets results have similar features. Random projection show a much more dramatic reduction in performance once too many dimensions are dropped compared to PCA. This is because the PCA subspace dimensions are sorted by how much variance they capture while there is nothing special about the first random dimensions. Also, DIAG s performance eventually drops as an increasing number of noisy PCA subspace dimensions are retained while random projection shows consistent performance across dimensions. In addition to dimension and projection methods, X is also modified for these two data sets but the dominance of FKM over DIAG and LogReg seen with the appearancebased image data is not replicated. In fact, LogReg, due to its much lower number of parameters, is unaffected by lower X. Figure 5, which displays the results for DIAG and FKM for both projection methods 4

5 FKM Test Data Accuracy By Lambda Value DIAG Test Data Accuracy By Lambda Value (a) PCA with FKM (b) PCA with DIAG FKM Test Data Accuracy By Lambda Value DIAG Test Data Accuracy By Lambda Value (c) Random with FKM (d) Random with DIAG Figure 2: Classification accuracy with respect to dimension for both DIAG and FKM versions with several λ values on test data for both projection methods on the appearance-based image data. for various λ values across dimension and X on the mfeat and isolet data sets, shows a similar relationship between λ and classification accuracy where higher λ values are best up to around or.9 and then quickly drops to such an extent that λ of and above are among the worst performing values. Figure 2 shows how the classification accuracies for the two MoG versions are affected by choice of λ for both projection schemes on the appearance-based data. Figures 2a and 2c show a preference toward higher λ values for FKM with the notable exception at λ =. Figure 2b shows a less clear relationship for λ with DIAG on the PCA-projected data. Figure 2c shows the performance of FKM with random projection with λ = rising as the dimensions increasing as the dimension increases and there is no reason to believe that it will not eventually become even with the other λ values. Figure 2d reveals no apparent favorite value for λ by DIAG with random projection while λ = is inconsistent and less desirable than the other values on average. 5 CONCLUSIONS By a very small margin, FKM with λ = and random projection keeping many dimensions (little to no dimensionality reduction) gives the best perfor- 5

6 PCA Proj Test Data RND Proj Test Data PCA Proj Test Data RND Proj Test Data (a) mfeat/pca (b) mfeat/rnd (c) isolet/pca (d) isolet/rnd Figure 3: Classification accuracy on mfeat and isolet data sets for both PCA and RND projection methods. (a) 7% (b) 5% (c) 3% (d) 6% Figure 4: Classification accuracy of various λ values with DIAG on mfeat data set for the PCA projection method. The number beneath each plot is the percentage of the 2 data points used for training. mance. The difference in performance for DIAG as dimension increases between the two projection methods is most likely a result of the later PCA dimensions being uninformative noise. Remember, DIAG is unable to remove any variance and, therefore, noisy dimensions have a more negative effect than for its FKM counterpart. This drop in performance is also present in the training data. At the lower dimensions, random projection pre-processing performs worse than PCA. However, once dimensionality reaches a certain level, the performance of FKM and LogReg reamain steady for both projection methods. Figure 4 shows an odd relationship between the number of PCA dimensions kept and classification accuracy of DIAG as X decreases. As expected, the dimension at which DIAG performance drops off for all λ values decreases along with X. However, as X shrinks, the magnitude of this dip decreases as well. This may be explained by a decreasing X decreasing the specificity of the PCA subspace to the training data. The dropped PCA dimensions for large X are almost pure noise. The PCA subspace dimensions computed form small X will be more general. Therefore, keeping all PCA dims computed from a 6

7 PCA Proj Test Data DIAG PCA Proj Test Data FKM RND Proj Test Data DIAG RND Proj Test Data FKM (a) DIAG/PCA/mfeat (b) FKM/PCA/mfeat (c) DIAG/RND/mfeat (d) FKM/RND/mfeat PCA Proj Test Data DIAG PCA Proj Test Data FKM RND Proj Test Data DIAG RND Proj Test Data FKM (e) DIAG/PCA/isolet (f) FKM/PCA/isolet (g) DIAG/RND/isolet (h) FKM/RND/isolet Figure 5: Classification accuracy of various λ values for both covariance regularization methods, projection methods, and the mfeat and isolet data sets. small X will result in better performance than for large X once D X in part because these later dimensions are now more like random projection. Most importantly, the result that FKM with high λ is consistently at the top in these experiments. This tells us that some covariance information is necessary but a little bit of covariance goes a long way toward being able to classify this data and generalize to unseen data while adding additional covariance primarily increases the risk of over-fitting with little chance of improving classification accuracy. In addition to the promise of projection onto high dimensional, random basis combined with FKM with > λ our results show a simpler relationship between choice of λ and the number of dimensions to keep when projecting using using a random projection. By comparison PCA pre-processing, still the most popular dimensionality reduction technique in many domains, is fussy and its optimal number of dimensions is different for each data set. If a near-optimal number of subspace dimensions is not chosen, there is a great deal of variation in performance between λ values for DIAG. Another option is to apply FKM when pre-processing via PCA projection. However, computation of a 7

8 random basis is much faster than PCA computation. Either way, it appears that it is important to involve a large degree of covariance regularization through either random projection or FKM with high λ values. For all data sets, as the dimension of the randomly generated basis increases, the disparity in performance between the varying levels of covariance regularization decreases. If this trend were to continue, one could expect FKM with λ = to become competitive with all λ values for FKM and DIAG and kmeans could replace MoG. In this case, projecting data into a higher dimension and using kmeans would yield an algorithm with superior accuracy and improved computational complexity. Investigating this further is left to future work. These observations span the three data sets used in this paper which represent use of raw data and features computed from the data and situations where there is sufficient and insuffient training data. Isolet and mfeat may be uni-modal and are a good fit for LogReg when the training set size is diminished. Otherwise the experimental results indicate random projection with little dimensionality reduction and application of FKM with high λ is a safe bet to obtain quality classfication accuracy. References Belhumeur, P., Hespanha, J. & Kriegman, D. (997), Eigenfaces vs. fisherfaces: recognition using class specific linear projection, Pattern Analysis and Machine Intelligence, IEEE Transactions on 9(7), Bishop, C. M. (27), Pattern Recognition and Machine Learning, Springer. Candes, E. & Tao, T. (26), Near-optimal signal recovery from random projections: Universal encoding strategies, IEEE Trans. Inform. Theory (52), Dasgupta, S. (2), Experiments with random projections, in Proceedings of the 6th Conference on Uncertainty in Artificial Intelligence. Deegalla, S. & Bostrom, H. (26), Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification, in ICMLA 6, IEEE Computer Society, Washington, DC, USA, pp Elliott, D. L. (29), Covariance regularization in mixture of gaussians for high-dimensional image classification, Master s thesis, Colorado State University. Frank, A. & Asuncion, A. (2), UCI machine learning repository, Hammoud, R. & Mohr, R. (2), Mixture densities for video objects recognition, International Conference on Pattern Recognition 2, 27. Hinton, G., Dayan, P. & Revow, M. (997), Modelling the manifolds of images of handwritten digits, IEEE transactions on Neural Networks 8(), Mitchell, T. M. (997), Machine Learning, McGraw-Hill, Boston. Moghaddam, B. & Pentland, A. (997), Probabilistic visual learning for object representation, PAMI 9(7), Robinson, J. A. (29), Covariance estimation in full- and reduced-dimensionality image classification, Image Vision Comput. 27(8), Schäfer, J. & Strimmer, K. (25), A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Statistical Applications in Genetics and Molecular Biology 4(). Tipping, M. & Bishop, C. (999), Mixtures of probabilistic principal component analysers, Neural Computation (2), Ye, J. & Wang, T. (26), Regularized discriminant analysis for high dimensional, low sample size data, in KDD 26, ACM, New York, NY, USA, pp

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION. Submitted by. Daniel L Elliott

THESIS COVARIANCE REGULARIZATION IN MIXTURE OF GAUSSIANS FOR HIGH-DIMENSIONAL IMAGE CLASSIFICATION Submitted by Daniel L Elliott Department of Computer Science In partial fulfillment of the requirements