NONNEGATIVE MATRIX FACTORIZATION FOR PATTERN RECOGNITION

Similar documents
Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

c Springer, Reprinted with permission.

Non-negative Matrix Factorization on Kernels

Iterative Laplacian Score for Feature Selection

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Nonnegative Matrix Factorization Clustering on Multiple Manifolds

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Brief Introduction to Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

MinOver Revisited for Incremental Support-Vector-Classification

Dimension Reduction Using Nonnegative Matrix Tri-Factorization in Multi-label Classification

Adaptive Kernel Principal Component Analysis With Unsupervised Learning of Kernels

Improvements to Platt s SMO Algorithm for SVM Classifier Design

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

A Simple Algorithm for Learning Stable Machines

Supervised locally linear embedding

Small sample size generalization

COMS 4771 Introduction to Machine Learning. Nakul Verma

Neural Networks and the Back-propagation Algorithm

Group Sparse Non-negative Matrix Factorization for Multi-Manifold Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Discriminative Direction for Kernel Classifiers

Adaptive Boosting of Neural Networks for Character Recognition

On Improving the k-means Algorithm to Classify Unclassified Patterns

Feature Extraction with Weighted Samples Based on Independent Component Analysis

Linear Discrimination Functions

Comparison of Modern Stochastic Optimization Algorithms

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Analysis of Fast Input Selection: Application in Time Series Prediction

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Change point method: an exact line search method for SVMs

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Logistic Regression and Boosting for Labeled Bags of Instances

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Analysis of Multiclass Support Vector Machines

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Sparse Support Vector Machines by Kernel Discriminant Analysis

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning

Mining Classification Knowledge

EUSIPCO

A Multi-task Learning Strategy for Unsupervised Clustering via Explicitly Separating the Commonality

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Discriminant Uncorrelated Neighborhood Preserving Projections

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Mining Classification Knowledge

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Modern Information Retrieval

Statistical Pattern Recognition

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Cholesky Decomposition Rectification for Non-negative Matrix Factorization

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

#33 - Genomics 11/09/07

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Local Feature Extraction Models from Incomplete Data in Face Recognition Based on Nonnegative Matrix Factorization

Algorithm-Independent Learning Issues

Performance Evaluation and Comparison

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Non-Negative Factorization for Clustering of Microarray Data

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine & Its Applications

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Simple Neural Nets For Pattern Classification

arxiv: v3 [cs.lg] 18 Mar 2013

Adaptive Sampling Under Low Noise Conditions 1

NON-NEGATIVE SPARSE CODING

Lecture 5: Logistic Regression. Neural Networks

Linear Classification and SVM. Dr. Xin Zhang

Lecture 7 Artificial neural networks: Supervised learning

A Unifying View of Image Similarity

Infinite Ensemble Learning with Support Vector Machinery

Learning features by contrasting natural images with noise

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

A lower bound for scheduling of unit jobs with immediate decision on parallel machines

Learning with multiple models. Boosting.

10-701/ Machine Learning, Fall

STUDY ON METHODS FOR COMPUTER-AIDED TOOTH SHADE DETERMINATION

MIRA, SVM, k-nn. Lirong Xia

Nonnegative Matrix Factorization with Toeplitz Penalty

Brief Introduction of Machine Learning Techniques for Content Analysis

Unit 8: Introduction to neural networks. Perceptrons

Transcription:

NONNEGATIVE MATRIX FACTORIZATION FOR PATTERN RECOGNITION Oleg Okun Machine Vision Group Department of Electrical and Information Engineering University of Oulu FINLAND email: oleg@ee.oulu.fi Helen Priisalu Tallinn University of Technology ESTONIA ABSTRACT In this paper, linear and unsupervised dimensionality reduction via matrix factorization with nonnegativity constraints is studied when applied for feature extraction, followed by pattern recognition. Since typically matrix factorization is iteratively done, convergence can be slow. To alleviate this problem, a significantly (more than 11 times) faster algorithm is proposed, which does not cause severe degradations in classification accuracy when dimensionality reduction is followed by classification. Such results are due to two modifications of the previous algorithms: feature scaling (normalization) prior to the beginning of iterations and initialization of iterations, combining two techniques for mapping unseen data. KEY WORDS pattern recognition, nonnegative matrix factorization 1 Introduction Dimensionality reduction is typically used to cure or at least to alleviate the effect known as curse of dimensionality negatively affecting classification accuracy. The simplest way to reduce dimensionality is to linearly transform the original data. Given the original, high-dimensional data gathered in an n m matrix V, a transformed or reduced matrix H, composed of mr-dimensional vectors (r <n and often r n), is obtained from V according to the following linear transformation W: V WH, where W is an n r (basis) matrix. Nonnegative matrix factorization (NMF) is an example of a typical linear transformation, but unlike other transformations, it is based on nonnegativity constraints on all matrices involved. Thanks to this fact, it can generate a part-based representation, since no subtractions are allowed. Lee and Seung [1] proposed a simple iterative algorithm for NMF and proved its convergence. The factorized matrices are initialized with positive random numbers before starting matrix updates. It is well known that initialization is of importance for any iterative algorithm: properly initialized, an algorithm converges faster. However, this issue was not yet investigated in case of NMF. In this paper, our contribution is two modifications accelerating algorithm convergence: 1) feature scaling prior to NMF and 2) combination of two techniques for mapping unseen data with theoretical proof of why this combination leads to faster convergence. Because of its straightforward implementation, NMF has been already applied for object and pattern classification [2, 3, 4, 5, 6, 7, 8, 9]. Here we extend the application of NMF to bioinformatics: NMF coupled with a classifier is applied for protein fold recognition. Statistical analysis (by means of one-way analysis of variance (ANOVA) and multiple comparison tests) of the obtained results (error rates) testifies to the absence of significant accuracy degradations when introducing our modifications, compared to the standard NMF algorithm. 2 Nonnegative Matrix Factorization Given the nonnegative matrices V, W and H whose sizes are n m, n r and r m, respectively, we aim at such factorization that V WH. The value of r is selected nm n+m according to the rule r< in order to obtain dimensionality reduction. Each column of W is a basis vector while each column of H is a reduced representation of the corresponding column of V. In other words, W can be seen as a basis that is optimized for linear approximation of the data in V. NMF provides the following simple learning rule guaranteeing monotonical convergence to a local maximum without the need for setting any adjustable parameters [1]: W ia W ia W ia µ H aµ H aµ V iµ (WH) iµ H aµ, (1) W ia j W ja i, (2) W ia V iµ (WH) iµ. (3) The matrices W and H are initialized with positive random values. Eqs. (1-3) iterate until convergence to a local maximum of the following objective function: F = i=1 µ=1 (V iµ log(wh) iµ (WH) iµ ). (4) 480-040 546

In its original form, NMF can be slow to converge to a local maximum for large matrices and/or high data dimensionality. On the other hand, stopping after a predefined number of iterations, as sometimes done, can be too premature to get a good approximation. Introducing a parameter tol (0 <tol 1) to decide when to stop iterations significantly speeds up the convergence without negatively affecting the mean square error (MSE), measuring the approximation quality 1. That is, iterations stop when F new F old <tol. After learning the NMF basis functions, i.e. the matrix W, new (unseen before) data in the matrix are mapped to r-dimensional space by fixing W and using one of the following techniques: 1. randomly initializing H as described above and iterating Eq. 3 until convergence [3], 2. as a least squares solution of V new = WH new, i.e. H new =(W T W) 1 W T V new, e.g. as in [5], where V new contains the new (unseen or test) data. We will call the first technique iterative while the second - direct, because the latter provides a straightforward noniterative solution. 3 Our Contribution We propose two modifications in order to accelerate convergence of the iterative NMF algorithm. The first modification concerns feature scaling (normalization) linked to the initialization of the factorized matrices. Typically, these matrices are initialized with positive random numbers, say uniformly distributed between 0 and 1, in order to satisfy the nonnegativity constraints. Hence, elements of V (matrix of the original data) also need to be within the same range. Given that V j is an n-dimensional feature vector, where j =1,...,m, its components are normalized as follows: /V kj, where k =argmax l V lj. In other words, components of each feature vector are divided by the maximal value among them. As a result, feature vectors are composed of components whose nonnegative values do not exceed 1. Since all three matrices (V, W, H) have now entries between 0 and 1, it takes much less time to perform matrix factorization V WH (values of the entries in the factorized matrices do not have to grow much in order to satisfy the stopping criterion for the objection function F in Eq. 4) than if V had the original (unnormalized) values. As additional benefit, MSE becomes much smaller, too, because a difference of the original ( ) and approximated ((WH) ij ) values becomes smaller, given that mn is fixed. Though this modification is simple, it brings significant speed of convergence as will be shown below, and to our best knowledge, it was never explicitly mentioned in literature. 1 It was observed in numerous experiments that MSE quickly decreases after not very many iterations, and after that its rate of decrease dramatically slows down. It is generally acknowledged that proper initialization significantly speeds up the convergence of an iterative algorithm. Therefore the second modification concerns initialization of NMF iterations for mapping unseen data (known as generalization), i.e. after the basis matrix W has been learned. Since such a mapping in NMF involves only the matrix H (W is kept fixed), its initialization is to be done. We propose to initially set H to (W T W) 1 W T V new, i.e. to the solution provided by the direct mapping technique in Section 2, because 1) it provides a better initial approximation for H new than a random guess, and 2) it moves the start of iterations closer toward the final point, since the objective function F in Eq. 4 is increasing [1] and the inequality F direct >F iter always holds at initialization (theorem below proves this fact), where F direct and F iter stand for the values of F when using the direct and iterative techniques, respectively. Theorem 1 Given that F direct and F iter are values of the objective function in Eq. 4 when mapping unseen data with the direct and iterative techniques, respectively. Then F direct F iter > 0 always holds at the start of iterations when using Eq. 3. Proof. By definition, F iter = ( log(wh) ij (WH) ij ), F direct = ( log ). The difference F direct F iter is equal to ( log log(wh) ij +(WH) ij )= ( (log ( log ) 1) + (WH) ij = (WH) ij ) 10(WH) ij +(WH) ij In the last expression, we assumed the logarithm to the base 10, though the natural logarithm or the logarithm to the base 2 could be used as well without affecting our proof. Given that all three matrices involved are nonnegative, the last expression is positive if either condition is satisfied: 1. log 10(WH) ij > 0, 2. (WH) ij > log 10(WH) ij. Let us introduce a new variable, t: t = above conditions can be written as 1. log t 10 > 0 or logt > 1, Vij (WH) ij. Then the. 547

2. 1 >tlog t t+1 10 or log t< t. The first condition is satisfied if t>10 whereas the second if t<t 0 (t 0 12). Therefore either t>10 or t<12 should be satisfied for F direct >F iter. Since the union of both conditions covers the whole interval [0, + [, it means that F direct >F iter holds, independently of t, i.e. of whether > (WH) ij or not. This proof is valid for the base 2 and natural logarithms, too, because 2 or 2.71, replacing 10 above, is still smaller than 12. It means that the whole interval [0, + [ is covered. Q.E.D. Because our approach combines both direct and iterative techniques for mapping unseen data, we will call it iterative2, since the whole procedure remains essentially iterative but only the initialization is modified. 4 Summary of Our Algorithm Suppose that the whole data set is divided into training and test (unseen) sets. Our algorithm is summarized as follows: 1. Scale both training and test data and randomly initialize the factorized matrices as described in Section 3. Set parameters tol and r. 2. Iterate Eqs. 1-3 until convergence to obtain the NMF basis matrix W and to map training data to NMF (reduced) space. 3. Given W, map test data by using the direct technique. Set negative values in the resulting matrix H direct new to zero. 4. Fix the basis matrix and iterate Eq. 3 until convergence by using H direct new at initialization of iterations. The resulting matrix H iterative2 new provides reduced representations of the test data in the NMF space. Thus, the main distinction of our NMF algorithm from others is that it combines both techniques for mapping unseen data instead of applying one of them for this task. 5 Practical Application As a challenging task, we selected protein fold recognition from bioinformatics. Protein is an amino acid sequence. In bioinformatics, one of the current trends is to understand evolutionary relationships in terms of protein function. Two common approaches to identify protein function are sequence analysis and structure analysis. Sequence analysis is based on a comparison between unknown sequences and those whose function is already known and it is argued that it is good at high levels of sequence identity, but below 50% identity it becomes less reliable. Protein fold recognition is structure analysis without relying on sequence similarity. Proteins are said to have a common fold if they have the same major secondary structure 2 in the same arrangement and with the same topology, whether or not they have a common evolutionary origin. The structural similarities of proteins in the same fold arise from the physical and chemical properties favoring certain arrangements and topologies, meaning that various physicochemical features such as fold compactness or hydrophobicity are utilized for recognition. As the gap widens between the number of known sequences and the number of experimentally determined protein structures (the ratio is more than 100 to 1, and sequence databases are doubling in size every year), the demand for automated fold recognition techniques rapidly grows. 6 Experiments Experiments with NMF involve estimation of the error rate when combining NMF and a classifier. Three techniques for generalization (mapping of the test data) are used: direct, iterative, and iterative2. The training data are mapped into reduced space according to Eqs. 1-3 with simultaneous learning of the matrix W. We experimented with the real-world data set created by Ding and Dubchak [10] and briefly described below. Tests were repeated 10 times to collect statistics necessary for comparison of three generalization techniques. Each time, the different random initialization for learning the basis matrix W was used, but the same learned basis matrix was utilized in each run for all generalization techniques in order to create as fair comparison of three generalization techniques as possible. Values of r (dimensionality of reduced space) were set to 25, 50, 75, and 88 3, which constitutes 20%, 40%, 60%, and 71.2% of the original dimensionality, respectively, while tol was equal to 0.05. In all reported statistical tests α =0.05. All algorithms were implemented in MAT- LAB running on a Pentium 4 (3 GHz CPU, 1GB RAM). 6.1 Data A challenging data set derived from the SCOP (Structural Classification of Proteins) database was used in experiments. It is available online 4 and its detailed description can be found in [10]. The data set contains the 27 most populated folds represented by seven or more proteins. Ding and Dubchak already split it into the training and test sets, which we will use as other authors did. Feature vectors characterizing protein folds are 125- dimensional. The training set consists of 313 protein folds having (for each two proteins) no more than 35% of the sequence identity for aligned subsequences longer than 80 residues. The test set of 385 folds is composed of protein sequences of less than 40% identity with each other and 2 Regions of local regularity within a fold. 3 88 is the largest possible value of r, resulting in dimensionality reduction with NMF for a 125x313 training matrix used in experiments. 4 http://crd.lbl.gov/ cding/protein/ 548

less than 35% identity with the proteins of the first set. In fact, 90% of the proteins of the test set have less than 25% sequence identity with the proteins of the training set. This, as well as multiple classes, many of which are sparsely represented, render the task extremely difficult. 6.2 Classification Results The K-Local Hyperplane Distance Nearest Neighbor (HKNN) [11] was selected as a classifier, since it demonstrated a competitive performance compared to support vector machine (SVM) when both methods were applied to classify the abovementioned protein data set in the original space, i.e. without dimensionality reduction [12]. In addition, when applied to other data sets, the combination of NMF and HKNN showed very good results [13], thus rendering HKNN as a natural choice for NMF. HKNN computes distances of each test point x to L local hyperplanes, where L is the number of different classes. The lth hyperplane is composed of K nearest neighbors of x in the training set, belonging to the lth class. A test point x is associated with the class whose hyperplane is closest to x. HKNN needs to predefine two parameters, K and λ (regularization). The normalized features were used since feature normalization to zero mean and unit variance prior to HKNN dramatically increases the classification accuracy (on average by 6 percent). The optimal values for two parameters of HKNN, K and λ, determined via cross-validation, are 7 and 8, respectively. Table 1 shows error rates obtained with HKNN, SVM, and various neural networks when classifying protein folds in the original, 125-dimensional space. A brief survey of the methods used for classification of this data set can be found in [12]. From Table 1 one can see that HKNN outperformed all but one competitors 5 even though some of them employed ensembles of classifiers. Table 1. Error rates when classifying protein folds in the original space with different methods. Source Classifier Error rate [14] DIMLP-B 38.8 [14] DIMLP-A 40.9 [15] MLP 51.2 [15] GRNN 55.8 [15] RBFN 50.6 [15] SVM 48.6 [16] RBFN 48.8 [10] SVM 46.1 [12] HKNN 42.7 5 In [14] they used an ensemble of four-layer Discretized Interpretable Multi-Layer Perceptrons (DIMLP), together with bagging or arcing for combining individual classifiers. When using a single DIMLP, the error rate was 61.8% as reported in [14]. Let us now turn to the error rates in reduced space. As mentioned above, we tried four different values for r and three different generalization techniques with and without scaling prior to NMF. For each value of r, NMF followed by HKNN are repeated 10 times, and the results are assembled in a single column vector [e 88 1,...,e88 10,e75 1,...,e75 10,e50 1,...,e50 10,e25 1,...,e25 10 ]T, where the first ten error rates correspond to r =88,next ten error rates to r = 75, etc. As a results, a 40x6 (10 runs times 4 values of r times 3 generalization techniques times 2 scaling options for NMF) matrix containing the error rates was generated. This matrix is then subjected to the one-way analysis of variance (ANOVA) and multiple comparison tests in order to make statistically driven conclusions about all generalization techniques. Table 2 shows identifiers used below in association with all generalization techniques. Table 2. Generalization techniques and their identifiers. Identifier Technique Scaling prior to NMF 1 Direct No 2 Iterative No 3 Iterative2 No 4 Direct Yes 5 Iterative Yes 6 Iterative2 Yes The one-way ANOVA test is first utilized in order to find out whether the mean error rates of all six techniques are the same (null hypothesis H 0 : µ 1 = µ 2 = = µ 6 ) or not. If the returned p-value is smaller than α =0.05 (which happened in our case), the null hypothesis is rejected, which implies that the mean error rates are not the same. The next step is to determine which pairs of means are significantly different, and which are not by means of the multiple comparison test. Such a sequence of tests constitutes the standard statistical protocol when two or more means are to be compared. Table 3 contains results of the multiple comparison test and it is seen that these results confirm that the direct technique stands apart from both iterative techniques. The main conclusions from Table 3 are µ 1 = µ 4 and µ 2 = µ 3 = µ 5 = µ 6, i.e. there are two groups of techniques, and the mean of the first group is larger than that of the second group based on Table 4. It implies that the direct technique tends to lead to higher error than either iterative technique when classifying protein folds. On the other hand, the multiple comparison test did not detect any significant difference in the mean error rates of both iterative techniques, though it looks like the standard iterative technique while being slower (see the next subsection) than the iterative2 technique, is nevertheless a bit more accurate on the data set used. The last column in Table 4 points to the very interesting result: independendently of feature normalization prior 549

Table 3. Results of the multiple comparison test. Numbers in the first two columns stand for technique identifiers. The third and fifth columns contain a confidence interval for the estimated difference in means of two techniques with the given identifiers. If the confidence interval contains 0 (zero), the difference in means is not significant, i.e. H 0 is not rejected. 1 2 4.38 5.99 7.61 Reject H 0 : µ 1 =µ 2 1 3 3.34 4.96 6.57 Reject H 0 : µ 1 =µ 3 1 4-2.07-0.46 1.16 Accept H 0 : µ 1 = µ 4 1 5 4.79 6.41 8.03 Reject H 0 : µ 1 =µ 5 1 6 3.00 4.61 6.23 Reject H 0 : µ 1 =µ 6 2 3-2.66-1.04 0.58 Accept H 0 : µ 2 = µ 3 2 4-8.07-6.45-4.83 Reject H 0 : µ 2 =µ 4 2 5-1.20 0.42 2.03 Accept H 0 : µ 2 = µ 5 2 6-3.00-1.38 0.24 Accept H 0 : µ 2 = µ 6 3 4-7.03-5.41-3.79 Reject H 0 : µ 3 =µ 4 3 5-0.16 1.46 3.07 Accept H 0 : µ 3 = µ 5 3 6-1.97-0.34 1.28 Accept H 0 : µ 3 = µ 6 4 5 5.25 6.87 8.48 Reject H 0 : µ 4 =µ 5 4 6 3.45 5.07 6.69 Reject H 0 : µ 4 =µ 6 5 6-3.42-1.80-0.18 Reject H 0 : µ 5 =µ 6 Table 4. Mean error rates for all six generalization techniques (standard error for all techniques is 0.4). Standard deviations are reported only for four techniques associated with smaller error. Identifier Mean error rate Std. deviation 1 50.93 3.08 2 44.94 2.52 3 45.98 2.11 4 51.38 3.31 5 44.52 2.13 6 46.32 1.72 to NMF, the standard deviation of the error rate for the our technique is lowest. It implies that our modifications of NMF led to a visible reduction in the deviation of classification error! This reduction is caused by shrinking a search space of possible factorizations. That is, our initialization eliminates some potentially erroneous solutions before iterations even start and leads to more stable classification error. However, the open issue is when to stop iterations in order to achieve the highest accuracy. One can say that the error rates in reduced space are larger than the error rate (42.7%) achieved in the original space. Nevertheless, we observed that sometimes error in reduced space can be lower than 42.7: for example, the minimal error when applying the iterative technique with no scaling before NMF and r = 50 is 39.22, while the minimal error when using the iterative2 technique under the same conditions is 42.34. Varying errors can be attributed to the fact that NMF factorization is not unique. The overall conclusion about three generalization techniques is that for the given data set, despite the fact that the direct technique is the fastest among the three, it lagged far behind either iterative technique in terms of classification accuracy when NMF is followed by HKNN. Among the iterative techniques, the conventional technique provides a slightly better accuracy (though the multiple comparison test did not detect any statistically significant difference) than the iterative2 technique, but at slower convergence as will be demonstrated in the next section. 6.3 Time Since the iterative techniques showed better performance than the direct technique, we will compare them in terms of time. Table 5 accumulates gains resulted from our modifications for different dimensionalities of reduced space. R 1 stands for the ratio of the average times spent on learning NMF basis and mapping training data to NMF space without and with scaling prior to NMF. R 2 means the ratio of the average times spent on generalization by means of the iterative technique without and with scaling prior to NMF. R 3 is the ratio of the average times spent on generalization by means of the iterative2 technique without and with scaling prior to NMF. R 4 corresponds to the ratio of the average times spent on learning and generalization together by means of the iterative technique without and with scaling prior to NMF. R 5 corresponds to the ratio of the average times spent on learning and generalization together by means of the iterative2 technique without and with scaling prior to NMF. R 6 is the gain in time caused by applying the iterative2 technique for generalization instead of the iterative one when no scaling before NMF occurs, whereas R 7 is the gain in time caused by applying the iterative2 technique for generalization instead of the iterative one when scaling before NMF occurs. As a result, the average gain in time obtained with our modifications is more than 11 times. Table 5. Gains in time resulting from our modifications of the conventional NMF algorithm. r R 1 R 2 R 3 R 4 R 5 R 6 R 7 88 11.9 11.4 10.4 11.7 11.5 1.5 1.4 75 13.8 12.9 13.1 13.6 13.7 1.6 1.6 50 13.2 11.1 12.5 12.6 13.1 1.6 1.7 25 9.5 6.4 8.8 8.7 9.5 1.6 2.2 Average 12.1 10.4 11.2 11.6 11.9 1.6 1.8 7 Conclusion The main contribution of this work is two modifications of the basic NMF algorithm and its practical application to 550

the challenging real-world task of protein fold recognition. The first modification concerns feature scaling before NMF while the second modification combines two known generalization techniques, which we called direct and iterative; the former is used as a starting point for updates of the latter, thus leading to a new generalization technique which is called iterative2. We proved (both theoretically and experimentally) that at the start of generalization the objective function in Eq. 4 of the new technique is always larger than that of the conventional iterative technique. Since this function is increasing, the generalization procedure according to the iterative2 technique converges faster than that with the ordinary iterative technique. When modifying NMF, we were also interested in the error rate when combining NMF and a classifier (HKNN in this work). To reach this goal, we tested all three generalization techniques. Statistical analysis of the obtained results indicates that the mean error rate associated with the direct technique is higher than that related to either iterative technique while both iterative techniques lead to the statistically similar error rates. Since the iterative2 technique provides a faster mapping of unseen data, it is advantageous to apply it instead of the ordinary iterative one, especially in combination with feature scaling before NMF because of the significant gain in time resulted from both modifications. References [1] D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788 791, 1999. [2] I. Buciu and I. Pitas. Application of non-negative and local non-negative matrix factorization to facial expression recognition. In Proc. 17th Int. Conf. on Pattern Recognition, Cambridge, UK, pages 288 291, 2004. [3] D. Gillamet and J. Vitrià. Discriminant basis for object classification. In Proc. 11th Int. Conf. on Image Analysis and Processing, Palermo, Italy, pages 256 261, 2001. [4] X. Chen, L. Gu, S.Z. Li, and H.-J. Zhang. Learning representative local features for face detection. In Proc. 2001 IEEE Conference on Computer Vision and Pattern Recognition, Kauai, HW, pages 1126 1131, 2001. [5] T. Feng, S.Z. Li, H.-Y. Shum, and H.-J. Zhang. Local non-negative matrix factorization as a visual representation. In Proc. 2nd International Conference on Development and Learning, Cambridge, MA, pages 178 183, 2002. [6] D. Guillamet and J. Vitrià. Evaluation of distance metrics for recognition based on non-negative matrix factorization. Pattern Recognition Letters, 24(9-10):1599 1605, 2003. [7] D. Guillamet, J. Vitrià, and B. Schiele. Introducing a weighted non-negative matrix factorization for image classification. Pattern Recognition Letters, 24(14):2447 2454, 2003. [8] Y. Wang, Y. Jia, C. Hu, and M. Turk. Fisher nonnegative matrix factorization for learning local features. In Proc. 6th Asian Conf. on Computer Vision, Jeju Island, Korea, 2004. [9] M. Rajapakse and L. Wyse. NMF vs ICA for face recognition. In Proc. 3rd Int. Symp. on Image and Signal Processing and Analysis, Rome, Italy, pages 605 611, 2003. [10] C.H.Q. Ding and I. Dubchak. Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17(4):349 358, 2001. [11] P. Vincent and Y. Bengio. K-local hyperplane and convex distance nearest neighbor algorithms. In T.G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, pages 985 992. MIT Press, Cambridge, MA, 2002. [12] O. Okun. Protein fold recognition with k-local hyperplane distance nearest neighbor algorithm. In Proc. 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, Italy, pages 47 53, 2004. [13] O. Okun. Non-negative matrix factorization and classifiers: experimental study. In Proc. 4th IASTED Int. Conf. on Visualization, Imaging, and Image Processing, Marbella, Spain, pages 550 555, 2004. [14] G. Bologna and R.D. Appel. A comparison study on protein fold recognition. In Proc. 9th Int. Conf. on Neural Information Processing, Singapore, pages 2492 2496, 2002. [15] I.-F. Chung, C.-D. Huang, Y.-H. Shen, and C.-T. Lin. Recognition of structure classification of protein folding by NN and SVM hierarchical learning architecture. In O. Kaynak, E. Alpaydin, E. Oja, and L. Xu, editors, Lecture Notes in Computer Science (Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003, Istanbul, Turkey), volume 2714, pages 1159 1167. Springer- Verlag, Berlin, 2003. [16] N.R. Pal and D. Chakraborty. Some new features for protein fold recognition. In O. Kaynak, E. Alpaydin, E. Oja, and L. Xu, editors, Lecture Notes in Computer Science (Artificial Neural Networks and Neural Information Processing - ICANN/ICONIP 2003, Istanbul, Turkey), volume 2714, pages 1176 1183. Springer- Verlag, Berlin, 2003. 551