Fusion of Dimensionality Reduction Methods: a Case Study in Microarray Classification
|
|
- Harold Clifford Reed
- 5 years ago
- Views:
Transcription
1 Fusion of Dimensionality Reduction Methods: a Case Study in Microarray Classification Sampath Deegalla Dept. of Computer and Systems Sciences Stockholm University Sweden si-sap@dsv.su.se Henrik Boström Informatics Research Centre University of Skövde Sweden henrik.bostrom@his.se Abstract Dimensionality reduction has been demonstrated to improve the performance of the k-nearest neighbor (knn) classifier for high-dimensional data sets, such as microarrays. However, the effectiveness of different dimensionality reduction methods varies, and it has been shown that no single method constantly outperforms the others. In contrast to using a single method, two approaches to fusing the result of applying dimensionality reduction methods are investigated: feature fusion and classifier fusion. It is shown that by fusing the output of multiple dimensionality reduction techniques, either by fusing the reduced features or by fusing the output of the resulting classifiers, both higher accuracy and higher robustness towards the choice of number of dimensions is obtained. Keywords: Nearest neighbor classification, dimensionality reduction, feature fusion, classifier fusion, microarrays. 1 Introduction There is a strong need for accurate methods for analyzing microarray gene-expression data sets, since early accurate diagnoses based on these analyses may lead to proper choice of treatments and therapies [1, 2, 3]. However, the nature of these data sets (i.e., thousands of attributes with small number of instances) is a challenge for many learning algorithms, including the well-known k-nearest neighbor (knn) classifier [4]. The knn has a very simple strategy as a learner: instead of generating an explicit model, it keeps all training instances. Classification is made by measuring the distances from the test instance to all training instances, most commonly using the Euclidean distance. Finally, the majority class among the k nearest instances is assigned to the test instance. This simple form of knn can however be both inefficient and ineffective for high-dimensional data sets due to presence of irrelevant and redundant attributes. Therefore, the classification accuracy of knn often decreases with an increase in dimensionality. One possible remedy to this problem that earlier has shown to be successful is to use dimensionality reduction, i.e., projecting the original feature set into a smaller number of features [5]. The use of knn has earlier been demonstrated to allow for successful classification of microarrays [2] and it has also been shown that dimensionality reduction can further improve the performance of knn for this task [5]. However, different dimensionality reduction methods may have different effects on the performance of the knn classifier, and it has been shown that no single method always outperforms the others when used for microarray classification [6]. As an alternative to choosing a single method, we will in this study consider the idea of applying a set of dimensionality reduction methods and fusing the output of these. Two fusion approaches are investigated: feature fusion, i.e., combining the reduced subset of features before learning with knn, and classifier fusion, i.e., combining the individual knn classifiers built from each feature reduction method. The organization of the paper is as follows. In the next section, we briefly present three dimensionality reduction methods that will be considered in the investigation together with the approaches for combining (or fusing) the output of these. In section 3, details of the experimental setup are provided, and the results of the comparison on eight microarray data sets are given. Finally, we give some concluding remarks and outline directions for future work in section 4. 2 Dimensionality reduction 2.1 Principal Component Analysis (PCA) PCA uses a linear transformation to obtain a simplified data set retaining the characteristics of the original data set. Assume that the original matrix contains o dimensions and n observations and that one wants to reduce the matrix into a d dimensional subspace. Following [7], this transformation can be defined by: Y = E T X (1)
2 where E o d is the projection matrix containing d eigen vectors corresponding to the d highest eigen values, and X o n is the mean centered data matrix. 2.2 Partial Least Squares (PLS) PLS was originally developed within the social sciences and has later been used extensively in chemometrics as a regression method [8]. It seeks for a linear combination of attributes whose correlation with the class attribute is maximum. In PLS regression, the task is to build a linear model, Ȳ = BX + E, where B is the matrix of regression coefficients and E is the matrix of error coefficients. In PLS, this is done via the factor score matrix Y = WX with an appropriate weight matrix W. Then it considers the linear model, Ȳ = QY +E, where Q is the matrix of regression coefficients for Y. Computation of Q will yield Ȳ = BX +E, where B = WQ. However, we are interested in dimensionality reduction using PLS and used the SIMPLS algorithm [9, 10]. In SIMPLS, the weights are calculated by maximizing the covariance of the score vectors y a and ȳ a where a = 1,...,d (where d is the selected number of PLS components) under some conditions. For more details of the method and its use, see [9, 11]. 2.3 Information Gain (IG) Information Gain (IG) can be used to measure the information content in a feature [12], and is commonly used for decision tree induction. Maximizing IG is equivalent to minimizing: V i=1 n i N C j=1 n ij n i log 2 n ij n i (2) where C is the number of classes, V is the number of values of the attribute, N is the total number of examples, n i is the number of examples having the ith value of the attribute and n ij is the number of examples in the latter group belonging to the jth class. When it comes to feature reduction with IG, all features are ranked according to decreasing information gain, and the first d features are selected. It is also necessary to consider how discretization of numerical features is to be done. Since such features are present in all the considered data sets, they have to be converted to categorical features in order to allow for the use of the above calculation of IG. We used the WEKA s default configuration, i.e., Fayyad & Irani s Minimum Description Length (MDL) [13] method, for discretization. 2.4 Feature fusion (FF) Feature fusion concerns how to generate and select a single set of features for a set of objects to which several sets of features are associated [14]. In this study, we use a single data source together with different dimensionality reduction methods which allows us to perform feature fusion by concatenating features generated by the different methods. High-dimensionality is not a problem here since each transformed data set is small compared to the original size of the data. Therefore, a straightforward method of choosing an equal number of features from each reduced set is considered. The selected total number of dimensions are from d = 3 to 99. For each d, the first d/3 reduced dimensions are chosen from the output of PLS, PCA and IG respectively. 2.5 Classifier fusion (CF) The focus of classifier fusion is either on generating a structure representing a set of combined classifiers or on combining classifier outputs [15]. We have considered the latter approach, i.e., combining nearest neighbor predictions with PLS, PCA and IG using unweighted voting. For multi-class problems, ties are resolved by randomly selecting one of the predictions. 3 Empirical study 3.1 Data sets The following eight microarray data sets are used in this study: Central Nervous System [16], which consists of 60 patient samples of survivors (39) and failures (21) after treatment of the medulloblastomas tumor (data set C from [16]). Colon Tumor [17], which consists of 40 tumor and 22 normal colon samples. Leukemia [18], which contains 72 samples of two types of leukemia: 25 acute myeloid leukemia (AML) and 47 acute lymphoblastic leukemia (ALL). Prostate [2], which consists of 52 prostate tumor and 50 normal specimens. Brain [16] contains 42 patient samples of five different brain tumor types: medulloblastomas (10), malignant gliomas (10), AT/RTs (10), PNETs (8) and normal cerebella (4) (data set A from [16]). Lymphoma [19], which contains 42 samples of diffuse large B-cell lymphoma (DLBCL), 9 follicular lymphoma (FL) and 11 chronic lymphocytic leukemia (CLL). NCI60 [20], which contains eight different tumor types. These are breast, central nervous system, colon, leukemia, melanoma, non-small cell lung carcinoma, ovarian and renal tumors.
3 Table 1: Description of data Data set Attributes Instances # of Classes Central Nervous Colon Tumor Leukemia Prostate Brain Lymphoma NCI SRBCT SRBCT [3], which contains four diagnostic categories of small, round blue-cell tumors as neuroblastoma (NB), rhabdomyosarcoma (RMS), non-hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS). The first three data sets come from Kent Ridge Biomedical Data Set Repository [21] and the remaining five from the supplementary materials in [22]. The data sets are summarized in Table Experimental setup We have used Matlab to transform raw attributes to both PLS and PCA components. The PCA transformation is performed using the Matlab s Statistics Toolbox whereas the PLS transformation is performed using the BDK-SOMPLS toolbox [23, 24], which uses the SIMPLS algorithm. The WEKA data mining toolkit [12] is used for the IG method, as well as for the nearest neighbor classification. Both PLS and IG are supervised methods which use class information for their transformations. Therefore, to generate the PLS components for a test set, for which the class labels are unknown, the weight matrix generated for the training set has to be used. For IG, attributes in the training set is ranked based on the information gain in a decreasing manner and the same attributes are selected for the test set. The numbers of selected dimensions were varied from one to approximately the number of examples in the current data set for all three methods. Stratified 10-fold cross validation [12] is employed to obtain measures of accuracy, which has been chosen as the performance indicator in this study. 3.3 Experimental results The results of using the original features, the three dimensionality reduction methods, the feature fusion and classifier fusion methods are shown in Fig. 1 and Fig. 2. Table 2 summarizes the highest classification accuracies of individual reduction and fusion methods comparing their accuracy to raw features. The numbers inside the brackets denote the minimum number of dimensions required to reach a particular accuracy. It can be observed in the Fig. 1 and Fig. 2 that both PLS and PCA obtain their best classification accuracies with relatively few dimensions, while more dimensions are required for IG. None of the single methods turns out as a clear winner, except for perhaps PLS on the binary classification tasks. However, all three methods outperform not using dimensionality reduction, and the difference in performance between the best and worst method can vary greatly for a particular data set, leading to the conclusion that the choice of dimensionality reduction to be used in conjunction with knn for microarray classification can be of major importance, but also that no single method is suitable for all cases. As shown in Table 2, the feature fusion method often performs well compared to each individual method, giving the overall best results in 7 out of 8 cases. Furthermore, the accuracy varies to a much less extent with the number of dimensions compared to the other methods, hence showing that using the combined features reduces sensitivity to the choice of the number of dimensions. In addition, the classifier fusion method also performs on par with or better than the individual dimensionality reduction methods in 6 out of 8 cases, reaching the overall best accuracy in 3 cases. It should also be noted that classifier fusion yields the best accuracy with fewer dimensions compared to the feature fusion method. For example, the FF method reaches the highest classification accuracy with 60 dimensions for the Central Nervous data set, whereas the CF method reaches its highest level using only 11 features. 4 Concluding remarks Three dimensionality reduction methods were investigated in conjunction with the knn classifier and two approaches to fusing the result of multiple dimensionality reduction methods were considered: feature fusion and classifier fusion. An experiment with eight microarray data sets shows that dimensionality reduction indeed is effective for nearest neighbor classification and that fusing the output of these methods can further improve the classification accuracy compared to the individual dimensionality reduction methods. It is also observed that PCA and PLS are best when choosing few dimensions and they even sometimes outperform the fusion methods. However, if one compares the best classification accuracies between individual methods and fusion methods, the feature fusion method obtain the best classification accuracy in 7 out of 8 cases. In addition, classifier fusion obtains the best accuracy in relatively few number of dimensions compared to feature fusion which on the other hand is even more robust to changes in number of dimensions. Therefore, it can be concluded that choosing any of the fusion approaches should be preferred to choosing any of the single dimensionality reduction methods, since the former can be expected to lead to higher classification accuracy and robustness with respect to the choice of number of dimensions.
4 Table 2: Results on best classification accuracies Data set Raw PCA PLS IG FF CF Central Nervous (31) 73.33(26) 68.33(18) 73.33(60) 71.67(11) Colon Tumor (10) 88.81(4) 84.52(14) 87.14(21) 87.38(7) Leukemia (10) 95.00(5) 96.67(32) (33) 97.50(4) Prostate (25) 92.36(15) 93.27(11) 95.18(87) 95.09(50) Brain (4) 79.00(23) 81.00(34) 88.00(96) 86.00(4) Lymphoma (2) (5) (15) (6) (2) NCI (6) 78.57(11) 68.81(24) 80.24(51) 80.24(6) SRBCT (10) (4) (10) (12) (4) Figure 1: Predictive performance with the change of numbers of dimensions using PCA, PLS, IG, Feature fusion (FF) and Classifier fusion (CF) with Nearest Neighbor for two class microarray data sets.
5 Figure 2: Predictive performance with the change of numbers of dimensions using PCA, PLS, IG, Feature fusion (FF) and Classifier fusion (CF) with Nearest Neighbor for multi class microarray data sets. There are a number of issues that need further exploration. First, fusion of classifiers from additional dimensionality reduction methods could be investigated, e.g., Random Projection [6]. Second, selecting different number of dimensions for different dimensionality reduction methods when fusing classifiers of features, could be investigated as an alternative strategy. Finally, more sophisticated voting strategies could be considered, see e.g., [25]. 5 Acknowledgments The first author gratefully acknowledge support through the SIDA/SAREC IT project. This work was supported by the Information Fusion Research Program ( at the University of Skövde, Sweden, in partnership with the Swedish Knowledge Foundation under grant 2003/0104. References [1] J. Quackenbush. Microarray analysis and tumor classification. The New England Journal of Medicine, 354(23): , [2] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1: , [3] J. Kahn, J. S. Wei, M. Ringnér, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P.S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7: , 2001.
6 [4] D. W. Aha, D. Kiblear, and M. K. Albert. Instance based learning algorithm. Machine Learning, 6:37 66, [5] S. Deegalla and H. Boström. Reducing highdimensional data by principal component analysis vs. random projection for nearest neighbor classification. In ICMLA 06: Proceedings of the 5th International Conference on Machine Learning and Applications, pages , Washington, DC, USA, IEEE Computer Society. [6] S. Deegalla and H. Boström. Classification of microarrays with knn: Comparison of dimensionality reduction methods. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning, Birmingham, UK, pages , [7] J. Shlens. A tutorial on principal component analysis. shlens/pub/ notes/pca.pdf. [8] H. Abdi. Partial least squares (pls) regression., [9] S. de Jong. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, [10] StatSoft Inc. Electronic statistics textbook, stathome.html. [11] A. Boulesteix. PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, [12] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, [13] U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87 102, [14] H. Boström. Feature vs. classifier fusion for predictive data mining - a case study in pesticide classification. In Proceedings of the 10th International Conference on Information Fusion, [15] D. Ruta and B. Gabrys. An overview of classifier fusion methods. Computing and Information Systems, 7:1 10, [16] S. L. Pomeroy, P. Tamayo, M. Gassenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415: , January [17] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proc. Natl. Acad. Sci. USA, volume 96, pages , [18] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: , [19] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson Jr, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403: , [20] D. T. Ross, U. Scherf, M. B. Eisen, C. M. Perou, C. Rees, P. Spellman, V. Iyer, S. S. Jeffrey, M. V. de Rijn, M. Waltham, A. Pergamenschikov, J. C. F. Lee, D. Lashkari, D. Shalon, T. G. Myers, J. N. Weinstein, D. Botstein, and P. O. Brown. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24(3): , [21] Kent Ridge Bio-medical Data Set Repository krbd/index.html. [22] R. Díaz-Uriarte and S. A. de Andrés. Gene selection and classification of microarray data using random forest. Bioinformatics, 7(3), randomforestvarsel.html. [23] W. Melssen, R. Wehrens, and L. Buydens. Supervised kohonen networks for classification problems. Chemometrics and Intelligent Laboratory Systems, 83:99 113, [24] W. Melssen, B. Üstün, and L. Buydens. Sompls: a supervised self-organising map - partial least squares algorithm. Chemometrics and Intelligent Laboratory Systems, 86(1): , [25] H. Boström, R. Johansson, and A. Karlsson. On evidential combination rules for ensemble classifiers. In Proceedings of the 11th International Conference on Information Fusion, pages , 2008.
Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification
Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification Sampath Deegalla and Henrik Boström Department of Computer and Systems Sciences Stockholm University Forum 100,
More informationIdentification of Statistically Significant Features from Random Forests
Identification of Statistically Significant Features from Random Forests Jérôme Paul, Michel Verleysen, and Pierre Dupont Université catholique de Louvain ICTEAM/Machine Learning Group Place Sainte Barbe
More informationDimensional reduction of clustered data sets
Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which
More informationComparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees
Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland
More informationGene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm
Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,
More informationAn Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets
An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets George Lee 1, Carlos Rodriguez 2, and Anant Madabhushi 1 1 Rutgers, The State University
More informationEffective Dimension Reduction Methods for Tumor Classification using Gene Expression Data
Effective Dimension Reduction Methods for Tumor Classification using Gene Expression Data A. Antoniadis *, S. Lambert-Lacroix and F. Leblanc Laboratoire IMAG-LMC, University Joseph Fourier BP 53, 38041
More informationCLASSIFIER FUSION FOR POORLY-DIFFERENTIATED TUMOR CLASSIFICATION USING BOTH MESSENGER RNA AND MICRORNA EXPRESSION PROFILES
1 CLASSIFIER FUSION FOR POORLY-DIFFERENTIATED TUMOR CLASSIFICATION USING BOTH MESSENGER RNA AND MICRORNA EXPRESSION PROFILES Yuhang Wang and Margaret H. Dunham Department of Computer Science and Engineering,
More informationGene Expression Data Classification With Kernel Principal Component Analysis
Journal of Biomedicine and Biotechnology 25:2 25 55 59 DOI:.55/JBB.25.55 RESEARCH ARTICLE Gene Expression Data Classification With Kernel Principal Component Analysis Zhenqiu Liu, Dechang Chen, 2 and Halima
More informationFeature Selection for SVMs
Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research
More informationASYMPTOTIC THEORY FOR DISCRIMINANT ANALYSIS IN HIGH DIMENSION LOW SAMPLE SIZE
Mem. Gra. Sci. Eng. Shimane Univ. Series B: Mathematics 48 (2015), pp. 15 26 ASYMPTOTIC THEORY FOR DISCRIMINANT ANALYSIS IN HIGH DIMENSION LOW SAMPLE SIZE MITSURU TAMATANI Communicated by Kanta Naito (Received:
More informationSimultaneous variable selection and class fusion for high-dimensional linear discriminant analysis
Biostatistics (2010), 11, 4, pp. 599 608 doi:10.1093/biostatistics/kxq023 Advance Access publication on May 26, 2010 Simultaneous variable selection and class fusion for high-dimensional linear discriminant
More informationFeature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers
Feature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers Sepp Hochreiter and Klaus Obermayer Department of Electrical Engineering and Computer Science Technische
More informationNon-Negative Factorization for Clustering of Microarray Data
INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty
More informationPrivacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection
Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Suxin Guo, Sheng Zhong, and Aidong Zhang Department of Computer Science and Engineering, State University of New
More informationEffect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis
Journal of Data Science 112013), 537-558 Effect Size Estimation and Misclassifiion Rate Based Variable Selection in Linear Discriminant Analysis Bernd Klaus European Molecular Biology Laboratory EMBL)
More informationRegularized Discriminant Analysis and Its Application in Microarrays
Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,
More informationPrivacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection
Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Suxin Guo 1, Sheng Zhong 2, and Aidong Zhang 1 1 Department of Computer Science and Engineering, SUNY at Buffalo,
More informationFalse discovery control for multiple tests of association under general dependence
False discovery control for multiple tests of association under general dependence Nicolai Meinshausen Seminar für Statistik ETH Zürich December 2, 2004 Abstract We propose a confidence envelope for false
More informationRegression Model In The Analysis Of Micro Array Data-Gene Expression Detection
Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics
More informationFEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES
FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES Alberto Bertoni, 1 Raffaella Folgieri, 1 Giorgio Valentini, 1 1 DSI, Dipartimento di Scienze
More informationEfficient and Robust Feature Extraction by Maximum Margin Criterion
Efficient and Robust Feature Extraction by Maximum Margin Criterion Haifeng Li and Tao Jiang Department of Computer Science and Engineering University of California, Riverside, CA 92521 {hli, jiang}@cs.ucr.edu
More informationTESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION
TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION Gábor J. Székely Bowling Green State University Maria L. Rizzo Ohio University October 30, 2004 Abstract We propose a new nonparametric test for equality
More informationarxiv: v1 [stat.ml] 18 Jan 2010
Improving stability and interpretability of gene expression signatures arxiv:1001.3109v1 [stat.ml] 18 Jan 2010 Anne-Claire Haury Mines ParisTech, CBIO Institut Curie, Paris, F-75248 INSERM, U900, Paris,
More informationMicroarray Data Analysis: Discovery
Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover
More informationProceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Uncorrelated Lasso
Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Uncorrelated Lasso Si-Bao Chen School of Computer Science and Technology, Anhui University, Hefei, 23060, China Chris Ding Department
More informationarxiv: v4 [stat.ap] 8 Oct 2010
The Annals of Applied Statistics 2010, Vol. 4, No. 1, 503 519 DOI: 10.1214/09-AOAS277 c Institute of Mathematical Statistics, 2010 arxiv:0903.2003v4 [stat.ap] 8 Oct 2010 FEATURE SELECTION IN OMICS PREDICTION
More informationSmooth LASSO for Classification
2010 International Conference on Technologies and Applications of Artificial Intelligence Smooth LASSO for Classification Li-Jen Chien Department of Computer Science and Information Engineering National
More informationLocalized Sliced Inverse Regression
Localized Sliced Inverse Regression Qiang Wu, Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University, Durham NC 2778-251,
More informationA Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data
A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005
More informationA Clustering Coefficient for Weighted Networks, with Application to Gene Expression Data
A Clustering Coefficient for Weighted Networks, with Application to Gene Expression Data Gabriela Kalna a, and Desmond J. Higham a a Department of Mathematics, University of Strathclyde, Glasgow G XH,
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationEffective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
More informationMulticlass Classifiers Based on Dimension Reduction. with Generalized LDA
Multiclass Classifiers Based on Dimension Reduction with Generalized LDA Hyunsoo Kim a Barry L Drake a Haesun Park a a College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Abstract
More informationNMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering
NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering Andri Mirzal Faculty of Computing Universiti Teknologi Malaysia Skudai, Johor Bahru, Malaysia Email: andrimirzal@utm.my
More informationWavelet-Based Feature Extraction for Microarray Data Classification
26 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 26 Wavelet-Based Feature Extraction for Microarray Data Classification Shutao
More informationDimension Reduction of Microarray Data with Penalized Independent Component Aanalysis
Dimension Reduction of Microarray Data with Penalized Independent Component Aanalysis Han Liu Rafal Kustra* Abstract In this paper we propose to use ICA as a dimension reduction technique for microarray
More informationCOMMENTARY Analysis of gene expression by microarrays: cell biologist s gold mine or minefield?
Journal of Cell Science 113, 4151-4156 (2000) Printed in Great Britain The Company of Biologists Limited 2000 JCS1292 4151 COMMENTARY Analysis of gene expression by microarrays: cell biologist s gold mine
More informationFull versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation
cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and
More informationapplication in microarrays
Biostatistics Advance Access published April 7, 2006 Regularized linear discriminant analysis and its application in microarrays Yaqian Guo, Trevor Hastie and Robert Tibshirani Abstract In this paper,
More informationA unified algorithm for mixed l 2, p -minimizations and its application in feature selection
Comput Optim Appl (014) 58:409 41 DOI 10.1007/s10589-014-9648-x A unified algorithm for mixed l, p -imizations and its application in feature selection Liping Wang Songcan Chen Yuanping Wang Received:
More informationIterative Laplacian Score for Feature Selection
Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,
More informationModule Based Neural Networks for Modeling Gene Regulatory Networks
Module Based Neural Networks for Modeling Gene Regulatory Networks Paresh Chandra Barman, Std 1 ID: 20044523 Term Project: BiS732 Bio-Network Department of BioSystems, Korea Advanced Institute of Science
More information10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification
10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene
More informationAN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;
213 IEEE INERNAIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEP. 22 25, 213, SOUHAMPON, UK AN ENHANCED INIIALIZAION MEHOD FOR NON-NEGAIVE MARIX FACORIZAION Liyun Gong 1, Asoke K. Nandi 2,3
More informationAnalysis Methods Supplement for: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks
Analysis Methods Supplement for: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks Javed Khan, Jun S. Wei, Markus Ringnér, Lao H. Saal,
More informationFACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION
SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL
More informationVariable Selection and Dimension Reduction by Learning Gradients
Variable Selection and Dimension Reduction by Learning Gradients Qiang Wu and Sayan Mukherjee August 6, 2008 1 Introduction High dimension data analysis has become a challenging problem in modern sciences.
More informationSpectral clustering and its use in bioinformatics
Journal of Computational and Applied Mathematics 4 (7) 5 37 www.elsevier.com/locate/cam Spectral clustering and its use in bioinformatics Desmond J. Higham a,,, Gabriela Kalna a,, Milla Kibble b,3 a Department
More informationA Bayesian Mixture Model for Differential Gene Expression
A Bayesian Mixture Model for Differential Gene Expression Kim-Anh Do, Peter Müller and Feng Tang 1 Abstract We propose model-based inference for differential gene expression, using a non-parametric Bayesian
More informationRandom Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions
JMLR: Workshop and Conference Proceedings 29:17 32, 2013 ACML 2013 Random Projections as Regularizers: Learning a Linear Discriminant nsemble from Fewer Observations than Dimensions Robert J. Durrant bobd@waikato.ac.nz
More informationPackage SHIP. February 19, 2015
Type Package Package SHIP February 19, 2015 Title SHrinkage covariance Incorporating Prior knowledge Version 1.0.2 Author Maintainer Vincent Guillemot The SHIP-package allows
More informationA Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier
A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe
More informationA Bias Correction for the Minimum Error Rate in Cross-validation
A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.
More informationComparison of Discrimination Methods for High Dimensional Data
Comparison of Discrimination Methods for High Dimensional Data M.S.Srivastava 1 and T.Kubokawa University of Toronto and University of Tokyo Abstract In microarray experiments, the dimension p of the data
More informationUnivariate shrinkage in the Cox model for high dimensional data
Univariate shrinkage in the Cox model for high dimensional data Robert Tibshirani January 6, 2009 Abstract We propose a method for prediction in Cox s proportional model, when the number of features (regressors)
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationGENE expression refers to the level of production of protein
IEEE/ACM TRANSACTIONS ON...,, VOL. TBD, NO. TBD, MONTH YEAR Feature Selection for Gene Expression using Model-based Entropy Shenghuo Zhu, Dingding Wang, Kai Yu, Tao Li and Yihong Gong Abstract Gene expression
More informationPackage IGG. R topics documented: April 9, 2018
Package IGG April 9, 2018 Type Package Title Inverse Gamma-Gamma Version 1.0 Date 2018-04-04 Author Ray Bai, Malay Ghosh Maintainer Ray Bai Description Implements Bayesian linear regression,
More informationFeature gene selection method based on logistic and correlation information entropy
Bio-Medical Materials and Engineering 26 (2015) S1953 S1959 DOI 10.3233/BME-151498 IOS Press S1953 Feature gene selection method based on logistic and correlation information entropy Jiucheng Xu a,b,,
More informationSparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data
Sparse Linear Discriant Analysis with Applications to High Dimensional Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract This paper develops a method for automatically incorporating
More informationKERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA
Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2
More informationSimilarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList
for Center Berlin Genome Based Bioinformatics Max Planck Institute for Molecular Genetics Computational Diagnostics Group @ Dept. Vingron Ihnestrasse 63-73, D-14195 Berlin, Germany http://compdiag.molgen.mpg.de/
More informationStatistical aspects of prediction models with high-dimensional data
Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by
More informationA Bayesian Approach to Variable Selection when the Number of Variables is Very Large
A Bayesian Approach to Variable Selection when the Number of Variables is Very Large Harri T. Kiiveri Abstract In this paper, we present a rapid Bayesian variable selection technique which can be used
More informationPenalized versus constrained generalized eigenvalue problems
Penalized versus constrained generalized eigenvalue problems Irina Gaynanova, James G. Booth and Martin T. Wells. arxiv:141.6131v3 [stat.co] 4 May 215 Abstract We investigate the difference between using
More informationStructured Statistical Learning with Support Vector Machine for Feature Selection and Prediction
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationUncorrelated Patient Similarity Learning
Uncorrelated Patient Similarity Learning Mengdi Huai Chenglin Miao Qiuling Suo Yaliang Li Jing Gao Aidong Zhang Abstract Patient similarity learning aims to derive a clinically meaningful similarity metric
More informationAccepted author version posted online: 28 Nov 2012.
This article was downloaded by: [University of Minnesota Libraries, Twin Cities] On: 15 May 2013, At: 12:23 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954
More informationAn Efficient Algorithm for Computing the HHSVM and Its Generalizations
This article was downloaded by: [University of Minnesota Libraries, Twin Cities] On: 12 October 2013, At: 13:09 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number:
More informationGenerative MaxEnt Learning for Multiclass Classification
Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,
More informationPackage ShrinkCovMat
Type Package Package ShrinkCovMat Title Shrinkage Covariance Matrix Estimators Version 1.2.0 Author July 11, 2017 Maintainer Provides nonparametric Steinian shrinkage estimators
More informationAijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.
Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present
More informationInvestigating the Performance of a Linear Regression Combiner on Multi-class Data Sets
Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets Chun-Xia Zhang 1,2 Robert P.W. Duin 2 1 School of Science and State Key Laboratory for Manufacturing Systems Engineering,
More informationMathematics, Genomics, and Cancer
School of Informatics IUB April 6, 2009 Outline Introduction Class Comparison Class Discovery Class Prediction Example Biological states and state modulation Software Tools Research directions Math & Biology
More informationClassification 1: Linear regression of indicators, linear discriminant analysis
Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification
More informationUncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization
Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,
More informationA Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier
A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,
More informationThe miss rate for the analysis of gene expression data
Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,
More informationData Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction
Data Mining 3.6 Regression Analysis Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Straight-Line Linear Regression Multiple Linear Regression Other Regression Models References Introduction
More informationFace Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi
Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold
More informationDATA MINING WITH DIFFERENT TYPES OF X-RAY DATA
DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA 315 C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA
More informationSHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan
A New Test on High-Dimensional Mean Vector Without Any Assumption on Population Covariance Matrix SHOTA KATAYAMA AND YUTAKA KANO Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama,
More informationday month year documentname/initials 1
ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi
More informationInternational Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA
International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich,
More informationPart I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes
Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If
More informationSelection of Classifiers based on Multiple Classifier Behaviour
Selection of Classifiers based on Multiple Classifier Behaviour Giorgio Giacinto, Fabio Roli, and Giorgio Fumera Dept. of Electrical and Electronic Eng. - University of Cagliari Piazza d Armi, 09123 Cagliari,
More informationA Graph-Based Representation of Gene Expression Profiles in DNA Microarrays
A Graph-Based Representation of Gene Expression Profiles in DNA Microarrays A. Benso, S. Di Carlo, G. Politano, L. Sterpone Abstract This paper proposes a new and very flexible data model, called Gene
More informationCorrelation Preserving Unsupervised Discretization. Outline
Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization
More informationUncovering the common essence of complex networking
A simple and exact Laplacian clustering of complex networking phenomena: Application to gene expression profiles Choongrak Kim*, Mookyung Cheon, Minho Kang, and Iksoo Chang *Department of Statistics, National
More informationRegularized Discriminant Analysis and Its Application in Microarray
Regularized Discriminant Analysis and Its Application in Microarray Yaqian Guo, Trevor Hastie and Robert Tibshirani May 5, 2004 Abstract In this paper, we introduce a family of some modified versions of
More informationComparison of the Empirical Bayes and the Significance Analysis of Microarrays
Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Holger Schwender, Andreas Krause, and Katja Ickstadt Abstract Microarrays enable to measure the expression levels of tens
More informationSTK Statistical Learning: Advanced Regression and Classification
STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable
More informationAssignment 3. Introduction to Machine Learning Prof. B. Ravindran
Assignment 3 Introduction to Machine Learning Prof. B. Ravindran 1. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively
More informationVariable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data
Biometrics 64, 440 448 June 2008 DOI: 10.1111/j.1541-0420.2007.00922.x Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data Sijian Wang 1 and Ji Zhu 2,
More informationClassification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)
Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Y-c Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationInduction of Decision Trees
Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.
More information