Fusion of Dimensionality Reduction Methods: a Case Study in Microarray Classification

Size: px
Start display at page:

Download "Fusion of Dimensionality Reduction Methods: a Case Study in Microarray Classification"

Transcription

1 Fusion of Dimensionality Reduction Methods: a Case Study in Microarray Classification Sampath Deegalla Dept. of Computer and Systems Sciences Stockholm University Sweden si-sap@dsv.su.se Henrik Boström Informatics Research Centre University of Skövde Sweden henrik.bostrom@his.se Abstract Dimensionality reduction has been demonstrated to improve the performance of the k-nearest neighbor (knn) classifier for high-dimensional data sets, such as microarrays. However, the effectiveness of different dimensionality reduction methods varies, and it has been shown that no single method constantly outperforms the others. In contrast to using a single method, two approaches to fusing the result of applying dimensionality reduction methods are investigated: feature fusion and classifier fusion. It is shown that by fusing the output of multiple dimensionality reduction techniques, either by fusing the reduced features or by fusing the output of the resulting classifiers, both higher accuracy and higher robustness towards the choice of number of dimensions is obtained. Keywords: Nearest neighbor classification, dimensionality reduction, feature fusion, classifier fusion, microarrays. 1 Introduction There is a strong need for accurate methods for analyzing microarray gene-expression data sets, since early accurate diagnoses based on these analyses may lead to proper choice of treatments and therapies [1, 2, 3]. However, the nature of these data sets (i.e., thousands of attributes with small number of instances) is a challenge for many learning algorithms, including the well-known k-nearest neighbor (knn) classifier [4]. The knn has a very simple strategy as a learner: instead of generating an explicit model, it keeps all training instances. Classification is made by measuring the distances from the test instance to all training instances, most commonly using the Euclidean distance. Finally, the majority class among the k nearest instances is assigned to the test instance. This simple form of knn can however be both inefficient and ineffective for high-dimensional data sets due to presence of irrelevant and redundant attributes. Therefore, the classification accuracy of knn often decreases with an increase in dimensionality. One possible remedy to this problem that earlier has shown to be successful is to use dimensionality reduction, i.e., projecting the original feature set into a smaller number of features [5]. The use of knn has earlier been demonstrated to allow for successful classification of microarrays [2] and it has also been shown that dimensionality reduction can further improve the performance of knn for this task [5]. However, different dimensionality reduction methods may have different effects on the performance of the knn classifier, and it has been shown that no single method always outperforms the others when used for microarray classification [6]. As an alternative to choosing a single method, we will in this study consider the idea of applying a set of dimensionality reduction methods and fusing the output of these. Two fusion approaches are investigated: feature fusion, i.e., combining the reduced subset of features before learning with knn, and classifier fusion, i.e., combining the individual knn classifiers built from each feature reduction method. The organization of the paper is as follows. In the next section, we briefly present three dimensionality reduction methods that will be considered in the investigation together with the approaches for combining (or fusing) the output of these. In section 3, details of the experimental setup are provided, and the results of the comparison on eight microarray data sets are given. Finally, we give some concluding remarks and outline directions for future work in section 4. 2 Dimensionality reduction 2.1 Principal Component Analysis (PCA) PCA uses a linear transformation to obtain a simplified data set retaining the characteristics of the original data set. Assume that the original matrix contains o dimensions and n observations and that one wants to reduce the matrix into a d dimensional subspace. Following [7], this transformation can be defined by: Y = E T X (1)

2 where E o d is the projection matrix containing d eigen vectors corresponding to the d highest eigen values, and X o n is the mean centered data matrix. 2.2 Partial Least Squares (PLS) PLS was originally developed within the social sciences and has later been used extensively in chemometrics as a regression method [8]. It seeks for a linear combination of attributes whose correlation with the class attribute is maximum. In PLS regression, the task is to build a linear model, Ȳ = BX + E, where B is the matrix of regression coefficients and E is the matrix of error coefficients. In PLS, this is done via the factor score matrix Y = WX with an appropriate weight matrix W. Then it considers the linear model, Ȳ = QY +E, where Q is the matrix of regression coefficients for Y. Computation of Q will yield Ȳ = BX +E, where B = WQ. However, we are interested in dimensionality reduction using PLS and used the SIMPLS algorithm [9, 10]. In SIMPLS, the weights are calculated by maximizing the covariance of the score vectors y a and ȳ a where a = 1,...,d (where d is the selected number of PLS components) under some conditions. For more details of the method and its use, see [9, 11]. 2.3 Information Gain (IG) Information Gain (IG) can be used to measure the information content in a feature [12], and is commonly used for decision tree induction. Maximizing IG is equivalent to minimizing: V i=1 n i N C j=1 n ij n i log 2 n ij n i (2) where C is the number of classes, V is the number of values of the attribute, N is the total number of examples, n i is the number of examples having the ith value of the attribute and n ij is the number of examples in the latter group belonging to the jth class. When it comes to feature reduction with IG, all features are ranked according to decreasing information gain, and the first d features are selected. It is also necessary to consider how discretization of numerical features is to be done. Since such features are present in all the considered data sets, they have to be converted to categorical features in order to allow for the use of the above calculation of IG. We used the WEKA s default configuration, i.e., Fayyad & Irani s Minimum Description Length (MDL) [13] method, for discretization. 2.4 Feature fusion (FF) Feature fusion concerns how to generate and select a single set of features for a set of objects to which several sets of features are associated [14]. In this study, we use a single data source together with different dimensionality reduction methods which allows us to perform feature fusion by concatenating features generated by the different methods. High-dimensionality is not a problem here since each transformed data set is small compared to the original size of the data. Therefore, a straightforward method of choosing an equal number of features from each reduced set is considered. The selected total number of dimensions are from d = 3 to 99. For each d, the first d/3 reduced dimensions are chosen from the output of PLS, PCA and IG respectively. 2.5 Classifier fusion (CF) The focus of classifier fusion is either on generating a structure representing a set of combined classifiers or on combining classifier outputs [15]. We have considered the latter approach, i.e., combining nearest neighbor predictions with PLS, PCA and IG using unweighted voting. For multi-class problems, ties are resolved by randomly selecting one of the predictions. 3 Empirical study 3.1 Data sets The following eight microarray data sets are used in this study: Central Nervous System [16], which consists of 60 patient samples of survivors (39) and failures (21) after treatment of the medulloblastomas tumor (data set C from [16]). Colon Tumor [17], which consists of 40 tumor and 22 normal colon samples. Leukemia [18], which contains 72 samples of two types of leukemia: 25 acute myeloid leukemia (AML) and 47 acute lymphoblastic leukemia (ALL). Prostate [2], which consists of 52 prostate tumor and 50 normal specimens. Brain [16] contains 42 patient samples of five different brain tumor types: medulloblastomas (10), malignant gliomas (10), AT/RTs (10), PNETs (8) and normal cerebella (4) (data set A from [16]). Lymphoma [19], which contains 42 samples of diffuse large B-cell lymphoma (DLBCL), 9 follicular lymphoma (FL) and 11 chronic lymphocytic leukemia (CLL). NCI60 [20], which contains eight different tumor types. These are breast, central nervous system, colon, leukemia, melanoma, non-small cell lung carcinoma, ovarian and renal tumors.

3 Table 1: Description of data Data set Attributes Instances # of Classes Central Nervous Colon Tumor Leukemia Prostate Brain Lymphoma NCI SRBCT SRBCT [3], which contains four diagnostic categories of small, round blue-cell tumors as neuroblastoma (NB), rhabdomyosarcoma (RMS), non-hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS). The first three data sets come from Kent Ridge Biomedical Data Set Repository [21] and the remaining five from the supplementary materials in [22]. The data sets are summarized in Table Experimental setup We have used Matlab to transform raw attributes to both PLS and PCA components. The PCA transformation is performed using the Matlab s Statistics Toolbox whereas the PLS transformation is performed using the BDK-SOMPLS toolbox [23, 24], which uses the SIMPLS algorithm. The WEKA data mining toolkit [12] is used for the IG method, as well as for the nearest neighbor classification. Both PLS and IG are supervised methods which use class information for their transformations. Therefore, to generate the PLS components for a test set, for which the class labels are unknown, the weight matrix generated for the training set has to be used. For IG, attributes in the training set is ranked based on the information gain in a decreasing manner and the same attributes are selected for the test set. The numbers of selected dimensions were varied from one to approximately the number of examples in the current data set for all three methods. Stratified 10-fold cross validation [12] is employed to obtain measures of accuracy, which has been chosen as the performance indicator in this study. 3.3 Experimental results The results of using the original features, the three dimensionality reduction methods, the feature fusion and classifier fusion methods are shown in Fig. 1 and Fig. 2. Table 2 summarizes the highest classification accuracies of individual reduction and fusion methods comparing their accuracy to raw features. The numbers inside the brackets denote the minimum number of dimensions required to reach a particular accuracy. It can be observed in the Fig. 1 and Fig. 2 that both PLS and PCA obtain their best classification accuracies with relatively few dimensions, while more dimensions are required for IG. None of the single methods turns out as a clear winner, except for perhaps PLS on the binary classification tasks. However, all three methods outperform not using dimensionality reduction, and the difference in performance between the best and worst method can vary greatly for a particular data set, leading to the conclusion that the choice of dimensionality reduction to be used in conjunction with knn for microarray classification can be of major importance, but also that no single method is suitable for all cases. As shown in Table 2, the feature fusion method often performs well compared to each individual method, giving the overall best results in 7 out of 8 cases. Furthermore, the accuracy varies to a much less extent with the number of dimensions compared to the other methods, hence showing that using the combined features reduces sensitivity to the choice of the number of dimensions. In addition, the classifier fusion method also performs on par with or better than the individual dimensionality reduction methods in 6 out of 8 cases, reaching the overall best accuracy in 3 cases. It should also be noted that classifier fusion yields the best accuracy with fewer dimensions compared to the feature fusion method. For example, the FF method reaches the highest classification accuracy with 60 dimensions for the Central Nervous data set, whereas the CF method reaches its highest level using only 11 features. 4 Concluding remarks Three dimensionality reduction methods were investigated in conjunction with the knn classifier and two approaches to fusing the result of multiple dimensionality reduction methods were considered: feature fusion and classifier fusion. An experiment with eight microarray data sets shows that dimensionality reduction indeed is effective for nearest neighbor classification and that fusing the output of these methods can further improve the classification accuracy compared to the individual dimensionality reduction methods. It is also observed that PCA and PLS are best when choosing few dimensions and they even sometimes outperform the fusion methods. However, if one compares the best classification accuracies between individual methods and fusion methods, the feature fusion method obtain the best classification accuracy in 7 out of 8 cases. In addition, classifier fusion obtains the best accuracy in relatively few number of dimensions compared to feature fusion which on the other hand is even more robust to changes in number of dimensions. Therefore, it can be concluded that choosing any of the fusion approaches should be preferred to choosing any of the single dimensionality reduction methods, since the former can be expected to lead to higher classification accuracy and robustness with respect to the choice of number of dimensions.

4 Table 2: Results on best classification accuracies Data set Raw PCA PLS IG FF CF Central Nervous (31) 73.33(26) 68.33(18) 73.33(60) 71.67(11) Colon Tumor (10) 88.81(4) 84.52(14) 87.14(21) 87.38(7) Leukemia (10) 95.00(5) 96.67(32) (33) 97.50(4) Prostate (25) 92.36(15) 93.27(11) 95.18(87) 95.09(50) Brain (4) 79.00(23) 81.00(34) 88.00(96) 86.00(4) Lymphoma (2) (5) (15) (6) (2) NCI (6) 78.57(11) 68.81(24) 80.24(51) 80.24(6) SRBCT (10) (4) (10) (12) (4) Figure 1: Predictive performance with the change of numbers of dimensions using PCA, PLS, IG, Feature fusion (FF) and Classifier fusion (CF) with Nearest Neighbor for two class microarray data sets.

5 Figure 2: Predictive performance with the change of numbers of dimensions using PCA, PLS, IG, Feature fusion (FF) and Classifier fusion (CF) with Nearest Neighbor for multi class microarray data sets. There are a number of issues that need further exploration. First, fusion of classifiers from additional dimensionality reduction methods could be investigated, e.g., Random Projection [6]. Second, selecting different number of dimensions for different dimensionality reduction methods when fusing classifiers of features, could be investigated as an alternative strategy. Finally, more sophisticated voting strategies could be considered, see e.g., [25]. 5 Acknowledgments The first author gratefully acknowledge support through the SIDA/SAREC IT project. This work was supported by the Information Fusion Research Program ( at the University of Skövde, Sweden, in partnership with the Swedish Knowledge Foundation under grant 2003/0104. References [1] J. Quackenbush. Microarray analysis and tumor classification. The New England Journal of Medicine, 354(23): , [2] D. Singh, P. G. Febbo, K. Ross, D. G. Jackson, J. Manola, C. Ladd, P. Tamayo, A. A. Renshaw, A. V. D Amico, J. P. Richie, E. S. Lander, M. Loda, P. W. Kantoff, T. R. Golub, and W. R. Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1: , [3] J. Kahn, J. S. Wei, M. Ringnér, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P.S. Meltzer. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7: , 2001.

6 [4] D. W. Aha, D. Kiblear, and M. K. Albert. Instance based learning algorithm. Machine Learning, 6:37 66, [5] S. Deegalla and H. Boström. Reducing highdimensional data by principal component analysis vs. random projection for nearest neighbor classification. In ICMLA 06: Proceedings of the 5th International Conference on Machine Learning and Applications, pages , Washington, DC, USA, IEEE Computer Society. [6] S. Deegalla and H. Boström. Classification of microarrays with knn: Comparison of dimensionality reduction methods. In Proceedings of the 8th International Conference on Intelligent Data Engineering and Automated Learning, Birmingham, UK, pages , [7] J. Shlens. A tutorial on principal component analysis. shlens/pub/ notes/pca.pdf. [8] H. Abdi. Partial least squares (pls) regression., [9] S. de Jong. SIMPLS: An alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, [10] StatSoft Inc. Electronic statistics textbook, stathome.html. [11] A. Boulesteix. PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, [12] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, [13] U. M. Fayyad and K. B. Irani. On the handling of continuous-valued attributes in decision tree generation. Machine Learning, 8:87 102, [14] H. Boström. Feature vs. classifier fusion for predictive data mining - a case study in pesticide classification. In Proceedings of the 10th International Conference on Information Fusion, [15] D. Ruta and B. Gabrys. An overview of classifier fusion methods. Computing and Information Systems, 7:1 10, [16] S. L. Pomeroy, P. Tamayo, M. Gassenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415: , January [17] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proc. Natl. Acad. Sci. USA, volume 96, pages , [18] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: , [19] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, L. Yang, G. E. Marti, T. Moore, J. Hudson Jr, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W. C. Chan, T. C. Greiner, D. D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M. R. Grever, J. C. Byrd, D. Botstein, P. O. Brown, and L. M. Staudt. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403: , [20] D. T. Ross, U. Scherf, M. B. Eisen, C. M. Perou, C. Rees, P. Spellman, V. Iyer, S. S. Jeffrey, M. V. de Rijn, M. Waltham, A. Pergamenschikov, J. C. F. Lee, D. Lashkari, D. Shalon, T. G. Myers, J. N. Weinstein, D. Botstein, and P. O. Brown. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24(3): , [21] Kent Ridge Bio-medical Data Set Repository krbd/index.html. [22] R. Díaz-Uriarte and S. A. de Andrés. Gene selection and classification of microarray data using random forest. Bioinformatics, 7(3), randomforestvarsel.html. [23] W. Melssen, R. Wehrens, and L. Buydens. Supervised kohonen networks for classification problems. Chemometrics and Intelligent Laboratory Systems, 83:99 113, [24] W. Melssen, B. Üstün, and L. Buydens. Sompls: a supervised self-organising map - partial least squares algorithm. Chemometrics and Intelligent Laboratory Systems, 86(1): , [25] H. Boström, R. Johansson, and A. Karlsson. On evidential combination rules for ensemble classifiers. In Proceedings of the 11th International Conference on Information Fusion, pages , 2008.

Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification

Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification Improving Fusion of Dimensionality Reduction Methods for Nearest Neighbor Classification Sampath Deegalla and Henrik Boström Department of Computer and Systems Sciences Stockholm University Forum 100,

More information

Identification of Statistically Significant Features from Random Forests

Identification of Statistically Significant Features from Random Forests Identification of Statistically Significant Features from Random Forests Jérôme Paul, Michel Verleysen, and Pierre Dupont Université catholique de Louvain ICTEAM/Machine Learning Group Place Sainte Barbe

More information

Dimensional reduction of clustered data sets

Dimensional reduction of clustered data sets Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,

More information

An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets

An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets George Lee 1, Carlos Rodriguez 2, and Anant Madabhushi 1 1 Rutgers, The State University

More information

Effective Dimension Reduction Methods for Tumor Classification using Gene Expression Data

Effective Dimension Reduction Methods for Tumor Classification using Gene Expression Data Effective Dimension Reduction Methods for Tumor Classification using Gene Expression Data A. Antoniadis *, S. Lambert-Lacroix and F. Leblanc Laboratoire IMAG-LMC, University Joseph Fourier BP 53, 38041

More information

CLASSIFIER FUSION FOR POORLY-DIFFERENTIATED TUMOR CLASSIFICATION USING BOTH MESSENGER RNA AND MICRORNA EXPRESSION PROFILES

CLASSIFIER FUSION FOR POORLY-DIFFERENTIATED TUMOR CLASSIFICATION USING BOTH MESSENGER RNA AND MICRORNA EXPRESSION PROFILES 1 CLASSIFIER FUSION FOR POORLY-DIFFERENTIATED TUMOR CLASSIFICATION USING BOTH MESSENGER RNA AND MICRORNA EXPRESSION PROFILES Yuhang Wang and Margaret H. Dunham Department of Computer Science and Engineering,

More information

Gene Expression Data Classification With Kernel Principal Component Analysis

Gene Expression Data Classification With Kernel Principal Component Analysis Journal of Biomedicine and Biotechnology 25:2 25 55 59 DOI:.55/JBB.25.55 RESEARCH ARTICLE Gene Expression Data Classification With Kernel Principal Component Analysis Zhenqiu Liu, Dechang Chen, 2 and Halima

More information

Feature Selection for SVMs

Feature Selection for SVMs Feature Selection for SVMs J. Weston, S. Mukherjee, O. Chapelle, M. Pontil T. Poggio, V. Vapnik, Barnhill BioInformatics.com, Savannah, Georgia, USA. CBCL MIT, Cambridge, Massachusetts, USA. AT&T Research

More information

ASYMPTOTIC THEORY FOR DISCRIMINANT ANALYSIS IN HIGH DIMENSION LOW SAMPLE SIZE

ASYMPTOTIC THEORY FOR DISCRIMINANT ANALYSIS IN HIGH DIMENSION LOW SAMPLE SIZE Mem. Gra. Sci. Eng. Shimane Univ. Series B: Mathematics 48 (2015), pp. 15 26 ASYMPTOTIC THEORY FOR DISCRIMINANT ANALYSIS IN HIGH DIMENSION LOW SAMPLE SIZE MITSURU TAMATANI Communicated by Kanta Naito (Received:

More information

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis Biostatistics (2010), 11, 4, pp. 599 608 doi:10.1093/biostatistics/kxq023 Advance Access publication on May 26, 2010 Simultaneous variable selection and class fusion for high-dimensional linear discriminant

More information

Feature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers

Feature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers Feature Selection and Classification on Matrix Data: From Large Margins To Small Covering Numbers Sepp Hochreiter and Klaus Obermayer Department of Electrical Engineering and Computer Science Technische

More information

Non-Negative Factorization for Clustering of Microarray Data

Non-Negative Factorization for Clustering of Microarray Data INT J COMPUT COMMUN, ISSN 1841-9836 9(1):16-23, February, 2014. Non-Negative Factorization for Clustering of Microarray Data L. Morgos Lucian Morgos Dept. of Electronics and Telecommunications Faculty

More information

Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection

Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Suxin Guo, Sheng Zhong, and Aidong Zhang Department of Computer Science and Engineering, State University of New

More information

Effect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis

Effect Size Estimation and Misclassification Rate Based Variable Selection in Linear Discriminant Analysis Journal of Data Science 112013), 537-558 Effect Size Estimation and Misclassifiion Rate Based Variable Selection in Linear Discriminant Analysis Bernd Klaus European Molecular Biology Laboratory EMBL)

More information

Regularized Discriminant Analysis and Its Application in Microarrays

Regularized Discriminant Analysis and Its Application in Microarrays Biostatistics (2005), 1, 1, pp. 1 18 Printed in Great Britain Regularized Discriminant Analysis and Its Application in Microarrays By YAQIAN GUO Department of Statistics, Stanford University Stanford,

More information

Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection

Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Privacy Preserving Calculation of Fisher Criterion Score for Informative Gene Selection Suxin Guo 1, Sheng Zhong 2, and Aidong Zhang 1 1 Department of Computer Science and Engineering, SUNY at Buffalo,

More information

False discovery control for multiple tests of association under general dependence

False discovery control for multiple tests of association under general dependence False discovery control for multiple tests of association under general dependence Nicolai Meinshausen Seminar für Statistik ETH Zürich December 2, 2004 Abstract We propose a confidence envelope for false

More information

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics

More information

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES Alberto Bertoni, 1 Raffaella Folgieri, 1 Giorgio Valentini, 1 1 DSI, Dipartimento di Scienze

More information

Efficient and Robust Feature Extraction by Maximum Margin Criterion

Efficient and Robust Feature Extraction by Maximum Margin Criterion Efficient and Robust Feature Extraction by Maximum Margin Criterion Haifeng Li and Tao Jiang Department of Computer Science and Engineering University of California, Riverside, CA 92521 {hli, jiang}@cs.ucr.edu

More information

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION Gábor J. Székely Bowling Green State University Maria L. Rizzo Ohio University October 30, 2004 Abstract We propose a new nonparametric test for equality

More information

arxiv: v1 [stat.ml] 18 Jan 2010

arxiv: v1 [stat.ml] 18 Jan 2010 Improving stability and interpretability of gene expression signatures arxiv:1001.3109v1 [stat.ml] 18 Jan 2010 Anne-Claire Haury Mines ParisTech, CBIO Institut Curie, Paris, F-75248 INSERM, U900, Paris,

More information

Microarray Data Analysis: Discovery

Microarray Data Analysis: Discovery Microarray Data Analysis: Discovery Lecture 5 Classification Classification vs. Clustering Classification: Goal: Placing objects (e.g. genes) into meaningful classes Supervised Clustering: Goal: Discover

More information

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Uncorrelated Lasso

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Uncorrelated Lasso Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence Uncorrelated Lasso Si-Bao Chen School of Computer Science and Technology, Anhui University, Hefei, 23060, China Chris Ding Department

More information

arxiv: v4 [stat.ap] 8 Oct 2010

arxiv: v4 [stat.ap] 8 Oct 2010 The Annals of Applied Statistics 2010, Vol. 4, No. 1, 503 519 DOI: 10.1214/09-AOAS277 c Institute of Mathematical Statistics, 2010 arxiv:0903.2003v4 [stat.ap] 8 Oct 2010 FEATURE SELECTION IN OMICS PREDICTION

More information

Smooth LASSO for Classification

Smooth LASSO for Classification 2010 International Conference on Technologies and Applications of Artificial Intelligence Smooth LASSO for Classification Li-Jen Chien Department of Computer Science and Information Engineering National

More information

Localized Sliced Inverse Regression

Localized Sliced Inverse Regression Localized Sliced Inverse Regression Qiang Wu, Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University, Durham NC 2778-251,

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

A Clustering Coefficient for Weighted Networks, with Application to Gene Expression Data

A Clustering Coefficient for Weighted Networks, with Application to Gene Expression Data A Clustering Coefficient for Weighted Networks, with Application to Gene Expression Data Gabriela Kalna a, and Desmond J. Higham a a Department of Mathematics, University of Strathclyde, Glasgow G XH,

More information

Lecture 18: Multiclass Support Vector Machines

Lecture 18: Multiclass Support Vector Machines Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification

More information

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA

More information

Multiclass Classifiers Based on Dimension Reduction. with Generalized LDA

Multiclass Classifiers Based on Dimension Reduction. with Generalized LDA Multiclass Classifiers Based on Dimension Reduction with Generalized LDA Hyunsoo Kim a Barry L Drake a Haesun Park a a College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA Abstract

More information

NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering

NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering NMF based Gene Selection Algorithm for Improving Performance of the Spectral Cancer Clustering Andri Mirzal Faculty of Computing Universiti Teknologi Malaysia Skudai, Johor Bahru, Malaysia Email: andrimirzal@utm.my

More information

Wavelet-Based Feature Extraction for Microarray Data Classification

Wavelet-Based Feature Extraction for Microarray Data Classification 26 International Joint Conference on Neural Networks Sheraton Vancouver Wall Centre Hotel, Vancouver, BC, Canada July 16-21, 26 Wavelet-Based Feature Extraction for Microarray Data Classification Shutao

More information

Dimension Reduction of Microarray Data with Penalized Independent Component Aanalysis

Dimension Reduction of Microarray Data with Penalized Independent Component Aanalysis Dimension Reduction of Microarray Data with Penalized Independent Component Aanalysis Han Liu Rafal Kustra* Abstract In this paper we propose to use ICA as a dimension reduction technique for microarray

More information

COMMENTARY Analysis of gene expression by microarrays: cell biologist s gold mine or minefield?

COMMENTARY Analysis of gene expression by microarrays: cell biologist s gold mine or minefield? Journal of Cell Science 113, 4151-4156 (2000) Printed in Great Britain The Company of Biologists Limited 2000 JCS1292 4151 COMMENTARY Analysis of gene expression by microarrays: cell biologist s gold mine

More information

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and

More information

application in microarrays

application in microarrays Biostatistics Advance Access published April 7, 2006 Regularized linear discriminant analysis and its application in microarrays Yaqian Guo, Trevor Hastie and Robert Tibshirani Abstract In this paper,

More information

A unified algorithm for mixed l 2, p -minimizations and its application in feature selection

A unified algorithm for mixed l 2, p -minimizations and its application in feature selection Comput Optim Appl (014) 58:409 41 DOI 10.1007/s10589-014-9648-x A unified algorithm for mixed l, p -imizations and its application in feature selection Liping Wang Songcan Chen Yuanping Wang Received:

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Module Based Neural Networks for Modeling Gene Regulatory Networks

Module Based Neural Networks for Modeling Gene Regulatory Networks Module Based Neural Networks for Modeling Gene Regulatory Networks Paresh Chandra Barman, Std 1 ID: 20044523 Term Project: BiS732 Bio-Network Department of BioSystems, Korea Advanced Institute of Science

More information

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification 10-810: Advanced Algorithms and Models for Computational Biology Optimal leaf ordering and classification Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene

More information

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK;

AN ENHANCED INITIALIZATION METHOD FOR NON-NEGATIVE MATRIX FACTORIZATION. Liyun Gong 1, Asoke K. Nandi 2,3 L69 3BX, UK; 3PH, UK; 213 IEEE INERNAIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEP. 22 25, 213, SOUHAMPON, UK AN ENHANCED INIIALIZAION MEHOD FOR NON-NEGAIVE MARIX FACORIZAION Liyun Gong 1, Asoke K. Nandi 2,3

More information

Analysis Methods Supplement for: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks

Analysis Methods Supplement for: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks Analysis Methods Supplement for: Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks Javed Khan, Jun S. Wei, Markus Ringnér, Lao H. Saal,

More information

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION SunLab Enlighten the World FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION Ioakeim (Kimis) Perros and Jimeng Sun perros@gatech.edu, jsun@cc.gatech.edu COMPUTATIONAL

More information

Variable Selection and Dimension Reduction by Learning Gradients

Variable Selection and Dimension Reduction by Learning Gradients Variable Selection and Dimension Reduction by Learning Gradients Qiang Wu and Sayan Mukherjee August 6, 2008 1 Introduction High dimension data analysis has become a challenging problem in modern sciences.

More information

Spectral clustering and its use in bioinformatics

Spectral clustering and its use in bioinformatics Journal of Computational and Applied Mathematics 4 (7) 5 37 www.elsevier.com/locate/cam Spectral clustering and its use in bioinformatics Desmond J. Higham a,,, Gabriela Kalna a,, Milla Kibble b,3 a Department

More information

A Bayesian Mixture Model for Differential Gene Expression

A Bayesian Mixture Model for Differential Gene Expression A Bayesian Mixture Model for Differential Gene Expression Kim-Anh Do, Peter Müller and Feng Tang 1 Abstract We propose model-based inference for differential gene expression, using a non-parametric Bayesian

More information

Random Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions

Random Projections as Regularizers: Learning a Linear Discriminant Ensemble from Fewer Observations than Dimensions JMLR: Workshop and Conference Proceedings 29:17 32, 2013 ACML 2013 Random Projections as Regularizers: Learning a Linear Discriminant nsemble from Fewer Observations than Dimensions Robert J. Durrant bobd@waikato.ac.nz

More information

Package SHIP. February 19, 2015

Package SHIP. February 19, 2015 Type Package Package SHIP February 19, 2015 Title SHrinkage covariance Incorporating Prior knowledge Version 1.0.2 Author Maintainer Vincent Guillemot The SHIP-package allows

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Comparison of Discrimination Methods for High Dimensional Data

Comparison of Discrimination Methods for High Dimensional Data Comparison of Discrimination Methods for High Dimensional Data M.S.Srivastava 1 and T.Kubokawa University of Toronto and University of Tokyo Abstract In microarray experiments, the dimension p of the data

More information

Univariate shrinkage in the Cox model for high dimensional data

Univariate shrinkage in the Cox model for high dimensional data Univariate shrinkage in the Cox model for high dimensional data Robert Tibshirani January 6, 2009 Abstract We propose a method for prediction in Cox s proportional model, when the number of features (regressors)

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

GENE expression refers to the level of production of protein

GENE expression refers to the level of production of protein IEEE/ACM TRANSACTIONS ON...,, VOL. TBD, NO. TBD, MONTH YEAR Feature Selection for Gene Expression using Model-based Entropy Shenghuo Zhu, Dingding Wang, Kai Yu, Tao Li and Yihong Gong Abstract Gene expression

More information

Package IGG. R topics documented: April 9, 2018

Package IGG. R topics documented: April 9, 2018 Package IGG April 9, 2018 Type Package Title Inverse Gamma-Gamma Version 1.0 Date 2018-04-04 Author Ray Bai, Malay Ghosh Maintainer Ray Bai Description Implements Bayesian linear regression,

More information

Feature gene selection method based on logistic and correlation information entropy

Feature gene selection method based on logistic and correlation information entropy Bio-Medical Materials and Engineering 26 (2015) S1953 S1959 DOI 10.3233/BME-151498 IOS Press S1953 Feature gene selection method based on logistic and correlation information entropy Jiucheng Xu a,b,,

More information

Sparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data

Sparse Linear Discriminant Analysis with Applications to High Dimensional Low Sample Size Data Sparse Linear Discriant Analysis with Applications to High Dimensional Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract This paper develops a method for automatically incorporating

More information

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA Rahayu, Kernel Logistic Regression-Linear for Leukemia Classification using High Dimensional Data KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA S.P. Rahayu 1,2

More information

Similarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList

Similarities of Ordered Gene Lists. User s Guide to the Bioconductor Package OrderedList for Center Berlin Genome Based Bioinformatics Max Planck Institute for Molecular Genetics Computational Diagnostics Group @ Dept. Vingron Ihnestrasse 63-73, D-14195 Berlin, Germany http://compdiag.molgen.mpg.de/

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

A Bayesian Approach to Variable Selection when the Number of Variables is Very Large

A Bayesian Approach to Variable Selection when the Number of Variables is Very Large A Bayesian Approach to Variable Selection when the Number of Variables is Very Large Harri T. Kiiveri Abstract In this paper, we present a rapid Bayesian variable selection technique which can be used

More information

Penalized versus constrained generalized eigenvalue problems

Penalized versus constrained generalized eigenvalue problems Penalized versus constrained generalized eigenvalue problems Irina Gaynanova, James G. Booth and Martin T. Wells. arxiv:141.6131v3 [stat.co] 4 May 215 Abstract We investigate the difference between using

More information

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Uncorrelated Patient Similarity Learning

Uncorrelated Patient Similarity Learning Uncorrelated Patient Similarity Learning Mengdi Huai Chenglin Miao Qiuling Suo Yaliang Li Jing Gao Aidong Zhang Abstract Patient similarity learning aims to derive a clinically meaningful similarity metric

More information

Accepted author version posted online: 28 Nov 2012.

Accepted author version posted online: 28 Nov 2012. This article was downloaded by: [University of Minnesota Libraries, Twin Cities] On: 15 May 2013, At: 12:23 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954

More information

An Efficient Algorithm for Computing the HHSVM and Its Generalizations

An Efficient Algorithm for Computing the HHSVM and Its Generalizations This article was downloaded by: [University of Minnesota Libraries, Twin Cities] On: 12 October 2013, At: 13:09 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number:

More information

Generative MaxEnt Learning for Multiclass Classification

Generative MaxEnt Learning for Multiclass Classification Generative Maximum Entropy Learning for Multiclass Classification A. Dukkipati, G. Pandey, D. Ghoshdastidar, P. Koley, D. M. V. S. Sriram Dept. of Computer Science and Automation Indian Institute of Science,

More information

Package ShrinkCovMat

Package ShrinkCovMat Type Package Package ShrinkCovMat Title Shrinkage Covariance Matrix Estimators Version 1.2.0 Author July 11, 2017 Maintainer Provides nonparametric Steinian shrinkage estimators

More information

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules. Discretization of Continuous Attributes for Learning Classication Rules Aijun An and Nick Cercone Department of Computer Science, University of Waterloo Waterloo, Ontario N2L 3G1 Canada Abstract. We present

More information

Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets

Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets Investigating the Performance of a Linear Regression Combiner on Multi-class Data Sets Chun-Xia Zhang 1,2 Robert P.W. Duin 2 1 School of Science and State Key Laboratory for Manufacturing Systems Engineering,

More information

Mathematics, Genomics, and Cancer

Mathematics, Genomics, and Cancer School of Informatics IUB April 6, 2009 Outline Introduction Class Comparison Class Discovery Class Prediction Example Biological states and state modulation Software Tools Research directions Math & Biology

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization Haiping Lu 1 K. N. Plataniotis 1 A. N. Venetsanopoulos 1,2 1 Department of Electrical & Computer Engineering,

More information

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier Seiichi Ozawa 1, Shaoning Pang 2, and Nikola Kasabov 2 1 Graduate School of Science and Technology,

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction Data Mining 3.6 Regression Analysis Fall 2008 Instructor: Dr. Masoud Yaghini Outline Introduction Straight-Line Linear Regression Multiple Linear Regression Other Regression Models References Introduction

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA 315 C. K. Lowe-Ma, A. E. Chen, D. Scholl Physical & Environmental Sciences, Research and Advanced Engineering Ford Motor Company, Dearborn, Michigan, USA

More information

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan

SHOTA KATAYAMA AND YUTAKA KANO. Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka , Japan A New Test on High-Dimensional Mean Vector Without Any Assumption on Population Covariance Matrix SHOTA KATAYAMA AND YUTAKA KANO Graduate School of Engineering Science, Osaka University, 1-3 Machikaneyama,

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich,

More information

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes Week 10 Based in part on slides from textbook, slides of Susan Holmes Part I Linear regression & December 5, 2012 1 / 1 2 / 1 We ve talked mostly about classification, where the outcome categorical. If

More information

Selection of Classifiers based on Multiple Classifier Behaviour

Selection of Classifiers based on Multiple Classifier Behaviour Selection of Classifiers based on Multiple Classifier Behaviour Giorgio Giacinto, Fabio Roli, and Giorgio Fumera Dept. of Electrical and Electronic Eng. - University of Cagliari Piazza d Armi, 09123 Cagliari,

More information

A Graph-Based Representation of Gene Expression Profiles in DNA Microarrays

A Graph-Based Representation of Gene Expression Profiles in DNA Microarrays A Graph-Based Representation of Gene Expression Profiles in DNA Microarrays A. Benso, S. Di Carlo, G. Politano, L. Sterpone Abstract This paper proposes a new and very flexible data model, called Gene

More information

Correlation Preserving Unsupervised Discretization. Outline

Correlation Preserving Unsupervised Discretization. Outline Correlation Preserving Unsupervised Discretization Jee Vang Outline Paper References What is discretization? Motivation Principal Component Analysis (PCA) Association Mining Correlation Preserving Discretization

More information

Uncovering the common essence of complex networking

Uncovering the common essence of complex networking A simple and exact Laplacian clustering of complex networking phenomena: Application to gene expression profiles Choongrak Kim*, Mookyung Cheon, Minho Kang, and Iksoo Chang *Department of Statistics, National

More information

Regularized Discriminant Analysis and Its Application in Microarray

Regularized Discriminant Analysis and Its Application in Microarray Regularized Discriminant Analysis and Its Application in Microarray Yaqian Guo, Trevor Hastie and Robert Tibshirani May 5, 2004 Abstract In this paper, we introduce a family of some modified versions of

More information

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Holger Schwender, Andreas Krause, and Katja Ickstadt Abstract Microarrays enable to measure the expression levels of tens

More information

STK Statistical Learning: Advanced Regression and Classification

STK Statistical Learning: Advanced Regression and Classification STK4030 - Statistical Learning: Advanced Regression and Classification Riccardo De Bin debin@math.uio.no STK4030: lecture 1 1/ 42 Outline of the lecture Introduction Overview of supervised learning Variable

More information

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran Assignment 3 Introduction to Machine Learning Prof. B. Ravindran 1. In building a linear regression model for a particular data set, you observe the coefficient of one of the features having a relatively

More information

Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data

Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data Biometrics 64, 440 448 June 2008 DOI: 10.1111/j.1541-0420.2007.00922.x Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data Sijian Wang 1 and Ji Zhu 2,

More information

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC) Eunsik Park 1 and Y-c Ivan Chang 2 1 Chonnam National University, Gwangju, Korea 2 Academia Sinica, Taipei,

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information