Graph Theoretic Latent Class Discovery

Similar documents
Investigating the structure of high dimensional pattern recognition problems

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Knowledge Discovery with Iterative Denoising

Microarray Data Analysis: Discovery

An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Discriminative Direction for Kernel Classifiers

Linear Programming-based Data Mining Techniques And Credit Card Business Intelligence

Heuristics for The Whitehead Minimization Problem

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Statistical aspects of prediction models with high-dimensional data

Nonlinear Dimensionality Reduction. Jose A. Costa

Statistics Applied to Bioinformatics. Tests of homogeneity

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

Data Exploration vis Local Two-Sample Testing

TenMarks Curriculum Alignment Guide: GO Math! Grade 8

How GIS can be used for improvement of literacy and CE programmes

Sparse Approximation and Variable Selection

Generalization Error on Pruning Decision Trees

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Regularization and Variable Selection via the Elastic Net

Machine Learning on temporal data

VARIABLE SELECTION IN VERY-HIGH DIMENSIONAL REGRESSION AND CLASSIFICATION

Gradient Boosting (Continued)

Statistical Pattern Recognition

Molecular Biology: from sequence analysis to signal processing. University of Sao Paulo. Junior Barrera

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Optimization Methods for Machine Learning (OMML)

Introduction to Bayesian Learning

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Chapter 9: Relations Relations

Chapter 2 Class Notes Sample & Population Descriptions Classifying variables

ISyE 691 Data mining and analytics

Gradient Boosting, Continued

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Iterative Laplacian Score for Feature Selection

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

Research Statement on Statistics Jun Zhang

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature Selection for SVMs

A Posteriori Corrections to Classification Methods.

where X is the feasible region, i.e., the set of the feasible solutions.

Reference Material /Formulas for Pre-Calculus CP/ H Summer Packet

Module Based Neural Networks for Modeling Gene Regulatory Networks

Learning Classification Trees. Sargur Srihari

Photometric Redshifts with DAME

Support Vector Machines Explained

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions

Non-Negative Factorization for Clustering of Microarray Data

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Mathematics, Genomics, and Cancer

Genetic Networks. Korbinian Strimmer. Seminar: Statistical Analysis of RNA-Seq Data 19 June IMISE, Universität Leipzig

Visualize Biological Database for Protein in Homosapiens Using Classification Searching Models

Bayesian decision making

Efficient Information Planning in Graphical Models

The lasso: some novel algorithms and applications

ALGEBRA 1B GOALS. 1. The student should be able to use mathematical properties to simplify algebraic expressions.

Computational Systems Biology

Sparse representation classification and positive L1 minimization

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Lecture 7: DecisionTrees

Statistical Machine Learning

Graph Wavelets to Analyze Genomic Data with Biological Networks

Decision T ree Tree Algorithm Week 4 1

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Alexander Klippel and Chris Weaver. GeoVISTA Center, Department of Geography The Pennsylvania State University, PA, USA

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (continued)

Unsupervised machine learning

Linear classifiers: Overfitting and regularization

ALGORITHMS FOR DISCOVERY OF MULTIPLE MARKOV BOUNDARIES: APPLICATION TO THE MOLECULAR SIGNATURE MULTIPLICITY PROBLEM. Alexander Romanovich Statnikov

Statistics Toolbox 6. Apply statistical algorithms and probability models

Grade 8. Concepts and Procedures. The Number System. Expressions and Equations

Clustering of Pathogenic Genes in Human Co-regulatory Network. Michael Colavita Mentor: Soheil Feizi Fifth Annual MIT PRIMES Conference May 17, 2015

Markowitz Minimum Variance Portfolio Optimization. using New Machine Learning Methods. Oluwatoyin Abimbola Awoye. Thesis

arxiv: v1 [stat.ml] 17 Sep 2012

Sparse Linear Discriminant Analysis With High Dimensional Data

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Exploring Spatial Relationships for Knowledge Discovery in Spatial Data

Data Mining and Machine Learning (Machine Learning: Symbolische Ansätze)

Next Generation Science Standards Crosscutting Concepts for MS

Decision Tree Learning and Inductive Inference

Missouri Educator Gateway Assessments

Machine Learning for Biomedical Engineering. Enrico Grisan

Support Vector Machine via Nonlinear Rescaling Method

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Geometric Algorithms in GIS

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Text mining and natural language analysis. Jefrey Lijffijt

URBAN LAND COVER AND LAND USE CLASSIFICATION USING HIGH SPATIAL RESOLUTION IMAGES AND SPATIAL METRICS

5. Discriminant analysis

Models, Data, Learning Problems

Master of Science in Statistics A Proposal

Transcription:

Graph Theoretic Latent Class Discovery Jeff Solka jsolka@gmu.edu NSWCDD/GMU GMU BINF Colloquium 2/24/04 p.1/28

Agenda What is latent class discovery? What are some approaches to the latent class discovery process? The class cover catch digraph classifier. Latent class discovery results on a gene expression data set. Wrap-up and conclusions. GMU BINF Colloquium 2/24/04 p.2/28

Acknowledgments John Grefenstette Office of Naval Research through their ILIR Program for funding this effort GMU BINF Colloquium 2/24/04 p.3/28

What is Latent Class Discovery? A latent class is a class of observations that reside undiscovered within a known class of observations. Develop a general methodology for the discernment of latent class structure during discriminant analysis. Moderately large hyperdimensional data sets. During training or testing. Explore applications of developed methodologies to the analysis of data sets in the areas of hyperdimensional image analysis, artificial olfactory systems, computer security data, gene expression data, and text data mining. GMU BINF Colloquium 2/24/04 p.4/28

Flow Chart MULTIDIMENSIONAL SCALING HYPERDIMENSIONAL DATA GRAPH THEORETIC DISCRIMINANT ANALYSIS LATENT CLASSES I N S I G H T S METRIC SPACE ADAPTATION NONLINEAR DIMENSIONALITY REDUCTION GMU BINF Colloquium 2/24/04 p.5/28

Dominating Set two class data and covering discs Dominating set GMU BINF Colloquium 2/24/04 p.6/28

A Brief Movie GMU BINF Colloquium 2/24/04 p.7/28

7 6 5 4 3 2 1 0 1 2 3 4 CCCD-Based Latent Class Discovery 3 2 1 0 1 2 3 4 5 6 GMU BINF Colloquium 2/24/04 p.8/28

Quadratic Classifier-Based Latent Class Discovery GMU BINF Colloquium 2/24/04 p.9/28

ALL/AML Leukemia Gene Expression Analysis 72 Patients 7129 genes Apply CCCD to ALL Observations = AML = ALL B cell = ALL T cell Cluster CCCD Solution Based on Radii Ascertain Significance of Latent Class Structure Examine Clusters for Latent Class Structure GMU BINF Colloquium 2/24/04 p.10/28

5 4 / 3 $$ ' 4 / 3 $$ ' Resubstitution Error Rate Estimate is an empirical risk (resubstitution error rate estimate) For each calculated as ( ),+ 021 3. - $ " #"! (*),+ ' &% 5 ( )8+ 021 3. - ( ) + ' % $ " #"! 7 6 GMU BINF Colloquium 2/24/04 p.11/28

5 Classification Dimension We proceed by defining the scale dimension to be the cluster map dimension that minimizes a dimensionality-penalized empirical risk; 021 / 021 / for some penalty coefficient. GMU BINF Colloquium 2/24/04 p.12/28

ALL/AML Classification Dimension Plot GMU BINF Colloquium 2/24/04 p.13/28

Gene Latent Class Discovery GMU BINF Colloquium 2/24/04 p.14/28

ALL/AML MDS Plot GMU BINF Colloquium 2/24/04 p.15/28

How Robust is the Methodology? One other success story using artificial nose data. What if we had used another dominating set in our analysis? Is the discovered latent class structure independent of the dominating set used? GMU BINF Colloquium 2/24/04 p.16/28

An Exhaustive Enumeration of All Possible Dominating Sets for the Gene Data 180 21 node solutions 16 of the nodes remain fixed across the solutions 14 greedy solutions GMU BINF Colloquium 2/24/04 p.17/28

Classification Space Curves for the 180 Solutions 0.00 0.05 0.10 0.15 0.20 0.25 0.30 5 10 15 20 GMU BINF Colloquium 2/24/04 p.18/28

Classification Dimension for the 180 Solutions (red o Greedy Solutions, Green * Previous Solution) 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 0 20 40 60 80 100 120 140 160 180 GMU BINF Colloquium 2/24/04 p.19/28

Classification Dimension for the 180 Solutions 60 50 40 30 20 10 0 2 3 4 5 6 7 GMU BINF Colloquium 2/24/04 p.20/28

Number of Dominating Sets for Each Vertex Number of Dominating sets for each vertex # Dominating Sets 0 50 100 150 T Cell B Cell In degree 0 0 10 20 30 40 Vertex GMU BINF Colloquium 2/24/04 p.21/28

Digraph Analysis! " # $ % & ' ( ) * +, -. / 0 1 2 3 4 5 6 78 9: ; = >? @ 8 :A 8 @ B : C B D : E ; FG H D I@ J H K C H F@ 9 8 @ 9 L ; J 8 C D 8 @ J H M= N B I 8 @ O F J 8 @ D9 J D P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n 78 9: ; o >? @ 8 :A 8 @ B : C B D : E ; FG H D I@ J H K C H F@ 9 8 @ 9 L ; J 8 C D 8 @ J H = p B I 8 @ O F J 8 @ D9 J D J H F J C I: q B ; D : q J r ; I O F 9 ; B s F q 9 I ; 8 J H O P = GMU BINF Colloquium 2/24/04 p.22/28

Latent Class Discovery Figures of Merit How can we be assured that all of the greedy dominating set solutions discover the same latent classes? Previous greedy solution had 3 clusters that are pure B and 1 cluster that contained 8/9 of the T observations Percentage of B points that are in pure B clusters and the highest percentage of T points in any one cluster GMU BINF Colloquium 2/24/04 p.23/28

Purity (Latent Class Discovery) for the Golub Gene Data, Red Triangles are the Greedy Solutions tpercent 0.80 0.85 0.90 0.95 1.00 0.4 0.5 0.6 0.7 0.8 0.9 bpercent GMU BINF Colloquium 2/24/04 p.24/28

Remaining Questions Demonstrated similar latent class discovery among all of the greedy dominating set solutions Many of the 7129 variates (genes) are superfluous to the discriminant analysis problem Work is ongoing to examine the discovered latent classes based on subsets of the genes Various figures of merit have been used to choose the subsets of the genes GMU BINF Colloquium 2/24/04 p.25/28

Conclusions Developed a new concept for latent class discovery during discriminant analysis Illustrated one graph theoretic methodology for the discovery of the latent classes Illustrated this methodology with a gene expression data set. Presented some preliminary results examining the robustness of the discovery process to the cccd process GMU BINF Colloquium 2/24/04 p.26/28

Readings C. E. Priebe, J. L. Solka, D. J. Marchette, and B. T. Clark,2003, Class Cover Catch Digraphs for Latent Class Discovery in Gene Expression Monitoring by DNA Microarrays, Computational Statistics and Data Analysis on Statistical, Vol. 43, pp. 621 632. J. L. Solka, C. E. Priebe, and B. T. Clark,2002, A Visualization Framework for the Analysis of Hyperdimensional Data, International Journal of Image and Graphics Special Issue on Graphical Methods in Data Mining, pp. 145-161. Marchette, D.J., Priebe, C.E., 2002, Characterizing the scale dimension of a high-dimensional classification problem, Pattern Recognition,Vol. 36, pp. 45 60. GMU BINF Colloquium 2/24/04 p.27/28

Questions? GMU BINF Colloquium 2/24/04 p.28/28