Graph Theoretic Latent Class Discovery

Graph Theoretic Latent Class Discovery Jeff Solka jsolka@gmu.edu NSWCDD/GMU GMU BINF Colloquium 2/24/04 p.1/28

Agenda What is latent class discovery? What are some approaches to the latent class discovery process? The class cover catch digraph classifier. Latent class discovery results on a gene expression data set. Wrap-up and conclusions. GMU BINF Colloquium 2/24/04 p.2/28

Acknowledgments John Grefenstette Office of Naval Research through their ILIR Program for funding this effort GMU BINF Colloquium 2/24/04 p.3/28

What is Latent Class Discovery? A latent class is a class of observations that reside undiscovered within a known class of observations. Develop a general methodology for the discernment of latent class structure during discriminant analysis. Moderately large hyperdimensional data sets. During training or testing. Explore applications of developed methodologies to the analysis of data sets in the areas of hyperdimensional image analysis, artificial olfactory systems, computer security data, gene expression data, and text data mining. GMU BINF Colloquium 2/24/04 p.4/28

Flow Chart MULTIDIMENSIONAL SCALING HYPERDIMENSIONAL DATA GRAPH THEORETIC DISCRIMINANT ANALYSIS LATENT CLASSES I N S I G H T S METRIC SPACE ADAPTATION NONLINEAR DIMENSIONALITY REDUCTION GMU BINF Colloquium 2/24/04 p.5/28

Dominating Set two class data and covering discs Dominating set GMU BINF Colloquium 2/24/04 p.6/28

A Brief Movie GMU BINF Colloquium 2/24/04 p.7/28

7 6 5 4 3 2 1 0 1 2 3 4 CCCD-Based Latent Class Discovery 3 2 1 0 1 2 3 4 5 6 GMU BINF Colloquium 2/24/04 p.8/28

Quadratic Classifier-Based Latent Class Discovery GMU BINF Colloquium 2/24/04 p.9/28

ALL/AML Leukemia Gene Expression Analysis 72 Patients 7129 genes Apply CCCD to ALL Observations = AML = ALL B cell = ALL T cell Cluster CCCD Solution Based on Radii Ascertain Significance of Latent Class Structure Examine Clusters for Latent Class Structure GMU BINF Colloquium 2/24/04 p.10/28

5 4 / 3 $$ ' 4 / 3 $$ ' Resubstitution Error Rate Estimate is an empirical risk (resubstitution error rate estimate) For each calculated as ( ),+ 021 3. - $ " #"! (*),+ ' &% 5 ( )8+ 021 3. - ( ) + ' % $ " #"! 7 6 GMU BINF Colloquium 2/24/04 p.11/28

5 Classification Dimension We proceed by defining the scale dimension to be the cluster map dimension that minimizes a dimensionality-penalized empirical risk; 021 / 021 / for some penalty coefficient. GMU BINF Colloquium 2/24/04 p.12/28

ALL/AML Classification Dimension Plot GMU BINF Colloquium 2/24/04 p.13/28

Gene Latent Class Discovery GMU BINF Colloquium 2/24/04 p.14/28

ALL/AML MDS Plot GMU BINF Colloquium 2/24/04 p.15/28

How Robust is the Methodology? One other success story using artificial nose data. What if we had used another dominating set in our analysis? Is the discovered latent class structure independent of the dominating set used? GMU BINF Colloquium 2/24/04 p.16/28

An Exhaustive Enumeration of All Possible Dominating Sets for the Gene Data 180 21 node solutions 16 of the nodes remain fixed across the solutions 14 greedy solutions GMU BINF Colloquium 2/24/04 p.17/28

Classification Space Curves for the 180 Solutions 0.00 0.05 0.10 0.15 0.20 0.25 0.30 5 10 15 20 GMU BINF Colloquium 2/24/04 p.18/28

Classification Dimension for the 180 Solutions (red o Greedy Solutions, Green * Previous Solution) 7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 2 0 20 40 60 80 100 120 140 160 180 GMU BINF Colloquium 2/24/04 p.19/28

Classification Dimension for the 180 Solutions 60 50 40 30 20 10 0 2 3 4 5 6 7 GMU BINF Colloquium 2/24/04 p.20/28

Number of Dominating Sets for Each Vertex Number of Dominating sets for each vertex # Dominating Sets 0 50 100 150 T Cell B Cell In degree 0 0 10 20 30 40 Vertex GMU BINF Colloquium 2/24/04 p.21/28

Digraph Analysis! " # $ % & ' ( ) * +, -. / 0 1 2 3 4 5 6 78 9: ; = >? @ 8 :A 8 @ B : C B D : E ; FG H D I@ J H K C H F@ 9 8 @ 9 L ; J 8 C D 8 @ J H M= N B I 8 @ O F J 8 @ D9 J D P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n 78 9: ; o >? @ 8 :A 8 @ B : C B D : E ; FG H D I@ J H K C H F@ 9 8 @ 9 L ; J 8 C D 8 @ J H = p B I 8 @ O F J 8 @ D9 J D J H F J C I: q B ; D : q J r ; I O F 9 ; B s F q 9 I ; 8 J H O P = GMU BINF Colloquium 2/24/04 p.22/28

Latent Class Discovery Figures of Merit How can we be assured that all of the greedy dominating set solutions discover the same latent classes? Previous greedy solution had 3 clusters that are pure B and 1 cluster that contained 8/9 of the T observations Percentage of B points that are in pure B clusters and the highest percentage of T points in any one cluster GMU BINF Colloquium 2/24/04 p.23/28

Purity (Latent Class Discovery) for the Golub Gene Data, Red Triangles are the Greedy Solutions tpercent 0.80 0.85 0.90 0.95 1.00 0.4 0.5 0.6 0.7 0.8 0.9 bpercent GMU BINF Colloquium 2/24/04 p.24/28

Remaining Questions Demonstrated similar latent class discovery among all of the greedy dominating set solutions Many of the 7129 variates (genes) are superfluous to the discriminant analysis problem Work is ongoing to examine the discovered latent classes based on subsets of the genes Various figures of merit have been used to choose the subsets of the genes GMU BINF Colloquium 2/24/04 p.25/28

Conclusions Developed a new concept for latent class discovery during discriminant analysis Illustrated one graph theoretic methodology for the discovery of the latent classes Illustrated this methodology with a gene expression data set. Presented some preliminary results examining the robustness of the discovery process to the cccd process GMU BINF Colloquium 2/24/04 p.26/28

Readings C. E. Priebe, J. L. Solka, D. J. Marchette, and B. T. Clark,2003, Class Cover Catch Digraphs for Latent Class Discovery in Gene Expression Monitoring by DNA Microarrays, Computational Statistics and Data Analysis on Statistical, Vol. 43, pp. 621 632. J. L. Solka, C. E. Priebe, and B. T. Clark,2002, A Visualization Framework for the Analysis of Hyperdimensional Data, International Journal of Image and Graphics Special Issue on Graphical Methods in Data Mining, pp. 145-161. Marchette, D.J., Priebe, C.E., 2002, Characterizing the scale dimension of a high-dimensional classification problem, Pattern Recognition,Vol. 36, pp. 45 60. GMU BINF Colloquium 2/24/04 p.27/28

Questions? GMU BINF Colloquium 2/24/04 p.28/28