Investigating the structure of high dimensional pattern recognition problems

Similar documents
Graph Theoretic Latent Class Discovery

On the conditionality principle in pattern recognition

Knowledge Discovery with Iterative Denoising

Biochip informatics-(i)

Microarray Data Analysis: Discovery

Vertex Nomination via Attributed Random Dot Product Graphs

Statistical Pattern Recognition

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Olfactory Classification via Interpoint Distance Analysis

Optimal normalization of DNA-microarray data

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Clustering and classification with applications to microarrays and cellular phenotypes

Unsupervised Anomaly Detection for High Dimensional Data

Module Based Neural Networks for Modeling Gene Regulatory Networks

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

STA414/2104 Statistical Methods for Machine Learning II

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Support Vector Machines for Classification: A Statistical Portrait

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

CS 540: Machine Learning Lecture 1: Introduction

Feature Selection for SVMs

Jeff Howbert Introduction to Machine Learning Winter

Doing Right By Massive Data: How To Bring Probability Modeling To The Analysis Of Huge Datasets Without Taking Over The Datacenter

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Exploiting Sparse Non-Linear Structure in Astronomical Data

High Dimensional Discriminant Analysis

arxiv: v1 [stat.ml] 17 Sep 2012

Gene Expression Data Classification With Kernel Principal Component Analysis

An Empirical Comparison of Dimensionality Reduction Methods for Classifying Gene and Protein Expression Datasets

Manifold Learning for Subsequent Inference

Support Vector Machine (SVM) and Kernel Methods

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Statistics Toolbox 6. Apply statistical algorithms and probability models

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Lecture 3: Statistical Decision Theory (Part II)

EEE 241: Linear Systems

L11: Pattern recognition principles

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Collaborative topic models: motivations cont

Microarray data analysis

Support Vector Machine (SVM) and Kernel Methods

What is semi-supervised learning?

Introduction to Machine Learning

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Terminology for Statistical Data

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Support Vector Machines: Maximum Margin Classifiers

Unsupervised machine learning

Iterative Laplacian Score for Feature Selection

Applied Machine Learning Annalisa Marsico

Learning gradients: prescriptive models

STK Statistical Learning: Advanced Regression and Classification

Mathematics, Genomics, and Cancer

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Kybernetika. Šarūnas Raudys Intrinsic dimensionality and small sample properties of classifiers

Fast Hierarchical Clustering from the Baire Distance

An Alternative Algorithm for Classification Based on Robust Mahalanobis Distance

Linear and Non-Linear Dimensionality Reduction

Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence. Uncorrelated Lasso

Score Normalization in Multimodal Biometric Systems

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

An introduction to clustering techniques

Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University

Multivariate statistical methods and data mining in particle physics

Dimension reduction methods: Algorithms and Applications Yousef Saad Department of Computer Science and Engineering University of Minnesota

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Graph Inference with Imperfect Edge Classifiers

Support Vector Machine & Its Applications

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

CLASSIFIER FUSION FOR POORLY-DIFFERENTIATED TUMOR CLASSIFICATION USING BOTH MESSENGER RNA AND MICRORNA EXPRESSION PROFILES

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

COURSE INTRODUCTION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Advanced analysis and modelling tools for spatial environmental data. Case study: indoor radon data in Switzerland

Lecture: Face Recognition

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Principal Component Analysis (PCA)

Machine Learning Lecture 5

Lecture : Probabilistic Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Clustering & microarray technology

Sparse statistical modelling

Discriminative Direction for Kernel Classifiers

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

On Improving the k-means Algorithm to Classify Unclassified Patterns

Non-Negative Factorization for Clustering of Microarray Data

Machine Learning (BSMC-GA 4439) Wenke Liu

Binary Classification / Perceptron

Supervised Classification for Functional Data Using False Discovery Rate and Multivariate Functional Depth

Statistical Pattern Recognition

Transcription:

Investigating the structure of high dimensional pattern recognition problems Carey E. Priebe <cep@jhu.edu> Department of Mathematical Sciences Whiting School of Engineering Johns Hopkins University altimore, MD 21218-2682 Inaugural Professorial Lecture November 29, 2001 The wealth of your practical experience with sane and interesting problems will give to mathematics a new direction and a new impetus. Leopold Kronecker to Hermann von Helmholtz 1

Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 2

Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 3

Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 4

Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 5

Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 6

High dimensional pattern recognition problems olfactory classification gene expression analysis multispecral imagery: mines & minefields hyperspectral imagery (e.g., HyMap) functional brain imagery (e.g., NV vs SZ) astronomy (e.g., Sloan Digital Sky Survey) face detection financial data analysis knowledge discovery from text 7

Gene expression monitoring by DNA microarrays 8

DNA microarrays consist of a library of genes immobilized in a grid, usually on a glass slide. Each individual spot in the grid contains DNA from a single gene that will bind to the messenger RNA (mrna) produced by the gene concerned. So by liquidizing a sample from a given tissue type, tagging its mrnas with fluorescent dyes and then exposing the sample to the slide, it is possible to obtain an instant visual read-out revealing which genes were active. Jonathan Knight, Nature, Vol. 410, 19 April 2001 9

ALL 1 The WI/MIT CGR 1999 data set, produced by Affymetrix DNA microarrays, involves two general classes of leukemia, ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia). Each observation is a patient, with n ALL = 47, n AML = 25; n = n ALL + n AML = 72. Each observation is a point in 7129-dimensional Euclidean space; there are 6817 unique human genes monitored, augmented with some redundant probes and control elements. Golub, Slonim, et al., Science, 1999. 10

? ALL 1? 11

ALL 2 ALL 1 AML 1 12

ALL 1 Goals: Problem: classify ALL v. AML cluster latent class discovery dimension reduction 72 observations in 7129-dimensions 13

Tufts University artificial nose chemical sensor White, Kauer, Dickinson, and Walt, Nature, 1996. Priebe, IEEE PAMI, 2001. 14

Vapor-sensing and pattern recognition with the Tufts University artificial nose chemical sensor. The plot in the cartoon represents sensor/analyte signatures for three sensors within the bundled nineteen-sensor array. Signature patterns of fluorescence changes vs. time are used for subsequent analysis. Nature, 382: 697-700 (1996). 15

Observation Chloroform 07 02 (Clfm0702). 16

Olfactory Classification Goal: Detection of distinguished analyte (Trichloroethylene (TCE)) at various concentrations in the presence of multiple confounders. Problem: Multivariate function-valued data; no parametric model. 17

Statistical Data Mining in the sense of Edward J. Wegman: Data Mining is an extension of exploratory data analysis and has basically the same goals: the discovery of unknown and unanticipated structure in the data. The chief distinction between the two topics resides in the size and dimensionality of the data sets involved. Data mining in general deals with much more massive data sets for which highly interactive analysis is not fully feasible. 18

The curse of dimensionality: nonparametric density estimation sample size 0 2*10^5 4*10^5 6*10^5 8*10^5 10^6 dimension samplesize 1 4 2 19 3 67 4 223 5 768 6 2790 7 10700 8 43700 9 187000 10 842000 2 4 6 8 10 dimension Choose n such that the relative mean squared error at 0 is small; n(d) = arg min n E[( ˆf n (0; d) ϕ(0; d)) 2 /ϕ(0; d) 2 ] = 0.1. Silverman, 1986 19

The curse of dimensionality: statistical pattern recognition Consider class-conditional probability density functions f j = Normal(µ j, I d ), j = 0, 1, with equal priors. Let µ 0 = µ 1 = [1, 2 1/2, 3 1/2,, d 1/2 ]. Case I: µ 0 known. ayes optimal rule is available and L(d) 0 as d. Case II: µ 0 unknown. µ 0 must be estimated from training data and L n (d) 1/2 as d for fixed n. Trunk, 1979 Jain, et al., 2000 20

PEANUTS by Charles Schulz 21

PEANUTS by Charles Schulz 22

PEANUTS by Charles Schulz 23

Statistical Data Mining in the sense of rian Ripley: Data mining, also known as knowledge discovery in databases is one of many terms for finding structure in large-scale datasets on the boundaries of statistics, engineering, machine learning and computer science. Statistical data mining concentrates on methods for finding the structure (as distinct from manipulating the databases). 24

Statistical Data Mining in the sense of rian Ripley: Data mining, also known as knowledge discovery in databases is one of many terms for finding structure in large-scale datasets on the boundaries of statistics, engineering, machine learning and computer science. Statistical data mining concentrates on methods for finding the structure (as distinct from manipulating the databases). 25

Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman 26

Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman, 1988 27

Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman, 1988 28

In his 1992 book David W. Scott writes: Fortunately, it appears that in practical situations, the dimension of the structure seldom exceeds 4 or 5. 29

Tufts University artificial nose chemical sensor 30

31

We choose to focus on methodologies which depend upon only the interpoint distances. To build a classifier for an unlabelled examplar Z, we restrict attention to information contained in the training sample interpoint distance matrix D = [d i,j = d(x i, X j )] and the test vector D Z = [d(z, X 1 ),, d(z, X n )]. Alas, the trick then first, get a million dollars is to pick a good distance. 32

The trick then... is to pick a good distance. Integrated Sensing and Processing DARPA ACMP ISP sensor/processor parameter selection (e.g. dimensionality selection) based on explicit optimization of exploitation objective (in the case of supervised classification, the probability of misclassification L) 33

Class Cover Catch Digraphs A statistical data mining methodology for investigating the structure of high dimensional pattern recognition problems. 34

Class Cover Catch Digraphs class conditional data X i Y i = j f j 35

Class Cover Catch Digraphs For X i s.t. Y i = j, i = {x : d(x i, x) < r Xi := min Xk :Y k =1 j d(x i, X k )} 36

Class Cover Catch Digraphs V j = {X i : Y i = j}. For i 1 i 2, (X i1, X i2 ) A j X i2 i1. 37

Class Cover Catch Digraphs D j = (V j, A j ) 38

Class Cover Catch Digraphs X X Choose a (minimum) dominating set Ŝ j for D j 39

Class Cover Catch Digraphs 1 1 0 0 1 1 Consider { i : X i Ŝ j } 40

Class Cover Catch Digraphs 1 1 0 0 1 1 Go and do likewise for class 1 j 41

Class Cover Catch Digraphs 1 1 0 0 1 1 g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i 42

Class Cover Catch Digraphs g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i 43

Class Cover Catch Digraphs Monte Carlo results: ˆL(nearest neighbor) = 0.123 ˆL(cccd) = 0.074 L(ayes optimal) = 0.035 (L := P [g(x) Y ] is probability of misclassification ) 44

Theorem 1: Let Ŝ j be dominating sets for cccds D j and g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i with T Xi 1 for all i. Then ˆL (R) n (g) = n i=1 I{g(X i ) Y i } = 0. 45

Theorem 2: Assume, in addition to the conditions of Theorem 1, that d is well-behaved (e.g. L p ) and the class-conditional distributions F j are strictly separable. Then g is consistent. That is, L n (g) L := L(ayes optimal). 46

Algorithmic extension: robustness (a) to contamination (b) to outliers 47

48

α = 0; β = 0 α = 10; β = 5 ˆL 0.21 ˆL 0.16 49

Coastal attlefield Reconnaissance and Analysis (CORA) 50

51

52

53

Coastal attlefield Reconnaissance and Analysis (CORA) Class Cover Catch Digraph with α = 1, β = 4 54

Coastal attlefield Reconnaissance and Analysis (CORA) Y band # 5 Y X X Y band # 3 55

Coastal attlefield Reconnaissance and Analysis (CORA) Y band # 5 Y X X Y band # 3 ˆL (D) (g) = n i=1 I{g ( i) (X i ) Y i } = 0.205. 56

Complexity Reduction 1. cccd & dominating set 2. hierarchical complete linkage clustering on the size (radii) of the proximity regions: cluster(r Xi : X i Ŝ j ) 3. determine scale dimension 57

58

Define the scale dimension d to be the cluster map dimension which minimizes a dimensionality-penalized misclassification rate; d δ := min{arg min k L k + δ k} for some penalty coefficient δ [0, 1]. 59

Scale Dimension ˆd 7 Scale dimension for artificial nose data: d*=7 misclassification 0.0 0.1 0.2 0.3 0.4 5 10 15 20 dimension 60

Scale Dimension ˆd 7 61

Return now to our example: Tufts University artificial nose chemical sensor Data: 80 observations of TCE + Chloroform in Air (at various concentrations) 40 observations of Chloroform in Air (at various concentrations) 62

63

7 4 7 9 3 4 1 1 0 4 5 6 2 4 6 6 0 8 1 3 7 7 6 8 0 1 3 7 9 1 5 8 0 1 1 1 1 1 1 1 2 4 3 7 6 5 5 7 4 7 9 3 4 1 1 0 4 5 6 2 4 6 6 0 8 1 3 7 7 6 8 0 1 3 7 9 1 5 8 0 1 1 1 1 1 1 1 2 4 3 7 6 5 5 64

Gene expression monitoring by DNA microarrays 65

66

Gene expression monitoring by DNA microarrays ALL vs. AML 67

T T T T T T T T T Gene expression monitoring by DNA microarrays -cell ALL vs. T-cell ALL vs. AML 68

0 T T T TT TTT d(x i, AML) H 0 : median(f d(x,aml) for X -cell ALL) = median(f d(x,aml) for X T-cell ALL) p-value = 0.0051 (Exact Wilcoxon rank-sum test) 69

Y X Y gene # 2275 X Y X X Y X X X Y gene # 6345 cccd classification on DNA microarry data ˆL = 0.069 70

Dimension Reduction Principle Components scree plot DNA microarray data scree plot DNA microarray data fraction of variance explained 0.0 0.05 0.10 0.15 0.20 cumulative variance explained 0.0 0.2 0.4 0.6 0.8 1.0 1 20 72 principle component 1 20 72 principle component 71

Dimension Reduction Principle Components ˆL P_prcomp(e;k) 0.0 0.1 0.2 0.3 0.4 0.5 0 20 40 60 k 72

Dimension Reduction ISP ALL 1 (i 1,, i k ) = arg inf 1 k d inf (i1,,i k ) {1,,d} L(g( i 1,, i k )) (Note: ( ) 7129 10 10 32 ) 73

Investigating the structure of high dimensional pattern recognition problems Lenore Cowen Jingdong Xie Adam Cannon Jason DeVinney Diego Socolinsky David Marchette Jeff Solka Dennis Healy Anna Tsao Wendy Martinez Ed Wegman 74

75