Investigating the structure of high dimensional pattern recognition problems

Investigating the structure of high dimensional pattern recognition problems Carey E. Priebe <cep@jhu.edu> Department of Mathematical Sciences Whiting School of Engineering Johns Hopkins University altimore, MD 21218-2682 Inaugural Professorial Lecture November 29, 2001 The wealth of your practical experience with sane and interesting problems will give to mathematics a new direction and a new impetus. Leopold Kronecker to Hermann von Helmholtz 1

Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 2

High dimensional pattern recognition problems olfactory classification gene expression analysis multispecral imagery: mines & minefields hyperspectral imagery (e.g., HyMap) functional brain imagery (e.g., NV vs SZ) astronomy (e.g., Sloan Digital Sky Survey) face detection financial data analysis knowledge discovery from text 7

Gene expression monitoring by DNA microarrays 8

DNA microarrays consist of a library of genes immobilized in a grid, usually on a glass slide. Each individual spot in the grid contains DNA from a single gene that will bind to the messenger RNA (mrna) produced by the gene concerned. So by liquidizing a sample from a given tissue type, tagging its mrnas with fluorescent dyes and then exposing the sample to the slide, it is possible to obtain an instant visual read-out revealing which genes were active. Jonathan Knight, Nature, Vol. 410, 19 April 2001 9

ALL 1 The WI/MIT CGR 1999 data set, produced by Affymetrix DNA microarrays, involves two general classes of leukemia, ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia). Each observation is a patient, with n ALL = 47, n AML = 25; n = n ALL + n AML = 72. Each observation is a point in 7129-dimensional Euclidean space; there are 6817 unique human genes monitored, augmented with some redundant probes and control elements. Golub, Slonim, et al., Science, 1999. 10

? ALL 1? 11

ALL 2 ALL 1 AML 1 12

ALL 1 Goals: Problem: classify ALL v. AML cluster latent class discovery dimension reduction 72 observations in 7129-dimensions 13

Tufts University artificial nose chemical sensor White, Kauer, Dickinson, and Walt, Nature, 1996. Priebe, IEEE PAMI, 2001. 14

Vapor-sensing and pattern recognition with the Tufts University artificial nose chemical sensor. The plot in the cartoon represents sensor/analyte signatures for three sensors within the bundled nineteen-sensor array. Signature patterns of fluorescence changes vs. time are used for subsequent analysis. Nature, 382: 697-700 (1996). 15

Observation Chloroform 07 02 (Clfm0702). 16

Olfactory Classification Goal: Detection of distinguished analyte (Trichloroethylene (TCE)) at various concentrations in the presence of multiple confounders. Problem: Multivariate function-valued data; no parametric model. 17

Statistical Data Mining in the sense of Edward J. Wegman: Data Mining is an extension of exploratory data analysis and has basically the same goals: the discovery of unknown and unanticipated structure in the data. The chief distinction between the two topics resides in the size and dimensionality of the data sets involved. Data mining in general deals with much more massive data sets for which highly interactive analysis is not fully feasible. 18

The curse of dimensionality: nonparametric density estimation sample size 0 2*10^5 4*10^5 6*10^5 8*10^5 10^6 dimension samplesize 1 4 2 19 3 67 4 223 5 768 6 2790 7 10700 8 43700 9 187000 10 842000 2 4 6 8 10 dimension Choose n such that the relative mean squared error at 0 is small; n(d) = arg min n E[( ˆf n (0; d) ϕ(0; d)) 2 /ϕ(0; d) 2 ] = 0.1. Silverman, 1986 19

The curse of dimensionality: statistical pattern recognition Consider class-conditional probability density functions f j = Normal(µ j, I d ), j = 0, 1, with equal priors. Let µ 0 = µ 1 = [1, 2 1/2, 3 1/2,, d 1/2 ]. Case I: µ 0 known. ayes optimal rule is available and L(d) 0 as d. Case II: µ 0 unknown. µ 0 must be estimated from training data and L n (d) 1/2 as d for fixed n. Trunk, 1979 Jain, et al., 2000 20

PEANUTS by Charles Schulz 21

PEANUTS by Charles Schulz 22

PEANUTS by Charles Schulz 23

Statistical Data Mining in the sense of rian Ripley: Data mining, also known as knowledge discovery in databases is one of many terms for finding structure in large-scale datasets on the boundaries of statistics, engineering, machine learning and computer science. Statistical data mining concentrates on methods for finding the structure (as distinct from manipulating the databases). 24

Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman 26

In his 1992 book David W. Scott writes: Fortunately, it appears that in practical situations, the dimension of the structure seldom exceeds 4 or 5. 29

Tufts University artificial nose chemical sensor 30

We choose to focus on methodologies which depend upon only the interpoint distances. To build a classifier for an unlabelled examplar Z, we restrict attention to information contained in the training sample interpoint distance matrix D = [d i,j = d(x i, X j )] and the test vector D Z = [d(z, X 1 ),, d(z, X n )]. Alas, the trick then first, get a million dollars is to pick a good distance. 32

The trick then... is to pick a good distance. Integrated Sensing and Processing DARPA ACMP ISP sensor/processor parameter selection (e.g. dimensionality selection) based on explicit optimization of exploitation objective (in the case of supervised classification, the probability of misclassification L) 33

Class Cover Catch Digraphs A statistical data mining methodology for investigating the structure of high dimensional pattern recognition problems. 34

Class Cover Catch Digraphs class conditional data X i Y i = j f j 35

Class Cover Catch Digraphs For X i s.t. Y i = j, i = {x : d(x i, x) < r Xi := min Xk :Y k =1 j d(x i, X k )} 36

Class Cover Catch Digraphs V j = {X i : Y i = j}. For i 1 i 2, (X i1, X i2 ) A j X i2 i1. 37

Class Cover Catch Digraphs D j = (V j, A j ) 38

Class Cover Catch Digraphs X X Choose a (minimum) dominating set Ŝ j for D j 39

Class Cover Catch Digraphs 1 1 0 0 1 1 Consider { i : X i Ŝ j } 40

Class Cover Catch Digraphs 1 1 0 0 1 1 Go and do likewise for class 1 j 41

Class Cover Catch Digraphs 1 1 0 0 1 1 g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i 42

Class Cover Catch Digraphs g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i 43

Class Cover Catch Digraphs Monte Carlo results: ˆL(nearest neighbor) = 0.123 ˆL(cccd) = 0.074 L(ayes optimal) = 0.035 (L := P [g(x) Y ] is probability of misclassification ) 44

Theorem 1: Let Ŝ j be dominating sets for cccds D j and g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i with T Xi 1 for all i. Then ˆL (R) n (g) = n i=1 I{g(X i ) Y i } = 0. 45

Theorem 2: Assume, in addition to the conditions of Theorem 1, that d is well-behaved (e.g. L p ) and the class-conditional distributions F j are strictly separable. Then g is consistent. That is, L n (g) L := L(ayes optimal). 46

Algorithmic extension: robustness (a) to contamination (b) to outliers 47

α = 0; β = 0 α = 10; β = 5 ˆL 0.21 ˆL 0.16 49

Coastal attlefield Reconnaissance and Analysis (CORA) 50

Coastal attlefield Reconnaissance and Analysis (CORA) Class Cover Catch Digraph with α = 1, β = 4 54

Coastal attlefield Reconnaissance and Analysis (CORA) Y band # 5 Y X X Y band # 3 55

Coastal attlefield Reconnaissance and Analysis (CORA) Y band # 5 Y X X Y band # 3 ˆL (D) (g) = n i=1 I{g ( i) (X i ) Y i } = 0.205. 56

Complexity Reduction 1. cccd & dominating set 2. hierarchical complete linkage clustering on the size (radii) of the proximity regions: cluster(r Xi : X i Ŝ j ) 3. determine scale dimension 57

Define the scale dimension d to be the cluster map dimension which minimizes a dimensionality-penalized misclassification rate; d δ := min{arg min k L k + δ k} for some penalty coefficient δ [0, 1]. 59

Scale Dimension ˆd 7 Scale dimension for artificial nose data: d*=7 misclassification 0.0 0.1 0.2 0.3 0.4 5 10 15 20 dimension 60

Scale Dimension ˆd 7 61

Return now to our example: Tufts University artificial nose chemical sensor Data: 80 observations of TCE + Chloroform in Air (at various concentrations) 40 observations of Chloroform in Air (at various concentrations) 62

7 4 7 9 3 4 1 1 0 4 5 6 2 4 6 6 0 8 1 3 7 7 6 8 0 1 3 7 9 1 5 8 0 1 1 1 1 1 1 1 2 4 3 7 6 5 5 7 4 7 9 3 4 1 1 0 4 5 6 2 4 6 6 0 8 1 3 7 7 6 8 0 1 3 7 9 1 5 8 0 1 1 1 1 1 1 1 2 4 3 7 6 5 5 64

Gene expression monitoring by DNA microarrays 65

Gene expression monitoring by DNA microarrays ALL vs. AML 67

T T T T T T T T T Gene expression monitoring by DNA microarrays -cell ALL vs. T-cell ALL vs. AML 68

0 T T T TT TTT d(x i, AML) H 0 : median(f d(x,aml) for X -cell ALL) = median(f d(x,aml) for X T-cell ALL) p-value = 0.0051 (Exact Wilcoxon rank-sum test) 69

Y X Y gene # 2275 X Y X X Y X X X Y gene # 6345 cccd classification on DNA microarry data ˆL = 0.069 70

Dimension Reduction Principle Components scree plot DNA microarray data scree plot DNA microarray data fraction of variance explained 0.0 0.05 0.10 0.15 0.20 cumulative variance explained 0.0 0.2 0.4 0.6 0.8 1.0 1 20 72 principle component 1 20 72 principle component 71

Dimension Reduction Principle Components ˆL P_prcomp(e;k) 0.0 0.1 0.2 0.3 0.4 0.5 0 20 40 60 k 72

Dimension Reduction ISP ALL 1 (i 1,, i k ) = arg inf 1 k d inf (i1,,i k ) {1,,d} L(g( i 1,, i k )) (Note: ( ) 7129 10 10 32 ) 73

Investigating the structure of high dimensional pattern recognition problems Lenore Cowen Jingdong Xie Adam Cannon Jason DeVinney Diego Socolinsky David Marchette Jeff Solka Dennis Healy Anna Tsao Wendy Martinez Ed Wegman 74