Investigating the structure of high dimensional pattern recognition problems

Size: px

Start display at page:

Download "Investigating the structure of high dimensional pattern recognition problems"

Gladys Hensley
6 years ago
Views:

1 Investigating the structure of high dimensional pattern recognition problems Carey E. Priebe Department of Mathematical Sciences Whiting School of Engineering Johns Hopkins University altimore, MD Inaugural Professorial Lecture November 29, 2001 The wealth of your practical experience with sane and interesting problems will give to mathematics a new direction and a new impetus. Leopold Kronecker to Hermann von Helmholtz 1

2 Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 2

3 Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 3

4 Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 4

5 Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 5

6 Investigating the structure of high dimensional pattern recognition problems Statistical pattern recognition (classification, clustering, etc.) in high dimensions is notoriously difficult the curse of dimensionality implies that enough data will never be available. Nevertheless, high dimensional pattern recognition applications such as hyperspectral target recognition and gene expression monitoring require methodologies for uncovering structure, generating hypotheses, and making decisions. This talk will discuss some of the challenges presented by high dimensional pattern recognition problems, and will introduce a statistical data mining methodology for investigating the structure of these problems. Applications from artificial olfactory analysis (the Tufts University artificial nose ) and gene expression monitoring by DNA microarrays will be used to frame the discussion. 6

7 High dimensional pattern recognition problems olfactory classification gene expression analysis multispecral imagery: mines & minefields hyperspectral imagery (e.g., HyMap) functional brain imagery (e.g., NV vs SZ) astronomy (e.g., Sloan Digital Sky Survey) face detection financial data analysis knowledge discovery from text 7

8 Gene expression monitoring by DNA microarrays 8

9 DNA microarrays consist of a library of genes immobilized in a grid, usually on a glass slide. Each individual spot in the grid contains DNA from a single gene that will bind to the messenger RNA (mrna) produced by the gene concerned. So by liquidizing a sample from a given tissue type, tagging its mrnas with fluorescent dyes and then exposing the sample to the slide, it is possible to obtain an instant visual read-out revealing which genes were active. Jonathan Knight, Nature, Vol. 410, 19 April

10 ALL 1 The WI/MIT CGR 1999 data set, produced by Affymetrix DNA microarrays, involves two general classes of leukemia, ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia). Each observation is a patient, with n ALL = 47, n AML = 25; n = n ALL + n AML = 72. Each observation is a point in 7129-dimensional Euclidean space; there are 6817 unique human genes monitored, augmented with some redundant probes and control elements. Golub, Slonim, et al., Science,

11 ? ALL 1? 11

12 ALL 2 ALL 1 AML 1 12

13 ALL 1 Goals: Problem: classify ALL v. AML cluster latent class discovery dimension reduction 72 observations in 7129-dimensions 13

14 Tufts University artificial nose chemical sensor White, Kauer, Dickinson, and Walt, Nature, Priebe, IEEE PAMI,

15 Vapor-sensing and pattern recognition with the Tufts University artificial nose chemical sensor. The plot in the cartoon represents sensor/analyte signatures for three sensors within the bundled nineteen-sensor array. Signature patterns of fluorescence changes vs. time are used for subsequent analysis. Nature, 382: (1996). 15

16 Observation Chloroform (Clfm0702). 16

17 Olfactory Classification Goal: Detection of distinguished analyte (Trichloroethylene (TCE)) at various concentrations in the presence of multiple confounders. Problem: Multivariate function-valued data; no parametric model. 17

18 Statistical Data Mining in the sense of Edward J. Wegman: Data Mining is an extension of exploratory data analysis and has basically the same goals: the discovery of unknown and unanticipated structure in the data. The chief distinction between the two topics resides in the size and dimensionality of the data sets involved. Data mining in general deals with much more massive data sets for which highly interactive analysis is not fully feasible. 18

19 The curse of dimensionality: nonparametric density estimation sample size 0 2*10^5 4*10^5 6*10^5 8*10^5 10^6 dimension samplesize dimension Choose n such that the relative mean squared error at 0 is small; n(d) = arg min n E[( ˆf n (0; d) ϕ(0; d)) 2 /ϕ(0; d) 2 ] = 0.1. Silverman,

20 The curse of dimensionality: statistical pattern recognition Consider class-conditional probability density functions f j = Normal(µ j, I d ), j = 0, 1, with equal priors. Let µ 0 = µ 1 = [1, 2 1/2, 3 1/2,, d 1/2 ]. Case I: µ 0 known. ayes optimal rule is available and L(d) 0 as d. Case II: µ 0 unknown. µ 0 must be estimated from training data and L n (d) 1/2 as d for fixed n. Trunk, 1979 Jain, et al.,

21 PEANUTS by Charles Schulz 21

22 PEANUTS by Charles Schulz 22

23 PEANUTS by Charles Schulz 23

24 Statistical Data Mining in the sense of rian Ripley: Data mining, also known as knowledge discovery in databases is one of many terms for finding structure in large-scale datasets on the boundaries of statistics, engineering, machine learning and computer science. Statistical data mining concentrates on methods for finding the structure (as distinct from manipulating the databases). 24

25 Statistical Data Mining in the sense of rian Ripley: Data mining, also known as knowledge discovery in databases is one of many terms for finding structure in large-scale datasets on the boundaries of statistics, engineering, machine learning and computer science. Statistical data mining concentrates on methods for finding the structure (as distinct from manipulating the databases). 25

26 Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman 26

27 Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman,

28 Computational Statistics: A New Agenda for Statistical Theory and Practice high dimensional data computationally intensive methodologies imprecise questions weak assumptions nonlinear error structures distribution free models Edward J. Wegman,

29 In his 1992 book David W. Scott writes: Fortunately, it appears that in practical situations, the dimension of the structure seldom exceeds 4 or 5. 29

30 Tufts University artificial nose chemical sensor 30

31 31

32 We choose to focus on methodologies which depend upon only the interpoint distances. To build a classifier for an unlabelled examplar Z, we restrict attention to information contained in the training sample interpoint distance matrix D = [d i,j = d(x i, X j )] and the test vector D Z = [d(z, X 1 ),, d(z, X n )]. Alas, the trick then first, get a million dollars is to pick a good distance. 32

33 The trick then... is to pick a good distance. Integrated Sensing and Processing DARPA ACMP ISP sensor/processor parameter selection (e.g. dimensionality selection) based on explicit optimization of exploitation objective (in the case of supervised classification, the probability of misclassification L) 33

34 Class Cover Catch Digraphs A statistical data mining methodology for investigating the structure of high dimensional pattern recognition problems. 34

35 Class Cover Catch Digraphs class conditional data X i Y i = j f j 35

36 Class Cover Catch Digraphs For X i s.t. Y i = j, i = {x : d(x i, x) < r Xi := min Xk :Y k =1 j d(x i, X k )} 36

37 Class Cover Catch Digraphs V j = {X i : Y i = j}. For i 1 i 2, (X i1, X i2 ) A j X i2 i1. 37

38 Class Cover Catch Digraphs D j = (V j, A j ) 38

39 Class Cover Catch Digraphs X X Choose a (minimum) dominating set Ŝ j for D j 39

40 Class Cover Catch Digraphs Consider { i : X i Ŝ j } 40

41 Class Cover Catch Digraphs Go and do likewise for class 1 j 41

42 Class Cover Catch Digraphs g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i 42

43 Class Cover Catch Digraphs g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i 43

44 Class Cover Catch Digraphs Monte Carlo results: ˆL(nearest neighbor) = ˆL(cccd) = L(ayes optimal) = (L := P [g(x) Y ] is probability of misclassification ) 44

45 Theorem 1: Let Ŝ j be dominating sets for cccds D j and g(z) = arg min j min {Xi Ŝ j } (d(z, X i)/r Xi ) T X i with T Xi 1 for all i. Then ˆL (R) n (g) = n i=1 I{g(X i ) Y i } = 0. 45

46 Theorem 2: Assume, in addition to the conditions of Theorem 1, that d is well-behaved (e.g. L p ) and the class-conditional distributions F j are strictly separable. Then g is consistent. That is, L n (g) L := L(ayes optimal). 46

47 Algorithmic extension: robustness (a) to contamination (b) to outliers 47

48 48

49 α = 0; β = 0 α = 10; β = 5 ˆL 0.21 ˆL

50 Coastal attlefield Reconnaissance and Analysis (CORA) 50

51 51

52 52

53 53

54 Coastal attlefield Reconnaissance and Analysis (CORA) Class Cover Catch Digraph with α = 1, β = 4 54

55 Coastal attlefield Reconnaissance and Analysis (CORA) Y band # 5 Y X X Y band # 3 55

56 Coastal attlefield Reconnaissance and Analysis (CORA) Y band # 5 Y X X Y band # 3 ˆL (D) (g) = n i=1 I{g ( i) (X i ) Y i } =

57 Complexity Reduction 1. cccd & dominating set 2. hierarchical complete linkage clustering on the size (radii) of the proximity regions: cluster(r Xi : X i Ŝ j ) 3. determine scale dimension 57

58 58

59 Define the scale dimension d to be the cluster map dimension which minimizes a dimensionality-penalized misclassification rate; d δ := min{arg min k L k + δ k} for some penalty coefficient δ [0, 1]. 59

60 Scale Dimension ˆd 7 Scale dimension for artificial nose data: d*=7 misclassification dimension 60

61 Scale Dimension ˆd 7 61

62 Return now to our example: Tufts University artificial nose chemical sensor Data: 80 observations of TCE + Chloroform in Air (at various concentrations) 40 observations of Chloroform in Air (at various concentrations) 62

63 63

65 Gene expression monitoring by DNA microarrays 65

66 66

67 Gene expression monitoring by DNA microarrays ALL vs. AML 67

68 T T T T T T T T T Gene expression monitoring by DNA microarrays -cell ALL vs. T-cell ALL vs. AML 68

69 0 T T T TT TTT d(x i, AML) H 0 : median(f d(x,aml) for X -cell ALL) = median(f d(x,aml) for X T-cell ALL) p-value = (Exact Wilcoxon rank-sum test) 69

70 Y X Y gene # 2275 X Y X X Y X X X Y gene # 6345 cccd classification on DNA microarry data ˆL =

71 Dimension Reduction Principle Components scree plot DNA microarray data scree plot DNA microarray data fraction of variance explained cumulative variance explained principle component principle component 71

72 Dimension Reduction Principle Components ˆL P_prcomp(e;k) k 72

73 Dimension Reduction ISP ALL 1 (i 1,, i k ) = arg inf 1 k d inf (i1,,i k ) {1,,d} L(g( i 1,, i k )) (Note: ( ) ) 73

74 Investigating the structure of high dimensional pattern recognition problems Lenore Cowen Jingdong Xie Adam Cannon Jason DeVinney Diego Socolinsky David Marchette Jeff Solka Dennis Healy Anna Tsao Wendy Martinez Ed Wegman 74

75 75

Graph Theoretic Latent Class Discovery

Graph Theoretic Latent Class Discovery Jeff Solka jsolka@gmu.edu NSWCDD/GMU GMU BINF Colloquium 2/24/04 p.1/28 Agenda What is latent class discovery? What are some approaches to the latent class discovery