Knowledge Discovery with Iterative Denoising

Size: px

Start display at page:

Download "Knowledge Discovery with Iterative Denoising"

Beatrix Williamson
5 years ago
Views:

1 Knowledge Discovery with Iterative Denoising Assistant Professor Department of Statistics and Operations Research Virginia Commonwealth University Associate Staff Scientist Human Language Technology Center for Excellence (HLTCOE) Johns Hopkins University

2 What is Data? Object xi has q measurements: x i = x i1, x i2,..., x iq T ℝ q * All n objects in the dataset can be expressed as an nxq data matrix * known as vector-space model Examples: 1.) Text Mining: n documents, q weights or scores for particular words or phrases 2.) Image Analysis: n images, q pixel color or intensity values 3.) Computer Network Traffic: n application or protocol flows, q network traffic counts or scores 4.) DNA Expression Microarrays: n genes (nucleotide sequences), q cell samples 2

Growth of (Complex) Data Amount of Data is Increasing * Figure: annotated collection of all available DNA sequences. * PUB MED: abstracts for 12 million research papers on life sciences topics.

3 Growth of (Complex) Data Amount of Data is Increasing * Figure: annotated collection of all available DNA sequences. * PUB MED: abstracts for 12 million research papers on life sciences topics. 40,000 new biomedical abstracts are added every month. * Similar growth of data in other fields: computer network traffic, text documents, images. Complexity of Data is Increasing * n observations, q features/measurements/dimensions * Data that we need to analyze is high-dimensional: q large * In complex data, often q >> n 3

4 Concerns with High-Dimensional Data High-dimensionality causes analysis algorithm performance problems * scaling * curse of dimensionality * if q >> n, overfitting issues Dimensionality Reduction! * Most feature selection is done univariately, but complex data is multivariate - a variable that is useless by itself can provide a significant performance improvement when taken with others Common procedure is to decouple feature selection from classification algorithm * but concern is better classification performance, not necessarily what features are important * difficult to do with high-dimensional data * especially for new datasets, there is some concern about the loss of useful information when throwing away features 4

5 Many (Statistical) Approaches Class-conditional densities: * known - Bayes Decision Theory * unknown - supervised - unsupervised - parametric and nonparametric 5

6 Example 1: Find SPAM 6

7 Example 1: Find SPAM Supervised learning 7

8 Example 2: Find EVIL-DOERS Enron Figure: The New York Times Carey Priebe and Youngser Park, Johns Hopkins University 8

9 Example 2: Find EVIL-DOERS Enron Figure: The New York Times Carey Priebe and Youngser Park, Johns Hopkins University Unsupervised learning 9

10 Implementation Issues * large databases * distance * high dimensionality * overfitting * missing/noisy data * local structure * user involvement 10

11 Desired Classification Methodology Attributes: 1. Unsupervised 2. Nonparametric 3. Scalable (large n) 4. Curse-aware (high-dimensional + sparse) 5. Intuitive visualization interface 6. Framework flexible 7. Useful for data analyst 8. Data domain agnostic 9. Distinguish global similarity from local dissimilarity 11

12 The Iterative Denoising Methodology In a nutshell: Process a set of high-dimensional data; perform a local structurepreserving projection into a low-dimensional space; provide a visualization and interaction interface; partition and iteratively denoise. Motivated by: Priebe, Marchette, Healy, 2004, Integrated Sensing and Processing Decision Trees, IEEE PAMI. Priebe, et al., 2004, Iterative Denoising for Cross-Corpus Discovery, COMPSTAT. 12

13 cd Denoise The Concept: An Iterative Denoising Tree fe label 13

14 Iterative Denoising Framework 14

15 Denoising Detail A proximity metric A low dim space Clusters Nonlinear version uses Laplacian Eigenmaps 15

16 Laplacian Eigenmaps Nonlinear dimensionality reduction technique that distorts geometry in such a way that enhances some types of clustering L = D A is large, sparse L symmetric, positive semi-definite 0 < λ1 λ2... λd Corresponding d eigenvectors --> Fiedler Space Eigenvectors corresponding to two smallest non-zero eigenvalues --> visualization 16

17 Great, but does it work? 17

18 A Text Document Example: a Science News Corpus n = 1047 documents q = words (ngrams) 18

19 Science News corpus: a clustering hierarchy 1. Anthropology 2. Astronomy 3. Behavioral Sciences 4. Earth Sciences 5. Life Sciences 6. Math & CS 7. Medicine 8. Physics 19

20 Science News corpus: a clustering hierarchy 1. Anthropology 2. Astronomy 3. Behavioral Sciences 4. Earth Sciences 5. Life Sciences 6. Math & CS 7. Medicine 8. Physics 20

21 Science News corpus, Fiedler Space projection Anthropology: yellow Astronomy: black Behavioral Sciences: magenta Earth Sciences: lightgray Life Sciences: orange Math & CS: red Medicine: green Physics: blue 21

22 Science News corpus, Fiedler Space projection Anthropology: yellow Astronomy: black Behavioral Sciences: magenta Life Sciences: orange Earth Sciences: lightgray Math & CS: red Medicine: green Physics: blue 22

23 Science News corpus, Fiedler Space projection Anthropology: yellow Astronomy: black Behavioral Sciences: magenta Earth Sciences: lightgray Life Sciences: orange Math & CS: red Medicine: green Physics: blue 23

24 Science News corpus, Fiedler Space projection Anthropology: yellow Astronomy: black Behavioral Sciences: magenta Earth Sciences: lightgray Life Sciences: orange Math & CS: red Medicine: green Physics: blue 24

25 Interesting Grouping 1. Math enthusiast wins Science Talent Search 2. Message in DNA tops Science Talent Search 3. Chinks in Digital Armor: exploiting faults to break smart-card cryptosystems 4. Motor City hosts top science fair winners 5. Science Talent Search winners shine bright 6. Neutrinos to buckyballs: 10 talents tower 7. Logic in the Blocks: simple puzzles can give computers an unexpected workout 8. How to trick other people's computers into solving your math problems Anthropology: yellow Astronomy: black Behavioral Sciences: magenta Earth Sciences: lightgray Life Sciences: orange Math & CS: red Medicine: green Physics: blue 25

26 Stopping Criteria Variety of methods can be used for stopping the tree build: * max tree height * min observations/node * (supervised): min node purity * (supervised): entropy or divergence Here, we used min observations/node = 10d/k 26

27 Semi-supervised Classification Performance 1.) Embed using all observations (labeled + unlabeled) 2.) Label Iterative Denoising Tree leaves using only labeled observations 3.) Calculate performance based on unlabeled observations, 10-fold cross validation See: M.W. Trosset, C.E. Priebe, Y. Park, and M.I. Miller, "Semisupervised Learning from Dissimilarity Data," Computational Statistics and Data Analysis, accepted for publication, February,

28 4-Class Science News: Iterative Denoising Astronomy: black Math & CS: red Medicine: green Physics: blue n = 579 documents, q = unique monograms 28

29 4-Class Science News Astronomy: black Math & CS: red Medicine: green Physics: blue 29

30 20-Newsgroups * Collection of newsgroup documents partitioned into 20 different newsgroups. --> * As a simple test, we chose three roughly disparate groups, and two groups on the same topic but with differing viewpoints. 1.) rec.autos 2.) talk.politics.misc 3.) comp.graphics 4.) soc.religion.christian 5.) alt.athiesm n = 2722 documents, with unique monograms * NB: we did not do typical text pre-processing tasks such as stop-word removal, removal of infrequent words, etc. 30

31 20-Newsgroups: no natural class separation Image from FINE: Fisher Information Non-parametric Embedding by Kevin M. Carter, Raviv Raich, William G. Finn, Alfred O. Hero 31

32 20-Newsgroups: Iterative Denoising K_knn = 20, m = 2, dim = 9: error = , sd = green black red gray blue rec.autos talk.politics.misc soc.religion.christian comp.graphics alt.athiesm 32

33 20-Newsgroups green black red gray blue rec.autos talk.politics.misc soc.religion.christian comp.graphics alt.athiesm 33

34 20-Newsgroups green black red gray blue rec.autos talk.politics.misc soc.religion.christian comp.graphics alt.athiesm 34

35 20-Newsgroups green black red gray blue rec.autos talk.politics.misc soc.religion.christian comp.graphics alt.athiesm 35

36 20-Newsgroups green black red gray blue rec.autos talk.politics.misc soc.religion.christian comp.graphics alt.athiesm 36

37 ...It follows from this that God must give everyone the same revelation of truth, and thus anyone who comes to a different conclusion is intentionally choosing the wrong path......we will grant that a God exists, and uses revelation to communicate with humans... Global Similarity green black red gray blue rec.autos talk.politics.misc soc.religion.christian comp.graphics alt.athiesm 37

38 20-Newsgroups...What I see as arrogance and the problem I have with it is not a sense of personal certainty, but a lack of respect for others who come to differing conclusions......now, based on this, can this person be blamed for concluding, absent a personal revelation of their own, that there is almost certainly nothing to this 'revelation' thing?... Local Dissimilarity 38

39 Summary 1. Unsupervised 2. Nonparametric Iterative Denoising Methodology Feature selection inherent in methodology Scalable and tailored for high-dimensionality Ability to visualize relationships in complex data 3. Scalable (large n) 4. Curse-aware (high-dimensional + sparse) 5. Intuitive visualization interface 6. Framework flexible 7. Useful for data analyst 8. Data domain agnostic 9. Distinguish global similarity from local dissimilarity Enhances global similarities and local dissimilarities Good performance on text datasets Also good performance on intrusion detection dataset Preliminary testing on backscatter, biological data, images 39

40 Also: Iterative Denoising,, Michael Trosset, David Marchette, Carey Priebe. Computational Statistics, Accepted for Publication, September Collaborators: * Dr. Carey Priebe, Department of Applied Mathematics and Statistics, Johns Hopkins University * Dr. David Marchette, Senior Scientist, Naval Surface Warfare Center * Dr. Michael Trosset, Department of Statistics, Indiana University * Dr. Jeff Solka, Principal Scientist, Naval Surface Warfare Center * Dr. Youngser Park, Senior Research Analyst, Johns Hopkins University *

41 We are drowning in information and starving for knowledge. -- Rutherford D. Roger 41

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time