Adapted Feature Extraction and Its Applications

Size: px

Start display at page:

Download "Adapted Feature Extraction and Its Applications"

Erik Ward
6 years ago
Views:

1 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA saito@math.ucdavis.edu URL: saito/

2 2 Acknowledgment Ronald R. Coifman (Yale/FMAH) Fred Warner (Yale/FMAH) Frank Geshwind (Yale/FMAH)

3 3 Outline Problem Formulation A Dictionary/Library of Orthonormal Bases Local Discriminant Basis (LDB) Improved LDB with Empirical Probability Density Estimation Example 1: Synthetic Waveform Classification Example 2: Geophysical Acoustic Waveform Classification Conclusion Future Directions

4 4 A Strategy for Pattern Recognition Normalization of input patterns scaling, translation, rotation Feature extraction and selection Classification LDA, CART, k-nn, Neural Networks,... X F Y x f f(x) g 1. y. C

5 5 Problem Formulation Learning (or Training): Given a set of data, T = {(x i, y i )} N i=1 X Y, where X R n and Y = {1,..., K} (class labels), find a map d : X Y such that R(d) = 1 N N I(d(x i ) y i ) small. i=1 Prediction: Apply the map d obtained during the training to a new dataset. Then interpret the result and evaluate d.

6 6 Problem Formulation... Let (X, Y ) (X, Y) be a random sample from P (X A, Y = y) = P (X A Y = y) P (Y = y), where P (Y = y) = π y = N y /N in practice. Let us assume the density p(x y) exists. The Bayes classifier (or rule) d B is: d B (x) = ki(x A k ), where A k = {x X : p(x k)π k = max y Y p(x y)π y }. In other words, assign x to class k if x A k. Note y Y A y = X.

7 7 Problem Formulation... Difficulties of the Bayes classifier: Don t know p(x k) or Too difficult to estimate p(x k) computationally because of the high dimensionality of the problem (curse of dimensionality). = Need dimensionality reduction without loosing important information for classification.

8 8 Problem Formulation... Therefore, we restrict our attention to the map d : X Y of the following form: d = g f = g Θ m Ψ T, f : X F R m is called a feature extractor consisting of: Ψ: an n-dimensional orthogonal matrix selected from a dictionary or library of orthonormal bases. Θ m : a selection rule: it selects the most important m ( n) coordinates from n-dimensional coordinates. g : F Y is a conventional classifier, e.g., LDA, CART, ANN etc.

9 9 A Library of Orthonormal Bases consists of dictionaries of orthonormal bases: each dictionary is a binary tree whose nodes are subspaces of Ω 0,0 = R n with different time-frequency localization characteristics. Ω 3,7 Ω 2,3 Ω 3,6 Ω 1,1 Ω 3,5 Ω 2,2 Ω 3,4 Ω 0,0 Ω 3,3 Ω 2,1 Ω 3,2 Ω 1,0 Ω 3,1 Ω 2,0 Ω 3,0

10 10 A Library of Orthonormal Bases... Examples of dictionaries include wavelet packet bases, local trigonometric bases, and local Fourier bases. It costs O(n[log n] p ) to generate a dictionary for a signal of length n (p = 1 for wavelet packets, p = 2 for LTB/LFB). Each dictionary may contain up to n(1 + log 2 n) basis vectors and more than 2 n possible orthonormal bases. How to select the best possible basis for the problem at hand is a key issue.

11 11 Example of Local Basis Functions Standard Basis Haar Basis Walsh Basis C12 Wavelet Packet Basis Local Sine Basis Discrete Sine Basis

12 12 Time-Frequency Characteristics Time Frequency Standard Basis Time Frequency Haar Basis Time Frequency Walsh Basis Time Frequency C12 Wavelet Packet Basis Time Frequency Local Sine Basis Time Frequency Discrete Sine Basis

13 13 Local Discriminant Basis Let D = {w i } N w i=1 be a time-frequency dictionary D = {B j } N B j=1 as a list of all possible orthonormal bases where B j = (w j1,..., w jn ) Let M + (B j ) be a measure of efficacy of B j for discriminantion Then the local discriminant basis (LDB) can be written as Ψ = arg max B j D M+ (B j ). Proposition: There exists a fast algorithm (divide-and-conquer, O(n)) to find Ψ from each dictionary D if M + is additive. M + (0) = 0, M + ({w j1,..., w jn }) = n M + (w ji ). i=1

14 14 Discriminant Measures For w i D, consider the projection Z i = wi X of an input random signal X X Let δ i be the efficacy for discrimination (or discriminant power) of w i Some possibilities of measuring the efficacy of the basis B j : M + (B j ) = n i=1 δ j i M + (B j ) = k i=1 δ (j i ), where {δ (ji )} is the decreasing rearrangement of {δ ji } and k < n. M + (B j ) = n i=1 ε j i δ ji, where ε ji = 1 if E[Z 2 i ] > θ, = 0 otherwise.

15 15 Discriminant Measures... What are the basic quantities to use for 1D efficacy? Differences in normalized energies of Z i among classes: V (y) i = E[Z 2 i Y = y] n i=1 E[Z2 i Y = y] ˆV (y) i = Ny k=1 w i x (y) k 2 Ny k=1 x(y) k 2 Differences in probability density functions (pdfs) of Z i : q (y) i (z) = p(x y) dx ˆq (y) i (z) via e.g., ASH w i x=z

16 16 Discriminant Measures... Some possibilites of discrepancy measures between two pdfs p(x), q(x): Relative entropy [Kullback-Leibler divergence]: Hellinger distance: D KL (p, q) = p(x) log 2 p(x) q(x) dx D H (p, q) = ( p(x) q(x) ) 2 dx L 2 distance: D 2 (p, q) = (p(x) q(x)) 2 dx

17 17 Discriminant Measures... Using normalized energy: δ i = ˆV (1) (1) ( ˆV i i log 2, ˆV (2) i Using empirical pdfs: ˆV (1) i ) 2 ( ) 2 ˆV (2) i, or ˆV (1) i ˆV (2) i δ i = D KL (ˆq (1) i, ˆq (2) i ), D H (ˆq (1) i, ˆq (2) i ), ord 2 (ˆq (1) i, ˆq (2) i )

18 18 Discriminant Measures... Important to consider two-dimensional projection x x 1 Applicable to the complex coefficients of local Fourier bases

19 19 Example 1: Signal Shape Classification Objective: Classify synthetic signals of length 128 to three possible classes, c(i) = (6 + η) χ [a,b] (i) + ɛ(i) for cylinder, b(i) = (6 + η) χ [a,b] (i) (i a)/(b a) + ɛ(i) for bell, f(i) = (6 + η) χ [a,b] (i) (b i)/(b a) + ɛ(i) for funnel. where a = unif[16, 32], b a = unif[32, 96], η N(0, 1), ɛ(i) i.i.d. N(0, 1).

20 20 Example Waveforms "Cylinder" signals "Bell" signals "Funnel" signals

21 21 Example 1: Signal Shape Classification... Top 10 LDFs Top 10 LDBs Selected Nodes by LDB

22 22 Example 1: Signal Shape Classification... Table of Misclassifications (average over 10 simulations) Method (Coordinates) Training Data Test Data LDA on STD 0.83 % % CT on STD 2.83 % % LDA on Top 10 LDB 7.00 % 8.37 % CT on Top 10 LDB 2.67 % 5.54 % 100 training signals and 1000 test signals were generated for each class per simulation 12-tap Coiflet filter was used for LDB

23 Example 1: Signal Shape Classification... Full Classification Tree on Top 10 LDB Coordinates cylinder 200/300 ldc.1< ldc.1> bell 1/99 ldc.5< ldc.5> bell 1/5 bell 0/94 21/120 ldc.7< /103 ldc.1< ldc.1> funnel 3/7 funnel 0/18 funnel funnel cylinder 101/201 ldc.5< ldc.5> funnel 5/96 ldc.5< ldc.5> funnel 2/91 ldc.9< ldc.9> funnel funnel 2/5 funnel 2/23 0/68 ldc.9< ldc.9> cylinder 2/5 ldc.7> /17 ldc.3< ldc.3> cylinder 0/12 cylinder 0/81 cylinder funnel 1/5 23

24 24 Example 2: Classification of Geophysical Acoustic Waveforms 402 acoustic waveforms (256 time samples) were recorded at a gas producing well. Region consists of sand or shale layers. Sand layers contain either water or gas Time (sec)

25 25 Sonic Waveform Measurement Borehole Layer #1 R Layer #2 T

26 26 Objectives Can we classify acoustic waveforms in terms of mineral contents of layers? Can we automate the classification process? If so, what features (or wave components) are important? Pressure (Pa) P S Stoneley Time (sec)

27 27 Classification Procedure Adopt 10-fold cross validation; split waveforms randomly into training and test datasets, and repeat the experiments. Decompose training waveforms into the local sine dictionary. Choose the LDB using various discrepancy measures. Choose 25, 50, 75, 100 most discriminant coordinates. Construct classifiers (LDA and Classification Tree) using these. Decompose test waveforms into the LDB and classify them. Compute the average misclassification rates.

28 28 Misclassification Rates for Test Data Misclassification Rate (%) HJ Solid Lines: LDA Dotted Lines: CT I W S WS S S I W W HJ I HJ HI HJ J I S S S HS J I I HJ I WH J W W Number of LDB Features

29 29 The Best LDB Vectors Time (sec)

30 30 The Best LDB Pattern Time (sec)

31 31 The Best LDB Vectors... Top 10 most influential LDB vectors in LDA Time (sec)

32 32 Observations LDA on LDB gave better results than CT on LDB = features are oblique. Top 25 LDB features were clearly not sufficient. Supplying too many features degraded the performance. Most important LDB vectors are clustered around P wave components.

33 33 Improved LDB with Empirical PD Estimation This can be viewed as a specific yet fast version of the Projection Pursuit algorithm (Friedman-Tukey, Huber). Compared to top-down strategy of PP, LDB is not greedy, i.e., bottom-up approach. This approach also works for complex-valued expansion coefficients (e.g., local Fourier bases). How to estimate the empirical pdf? histograms, ASH (averaged shifted histograms), nearest neighbor, kernel-based methods...

34 34 Conclusion LDB functions can capture relevant local features in data Interpretation of the results becomes easier and more intuitive Computational cost is at most O(n[log n] 2 ) These methods enhance both traditional and modern statistical methods

35 35 Conclusion... Our algorithms are being applied and tested to: Geophysical signal/image classification at Schlumberger Noise reduction in hearing aids at Northwestern University Diagnostics of mammography at University of Chicago Hospital Radar target discrimination at Lockeed-Martin

36 36 Future Directions How about the good basis for clustering? Parameterization of low dimensional structures in high dimensional space Explore the relationship with the David-Jones-Semmes geometric analysis Develop two dimensional version for the complex coefficients provided by the local Fourier bases or brushlets

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality