Adapted Feature Extraction and Its Applications

saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/ saito/

saito@math.ucdavis.edu 2 Acknowledgment Ronald R. Coifman (Yale/FMAH) Fred Warner (Yale/FMAH) Frank Geshwind (Yale/FMAH)

saito@math.ucdavis.edu 3 Outline Problem Formulation A Dictionary/Library of Orthonormal Bases Local Discriminant Basis (LDB) Improved LDB with Empirical Probability Density Estimation Example 1: Synthetic Waveform Classification Example 2: Geophysical Acoustic Waveform Classification Conclusion Future Directions

saito@math.ucdavis.edu 4 A Strategy for Pattern Recognition Normalization of input patterns scaling, translation, rotation Feature extraction and selection Classification LDA, CART, k-nn, Neural Networks,... X F Y x f f(x) g 1. y. C

saito@math.ucdavis.edu 5 Problem Formulation Learning (or Training): Given a set of data, T = {(x i, y i )} N i=1 X Y, where X R n and Y = {1,..., K} (class labels), find a map d : X Y such that R(d) = 1 N N I(d(x i ) y i ) small. i=1 Prediction: Apply the map d obtained during the training to a new dataset. Then interpret the result and evaluate d.

saito@math.ucdavis.edu 6 Problem Formulation... Let (X, Y ) (X, Y) be a random sample from P (X A, Y = y) = P (X A Y = y) P (Y = y), where P (Y = y) = π y = N y /N in practice. Let us assume the density p(x y) exists. The Bayes classifier (or rule) d B is: d B (x) = ki(x A k ), where A k = {x X : p(x k)π k = max y Y p(x y)π y }. In other words, assign x to class k if x A k. Note y Y A y = X.

saito@math.ucdavis.edu 7 Problem Formulation... Difficulties of the Bayes classifier: Don t know p(x k) or Too difficult to estimate p(x k) computationally because of the high dimensionality of the problem (curse of dimensionality). = Need dimensionality reduction without loosing important information for classification.

saito@math.ucdavis.edu 8 Problem Formulation... Therefore, we restrict our attention to the map d : X Y of the following form: d = g f = g Θ m Ψ T, f : X F R m is called a feature extractor consisting of: Ψ: an n-dimensional orthogonal matrix selected from a dictionary or library of orthonormal bases. Θ m : a selection rule: it selects the most important m ( n) coordinates from n-dimensional coordinates. g : F Y is a conventional classifier, e.g., LDA, CART, ANN etc.

saito@math.ucdavis.edu 9 A Library of Orthonormal Bases consists of dictionaries of orthonormal bases: each dictionary is a binary tree whose nodes are subspaces of Ω 0,0 = R n with different time-frequency localization characteristics. Ω 3,7 Ω 2,3 Ω 3,6 Ω 1,1 Ω 3,5 Ω 2,2 Ω 3,4 Ω 0,0 Ω 3,3 Ω 2,1 Ω 3,2 Ω 1,0 Ω 3,1 Ω 2,0 Ω 3,0

saito@math.ucdavis.edu 10 A Library of Orthonormal Bases... Examples of dictionaries include wavelet packet bases, local trigonometric bases, and local Fourier bases. It costs O(n[log n] p ) to generate a dictionary for a signal of length n (p = 1 for wavelet packets, p = 2 for LTB/LFB). Each dictionary may contain up to n(1 + log 2 n) basis vectors and more than 2 n possible orthonormal bases. How to select the best possible basis for the problem at hand is a key issue.

saito@math.ucdavis.edu 11 Example of Local Basis Functions Standard Basis Haar Basis Walsh Basis 3 2 1 0 3 2 1 0 3 2 1 0 0 20 40 60 80 100120 C12 Wavelet Packet Basis 0 20 40 60 80 100120 Local Sine Basis 0 20 40 60 80 100120 Discrete Sine Basis 3 2 1 0 3 2 1 0 3 2 1 0 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120

saito@math.ucdavis.edu 12 Time-Frequency Characteristics 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Standard Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Haar Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Walsh Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 C12 Wavelet Packet Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Local Sine Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Discrete Sine Basis

saito@math.ucdavis.edu 13 Local Discriminant Basis Let D = {w i } N w i=1 be a time-frequency dictionary D = {B j } N B j=1 as a list of all possible orthonormal bases where B j = (w j1,..., w jn ) Let M + (B j ) be a measure of efficacy of B j for discriminantion Then the local discriminant basis (LDB) can be written as Ψ = arg max B j D M+ (B j ). Proposition: There exists a fast algorithm (divide-and-conquer, O(n)) to find Ψ from each dictionary D if M + is additive. M + (0) = 0, M + ({w j1,..., w jn }) = n M + (w ji ). i=1

saito@math.ucdavis.edu 14 Discriminant Measures For w i D, consider the projection Z i = wi X of an input random signal X X Let δ i be the efficacy for discrimination (or discriminant power) of w i Some possibilities of measuring the efficacy of the basis B j : M + (B j ) = n i=1 δ j i M + (B j ) = k i=1 δ (j i ), where {δ (ji )} is the decreasing rearrangement of {δ ji } and k < n. M + (B j ) = n i=1 ε j i δ ji, where ε ji = 1 if E[Z 2 i ] > θ, = 0 otherwise.

saito@math.ucdavis.edu 15 Discriminant Measures... What are the basic quantities to use for 1D efficacy? Differences in normalized energies of Z i among classes: V (y) i = E[Z 2 i Y = y] n i=1 E[Z2 i Y = y] ˆV (y) i = Ny k=1 w i x (y) k 2 Ny k=1 x(y) k 2 Differences in probability density functions (pdfs) of Z i : q (y) i (z) = p(x y) dx ˆq (y) i (z) via e.g., ASH w i x=z

saito@math.ucdavis.edu 16 Discriminant Measures... Some possibilites of discrepancy measures between two pdfs p(x), q(x): Relative entropy [Kullback-Leibler divergence]: Hellinger distance: D KL (p, q) = p(x) log 2 p(x) q(x) dx D H (p, q) = ( p(x) q(x) ) 2 dx L 2 distance: D 2 (p, q) = (p(x) q(x)) 2 dx

saito@math.ucdavis.edu 17 Discriminant Measures... Using normalized energy: δ i = ˆV (1) (1) ( ˆV i i log 2, ˆV (2) i Using empirical pdfs: ˆV (1) i ) 2 ( ) 2 ˆV (2) i, or ˆV (1) i ˆV (2) i δ i = D KL (ˆq (1) i, ˆq (2) i ), D H (ˆq (1) i, ˆq (2) i ), ord 2 (ˆq (1) i, ˆq (2) i )

saito@math.ucdavis.edu 18 Discriminant Measures... Important to consider two-dimensional projection x 2 1 2 2 1 x 1 Applicable to the complex coefficients of local Fourier bases

saito@math.ucdavis.edu 19 Example 1: Signal Shape Classification Objective: Classify synthetic signals of length 128 to three possible classes, c(i) = (6 + η) χ [a,b] (i) + ɛ(i) for cylinder, b(i) = (6 + η) χ [a,b] (i) (i a)/(b a) + ɛ(i) for bell, f(i) = (6 + η) χ [a,b] (i) (b i)/(b a) + ɛ(i) for funnel. where a = unif[16, 32], b a = unif[32, 96], η N(0, 1), ɛ(i) i.i.d. N(0, 1).

saito@math.ucdavis.edu 20 Example Waveforms "Cylinder" signals 4 3 2 1 0 0 20 40 60 80 100 120 "Bell" signals 4 3 2 1 0 0 20 40 60 80 100 120 "Funnel" signals 4 3 2 1 0 0 20 40 60 80 100 120

saito@math.ucdavis.edu 21 Example 1: Signal Shape Classification... Top 10 LDFs 8 6 4 2 0 0 20 40 60 80 100 120 Top 10 LDBs 8 6 4 2 0 0 20 40 60 80 100 120 Selected Nodes by LDB 6 4 2 0 0 20 40 60 80 100 120

saito@math.ucdavis.edu 22 Example 1: Signal Shape Classification... Table of Misclassifications (average over 10 simulations) Method (Coordinates) Training Data Test Data LDA on STD 0.83 % 12.31 % CT on STD 2.83 % 11.28 % LDA on Top 10 LDB 7.00 % 8.37 % CT on Top 10 LDB 2.67 % 5.54 % 100 training signals and 1000 test signals were generated for each class per simulation 12-tap Coiflet filter was used for LDB

Example 1: Signal Shape Classification... Full Classification Tree on Top 10 LDB Coordinates cylinder 200/300 ldc.1<10.0275 ldc.1>10.0275 bell 1/99 ldc.5<0.743967 ldc.5>0.743967 bell 1/5 bell 0/94 21/120 ldc.7<20.4196 8/103 ldc.1<12.2765 ldc.1>12.2765 funnel 3/7 funnel 0/18 funnel funnel cylinder 101/201 ldc.5<17.3111 ldc.5>17.3111 funnel 5/96 ldc.5<14.8119 ldc.5>14.8119 funnel 2/91 ldc.9<-0.709864 ldc.9>-0.709864 funnel funnel 2/5 funnel 2/23 0/68 ldc.9<-0.930807 ldc.9>-0.930807 cylinder 2/5 ldc.7>20.4196 4/17 ldc.3<1.53606 ldc.3>1.53606 cylinder 0/12 cylinder 0/81 cylinder funnel 1/5 saito@math.ucdavis.edu 23

saito@math.ucdavis.edu 24 Example 2: Classification of Geophysical Acoustic Waveforms 402 acoustic waveforms (256 time samples) were recorded at a gas producing well. Region consists of sand or shale layers. Sand layers contain either water or gas. 0 100 200 300 400 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 25 Sonic Waveform Measurement Borehole Layer #1 R Layer #2 T

saito@math.ucdavis.edu 26 Objectives Can we classify acoustic waveforms in terms of mineral contents of layers? Can we automate the classification process? If so, what features (or wave components) are important? Pressure (Pa) -0.5 0.0 0.5 1.0 P S Stoneley 0 0.001 100 0.002 200 0.003 300 0.004 400 0.005 500 Time (sec)

saito@math.ucdavis.edu 27 Classification Procedure Adopt 10-fold cross validation; split waveforms randomly into training and test datasets, and repeat the experiments. Decompose training waveforms into the local sine dictionary. Choose the LDB using various discrepancy measures. Choose 25, 50, 75, 100 most discriminant coordinates. Construct classifiers (LDA and Classification Tree) using these. Decompose test waveforms into the LDB and classify them. Compute the average misclassification rates.

saito@math.ucdavis.edu 28 Misclassification Rates for Test Data Misclassification Rate (%) 0 2 4 6 8 10 HJ Solid Lines: LDA Dotted Lines: CT I W S WS S S I W W HJ I HJ HI HJ J I S S S HS J I I HJ I WH J W W 1.0 25 1.5 2.0 50 2.5 3.0 75 3.5 100 4.0 Number of LDB Features

saito@math.ucdavis.edu 29 The Best LDB Vectors 50 40 30 20 10 0 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 30 The Best LDB Pattern 6 5 4 3 2 1 0 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 31 The Best LDB Vectors... Top 10 most influential LDB vectors in LDA 8 6 4 2 0 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 32 Observations LDA on LDB gave better results than CT on LDB = features are oblique. Top 25 LDB features were clearly not sufficient. Supplying too many features degraded the performance. Most important LDB vectors are clustered around P wave components.

saito@math.ucdavis.edu 33 Improved LDB with Empirical PD Estimation This can be viewed as a specific yet fast version of the Projection Pursuit algorithm (Friedman-Tukey, Huber). Compared to top-down strategy of PP, LDB is not greedy, i.e., bottom-up approach. This approach also works for complex-valued expansion coefficients (e.g., local Fourier bases). How to estimate the empirical pdf? histograms, ASH (averaged shifted histograms), nearest neighbor, kernel-based methods...

saito@math.ucdavis.edu 34 Conclusion LDB functions can capture relevant local features in data Interpretation of the results becomes easier and more intuitive Computational cost is at most O(n[log n] 2 ) These methods enhance both traditional and modern statistical methods

saito@math.ucdavis.edu 35 Conclusion... Our algorithms are being applied and tested to: Geophysical signal/image classification at Schlumberger Noise reduction in hearing aids at Northwestern University Diagnostics of mammography at University of Chicago Hospital Radar target discrimination at Lockeed-Martin

saito@math.ucdavis.edu 36 Future Directions How about the good basis for clustering? Parameterization of low dimensional structures in high dimensional space Explore the relationship with the David-Jones-Semmes geometric analysis Develop two dimensional version for the complex coefficients provided by the local Fourier bases or brushlets