Adapted Feature Extraction and Its Applications

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING

IMPROVED LOCAL DISCRIMINANT BASES USING EMPIRICAL PROBABILITY DENSITY ESTIMATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Curve learning. p.1/35

Lecture 3: Statistical Decision Theory (Part II)

Is cross-validation valid for small-sample microarray classification?

Scalable robust hypothesis tests using graphical models

Generative MaxEnt Learning for Multiclass Classification

The Generalized Haar-Walsh Transform (GHWT) for Data Analysis on Graphs and Networks

Holdout and Cross-Validation Methods Overfitting Avoidance

CSCI-567: Machine Learning (Spring 2019)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

ECE521 Lectures 9 Fully Connected Neural Networks

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Pattern Recognition 2

L11: Pattern recognition principles

Algorithm-Independent Learning Issues

Applied Machine Learning for Biomedical Engineering. Enrico Grisan

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Julesz s Quest for texture perception

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Understanding Generalization Error: Bounds and Decompositions

Statistical Rock Physics

Decision trees COMS 4771

Sparse linear models

Mining Classification Knowledge

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CMU-Q Lecture 24:

A Bias Correction for the Minimum Error Rate in Cross-validation

Machine Learning for Signal Processing Bayes Classification and Regression

Mixture of Gaussians Models

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Kernel Density Estimation

Randomized Algorithms

Computational Genomics

STA 414/2104, Spring 2014, Practice Problem Set #1

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Variational sampling approaches to word confusability

Statistical Pattern Recognition

Expectation Propagation for Approximate Bayesian Inference

Lecture Notes 1: Vector spaces

Reproducing Kernel Hilbert Spaces

Classification: The rest of the story

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Day 5: Generative models, structured classification

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistical Modeling of Visual Patterns: Part I --- From Statistical Physics to Image Models

Deep Learning: Approximation of Functions by Composition

Learning gradients: prescriptive models

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Recap from previous lecture

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CS Machine Learning Qualifying Exam

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Bayesian Networks Inference with Probabilistic Graphical Models

Maximum Entropy Generative Models for Similarity-based Learning

A Modular NMF Matching Algorithm for Radiation Spectra

Learning Linear Detectors

Discriminative Learning and Big Data

COMS 4771 Introduction to Machine Learning. Nakul Verma

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Consistency of Nearest Neighbor Methods

Brief Introduction to Machine Learning

Data Mining and Analysis: Fundamental Concepts and Algorithms

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

When Dictionary Learning Meets Classification

Overfitting, Bias / Variance Analysis

Online Estimation of Discrete Densities using Classifier Chains

W compression of a variety of signals such as sound and

Mining Classification Knowledge

Machine Learning 2017

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Doug Cochran. 3 October 2011

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Recitation 2: Probability

Digital Image Processing

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

EE 381V: Large Scale Optimization Fall Lecture 24 April 11

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Nearest Neighbors Methods for Support Vector Machines

High Dimensional Discriminant Analysis

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Support Vector Machines

Data Mining Techniques

Invariant Scattering Convolution Networks

ECS289: Scalable Machine Learning

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Information geometry for bivariate distribution control

Transcription:

saito@math.ucdavis.edu 1 Adapted Feature Extraction and Its Applications Naoki Saito Department of Mathematics University of California Davis, CA 95616 email: saito@math.ucdavis.edu URL: http://www.math.ucdavis.edu/ saito/

saito@math.ucdavis.edu 2 Acknowledgment Ronald R. Coifman (Yale/FMAH) Fred Warner (Yale/FMAH) Frank Geshwind (Yale/FMAH)

saito@math.ucdavis.edu 3 Outline Problem Formulation A Dictionary/Library of Orthonormal Bases Local Discriminant Basis (LDB) Improved LDB with Empirical Probability Density Estimation Example 1: Synthetic Waveform Classification Example 2: Geophysical Acoustic Waveform Classification Conclusion Future Directions

saito@math.ucdavis.edu 4 A Strategy for Pattern Recognition Normalization of input patterns scaling, translation, rotation Feature extraction and selection Classification LDA, CART, k-nn, Neural Networks,... X F Y x f f(x) g 1. y. C

saito@math.ucdavis.edu 5 Problem Formulation Learning (or Training): Given a set of data, T = {(x i, y i )} N i=1 X Y, where X R n and Y = {1,..., K} (class labels), find a map d : X Y such that R(d) = 1 N N I(d(x i ) y i ) small. i=1 Prediction: Apply the map d obtained during the training to a new dataset. Then interpret the result and evaluate d.

saito@math.ucdavis.edu 6 Problem Formulation... Let (X, Y ) (X, Y) be a random sample from P (X A, Y = y) = P (X A Y = y) P (Y = y), where P (Y = y) = π y = N y /N in practice. Let us assume the density p(x y) exists. The Bayes classifier (or rule) d B is: d B (x) = ki(x A k ), where A k = {x X : p(x k)π k = max y Y p(x y)π y }. In other words, assign x to class k if x A k. Note y Y A y = X.

saito@math.ucdavis.edu 7 Problem Formulation... Difficulties of the Bayes classifier: Don t know p(x k) or Too difficult to estimate p(x k) computationally because of the high dimensionality of the problem (curse of dimensionality). = Need dimensionality reduction without loosing important information for classification.

saito@math.ucdavis.edu 8 Problem Formulation... Therefore, we restrict our attention to the map d : X Y of the following form: d = g f = g Θ m Ψ T, f : X F R m is called a feature extractor consisting of: Ψ: an n-dimensional orthogonal matrix selected from a dictionary or library of orthonormal bases. Θ m : a selection rule: it selects the most important m ( n) coordinates from n-dimensional coordinates. g : F Y is a conventional classifier, e.g., LDA, CART, ANN etc.

saito@math.ucdavis.edu 9 A Library of Orthonormal Bases consists of dictionaries of orthonormal bases: each dictionary is a binary tree whose nodes are subspaces of Ω 0,0 = R n with different time-frequency localization characteristics. Ω 3,7 Ω 2,3 Ω 3,6 Ω 1,1 Ω 3,5 Ω 2,2 Ω 3,4 Ω 0,0 Ω 3,3 Ω 2,1 Ω 3,2 Ω 1,0 Ω 3,1 Ω 2,0 Ω 3,0

saito@math.ucdavis.edu 10 A Library of Orthonormal Bases... Examples of dictionaries include wavelet packet bases, local trigonometric bases, and local Fourier bases. It costs O(n[log n] p ) to generate a dictionary for a signal of length n (p = 1 for wavelet packets, p = 2 for LTB/LFB). Each dictionary may contain up to n(1 + log 2 n) basis vectors and more than 2 n possible orthonormal bases. How to select the best possible basis for the problem at hand is a key issue.

saito@math.ucdavis.edu 11 Example of Local Basis Functions Standard Basis Haar Basis Walsh Basis 3 2 1 0 3 2 1 0 3 2 1 0 0 20 40 60 80 100120 C12 Wavelet Packet Basis 0 20 40 60 80 100120 Local Sine Basis 0 20 40 60 80 100120 Discrete Sine Basis 3 2 1 0 3 2 1 0 3 2 1 0 0 20 40 60 80 100120 0 20 40 60 80 100120 0 20 40 60 80 100120

saito@math.ucdavis.edu 12 Time-Frequency Characteristics 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Standard Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Haar Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Walsh Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 C12 Wavelet Packet Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Local Sine Basis 0 20 40 60 80 100 120 Time Frequency 0.0 0.1 0.2 0.3 0.4 0.5 Discrete Sine Basis

saito@math.ucdavis.edu 13 Local Discriminant Basis Let D = {w i } N w i=1 be a time-frequency dictionary D = {B j } N B j=1 as a list of all possible orthonormal bases where B j = (w j1,..., w jn ) Let M + (B j ) be a measure of efficacy of B j for discriminantion Then the local discriminant basis (LDB) can be written as Ψ = arg max B j D M+ (B j ). Proposition: There exists a fast algorithm (divide-and-conquer, O(n)) to find Ψ from each dictionary D if M + is additive. M + (0) = 0, M + ({w j1,..., w jn }) = n M + (w ji ). i=1

saito@math.ucdavis.edu 14 Discriminant Measures For w i D, consider the projection Z i = wi X of an input random signal X X Let δ i be the efficacy for discrimination (or discriminant power) of w i Some possibilities of measuring the efficacy of the basis B j : M + (B j ) = n i=1 δ j i M + (B j ) = k i=1 δ (j i ), where {δ (ji )} is the decreasing rearrangement of {δ ji } and k < n. M + (B j ) = n i=1 ε j i δ ji, where ε ji = 1 if E[Z 2 i ] > θ, = 0 otherwise.

saito@math.ucdavis.edu 15 Discriminant Measures... What are the basic quantities to use for 1D efficacy? Differences in normalized energies of Z i among classes: V (y) i = E[Z 2 i Y = y] n i=1 E[Z2 i Y = y] ˆV (y) i = Ny k=1 w i x (y) k 2 Ny k=1 x(y) k 2 Differences in probability density functions (pdfs) of Z i : q (y) i (z) = p(x y) dx ˆq (y) i (z) via e.g., ASH w i x=z

saito@math.ucdavis.edu 16 Discriminant Measures... Some possibilites of discrepancy measures between two pdfs p(x), q(x): Relative entropy [Kullback-Leibler divergence]: Hellinger distance: D KL (p, q) = p(x) log 2 p(x) q(x) dx D H (p, q) = ( p(x) q(x) ) 2 dx L 2 distance: D 2 (p, q) = (p(x) q(x)) 2 dx

saito@math.ucdavis.edu 17 Discriminant Measures... Using normalized energy: δ i = ˆV (1) (1) ( ˆV i i log 2, ˆV (2) i Using empirical pdfs: ˆV (1) i ) 2 ( ) 2 ˆV (2) i, or ˆV (1) i ˆV (2) i δ i = D KL (ˆq (1) i, ˆq (2) i ), D H (ˆq (1) i, ˆq (2) i ), ord 2 (ˆq (1) i, ˆq (2) i )

saito@math.ucdavis.edu 18 Discriminant Measures... Important to consider two-dimensional projection x 2 1 2 2 1 x 1 Applicable to the complex coefficients of local Fourier bases

saito@math.ucdavis.edu 19 Example 1: Signal Shape Classification Objective: Classify synthetic signals of length 128 to three possible classes, c(i) = (6 + η) χ [a,b] (i) + ɛ(i) for cylinder, b(i) = (6 + η) χ [a,b] (i) (i a)/(b a) + ɛ(i) for bell, f(i) = (6 + η) χ [a,b] (i) (b i)/(b a) + ɛ(i) for funnel. where a = unif[16, 32], b a = unif[32, 96], η N(0, 1), ɛ(i) i.i.d. N(0, 1).

saito@math.ucdavis.edu 20 Example Waveforms "Cylinder" signals 4 3 2 1 0 0 20 40 60 80 100 120 "Bell" signals 4 3 2 1 0 0 20 40 60 80 100 120 "Funnel" signals 4 3 2 1 0 0 20 40 60 80 100 120

saito@math.ucdavis.edu 21 Example 1: Signal Shape Classification... Top 10 LDFs 8 6 4 2 0 0 20 40 60 80 100 120 Top 10 LDBs 8 6 4 2 0 0 20 40 60 80 100 120 Selected Nodes by LDB 6 4 2 0 0 20 40 60 80 100 120

saito@math.ucdavis.edu 22 Example 1: Signal Shape Classification... Table of Misclassifications (average over 10 simulations) Method (Coordinates) Training Data Test Data LDA on STD 0.83 % 12.31 % CT on STD 2.83 % 11.28 % LDA on Top 10 LDB 7.00 % 8.37 % CT on Top 10 LDB 2.67 % 5.54 % 100 training signals and 1000 test signals were generated for each class per simulation 12-tap Coiflet filter was used for LDB

Example 1: Signal Shape Classification... Full Classification Tree on Top 10 LDB Coordinates cylinder 200/300 ldc.1<10.0275 ldc.1>10.0275 bell 1/99 ldc.5<0.743967 ldc.5>0.743967 bell 1/5 bell 0/94 21/120 ldc.7<20.4196 8/103 ldc.1<12.2765 ldc.1>12.2765 funnel 3/7 funnel 0/18 funnel funnel cylinder 101/201 ldc.5<17.3111 ldc.5>17.3111 funnel 5/96 ldc.5<14.8119 ldc.5>14.8119 funnel 2/91 ldc.9<-0.709864 ldc.9>-0.709864 funnel funnel 2/5 funnel 2/23 0/68 ldc.9<-0.930807 ldc.9>-0.930807 cylinder 2/5 ldc.7>20.4196 4/17 ldc.3<1.53606 ldc.3>1.53606 cylinder 0/12 cylinder 0/81 cylinder funnel 1/5 saito@math.ucdavis.edu 23

saito@math.ucdavis.edu 24 Example 2: Classification of Geophysical Acoustic Waveforms 402 acoustic waveforms (256 time samples) were recorded at a gas producing well. Region consists of sand or shale layers. Sand layers contain either water or gas. 0 100 200 300 400 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 25 Sonic Waveform Measurement Borehole Layer #1 R Layer #2 T

saito@math.ucdavis.edu 26 Objectives Can we classify acoustic waveforms in terms of mineral contents of layers? Can we automate the classification process? If so, what features (or wave components) are important? Pressure (Pa) -0.5 0.0 0.5 1.0 P S Stoneley 0 0.001 100 0.002 200 0.003 300 0.004 400 0.005 500 Time (sec)

saito@math.ucdavis.edu 27 Classification Procedure Adopt 10-fold cross validation; split waveforms randomly into training and test datasets, and repeat the experiments. Decompose training waveforms into the local sine dictionary. Choose the LDB using various discrepancy measures. Choose 25, 50, 75, 100 most discriminant coordinates. Construct classifiers (LDA and Classification Tree) using these. Decompose test waveforms into the LDB and classify them. Compute the average misclassification rates.

saito@math.ucdavis.edu 28 Misclassification Rates for Test Data Misclassification Rate (%) 0 2 4 6 8 10 HJ Solid Lines: LDA Dotted Lines: CT I W S WS S S I W W HJ I HJ HI HJ J I S S S HS J I I HJ I WH J W W 1.0 25 1.5 2.0 50 2.5 3.0 75 3.5 100 4.0 Number of LDB Features

saito@math.ucdavis.edu 29 The Best LDB Vectors 50 40 30 20 10 0 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 30 The Best LDB Pattern 6 5 4 3 2 1 0 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 31 The Best LDB Vectors... Top 10 most influential LDB vectors in LDA 8 6 4 2 0 0 0.0005 50 0.001 100 0.0015 150 0.002 200 0.0025 250 Time (sec)

saito@math.ucdavis.edu 32 Observations LDA on LDB gave better results than CT on LDB = features are oblique. Top 25 LDB features were clearly not sufficient. Supplying too many features degraded the performance. Most important LDB vectors are clustered around P wave components.

saito@math.ucdavis.edu 33 Improved LDB with Empirical PD Estimation This can be viewed as a specific yet fast version of the Projection Pursuit algorithm (Friedman-Tukey, Huber). Compared to top-down strategy of PP, LDB is not greedy, i.e., bottom-up approach. This approach also works for complex-valued expansion coefficients (e.g., local Fourier bases). How to estimate the empirical pdf? histograms, ASH (averaged shifted histograms), nearest neighbor, kernel-based methods...

saito@math.ucdavis.edu 34 Conclusion LDB functions can capture relevant local features in data Interpretation of the results becomes easier and more intuitive Computational cost is at most O(n[log n] 2 ) These methods enhance both traditional and modern statistical methods

saito@math.ucdavis.edu 35 Conclusion... Our algorithms are being applied and tested to: Geophysical signal/image classification at Schlumberger Noise reduction in hearing aids at Northwestern University Diagnostics of mammography at University of Chicago Hospital Radar target discrimination at Lockeed-Martin

saito@math.ucdavis.edu 36 Future Directions How about the good basis for clustering? Parameterization of low dimensional structures in high dimensional space Explore the relationship with the David-Jones-Semmes geometric analysis Develop two dimensional version for the complex coefficients provided by the local Fourier bases or brushlets