A Nonlinear Extension of the MACE Filter

Similar documents
Entropy Manipulation of Arbitrary Non I inear Map pings

Pose Estimation in SAR using an Information Theoretic Criterion

Linear & nonlinear classifiers

Bobby Hunt, Mariappan S. Nadar, Paul Keller, Eric VonColln, and Anupam Goyal III. ASSOCIATIVE RECALL BY A POLYNOMIAL MAPPING

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Statistical Pattern Recognition

Pattern Recognition and Machine Learning

Linear & nonlinear classifiers

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning and Memory in Neural Networks

Nonlinearity optimization in nonlinear joint transform correlators

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Robust Space-Time Adaptive Processing Using Projection Statistics

THE multilayer perceptron (MLP) is a nonlinear signal

From perceptrons to word embeddings. Simon Šuster University of Groningen

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

Neural Networks and the Back-propagation Algorithm

Stochastic Analogues to Deterministic Optimizers

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Target Detection Studies Using Fully Polarimetric Data Collected by the Lincoln Laboratory MMW SAR. L.M. Novak MIT Lincoln Laboratory

4. Multilayer Perceptrons

Multilayer Neural Networks

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Optimal Polynomial Control for Discrete-Time Systems

Automatic Differentiation and Neural Networks

L11: Pattern recognition principles

CHAPTER 4 THE COMMON FACTOR MODEL IN THE SAMPLE. From Exploratory Factor Analysis Ledyard R Tucker and Robert C. MacCallum

Chapter 9: The Perceptron

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Adaptive Inverse Control based on Linear and Nonlinear Adaptive Filtering

Comparison of Modern Stochastic Optimization Algorithms

Introduction to Machine Learning

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

ADAPTIVE FILTER THEORY

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Bayesian Decision Theory

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Efficient Algorithms for Pulse Parameter Estimation, Pulse Peak Localization And Pileup Reduction in Gamma Ray Spectroscopy M.W.Raad 1, L.

Temporal Backpropagation for FIR Neural Networks

Reading Group on Deep Learning Session 1

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

Learning features by contrasting natural images with noise

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

CFAR TARGET DETECTION IN TREE SCATTERING INTERFERENCE

Over-enhancement Reduction in Local Histogram Equalization using its Degrees of Freedom. Alireza Avanaki

Discriminative Direction for Kernel Classifiers

A Statistical Analysis of Fukunaga Koontz Transform

Feed-forward Network Functions

Artificial Neural Networks (ANN)

Maximum variance formulation

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Deep Feedforward Networks. Sargur N. Srihari

Multilayer Perceptrons (MLPs)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Neural Networks biological neuron artificial neuron 1

Artificial Neural Networks. Edward Gatt

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Lecture 5: Logistic Regression. Neural Networks

Translation-invariant optical pattern recognition without correlation

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Statistical Geometry Processing Winter Semester 2011/2012

p(d θ ) l(θ ) 1.2 x x x

Convolutional Associative Memory: FIR Filter Model of Synapse

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks (and Gradient Ascent Again)

STA 4273H: Statistical Machine Learning

Advanced statistical methods for data analysis Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MODULE -4 BAYEIAN LEARNING

Linear Model Selection and Regularization

An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems

Artificial Neural Networks

Statistical Signal Processing Detection, Estimation, and Time Series Analysis

Linear Models for Classification

Generalized Information Potential Criterion for Adaptive System Training

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

SPEECH ANALYSIS AND SYNTHESIS

(Refer Slide Time: )

Address for Correspondence

Multilayer Neural Networks

y(n) Time Series Data

Gradient-Based Learning. Sargur N. Srihari

THIS paper studies the input design problem in system identification.

CONTROL AND TOPOLOGICAL OPTIMIZATION OF A LARGE MULTIBEAM ARRAY ANTENNA

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Expressions for the covariance matrix of covariance data

Small sample size generalization

Neural Network Based Response Surface Methods a Comparative Study

EIGENFILTERS FOR SIGNAL CANCELLATION. Sunil Bharitkar and Chris Kyriakakis

On Information Maximization and Blind Signal Deconvolution

Regularization in Neural Networks

Structural Damage Detection Using Time Windowing Technique from Measured Acceleration during Earthquake

A Modified Incremental Principal Component Analysis for On-Line Learning of Feature Space and Classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

UNIFORMLY MOST POWERFUL CYCLIC PERMUTATION INVARIANT DETECTION FOR DISCRETE-TIME SIGNALS

Transcription:

page 1 of 27 A Nonlinear Extension of the MACE Filter John W. Fisher III and Jose C. Principe Computational NeuroEngineering Laboratoryfisher@synapse.ee.ufl.edu Department of Electrical Engineeringprincipe@brain.ee.ufl.edu University of Florida Abstract - The minimum average correlation energy (MACE) filter, which is linear and shift invariant, has been used extensively in the area of automatic target detection and recognition (ATD/R). We present a nonlinear extension of the MACE filter, based on a statistical formulation of the optimization criterion, of which the linear MACE filter is a special case. A method by which nonlinear topologies can be incorporated into the filter design is presented and adaptation issues are discussed. In particular, we outline a method by which training exhaustively over the image plane is avoided which leads to much shorter adaptation. Experimental results, using target chips from 35 GHz TABILS 24 inverse synthetic aperture (ISAR) data are presented and performance comparisons are made between the MACE filter and this nonlinear extension. Acknowledgement: This work was partially supported by ARPA grant N60921-93-C-A335. John W. FIsher 405 CSE, BLDG 42 University of Florida Gainesville, FL 32611 (904) 392-9662 (904) 392-0044 (FAX) fisher@synapse.ee.ufl.edu

page 2 of 27 Abstract - The minimum average correlation energy (MACE) filter, which is linear and shift invariant, has been used extensively in the area of automatic target detection and recognition (ATD/R). We present a nonlinear extension of the MACE filter, based on a statistical formulation of the optimization criterion, of which the linear MACE filter is a special case. A method by which nonlinear topologies can be incorporated into the filter design is presented and adaptation issues are discussed. In particular, we outline a method by which training exhaustively over the image plane is avoided which leads to much shorter adaptation. Experimental results, using target chips from 35 GHz TABILS 24 inverse synthetic aperture (ISAR) data are presented and performance comparisons are made between the MACE filter and this nonlinear extension. Keywords - correlation filters, MACE filter, ISAR, automatic target recognition.

page 3 of 27 1.0 Introduction In the area of automatic target detection and recognition (ATD/R), it is not only desirable to recognize various targets, but to locate them with some degree of resolution. The minimum average correlation energy filter (MACE) (Mahalanobis et al, 1987) is of interest to the ATD/R problem due to its localization and discrimination properties. Correlation filters, of which the MACE is an example, have been widely used in optical pattern recognition in recent years. Our current interest is in the application of these types of filters to high-resolution synthetic aperture radar (SAR) imagery. Some recent articles have appeared showing experimental results using correlation filters on SAR data. Mahalanobis (Mahalanobis, Forman et al, 1994) used MACE filters in combination with distance classifier methods to discriminate 5 classes of vehicles and reject natural clutter as well as a confusion vehicle class with fairly good results. Novak (Novak et al, 1994) presents results comparing the performance of several classifiers including the MACE filter on SAR data. The MACE filter is a member of a family of correlation filters derived from the synthetic discriminant function (SDF) (Hester and Casasent, 1980). Other generalizations of the SDF include the minimum variance synthetic discriminant function (MVSDF) (Kumar et al, 1988), the MACE filter, and more recently the gaussian minimum average correlation energy (G-MACE) (Casasent et al, 1991) and the minimum noise and correlation energy (MINACE) (Ravichandran and Casasent, 1992) filters. All of these filters are linear and shift-invariant 1 and can be formulated as a quadratic optimization subject to a set of linear constraints in either the sample or spectral domain. The solution to these problems is obtained using the method of Lagrange multipliers. Kumar (Kumar; 1992) gives an excellent review of these filters. The bulk of the research using these types of filters has concentrated on optical and infra-red (IR) imagery and overcoming recognition problems in the presence of distortions associated with 1 We refer here to the signal processing definition of shift invariance, that is, an operator is said to be shift invariant if a shift in the input results in a corresponding shift in the output.

page 4 of 27 3-D to 2-D mappings, i.e. scale and rotation. Usually, several exemplars from the recognition class are used to represent the range of distortions over which the filter is to be used. Although the distortions in SAR imagery do not occur in the same way, that is a change in target aspect does not manifest exactly as a rotation in the SAR image, exemplars may still be sufficient to model a single target class over a range of target aspects and relative depression angles. Our focus is on the MACE filter and its variants because they are designed to produce a narrow constrained-amplitude peak response when the filter mask is centered on a target in the recognition class while minimizing the energy in the rest of the output plane. The filter can be modified to produce a low variance output for a designated rejection class as well. Another property of the MACE filter is that the constrained peak output is guaranteed over the training exemplars to be the maximum in the output image plane (Mahalanobis et al, 1987). Since the MACE filter is linear, it can only be used to realize linear discriminant functions. Along with its desirable properties it has been shown to be limited in its ability to generalize to between-aspect exemplars that are in the recognition class (but not in the training set), while simultaneously rejecting out-of-class inputs (Ravichandran and Casasent, 1992)(Casasent, and Ravichandran, 1992)(Casasent et al, 1991). The number of design exemplars can be increased in order to overcome generalization problems, however the computation of the filter coefficients becomes computationally prohibitive and numerically unstable as the number of design exemplars is increased (Kumar; 1992). The MINACE and G-MACE variations have improved generalization properties with a slight degradation in the average output plane variance and sharpness of the central peak, respectively. In the sample domain, the SDF family of correlation filters is equivalent to a cascade of a linear pre-processor followed by a linear correlator (Mahalanobis et al, 1987)(Kumar; 1992). (Fisher and Principe, 1994) showed that this is equivalent to a preprocessor followed by a linear associative memory (LAM), illustrated in figure 1 with vector operations. The pre-processor, in the case of the MACE filter, is a pre-whitening filter computed on the basis of the average power spectrum

page 5 of 27 of the recognition class training exemplars. Mahalanobis et al (Mahalanobis et al, 1987) use a synthetic discriminant function (SDF) to refer to the LAM portion of the filter decomposition. y = Ax h = y( y y) 1 d input image, x pre-processor LAM/SDF scalar output SDF Filter Decomposition FIGURE 1. Decomposition of SDF-type filter in space domain, assuming image and filter coefficients have been re-ordered into vectors. The input image vector, x, is pre-processed by the linear transformation, y = Ax. The resulting vector is processed by a linear associative memory (LAM), y out = y h. We use the associative memory viewpoint for investigating extensions to the MACE filter. It is well known that non-linear associative memory structures can outperform their linear counterparts on the basis of generalization and dynamic range (Kohonen, 1988)(Hinton and Anderson, 1981). In general, they are more difficult to design as their parameters cannot be computed in closed form. The parameters for a large class of nonlinear associative memories can, however, be determined by gradient search techniques. In this paper we discuss a non-linear extension of the MACE filter that shows promise in overcoming some of the problems described. In our development we show that the performance of the linear MACE filter can be improved upon in terms of generalization while maintaining its desirable properties, i.e. sharp, constrained peak at the center of the output plane. In this paper we present experimental results using a simple nonlinear modification of the MACE filter. We replace the LAM portion of the filter with a nonlinear associative memory structure, specifically a feedforward multi-layer perceptron (MLP) which retains the shift invariance properties but yields improved performance via a nonlinear discriminant function. In section 2.0 we review the MACE filter formulation and its relationship to associative memories. In section 3.0 we develop a generalized statistical filter structure of which the linear MACE filter is a special case. Section 4.0 details experimental results using TABILS 24 inverse synthetic

page 6 of 27 aperture (ISAR) imagery. We compare performance of the linear MACE filter to a nonlinear extension. We draw our conclusions and observations in section 5.0. 2.0 MACE Filter as an Associative Memory In the original development SDF type filters were formulated using correlation operations, although a convolutional approach can be easily adopted. The output, g( n 1, n 2 ), of a correlation filter is determined by N 1 1 N 2 1 g( n 1, n 2 ) = x ( n 1 + m 1, n 2 + m 2 )h( m 1, m 2 ) m 1 = 0 m 2 = 0 = x ( n 1, n 2 ) h( n 1, n 2 ), where x ( n 1, n 2 ) is the complex conjugate of the input image with N 1 N 2 region of support and h( n 1, n 2 ) represents the filter coefficients. The MACE filter is formulated as follows (Mahalanobis et al, 1987). Given a set of image exemplars, x i R N 1 N { 2 ; i = 1 N t }, we wish to find filter coefficients, h R N 1 N 2, such that average correlation energy at the output of the filter E = 1 ----- N t N t i = 1 N 1 1 N 2 1 1 ------------- g( n N 1 N 1, n 2 ) 2 2 n 1 = 0 n 2 = 0 (1) is minimized subject to the constraints N 1 1 N 2 1 d i i g i ( 0, 0) = x i ( m 1, m 2 )h ( m 1, m ) 2 = ; = 1 N t m 1 = 0 m 2 = 0. (2) Mahalanobis (Mahalanobis et al, 1987) reformulates this as a vector optimization in spectral domain using Parseval s theorem. Let X C N 1 N 2 N t be a matrix whose columns contain the 2-D DFT coefficients of exemplars { x 1,, x N t } reordered into column vectors. Let the matrix

page 7 of 27 D i R N 1 N 2 N 1 N 2 be a diagonal matrix whose diagonal elements contain the magnitude squared of the 2-D DFT coefficients of the i th exemplar. The diagonal elements of the matrix D = N t 1 ----- D i N t i = 1 (3) are then the average power spectrum of the training exemplars. The solution to this optimization problem can be found using the method of Lagrange multipliers. In the spectral domain, the filter that satisfies the constraints of equation (2) and minimizes the criterion of equation (1) (Mahalanobis et al, 1987)(Kumar; 1992) is H = ( N 1 N 2 )D 1 X( X D 1 X) 1 d, (4) where H C N 1 N 2 1 contains the 2D-DFT coefficients of the filter, assuming the nonunitary 2-D DFT as defined in (Oppenheim and Shafer, 1989), re-ordered into a column vector and d R N t 1, for each exemplar. This formulation can be easily cast as an asso- contains the desired outputs, ciative memory. d i In general, associative memories are mechanisms by which patterns can be related to one another, typically in an input/output pair-wise fashion. From a signal processing perspective we view associative memories as projections (Kung, 1992), linear and nonlinear. The input patterns exist in a vector space and the associative memory projects them onto a new space. Kohonen s linear associative memory (Kohonen, 1988) is formulated exactly in this way. A simple form of the linear associative memory (the hetero-associative memory) maps vectors to scalars, that is, given a set of input/output vector/scalar pairs { x i R N 1, d i R, i = 1 N t }, find the linear projection, h, such that h T x = d T (5)

page 8 of 27 and, in the under-determined case, the product h T h (6) is minimized, while for the over-determined case h is found such that ( h T x d T )( h T x d T ) T (7) is minimized. The columns of the matrix x = [ x 1 x N t ] contain the input vectors and the elements of the vector, d = [ d 1 d N t ] T contain the associated desired output scalars. The optimal solution for the under-determined, using the pseudo-inverse of x is (Kohonen, 1988) h = x( x T x) 1 d. (8) As was shown in (Fisher and Principe, 1994), if we modify this linear associative memory model slightly by adding a pre-processing linear transformation matrix, A, and find h such that the under-determined system of equations h T ( Ax) = d T (9) is satisfied while h T h is minimized, we get the result h = Ax( x T A T Ax) 1 d. (10) If the pre-processing transformation, A, is the space-domain equivalent of the MACE filter s spectral pre-whitening filter then equation (10) combined with the pre-processing transformation yields exactly the space domain coefficients of the MACE filter when the input vectors, x, are the re-ordered elements of the original images.

page 9 of 27 3.0 Nonlinear Extension of the MACE Filter The MACE filter is the best linear system that minimizes the energy in the output correlation plane subject to a peak constraint at the origin. One of the advantages of linear systems is that we have the mathematical tools to use them in optimal operating conditions. Such optimality conditions, however, should not be confused with the best possible performance. In the case of the MACE filter one drawback is poor generalization. A possible approach to design a nonlinear extension to the MACE filter and improve on the generalization properties is to simply substitute the linear processing elements of the LAM with nonlinear elements. Since such a system can be trained with error backpropagation, the issue would be simply to report on performance comparisons with the MACE. Such methodology does not, however, lead to understanding of the role of the nonlinearity, and does not elucidate the tradeoffs in the design and in training. Here we approach the problem from a different perspective. We seek to extend the optimality condition of the MACE to a nonlinear system, i.e. the energy in the output space is minimized while maintaining the peak constraint at the origin. Hence we will impose these constraints directly in the formulation, even knowing a priori that an analytical solution is very difficult or impossible to obtain. We reformulate the MACE filter from a statistical viewpoint and generalize it to arbitrary mapping functions, linear and nonlinear. We begin with a random vector, x R N 1 N 2 1, which is representative of the rejection class and a set of observations of the random vector, placed in the N t matrix x o R N 1 N 2 N t, which represent the target sub-class. We wish to find the parameters, α of a mapping, g( α, x):r N 1 N 2 1 R such that we may discriminate target vectors from vectors in the general rejection class. In this sense the mapping function, g, constrains the discriminator topology. Towards this goal, we wish to minimize the objective function J = E( g( α, x) 2 )

page 10 of 27 over the mapping parameters, α, subject to the system of constraints g( α, x o ) = d T, (11) where d R N t 1 is a column vector of desired outputs. It is assumed that the mapping function is applied to each column of x o, and E( ) is the expected value function. Using the method of Lagrange multipliers, we can augment the objective function as J = E( g( α, x) 2 ) + ( g( α x ) d T )λ, o, (12) where the mapping is assumed to be applied to each column of x o. Computing the gradient with respect to the mapping parameters yields J α = 2E g( α, x) g( α, x) -------------------- g( α, x o ) + ----------------------λ α α. (13) Equation (13) along with the constraints of equation (11) can be used to solve for the optimal parameters, α o, assuming our constraints form a consistent set of equations. This is, of course dependent on the network topology. For arbitrary nonlinear mappings it will, in general, be very difficult to solve for globally optimal parameters analytically. Our initial goal, instead, is to develop topologies and adaptive training algorithms which are practical and yield improved generalization over the linear mappings. It is interesting to verify that this formulation yields the MACE filter as a special case. If, for example, we choose the mapping to be a linear projection of the input image, that is g( α, x) = α T x ; α [ h 1 h N 1 N 2 ] T R N 1 N 2 1 =, then equation (12) becomes, after simplification, J = α T E( xx T )α + ( α T xo d T )λ. (14)

page 11 of 27 In order to solve for the mapping parameters, α, we are still left with the task of computing the term E( xx T ) which, in general, we can only estimate from observations of the random vector, x. Assuming that we have a suitable estimator, the well known solution to the minimum of equation (14) over the mapping parameters subject to the constraints of equation (11) is α 1 = Rˆ x xo x o Rˆ x1 xo 1 d where Rˆ x = estimate{ E( xx T )}. (15) Depending on the characterization of x, equation (15) describes various SDF-type filters (i.e. MACE, MVSDF, etc.). In the case of the MACE filter, the random vector, x, is characterized by all 2D circular shifts of target class images away from the origin. Solving for the MACE filter coefficients is therefore equivalent to using the average circular autocorrelation sequence (or equivalently the average power spectrum in the frequency domain) over images in the target class as estimators of the elements of the matrix E( xx T ). Sudharsanan et al (Sudharsanan et al, 1990) suggest a very similar methodology for improving the performance of the MACE filter. In that case the average linear autocorrelation sequence is estimated over the target class and this estimator of E( xx T ) is used to solve for linear projection coefficients in the space domain. The resulting filter is referred to as the SMACE (space-domain MACE) filter. As stated, our goal is to find mappings, defined by a topology and a parameter set, which improve upon the performance of the MACE filter in terms of generalization while maintaining a sharp constrained peak in the center of the output plane for images in the recognition class. One approach, which leads to an adaptive algorithm, is to approximate the original objective function of equation (12) with the modified objective function J = ( 1 β)e( g( α, x) 2 ) + β[ g( α, x o ) d T ][ g( α, x o ) d T ] T. (16) The principal advantage gained by using equation (16) over equation (12) is that we can solve adaptively for the parameters of the mapping function (assuming it is differentiable). The constraint equations, however, are no longer satisfied with equality over the training set. Varying β in

page 12 of 27 the range [ 0, 1] controls the degree to which the average response to the rejection class is emphasized versus the variance about the desired output over the recognition class. In (Réfrégier and Figue, 1991.) an optimal criterion trade-off method is presented. The authors show that the convex combination over the set of criteria describe a performance bound for the linear mapping. Mahalanobis (Mahalanobis, Kumar et al, 1994.) extends this idea to unconstrained linear correlation filters. Further investigation will be required in order to explore the relationship and performance of these linear filters relative to the nonlinear mappings we are currently studying. As in the linear case, we can only estimate the expected variance of the output due to the random vector input and its associated gradient. If, as in the MACE (or SMACE) filter formulation, x is characterized by all 2D circular (or linear) shifts of the recognition class away from the origin then this term can be estimated with a sampled average over the exemplars, x o, for all such shifts. From an adaptive standpoint this leads to a gradient search method which trains exhaustively over the entire output plane. This becomes a computationally intensive problem for most nonlinear mappings. It is desirable, then, to find other equivalent characterizations of the rejection class which may alleviate the computational load without significantly impacting performance. This issue is addressed in later sections. 3.1 Architecture A block diagram of the proposed nonlinear extension is shown in figure 2. In the pre-processor/ LAM decomposition of the MACE filter the LAM structure was replaced with a feed-forward multi-layer perceptron (MLP). The pre-processor remains as a linear, shift-invariant pre-whitening transformation, R N 1 N 2 1 R N 1 N 2 1, yielding a pre-whitened space domain image. The MLP has two hidden layers with N 1 N 2 nodes on the input layer, corresponding to an input mask with N 1 N 2 support in the image domain. This layer can be implemented with two correlators followed by nonlinear elements. The outputs of these elements feed into four nodes on the second hidden layer which nonlinearly combine the two features followed by a single output node. The nonlinearity is the logistic function. Since the mapping is R N 1 N 2 1 R, we must, of course, apply

page 13 of 27 the filter input mask to each location in the original input image in order to obtain an output image. A Image Pre-processor Σ σ Σ σ Σ σ Σ σ Σ σ Output Scalar Σ σ Input Image NL-MACE MLP Σ σ FIGURE 2. Experimental nonlinear MACE structure. The specific architecture was chosen for several reasons. The linear MACE filter extracts the optimal feature over the design exemplars for a linear discriminant function. Any linear combinations of additional linear features will yield an equivalent linear feature. This means that the MACE filter is the best linear feature extractor for the target exemplars. Only a nonlinear system can improve on this design. The MLP structure has the advantage of providing an efficient means of nonlinearly combining an optimal linear feature with others. It is well known that a single hidden layer MLP can realize any smooth discriminant function of its inputs. If we view the output of each node in the first hidden layer as an extracted feature of the input image, then the second layer gives the capability of realizing any smooth discriminant function of first hidden layer output.

page 14 of 27 This is illustrated in figure 3, where the linear outputs plus bias terms, f 1 + θ 1 and f 2 + θ 1, of the first hidden layer are the features of interest, and f( ) is the nonlinear logistic function. A Image Pre-processor Σ σ f( f 1 + θ 1 ) Σ σ Σ σ f( f 2 + θ 1 ) Σ σ Σ σ Feature Extraction Σ σ Σ σ Discriminant Function FIGURE 3. Division of pre-processor/mlp into feature extraction and discriminant function. The division of figure 3 will be useful in later analysis. If the performance of the linear MACE filter can be improved, the addition of a single feature should be sufficient to illustrate this improvement. It is for this reason that we set the number of nodes to two on the first hidden layer, although more hidden nodes may lead to even better performance. Finally, the MLP with backpropagation provides a simple means for adapting the NL-MACE although a globally optimal solution is not guaranteed. The mapping function of the NL-MACE can be written g( α, x) = f ( W 3 f ( W 2 f ( W 1 x + θ 1 ) + θ 2 )) α W 1 R 2 N 1 N 2 = {, W 2 R 4 2 W 3 R 1 4, θ 1, θ 2 }. (17) Implicit in equation (17) is that the terms θ 1 and θ 2 are constant bias matrices with the appropriate dimensionality. It is also assumed that if the argument to the nonlinear function f ( ) is a matrix then the nonlinearity is applied to each element of the matrix. We can rewrite the linear

page 15 of 27 input term, W 1 x, which is the only term with dependency on the input image (reordered into a column vector), as W 1 x h T 1 x = : = h T N h1 x f 1 ( x) f N h1 : ( x), (18) where is the number of hidden nodes in the first layer of the MLP (two in our specific case) N h1 and h T T { 1,, h N h1 } R 1 N 1 N 2 are the rows of the matrix W 1. The elements of the result, { f 1 ( x),, f N h1 ( x) } R 1 N 1 N 2, are recognized as the outputs, in vector form (Kumar et al, 1988)(Mahalanobis et al, 1987), of purely real linear correlation filters operating in parallel, N h1 therefore the elements of this term are shift-invariant. Rewriting equation (17) as a function of its shift invariant terms g( x) = f W 3 f W 2 f f 1 ( x) f N h1 ( x) T + θ 1 + θ 2 (19) we can see that the output is a static function of shift invariant input terms. Any shift in the input image will be reflected as a corresponding shift in the output image. The mapping is, therefore, shift invariant. 3.2 Avoiding Exhaustive Training Training becomes an issue once the associative memory structure takes a nonlinear form. The output variance of the linear MACE filter is minimized for the entire output plane over the training exemplars. Even when the coefficients of the MACE filter are computed iteratively we need only consider the output point at the designated peak location (constraint) for each pre-whitened training exemplar (Fisher and Principe, 1994). This is due to the fact that for the under-determined case, the linear projection which satisfies the system of constraints with equality and has minimum norm is also the linear projection which minimizes the response to images with a flat power spectrum. This solution is arrived at naturally via a gradient search only at the constraint location.

page 16 of 27 This is no longer the case when the mapping is nonlinear. Adapting the parameters via gradient search on pre-whitened exemplars only at the constraint location will not, in general, minimize the variance in the output image. In order to minimize the variance over the entire output plane we must consider the response of the filter to each location in the input image, not just the constraint location. The brute force approach would be to adapt the parameters over the entire output plane which would require N 1 N 2 N t image presentations per training epoch. If such exhaustive training is done, then the pre-whitening stage seems unnecessary. The pre-whitening stage and the input layer weights could be combined into a single equivalent linear transformation, however, prewhitening separately enables us to greatly reduce the number of image presentations during training. This can be explained as follows: due to the statistical formulation, we are only reducing the response of the NL-MACE filter to images with the second order statistics of the rejection class. If the exemplars have been pre-whitened then the rejection class can be represented with random white images. Minimizing the response to these images, in the average, minimizes the response to shifts of the exemplar images since they have the same second-order statistics. In this way we do not have to train over the entire output plane exhaustively, thereby reducing training times proportionally by the input image size, N 1 N 2. Experimentally the difference in convergence time was approximately 2300 epochs of N 1 N 2 N t image presentations for exhaustive training versus 1800 epochs of ( N t + 4) image presentations (training exemplars plus 4 white noise images) for noise training, with nearly the same performance in both cases. This is obviously a considerable speedup in training for even moderate images sizes. In both cases, the resulting filters exhibit improved performance over the linear MACE filter in terms of generalization and output variance. 3.3 Linear Versus Nonlinear Discriminant Functions Several observations were made during our experiments. It became apparent that linear solutions were a strong attractor. Examination of the input layer showed that the columns of were W 1

page 17 of 27 highly correlated. When this condition is true, although a nonlinear system is being used, the mapping of the image space to the feature space is confined to a narrow strip. The net result is that a mapping similar to the linear MACE filter could be achieved with a single node on the first hidden layer and we have achieved a linear discriminant function with a complicated topology. Even if the resulting linear discriminant function yields better performance there are much better and well documented methods for finding linear discriminant functions (Réfrégier and Figue, 1991.) (Mahalanobis, Kumar et al, 1994.). In order to find, with the MLP, a nonlinear discriminant function of the image space modifications were made to the adaptation procedure. The presumption here is that better performance (in terms of discrimination, localization, and generalization) can be achieved using a nonlinear discriminant function. It is certainly possible that in some input spaces the best discrimination can be achieved with a linear projection, but in a space as rich as the one in which we are working we believe that this will rarely be the case. The modification to the adaptation was to enforce orthogonality on the columns of the input layer weight matrix, T h T W 1 W 1 h 1 h T 1 h 2 h 1 0 1 = = h T 2 h 1 h T 2 h 0 h 2 2 via Gram-Schmidt orthogonalization, where { W 1, h 1, h 2 } are as in equations (17) and (18). This has two consequences. First, it guarantees that the mapping to the feature space is not rank-deficient although it does not ensure that the discriminant function derived through gradient search will utilize the additional feature. The second consequence is that, assuming we have pre-whit-

page 18 of 27 ened input images over the rejection class, the extracted linear features will also be orthogonal, in the statistical sense, over the rejection class. Mathematically, this can be shown as follows E W 1 xx T T ( W 1) = W 1 E( xx T T )W 1 = h T 1 E( xx T )h 1 h T 1 E( xx T )h 2 h T 2 E( xx T )h 1 h T 2 E( xx T )h 2. (20) As a consequence of the pre-whitening, the term E( xx T ) is of the form σ 2 I N 1 N 2, where σ 2 is a scalar and I N 1 N 2 is the N 1 N 2 N 1 N 2 identity matrix. Substituting into equation (20) gives E W 1 xx T T ( W 1) = h T 1 ( σ 2 I N 1 N 2 )h 1 h T 1 ( σ 2 I N 1 N 2 )h 2 h T 2 ( σ 2 I N 1 N 2 )h 1 h T 2 ( σ 2 I N 1 N 2 )h 2 = σ 2 T h1 h1 σ 2 T h1 h2 σ 2 T h2 h1 σ 2 T h2 h2. (21) σ 2 h 1 0 0 σ 2 h 2 It is fairly straightforward to show that any affine transformation of these features will also be uncorrelated. Since the MLP is nonlinearly combining orthogonal features it will yield, in general, a nonlinear discriminant function. 4.0 Experimental Results For these experiments we used vehicle data from the TABILS 24 ISAR data set. The radar used for the data collection is a fully polarimetric, K a band radar. The ISAR imagery was processed

page 19 of 27 with a polarimetric whitening filter (PWF) (Novak et al, 1993) and then logarithmically scaled to units of dbsm (db square meters) prior to being used for our experiments. The data used was collected at a depression angle of 20 degrees, that is the radar antenna was directed 20 degrees down from the horizon. ISAR images were extracted in the range 5 to 85 degrees azimuth in increments of 0.8 degrees. This resulted in 100 ISAR images (50 training, 50 testing). Images within both the training and testing sets were separated by 1.6 degrees. FIGURE 4.Examples of ISAR imagery. Down range is increasing from left to right. Target vehicle is shown at aspects of 5 (left), 45 (middle), and 85 (right) degrees. 4.1 Experiment 1 In the first experiment, straight backpropagation training was conducted with no other modifications other than to weight quadratic penalty term associated with the constraints in (16) by β = 0.93 and the output variance term with ( 1 β) = 0.07. The coefficients converged to a solution after approximately 1200 runs through the entire training set. Examination of the input layer (feature extracting layer) revealed that the coefficients associated with the first feature (1st column of matrix W 1 ) were highly correlated with the coefficients of the second feature. In effect the MLP converged to a linear discriminant function. At best, the MLP was equivalent to choosing a threshold for a linear filter. The resulting discriminant function is illustrated in figure 6. In the figure a contour plot of the discriminant function with respect to the linear outputs of the first hidden layer, f 1 and f 2 of figure 3, is plotted. Although the discriminant function implemented is nonlinear, the features are so

page 20 of 27 highly correlated that all inputs are projected onto a single curve in the feature space. Further adaptation continued to increase the correlation of the features. FIGURE 5. NL-MACE discriminant function with respect to extracted feature mapping. The cluster in the lower left is the mapping of noise (asterisk) with the same second order statistics as the rejection class. The cluster in the upper right is the mapping of testing (plus) and training (diamonds) exemplars. Since the features are highly correlated, inputs are mapped to a single curve in the feature space, and the overall filter is effectively a linear discriminant function of the input image. 4.2 Experiment 2 In light of the results of the first experiment (and several other experiments not described here for brevity), a modification was made to the training algorithm that yielded a nonlinear discriminant function. During training, orthogonality between the columns of the matrix was enforced via a Gram-Schmidt procedure at each training iteration. The approximate convergence time was nearly the same as in the first case, but the resulting discriminant function was no longer linear, indicating that the second feature was utilized by the filter. The new discriminant function is plotted in figure 6. The features are no longer correlated so the target exemplars and noise (rejection W 1

page 21 of 27 class) no longer lie on a single curve in the feature space. The resulting filter is utilizing the second feature and the discriminant is not equivalent to a linear discriminant function. FIGURE 6. Comparison of discriminant function with respect to extracted feature mapping when orthogonal features was enforced. The cluster in the upper left is the mapping of noise (asterisk) with the same secondorder statistics as the rejection class. The cluster in the lower right is the mapping of testing (plus) and training (diamonds) exemplars. The mapping is no longer confined to a single curve in the feature space and the discriminant function is a nonlinear function of both features.

page 22 of 27 4.3 Performance Comparison At this point, we are satisfied that the nonlinear associative memory structure is doing more than applying a threshold to the linear discriminant function. We now compare the performance of the linear MACE filter to our nonlinear extension with orthogonal features. FIGURE 7.Sample responses of the linear MACE filter (left) as compared to the output of the nonlinear filter (right) given the same input. The samples shown include one training exemplar (top) and the adjacent testing exemplar (bottom). Sample responses of both filters, linear and nonlinear, are shown in figure 7. One training set exemplar and one testing set exemplar are shown for both the linear MACE filter and the nonlinear filter. It is evident from the figure that the nonlinear filter appears to reduce the variance in the output plane (correlation energy for the linear case) as compared to the linear filter while still maintaining a sharp peak near the center point. Recall that at no time during training were shifted exemplars presented to the network, although as in the MACE filter, the projection must be com-

page 23 of 27 puted at all positions in the input image in order to compute the output image. This response was typical for all exemplars. Localized peak and low variance properties were retained. FIGURE 8. Peak (center) response of the linear MACE filter (left) compared to the output of the nonlinear filter (right) over the entire training set (top) and testing set (bottom) plotted as a function of vehicle aspect angle. In figure 8 we show the peak response for both the linear and nonlinear filter for both the training and testing set. In the case of the training set for the linear filter the designed value is, of course, met exactly at the center point. The peak response over the training set always occurred at

page 24 of 27 the center point for the nonlinear filter. In order to determine the peak response for the testing set, for both the linear and nonlinear filter, we simply chose the peak response in the output plane. In all cases this point occurred within a 5 x 5 pixel area centered in the output plane, but was not necessarily the center point for the test set. It can be seen in the plot that the nonlinear filter appears to have better generalization properties over the training set than the linear filter. FIGURE 9. Filter output plane pdfs (excluding 5x5 pixel center region), estimated over training exemplars, for the linear MACE (solid line) and the NL-MACE (dotted line). In figure 9 we show the probability distribution of the output plane response estimated (via Parzen window method) from the testing exemplars. The linear MACE filter clearly exhibits a more significant tail in the distribution than does the nonlinear filter. 5.0 REMARKS AND CONCLUSIONS We have presented a method by which the MACE filter can be extended to nonlinear processing. A necessary part of any extension to the MACE filter must consider the entire output image plane. In the case of the nonlinear extension to the MACE filter the output image plane can no longer be

page 25 of 27 characterized by the average power spectrum over the recognition class and any iterative method for computing its parameters might have to train exhaustively over the entire output plane. Using a statistical treatment, however, we were able to develop a training method that did not require exhaustive output plane training and which drastically reduced the convergence time of our training algorithm and gave improved performance. Our training algorithm requires the generation of a small number of random sequences with the same second-order statistics as our recognition class. Pre-whitening of the input exemplars played an important role in the training algorithm because the random sequences could then be any white noise sequence, which, as a practical matter, are less difficult to generate during training. Our results also show that it is not enough to simply train a multi-layer perceptron using backpropagation; the black-box approach. Careful analysis of the final solution is necessary to confirm reasonable results. In particular, the linear solution is a strong attractor and must be avoided, otherwise the solution would be equivalent (at best) to the linear MACE filter followed by a threshold. We used Gram-Schmidt orthogonalization on the input layer which did result in a nonlinear discriminant function and improved performance. We are currently exploring other methods by which independent features will adapt naturally. In our experiments better generalization and reduced variance in the output plane were demonstrated. Our current interest is in the application of this filter structure to SAR imagery. We are in the process of testing with multiple targets in target-plus-clutter imagery and will be reporting our results in the future. Future investigations will also explore the performance and relationships to the class of unconstrained correlation filters of (Mahalanobis, Kumar et al, 1994.).

page 26 of 27 6.0 References Kumar, B. V. K. Vijaya(1992); Tutorial survey of composite filter designs for optical correlators, Appl. Opt. 31 no.23, 4773-4801. Ravichandran, G., and D. Casasent (1992); Minimum noise and correlation energy filters, Appl. Opt. 31 no. 11, 1823-1833. Casasent, D., and G. Ravichandran (1992); Advanced distortion-invariant minimum average correlation energy (MACE) filters, Appl. Opt. 31 no. 8, 1109-1116. Casasent, D., G. Ravichandran, and S. Bollapragada (1991); Gaussian minimum average correlation energy filters, Appl. Opt. 30 no. 35, 5176-5181. Sudharsanan, S. I., A. Mahalanobis, and M. K. Sundareshan (1991); A unified framework for the synthesis of synthetic discriminant functions with reduced noise variance and sharp correlation structure, Appl. Opt. 30 no. 35, 5176-5181 Hester, C. F., and D. Casasent (1980); Multivariant technique for multiclass pattern recognition, Appl. Opt. 19, 1758-1761. Kumar, B.V.K. Vijaya, Z. Bahri, and A. Mahalanobis (1988); Constraint phase optimization in minimum variance synthetic discriminant functions, Appl. Opt. 27 no. 2, 409-413. Mahalanobis, A., B.V.K. Vijaya Kumar, and D. Casasent (1987); Minimum average correlation energy filters, Appl. Opt. 26 no. 17, 3633-3640. Kumar, B.V.K. Vijaya (1986); Minimum variance synthetic discriminant functions, J. Opt. Soc. Am. A 3 no. 10, 1579-1584. Kohonen, T. (1988); Self-Organization and Associative Memory (1st ed.); Springer Series in Information Sciences, vol. 8; Springer-Verlag. Réfrégier, Ph., and J. Figue (1991); Optimal trade-off filter for pattern recognition and their comparison with Weiner approach, Opt. Comp. Proc. 1, 3-10. Mahalanobis, A., B.V.K. Vijaya Kumar, Sewoong Song, S.R.F. Sims, and J.F. Epperson (1994); Unconstrained correlation filters; Appl. Opt. 33 no. 33, 3751-3759. Fisher J., and Principe, J. C. (1994); Formulation of the MACE Filter as a Linear Associative Memory, Proceedings of the IEEE International Conference on Neural Networks, Vol. 5, p. 2934. Mahalanobis, A., A. V. Forman, N. Day, M. Bower, R. Cherry (1994); Multi-class SAR ATR using shift-invariant correlation filters, Pattern Recognition 27 no. 4, 619-626. Novak, L. M., G. Owirka, C. Netishen (1994); Radar target identification using spatial matched filters, Pattern Recognition 27 no. 4, 607-617. Hinton, G. E., and J. A. Anderson Ed. (1981), Parallel Models of Associative Memory, Lawrence Erlbaum Associates, Publishers. Hertz, J., et al (1991); Introduction to the Theory of Neural Computation, Addison-Wesley Publishing Company.

page 27 of 27 Kung, S. Y. (1992); Digital Neural Networks, Prentice-Hall. Amit, D. J. (1989); Modeling Brain Function: The World of Attractor Neural Networks, Cambridge University Press. Novak, L. M., M. C. Burl, and W. W. Irving (1993); Optimal Polarimetric Processing for Enhanced Target Detection, IEEE Transactions on Aerospace and Electronic Systems, Vol. 29, p. 234. Oppenheim A. V., and R. W. Shafer (1989), Discrete-Time Signal Processing, Prentice-Hall, Inc.