Pose Estimation in SAR using an Information Theoretic Criterion

Pose Estimation in SAR using an Information Theoretic Criterion Jose C. Principe, Dongxin Xu, John W. Fisher III Computational NeuroEngineering Laboratory, U. of Florida. {principe,xu,fisher}@cnel.ufl.edu Abstract This paper describes a pose estimation algorithm based on an information theoretic formulation. We formulate the pose estimation statistically and show that pose can be estimated from a low dimensional feature space obtained by maximizing the mutual information between the aspect angle and the output of the mapper. We use the Havrda-Charvat definition of entropy to implement a nonparametric estimator based on the Parzen window method. Results in the MSTAR data set are presented and show the good performance of the methodology. 1.0 Introduction Knowing the relative position of a vehicle with respect to the sensor (normally called the aspect angle of the observation or the pose) is an important piece of information for vehicle recognition. Since pattern classifiers are statistical machines, without the pose information the classifier has to be trained with all possible poses to become invariant to aspect angle during operation. This is the principle of classifiers based on the synthetic discriminant function (SDF) so widely used in optical correlators [1], or the template based classifiers [2]. Even if the classifier is built around Bayesian principles or neural networks, all possible aspect angles have to be included during training to describe reliably the object. In SAR this is not a simple task due to the enormous variability of the scattering phenomenology created by man-made objects. This argument suggests that alternatively one could divide the classification in two stages: first find the pose of the object and then decide which is the class by selecting a classifier trained exclusively for that pose. Notice that this approach drastically reduced the complexity of the classifier training. This in fact is the principle used in the MSTAR architecture [3] where classifica- Jose C. Principe 1 CNEL, University of Florida

tion is divided in an indexing module followed by search and match. However, the approach utilized in MSTARs is based on the traditional method of a priori selecting landmarks in the vehicle and then comparing them for the best match with a database of features taken at different angles. This solution has several drawbacks. First, it is computationally expensive (search has to be done on-line). Second, it is highly dependent on the quality of the landmarks. Edges have been proved useful in optical images, but in SAR point scatters are normally preferred due to the different image formation characteristics. The issue is that point scatters vary abruptly with the depression angle and pose so the stability of the method is still under investigation. Third, the size of the problem space increases drastically with the number of objects and the precision required when local features are utilized. Instead of thinking that the system complexity is intrinsic to the problem [4], we submit that the problem formulation also affects the complexity of the solution. If the landmarks are local, then it is obvious that the problem does not scale up well. Our approach is to extract optimal features directly from the data by training an adaptive system. The advantages are the following: First, the method is very fast. Once the system is trained, during testing the image is presented and the output of the system is the estimation of pose, i.e. we have created a content addressable memory (CAM). Any microprocessor can do this in real time. Second, the system is not sensitive to the detection of landmarks which is a big advantage primarily when we do not know how much information is carried in the landmarks. Until the information formulation proposed here, this optimal feature extraction could only be done using principal component analysis (PCA) or linear discriminant analysis. PCA provides only global (rough) information about the objects (second order statistics) and the information provided may not be directly related to pose, which is just one aspect of the input image. So the results may be disappointing. However, our method of mutual information maximization is using the full information contained on the probability density function (pdf) so it can utilize local information if it is needed to solve the problem and the model parameters are right directed to the pose, which is our only interest here. This paper starts with a statistical formulation of the problem of pose estimation, describes a method of computing entropy from samples and how to construct a mutual information estimator, and presents preliminary results in the MSTAR data set. Jose C. Principe 2 CNEL, University of Florida

2.0 A statistical formulation of pose estimation Suppose that we have collected data in pairs ( x i, a i ), i 1,, N, where the image x i can be regarded as a vector in a high dimensional space x i ( m is usually in the thousands) and a i is a vector of ground truth information relative to the image contents. For the general case of pose R m estimation, a i is a six dimensional vector containing the translational and rotational information [5]. Here we will treat the one degree of freedom (1DOF) pose estimation problem where is a x i SAR image of a land vehicle obtained at a given depression angle, and a i is the azimuth (aspect) angle of the vehicle. The MSTAR data set [6] can be readily utilized to test the accuracy of 1DOF pose estimation algorithms. In general, the estimation of the aspect angle (here called pose) given can be formulated as a MAP (maximum a posteriori probability) prob- a particular image x lem: R m â argmax f AX x ( x, a) a (1) where f AX ( xa, ) is the a posteriori probability density function (pdf) of the aspect angle A x given x. This formulation implies that the best estimation of the aspect angle given x is the one which maximizes the a posteriori probability. Although the aspect angle A is a continuous variable, we can discretize it for convenience, where the possible values are a i, i 1,, N, i.e. all the angles in the training set. Since we have no a priori knowledge about the aspect angle, the uniform distribution is the most reasonable assumption about the probability density function of A in the sense that it is the direct result of MaxEnt [7] principle. Under these conditions, the above MAP problem is equivalent to ML (Maximum Likelihood): â Pa ( i )f XA ai ( xa i ) argmax Pa ( i x) argmax ------------------------------------------- argmax f f X ( x) XA ai ( xa i ) i i i (2) where pa ( i x), i 1,, N, is the a posteriori probability of the discrete variable A given x, Jose C. Principe 3 CNEL, University of Florida

Pa ( i ) is the a priori probability of A, which here is the uniform distribution, i.e. Pa ( i ) constant for i 1,, N ; f XA ai ( xa i ) is the conditional pdf of the image x for a particular aspect angle A, and f X ( x) is the marginal pdf of x. Therefore from (2), the problem becomes the estimation of the conditional pdf of x for all the possible angle a i, i 1,, N. Since x is a very high dimensional vector and any assumption about the form of the pdf is not appropriate for realistic pose estimation in SAR, a non-parametric method should be used. However, nonparametric pdf estimation of x becomes very unreliable since x is in a very high dimensional space and training data is limited. So, dimensionality reduction or feature extraction is necessary in this case, which means that instead of estimating the angle directly from the image x, we estimate it from a feature space of the image x. a i Generally, a feature is the output of a mapping. Let y h( x, w) be a feature set for x, where h: R m R k is a mapping, also called the feature extractor, y R k, k «m, and w is the parameter set of the feature extractor. Now, the problem according to (2) becomes: â argmax f ya ai ( ), y h( x, w) i ya i (3) In this framework, the key issue is how to choose the parameter set w. We propose to apply Information Theory [8]. From the information theoretic point of view, a mapping or feature extractor is an information transmission channel. The parameter of the mapping should be chosen so that it transmits as much information as possible. Here, the problem requires that the mapping transmits the most information about the aspect angle, i.e. the feature y should best represent the aspect angle. According to Information Theory, the quantitative measure for this purpose is the mutual information between the feature y and aspect angle a. So, parameter selection can be formulated as: w opt argmax Iy ( hxw (, ), a) w (4) Jose C. Principe 4 CNEL, University of Florida

where Iya (, ) is the mutual information between y and a, that is the optimal parameter set should be the one which maximizes the mutual information between the feature and the angle. Actually, the mutual information measure directly relies on pdfs. As mentioned above, non-parametric pdf estimation should be used, so the Parzen Window [9] method is selected here. Unfortunately, Shannon s mutual information measure will become too complex to be implementable with the Parzen Window pdf estimation. In the next section, we will introduce our method of mutual information estimation by the Havrda-Chavart s entropy. 5.2.1 Pose estimation using the Havrda Chavart s entropy Figure 1 shows the proposed block diagram for pose estimation. A(ngles) x y 1 Estimate mutual information y 2 I(Y,A) y f( x, w) adapt parameters w Figure 1. Pose estimation with the MLP From Information Theory, the mutual information can be computed by the difference between the entropy and the conditional entropy: IyA (, ) H H2 ( Y) H H2 ( YA) (5) where y is the feature and A is aspect angle. For reasons that are connected to the estimator of entropy from samples, here we utilize the Havrda-Chavart definition of entropy [10] Jose C. Principe 5 CNEL, University of Florida

H Hα ( Y) + ----------- 1 f 1 α Y ( y) α dy 1 (6) with α2 which will also be called the Quadratic entropy. For a more in depth discussion of several definitions of entropy see [10]. So H 2 ( Y) is the Quadratic entropy of the output and H 2 ( YA) is the conditional Quadratic Entropy. Since the MLP is a universal mapper [11] it is used in this application as the mapping function (here we use the configuration e.g. 6,400x3x2). Now, the problem can be described as finding the parameters (w) of the MLP so that the mutual information between the output of the MLP and the aspect angle is maximized, i.e. we let the output convey the most information about the aspect angle. We can think of this scheme as information filtering as opposed to the more traditional image filtering so commonly utilized in image processing. Suppose the training data set are pairs { x i, a i }, where x i is a SAR image of a vehicle and a i is its true azimuth (aspect) angle. The feature set y i hx ( i, w) is a 2 dimensional vector (y 1i,y 2i ) where the aspect can be easily measured as the angle of the vector. We can discretize uniformly the angles around the curve described by the output vector, as shown in Figure 2, where a circumference is assumed for simplicity. y 2 a 0 a1 a i, i 1,, N a 2 y 1 Fig 2. Structure for the angle information In our problem formulation, the pose is a random variable which must be described statistically. Jose C. Principe 6 CNEL, University of Florida

We create a local structure weighting the samples of adjacent angles samples a i k a i 1 a i a i 1 weights w l w 1 w 0 w 1 w l + a i + k 1 w l 0 w l 1 The neighborhood size was experimentally set at l 2 nearest neighbors, and the weighting was selected as a Gaussian decay. Effectively this arrangement says that there is a fuzzy correspondence between several possible angles and each one of the sampled points in the unit circumference. The reason we selected the HC Quadratic entropy is related to the Parzen window estimator presented in [12]. Let R k, i 1,, N, be a set of samples from a random variable Y R k in k- y i dimensional feature space. One interesting question is what will be the entropy associated with this set of data points. One answer lies in the estimation of the data pdf by the Parzen window method using a Gaussian kernel: f Y ( y) N --- 1 Gy y N ( i, σ 2 ) i 1 (7) where G(.,.) is the Gaussian kernel Gyσ (, ) ------------------------------- 1 1 in dimensional ( 2π) k 2 σ exp --------y T y 1 2 2σ 2 k space, and σ 2 is the variance. When Shannon s entropy is used along with this pdf estimation, the measure becomes very complex. Fortunately, HC quadratic entropy of (6) leads to a simpler form and we obtain the following entropy measure for a set of discrete data points { }: y i H( { y i }) H H2 ( Y { y i }) 1 f Y ( y) 2 dy 1 V( { y i }) V( { y i }) N N + N 2 i 1 j 1 + ----- 1 Gy ( y i, σ 2 )Gy ( y j, σ 2 ) dy N N ----- 1 Gy ( i y j, 2σ 2 ) N 2 i 1 j 1 (8) Jose C. Principe 7 CNEL, University of Florida

With this estimator the mutual information related to the quadratic HC entropy becomes IYA (, ) k 2 --- 1 w N l Gy ( y i + l ) y ----- 1 Gy ( y N 2 i ) d 2 dy i i l k (9) The second term estimates the entropy due to all the input images, while the first term estimates the conditional entropy. In order to train the MLP, we take the derivative of (9) with respect to the parameters and interpret it as an injected error to the back-propagation algorithm [12]. In this way, the feature extraction mapping for pose estimation can be obtained. After training the testing image x is presented to the MLP, and its output y estimates the discrete conditional pdf in the output feature space ( ). Then the pose can be estimated by using (3). f YA ai ya i 3.0 Experimental Results This algorithm was validated in the MSTAR public release database [6]. We trained the pose estimator with the class BMP-2 vehicle, type sn-c21 with a depression angle of 15 degrees. We simply clipped the chips (128x128) from pixel 20 to 99 both vertically and horizontally (obtained image chip size of 80x80) to preserve the image of the vehicle and its shadow. No fine centering of the vehicle was attempted. The training set was constructed from 53 chips taken at approximately 3.5 o angle apart to cover angles from 0 to 180 degrees. The algorithm takes about 100 batch iterations to converge (very repeatable performance). In Figure 3 the circle at left (diamonds) represents the training results in the feature space. Notice that the MLP trained with our criterion created an output that is almost a perfect circle. The circle can be interpreted as the best output distribution to maximize the mutual information between the input and the pose. This result is intuitively appealing, but notice that it was discovered automatically using our algorithm (i.e. we did not enforced the circle as a desired response). The triangles at the left show the typical results in a test set (the chips for the same vehicle not used for training). It is interesting that the amplitude for the test set fluctuates a lot, but the outputs tend to move inwards along the radial direction, preserving the quality of the pose estimation. This means that Jose C. Principe 8 CNEL, University of Florida

the algorithm created an output metric that preserves angle relationships as we expected. The figure at the right shows the true and estimated pose. The vertical axis is the angle and the horizontal axis is the exemplar index. angle y1 y2 image # Figure 3. BMP-2, CN-C21 (180 degree training) The testing was conducted in the rest of the chips from the same vehicle and two other vehicle types (SN-9563 and 9566) which represent different configurations (all at the same depression angle). We also tested the pose estimator on a different class, the T-72, using the type sn-s7. Table I quantifies the results. Table 1: Testing with 0-180 training class/type error mean (degrees) error std. dev. (degrees) BMP2/sn-c21 3.45 2.58 BMP2/sn-9563 4.99 3.87 BMP2/sn-9566 4.99 6.64 T72/sn-s7 6.98 5.19 Notice that the pose estimation error in the testing of the same vehicle type is basically the same as the resolution in training (3.5 o ) which means that the accuracy of the estimator is very good. Therefore we expect that more precise pose estimation are achievable by creating training sets with more images with finer resolution in pose. Table I also shows that the algorithm generalizes very well to both other vehicle types and even Jose C. Principe 9 CNEL, University of Florida

other vehicle classes. We notice a degradation in performance in the T72, but it is a smooth rolloff. If we want to obviate this degradation of performance with the vehicle type we should utilize more than one vehicle in the training set, which at the same time will obviate the resolution problem addressed above. However, we have to state that the algorithm for mutual information estimation is O(N 2 ), which means that there is practically a limit on the number of exemplars (N) utilized in training. In order to quantify the robustness of the algorithm to vehicle occlusion we have replaced progressively one vehicle image with the background (this is an image of the BMP2 not used in training). We observed that although the amplitude of the output feature decreased appreciably when the bright output of the vehicle was substituted by the darker background (the triangles in the left portion of Figure 4), the pose estimation hold-off remarkably well (right portion of Figure 4). In this case the pose was within an angle of +/- 5 degree up to 50% occlusion and +/- 10 degrees up to 95% occlusion (which occurs at increment 36 in the plot). In our opinion this smooth degradation is one of the advantages of using a distributed system as a mapper, and the same behavior has been extensively reported in the associative memory literature [13]. However, different occlusion directions may provide different performance (it all depends upon which portions of the image are occluded). angle y1 y2 occlusion sequence Figure 4. Results of pose estimation with vehicle occlusion. Vehicle pose is 58 degrees. Jose C. Principe 10 CNEL, University of Florida

6.0 Conclusions This paper reports on our present efforts to create a robust and easy to implement pose estimator for SAR imagery. The need for such an algorithm stems from our goal of creating accurate and robust classifiers. Knowing the pose of the vehicle will streamline the size and training of the classifier module which should be translated in better performance. Our pose estimation framework is statistical in nature and utilizes directly information through manipulation of entropy from examples. We address the enormous complexity of the input space by creating an adaptive systems with optimal parameters. This is probably the best way to deal and conquer complexity. We project the input data to a subspace such that some property of the input relevant for our problem is optimally preserved. This can be thought as information filtering as opposed to the more conventional signal filtering. The issue is the choice of the criterion for optimization. We were fully aware of the limitation of the second order methods utilized traditionally in pattern recognition, so we sought a method that would utilize the full information about the pdf of the input class. The mutual information between the feature and pose becomes the criterion of choice. This criterion measures simply the uncertainty remaining in the feature (the output of the mapper) about pose. By maximizing mutual information we are decreasing the uncertainty of pose in the feature, i.e. we are transferring as much information as possible between the feature and pose. There are also other reasons to use mutual information for classification such as the decrease of the lower bound of the classification error according to Fano s equality [14]. The big issue is the estimation of entropy from examples. In [12] we proposed a Parzen window to estimate the pdf along with mean squared difference between the uniform distribution and the estimated one to manipulate the output entropy. The derivative of the criterion can be used as an injected error to adapt the parameters of our mapper (linear or nonlinear) using the backpropagation algorithm. In this paper we couple the entropy estimator with the definition of Quadratic entropy according to Havrda-Charvat to come up with an estimator of mutual information. The preliminary results of our method are very promising. We successfully trained our pose esti- Jose C. Principe 11 CNEL, University of Florida

mator with MSTAR vehicles. The accuracy in the test set is similar to the training set in the same vehicle, and the performance degrades gracefully to other vehicle types. Even with severe occlusion of the training vehicle (up to 95% occlusion) we obtain estimates of pose within +/- 10 degrees. Further testing of the algorithm is required, as well as further refinements to the theory. The image set is realistic, but still simple (1-DOF). Extension to more degrees of freedom will be pursued next, as well as more vehicles. Our pose estimator is based on the fact that the angle is discrete. It is important for accuracy to utilize the angle as a continuous variable. This will require a new estimator for the condition entropy. It is also important to understand the algorithm better and to compare its performance with alternate approaches. One of the bottlenecks of the method is that the computation is O(N 2 ), which imposes a practical limit on the size of training sets. Acknowledgments: This work was partially support by DARPA-Air force grant F33615-97-1019. 4.0 References [1] Kumar B., Minimum variance synthetic discriminant functions, J. Opt. Soc. Am. A 3(1), 1579-1584, 1986. [2] Duda R. and Hart P., Pattern classificatioin and scene analysis, Wiley, 1973. [3] MSTAR Kickoff Meeting Proceedings, Washingtom, 1995. [4] Minardi M., Moving & stationary target acquisition and recognition, WL talk, September 1997. [5] Lowe, D., Solving parameters of object models from image descriptions, In Proc. ARPA IU workshop, pp 121-127, 1980. [6] MSTAR (Public) Targets, CDROM, Veda Inc., Ohio, 1997. [7] Jaynes E., Information theory and statistical mechanics, Phys. Rev., vol 106, pp 620-630, 1957. [8] Shannon, C.E. A mathematical theory of communication. Bell Sys. Tech. J. 27, 1948, pp379-423, 623-653 [9] Parzen, E. On the estimation of a probability density function and the mode, Ann. Math. Jose C. Principe 12 CNEL, University of Florida

Stat. 33, 1962, p1065 [10] Kapur, J.N. Measures of Information and Their Applications. John Wiley & Sons. 1994 [11] Haykin S., Neural Networks, A Comprehensive Foundation, Macmillan Publishing Company, 1994 [12] Fisher J., Principe J., Entropy manipulation of arbitrary nonlinear mappings, Proc. IEEE Workshop on Neural Nt. for Signal Proc. VII, 14-23, 1997. [13] Kohonen T., Self-organization and associative memory, Springer Verlag, 1987. [14] Fisher J.W.III Nonlinear Extensions to the Minimum Average Correlation Energy Filter Ph.D dissertation, Dept. of ECE, University of Florida, 1997. Jose C. Principe 13 CNEL, University of Florida