1 Introduction Robust statistics methods provide tools for statistics problems in which underlying assumptions are inexact. A robust procedure should

Size: px

Start display at page:

Download "1 Introduction Robust statistics methods provide tools for statistics problems in which underlying assumptions are inexact. A robust procedure should"

Amberlynn Merritt
6 years ago
Views:

1 Pattern Recognition, Vol.29, No.1, pages , Robustizing Robust M-Estimation Using Deterministic Annealing S. Z. Li School of Electrical and Electronic Engineering Nanyang Technological University Singapore ABSTRACT This paper presents a modied robust M-estimator referred to as annealing M-estimator (AM-estimator) to avoid problems with M-estimator. The AM-estimator combines the annealing technique into the M-estimator. It has the following advantages: It gives the global solution regardless of the initialization. It involves no scale estimator nor free parameters, avoiding the unreliability therein. Nor does it need order statistics such as the median and hence no sorting. Experimental results show that the AM-estimator is very stable and has an elegant behavior w.r.t. percentage of outliers and noise variance. Key words Annealing, M-estimator, pattern recognition, robust statistics. 1

2 1 Introduction Robust statistics methods provide tools for statistics problems in which underlying assumptions are inexact. A robust procedure should be insensitive to departures from underlying assumptions, that is, it should have good performance under the underlying assumptions and the performance deteriorates gracefully as the situation departs from the assumptions. One of the primary concerns of robustness is with noise distributions. In practice, it is very common to assume Gaussian noise but this is often not accurate. A typical situation is contaminated Gaussian which is a mixture of Gaussian and some unknown distributions. There are various types of robust estimators. Recent years have seen increasing interests in applications of robust techniques in computer vision. Kashyap and Eom [1] develop an robust algorithm for estimating parameters in an autoregressive image model where the noise is assumed to be a mixture of a Gaussian and an outlier process. Shulman and Herve [2] propose to use Huber's robust M-estimator [3] to compute optical ow involving discontinuities. Stevenson and Delp [4] use the same estimator for curve tting. Besl et al. [5] propose a robust M window operator to prevent smoothing across discontinuities. Haralick et al. [6], Kumar and Hanson [7] and Zhuang et al. [8] use robust estimators to nd pose parameters. Jolion and Meer [9] identify clusters in feature space based on the robust minimum volume ellipsoid estimator. More recently, Boyer et al. [10] present a procedure for surface parameterization using a robust M estimator. Other recent advances in this area can be found in Proceedings of International Conference on Robust Computer Vision [11, 12]. A study on Markov random eld (MRF) vision models [13] points out close relationships between robust M-estimators and discontinuity-adaptive MRF priors. This paper concerns the following computational diculties associated with the M-estimator: First, the M-estimator is not robust to the initialization; the choice of the initial estimate has signicant inuence on the quality of the M-estimate [6, 14, 8, 15]. This is a problem common to nonlinear regression procedures [16]. It is due to the following reason: The M-estimate is dened as the global minimum of a non-convex energy function. However, gradient based algorithms, which are commonly used, can get stuck at a non-global solution. Second, it is dependent on some scale estimate, such as the median of absolute deviation (MAD). Nonetheless, the robustness of such estimates is questionable and deserves a devoted study. Third, there are free parameters which, together with the scale estimate, determine the threshold for rejection of outliers and have to be chosen on some basis. Forth, the convergence of the M-estimator is not approved in most cases and often not guaranteed. Owing to the above problems, the theoretical breakdown point can hardly be achieved. The aim of this paper is to avoid the above problems. We presents a modied robust M- estimator referred to as annealing M-estimator (AM-estimator). The AM-estimator combines the annealing technique [17, 18, 19] into the M-estimator in the estimation process. It involves no scale estimates such as the MAD nor free parameters therein. This avoids problems with the unreliability of scale estimates and the selection of parameters. No order statistics are needed and hence no sorting. Most importantly, the AM-estimate is independent on the initialization. The AM-estimate is dened as the minimum of a global energy function parameterized by a parameter. The annealing is performed by continuation in from a very high value down to 0 +. The sequence of global solutions is traced for decreasing values of. The nal solution is in the zero parameter limit when! 0 +. Experimental results show that the AM-estimator is very stable and has an elegant behavior w.r.t. percentage of outliers and noise variance, in contrast to the M-estimator. The rest of the paper is organized as follows. Section 2 presents the annealing robust estimator. Section 3 presents experimental results. Section 4 concludes the paper. 2

3 2 Robust Estimators and Annealing The AM-estimator is very similar to the familiar M-estimator in form. We rst briey describe the M-estimator and then introduce the AM-estimator and point out dierences. 2.1 The M-Estimator The essential form of the M-estimation problem is the following: Given a set of m data samples r = fr i j 1 i mg where r i = f + i, the problem is to estimate the location parameter f under noise i. The distribution of i is not assumed to be known exactly. The only underlying assumption is that 1 ; : : : ; m obey a symmetric, independent, identical distribution (symmetric i.i.d.). A robust estimator has to deal with departures from this assumption. Let the residual errors be i = r i? f (i = 1; : : : ; m) and the potential (penalty) function be g( i ). The M-estimate f is dened as the minimum of a global energy function f = arg min E(f) (1) f where X E(f) = g(r i? f) (2) i To minimize above, it is necessary to solve the following equation X i (r i? f) = 0 (3) where () = g 0 (). This is based on gradient descent. When g( i ) is also a function of 2 i, ( i ) takes the following form ( i ) = 2 i h( i ) = 2(r i? f)h(r i? f) (4) where h() is an even function. In this case, the estimate f can be expressed as the following weighted sum of the data samples f = P i h( i) r i P i h( i) where h acts as the interaction (weighting) function. The above represents a xed point equation because i = r i? f. The function h provides adaptive interaction; data points r i with larger errors eta i have smaller interactions h( i ) and those with innitely large errors have the zero interactions. The interaction function h in M-estimation is usually piecewisely dened. Tukey's biweight function [20] is dened as 2 2 i 1? if 2 i cs cs < 1 where h( i ) = 8 < : 0 otherwise (5) For example, (6) S = median i f i g (7) is an estimate of spread, c is a constant whose value is often set to 6 or 9, and cs is the scale estimate. Some use the following as the scale estimate cs = c medianfj i? medianf i gjg (8) where the constant c is usually taken as c = 1:4826 to be consistent with the Gaussian distribution. All M-estimators involve some scale estimates. Design of scale estimates is crucial and nding good scale estimates is a topic in robust statistics. Classical scale estimates such as the median and MAD are not very robust. 3

4 2.2 The Annealing M-Estimator The AM-estimator has the same form as the M-estimator and Equations (1) { (5) apply to the both. The major dierences is that the scale estimate in the M-estimator is replaced by a parameter in the AM-estimator in which the value of is approaching 0 + during the estimation process. The AM-estimate under is dened by f = P i h ( i ) r i P i h ( i ) where h () is an adaptive interaction function parameterized by and will be dened in the next subsection. The AM-estimate is dened in the zero parameter limit (9) f = lim!0 + f (10) Computationally, is initially set to a high enough value and is decreased toward 0 + (a very small number); a sequence of solutions ff g is generated for the decreasing ; and f is the last one in the sequence. This is the annealing process. Given the form of the AM-estimator, it is the interaction function h and the annealing schedule that determine the nal AM-estimate. The following two subsections dene the AM-estimator. 2.3 Adaptive Interaction Functions Denition 1. An adaptive interaction function (AIF) h parameterized by (> 0) is a function which satises: (i) h is continuous, (ii) h () = h (?), (iii) h () > 0, (iv) h 0 () < 0 (8 > 0) and (v) lim!1 jh ()j = C < 1. The class of AIFs is dened as the collection of all such h and is denoted by H. # The continuity of (i) means that the interaction varies continuously w.r.t. the error. The evenness of (ii) makes the interaction relative to the error magnitude only, regardless of its sign. The positive deniteness of (iii) keeps the weight positive. The monotony of (iv) leads to decreasing interaction as the error magnitude increases. In (v), C 0 is a constant whose value is the asymptote of jh ()j. To satisfy this property, it is necessary that lim!1 h () = 0. This is essential for robust estimators; it sets zero interactions for data points having innitely large errors. The above characterizes the properties that an AIF h should possess rather than instantiates some particular forms; therefore the denition is rather broad. The denition of the AIF has more implications than the robust estimation: It also describes how neighboring pixels of an image should interact for discontinuity adaptive regularization (smoothing) [21]. The AM-estimator is dened by Eqs.(9 { 10) constrained by h 2 H. An AIF adaptively weights the importance of data points in computing the estimate. 2.4 Adaptive Potential Functions Denition 2. R The adaptive potential function (APF) corresponding to an h 2 H by g () = h ( 0 )d 0. # is dened 4

5 AIF APF Band h 1 () = exp(? 2 =) g 1 () =? exp(? 2 =) B 1 = (? p =2; p =2) h 2 () = 1=[1 + 2 =] 2 g 2 () =?=[1 + 2 =] B 2 = (? p =3; p =3) h 3 () = 1=[1 + 2 =] g 3 () = log(1 + 2 =) B 3 = (? p ; p ) Table 1: Three possible choices of h (), the corresponding g () and the bands. Figure 1: The qualitative shapes of the three APFs g () (bottom) and their directive functions g 0 = 2h () (top). Basically, g is C 1 continuous; it is even: g () = g (?); its derivative function is odd g() 0 =?g(?). 0 However, it is not necessary for g (1) to be bounded. Furthermore, g is strictly monotonically increasing as the error jj increases because g () = g (jj) and g() 0 = h () > 0 for > 0. This means larger jj leads to larger potential (penalty) g (). It conforms to the original spirit of the quadratic potential function g q () = 2. Most existing potential functions in M-estimation do not have such a property: the potential there does not increase as the error jj increases beyond certain value. Of the two denitions, the former is more important for dening the AM-estimator. It is h 2 H that captures the essence of the robustness of the AM-estimate to distributional outliers. It is usually unnecessary to dene the AM-estimator based on the potential function g. Nonetheless, knowing g is helpful for us to study the convexity property of the corresponding AM-estimator. For a given g (), there exists a region of within which the function is convex: B = [b L ; b H ] = f j g 00 () 0g (11) We refer to this region B as the band. The lower and upper bounds b L ; b H correspond to the two extrema of g(), 0 which can be obtained by letting g() 00 = 0, and we have b l =?b H when g is even. When b L < < b H, g() 00 > 0 and thus g () is strictly convex. Table 1 instantiates three possible choices of AIFs, the corresponding APFs and the bands. Fig.1 shows the qualitative shapes of the three APFs g () and their derivative functions g() 0 = 5

6 AM-Estimator Begin Algorithm (1) set t = 1, f (0) = f MSE ; choose initial ; (2) do f (3) t t + 1; (4) compute errors i = r i? f (t) ; 8i; (5) compute weighted sum f (t) = (6) lower temperature lower(); (7) g until ( < or jf (t) (7) f f (t) ; End Algorithm P i h ( i) r i Pi h ( i ; )? f (t?1) j < ) /* converged */ Figure 2: The AM-estimation algorithm. 2h (). Another interesting AIF is h 4 = 1=[1 + jj=] (12) It allows bounded but non-zero contribution due to errors i = r i?f! 1, with lim!1 h 4 () =. It is attractive because g 00 4 () = [2h 4()] 0 > 0 for all and hence leads to strictly convex minimization. Huber's function [3] g () = minf 2 ; 2 + 2j? jg (13) is a convex function. Its rst derivative is g 0 () = 2h () = 2 for jj and g 0 () = 2h () = 2=jj for other. Hence, its AIF is h () = 1 for jj and h () = =jj for other. 2.5 Annealing Procedure When g () is non-convex, the direct method using the xed point iteration f (t+1) = Pi h (r i? f (t) ) r i P i h (r i? f (t) ) (14) can get stuck at a local minimum because the above equating is derived based on gradient descent. This problem is particularly serious for small. To solve this problem, we combine the annealing technique into the iteration. Annealing, stochastic [17, 18, 19] or deterministic [22, 23, 24, 25], is a continuation technique for avoiding local optima. In the AM-estimator, the annealing is performed by gradually decreasing the parameter value towards 0 + during the estimation process. This is very signicant in improving the quality of the estimate. Initially, the parameter is set to a suciently large value (0) such that the APF g () is strictly convex. With such (0), it is easy to nd the unique minimum of the global energy function E(f ) using the gradient descent method, regardless of the initial value f (0). The 6

7 minmum is then used as the initial value for the next phase of minimization under a lower to obtain the next minimum. As is lowered, g () may be no longer convex and local minima may appear. However, if we track the global minima from high to! 0 +, we can approximate the global minimum f under! 0 +. The AM-estimator algorithm is summarized in C-like language in g.2. Initially, f is set to the MSE estimate f MSE = 1 m The initial is chosen to satisfy the following mx i=1 r i (15) j i j = jr i? f MSE j < b H () (16) This is to guarantee g( 00 i ) > 0 and hence the convexity of g. In the above, b H () (=?b L ()) is the upper bound of the band in (11). The parameter is lowered according to a schedule written in the function lower() in line (6). The convergence is judged by the conditions in line (7) where and are small numbers. We found that the location estimate converges (jf (t) after dozens of iterations and there is no much change for smaller. 3 Experimental Results? f (t?1) j < ) The following experiment on location estimation is aimed to compare performances of the AMestimator and the M-estimator with Tukey's biweight function. Simulated data points in 2D locations are generated. The data set is a mixture of true data points and outliers. First, m true data points f(x i ; y i ) j i = 1; : : : ; mg are randomly generated around f = (10; 10). The values of x i and y i obey an identical, independent Gaussian distribution with a xed mean value of 10 and a variance value V. After that, a percentage of the m data points are replaced by random outlier values. The outliers are uniformly distributed in a square of size centered at (b; b) 6= f. There are four parameters to control the data generation. Their values are: 1. the number of data points m 2 f50; 200g, 2. the noise variance V 2 f0; 2; 5; 8; 12; 17; 23; 30g, 3. the the percentage of outliers from 0 to 70 with step 5, and 4. the outlier square center parameter b = 22:5 or 50. The experiments are done with dierent combination of the parameter values.? The AIF is chosen to be h () = h 3 () = 1:0=(1 + 2 =). The schedule in lower(t ) is: 100 1:5? 1; when t 2 time t! 1,! 0 +. It takes about 50 iterations to converge for each of these data sets. Two quantities are computed as the performance measures of an estimator: (1) the mean error e versus the percentage of outliers (PO) and (2) the mean error e versus the noise variance (NV) V. Let the Euclidean error by e = kf? fk p = (x? 10) 2 + (y? 10) 2 where f is the estimate and f is the true location. Fig. 3 and 4 show the performances of the AM-estimator and the M-estimator, respectively. A comparison of these results demonstrates that the AM-estimator is remarkably better than the M-estimator. The AM-estimator has not only lower estimation errors but also behaves in a more stable and elegant manner as the percentage of outliers and the noise variance increase. 7

8 Figure 3: Mean error of the AM-estimate. Mean error vs. percentage of outliers with m = 50 (row 1) and m = 200 (row 2). Mean error vs. noise variance with m = 50 (row 3) and m = 200 (row 4). Outlier are uniformly distributed in a square centered at b = 22:5 (left) or b = 50 (right). 8

9 Figure 4: Mean error of the M-estimate. Mean error vs. percentage of outliers with m = 50 (row 1) and m = 200 (row 2). Mean error vs. noise variance with m = 50 (row 3) and m = 200 (row 4). Outlier are uniformly distributed in a square centered at b = 22:5 (left) or b = 50 (right). 9

10 4 Conclusion The AM-estimator has the advantages that it gets rid of scale estimates and free parameters and nds a good approximation to the global solution. Divergence is minimal in the AM-estimator because the initial estimate for the current value is the convergence point obtained with the previous value. Experimental results demonstrate signicant improvements in accuracy, stability, and breakdown point as well, over the traditional M-estimator. Each statistic is made based on 1000 random tests and the data sets are exactly the same for the two methods. These mean that the results of the comparison of the two methods are suciently reliable. References [1] R. L. Kashyap and K. N. Eom. \Robust image modeling techniques with their applications". IEEE Transactions on Acoustic, Speech and Signal Processing, 36(8):1313{1325, [2] D. Shulman and J.Y. Herve'. \Regularization of discontinuous ow elds". In Proc. Workshop on Visual Motion, pages 81{86, [3] P. Huber. Robust Statistics. Wiley, [4] R. Stevenson and E. Delp. \Fitting curves with discontinuities". In Proceedings of International Workshop on Robust Computer Vision, pages 127{136, Seattle, WA, October [5] P. J. Besl, J. B. Birch, and L. T. Watson. \Robust window operators". In Proceedings of Second International Conference on Computer Vision, pages 591{600, Florida, December [6] R. M. Haralick, H. Joo, C.N. Lee, X. Zhuang, V.G. Vaidya, and M.B. Kim. \ Pose estimation from corresponding point data". IEEE Transactions on Systems, Man and Cybernetics, 19:1426{1446, [7] R. Kumar and A.R. Hanson. \Robust estimation of camera location and orientation from noisy data having outliers". In Proc Workshop on Interpretation of Three-Dimensional Scenes, pages 52{60, [8] X. Zhuang, T. Wang, and P. Zhang. \A highly robust estimator through partially likelihood function modeling and its application in computer vision". IEEE Transactions on Pattern Analysis and Machine Intelligence, 14, [9] J. M. Jolion, P. Meer, and S. Bataouche. \Robust clustering with applications in computer vision". IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:791{802, [10] K. L. Boyer, M. J. Mirza, and G. Ganguly. \The robust sequential estimator: A general approach and its application to surface organization in range data". IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(10), Octorber [11] R. M. Haralick, editor. Proceedings of International Workshop on Robust Computer Vision, Seattle, WA, October [12] W. Forstner and S. Ruwiedel, editors. Robust Computer Vision-Quality of Vision Algorithms (Proceedings of 2nd International Workshop on Robust Computer Vision, Karlsruhe, Germany, March 10-12),

11 [13] S. Z. Li. Markov Random Field Modeling in Computer Vision. Springer-Verlag, [14] P. Meer, D. Mintz, A. Rosenfeld, and D.Y. Kim. \Robust regression methods for computer vision: A review". International Journal of Computer Vision, 6:59{70, [15] A. Stein and M. Werman. \Robust statistics in shape tting". In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 540{ 546, [16] Raymond H. Myers. Classical and Modern Regression with Applications. PWS-Kent Publishing Company, [17] S. Kirkpatrick, C. D. Gellatt, and M. P. Vecchi. \Optimization by simulated annealing". Science, 220:671{680, [18] S. Geman and D. Geman. \Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images". IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6):721{741, November [19] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. \A learning algorithm for Boltzmann machines". Cognitive Science, 9:147{169, [20] J. W. Tukey. Explortary Data Analysis. Addison-Wesley, Reading, MA, [21] S. Z. Li. \On discontinuity-adaptive smoothness priors in computer vision". IEEE Transactions on Pattern Analysis and Machine Intelligence, accepted for publication. [22] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, Cambridge, MA, [23] J. J. Hopeld. \Neurons with graded response have collective computational properties like those of two state neurons". Proceedings of National Academic Science, USA, 81:3088{3092, [24] C. Koch, J. Marroquin, and A. Yuille. \Analog `neuronal' networks in early vision". Proceedings of National Academic Science, USA, 83:4263{4267, [25] A. Witkin, D. T. Terzopoulos, and M. Kass. \Signal matching through scale space". International Journal of Computer Vision, pages 133{144, Biography S. Z. Li received the B.Sc degree from Hunan University, China, in 1982, M.Sc degree from the National University of Defense Technology, China, in 1985 and Ph.D degree from the University of Surrey, UK, in All degrees are in EEE. He is currently a lecturer at Nanyang Technological University, Singapore. His research interests include computer vision, pattern recognition, image processing and optimization methods. 11

12 Figure 5: The qualitative shapes of the three APFs g () (bottom) and their directive functions g 0 = 2h () (top). 12

13 AM-Estimator Begin Algorithm (1) set t = 1, f (0) = f MSE ; choose initial ; (2) do f (3) t t + 1; (4) compute errors i = r i? f (t) ; 8i; (5) compute weighted sum f (t) = (6) lower temperature lower(); (7) g until ( < or jf (t) (7) f f (t) ; End Algorithm P i h ( i) r i Pi h ( i ; )? f (t?1) j < ) /* converged */ Figure 6: The AM-estimation algorithm. 13

14 Figure 7: Mean error of the AM-estimate. Mean error vs. percentage of outliers with m = 50 (row 1) and m = 200 (row 2). Mean error vs. noise variance with m = 50 (row 3) and m = 200 (row 4). Outlier are uniformly distributed in a square centered at b = 22:5 (left) or b = 50 (right). 14

15 Figure 8: Mean error of the M-estimate. Mean error vs. percentage of outliers with m = 50 (row 1) and m = 200 (row 2). Mean error vs. noise variance with m = 50 (row 3) and m = 200 (row 4). Outlier are uniformly distributed in a square centered at b = 22:5 (left) or b = 50 (right). 15

Continuous State MRF s

EE64 Digital Image Processing II: Purdue University VISE - December 4, Continuous State MRF s Topics to be covered: Quadratic functions Non-Convex functions Continuous MAP estimation Convex functions EE64