ESTIMATING an unknown probability density function

Size: px
Start display at page:

Download "ESTIMATING an unknown probability density function"

Transcription

1 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE Maximum-Entropy Expectation-Maximization Algorithm for Image Reconstruction and Sensor Field Estimation Hunsop Hong, Student Member, IEEE, and Dan Schonfeld, Senior Member, IEEE Abstract In this paper, we propose a maximum-entropy expectation-maximization (MEEM) algorithm We use the proposed algorithm for density estimation The maximum-entropy constraint is imposed for smoothness of the estimated density function The derivation of the MEEM algorithm requires determination of the covariance matrix in the framework of the maximum-entropy likelihood function, which is difficult to solve analytically We, therefore, derive the MEEM algorithm by optimizing a lower-bound of the maximum-entropy likelihood function We note that the classical expectation-maximization (EM) algorithm has been employed previously for 2-D density estimation We propose to extend the use of the classical EM algorithm for image recovery from randomly sampled data and sensor field estimation from randomly scattered sensor networks We further propose to use our approach in density estimation, image recovery and sensor field estimation Computer simulation experiments are used to demonstrate the superior performance of the proposed MEEM algorithm in comparison to existing methods Index Terms Expectation-maximization (EM), Gaussian mixture model (GMM), image reconstrution, Kernel density estimation, maximum entropy, Parzen density, sensor field estimation I INTRODUCTION ESTIMATING an unknown probability density function (pdf) given a finite set of observations is an important aspect of many image processing problems The Parzen windows method [1] is one of the most popular methods which provides a nonparametric approximation of the pdf based on the underlying observations It can be shown to converge to an arbitrary density function as the number of samples increases The sample requirement, however, is extremely high and grows dramatically as the complexity of the underlying density function increases Reducing the computational cost of the Parzen windows density estimation method is an active area of research Girolami and He [2] present an excellent review of recent developments in the literature There are three broad categories of methods adopted to reduce the computational cost of the Parzen windows density estimation for large sample sizes: a) approximate kernel decomposition method [3], b) data Manuscript received March 29, 2007; revised January 13, 2008 The associate editor coordinating the review of this manuscript and approving it for publication was Dr Gaurav Sharma The authors are with the Multimedia Communications Laboratory, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, IL USA ( hhong6@uicedu; dans@uicedu) Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TIP reduction methods [4], and c) sparse functional approximation method Sparse functional approximation methods like support vector machines (SVM) [5], obtain a sparse representation in approximation coefficients and, therefore, reduce computational costs for performance on a test set Excellent results are obtained using these methods However, these methods scale as making them expensive computationally The reduced set density estimator (RSDE) developed by Girolami and He [2] provides a superior sparse functional approximation method which is designed to minimize an integrated squared-error (ISE) cost function The RSDE formulates a quadratic programming problem and solves it for a reduced set of nonzero coefficients to arrive at an estimate of the pdf Despite the computational efficiency of the RDSE in density estimation, it can be shown that this method suffers from some important limitations [6] In particular, not only does the linear term in the ISE measure result in a sparse representation, but its optimization leads to assigning all the weights to zero with the exception of the sample point closest to the mode as observed in [2] and [6] As a result, the ISE-based approach to density estimation degenerates to a trivial solution characterized by an impulse coefficient distribution resulting in a single kernel density function as the number of data samples increases However, the expectation-maximization algorithm (EM) [7] provides a very effective and popular alternative for estimating model parameters It provides an iterative solution, which converges to a local maximum of the likelihood function Although the solution to the EM algorithm provides the maximum likelihood estimate of the kernel model for density function, the resulting estimate is not guaranteed to be smooth and may still preserve some of the sharpness of the ISE-based density estimation methods A common method used in regularization theory to ensure smooth estimates is to impose the maximum entropy constraint There have been some attempts to bind the entropy criterion with EM algorithm Byrne [8] proposed an iterative image reconstruction algorithm based on cross-entropy minimization using the Kullback Leibler (KL) divergence measure [9] Benavent et al [10] presented an entropy-based EM algorithm for the Gaussian mixture model in order to determine the optimal number of centers However, despite the efforts to use maximum entropy to obtain smoother density estimates, thus far, there have been no successful attempts to expand the EM algorithm by incorporating a maximum-entropy penalty-based approach to estimating the optimal weight, mean and covariance matrix In this paper, we introduce several novel methods for smooth kernel density estimation by relying on a maximum-entropy /$ IEEE

2 898 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE 2008 penalty and use the proposed methods for the solution of important applications in image reconstruction and sensor field estimation The remainder of the paper is organizes as follows In Section II, we first introduce kernel density estimation and present the integrated squared-error (ISE) cost function We subsequently introduce the maximum-entropy ISE-based density estimation to ensure that the estimated density function is smooth and does not suffer from the degeneracy of the ISE-based kernel density estimation Determination of the maximum-entropy ISE-based cost function is a difficult task and generally requires the use of iterative optimization techniques We propose the hierarchical maximum entropy kernel density estimation (HMEKDE) method by using a hierarchical tree structure for the decomposition of the density estimation problem under the maximum-entropy constraint at multiple resolutions We derive a closed-form solution to the hierarchical maximum-entropy kernel density estimate for implementation on binary trees We also propose an iterative solution to a penalty-based maximum-entropy density estimation by using Newton s method The methods discussed in this section provide the optimal weights for kernel density estimates which rely on fixed kernels located at few samples In Section III, we propose the maximum-entropy expectation maximization (MEEM) algorithm to provide the optimal estimates of the weight, mean, and covariance for kernel density estimation We investigate the performance of the proposed MEEM algorithm for 2-D density estimation and provide computer simulation experiments comparing the various methods presented for the solution of maximum-entropy kernel density estimation in Section IV We propose the application of both the EM and MEEM algorithms for image reconstruction from randomly sampled images and sensor field estimation from randomly scattered sensors in Section V The basic EM algorithm estimates a complete data set from partial data sets, and, therefore, we propose to use the EM and MEEM algorithms in these image reconstruction and sensor network applications We present computer simulations of the performance of the various methods for kernel density estimation for these applications and discuss the advantages and disadvantages in various applications A discussion of the performance of the MEEM algorithm as the number of kernels varies is provided in Section VI Finally, in Section VII, we provide a brief summary and discussion of our results II KERNEL-BASED DENSITY ESTIMATION A Parzen Density Estimation The parzen density estimator using the Gaussian Kernel is given by Torkkola [11] The main limitation of the Parzen windows density estimator is the very high computational cost due to the very large number of kernels required for its representation B Kernel Density Estimation We seek an approximation to the true density of the form where and the function denotes the Gaussian kernel defined in (2) The weights must be determined such that the overall model remains a pdf, ie, Later in this paper, we will explore the simultaneous optimization of the mean, variance, and weights of the Gaussian kernels Here, we focus exclusively on the weights The variances and means of the Gaussian kernels are estimated by using the -means algorithm in order to reduce the computational burden Specifically, the centers of the kernels in (3) are determined by -means clustering, and the variance of the kernels is set to the mean of Euclidean distance between centers [12] We assume that is significantly greater than since the Parzen method relies on delta functions at the sample data which are represented by Gaussian functions with very narrow variance The mixture of Gaussian model, on the other hand, relies on a few Gaussian kernels and the variance of each Gaussian function is designed to capture many sample points Therefore, only the coefficients are unknown We rely on minimization of the error between and using the ISE method The ISE cost function is given by Substituting and, using (1) and (3), the equation can be expanded and the order of integration and summation exchanged Thus, we can write the cost function of (5) in vectormatrix form where (3) (4) (5) (6) (1) where is the total number of observation and is the isotropic Gaussian kernel defined by (2) Our goal is to minimize this function with respect to under the conditions provided by (4) Equation (6) is a quadratic programming problem, which has a unique solution if the matrix (7)

3 HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 899 is positive semi-definite [13] Therefore, can be simplified to In Appendix A, we prove that the solution of the ISE-based kernel density estimation degenerates as the number of observations increases to a trivial solution that concentrates the estimated probability mass in a single kernel This degeneracy leads to a sharp peak in the estimated density, which is characterized by the minimum-entropy solution C Maximum-Entropy Kernel Density Estimation Given observations from an unknown probability distribution, there may exist an infinity of probability distributions consistent with the observations and any given constraints [14] The maximum entropy principle states that under such circumstances we are required to be maximally uncertain about what we do not know, which corresponds to selecting the density with the highest entropy among all candidate solutions to the problem In order to avoid degenerate solutions to (6), we maximize the entropy and minimize the divergence between the estimated distribution and the Parzen windows density estimate Here, we use Renyi s quadratic entropy measure given by [11], which is defined as Substituting (3) into (8), we obtain (8) Newton s method for multiple variables is given in [15] (12) where denotes the iteration We will use the soft-max function for the weight constraint [16] The weight of the center can be expressed as (13) Therefore, the derivative of the weight with respect to is given by (14) For convenience, we define the following variables: (15) (16) (17) (18) By expanding the square, interchanging the order of summation and integration, we obtain the following: We can now express (11) using (15) and (18) (19) (20) (9) The element of the gradient of (20) is given by Since the logarithm is a monotonic function, maximizing the logarithm of a function is equivalent to maximizing the function Thus, the maximum entropy solution of the entropy can be reached by maximizing the function expressed in vector-matrix form The derivation of the gradient is provided in Appendix B From (57), (58), and (62), the element of the Hessian matrix is given by the following a) The optimal maximum entropy solution of is (10) where is subject to the constraints provided by (4) 1) Penalty-Based Approach Using Newton s Method: We adopt the penalty-based approach by introducing an arbitrary constant to balance between the ISE and entropy cost functions We, therefore, define a new cost function given by where is the penalty coefficient Since the variable is constant with respect to it will be omitted We now have (11) b) (21) (22) The detailed derivation of the Hessian matrix are also presented in Appendix B We assume that the Hessian matrix is positive definite Finally, the gradient and Hessian required for the iteration in (12) can be generated using (21), (22), and (59) 2) Constrained-Based Approach Using a Hierarchical Binary Tree: Our preference is to avoid penalty-based methods and to derive the optimal weights as a constrained optimization problem Specifically, we seek the maximum entropy weights

4 900 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE 2008 The constraint in the maximum entropy problem is defined such that its corresponding ISE cost function does not exceed the optimal ISE cost beyond a prespecified value From (6), (25), and (26), we can determine the optimal ISE coefficient by minimization of the cost given by Fig 1 Binary tree structure for hierarchical density estimation such that It is easy to show that (27) such that its corresponding ISE cost function does not exceed the optimal ISE cost beyond a prespecified value We thus define the maximum-entropy coefficients to be given by (23) Therefore, from (6), (25), (26), and (28), we have (28) (29) such that A closed-form solution to this problem is difficult to obtain in general However, we can obtain the closed-form solution when the number of centers is limited to two Hence, we form an iterative process, where we assume that we only have two centers at each iteration We represent this iterative process as a hierarchical model, which generates new centers at each iteration We use a binary tree to illustrate the hierarchical model, where each node in the tree depicts a single kernel Therefore, in the binary tree, each parent node has two children nodes as seen in Fig 1 The final density function corresponds to the kernels at the leafs of the tree We now wish to determine the maximum entropy kernel density estimation at each iteration of the hierarchical binary tree We, therefore, seek the maximum entropy coefficients Note that sum of these coefficients is dictated by the corresponding coefficients of their parent node This restriction will ensure that the sum of the coefficients of all the leave nodes (ie, nodes with no children) is one since we set the coefficient of the root parent node to 1 We simplify the notation by considering and to be the coefficients of the children nodes where is used to denote the coefficient of their corresponding parent node (ie, ) This implies that it is sufficient to characterize the optimal coefficient such that The samples are divided into two groups using -means method at each node Let us adopt the following notation: We assume, without loss of generality, that Therefore, the constant is equivalent to From (10) and (25), we observe that the maximum entropy coefficient is given by such that Therefore, from (30), we form the Lagrangian given by and (30) Differentiating with respect to and setting to zero, we have We shall now determine the Lagrange multiplier the constraint (31) by satisfying (32) From (31) and (32), we observe that (24) Therefore, from (33) and (31), we observe that (33) where From (6) and (7), we observe that (34) (25) (26) Finally, we impose the condition from (34), we have Therefore, where,, and

5 HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 901 III MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM As seen in previous section, the ISE-based methods enable pdf estimation given a set of observations without information about the underlying density However, the ISE based solutions do not fully utilize the sample information as the number of samples increases Moreover, ISE-based methods are generally used to determine optimal weights used in the linear combination Selection of the mean and variance of the kernel functions is accomplished by using the -means algorithm, which can be viewed as a hard limiting case of the EM [7] The EM algorithm offers an approximation of the pdf by an iterative optimization under the maximum likelihood criterion A probability density function can be approximated as the sum of Gaussian functions (35) The expectation step of the EM algorithm can be separated into two terms, one is the expectation related with likelihood and the other is the expectation related with the entropy penalty (40) (41) where denotes that this expectation is from the likelihood function, denotes that this expectation is from the entropy penalty, and denotes the number of iteration The Jensen s inequality is applied to find the new lower bound of the likelihood functions using (40) and (41) Therefore, the lower bound function for the likelihood function can be derived as where is center of a Gaussian function, is a covariance matrix of function and is the weight for each center which subject to the conditions as (4) The Gaussian function is given by (36) From (35) and (36), we observe that the logarithm of the likelihood function for the given Gaussian mixture parameters that has observations can be written as (37) where is the sample and is a set of parameters (ie, the weights, centers, and covariances) to be estimated The entropy term is added in order to make the estimated density function smooth and not to have an impulse distribution We expand Renyi s quadratic entropy measure [11] to incorporate with covariance matrices and use the measure again Substituting (35) into (8), expanding the square and interchanging the order of summation and integration, we obtain the following: (42) Now, we wish to obtain a lower bound for the entropy This bound cannot be derived using the method in (42) since is not a concave function To derive the lower bound, we, therefore, rely on a monotonically decreasing and concave function such that The detailed derivation is provided in Appendix C Notice that maximization of the entropy remains unchanged if we replace the function in (38) by since both are monotonically decreasing functions We can now use Jensens inequality to obtain the lower bound for the entropy (38) We, therefore, form an augmented likelihood function parameterized by a positive scalar in order to simultaneously maximize the entropy and likelihood using (37) and (38) The augmented likelihood function is given by (39) The lower bound bounds is given by which combines the two lower (43) Since we have the lower bound function, the new estimates of the parameters are easily calculated by setting the derivatives of with respect to each parameters to zero

6 902 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE 2008 A Mean The new estimates for the mean vectors can be obtained by the derivative of (43) with respect to and setting it to zero Therefore (47) Using (47) and the symmetric property of Gaussian, we thus introduce a new lower bound for the covariance given by (44) B Weight For the weights, we once again use the soft-max function in (13) and (14) Thus, by setting the derivative of with respect to to zero, the new estimated weight is given by (45) Therefore, the new estimated covariance setting the derivative of the new lower bound respect to to zero is attained by with (48) C Covariance In order to update the EM algorithm, the derivative of (43) with respect to is required However, the derivative cannot be solved directly because of the existence of the inverse matrix which appears in the derivative We, therefore, introduce a new lower bound for the EM algorithm using Cauchy Schwartz inequality The lower bound given by (43) can be rewritten as We note that the EM algorithm presented here relies on a simple extension of the lower-bound maximization method in [17] In particular, we can use this method to prove that our algorithm converges to a local maximum on the bound generated by the Cauchy Schwartz inequality, which serves as a lower bound on the augmented likelihood function Moreover, we would have attained a local maximum of the augmented likelihood function had we not used the Cauchy Schwartz inequality to obtain a lower bound for the sum of the covariances Note that the Cauchy Schwartz inequality is met with equality if and only if the covariance matrices of the different kernels are identical Therefore, if the kernels are restricted to have the same covariance structure, the maximum-entropy expectation-maximization algorithm converges to a local maximum of the augmented likelihood function (46) The term in (46) is equal to Using the Cauchy Schwartz inequality and the fact that the Gaussian function is greater than or equal to zero, we obtain IV TWO-DIMENSIONAL DENSITY ESTIMATION We apply MEEM method and other conventional methods to a 2-D density estimation problem Fig 2(a) describes original 2-D density function and Fig 2(b) displays a scatter plot of 500 data samples drawn from (49) in the interval [0,1] The equation used for generating the samples is given by (49) where Given data without knowledge of the underlying density function used to generate the observations, we must estimate the 2-D density function Here, we use 500, 1000, 1500, and 2000 samples for the experiment With the exception of the RSDE method, the other approaches cannot be used to determine the optimal number of centers since it will

7 HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 903 Fig 2 Comparison of 2-D density estimation from 500 samples (a) Original density function; (b) 500 samples; (c) RSDE; (d) HMEKDE; (e) Newton s method; (f) conventional EM; (g) MEEM fluctuate based on variations in the problem (eg, initial conditions) We determine the number of centers experimentally such that we assign less than 100 samples per center for Newton s method, EM and MEEM For the HMEKDE method, we terminate the splitting of the hierarchical tree when the leaf has less than 5% of total number of samples The results of RSDE are shown in Fig 2(c) RSDE method is very powerful algorithm in that it requires no parameters for the estimation However, the choice for the kernel width is very crucial since it suffers the degeneracy problem when the kernel width is large and the reduction performance is diminished when the kernel width is small The results of Newton s method and HMEKDE are given in Fig 2(d) and (e), respectively The major practical issue in implementing Newton s method is the guarantee of local minimum, which can be sustained by positive definitiveness of Hessian matrix [15] Thus, we use the Levenberg Marquardt algorithm [18], [19] The value in HMEKDE method is chosen experimentally The results of the conventional EM algorithm and the MEEM algorithm are shown in Fig 2(f) and (g), respectively The variable in MEEM algorithm is chosen experimentally The result of MEEM is properly smoothed In Fig 3, SNR improvements according to iteration and the value of is displayed using 300 samples We choose the value as proportional to the number of samples The parameter values multiplied by the number of samples, are shown in Fig 3 (ie, 005, 010, and 015) We observe the over-fitting problem of the EM algorithm in Fig 3 The overall improvements in SNR are given in Table I V IMAGERECONSTRUCTION AND SENSOR FIELD ESTIMATION Density estimation problem can easily expanded into practical problems like image reconstruction from random sample For experiment, we use gray Pepper, Lena, and Barbara images which is shown in Fig 4(a) (c) We take 50% samples of Pepper image, 60% samples of Lena image and 70% of Barbara image We use density function model in [20] where is the intensity value and is the location of a pixel We estimate a density function of given image from samples For the reduction of computational Fig 3 SNR improvements according to iteration and the parameter TABLE I SNR COMPARISON OF ALGORITHM FOR 2-D DENSITY ESTIMATION Fig 4 Three gray images used for the experiments (a) Pepper, (b) Lena, and (c) Barbara and two sensor fields used for sensor field estimation from randomly scattered sensor: (d) polynomial sensor field and (e) artificial sensor field burden, 50% overlapped blocks are used for the experiment Since the smoothness is different from block to block, we choose the smoothing parameter for each block experimentally The initial center location is equally spaced We use 3 3 centers for experiment Using the estimated density function, we can estimate the intensity value of given location using expectation operation of conditional density distribution function The sampled image and the reconstruction results of Lena are shown in Fig 5 We can also expand our approach into the estimation of sensor field from randomly scattered sensors In this experiment, we generate an arbitrary field using polynomials in Fig 4(d) and an artificial field in Fig4(e) The original sensor field is randomly sampled and 2% of samples is used for the polynomial field and 30% of samples are used for the artificial field We use density function model where L is intensity value and is the location of sensor 50% overlapped blocks and blocks are used for the estimation of polynomial sensor field and artificial sensor field respectively for computational time We also choose the smoothing parameter for each block experimentally The initial center location is equally spaced We use 3 3 centers for each experiment We estimate a density function of given field using sensors For each algorithm except HMEKDE, we use equally spaced centers for the initial location of center The sampled sensor field and the estimation results of

8 904 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE 2008 TABLE II SNR COMPARISON OF DENSITY ESTIMATION ALGORITHM FOR IMAGE RECONSTRUCTION AND SENSOR FIELD ESTIMATION VI DISCUSSION Fig 5 Comparison of density estimation for image reconstruction from randomly sampled image (a) 60% sampled image; (b) RSDE; (c) HMEKDE; (d) Newton s method; (e) conventional EM; (f) MEEM In this section, we discuss the relationship between the number of center and minimum/maximum entropy Our experimental results indicate that, in most cases, the results under the maximum entropy show better results than the conventional EM algorithm However, in some limited cases, like when we use a small number of centers, the results of minimum entropy penalty shows better results than the results of the conventional EM algorithm and maximum entropy penalty This is due to the characteristics of maximum and minimum entropy, which is well described in [21] The maximum entropy solution provides us smooth solution In the case that the number of centers are relatively sufficient, each center can represent piecewise one Gaussian component, which means the resulting density function can be described better under maximum entropy criterion On the contrary, the minimum entropy solution gives us the least smooth distribution In the case that the number of centers are insufficient, each center should represent a large number of samples; thus, the resulting distribution described by a center should be the least smooth one, since each center cannot be described in terms of piecewise Gaussian any more However, the larger number of centers used, the better the result VII CONCLUSION Fig 6 Comparison of density estimation for artificial sensor field estimation from randomly scattered sensor (a) 30% sampled sensor; (b) RSDE; (c) HMEKDE; (d) Newton s method; (e) conventional EM; (f) MEEM artificial field are given in Fig 6 The signal to noise ratio of the results and the computational time are also given in Table II In this paper, we develop a new algorithm for density estimation using the EM algorithm with a ME constraint The proposed MEEM algorithm provides a recursive method to compute a smooth estimate of the maximum likelihood estimate The MEEM algorithm is particularly suitable for tasks that require the estimation of a smooth function from limited or partial data, such as image reconstruction and sensor field estimation We demonstrated the superior performance of the proposed MEEM algorithm in comparison to various methods (including the traditional EM algorithm) in application to 2-D density estimation, image reconstruction from randomly sampled data, and sensor field estimation from scattered sensor networks

9 HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 905 APPENDIX A DEGENERACY OF THE KERNEL DENSITY ESTIMATION This appendix illustrates the degeneracy of kernel density estimation discussed in [6] We will show that the ISE cost function converges asymptotically to the linear linear term as the number of data samples increases Moreover, we show that optimization of the linear term leads to a trivial solution where all of the coefficients are zero except one which is consistent with the observation in [2] We will, therefore, establish that the minimal ISE coefficients will converge to an impulse coefficient distribution as the number of data samples increases In the following proposition, we prove that the ISE cost function in (6) decays asymptotically to the linear linear term as the number of data samples increases Proposition 1: as Proof: The ratio of the quadratic and linear term in (6) is given by impulse function In particular, we assume that the elements in the vector have a unique maximum element with index This assumption generally corresponds to the case where the true density function has a distinct maximum leading to a high density region in the data samples We show that the optimal distribution of the coefficients obtained from the solution of the linear programming problem in (50) is characterized by a spike corresponding to the maximum element and zero for all other coefficients Proposition 2: if and only if and Proof: We observe that (51) If we set and on the left side of (51), and apply the constraint on the right, the inequality is met as an equality Or We now prove the converse, Therefore Expanding the sum, we obtain Canceling common terms and grouping terms with like coefficients, we observe that (52) where we conclude that the quadratic term decays asymptotically at an exponential rate with increasing number of data samples and the quadratic programming minimizing problem in (6) reduces to a linear programming problem defined by the linear term Therefore, we can now determine the minimal ISE coefficients as the number of data samples increases from (6) by minimization of the linear programming problem defined by ; ie, (50) such that and when In the following proposition, we show that the linear programming problem corresponding to the minimal ISE cost function as the number of data samples increases degenerates to a trivial distribution of the coefficients which consists of an Since in (52), this implies, This result can be easily extended to the case where the elements in the vector have maximum element at indexes where This situation generally arises when the true density function has several nearly equal modes leading to a few high density regions in the data sample In this case, we can show that, where if and only if and when We now observe that the minimal ISE coefficient distribution decays asymptotically to a Kronecker delta function as the number of data samples increases (ie,, when and, when Corollary 1: as Proof: The proof is obtained directly from Propositions 1 and 2

10 906 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL 17, NO 6, JUNE 2008 This corollary implies that the minimal ISE kernel density estimation leads to the degenerative approximation which consists of a single kernel and is given by gradient of (20) which requires the gradient of and Thus, from (16) and (17), we can express the gradient of as (53) as the number of samples increases [see (3)] We will now examine the entropy of the degenerative distribution given by (53), which has the lowest entropy among all possible kernel density estimates Proposition 3: Proof: We observe that, for all and Therefore, it follows that Therefore, we have can be ex- Similarly, from (18) and (19), the gradient of pressed as (57) (54) Taking logarithms on both sides and multiplying by obtain 1, we (55) We now compute the entropy of the degenerative distribution From (2), (9), and (53), we obtain (56) Thus, the element of the gradient can be expressed as (58) We now add (59), we observe that to both sides of (55) and using (9) and The element of the Hessian matrix can be expressed as a) (59) This completes the proofs From the proposition above, we observe that the ISE-based kernel density estimation yields the lowest entropy kernel density estimation It results in a kernel density estimate that consists of a single kernel This result presents a clear indication of the limitation of ISE-based cost functions b) (60) APPENDIX B GRADIENT AND HESSIAN IN NEWTON S METHOD In this appendix, we provide the detailed derivation of the gradient and the Hessian matrix of (12) First, we present the (61)

11 HONG AND SCHONFELD: MAXIMUM-ENTROPY EXPECTATION-MAXIMIZATION ALGORITHM 907 The gradient of is given by (62) APPENDIX C CONCAVE FUNCTION INEQUALITY Let us consider a monotonically decreasing and concave function Since is same monotonic decreasing function as, maximizing is equivalent to maximizing Therefore, Thus, the entropy term can be rewritten as The argument of the function has finite range since, for and for The function satisfies (63) within the range,if and It can be easily shown that the function is convex Therefore Meanwhile, the function is concave Therefore for Finally, if and, then the conditions required for (63) are satisfied REFERENCES [1] E Parzen, On estimation of a probability denstiy function and mode, Ann Math Statist, vol 33, pp , 1962 [2] M Girolami and C He, Probability density estimation from optimally condensed data samples, IEEE Trans Pattern Anal Mach Intell, vol 25, no 10, pp , Oct 2003 [3] A Izenmann, Recent developments in nonparametric density estimation, J Amer Statistic Assoc, vol 86, pp , 1991 [4] D W Scott and W F Szewczyk, From kernels to mixtures, Technometrics, vol 43, pp , Aug 2001 [5] S Mukherjee and V Vapnik, Support Vector Method for Multivariate Density Estimation Cambridge, MA: MIT Press, 2000 [6] N Balakrishnan and D Schonfeld, A maximum entropy kernel density estimator with applications to function interpolation and texture segmentation, presented at the SPIE Conf Computational Imaging IV, San Jose, CA, 2006 [7] A P Dempster, N M Laird, and D B Rubin, Maximum likelihood from incomplete data via the em algorithm, J Roy Statist Assoc, vol 39, pp 1 38, 1977 [8] C L Byrne, Iterative image reconstruction algorithms based on crossentropyminimization, IEEE Trans Image Process, vol 2, no 1, pp , Jan 1993 [9] S Kullback and R A Leibler, On information and sufficiency, Ann Math Statist, vol 22, pp 79 86, Mar 1951 [10] A P Benavent, F E Ruiz, and J M S Martinez, Ebem: An entropybased em algorithm for gaussian mixture models, in Proc 18th Int Conf Pattern Recognition, 2006, vol 2, pp [11] K Torkkola, Feature extraction by non-parametric mutual information maximization, J Mach Learn Res, vol 3, pp [12] I T Nabney, Netlab, Algorithms for Pattern Recognition New York: Springer, 2004 [13] R J Vanderbei, Linear Programming: Foundation and Extensions, 2nd ed Boston, MA: Kluwer, 2001 [14] J N Kapur and H K Kesavan, Entropy Optimization With Applications San Diego, CA: Academic, 1992 [15] T K Moon and W C Stirling, Mathematical Methods and Algorithms for Signal Processing Upper Saddle River, NJ: Prentice-Hall, 1999 [16] C Bishop, Neural Networks for Pattern Recogntion Oxford, UK: Oxford Univ Press, 1995 [17] R Neal and G Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models, M I Jordan, Ed Norwell, MA: Kluwer, 1998 [18] K Levenberg, A method for the solution of certain non-linear problems in least squares, Quart Appl Math, vol 2, pp , Jul 1994 [19] D W Marquardt, An algorithm for the least-squares estimation of nonlinear parameters, SIAM J Appl Math, vol 11, pp , Jun 1963 [20] D Comaniciu and P Meer, Mean shift: A robust approach toward feature space analysis, IEEE Trans Pattern Anal Mach Intell, vol 24, no 5, pp , May 2002 [21] Y Lin and H K Kesavan, Minimum entropy and information measure, IEEE Trans Syst, Man, Cybern C, Cybern, vol 28, no 5, pp , Aug 1998 Hunsop Hong (S 08) received the BS and MS degrees in electronic engineering from Yonsei University, Seoul, Korea, in 2000 and 2002, respectively He is currently pursuing the PhD degree at the Department of Electrical and Computer Engineering, University of Illinois at Chicago He was a Research Engineer at the Electronics and Telecommunications Research Institute (ETRI), Daejeon, Korea, until 2003 His research interests include image processing and density estimation Dan Schonfeld (M 90 SM 05) was born in Westchester, PA, in 1964 He received the BS degree in electrical engineering and computer science from the University of California, Berkeley, and the MS and PhD degrees in electrical and computer engineering from the Johns Hopkins University, Baltimore, MD, in 1986, 1988, and 1990, respectively In 1990, he joined the University of Illinois at Chicago, where he is currently an Associate Professor in the Department of Electrical and Computer Engineering He has authored over 100 technical papers in various journals and conferences His current research interests are in signal, image, and video processing; video communications; video retrieval; video networks; image analysis and computer vision; pattern recognition; and genomic signal processing Dr Schonfeld was coauthor of a paper that won the Best Student Paper Award in Visual Communication and Image Processing 2006 He was also coauthor of a paper that was a finalist in the Best Student Paper Award in Image and Video Communication and Processing 2005 He has served as an Associate Editor of the IEEE TRANSACTIONS ON IMAGE PROCESSING (Nonlinear Filtering) as well as an Associate Editor of the IEEE TRANSACTIONS ON SIGNAL PROCESSING (Multidimensional Signal Processing and Multimedia Signal Processing)

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Self-Organization by Optimizing Free-Energy

Self-Organization by Optimizing Free-Energy Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Mean Shift is a Bound Optimization

Mean Shift is a Bound Optimization IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.??, NO.??, MONTH YEAR 1 Mean Shift is a Bound Optimization Mark Fashing and Carlo Tomasi, Member, IEEE M. Fashing and C. Tomasi are with

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2012-2013 Jakob Verbeek, ovember 23, 2012 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.12.13

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

HOPFIELD neural networks (HNNs) are a class of nonlinear

HOPFIELD neural networks (HNNs) are a class of nonlinear IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 4, APRIL 2005 213 Stochastic Noise Process Enhancement of Hopfield Neural Networks Vladimir Pavlović, Member, IEEE, Dan Schonfeld,

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

An Efficient Approach to Multivariate Nakagami-m Distribution Using Green s Matrix Approximation

An Efficient Approach to Multivariate Nakagami-m Distribution Using Green s Matrix Approximation IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, VOL 2, NO 5, SEPTEMBER 2003 883 An Efficient Approach to Multivariate Nakagami-m Distribution Using Green s Matrix Approximation George K Karagiannidis, Member,

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

MOMENT functions are used in several computer vision

MOMENT functions are used in several computer vision IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 13, NO. 8, AUGUST 2004 1055 Some Computational Aspects of Discrete Orthonormal Moments R. Mukundan, Senior Member, IEEE Abstract Discrete orthogonal moments

More information

Newscast EM. Abstract

Newscast EM. Abstract Newscast EM Wojtek Kowalczyk Department of Computer Science Vrije Universiteit Amsterdam The Netherlands wojtek@cs.vu.nl Nikos Vlassis Informatics Institute University of Amsterdam The Netherlands vlassis@science.uva.nl

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES Mika Inki and Aapo Hyvärinen Neural Networks Research Centre Helsinki University of Technology P.O. Box 54, FIN-215 HUT, Finland ABSTRACT

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

H State-Feedback Controller Design for Discrete-Time Fuzzy Systems Using Fuzzy Weighting-Dependent Lyapunov Functions

H State-Feedback Controller Design for Discrete-Time Fuzzy Systems Using Fuzzy Weighting-Dependent Lyapunov Functions IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL 11, NO 2, APRIL 2003 271 H State-Feedback Controller Design for Discrete-Time Fuzzy Systems Using Fuzzy Weighting-Dependent Lyapunov Functions Doo Jin Choi and PooGyeon

More information

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 411 A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces Paul M. Baggenstoss, Member, IEEE

More information

Using Expectation-Maximization for Reinforcement Learning

Using Expectation-Maximization for Reinforcement Learning NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational

More information

Benjamin L. Pence 1, Hosam K. Fathy 2, and Jeffrey L. Stein 3

Benjamin L. Pence 1, Hosam K. Fathy 2, and Jeffrey L. Stein 3 2010 American Control Conference Marriott Waterfront, Baltimore, MD, USA June 30-July 02, 2010 WeC17.1 Benjamin L. Pence 1, Hosam K. Fathy 2, and Jeffrey L. Stein 3 (1) Graduate Student, (2) Assistant

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Image Interpolation Using Kriging Technique for Spatial Data

Image Interpolation Using Kriging Technique for Spatial Data Image Interpolation Using Kriging Technique for Spatial Data Firas Ajil Jassim, Fawzi Hasan Altaany Abstract Image interpolation has been used spaciously by customary interpolation techniques. Recently,

More information

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models

Density Propagation for Continuous Temporal Chains Generative and Discriminative Models $ Technical Report, University of Toronto, CSRG-501, October 2004 Density Propagation for Continuous Temporal Chains Generative and Discriminative Models Cristian Sminchisescu and Allan Jepson Department

More information

EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models

EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models EBEM: An Entropy-based EM Algorithm for Gaussian Mixture Models Antonio Peñalver Benavent, Francisco Escolano Ruiz and Juan M. Sáez Martínez Robot Vision Group Alicante University 03690 Alicante, Spain

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 12, DECEMBER 2010 1005 Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko Abstract A new theorem shows that

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

Riccati difference equations to non linear extended Kalman filter constraints

Riccati difference equations to non linear extended Kalman filter constraints International Journal of Scientific & Engineering Research Volume 3, Issue 12, December-2012 1 Riccati difference equations to non linear extended Kalman filter constraints Abstract Elizabeth.S 1 & Jothilakshmi.R

More information

Design of Optimal Quantizers for Distributed Source Coding

Design of Optimal Quantizers for Distributed Source Coding Design of Optimal Quantizers for Distributed Source Coding David Rebollo-Monedero, Rui Zhang and Bernd Girod Information Systems Laboratory, Electrical Eng. Dept. Stanford University, Stanford, CA 94305

More information

Filter Design for Linear Time Delay Systems

Filter Design for Linear Time Delay Systems IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 11, NOVEMBER 2001 2839 ANewH Filter Design for Linear Time Delay Systems E. Fridman Uri Shaked, Fellow, IEEE Abstract A new delay-dependent filtering

More information

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases

A Generalized Uncertainty Principle and Sparse Representation in Pairs of Bases 2558 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 48, NO 9, SEPTEMBER 2002 A Generalized Uncertainty Principle Sparse Representation in Pairs of Bases Michael Elad Alfred M Bruckstein Abstract An elementary

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

On Information Maximization and Blind Signal Deconvolution

On Information Maximization and Blind Signal Deconvolution On Information Maximization and Blind Signal Deconvolution A Röbel Technical University of Berlin, Institute of Communication Sciences email: roebel@kgwtu-berlinde Abstract: In the following paper we investigate

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

An Adaptive Bayesian Network for Low-Level Image Processing

An Adaptive Bayesian Network for Low-Level Image Processing An Adaptive Bayesian Network for Low-Level Image Processing S P Luttrell Defence Research Agency, Malvern, Worcs, WR14 3PS, UK. I. INTRODUCTION Probability calculus, based on the axioms of inference, Cox

More information

An Invariance Property of the Generalized Likelihood Ratio Test

An Invariance Property of the Generalized Likelihood Ratio Test 352 IEEE SIGNAL PROCESSING LETTERS, VOL. 10, NO. 12, DECEMBER 2003 An Invariance Property of the Generalized Likelihood Ratio Test Steven M. Kay, Fellow, IEEE, and Joseph R. Gabriel, Member, IEEE Abstract

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

Point and Interval Estimation for Gaussian Distribution, Based on Progressively Type-II Censored Samples

Point and Interval Estimation for Gaussian Distribution, Based on Progressively Type-II Censored Samples 90 IEEE TRANSACTIONS ON RELIABILITY, VOL. 52, NO. 1, MARCH 2003 Point and Interval Estimation for Gaussian Distribution, Based on Progressively Type-II Censored Samples N. Balakrishnan, N. Kannan, C. T.

More information

A Unifying View of Image Similarity

A Unifying View of Image Similarity ppears in Proceedings International Conference on Pattern Recognition, Barcelona, Spain, September 1 Unifying View of Image Similarity Nuno Vasconcelos and ndrew Lippman MIT Media Lab, nuno,lip mediamitedu

More information

Title without the persistently exciting c. works must be obtained from the IEE

Title without the persistently exciting c.   works must be obtained from the IEE Title Exact convergence analysis of adapt without the persistently exciting c Author(s) Sakai, H; Yang, JM; Oka, T Citation IEEE TRANSACTIONS ON SIGNAL 55(5): 2077-2083 PROCESS Issue Date 2007-05 URL http://hdl.handle.net/2433/50544

More information

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,

More information

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction

Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Degenerate Expectation-Maximization Algorithm for Local Dimension Reduction Xiaodong Lin 1 and Yu Zhu 2 1 Statistical and Applied Mathematical Science Institute, RTP, NC, 27709 USA University of Cincinnati,

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA

MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA MIXTURE OF EXPERTS ARCHITECTURES FOR NEURAL NETWORKS AS A SPECIAL CASE OF CONDITIONAL EXPECTATION FORMULA Jiří Grim Department of Pattern Recognition Institute of Information Theory and Automation Academy

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Optimum Sampling Vectors for Wiener Filter Noise Reduction

Optimum Sampling Vectors for Wiener Filter Noise Reduction 58 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 50, NO. 1, JANUARY 2002 Optimum Sampling Vectors for Wiener Filter Noise Reduction Yukihiko Yamashita, Member, IEEE Absact Sampling is a very important and

More information

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION

AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION AN ALTERNATING MINIMIZATION ALGORITHM FOR NON-NEGATIVE MATRIX APPROXIMATION JOEL A. TROPP Abstract. Matrix approximation problems with non-negativity constraints arise during the analysis of high-dimensional

More information

Asymptotic Achievability of the Cramér Rao Bound For Noisy Compressive Sampling

Asymptotic Achievability of the Cramér Rao Bound For Noisy Compressive Sampling Asymptotic Achievability of the Cramér Rao Bound For Noisy Compressive Sampling The Harvard community h made this article openly available. Plee share how this access benefits you. Your story matters Citation

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

USING multiple antennas has been shown to increase the

USING multiple antennas has been shown to increase the IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 55, NO. 1, JANUARY 2007 11 A Comparison of Time-Sharing, DPC, and Beamforming for MIMO Broadcast Channels With Many Users Masoud Sharif, Member, IEEE, and Babak

More information

On the Behavior of Information Theoretic Criteria for Model Order Selection

On the Behavior of Information Theoretic Criteria for Model Order Selection IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 49, NO. 8, AUGUST 2001 1689 On the Behavior of Information Theoretic Criteria for Model Order Selection Athanasios P. Liavas, Member, IEEE, and Phillip A. Regalia,

More information

Recursive Least Squares for an Entropy Regularized MSE Cost Function

Recursive Least Squares for an Entropy Regularized MSE Cost Function Recursive Least Squares for an Entropy Regularized MSE Cost Function Deniz Erdogmus, Yadunandana N. Rao, Jose C. Principe Oscar Fontenla-Romero, Amparo Alonso-Betanzos Electrical Eng. Dept., University

More information

On Mean Curvature Diusion in Nonlinear Image Filtering. Adel I. El-Fallah and Gary E. Ford. University of California, Davis. Davis, CA

On Mean Curvature Diusion in Nonlinear Image Filtering. Adel I. El-Fallah and Gary E. Ford. University of California, Davis. Davis, CA On Mean Curvature Diusion in Nonlinear Image Filtering Adel I. El-Fallah and Gary E. Ford CIPIC, Center for Image Processing and Integrated Computing University of California, Davis Davis, CA 95616 Abstract

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

OPTIMAL POWER FLOW (OPF) is a tool that has been

OPTIMAL POWER FLOW (OPF) is a tool that has been IEEE TRANSACTIONS ON POWER SYSTEMS, VOL. 20, NO. 2, MAY 2005 773 Cumulant-Based Probabilistic Optimal Power Flow (P-OPF) With Gaussian and Gamma Distributions Antony Schellenberg, William Rosehart, and

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines 1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER 2000 Brief Papers The Evidence Framework Applied to Support Vector Machines James Tin-Yau Kwok Abstract In this paper, we show that

More information

On the Cross-Correlation of a p-ary m-sequence of Period p 2m 1 and Its Decimated

On the Cross-Correlation of a p-ary m-sequence of Period p 2m 1 and Its Decimated IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 58, NO 3, MARCH 01 1873 On the Cross-Correlation of a p-ary m-sequence of Period p m 1 Its Decimated Sequences by (p m +1) =(p +1) Sung-Tai Choi, Taehyung Lim,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Forecasting Wind Ramps

Forecasting Wind Ramps Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

More information

THE information capacity is one of the most important

THE information capacity is one of the most important 256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

The Discrete Kalman Filtering of a Class of Dynamic Multiscale Systems

The Discrete Kalman Filtering of a Class of Dynamic Multiscale Systems 668 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL 49, NO 10, OCTOBER 2002 The Discrete Kalman Filtering of a Class of Dynamic Multiscale Systems Lei Zhang, Quan

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Bayesian ensemble learning of generative models

Bayesian ensemble learning of generative models Chapter Bayesian ensemble learning of generative models Harri Valpola, Antti Honkela, Juha Karhunen, Tapani Raiko, Xavier Giannakopoulos, Alexander Ilin, Erkki Oja 65 66 Bayesian ensemble learning of generative

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE

Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren, Member, IEEE IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 57, NO. 1, JANUARY 2012 33 Distributed Coordinated Tracking With Reduced Interaction via a Variable Structure Approach Yongcan Cao, Member, IEEE, and Wei Ren,

More information

Shape of Gaussians as Feature Descriptors

Shape of Gaussians as Feature Descriptors Shape of Gaussians as Feature Descriptors Liyu Gong, Tianjiang Wang and Fang Liu Intelligent and Distributed Computing Lab, School of Computer Science and Technology Huazhong University of Science and

More information

Classification of Hand-Written Digits Using Scattering Convolutional Network

Classification of Hand-Written Digits Using Scattering Convolutional Network Mid-year Progress Report Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan Co-Advisor: Dr. Maneesh Singh (SRI) Background Overview

More information

Learning Binary Classifiers for Multi-Class Problem

Learning Binary Classifiers for Multi-Class Problem Research Memorandum No. 1010 September 28, 2006 Learning Binary Classifiers for Multi-Class Problem Shiro Ikeda The Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku, Tokyo, 106-8569,

More information

Dimensional reduction of clustered data sets

Dimensional reduction of clustered data sets Dimensional reduction of clustered data sets Guido Sanguinetti 5th February 2007 Abstract We present a novel probabilistic latent variable model to perform linear dimensional reduction on data sets which

More information

Auxiliary signal design for failure detection in uncertain systems

Auxiliary signal design for failure detection in uncertain systems Auxiliary signal design for failure detection in uncertain systems R. Nikoukhah, S. L. Campbell and F. Delebecque Abstract An auxiliary signal is an input signal that enhances the identifiability of a

More information

Constrained Optimization and Support Vector Machines

Constrained Optimization and Support Vector Machines Constrained Optimization and Support Vector Machines Man-Wai MAK Dept. of Electronic and Information Engineering, The Hong Kong Polytechnic University enmwmak@polyu.edu.hk http://www.eie.polyu.edu.hk/

More information

Co-Prime Arrays and Difference Set Analysis

Co-Prime Arrays and Difference Set Analysis 7 5th European Signal Processing Conference (EUSIPCO Co-Prime Arrays and Difference Set Analysis Usham V. Dias and Seshan Srirangarajan Department of Electrical Engineering Bharti School of Telecommunication

More information

Entropy Manipulation of Arbitrary Non I inear Map pings

Entropy Manipulation of Arbitrary Non I inear Map pings Entropy Manipulation of Arbitrary Non I inear Map pings John W. Fisher I11 JosC C. Principe Computational NeuroEngineering Laboratory EB, #33, PO Box 116130 University of Floridaa Gainesville, FL 326 1

More information

SOLVING an optimization problem over two variables in a

SOLVING an optimization problem over two variables in a IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 55, NO. 3, MARCH 2009 1423 Adaptive Alternating Minimization Algorithms Urs Niesen, Student Member, IEEE, Devavrat Shah, and Gregory W. Wornell Abstract The

More information

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 2, FEBRUARY

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 54, NO. 2, FEBRUARY IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 54, NO 2, FEBRUARY 2006 423 Underdetermined Blind Source Separation Based on Sparse Representation Yuanqing Li, Shun-Ichi Amari, Fellow, IEEE, Andrzej Cichocki,

More information