On Hyper-Parameter Estimation in Empirical Bayes: A Revisit of the MacKay Algorithm

Size: px
Start display at page:

Download "On Hyper-Parameter Estimation in Empirical Bayes: A Revisit of the MacKay Algorithm"

Transcription

1 On Hyper-Paraeter Estiation in Epirical Bayes: A Revisit of the MacKay Algorith Chune Li 1, Yongyi Mao, Richong Zhang 1 and Jinpeng Huai 1 1 School of Coputer Science and Engineering, Beihang University, Beijing, P. R. China School of Electrical Engineering and Coputer Science, University of Ottawa, Ottawa, ON K1N 6N5, Canada Abstract An iterative procedure introduced in MacKay s evidence fraework is often used for estiating the hyper-paraeter in epirical Bayes. Despite its effectiveness, the procedure has stayed priarily as a heuristic to date. This paper forally investigates the atheatical nature of this procedure and justifies it as a well-principled algorith fraework. This fraework, which we call the MacKay algorith, is shown to be closely related to the EM algorith under certain Gaussian assuption. 1 INTRODUCTION As a bridge between full Bayesian odels and copletely frequentist odels, the epirical Bayesian ethod (also known as epirical Bayes in short, or type-ii axiu likelihood) has been applied to any learning, inference or prediction applications (see. e.g. [Schäfer and Strier, 005, Efron, 01, Heskes, 000, Yang et al., 004, Frost and Savarino, 1986, DuMouchel and Pregibon, 001]). The generic setup of epirical Bayes consists the observed data D, the odel paraeter z that paraetrizes the data likelihood function p(d z), and the prior distribution p(z) of the odel paraeter. In the paraetric version of epirical Bayes (Figure 1), the prior distribution is paraeterized by certain hyper-paraeter, naely, as p(z ), and the philosophy of epirical Bayes is to estiate the hyper-paraeter fro the observed data D. As epirical Bayes treats the odel paraeter z as a latent rando variable, the estiation of the hyper- This work is supported partly by China 973 progra (No. 014CB340305), by the National Natural Science Foundation of China (No , ), and by the Beijing Advanced Innovation Center for Big Data and Brain Coputing. Richong Zhang is the corresponding author of this work (eail: zhangrc@act.buaa.edu.cn) z D Figure 1: The generic odel of epirical Bayesian ethod paraeter naturally fits in the fraework of the EM algorith [Depster et al., 1977, Carlin and Louis, 1997], and the EM-based solutions have been developed to solve this proble in various application doains (see, e.g., [Inoue and Tanaka, 001, Clyde and George, 000]). Aong other approaches to this proble, a technique introduced by MacKay is also widely adopted in practice[bishop, 1999, Tipping, 001, Tipping et al., 003, Wipf and Nagarajan, 008, Tan and Févotte, 009]. In his evidence fraework [MacKay, 199b, MacKay and Neal, 1994, MacKay, 1995], MacKay considers a hierarchical Bayesian odel siilar to that in Figure 1 but with one distinction: an additional hyper-prior p( H), which depends on the choice H of odel, is placed on the hyper-paraeter. In this setting, the evidence fraework addresses three levels of inference probles: 1) given the hyper-paraeter, inferring z, ) given the odel H, inferring, and 3) evaluating odel H. MacKay shows [MacKay, 1995] that the three levels of inference ay be cobined for prediction and for autoatic shrinkage of paraeter spaces (naely, Autoatic Relevance Deterination, or ARD) for neural network regression odels. The second-level inference in the evidence fraework is closely related to epirical Bayes. In particular, when a flat hyper-prior p( H) is placed on, the objective of the second-level inference coincides with the objective of epirical Bayes. For the second-level inference, MacKay introduces a procedure that alternates between inferring z given (first-level inference) and inferring given z. This procedure, although well appreciated in soe classical papers (e.g., [Bishop, 1999]) and highly cited in ARD related literature (e.g., [Bishop, 1999, Tipping, 001, Tan and Févotte, 009]), is called the MacKay algorith in this paper.

2 Since its birth, the MacKay algorith has been applied to various epirical Bayes odels and its perforance is often copared with the EM algorith. For exaple, the MacKay algorith is applied to the Bayesian PCA odel [Bishop, 1999] and a non-negative atrix factorization odel [Tan and Févotte, 009] for autoated shrinkage of the latent-space diensions. In [Tipping, 001], the MacKay algorith is applied to SVM regression odels for prooting sparsity, and it is shown to converge faster than the EM algorith. Despite its effectiveness, the atheatical principle and optiization objective of the MacKay algorith are however not well characterized in the literature to date. In MacKay s original exposition [MacKay and Neal, 1994, MacKay, 1995], the second-level inference task is clearly stated, but the justification of the iterative procedure (i.e., the MacKay algorith) is ainly heuristic. In addition, since the MacKay s algorith is often ipleented with a particular update procedure, known as the fixedpoint iteration [Soloon, 015, Hyvärinen, 1999], or the MacKay update, the boundary between the fraework of the MacKay algorith and MacKay s fixed-point update rule is often blurred in the literature. This akes the MacKay algorith often understood in a narrow sense as this specific fixed-point update rule, rather than as an algorith fraework. In this paper, under a generic forulation of the epirical Bayes odel (Figure 1), we re-forulate the MacKay algorith as a coordinate-ascent procedure for solving a welldefined optiization proble. This optiization proble shares soe siilarity with the optiization proble underlying the EM algorith: its objective function is a lower bound of the true objective function defining the optiization objective of epirical Bayes. Also siilar to the EM algorith, this lower bound is not far fro the true objective function and one of the two update steps in the coordinate-ascent procedure guarantees to ake the lower bound eet the true objective function. This understanding justifies the MacKay algorith (whether or not ipleented with the MacKay update) as a well-principled algorith fraework, juxtaposed on equal footing with the EM fraework. Under a specific linear regression odel, it has been observed that the MacKay update and the EM algorith are closely reated [Murphy, 01]. It is then curious to investigate the relationship between the two algoriths in a ore general setting. To that end, we show that as long as the the posterior distribution p(z D, ) is a Gaussian distribution, the objective function for the MacKay algorith is siply a restriction of the EM objective function where two of the three variables are restricted on a curve. Under this Gaussian condition, we show that the MacKay optiization proble is a relaxation of the original optiization proble in epirical Bayes, and that the EM optiization proble is a relaxation of the MacKay optiization proble. In addition, the three probles attain their optiu at the sae configuration of the hyper-paraeter. These understandings then help to explain why the MacKay algorith converge faster than the EM algorith. The objective of this paper is to rigorously forulate the MacKay algorith and to investigate its connection to the EM algorith. We have ade an effort to be pedagogical in our presentation. In particular, we use a linear regression odel and the Bayesian PCA odel as running exaples throughout the paper. SETUP The generic odel for epirical Bayes is given in Figure 1, where D is the observed data, z is the odel paraeter, and is the hyper-paraeter. We note that both z and can be a scalar, a vector, a atrix or of an arbitrary for. The odel is specified by the likelihood function p(d z) and the prior distribution p(z ). The objective of epirical Bayes is then to estiate the hyper-paraeter fro the data D. Let l() := log p(d ) = log p(d z)p(z )dz (1) be the log-arginal likelihood or the log-evidence [MacKay, 199b] of the hyper-paraeter. Then the estiation of can be naturally forulated as solving the following optiization proble. Opt-I Find that axiizes l(). We now use the exaples of linear regression and Bayesian PCA[Bishop, 1999] to illustrate this. Throughout the paper, we will use N (x; µ, Λ) to denote the Gaussian density function with variable x, ean µ and covariance atrix Λ, and we will use I d to denote the d d identity atrix, Tr( ) to denote the trace operator, Det( ) to denote the deterinant operator, to denote L nor, and E q [ ] to denote expectation under distribution q. Linear Regression Exaple 1 Let D := {(x (i), y (i) ) : i = 1,,..., n} be the observed data, where each x (i) is a vector in R d, and each y (i) is a scalar in R. The dependency of y (i) on x (i) is odelled as y (i) = z T x (i) + ϵ (i). Here ϵ (i) is a zero-ean Gaussian noise with variance σ, and z is the odel paraeter, which is odelled as a d- diensional spherical Gaussian variable with zero ean and variance 1/. For siplicity, we assue that the paraeter σ is known and the objective of epirical Bayes

3 is to estiate the hyper-paraeter. Then the objective function in Opt-I is: l()= log N (z;0, 1 I d) n N (y (i) ; z T x (i),σ )dz Bayesian PCA Exaple 1 Following [Bishop, 1999], let D := {t (i) : i = 1,,..., n} be the observed data, where each t (i) is a vector in R. Each observed vector t (i) depends on a latent variable x (i) R d via t (i) = zx (i) + ϵ (i). Here x (i) is a zero-ean Gaussian variable with covariance I d, ϵ (i) is a spherical Gaussian noise with zero ean and known variance σ, and paraeter z R d (d < ) is odelled as a rando atrix whose kth colun z k is drawn fro a spherical Gaussian distribution with zero ean and variance 1/ k. Let := { k : k = 1,,..., d}. Then is the hyper-paraeter on z and Opt-I has the following objective function: ( d 1 l() = log N (z k ; 0, I ) k n ) N (x (i) ; 0, I d )N (t (i) ; zx (i),σ I )dx (i) dz The optiization proble Opt-I can soeties be solved easily, for instance, in the above linear regression setting. In practice, however, this proble is usually difficult and requires special algorithic techniques. The above Bayesian PCA setting is one such exaple. The EM approach to Opt-I is well-known. In the reainder of this paper, we develop the MacKay algorith for this proble. To copare and relate to EM, we also present the EM algorith in parallel. The above linear regression and Bayesian PCA settings will be carried along our developent as illustrative exaples. 3 THE TWO ALGORITHMS In this section, we will show that both the EM algorith and the MacKay algorith can be forulated as optiizing a lower bound of the objective function l() via coordinate ascent. While this is well known for the EM algorith, it has been quite obscure for the MacKay algorith. 3.1 THE EM ALGORITHM The Expectation-Maxiization (EM) algorith [Depster et al., 1977] is a classical ethod for axiizing the log-likelihood function or log-posterior density function in which certain latent variables have been integrated over. When applied to the optiization proble Opt-I, the EM algorith iplicitly constructs a lower bound F EM of the objective function l(). [ F EM (q, ) := E q log p(d z)p(z ) ] q(z) = q(z) log p(d z)p(z ) dz q(z) where q( ) is an arbitrary probability distribution on the space of z. By the Jensen s Inequality[Jensen, 1906], the follow result is well-known in the literature of the EM algorith [Depster et al., 1977]. Lea 1. F EM (q, ) l(), where the equality is achieved if and only if q(z) is the posterior distribution p(z D, ). Instead of optiizing the original objective function l(), we now define an alternative optiization proble. OptEM Find and a distribution q that axiize F EM (q, ). The EM algorith is then the coordinate ascent solver for OptEM. More precisely, the update rule at the t th iteration of the coordinate ascent is given below. EM Algorith E-Step: M-Step: q (t) := arg ax q F EM (q, (t) ) = p(z D, (t) ) (t+1) := arg ax F EM (q (t), ) = arg ax E q (t) [log p(z )] We note that by Lea 1, at the end of E-Step, () F EM (q (t), (t) ) = l( (t) ). (3) This gives rise to the following lea [Depster et al., 1977]. Lea. The iteration of the EM algorith continuously increases the log-evidence function l() and therefore is guaranteed to converge. Linear Regression Exaple For the linear regression odel, let [x (1), x (),..., x (n) ] be denoted by a atrix X R d n and [y (1), y (),..., y (n) ] T denoted by a vector

4 Y R n. The objective function () is n F EM (q, ) = 1 σ y (i) x T (i) E q[z] [ ( 1 n ) ] σ E q z T x (i) x T (i) + σ I d z E q [log q(z)] + n + d log π + n log σ + d log 1 n σ y(i) It is easy to verify that the posterior distribution of z is also Gaussian and the E-Step update becoes ( ) q (t) (z) = N z; µ LR ( (t) ), K LR ( (t) ) (4) where and µ LR () := 1 σ ( 1 σ XXT + I d ) 1 XY, (5) K LR () := ( ) 1 1 σ XXT + I d. (6) For the M-Step update, noting that F EM (q (t), ) = µ LR ( (t) ) ) (K Tr LR ( (t) ) + d log + const, it is possible to express the axiizing for this function directly in ters of (t) as: (t+1) d = µ LR ( (t) ) +Tr ( K LR ( (t) ) ) (7) That is, the updates in E-Step and M-Step can be cobined into the single update equation (7). Bayesian PCA Exaple For the BPCA odel, the objective function () is F EM (q, ) = n E [ q log Det(zz T + σ I ) ] n E q [ t T (i) (zzt + σ I ) 1 t (i) ] k E q [ z k ] E q [log q(z)] log dn + d k log π. However, the E-Step update of q (t) can not be expressed in explicit fors and one usually relies on various approxiation techniques. For exaple, a sapling approach [Neal, 1993] ay be used for this purpose. Later in this paper, we will discuss the approach that approxiates the posterior as a Gaussian. 3. MACKAY ALGORITHM In [MacKay, 199a, MacKay, 199c, MacKay, 199b, MacKay, 1995], MacKay presented the influential evidence fraework that addresses inference as three levels. A heuristic iterative procedure is introduced for the second-level inference, naely, for inferring the hyperparaeter. Here, we re-forulate the this procedure as a well-principled algorith fraework, and call it the MacKay algorith. To begin, note that the objective function l() can be expressed as l() = log p(d, z ) log p(z D, ), for any z (with p(z D, ) non-zero). Define F MacKay (z,):=log p(d, z ) ax log p(z D,). (9) z The following lea is easy to verify. Lea 3. F MacKay (z, ) l(), where the equality is achieved if and only if z = arg ax z log p(d, z ). We now introduce another optiization proble. OptMacKay Find and z that axiize F MacKay (z, ). The MacKay algorith is then defined as the following coordinate ascent procedure for optiizing F MacKay. MacKay Algorith z-step: -Step: z (t) := arg ax F MacKay (z, (t) ) z = arg ax log p(d, z (t) ) z (t+1) := arg ax F MacKay(z (t), ). At the end of z-step, by Lea 3, The M-Step update is then (t+1) k = E q (t) (8) ]. [ z k F MacKay (z (t), (t) ) = l( (t) ). (10) This property clearly parallels Equation (3) of the EM algorith. That is, although the MacKay algorith axiizes

5 the lower bound F MacKay of the true objective function l(), the lower bound F MacKay is in fact not far below l() and at the end of each z-step update, the lower-bound eets l(). Then by the coordinate-ascent nature of the MacKay algorith, we have the following lea, parallel to Lea of the EM algorith. Lea 4. The iteration of the MacKay algorith continuously increases the log-evidence function l() and therefore is guaranteed to converge. In the MacKay algorith, it is worth noting that the update in the -Step is usually perfored with a fixed-point iteration procedure [Soloon, 015, MacKay, 199a, Bishop, 1999, Murphy, 01], which we describe next for self-containedness. Fixed-Point Iteration Suppose that the equation F MacKay (z, )/ = 0 can be reduced to the for = h(, z). The fixed-point iteration approach for the -Step update in the MacKay algorith is the following update rule. (t+1) = h( (t), z (t) ). Linear Regression Exaple 3 In the linear regression odel, the lower bound F MacKay is F MacKay (z, ) = 1 n σ y (i) x T (i) z 1 n σ ( 1 n ) σ zt x (i) x T (i) + σ I d z + n + d log π + n log σ + d log + 1 log Det (πk LR()) The z-step turns out to be Note F MacKay (z,) z (t) = µ LR ( (t) ). When setting F MacKay(z,) y (i) = d 1 z 1 Tr (K LR()). = 0, we obtain = d Tr (K LR()) z This gives rise to the fixed-point iteration of the -Step: (t+1) = d (t) Tr ( K LR ( (t) ) ) z (t). (11) Bayesian PCA Exaple 3 objective function F MacKay is For the BPCA odel, the F MacKay (z, ) = n log Det ( zz T + σ I ) n t T (i) (zzt + σ I ) 1 t (i) k z k ax z p(z D, ) log dn + d k log π The z-step and -Step updates then becoe z (t) = arg ax z (t+1) = arg ax p(z D, (t) ) [ 1 ax z p(z D, ) + k z (t) k ] log k Since in general there does not exist closed-for solution for the z-step update, the two update equations can not be further expressed. In practice, a Gaussian approxiation is applied to the posterior function p(z D, (t) ) in order to derive these update equations (see next section). It is perhaps worth noting that the z-step update of the MacKay algorith resebles the E-step update of an approxiate version of the EM algorith, known as Hard EM (or Viterbi-EM in the context of Hidden Markov Models)[Allahverdyan and Galstyan, 011]. However, the M-step of Hard EM/Viterbi-EM is different fro the - step of the MacKay algorith, due to the fact the OptEM and OptMacKay have different objective functions. It is not clear whether there is a ore direct connection between Hard EM and the MacKay algorith bypassing the generic EM algorith, although we suspect that the answer is no. 4 GAUSSIAN APPROXIMATION As seen above, in both the EM algorith and the MacKay algorith, it is desirable to copute the posterior distribution of odel paraeter, naely, p(z D, ). In the case of EM, this is for updating q in the E-Step and in the case of MacKay, this is for updating z in the z-step. For soe odels, such Bayesian PCA, it is difficult to carry out explicit coputation of the posterior. A coonly used technique is to approxiate the posterior as a Gaussian density function, naely, p(z D, ) N (z; µ, K), for soe µ, K. (1)

6 Clearly, the ean vector µ and the covariance atrix K of the Gaussian density depend on the hyper-paraeter and will be denoted by µ() and K() respectively. When p(z D, ) is a continuous function of z, a coon technique for obtain such an approxiation (1) is the following [MacKay, 1995]. Let ẑ axiizes p(z D, ). By Taylor-expanding log p(z D, ) at z = ẑ, up to the second-order ters, it is easy to see that log p(z D, ) log p (ẑ D, )+ 1 (z ẑ)t H() (z ẑ) where H() denotes the Hessian atrix of function log p(z D, ) at z = ẑ. This gives the custoary approxiation of p(z D, ) as [MacKay, 1995] p(z D, ) N ( z; ẑ, H() 1) Then µ() and K() in the Gaussian approxiation (1) can be taken as µ() = ẑ, K() = H() 1. (13) We note that in (1), the approxiation is soeties accurate, naely, that the strict equality is satisfied. In such cases, the Gaussian approxiation as stated in (1) and (13) in fact holds precisely. Linear Regression Exaple 4 As seen in (4), (5) and (6), the posterior distribution p(z D, ) is indeed a Gaussian density function. That is, the Gaussian approxiation (1) holds with equality, where µ() = µ LR () and K() = K LR () (defined in (5) and (6) respectively). Bayesian PCA Exaple 4 Let Z be the vector representation of atrix z, naely, Z is a length-d vector obtained by stacking coluns of the atrix z. That is, Z := (z T 1, z T,..., z T d )T. Let Ẑ denote the axiizing configuration for function p(z D, ), and siilarly let H() denote the Hessian of log p(z D, ) at Z = Ẑ. The Gaussian approxiation (1) then becoes where p(z D,) N (Z; µ BPCA (),K BPCA ()) (14) µ BPCA () := Ẑ, K BPCA() := H() 1. We note that in this case, (1) is only an approxiation. In addition, since Ẑ and H() 1 are difficult to copute analytically, nuerical solutions are usually sought. In the reainder of this section, we assue that (1) holds with equality and further investigate the optiization probles in the EM and MacKay algoriths. 4.1 EM Recall that with the EM algorith, the objective function in the optiization proble is F EM in (). Since the optiizing distribution q for any given is the posterior p(z D, ), this, under the Gaussian assuption (1) of the posterior, allows us to restrict q to the for N (z; u, S) paraetrized by ean vector u and covariance S. The the objective function F EM (q, ) can then be re-expressed as F EM (u, S, ). That is, [ F EM (u, S, ) = E N (z;u,s) log p(d z)p(z ) ] N (z; u, S) = E N (z;u,s) [log p(d z)p(z )] (15) + J log π + 1 log Det(S) + J where J is the length of the vector z. The following lea is established by noting the following two-way factorization of p(d, z ). p(d, z ) = p(d z)p(z ) = p(d )p(z D, ) (16) Lea 5. When the Gaussian approxiation (1) holds with equality, the function F EM can be re-expressed as F EM (u,s, ) = log N (u; µ(), K()) 1 Tr(K 1 ()S) +log p(d ) + J log π + 1 log Det(S) + J To derive the update rule for the EM algorith under the Gaussian approxiation (1), we prove the following results. Lea 6. For any given S and, arg ax F EM(u, S, ) = µ(). u Proof: Based on Lea 5, we express F EM (u, S, ) further. F EM (u,s, ) = J log π 1 log Det(K()) 1 (u µ())t K 1 ()(u µ()) 1 Tr ( K 1 ()S ) +log p(d ) + J log π + 1 log Det(S) + J. F EM u = 1 ( K 1 () + (K 1 ()) T ) (u µ()) By setting FEM u to zero, we prove the result. Lea 7. For any u and, arg ax F EM(u, S, ) = K(). S

7 Proof: By the expression of F EM (u, S, ) in Lea 5, F EM S = 1 S Tr(K 1 ()S) + 1 log Det(S). S By Tr(AB) = i j A ijb ji, we have Tr(K 1 ()S) = (K 1 ()) ij S ji ; i j Tr(K 1 ()S) = (K 1 ()) ji S ij S Tr(K 1 ()S) = (K 1 ()) T. On the other hand, since Det(S) S ij have Then S ij log Det(S) = (S 1 ) ji S log Det(S) = (S 1 ) T. = Det(S)(S 1 ) ji, we F EM S = 1 (K 1 ()) T + 1 (S 1 ) T. The lea is then proved by setting this derivative to zero. As a corollary of Lea 6, Lea 7 and (15), the update rule of the EM algorith can be established. Lea 8. When the Gaussian approxiation (1) holds with equality, the EM algorith becoes the following EM-Gauss Procedure. EM-Gauss Procedure: E-Step: M-Step: u (t) := µ( (t) ) S (t) := K( (t) ) (t+1) := arg ax E N (z;u (t),s (t) ) [log p(z )] Linear Regression Exaple 5 In the linear regression odel, the Gaussian assuption (1) holds true. The E- Step then reduces to (4), which can be integrated into the M-Step. The EM update can then be expressed as a single update equation (7), the sae as that in Linear Regression Exaple-. Bayesian PCA Exaple 5 Note that µ BPCA () is a vector of length d and K BPCA () is an d d atrix. Let µ BPCA,k () denote the coponent of µ BPCA () corresponding to z k coponent of Ẑ, and K BPCA,k() denote the sub-atrix of K BPCA () that serves as the covariance atrix of z k. With the Gassian approxiation (14) holds with equality, the update (8) of becoes (t+1) k = 4. MACKAY µ BPCA,k ( (t) ) +Tr ( K BPCA,k ( (t) ) ). Lea 9. When the Gaussian approxiation (1) holds with equality, the function F MacKay in (9) becoes F MacKay (z, ) = log p(d, z ) + 1 log Det(πK()). Proof: This lea follows fro ax z log p(z D, ) = log Det(πK()). 1 Lea 10. When the Gaussian approxiation (1) holds with equality, for any, arg ax log p(d, z ) = µ(). z The proof of this lea follows the sae line as that of Lea 6. The MacKay algorith under the Gaussian approxiation (1) can then be established fro Lea 10 and Lea 9. Lea 11. When the Gaussian approxiation (1) holds with equality, the MacKay algorith becoes the following MacKay-Gauss Procedure. MacKay-Gauss Procedure: z-step: -Step: (t+1) :=arg ax z (t) := µ( (t) ) [log p(z (t) ) + 1 log Det(K()) ] Linear Regression Exaple 6 Since the Gaussian assuption (1) holds true in linear regression, the updates of the MacKay algorith are those in Linear Regression Exaple 3. Bayesian PCA Exaple 6 Assuing that the Gassian approxiation (1) holds with equality, the z-step and - Step updates becoe Z (t) = µ BPCA ( (t) ) [ (t+1) = arg ax 1 z (t) k + 1 log Det (K BPCA())+ k ] log k

8 When setting F MacKay(z,) k = 0, we can obtain k = h( k, z) = ktr (K BPCA,k ()) z k. This gives rise to the fixed-point iteration of the -Step (t+1) k = (t) k Tr ( K BPCA,k ( (t) ) ) z (t) k 4.3 THE RELATIONSHIP BETWEEN MAKCAY AND EM As is shown earlier, the MacKay and EM algoriths correspond to solving two different optiization probles. However, we will show next that when the Gaussian approxiation (1) holds exactly, the two algoriths are closely related. First note that F EM is a trivariate function whereas F MacKay is a bivariate function. The theore below suggests that if the Gaussian approxiation of the posterior distribution p(z D, ) is exact, then by setting its covariance variable S of F EM to the covariance atrix of the posterior, the function F EM reduces F MacKay. Theore 1. When the Gaussian approxiation (1) holds with equality, F EM (u, K(), ) = F MacKay (u, ), where K() is defined in (13). Proof: Suppose that the Gaussian approxiation (1) holds with equality. Invoking (16), we have F MacKay (z, ) = log p(d, z ) + 1 log Det(πK()) = log N (z; µ(), K()) + log p(d ) But by Lea 5, we have + 1 log Det(πK()) F EM (z,k(), ) = log N (z; µ(), K()) 1 Tr ( K 1 ()K() ) + log p(d ) + J log π + 1 log Det(K()) + J = log N (z; µ(), K()) + log p(d ) + 1 log Det(πK()) =F MacKay (z, ) This proves the theore. Theore 1 suggests that the function F MacKay is a restriction of function F EM. Denote by C the set of all (S, ) configurations with S = K(). That is, C is the curve on the (S, ) plane specified by S = K(). Under this notation, F MacKay is the function F EM with variables (S, ) restricted on the curve C. Theore. When the Gaussian approxiation (1) holds with equality, F MacKay (u, ) = ax S F EM(u, S, ), (17) l() = ax u F MacKay(u, ). (18) Proof: Denote S := arg ax S F EM (u, S, ) = K(). Thus ax F EM(u, S, )=F EM (u, S, )=F EM (u, K(), ). S Then the equation (17) holds by Theore 1, On the other hand, by Lea 3, F MacKay (z, ) l() and the equality can be achieved. We thus obtain the equation (18).. The following result follows iediately fro Theore. Corollary 1. The optiizing configurations for Opt-I, OptEM and OptMacKay are identical in. Theore and Corollary 1 essentially suggest that Opt- MacKay is a relaxation of Opt-I, that OptEM is a relaxation of OptMacKay, and that such successive relaxations do not alter the solution of the original proble Opt-I. Lea 1. The EM-Gauss Procedure is identical to the 3-way coordinate ascent on F EM, naely, iterating over the following three steps. u (t) : = arg ax u F EM(u, S (t 1), (t) ) S (t) : = arg ax S F EM(u (t), S, (t) ) (t+1) : = arg ax F EM(u (t), S (t), ) Proof: This follows fro the fact that in the EM-Gauss Procedure, the update of u is independent of S and the update of S is independent of u. Since OptEM is a relaxation of OptMacKay, it has higher degrees of freedo during optiization. This extra degree of freedo is fully explored in the three-way coordinate descent of EM-Gauss, aking its convergence slower than that of MacKay-Gauss. This slower convergence of EM- Gauss can also be understood fro another perspective, in which MacKay-Gauss and EM-Gauss are both considered as optiizing the function F EM. Lea 13. The MacKay-Gauss Procedure is equivalent to the following two-way coordinate-ascent on F EM. u (t) : = arg ax F EM(u, S (t 1), (t 1) ) u ( S (t), (t)) : = arg ax F EM(u (t), S, ). (S,) C Following directly fro the Lea 11 and Theore 1, this lea suggests that MacKay-Gauss can be viewed as optiizing the sae objective function F EM as EM-Gauss,

9 but taking a particular coordinate-ascent path, naely, alternating between axiization over u and axiization over (S, ) along the curve C. This allows MacKay-Gauss to take a shorter-cut than EM-Gauss. u() EM MacKay 0.6 l() F EM F MacKay (a) (b) Figure : The convergence of both the EM algorith and the MacKay algorith fro the initial configuration = 1 to the final configuration 0.1. (a) The (, u)- trajectories of the EM and MacKay algoriths, where an arbitrary coponent (in this case, the second coponent u()) of vector u is taken as a representative for u. (b) the function F EM evaluated along the EM trajectory and the function F MacKay evaluated along the MacKay trajectory, and the log-evidence function l() plotted using its closedfor expression. 4.4 Experients Experients are perfored to study the dynaics of the MacKay algorith and the EM algorith. We generate a siulated dataset D for a linear regression odel according to the setup in Linear Regression Exaple 1 with n = 300, d = 00, σ = 10, = 0.1 where each each x (i) is drawn uniforly at rando fro the open interval (0, 1). Both the EM algorith (in Linear Regression Exaple-) and the MacKay algorith (in Linear Regression Exaple- 3) are used to estiate fro D. For both algoriths, is initialized to 1 and σ is treated as known. The optiization trajectories of the two algoriths in Figure (a) show that the MacKay algorith converges faster towards the fixed point (top right corner) than the EM algorith. In Figure (b), we see that both the MacKay algorith and the EM algorith increase their respective objective functions along their optiization paths, but MacKay achieves higher value of the log-evidence function l() than EM at every iteration step. 5 Concluding Rearks In his influential evidence fraework, MacKay presented practical Bayesian ethods for inference at the paraeter level, at the hyper-paraeter level and at the odel level. For inference at the hyper-paraeter level, MacKay introduced a heuristic procedure that iterates between estiating the paraeter for a given hyper-paraeter setting and estiating the hyper-paraeter for the previous paraeter setting. Although this procedure is widely adopted in epirical Bayesian ethods, its atheatical principle had not been carefully explored prior to this work. In this paper, we forulate this procedure as a well-principled algorithic fraework, and call it the MacKay algorith. We show that the MacKay algorith, like the EM algorith, can be understood as a coordinate-ascent solution to optiizing a lower bound of the objective function in epirical Bayes. Although this lower bound is different fro the lower bound that is optiized by the EM algorith, we show that as long as the posterior distribution of the paraeter is a Gaussian density function, the two algoriths are closely related. In particular, the EM optiization proble, the MacKay optiization proble, and the original epirical Bayes optiization proble all have the sae solution. In addition, the MacKay proble is a relaxation of the original proble, and the EM proble is a relaxation of the MacKay proble. This understanding provides and intuitive explanation as to why the MacKay algorith converges faster than the EM algorith. We believe that the close relationship between the MacKay algorith and the EM algorith revealed in this paper strongly depends on the Gaussian condition. Although this paper does not show the necessity of this condition, we believe that, in case of non-gaussian posterior or non- Gaussian odels, this relationship will break down, and the MacKay algorith will diverge fro the EM algorith towards its own optiization objective and along its own optiization path. This akes the MacKay algorith a fraework in its own right. We hope that this paper inspire ore applications of the MacKay algorith to ore general odels, a territory appearing copletely unexplored.

10 References [Allahverdyan and Galstyan, 011] Allahverdyan, A. and Galstyan, A. (011). Coparative analysis of viterbi training and axiu likelihood estiation for hs. In Advances in Neural Inforation Processing Systes, pages [Bishop, 1999] Bishop, C. M. (1999). Bayesian PCA. Advances in neural inforation processing systes, pages [Carlin and Louis, 1997] Carlin, B. P. and Louis, T. A. (1997). Bayes and epirical Bayes ethods for data analysis. Statistics and Coputing, 7(): [Clyde and George, 000] Clyde, M. and George, E. I. (000). Flexible epirical Bayes estiation for wavelets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 6(4): [Depster et al., 1977] Depster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maxiu likelihood fro incoplete data via the e algorith. Journal of the royal statistical society. Series B (ethodological), pages [DuMouchel and Pregibon, 001] DuMouchel, W. and Pregibon, D. (001). Epirical Bayes screening for ulti-ite associations. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data ining, pages ACM. [Efron, 01] Efron, B. (01). Large-scale inference: epirical Bayes ethods for estiation, testing, and prediction, volue 1. Cabridge University Press. [Frost and Savarino, 1986] Frost, P. A. and Savarino, J. E. (1986). An epirical Bayes approach to efficient portfolio selection. Journal of Financial and Quantitative Analysis, 1(03): [Heskes, 000] Heskes, T. (000). Epirical bayes for learning to learn. In ICML, pages [Hyvärinen, 1999] Hyvärinen, A. (1999). Fast and robust fixed-point algoriths for independent coponent analysis. Neural Networks, IEEE Transactions on, 10(3): [Inoue and Tanaka, 001] Inoue, J.-i. and Tanaka, K. (001). Dynaics of the axiu arginal likelihood hyperparaeter estiation in iage restoration: Gradient descent versus expectation and axiization algorith. Phys. Rev. E, 65: [Jensen, 1906] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les ingalits entre les valeurs oyennes. Acta Matheatica, 30(1): [MacKay, 199a] MacKay, D. J. (199a). Bayesian interpolation. Neural coputation, 4(3): [MacKay, 199b] MacKay, D. J. (199b). The evidence fraework applied to classification networks. Neural coputation, 4(5): [MacKay, 199c] MacKay, D. J. (199c). A practical Bayesian fraework for backpropagation networks. Neural coputation, 4(3): [MacKay, 1995] MacKay, D. J. (1995). Probable networks and plausible predictions: a review of practical bayesian ethods for supervised neural networks. Network: Coputation in Neural Systes, 6(3): [MacKay and Neal, 1994] MacKay, D. J. and Neal, R. M. (1994). Autoatic relevance deterination for neural networks. In Technical Report in preparation. Cabridge University. [Murphy, 01] Murphy, K. P. (01). Machine learning: a probabilistic perspective. MIT press. [Neal, 1993] Neal, R. (1993). Probabilistic inference using arkov chain onte carlo ethods. Technical Report CRG-TR -93-1, Dept. of Coputer Science, University of Toronto. [Schäfer and Strier, 005] Schäfer, J. and Strier, K. (005). An epirical Bayes approach to inferring large-scale gene association networks. Bioinforatics, 1(6): [Soloon, 015] Soloon, J. (015). Nuerical Algoriths: Methods for Coputer Vision, Machine Learning, and Graphics. CRC Press. [Tan and Févotte, 009] Tan, V. Y. and Févotte, C. (009). Autoatic relevance deterination in nonnegative atrix factorization. In SPARS 09-Signal Processing with Adaptive Sparse Structured Representations. [Tipping, 001] Tipping, M. E. (001). Sparse Bayesian learning and the relevance vector achine. The journal of achine learning research, 1: [Tipping et al., 003] Tipping, M. E., Faul, A. C., et al. (003). Fast arginal likelihood axiisation for sparse bayesian odels. In AISTATS. [Wipf and Nagarajan, 008] Wipf, D. P. and Nagarajan, S. S. (008). A new view of autoatic relevance deterination. In Advances in neural inforation processing systes, pages [Yang et al., 004] Yang, D., Zakharkin, S. O., Page, G. P., Brand, J. P., Edwards, J. W., Bartolucci, A. A., and Allison, D. B. (004). Applications of Bayesian statistical ethods in icroarray data analysis. Aerican Journal of Pharacogenoics, 4(1):53 6.

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Detection and Estimation Theory

Detection and Estimation Theory ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Machine Learning: Fisher s Linear Discriminant. Lecture 05 Machine Learning: Fisher s Linear Discriinant Lecture 05 Razvan C. Bunescu chool of Electrical Engineering and Coputer cience bunescu@ohio.edu Lecture 05 upervised Learning ask learn an (unkon) function

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm Acta Polytechnica Hungarica Vol., No., 04 Sybolic Analysis as Universal Tool for Deriving Properties of Non-linear Algoriths Case study of EM Algorith Vladiir Mladenović, Miroslav Lutovac, Dana Porrat

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning by Prof. Seungchul Lee isystes Design Lab http://isystes.unist.ac.kr/ UNIST Table of Contents I.. Probabilistic Linear Regression I... Maxiu Likelihood Solution II... Maxiu-a-Posteriori

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair Proceedings of the 6th SEAS International Conference on Siulation, Modelling and Optiization, Lisbon, Portugal, Septeber -4, 006 0 A Siplified Analytical Approach for Efficiency Evaluation of the eaving

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Tracking using CONDENSATION: Conditional Density Propagation

Tracking using CONDENSATION: Conditional Density Propagation Tracking using CONDENSATION: Conditional Density Propagation Goal Model-based visual tracking in dense clutter at near video frae rates M. Isard and A. Blake, CONDENSATION Conditional density propagation

More information

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x) 7Applying Nelder Mead s Optiization Algorith APPLYING NELDER MEAD S OPTIMIZATION ALGORITHM FOR MULTIPLE GLOBAL MINIMA Abstract Ştefan ŞTEFĂNESCU * The iterative deterinistic optiization ethod could not

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Feedforward Networks

Feedforward Networks Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 433 Christian Jacob Dept.of Coputer Science,University of Calgary CPSC 433 - Feedforward Networks 2 Adaptive "Prograing"

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004 Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 533 Winter 2004 Christian Jacob Dept.of Coputer Science,University of Calgary 2 05-2-Backprop-print.nb Adaptive "Prograing"

More information

An Improved Particle Filter with Applications in Ballistic Target Tracking

An Improved Particle Filter with Applications in Ballistic Target Tracking Sensors & ransducers Vol. 72 Issue 6 June 204 pp. 96-20 Sensors & ransducers 204 by IFSA Publishing S. L. http://www.sensorsportal.co An Iproved Particle Filter with Applications in Ballistic arget racing

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Training an RBM: Contrastive Divergence. Sargur N. Srihari Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and

More information

Feedforward Networks

Feedforward Networks Feedforward Neural Networks - Backpropagation Feedforward Networks Gradient Descent Learning and Backpropagation CPSC 533 Fall 2003 Christian Jacob Dept.of Coputer Science,University of Calgary Feedforward

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

On the theoretical analysis of cross validation in compressive sensing

On the theoretical analysis of cross validation in compressive sensing MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.erl.co On the theoretical analysis of cross validation in copressive sensing Zhang, J.; Chen, L.; Boufounos, P.T.; Gu, Y. TR2014-025 May 2014 Abstract

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE Proceeding of the ASME 9 International Manufacturing Science and Engineering Conference MSEC9 October 4-7, 9, West Lafayette, Indiana, USA MSEC9-8466 MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL

More information

ANALYTICAL INVESTIGATION AND PARAMETRIC STUDY OF LATERAL IMPACT BEHAVIOR OF PRESSURIZED PIPELINES AND INFLUENCE OF INTERNAL PRESSURE

ANALYTICAL INVESTIGATION AND PARAMETRIC STUDY OF LATERAL IMPACT BEHAVIOR OF PRESSURIZED PIPELINES AND INFLUENCE OF INTERNAL PRESSURE DRAFT Proceedings of the ASME 014 International Mechanical Engineering Congress & Exposition IMECE014 Noveber 14-0, 014, Montreal, Quebec, Canada IMECE014-36371 ANALYTICAL INVESTIGATION AND PARAMETRIC

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter Identical Maxiu Lielihood State Estiation Based on Increental Finite Mixture Model in PHD Filter Gang Wu Eail: xjtuwugang@gail.co Jing Liu Eail: elelj20080730@ail.xjtu.edu.cn Chongzhao Han Eail: czhan@ail.xjtu.edu.cn

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS. Introduction When it coes to applying econoetric odels to analyze georeferenced data, researchers are well

More information

Using a De-Convolution Window for Operating Modal Analysis

Using a De-Convolution Window for Operating Modal Analysis Using a De-Convolution Window for Operating Modal Analysis Brian Schwarz Vibrant Technology, Inc. Scotts Valley, CA Mark Richardson Vibrant Technology, Inc. Scotts Valley, CA Abstract Operating Modal Analysis

More information

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory Pseudo-arginal Metropolis-Hastings: a siple explanation and (partial) review of theory Chris Sherlock Motivation Iagine a stochastic process V which arises fro soe distribution with density p(v θ ). Iagine

More information

Variational Bayes for generic topic models

Variational Bayes for generic topic models Variational Bayes for generic topic odels Gregor Heinrich 1 and Michael Goesele 2 1 Fraunhofer IGD and University of Leipzig 2 TU Darstadt Abstract. The article contributes a derivation of variational

More information

Hybrid System Identification: An SDP Approach

Hybrid System Identification: An SDP Approach 49th IEEE Conference on Decision and Control Deceber 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA Hybrid Syste Identification: An SDP Approach C Feng, C M Lagoa, N Ozay and M Sznaier Abstract The

More information

Correlated Bayesian Model Fusion: Efficient Performance Modeling of Large-Scale Tunable Analog/RF Integrated Circuits

Correlated Bayesian Model Fusion: Efficient Performance Modeling of Large-Scale Tunable Analog/RF Integrated Circuits Correlated Bayesian odel Fusion: Efficient Perforance odeling of Large-Scale unable Analog/RF Integrated Circuits Fa Wang and Xin Li ECE Departent, Carnegie ellon University, Pittsburgh, PA 53 {fwang,

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

Recursive Algebraic Frisch Scheme: a Particle-Based Approach

Recursive Algebraic Frisch Scheme: a Particle-Based Approach Recursive Algebraic Frisch Schee: a Particle-Based Approach Stefano Massaroli Renato Myagusuku Federico Califano Claudio Melchiorri Atsushi Yaashita Hajie Asaa Departent of Precision Engineering, The University

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

Use of PSO in Parameter Estimation of Robot Dynamics; Part One: No Need for Parameterization

Use of PSO in Parameter Estimation of Robot Dynamics; Part One: No Need for Parameterization Use of PSO in Paraeter Estiation of Robot Dynaics; Part One: No Need for Paraeterization Hossein Jahandideh, Mehrzad Navar Abstract Offline procedures for estiating paraeters of robot dynaics are practically

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL paper prepared for the 1996 PTRC Conference, Septeber 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL Nanne J. van der Zijpp 1 Transportation and Traffic Engineering Section Delft University

More information

Bootstrapping Dependent Data

Bootstrapping Dependent Data Bootstrapping Dependent Data One of the key issues confronting bootstrap resapling approxiations is how to deal with dependent data. Consider a sequence fx t g n t= of dependent rando variables. Clearly

More information

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions

An l 1 Regularized Method for Numerical Differentiation Using Empirical Eigenfunctions Journal of Matheatical Research with Applications Jul., 207, Vol. 37, No. 4, pp. 496 504 DOI:0.3770/j.issn:2095-265.207.04.0 Http://jre.dlut.edu.cn An l Regularized Method for Nuerical Differentiation

More information

When Short Runs Beat Long Runs

When Short Runs Beat Long Runs When Short Runs Beat Long Runs Sean Luke George Mason University http://www.cs.gu.edu/ sean/ Abstract What will yield the best results: doing one run n generations long or doing runs n/ generations long

More information

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy Storage Capacity and Dynaics of Nononotonic Networks Bruno Crespi a and Ignazio Lazzizzera b a. IRST, I-38050 Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I-38050 Povo (Trento) Italy INFN Gruppo

More information

Bayesian Approach for Fatigue Life Prediction from Field Inspection

Bayesian Approach for Fatigue Life Prediction from Field Inspection Bayesian Approach for Fatigue Life Prediction fro Field Inspection Dawn An and Jooho Choi School of Aerospace & Mechanical Engineering, Korea Aerospace University, Goyang, Seoul, Korea Srira Pattabhiraan

More information

Data-Driven Imaging in Anisotropic Media

Data-Driven Imaging in Anisotropic Media 18 th World Conference on Non destructive Testing, 16- April 1, Durban, South Africa Data-Driven Iaging in Anisotropic Media Arno VOLKER 1 and Alan HUNTER 1 TNO Stieltjesweg 1, 6 AD, Delft, The Netherlands

More information

Research in Area of Longevity of Sylphon Scraies

Research in Area of Longevity of Sylphon Scraies IOP Conference Series: Earth and Environental Science PAPER OPEN ACCESS Research in Area of Longevity of Sylphon Scraies To cite this article: Natalia Y Golovina and Svetlana Y Krivosheeva 2018 IOP Conf.

More information

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup) Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS

AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Statistica Sinica 6 016, 1709-178 doi:http://dx.doi.org/10.5705/ss.0014.0034 AN OPTIMAL SHRINKAGE FACTOR IN PREDICTION OF ORDERED RANDOM EFFECTS Nilabja Guha 1, Anindya Roy, Yaakov Malinovsky and Gauri

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis City University of New York (CUNY) CUNY Acadeic Works International Conference on Hydroinforatics 8-1-2014 Experiental Design For Model Discriination And Precise Paraeter Estiation In WDS Analysis Giovanna

More information

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS A Thesis Presented to The Faculty of the Departent of Matheatics San Jose State University In Partial Fulfillent of the Requireents

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

Effective joint probabilistic data association using maximum a posteriori estimates of target states

Effective joint probabilistic data association using maximum a posteriori estimates of target states Effective joint probabilistic data association using axiu a posteriori estiates of target states 1 Viji Paul Panakkal, 2 Rajbabu Velurugan 1 Central Research Laboratory, Bharat Electronics Ltd., Bangalore,

More information

SPECTRUM sensing is a core concept of cognitive radio

SPECTRUM sensing is a core concept of cognitive radio World Acadey of Science, Engineering and Technology International Journal of Electronics and Counication Engineering Vol:6, o:2, 202 Efficient Detection Using Sequential Probability Ratio Test in Mobile

More information

IN modern society that various systems have become more

IN modern society that various systems have become more Developent of Reliability Function in -Coponent Standby Redundant Syste with Priority Based on Maxiu Entropy Principle Ryosuke Hirata, Ikuo Arizono, Ryosuke Toohiro, Satoshi Oigawa, and Yasuhiko Takeoto

More information

arxiv: v2 [cs.lg] 30 Mar 2017

arxiv: v2 [cs.lg] 30 Mar 2017 Batch Renoralization: Towards Reducing Minibatch Dependence in Batch-Noralized Models Sergey Ioffe Google Inc., sioffe@google.co arxiv:1702.03275v2 [cs.lg] 30 Mar 2017 Abstract Batch Noralization is quite

More information

A Monte Carlo scheme for diffusion estimation

A Monte Carlo scheme for diffusion estimation A onte Carlo schee for diffusion estiation L. artino, J. Plata, F. Louzada Institute of atheatical Sciences Coputing, Universidade de São Paulo (ICC-USP), Brazil. Dep. of Electrical Engineering (ESAT-STADIUS),

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

PAC-Bayesian Learning of Linear Classifiers

PAC-Bayesian Learning of Linear Classifiers Pascal Gerain Pascal.Gerain.@ulaval.ca Alexandre Lacasse Alexandre.Lacasse@ift.ulaval.ca François Laviolette Francois.Laviolette@ift.ulaval.ca Mario Marchand Mario.Marchand@ift.ulaval.ca Départeent d inforatique

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

OPTIMIZATION in multi-agent networks has attracted

OPTIMIZATION in multi-agent networks has attracted Distributed constrained optiization and consensus in uncertain networks via proxial iniization Kostas Margellos, Alessandro Falsone, Sione Garatti and Maria Prandini arxiv:603.039v3 [ath.oc] 3 May 07 Abstract

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016/2017 Lessons 9 11 Jan 2017 Outline Artificial Neural networks Notation...2 Convolutional Neural Networks...3

More information

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Ph 20.3 Numerical Solution of Ordinary Differential Equations Ph 20.3 Nuerical Solution of Ordinary Differential Equations Due: Week 5 -v20170314- This Assignent So far, your assignents have tried to failiarize you with the hardware and software in the Physics Coputing

More information