Data Augmentation for the Bayesian Analysis of Multinomial Logit Models

Size: px

Start display at page:

Download "Data Augmentation for the Bayesian Analysis of Multinomial Logit Models"

Jonah Shelton
5 years ago
Views:

1 Data Augmentation for the Bayesian Analysis of Multinomial Logit Models Steven L. Scott, University of Southern California Bridge Hall 401-H, Los Angeles, CA Key Words: Markov chain Monte Carlo, logistic regression, data augmentation 1. Introduction This article introduces a Markov chain Monte Carlo (MCMC) method for sampling the parameters of a multinomial logit model from their posterior distribution. Let y i {0,...,M} denote the categorical response of subject i with covariates x i = (x i1,..., x ip ) T. Let X = (x 1,...,x n ) T denote the design matrix, and let y = (y 1,..., y n ) T. Multinomial logit models relate y i to x i through p(y i = m) exp{g m (x i β)} = λ im (1) where g m (x i β) is a linear function and β is a parameter vector. Adding the same constant to each g m (x β) leaves (1) unchanged, so one commonly assumes g 0 (x β) = 0 to preserve identifiability. The general function notation masks subtleties in the linear predictor that distinguish several varieties of multinomial logit models. For example, equation (1) can be made to model either ordinal or nominal responses by suitably constraining the linear predictor. By extending x through basis expansions equation (1) includes generalized additive multinomial logit models (Abe, 1999). Variants of multinomial logit models occur frequently in many areas of application. The models are especially important in econometrics (McFadden, 1974), where they are referred to as discrete choice models, and as a component of the partial credit models used in educational testing (Muraki, 1992). Despite multinomial logit models obvious importance in applied research, Bayesian statisticians typically prefer to work with multinomial probit models instead. The preference for multinomial probit is largely due to a convenient Gibbs sampling algorithm introduced by McCulloch and Rossi (1994), which is a multivariate version of the probit regression algorithm of Albert and Chib (1993) (henceforth MRAC). The MRAC algorithm is a data augmentation scheme which alternates between simulating a latent multivariate Gaussian vector for each subject, given observed data and model parameters, and simulating model parameters given complete data. The MRAC algorithm is widely used, even though computationally superior methods have been developed (e.g. van Dyk and Meng, 2001; Liu and Wu, 1999). MRAC s appeal lies it its aesthetic simplicity, which derives from stochastically replacing the nonlinear probit likelihood with a complete data likelihood based on the identity link. The identity link allows model parameters to be simulated from closed form full conditional distributions. Consequently MRAC avoids the tuning constants required by most Metropolis-Hastings algorithms, which means that MRAC is a default method that can be implemented with little expertise on the part of the user. It is easy to program and easy to explain to clients. The analytically tractable complete data likelihood also makes it easy to embed probit regression models into more elaborate hierarchical or random effects models. The sampler introduced in this article is the natural extension of MRAC to multinomial logit models. Unlike MRAC, multinomial logit models do not admit a Gibbs sampler with closed form full conditional distributions. However it is possible to simulate a set of latent variables with mean g m (x β) and known, constant variance. The latent variables can be combined with frequentist asymptotic theory for linear models, or with the Bayesian method of moments (Zellner, 1997), to produce a closed form surrogate distribution that approximates the full conditional distribution of the model parameters given complete data. The draw from the surrogate distribution is filtered using a Metropolis-Hastings probability (Metropolis et al., 1953; Hastings, 1970) to produce a draw from the desired posterior distribution p(β X, y). The proposed sampler inherits MRAC s considerable aesthetic appeal, and several practical features as well. First, the method is simple to program, and it readily extends to complex stochastic systems which include multinomial logit models as embedded components. Second, the proposal distribution is tailored to the target distribution at each iteration of the sampler without invoking iterative root finding methods which might fail for computational reasons at some un-

2 lucky draw of β. Third, the sampler requires no tuning constants typically needed for random walk Metropolis-Hastings algorithms, so its burden on practitioners is minimal. Computationally, the sampler only evaluates the complete data likelihood once during each iteration. The complete data likelihood is simpler than the full multinomial logit likelihood because it avoids the multinomial logit normalizing constant. As a result the sampler is computationally faster than one scalar at a time sampling methods, such as adaptive rejection sampling, which require several likelihood evaluations per draw of β. The proposed method can handle both continuous and categorical covariates, so it is more flexible than methods based on the multinomial-poisson transformation (Baker, 1994; Spiegelhalter, Thomas, Best, and Gilks, 1996; Chen and Kuo, 2001). Finally, the proposed sampler can be modified to work with Poisson regression. The remainder of the article is structured as follows. Section 2. explains the latent exponential sampler in general terms, without reference to a specific form for g m (x i β). Section 3. reviews several subfamilies of multinomial logit models and explains how the sampler can be applied to each. Section 4. illustrates the algorithm on a real data set. Section 5. provides a concluding discussion. 2. The Latent Exponential Sampler Let E(λ) denote the exponential distribution with rate λ, and let Z = (z im ) denote a matrix of independent exponential random variables with rows z i = (z i0,...,z im ), where p(z im X, β) = E(λ im ) with λ im defined in (1). If y i = arg min m (z im ) then p(y i = m X, β) λ im, which is the multinomial logit model. Note that y is a deterministic function of Z, so that the complete data likelihood that would be obtained if Z were observed is n M 1 p(z X, β) = exp{g m (x i β m ) i=1 m=0 z im exp{g m (x i β m )}}. (2) which does not involve y. Equation (2) implies a convenient conditional independence property. Many variants of multinomial logit models are parameterized so that g m (x β) = g(x δ, β m ) where β m and β m are distinct for m m and δ is a parameter shared by all response levels. If the {β m } are independent in the prior distribution p(β X, δ) then they remain independent in p(β X, Z, δ). This property is absent from p(β X,y, δ) because of the normalizing constant in (1). The latent exponential sampler cycles between three steps: sampling Z from p(z X, y, β), proposing a new value of β from a surrogate for p(β X,Z) constructed using a transformation of Z, and promoting either the proposal or the current β according to a Metropolis-Hastings probability. 2.1 Sampling Latent Data Sampling Z from p(z X,y, β) is trivial because z 1,...,z n are conditionally independent given (β,x,y). To draw z i, first draw the minimal element z iyi from p(z iyi X,y, β) = E( m λ im ). Then the memoryless property of the exponential distribution implies z im = z iyi + z im with z im E(λ im ) independently for m y i. If the identifiability constraint g 0 (x β) = 0 is imposed then β is independent of z i0 in (2). However z i0 must be sampled during the data augmentation step to maintain the scale of the imputed variables. 2.2 Sampling β Equation (2) fails to suggest a closed form full conditional distribution for β, but a closed form surrogate exists. Any exponential random variable z E(λ) may be written z = e/λ, with e E(1). Thus log(z) = log(e) log(λ) has mean µ log(λ) and variance σ 2, where µ and σ 2 are the mean and variance of log(e). Because log(e) follows an extreme value distribution (Johnson et al., 1995) we have µ = γ, the negative of Euler s constant , and σ 2 = π 2 /6. In particular note that σ 2 does not depend on λ. Therefore u im µ log(z im ) [g m (x i β), σ 2 ], (3) with square brackets denoting a random variable s mean and variance. The random variables in U = (u im ) are independent observations with constant variance whose expected value is the linear predictor. Therefore, frequentist theory for linear models implies that ˆβ, the least squares estimate of β in a regression of U on X, has limiting distribution p(ˆβ X, β) N(β, V ), where V is a known function of X and σ 2. If the prior distribution for β is Gaussian, say β N(α, Σ), then a closed form surrogate for the full conditional distribution is p(β X, ˆβ) ) = N (Ω(Σ 1 α + V 1 ˆβ), Ω, (4) where Ω 1 = Σ 1 +V 1. In many cases equation (4) is the full conditional distribution that would be obtained if the observations in (3) were Gaussian with 2

3 the specified mean and variance. The same proposal distribution can be justified, without asymptotics, based on the maximum entropy principle using the Bayesian method of moments (Zellner, 1997). A candidate β is drawn from equation (4) and compared to the current β (t) through { f(β)/f(β (t) } ) α = min q(β)/q(β (t) ), 1 (5) where f(β) = p(β X,Z) and q(β) = p(β X, ˆβ). The candidate β is promoted to β (t+1) with probability α, otherwise β (t+1) = β (t). Note that the conditional independence properties of equation (2) are also present in (4). 2.3 Asymptotics The proposal distribution violates the likelihood principle because ˆβ is not a sufficient statistic for β. The cost of replacing the full conditional distribution p(β X, Z) with its method of moments approximation p(β X, ˆβ) can be seen by examining the limiting behavior of the two distributions as n. If y truly follows a multinomial logit model with parameter β 0, and if the prior distribution p(β) is such that it is eventually dominated by the likelihood, then both p(β X, ˆβ) and p(β X,Z) are asymptotically normal with mean β 0 (Le Cam and Yang, 2000). However, it is easy to show that the asymptotic variance of p(β X, ˆβ) is σ 2 times that of p(β X, Z). Figure 1 contrasts the proposal and full conditional distributions for β based on a simulated data set of 100 observations from the model z i E(exp(.2x i )) where x i U(0, 1). Figure 1 reminds us that the proposal and full conditional distributions can have different means in finite samples, even though the means would be the same in the limit. The inflated variance of the proposal distribution relative to the full conditional is readily apparent. Upon viewing Figure 1, one is tempted to replace the variance of p(β X, ˆβ) with the asymptotic variance of p(β X,Z) by simply setting σ 2 = 1 in the computer code that fits the model. However, notice that doing so places much smaller posterior mass near the true β 0 = 0.2 in Figure 1 than either the proposal or the full conditional distribution. In fact, we know that Metropolis-Hastings algorithms with heavy tailed proposals have desirable mixing properties (see, e.g. Mengersen and Tweedie, 1996), so the increased variance of p(β X, ˆβ) relative to p(β X, Z) is actually something of a blessing. In practice one prefers to inflate the tails of (4) even further, for example by replacing it with a multivariate T distribution with small (e.g. 3) degrees of freedom Full Cond. Proposal Adj. Var Figure 1: Comparing p(β X, Z) ( Full Cond. ) and p(β X, ˆβ) ( Proposal ) assuming the prior p(β) 1. Adj. Var. is the proposal distribution rescaled to have the same asymptotic variance as p(β X,Z). 2.4 Motivation The latent exponential sampler can be motivated by either of two primary means. The first is a utility maximization argument which has been heavily used in the econometrics literature since its introduction by McFadden (1974). See Train (2003) for a recent review and bibliography. Conceptually, u im represents the perceived utility of choice m for subject i, which is linearly related to x i. Then subject i chooses y i to maximize his perceived utility. The proposed sampler stochastically restores the unobservable utilities. Of course the utilities need not physically exist for the sampler to use them as a computational device. That is, multinomial logit models apply equally well to physical systems which lack rational decision makers exercising free will. The sampler may also be viewed as a multinomial-poisson transformation, a name given to the dual relationship between the multinomial and Poisson likelihoods. Several authors (Baker, 1994; Chen and Kuo, 2001; Spiegelhalter et al., 1996) have used the multinomial-poisson transformation to approximate p(β X, y) with a normal distribution based on a Poisson regression of y on X determined by iteratively reweighted least squares. However, the approximation involves an additional parameter for each distinct covariate pattern in X, which limits its usefulness when X contains continuously measured variables. The beta 3

4 latent exponential sampler views the exponential likelihood as primitive, rather than the Poisson. The advantage is that one can achieve linearity on the scale of the parameters by taking the log of exponential data. The same cannot be said for Poisson data, which has a positive probability of being zero. 3. Multinomial Logit Models To illustrate the variety of multinomial logit models with which the latent exponential sampler can be used, this Section reviews several versions and extensions of the model in equation (1) and explains how the sampler can be applied to each. The models discussed include multinomial logistic regression, discrete choice models, ordinal logit models, and additive multinomial logit models. More elaborate models, such as random effects models and partial credit models can also be accommodated, but are not discussed here due to space constraints (but see Scott and Ip, 2002). 3.1 Multinomial Logistic Regression Multinomial logistic regression sets g m (x i β) = β T m x i, where the {β m } are distinct across m with β 0 = 0 for identifiability. If u m denotes column m of U = (u im ), and if one assumes independent prior distributions β m N(α m, Σ m ) for each m, then the proposal distribution for β m is p(β m X, ˆβ) = N [ Ω m (Σ 1 m α m + X T u m /σ 2 ), Ω m ] where Ω 1 m (6) = (Σ 1 m + XT X/σ 2 ). Note that Ω m depends only on known quantities, so it only needs to be computed once. It depends on m only through the prior distribution. The {β m } can be treated separately because equation (2) implies their conditional independence given Z. That contrasts with the asymptotic variance obtained from the Hessian matrix of the observed multinomial logistic regression log likelihood 2 l n β β T = (diag π i π i π T i ) x ix T i, (7) i=1 where π i = (π i1,..., π im ) with π im = exp(λ im )/ M k=0 exp(λ ik). The Hessian matrix in (7) has M times as many rows and columns as the variance of the proposal distribution in (6). Thus, conditioning on Z replaces one large matrix factorization with M smaller ones differing only in the prior variance Σ m. 3.2 Discrete Choice Models Discrete choice models differ from ordinary multinomial logit models in that some covariates are response specific while others are subject specific. For example, if y i indicates the type of car purchased by customer i then the car s gas mileage is response specific, whereas the customer s age is subject specific. Response specific covariates shift a subscript from the coefficient to the covariate, which substantially reduces the dimension of the parameter space. Let x i denote the (p 1) vector of subject specific characteristics for observation i, and let w im denote the (q 1) vector of response specific characteristics for potential response m on observation i. Then one may write p(y i = m) exp(β T mx i + δ T w im ). (8) Assuming β 0 = 0 for identifiability, a convenient algorithm for sampling from p(β, δ X,y) is as follows. (1) Generate U = (u im ) as in Section 2.. (2) Sample δ from p(δ X,Z, β). (3) Sample β m from p(β m X,Z, δ) for m = 1,...,M. Step 2 can be accomplished by defining u (d) im = u im βmx T i, then stacking u (d) im into the nm 1 vector u(d), and wim T into nm q matrix W. Assuming independent Gaussian priors δ N(α d, Σ d ) and β m N(α m, Σ m ), the proposal distribution for δ is p(δ X, β, ˆδ) [ ] = N Ω d (Σ 1 d α d + W T u (d) /σ 2 ), Ω d, (9) where Ω 1 d = Σ 1 d + W T W/σ 2. Step 3 can be realized by creating u (b) im = u im δ T w im, and then proceeding as in Section Generalized Additive Multinomial Logit Models Generalized additive multinomial logit models (Abe, 1999) extend discrete choice models by assuming log λ im = p s mp (x ip β m ) + q s q (w imq δ) where s mp and s q are scalar functions of scalar arguments to be estimated by a spline or some other smoother indexed by the parameters β m and δ. Hastie and Tibshirani (2000) describe a Bayesian backfitting algorithm for fitting additive models under the linear link. Their advice for fitting generalized additive models under other link functions is to use the Metropolis algorithm but they offer no guidance on how to create appropriate proposal distributions. 4

5 By conditioning on Z, the latent exponential sampler splits log λ im into M +1 conditionally independent additive models under the identity link. Thus sampling δ and β can proceed as in Section 3.2, where a proposal distribution for δ or β m (for each m) is attained by one iteration of Hastie and Tibshirani s Bayesian backfitting algorithm. Note that if s mp and s q are splines then Bayesian backfitting algorithm simply applies the methods of Section 3.2 after a basis expansion of x i and w im, albeit with computational tricks to capitalize on the structure of the spline basis functions. 3.4 Ordinal Logit Models Ordinal logit models, which are described by McCullagh and Nelder (1989, Chapter 5.2.3) and Agresti (1990, Chapter 6), assume g m (x i ) = η m + s m γ T x i, where γ and η m are parameters and s m is a known, real-valued score assigned to level m. In practice one often sets s m = m unless a better alternative presents itself. These ordinal logit models are distinct from cumulative logit models, in which the support of an unobserved logistic distribution is partitioned into regions corresponding to the levels of the observed response (Johnson and Albert, 1999). Assuming the identifiability constraint s 0 = η 0 = 0, one may construct a proposal distribution for ordinal logit models as follows. Let s = (s 1,..., s M ) T, let X i = (sx T i, I) where I is the M M identity matrix, and set β = (γ T, η T ) T. Then g m (x i β) is row m of the vector X i β. Form the design matrix by stacking the X i, so that X = (X T 1,...,XT n) T, and form the response vector u analogously. Then the proposal distribution is obtained by regressing u on X with parameter β, as in (6). 4. Data Example In lieu of a simulation study, this Section considers a traditional multinomial logit problem where likelihood methods perform adequately. This section compares the latent exponential sampler to several random walk Metropolis samplers using data describing automobile preferences for n = 263 customers (Foster et al., 1998). This example assumes the multinomial logistic regression model with flat priors p(β m ) 1 for all m. The outcome variable is the type of car purchased (Family=0, Sporty=1, or Work=2). The covariates used are Age (in years), Sex (1 if Female, 0 if Male), and Marital Status (1 if Single, 0 if Married). All are subject specific variables. Note that this example precludes comparisons with the multinomial-poisson transformation because the data include Age, a continuous covariate. Table 1 and Figure 2 compare MCMC output for the latent exponential method to samplers labeled RW k, where k is the number of parameters simulated in each Metropolis proposal. The RW k samplers propose β (t+1) N[β (t), τ 2 I], where I is the k k identity matrix. Thus RW all proposes all elements of β in a single draw and accepts or rejects the entire parameter, RW p proposes, accepts, or rejects each β m vector individually, and RW 1 performs the Metropolis algorithm on each scalar element of β. Table 1 records the time required for each sampler to produce 10,000 iterations, the fraction of proposals that were accepted, and the estimated posterior means and standard deviations of each component of β. Figure 2 displays the corresponding MCMC sample paths for the coefficient of Sex on Sporty cars (plots of other coefficients are similar). The latent exponential sampler converges almost immediately, accepts a high fraction of proposed deviates, and closely agrees with maximum likelihood point estimates and standard errors. The computational speed of the latent exponential sampler compares favorably with the fastest random walk Metropolis algorithms. This is because most of the computational effort in Metropolis-Hastings samplers comes from evaluating likelihoods required to compute the acceptance probability. The latent exponential and RW all samplers each have only one such evaluation per iteration. The RW k samplers fare poorly in Table 1 and Figure 2. None have converged by the end of 10,000 iterations, which causes them to produce misleading estimates of posterior mean and standard error. The RW all samplers are computationally fast, but mix poorly because most of their proposals are rejected. To obtain an acceptance rate competitive with the latent exponential and hybrid methods, RW k must either choose τ to be small or k to be large, which slows the algorithm with respect to mixing, computational time, or both. 5. Discussion The latent exponential sampler is a convenient method for sampling the posterior distribution of general multinomial logit models. The sampler requires simulating a collection of exponential random variables Z, a minor computational burden. Conditioning on Z allows β to be sampled from approximate full conditional proposal distributions which require no tuning, without resorting to iterative root finding methods. The latent exponential method accommodates both categorical and continuous covariates and allows parameters of (g m ) which are distinct 5

6 Table 1: MCMC output summaries for several samplers, each run for 10,000 iterations on the auto choice data. Estimated posterior means are shown, with estimated posterior standard errors in parentheses. The first line of coefficients in each group is for sporty cars, the second is for work cars. Family cars are the baseline. Sampler Time (s) Accept % Intercept Age Sex Mar.Stat. RW 1 (τ =.05) (0.61) (0.02) (0.09) (0.35) (0.24) (0.01) (0.21) (0.18) RW 1 (τ =.01) (0.43) (0.02) (0.15) (0.55) (0.28) (0.01) (0.20) 0.16 RW M (τ =.05) (1.24) (0.04) (0.29) (0.36) (0.44) (0.02) (0.31) 0.30 RW M (τ =.01) (0.32) (0.02) (0.18) (0.47) (0.36) (0.01) (0.25) (0.24) RW MP (τ =.05) (1.14) (0.04) (0.31) (0.31) (1.12) (0.03) (0.43) (0.42) RW MP (τ =.01) (0.51) (0.02) (0.24) (0.36) (0.42) (0.01) (0.26) (0.34) Latent Exponential (0.92) (0.03) (0.31) (0.32) (0.95) (0.03) (0.34) (0.39) MLE (0.95) (0.03) (0.31) (0.32) (0.97) (0.03) (0.35) (0.38) 6

7 (a) RW 1 (τ =.05) (b) RW M (τ =.05) (c) RW MP (τ =.05) (d) RW 1 (τ =.01) (e) RW M (τ =.01) (f) RW MP (τ =.01) (g) Latent exponential Figure 2: Sample paths for the coefficient of Sex on Sporty cars The reference lines represent the maximum likelihood estimate computed from SAS, and ±2/ ± 3 standard errors. 7

8 across m to be independently sampled. There are several ways to motivate the latent exponential sampler. It is the natural generalization of Albert and Chib (1993) to multilevel categorical responses under the logit link. The utility restoration argument in Section 2. will be familiar to econometricians. One can also view method as a twist on the well established relationship between the multinomial and Poisson likelihoods, known at least since Birch (1965). By exploiting the duality between the Poisson and exponential distributions, the latent exponential method is able to replace the log link for the Poisson likelihood with the identity link. Conditioning on Z allows the modeler to work with the identity link by transforming imputed data instead of functions of model parameters, which removes a layer of complexity from the model. Finally, this article has focused on the fact that the latent exponential sampler is a convenient method for sampling from the posterior distribution of multinomial logit models. It may be the key to finding a rapidly mixing sampler as well. The parameter expansion methods described by van Dyk and Meng (2001) and Liu and Wu (1999) have produced much more rapidly mixing samplers in binomial probit regression models. I am currently investigating whether similar results can be obtained for multinomial logit models. References Abe, M. (1999). A generalized additive model for discrete-choice data. Journal of Business and Economic Statistics 17, Agresti, A. (1990). Categorical Data Analysis. Wiley. Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, Baker, S. G. (1994). The multinomial-poisson transformation. The Statistician 43, Birch, M. W. (1965). The defection of partial association, II: The general case. Journal of the Royal Statistical Society, Series B, Methodological 27, Chen, Z. and Kuo, L. (2001). A note on the estimation of multinomial logit models with random effects. The American Statistician 55, Foster, D. P., Stine, R. A., and Waterman, R. P. (1998). Business Analysis Using Regression. Springer. Hastie, T. and Tibshirani, R. (2000). Bayesian backfitting (with discussion). Statistical Science 15, Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, Johnson, N. L., Kotz, S., and Balakrishnan, N. (1995). Continuous Univariate Distributions, vol. 2. Wiley Interscience, Somerset, NJ, 2nd edn. Johnson, V. E. and Albert, J. H. (1999). Ordinal Data Modeling. Springer-Verlag. Le Cam, L. M. and Yang, G. L. (2000). Asymptotics in Statistics: Some Basic Concepts. Springer-Verlag. Liu, J. S. and Wu, Y. N. (1999). expansion for data augmentation. Journal of the American Statistical Association 94, McCullagh, P. and Nelder, J. A. (1989). Generalized linear models (Second edition). Chapman & Hall. McCulloch, R. and Rossi, P. E. (1994). An exact likelihood analysis of the multinomial probit model. Journal of Econometrics 64, McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka, ed., Frontiers in Econometrics, Academic Press. Mengersen, K. L. and Tweedie, R. L. (1996). Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics 24, Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics 21, Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement 16, Scott, S. L. and Ip, E. H. (2002). Empirical Bayes and item clustering effects in a latent variable hierarchical model: A case study from the National Assessment of Educational Progress. Journal of the American Statistical Association 97, 458, Spiegelhalter, D. J., Thomas, A., Best, N. G., and Gilks, W. R. (1996). BUGS: Bayesian inference Using Gibbs Sampling, Version 0.5, (version ii). Train, K. E. (2003). Discrete Choice Methods with Simulation. Cambridge University Press, New York. Available at van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation (disc: P ). Journal of Computational and Graphical Statistics 10, Zellner, A. (1997). The Bayesian method of moments (BMOM): Theory and applications. Advances in Econometrics 12,

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate