arxiv: v1 [stat.me] 3 Apr 2018

Size: px
Start display at page:

Download "arxiv: v1 [stat.me] 3 Apr 2018"

Transcription

1 Grouped Heterogeneous Mixture Modeling for Clustered Data Shonosuke Sugasawa arxiv: v1 [stat.me] 3 Apr 2018 Center of Spatial Information Science, The University of Tokyo Abstract Clustered data which has a grouping structure (e.g. postal area, school, individual, species) appears in a variety of scientific fields. The goal of statistical analysis of clustered data is modeling the response as a function of covariates while accounting for heterogeneity among clusters. For this purpose, we consider estimating clusterwise conditional distributions by mixtures of latent conditional distributions common to all the clusters with cluster-wise different mixing proportions. For modeling the mixing proportions, we propose a structure that clusters are divided into finite number of groups and mixing proportions are assumed to be the same within the same group. The proposed model is interpretable and the maximum likelihood estimator is easy to compute via the generalized EM algorithm. In the setting where the cluster sizes grows with, but much more slowly than, the number of clusters, some asymptotic properties of the maximum likelihood estimator are presented. Furthermore, we propose an information criterion for selecting two tuning parameters, number of groups and latent conditional distributions. Numerical studies demonstrate that the proposed model outperforms some other existing methods. Key words: Clustered data; EM algorithm; Maximum likelihood; Mixture of regressions 1

2 1 Introduction Clustered data which has a grouping structure (e.g. postal area, school, individual, species) appears in a variety of scientific fields. The goal of statistical analysis of clustered data is modeling the response as a function of covariates while accounting for heterogeneity among clusters. The standard approach for analyzing clustered data is using mixed models or random effect models (e.g. Demidenko, 2013; Jiang, 2007). However, since this approach essentially aims at modeling conditional means in each cluster, it would not be plausible when the data distribution is skewed or multimodal. As an alternative modeling strategy, the finite mixture model (McLachlan and Peel, 2000) has been extensively applied for its flexibility to capture distributional relationship between response and covariates. For independent (non-clustered) data, the mixture model with covariates was originally proposed in Jacobs et al. (1991), known as mixture-of-experts, and a large body of literature has been concerned with flexible modeling of the conditional distribution of non-clustered data, e.g. Jordan and Jacobs (1994); Geweke and Keane (2007); Chung and Dunson (2009); Villani et al. (2009, 2012); Nguyen and McLachlan (2016). However, these methods cannot be directly imported into the context of analyzing clustered data. One possible direct adaptation of these mixture models into modeling cluster-wise conditional distributions is to apply the models to the whole dataset without taking account of cluster information (we call global mixture modeling), which produces a result that all the estimated cluster-wise distributions are the same. Nevertheless, such result would be inappropriate and the relationship between response and covariates might be misunderstood. Another possible adaptation is fitting the mixture models separately to each cluster (we call local mixture modeling), so that the cluster-wise heterogeneity would be addressed. However, since it is often the case that cluster sizes are not large in practice, the estimation might be unstable. For modeling cluster-wise conditional distributions, Rubin and Wu (1997) and Ng and McLachlan (2014) proposed a mixture model of random effects models, but it would be computationally burdensome since the estimation of a single random effects 2

3 model is still presents a major challenge for the non-normal responses (e.g. Hui et al., 2017b). Moreover, the result from mixture of mixed models would be difficult to interpret. An alternative approach is latent mixture modeling (Sugasawa et al., 2017) that can be regarded as a compromised model of global and local mixture models, in which the cluster-wise conditional distributions are expressed as the following mixture: L f i (y x) = π ik h k (y x; φ k ), i = 1,..., m, (1) k=1 where m is the number of clusters, h k (y x; φ k ) is a latent conditional distribution common to all the clusters, which typically comes from generalized linear model with parameter φ k, and π i = (π i1,..., π il ) is the cluster-wise mixing proportions. In the above model, cluster-wise heterogeneity is captured by the mixing proportions π i, so that the model would be easy to interpret. For modeling π i, Sun et al. (2007) proposed the use of a logistic mixed model while Sugasawa et al. (2017) proposed the use of a Dirichlet distribution. In both models, due to the additional stochastic structure, parameter estimation requires numerically intensive methods such as Markov Chain Monte Carlo or Monte Carlo EM algorithm as used in Sun et al. (2007) and Sugasawa et al. (2017), respectively. Apart from parametric modeling of cluster-wise conditional distributions, Rosen et al. (2000) and Tang and Qu (2016) proposed a mixture modeling based on the generalized estimating equation, but the primary interest in these works is estimation of the component distributions in the mixture by taking account of correlations within clusters. Moreover, in nonparametric Bayesian approach, hierarchical Dirichlet processes (Teh et al., 2006) might have similar philosophy to latent mixture modeling in that both methods assumes sharing mixing components among clusters, and Rodoriguez et al. (2008) is also concerned with modeling cluster-wise conditional distributions in terms of Bayesian nonparametrics. Hence, most of modeling strategies of cluster-wise conditional distributions are computationally intensive or have poor interpretability. In other words, approaches to estimating cluster-wise conditional distributions while accounting for cluster-wise heterogeneity are still limited. 3

4 The purpose of this paper is to introduce an effective mixture modeling called grouped heterogeneous mixture (GHM) modeling that is easy to fit and interpret. We develop our method in the framework of latent mixture model (1), and introduce a novel structure for the mixing proportions π i that π 1,..., π m can be divided into finite number of groups and π i s within the same groups have the same mixing proportions. Specifically, let g i, i = 1,..., m, be a group membership variable such that g i = 1,..., G, where G is the number of groups, and π g = (π g1,..., π gl ) be unknown mixing proportions in the gth group. Then the proposed structure of π i is π i = π gi. Since we do not know how the clusters are divided into G groups, we treat the group membership variable g i as an unknown variable (parameter), and estimate it based on data. Owing to the simple structure of π i, the likelihood function of the GHM model can be easily obtained, and the maximization of the function can be easily carried out by the generalized EM algorithm (Meng and Rubin, 1993) as will be described in Section 2. Note that the estimation of group membership g i is closely related to k-means algorithm (Forgy, 1965), and the idea of grouping parameters have been recently considered in the context of panel data analysis (e.g. Hahn and Moon, 2010; Bonhomme and Manresa, 2015; Ando and Bai, 2016), but this paper is the first one to use the idea for modeling cluster-wise mixing proportions. Since the number of group membership variable g i grows with the number of clusters, the cluster size should also grow with the number of clusters to consistently estimate g i. Nevertheless, owing to the grouping structure, the model parameters including g i are shown to be consistent under the setting where the cluster size grows considerably more slowly than the number of clusters. Specifically, for n 1 being the minimum cluster size among m clusters, our asymptotic framework is m and m/n ν 1 0 for some ν > 0, which would be met in various applications. The rest of paper is organized as follows. In Section 2, we introduce the GHM modeling, describe the generalized EM algorithm, and propose BIC-type information criterion for selecting numbers of latent distributions, L, and groups G. In Section 3, we illustrate the asymptotic properties of the estimator of model parameters. In Section 4, we carry out numerical experiments to demonstrate the performance of the 4

5 GHM modeling relative to global and local mixture models or random effects models, based on simulated data as well as real-world data. 2 Grouped Heterogeneous Mixture Modeling 2.1 The model Suppose that we have the clustered observations y ij, i = 1,..., m, j = 1,..., n i, with an associated p-dimensional vector of covariates x ij. Let f i (y x) be a density or probability mass function of y ij given x ij, which are assumed to be the same within clusters but different across clusters. Our primary interest is estimating the clusterwise conditional density f i (y x) from the data. To this end, we propose the following grouped heterogeneous mixture (GHM) model: L f i (y x) = π gi kh k (y x; φ k ), i = 1,..., m, (2) k=1 where h k s are latent conditional densities common to all the clusters, g i {1,..., G} denotes group membership variable and φ k is a vector of unknown parameters in the kth latent densities. As mentioned in Section 1, the key notion in the GHM model (2) is to assume that m clusters can be divided into G groups (sub-clusters) and the same mixing proportions (the same conditional density) holds within the same groups. The latent model h k should be specified by the user depending on the type of response variable y ij, and generalized linear models would be a typical choice. For instance, we may use a normal density with mean x t β k and variance σ 2 k when the response takes continuous values, or a Bernoulli distribution with success probability exp(x t β k )/{1 + exp(x t β k )} when the response is binary. We treat the group membership variable g i as an unknown parameter and estimate it based on the data, which means that clusters are adaptively grouped by reflecting information of the data. Moreover, in the GHM model (2), the heterogeneity among clusters can be expressed by the mixing proportions π gi k, so that the GHM model would be flexible yet parsimonious, so that the estimated results would 5

6 be interpretable. There are three types of unknown parameters in (2), π = {π gk, g = 1,..., G, k = 1,..., L} being the mixing proportions in each group, γ = {g 1,..., g m } being the set of group membership variables, and φ = {φ k, k = 1,..., L} being the structural parameters in the latent densities. 2.2 Parameter estimation via generalized EM algorithm Let Θ = (γ, π, φ) be the set of unknown parameters. We first consider estimating Θ under known numbers of groups G and latent components L. The estimation of G and L will be given in the end of this section. Given the data, the log-likelihood function of Θ is obtained in the following closed form: Q(Θ) = ( n i L ) log π gi kh k (y ij x ij ; φ k ). (3) i=1 j=1 k=1 The maximum likelihood estimator Θ of Θ is defined as the maximizer of Q(Θ). As is often the case in the context of mixture modeling, we develop an iterative method for maximizing Q(Θ) with respect to Θ. Specifically, we use the generalized EM (GEM) algorithm (Meng and Rubin, 1993). To this end, we consider the following hierarchical expression of the model (2): y ij z ij = k h k (y ij x ij ; φ k ), P(z ij = k) = π gi k, (4) where z ij s are latent variables representing the membership of y ij s. Given z ij s, the complete log-likelihood is given by L c (Θ) = n i L I(z ij = k) {log h k (y ij x ij ; φ k ) + log π gi k}. i=1 j=1 k=1 In the E-step, we need to compute expectations of the latent variables z ij s. Under the hierarchical expression (4), the conditional (posterior) probability of z ij = k given 6

7 observations Y = {y ij, j = 1,..., n i, i = 1,..., m} evaluated at Θ = Θ (r) is given by P(z ij = k Y ; Θ (r) ) w (r) ijk = π (r) g i k h k(y ij x ij ; φ (r) k ) L l=1 π(r) g i l h l(y ij x ij ; φ (r) l ), so that the objective function to be maximized in the M-step is obtained as Q(Θ Θ (r) ) = L n i k=1 i=1 j=1 w (r) ijk log h k(y ij x ij ; φ k ) + i=1 k=1 L n i log π gi k j=1 w (r) ijk (5) Q 1 (φ Θ (r) ) + Q 2 (γ, π Θ (r) ). It is observed that the maximization of Q 1 (φ Θ (r) ) with respect to φ can be divided into L maximization problems and φ 1,..., φ L can be separately updated. On the other hand, maximizing Q 2 (γ, π Θ (r) ) includes discrete optimization of γ on the space {1,..., G} m, which would be slightly complicated. Hence, instead of simultaneously maximizing with respect to γ and π, we first maximize Q 2 (γ (r), π Θ (r) ) with respect to π and we set π (r+1) to the maximizer, and then, maximize Q 2 (γ, π (r+1) Θ (r) ) with respect to γ to get γ (r+1). This updating process guarantees that monotone increasing of the objective function, that is, Q 2 (γ (r), π (r) Θ (r) ) Q 2 (γ (r+1), π (r+1) Θ (r) ). The proposed GEM algorithm is summarized in what follows. GEM algorithm (1) Set the initial values Θ (0) and r = 0. (2) (E-step) Compute the following weights: w (r) ijk = π (r) g i k h k(y ij x ij ; φ (r) k ) L l=1 π(r) g i l h l(y ij x ij ; φ (r) l ). (3) (M-step-1) Solving the following maximization problem and set φ (r+1) k = φ k : { m n i } φ k = argmax w (r) ijk log h k(y ij x ij ; φ k ), k = 1,..., L. i=1 j=1 7

8 (4) (M-step-2) Update π gk and g i as follows: ( 1 π (r+1) gk = i:g (r) i =g i) n i:g g (r+1) i = argmax g=1,...,g { L n i (r) i =g j=1 k=1 log π (r+1) gk w (r) ijk, g = 1,..., G, k = 1,..., L, n i j=1 } w (r) ijk, i = 1,..., m, (5) If the algorithm has converged, the the algorithm is terminated. Otherwise, set r = r + 1 and go back to Step (2). Note that updating φ k requires maximizing the weighted log-likelihood of kth latent distribution, which can be easily and efficiently carried out as long as the latent density h k is familiar e.g. generalized linear models. Moreover, for updating g i, we simply compute all the values of the objective function for g = 1,..., G and select the maximizer. Therefore, all the M-steps in the GEM algorithm are easy to execute, thereby the GEM algorithm is computationally quite easy to carry out. Finally, for selecting the values of G (number of groups) and L (number of latent components), we adopt the following BIC-type criterion: BIC(G, L) = 2Q( Θ) + log N { ql + G(L 1) + m }, where N = m i=1 n i is the number of whole samples, and q = dim(φ k ). The suitable values of G and L are set to the minimizer of BIC(G, L). 3 Asymptotic Properties We investigates the large-sample properties of the maximum likelihood estimator Θ when the cluster-sizes grow with the number of clusters. Without loss of generality, we suppose that the cluster labels are assigned such that the cluster-size increases with the cluster index, that is, n 1 n 2 n m. Moreover, we assume that m as well as n 1 as considered in literatures (e.g. Vonesh et al., 2002; Hui et al., 2017a) regarding estimation of random effects models. However, we consider 8

9 asymptotics under the setting m and m/n ν 1 0 for some ν > 0, that is, the cluster-sizes are allowed to grow much more slowly than m, which would be met in various applications. In what follows, we further assume that the number of groups G and the number of latent conditional distributions L are known. Under the setting, we define γ 0 and ψ 0 be the true parameters of γ and ψ = (π, φ). We now present asymptotic properties of the maximum likelihood estimator, whose proof is given in Appendix. Theorem 1. Under Assumptions given in Appendix and m and m/n ν 1 0 for some ν > 0, it holds that 1 m i=1 k=1 L ( 1 ) ( πĝi k π 0 gi 0k)2 = O p m (6) mfm ( Θ) 1/2 ( ψ ψ 0 ) N(0, I dim(ψ) ), (7) where F m (Θ) is the Fisher information matrix defined in Appendix. Although the number of cluster-wise mixing proportions π i = π gi k grows at the number of clusters m, (6) shows that they are consistently estimated and the convergence rate depends on m. Moreover, (7) shows that the structural parameters φ in latent distributions and group-wise mixing proportions π whose dimensions are fixed regardless of m and n 1 are m-consistent. 4 Numerical Studies 4.1 Simulation study: continuous response We investigate the finite sample performance of the proposed method together with some existing methods. We consider m = 40 clusters and assume that each cluster i (i = 1,..., m) has the following cluster-wise conditional distribution: f i (y x 1, x 2 ) = π i φ(y; α i0 + α i1 x 1 + α i2 x 2, 1) + (1 π i )φ(y; β i0 + β i1 x 1 + β i2 x 2, (1.5) 2 ), 9

10 where π i is a mixing proportion and φ( ; a, b) denotes a normal density function with mean a and variance b. We generated x 1 from the uniform distribution on ( 0.2, 0.8) and x 2 from Bernoulli distribution with success probability 0.5. We consider the following two cases of the mixing proportion π i : (A1) π i Beta(2, 1), (A2) π i TP(0.2, 0.5, 0.8), where TP(a 1, a 2, a 3 ) denotes an uniform three-point distribution on a 1, a 2 and a 3. The mixing proportion π i varies across all the clusters in the case (A1) while π i s are divided into three groups and clusters within the same groups have the same mixing proportions in the case (A2). Moreover, we consider the following two cases of the structural parameters Θ i = (α i0, α i1, α i2, β i0, β i1, β i2 ): (B1) α i0 = α i2 = 1, α i1 = 2, β i0 = β i2 = 1, β i1 = 2, (B2) α i0, α i2 N( 1, (0.2) 2 ), β i0, β i2 N(1, (0.2) 2 ), α i1 N(2, (0.2) 2 ), β i1 N( 2, (0.2) 2 ). Note that the structural parameters Θ i in the mixing distributions are the same over all the clusters in the case (B1), but the mixing distributions are different over all the clusters in the case (B2). We consider the following 4 scenarios produced by all the combinations of the cases of π i and Θ i : scenario case (π i ) (A1) (A2) (A1) (A2) case (Θ i ) (B1) (B1) (B2) (B2) From the conditional distribution, we generated n samples in each cluster, and we considered n = 80 and n = 160. For the simulated dataset, we applied the proposed grouped heterogeneous mixture (GHM) model as well as global mixture (GM), local mixture (LM) and random effects (RE) model. In the GHM model, normal regression models are used for latent conditional distributions and the suitable numbers of groups 10

11 and latent components are selected by BIC among combinations of G {2,..., 8} and L = 2, 3. In the GM and LM models, we consider mixtures of normal regressions and the number of components is selected by BIC from L {1,..., 4}. For the RE model, only a random intercept term is included. For fitting GM, LM and RE, we use the maximum likelihood estimators of model parameters. Based on these methods, we construct an estimator of cluster-wise conditional densities f i (y x 1, x 2 ), and compute the following mean integrated squared errors (MISE): MISE(x 2 ) = 1 m i= { } 2dydx1 fi (y x 1, x 2 ) f i (y x 1, x 2 ), which is approximated by finite sum over equally-spaced 51 grid points from 5 to 5 for y and from 0.2 to 0.8 for x 1. We repeated simulating a dataset and computing the MISE for 500 times. Since the results of MISE(1) and MISE(0) are quite similar, we only report the boxplots of MISE(1) in Figure 1, which shows that GHM clearly outperforms the other methods in both all the scenarios. It should be noted that LM does not work well in spite of its flexibility in almost all the scenarios. This is because fitting a mixture model based on moderate cluster sizes like n = 80 or n = 160 would be unstable. 4.2 Simulation study: binary response We next investigate the performance when the response is binary. To this end, we consider the following cluster-wise conditional distributions: f i (y x 1, x 2 ) = π i Ber(y; µ 1i (x 1, x 2 )) + (1 π i )Ber(y; µ 2i (x 1, x 2 )), where Ber( ; p) denotes a Bernoulli probability function with success probability p, logit{µ 1i (x 1, x 2 )} = α i0 + α i1 x 1 + α i2 x 2 and logit{µ 2i (x 1, x 2 )} = β i0 + β i1 x 1 + β i2 x 2. The two covariates x 1 and x 2 are generated in the same way as the previous section, and we considered the same four scenarios regarding the settings of the mixing proportions and distributions. Based on the above cluster-wise conditional distribution, 11

12 scenario 1 (n=80) scenario 2 (n=80) scenario 3 (n=80) scenario 4 (n=80) scenario 1 (n=160) scenario 2 (n=160) scenario 3 (n=160) scenario 4 (n=160) Figure 1: Boxplots of logarithm of MISE(1) (log-mise) averaged over x 1 in four scenarios with n = 80 (upper) and n = 160 (lower) when response is Gaussian. 12

13 we generated n = 80 and n = 160 samples in each cluster. For the simulated dataset, we applied the same four methods as considered in Section 4.1. We used a binomial regression as latent or mixing distributions in GHM, GM and LM, and a logistic mixed regression model as RE. Similarly to Section 4.1, we competed the following mean integrated squared errors (MISE): MISE(x 2 ) = 1 m 1 i=1 y= { } 2dx1 fi (y x 1, x 2 ) f i (y x 1, x 2 ), where the integral is approximated by a finite sum over equally-spaced 51 grid points from 0.2 to 0.8. In Figure 2, we show boxplots of MISE(1) based on 500 replications in four scenarios. We can observe similar results that GHM outperforms the other methods in all the scenarios. 4.3 Butterfly data We consider an application of the proposed method to Butterfly data (Oliver et al., 2006). The dataset is consists of abundance of Butterflies at S = 66 sites in Boulder County Open Space in the years 1999 and For each site, geographical information (longitude and latitute), habitat characteristics (grassland type and quality) and landscape context (percentage surrounding urbanization) are available, so that we use the information as covariates. Covariate taking continuous values were scaled to have mean 0 and variance 1. By omitting species being presence in smaller than three sites, there were m = 33 species. We let y is be binary outcome representing presence (y is = 1) and absence (y is = 0) of species i in the site s with i = 1,..., m and s = 1,..., S. For the dataset, we applied the following GHM model: L y is π gi kber(logit(α k + x t sβ k )), k=1 where g i {1,..., G}, Ber(p) denotes a Bernoulli distribution with success probability p and x s is a vector of covariates in the site s. The number of group G and latent 13

14 scenario 1 (n=80) scenario 1 (n=160) scenario 2 (n=80) scenario 2 (n=160) scenario 3 (n=80) scenario 3 (n=160) scenario 4 (n=80) scenario 4 (n=160) Figure 2: Boxplots of logarithm of MISE(1) (log-mise) averaged over x 1 in four scenarios with n = 80 (upper) and n = 160 (lower) when response is binary. 14

15 Table 1: Estimates of regression coefficients in L = 3 latent logistic regression models. Latent 1 Latent 2 Latent 3 Intercept Latitude Longitude Percentage of building Percentage of urban vegetation Habitat (mixed) Habitat (short) Habitat (tall) Table 2: Estimates of mixing proportions and number of species included in G = 5 groups. Group #(species) Latent Latent Latent conditional distributions L were selected by BIC-type criterion over G = 2,..., 8 and L = 2, 3, 4, and G = 5 and L = 3 were selected. This means that 33 species can be divided into 5 groups and each group-wise conditional distribution is expressed as a mixture of three logistic regression models. In Table 1, estimates of regression coefficients in latent 3 logistic regression models are reported, which shows that the estimates are quite different over the models. Moreover, in Table 2, estimates of mixing proportions in 5 groups are shown. Although the estimates of mixing proportions are slightly similar in group 2 and group 5, the mixing proportions are entirely very different over groups, and the difference among groups can be represented in the mixing proportions. For comparison of predictive performance, 5 or 10 sites are randomly omitted to act as test data. Then, we applied the four methods, GHM, GM, LM, RE, to the reaming data (training data) to construct a predictive model for the test data. Similarly to Section 4.2, we considered a mixture of logistic regression models for 15

16 GM and LM, and a logistic random intercept model as RE. The number of mixture components in LM and GM were selected by BIC. To asses the predictive performance, we computed likelihood of the test data (we call test likelihood) based on the estimated model. This procedure was repeated for 100 times, so that such test likelihood are computed for each replication and method. We found that the test likelihood of LM were very small compared with other methods possibly because cluster size (s = 66) is not so large and there are 7 covariate. Moreover, proportions of presence are not small in some species, which may lead to unstable results of fitting a mixture model to each species. For the other methods, we show the test likelihood in each replication in Figure 3. We can observe that the test likelihood of GHM is substantially larger than those of the other methods in almost all the replications, which shows that better prediction performance of GHM than the other methods #(omitted sites)=5 Replication Test likelihood GHM GM RE #(omitted sites)=10 Replication Test likelihood GHM GM RE Figure 3: Plots of likelihoods of test data for 100 replications of two cases of number of omitted sites, 5 (left) and 10 (right). 16

17 5 Final remarks Estimation of cluster-wise conditional distributions is a powerful tool to reveal the distributional relationship between response and covariates while accounting for heterogeneity among clusters. For this purpose, we proposed a novel mixture modeling called grouped heterogeneous mixture (GHM) modeling for estimating luster-wise conditional distributions. The estimation of the GHM model is easily carried our by using the generalized EM algorithm. Under the setting where the cluster size grows with, but considerably slowly than, the number of clusters, we revealed asymptotic properties of the maximum likelihood estimator. In this work, we were not concerned with a problem caused by the dimension of the structural parameters φ in the latent distributions. Such problem is typically related to the number of covariates, and the proposed method might breakdown when the number of covariates are very large. In this case, variable selection or regularization is needed. In the framework of mixture modeling, there are several works concerned with variable selection via regularization (penalization) methods (e.g. Hui et al., 2015; Khalili and Chen, 2007). The adaptation of these methods into our framework would be a valuable future study. Acknowledgements This work was supported by JSPS KAKENHI Grant Numbers 16H Appendix: Proof of Theorem 1 We require the following regularity conditions: A1. The density (probability) function g(y X; γ, ψ) m ni i=1 j=1 f i(y ij x ij ; g i, ψ) is identifiable in Θ up to the permutation of the component and grouping labels. A2. The Fisher information matrix F m (Θ) = E [ ( ) ( ) ] t log g(y X; γ, ψ) log g(y X; γ, ψ) ψ ψ 17

18 is finite and positive definite at ψ = ψ 0 and γ = γ 0. A3. There exists an open subset ω of Ω containing true parameters ψ, such that there exists functions M k (x, y), k = 1, 2, 3, with f i (y x; γ, ψ) f i (y x; γ, ψ) < M 1(x, y), log f i (y x; γ, ψ) < M 2s (x, y), ψ s 2 log f i (y x; γ, ψ) < M 3rs (x, y), ψ r ψ s for arbitrary γ and γ. Moreover, E[M 1 (x, y) 2 ] <, E[M 2s (x, y) 2 ] < and E[M 3rs (x, y) 2 ] <. A4. ψ 0 is an interior point in the compact set Ω R dim(ψ). We define a set of neighborhood of ψ 0 as N η = {ψ 0 +ηu; u = 1}, where η is a scalar, u R dim(ψ), and denotes the L 2 -norm. We first show the following lemma. Lemma 1. For any δ > 0 and ψ N η, it holds that 1 m i=1 I(ĝ i (ψ) g i0 ) = o p (n δ 1 ). Proof. From Markov s inequality, it follows that ( ) 1 P I(ĝ i (ψ) g i0 ) > εn δ 1 m i=1 for arbitrary ε > 0 and ψ N η. Note that P ( ĝ i (ψ) gi 0 ) g g 0 i = g g 0 i ( n i P j=1 ( n i P j=1 nδ 1 mε P(ĝ i (ψ) g i0 ) (8) i=1 n i log f i (y ij ; g, ψ) j=1 log f i(y ij ; g, ψ) f i (y ij ; g 0 i, ψ) 0 ). ) log f i (y ij ; gi 0, ψ) 18

19 Moreover, from Markov s inequality, it holds that ( n i P log f i(y ij ; g, ψ) ) { [ ( 1 f j=1 i (y ij ; gi 0, ψ) 0 E exp 2 log f )]} i(y ij ; g, ψ) ni f i (y ij ; gi 0, ψ) { [ ]} f i (y ij ; g, ψ) = exp n i log E f i (y ij ; gi 0, ψ). Note that 1 f i (y ij ; gi 0, ψ) = 1 f i (y ij ; gi 0, ψ0 ) + f(y ij; gi 0, ψ ) 3/2 (ψ ψ 0 ) t ψ f i(y ij ; gi 0, ψ) ψ=ψ 1 = f i (y ij ; gi 0, ψ0 ) + f(y ij; gi 0, ψ ) 1/2 (ψ ψ 0 ) t ψ log f i(y ij ; gi 0, ψ), ψ=ψ where ψ lies on the line segment joining ψ and ψ 0. Therefore, under Assumption A3, we have [ ] f i (y ij ; g, ψ) f i (y ij ; g, ψ) E f i (y ij ; gi 0, ψ) = f i (y ij ; gi 0, ψ)f i(y ij ; gi 0, ψ 0 )dy ij [ ( = f i (y ij ; g, ψ)f i (y ij ; gi 0, ψ0 )dy ij + (ψ ψ 0 ) t E ψ log f i(y ij ; gi 0, ψ) 1 H(f( ; g, ψ), f( ; g 0 i, ψ 0 )) + Cη, ψ=ψ ) ] f i (y ij ; g, ψ) f(y ij ; gi 0, ψ ) where H(f 0, f 1 ) denotes the Hellinger distance between f 0 and f 1 and C is a constant. Under Assumption A1, inf ψ Nη H(f( ; g, ψ), f( ; gi 0, ψ0 )) > 0 when g gi 0, so that for sufficiently small η, there exist a constant c > 0 such that { } sup log 1 H(f( ; g, ψ), f( ; gi 0, ψ 0 )) + Cη < c. ψ N η Hence, we have ( n i P log f i(y ij ; g, ψ) ) f i (y ij ; gi 0, ψ) 0 exp( cn i ) = o(n δ 1 ), j=1 thereby the right side of (8) is o(1), which completes the proof. 19

20 We next show the consistency of Θ as given in the following lemma: Lemma 2. As m and n 1, it holds that Θ Θ 0. Proof. Define u R dim(ψ) such that u = 1 and γ s = (g 1s,..., g ms ) {1,..., G} m such that d(γ s, γ 0 ) m i=1 I(g is g0 i ) = s. Further, we define R(ψ, γ) 1 N R i (ψ, g i ) = 1 N Q(Θ). i=1 Note that R(ψ 0 + ηu, γ s) R(ψ 0, γ 0 ) = R(ψ 0 + ηu, γ 0 ) R(ψ 0, γ 0 ) + R(ψ 0 + ηu, γ s) R(ψ 0 + ηu, γ 0 ) = I 1 + I 2. Under Assumption A2 and A3, using the similar argument given in the proof of Theorem 1 in Hui et al. (2015), it holds that I 1 = o(η). Regarding I 2, it is noted that I 2 1 N = 1 N R i (ψ 0 + ηu, g is ) R i(ψ 0 + ηu, gi 0 ) i=1 I(g is g0 i ) R i (ψ 0 + ηu, g is ) R i(ψ 0 + ηu, gi 0 ) sc, i=1 for some C > 0. Therefore, for any ε, there exists a local maximum inside the set {ψ 0 + ηu; u = 1} {γ ; d(γ, γ 0 ) < s}, so that the consistency follows. We define the following objective functions: Q(ψ) = ( n i L ) log πĝi (ψ)kh k (y ij x ij ; φ k ). i=1 j=1 k=1 Q(ψ) = ( n i L ) log π g 0 i k h k(y ij x ij ; φ k ), i=1 j=1 k=1 and we define ψ and ψ as the maximizer of Q(ψ) and Q(ψ), respectively. Note that 20

21 mfm ( ψ) 1/2 ( ψ ψ 0 ) N(0, I dim(ψ) ). Note that, from Lemma 1, it holds that sup Q(ψ) Q(ψ) = o p (n δ 1 ). (9) ψ N η Then, from the definition of ψ and ψ, we have 0 Q( ψ) Q( ψ) = { Q( ψ) Q( ψ)} + { Q( ψ) Q( ψ)} { Q( ψ) Q( ψ)} + { Q( ψ) Q( ψ)} = o p (n δ 1 ), where the last equality follows from (9) and ψ, ψ N η for large n i since ψ and ψ are consistent. from (9), so that ψ ψ = o p (n δ 1 ). Hence, it follows that m( ψ ψ 0 ) = m( ψ ψ 0 ) + m( ψ ψ) = m( ψ ψ 0 ) + o p ( mn δ 1 ) = m( ψ ψ 0 ) + o p (1) under m/n ν 0 for some ν > 0, which establishes (7). Moreover, from Lemma 1 and the asymptotic property of ψ, it holds that 1 m 1 m i=1 k=1 i=1 k=1 L ( πĝi k π g 0 i k )2 = o p (n δ 1 ), 1 m L ( 1 ( π g 0 i k π0 gi 0k)2 = O p m ), i=1 k=1 L ( π g 0 i k π gi 0k)2 = o p (n δ 1 ) thereby (6) follows under m/n ν 0. References Ando, T. and Bai, J. (2016). Panel data models with grouped factor structure under unknown group membership. Journal of Applied Econometrics, 31, Bonhomme, S. and Manresa, E. (2015). Grouped pattern of heterogeneity in panel data. Econometrica, 83,

22 Chung, Y. and Dunson, D. B. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association, 104, Demidenko, E. (2013). Mixed Models: Theory and Applications with R, Wiley Series in Probability and Statistics, New York, Wiley. Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics, 21, Geweke, J. and Keane, M. (2007). Smoothly mixing regressions. Journal of Econometrics, 138, Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regression models. Journal of the American Statistical Association, 102, Hahn, J. And Moon, H. (2010). Panel data models with finite number of multiple equilibria. Econometric Theory, 26, Hui, F. K. C., Muller, S. and Welsh, A. H. (2017). Joint selection in mixed models using regularized PQL. Journal of the American Statistical Association, 112, Hui, F. K. C., Warton, D. I. and Foster, A. D. (2015). Multi-species distribution modeling using penalized mixture of regressions. The Annals of Applied Statistics, 9, Hui, F. K. C., Warton, D. I., Ormerod, J. T., Haapaniemi, V. and Taskinen, S. (2017). Variational approximations for generalized linear latent variable models. Journal of Computational and Graphical Statistics, 26, Jacobs, R. A., Jordan, M. I., Nowlan, S. J. and Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, Jiang, J. (2007). Linear and generalized linear mixed models and their applications, Springer Series in Statistics, New York, Springer. 22

23 Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 214, McLachlan, G. J. and Peel, D. (2000). Finite Mixture Models, New York, Wiley. Meng, X. L. and Rubin, D. B. (1993). Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, Ng, S. K. and McLachlan, G. J. (2007). Extension of mixture-of-experts networks for binary classification of hierarchical data. Artificial Intelligence in Medicine, 41, Ng, S. K. and McLachlan, G. J. (2014). Mixture models for clustering multilevel growth trajectories. Computational Statistics & Data Analysis, 71, Nguyen, H. D. and McLachlan, G. J. (2016). Laplace mixture of linear experts. Computational Statistics & Data Analysis, 93, Oliver, J. C., Prudic, K. L. and Collinge, S. K. (2006). Boulder county open space butterfly diversity and abundance. Ecology, 87, Rodriguez, A., Dunson, D. B. and Gelfand, A. E. (2008). The nested Dirichlet process. Journal of the American Statistical Association, 103, Rosen, O., Jiang, W. and Tanner, M. A. (2000). Mixtures of marginal models. Biometrika, 87, Rubin, D. B. and Wu, Y. (1997). Modeling schizophrenic behavior using general mixture components. Biometrics, 53, Sugasawa, S., Kobayashi, G. and Kawakubo, Y. (2016). Latent mixture modeling for clustered data. arxiv: Sun, Z., Rosen, O. and Sampson, A. R. (2007). Multivariate Bernoulli mixture models with application to postmortem tissue studies in schizophrenia. Biometrics, 63,

24 Tang, X. and Qu, A. (2016). Mixture modeling for longitudinal data. Journal of Computational and Graphical Statistics, 25, Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, Villani, M., Kohn, R. and Giordani, P. (2009). Regression density estimation using smooth adaptive Gaussian mixtures. Journal of Econometrics, 153, Villani, M., Kohn, R. and Nott, D. J. (2012). Generalized smooth finite mixtures. Journal of Econometrics, 171, Vonesh, E. F., Wang, H., Nie, L. and Majumdar, D. (2002). Conditional second-order generalized estimating equations for generalized linear and nonlinear mixed effects models. Journal of the American Statistical Association, 97,

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Bayesian Modeling of Conditional Distributions

Bayesian Modeling of Conditional Distributions Bayesian Modeling of Conditional Distributions John Geweke University of Iowa Indiana University Department of Economics February 27, 2007 Outline Motivation Model description Methods of inference Earnings

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

The Australian National University and The University of Sydney. Supplementary Material

The Australian National University and The University of Sydney. Supplementary Material Statistica Sinica: Supplement HIERARCHICAL SELECTION OF FIXED AND RANDOM EFFECTS IN GENERALIZED LINEAR MIXED MODELS The Australian National University and The University of Sydney Supplementary Material

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Determining the number of components in mixture models for hierarchical data

Determining the number of components in mixture models for hierarchical data Determining the number of components in mixture models for hierarchical data Olga Lukočienė 1 and Jeroen K. Vermunt 2 1 Department of Methodology and Statistics, Tilburg University, P.O. Box 90153, 5000

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

The EM Algorithm for the Finite Mixture of Exponential Distribution Models

The EM Algorithm for the Finite Mixture of Exponential Distribution Models Int. J. Contemp. Math. Sciences, Vol. 9, 2014, no. 2, 57-64 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ijcms.2014.312133 The EM Algorithm for the Finite Mixture of Exponential Distribution

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features Yangxin Huang Department of Epidemiology and Biostatistics, COPH, USF, Tampa, FL yhuang@health.usf.edu January

More information

An Introduction to mixture models

An Introduction to mixture models An Introduction to mixture models by Franck Picard Research Report No. 7 March 2007 Statistics for Systems Biology Group Jouy-en-Josas/Paris/Evry, France http://genome.jouy.inra.fr/ssb/ An introduction

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Label Switching and Its Simple Solutions for Frequentist Mixture Models Label Switching and Its Simple Solutions for Frequentist Mixture Models Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Abstract The label switching

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models

Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models Computationally Efficient Estimation of Multilevel High-Dimensional Latent Variable Models Tihomir Asparouhov 1, Bengt Muthen 2 Muthen & Muthen 1 UCLA 2 Abstract Multilevel analysis often leads to modeling

More information

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH Lecture 5: Spatial probit models James P. LeSage University of Toledo Department of Economics Toledo, OH 43606 jlesage@spatial-econometrics.com March 2004 1 A Bayesian spatial probit model with individual

More information

Latent factor density regression models

Latent factor density regression models Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Bayesian inference for multivariate skew-normal and skew-t distributions

Bayesian inference for multivariate skew-normal and skew-t distributions Bayesian inference for multivariate skew-normal and skew-t distributions Brunero Liseo Sapienza Università di Roma Banff, May 2013 Outline Joint research with Antonio Parisi (Roma Tor Vergata) 1. Inferential

More information

Parametric fractional imputation for missing data analysis

Parametric fractional imputation for missing data analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????),??,?, pp. 1 15 C???? Biometrika Trust Printed in

More information

On Reparametrization and the Gibbs Sampler

On Reparametrization and the Gibbs Sampler On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA

PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA PENALIZED LIKELIHOOD PARAMETER ESTIMATION FOR ADDITIVE HAZARD MODELS WITH INTERVAL CENSORED DATA Kasun Rathnayake ; A/Prof Jun Ma Department of Statistics Faculty of Science and Engineering Macquarie University

More information

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis Resubmitted to Econometric Theory POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES By Andriy Norets and Justinas Pelenis Princeton University and Institute for Advanced

More information

Regularization in Cox Frailty Models

Regularization in Cox Frailty Models Regularization in Cox Frailty Models Andreas Groll 1, Trevor Hastie 2, Gerhard Tutz 3 1 Ludwig-Maximilians-Universität Munich, Department of Mathematics, Theresienstraße 39, 80333 Munich, Germany 2 University

More information

Unsupervised Regressive Learning in High-dimensional Space

Unsupervised Regressive Learning in High-dimensional Space Unsupervised Regressive Learning in High-dimensional Space University of Kent ATRC Leicester 31st July, 2018 Outline Data Linkage Analysis High dimensionality and variable screening Variable screening

More information

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004 Estimation in Generalized Linear Models with Heterogeneous Random Effects Woncheol Jang Johan Lim May 19, 2004 Abstract The penalized quasi-likelihood (PQL) approach is the most common estimation procedure

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Graphical Models for Query-driven Analysis of Multimodal Data

Graphical Models for Query-driven Analysis of Multimodal Data Graphical Models for Query-driven Analysis of Multimodal Data John Fisher Sensing, Learning, & Inference Group Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Web-based Supplementary Materials for Multilevel Latent Class Models with Dirichlet Mixing Distribution

Web-based Supplementary Materials for Multilevel Latent Class Models with Dirichlet Mixing Distribution Biometrics 000, 1 20 DOI: 000 000 0000 Web-based Supplementary Materials for Multilevel Latent Class Models with Dirichlet Mixing Distribution Chong-Zhi Di and Karen Bandeen-Roche *email: cdi@fhcrc.org

More information

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model Thai Journal of Mathematics : 45 58 Special Issue: Annual Meeting in Mathematics 207 http://thaijmath.in.cmu.ac.th ISSN 686-0209 The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Reconstruction of individual patient data for meta analysis via Bayesian approach

Reconstruction of individual patient data for meta analysis via Bayesian approach Reconstruction of individual patient data for meta analysis via Bayesian approach Yusuke Yamaguchi, Wataru Sakamoto and Shingo Shirahata Graduate School of Engineering Science, Osaka University Masashi

More information

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION STATISTICAL INFERENCE WITH arxiv:1512.00847v1 [math.st] 2 Dec 2015 DATA AUGMENTATION AND PARAMETER EXPANSION Yannis G. Yatracos Faculty of Communication and Media Studies Cyprus University of Technology

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation Dimitris Rizopoulos Department of Biostatistics, Erasmus University Medical Center, the Netherlands d.rizopoulos@erasmusmc.nl

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Marginal parameterizations of discrete models

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Dynamic System Identification using HDMR-Bayesian Technique

Dynamic System Identification using HDMR-Bayesian Technique Dynamic System Identification using HDMR-Bayesian Technique *Shereena O A 1) and Dr. B N Rao 2) 1), 2) Department of Civil Engineering, IIT Madras, Chennai 600036, Tamil Nadu, India 1) ce14d020@smail.iitm.ac.in

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Hmms with variable dimension structures and extensions

Hmms with variable dimension structures and extensions Hmm days/enst/january 21, 2002 1 Hmms with variable dimension structures and extensions Christian P. Robert Université Paris Dauphine www.ceremade.dauphine.fr/ xian Hmm days/enst/january 21, 2002 2 1 Estimating

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P.

Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Mela. P. Frailty Modeling for Spatially Correlated Survival Data, with Application to Infant Mortality in Minnesota By: Sudipto Banerjee, Melanie M. Wall, Bradley P. Carlin November 24, 2014 Outlines of the talk

More information

The Expectation Maximization Algorithm

The Expectation Maximization Algorithm The Expectation Maximization Algorithm Frank Dellaert College of Computing, Georgia Institute of Technology Technical Report number GIT-GVU-- February Abstract This note represents my attempt at explaining

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

More information

Different points of view for selecting a latent structure model

Different points of view for selecting a latent structure model Different points of view for selecting a latent structure model Gilles Celeux Inria Saclay-Île-de-France, Université Paris-Sud Latent structure models: two different point of views Density estimation LSM

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Modeling conditional distributions with mixture models: Theory and Inference

Modeling conditional distributions with mixture models: Theory and Inference Modeling conditional distributions with mixture models: Theory and Inference John Geweke University of Iowa, USA Journal of Applied Econometrics Invited Lecture Università di Venezia Italia June 2, 2005

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models Jaemo Sung 1, Sung-Yang Bang 1, Seungjin Choi 1, and Zoubin Ghahramani 2 1 Department of Computer Science, POSTECH,

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Dependent hierarchical processes for multi armed bandits

Dependent hierarchical processes for multi armed bandits Dependent hierarchical processes for multi armed bandits Federico Camerlenghi University of Bologna, BIDSA & Collegio Carlo Alberto First Italian meeting on Probability and Mathematical Statistics, Torino

More information

Notes on pseudo-marginal methods, variational Bayes and ABC

Notes on pseudo-marginal methods, variational Bayes and ABC Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution

More information

Bayesian Analysis of Latent Variable Models using Mplus

Bayesian Analysis of Latent Variable Models using Mplus Bayesian Analysis of Latent Variable Models using Mplus Tihomir Asparouhov and Bengt Muthén Version 2 June 29, 2010 1 1 Introduction In this paper we describe some of the modeling possibilities that are

More information

Probability and Information Theory. Sargur N. Srihari

Probability and Information Theory. Sargur N. Srihari Probability and Information Theory Sargur N. srihari@cedar.buffalo.edu 1 Topics in Probability and Information Theory Overview 1. Why Probability? 2. Random Variables 3. Probability Distributions 4. Marginal

More information

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models Optimum Design for Mixed Effects Non-Linear and generalized Linear Models Cambridge, August 9-12, 2011 Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Variational Approximations for Generalized Linear. Latent Variable Models

Variational Approximations for Generalized Linear. Latent Variable Models 1 2 Variational Approximations for Generalized Linear Latent Variable Models 3 4 Francis K.C. Hui 1, David I. Warton 2,3, John T. Ormerod 4,5, Viivi Haapaniemi 6, and Sara Taskinen 6 5 6 7 8 9 10 11 12

More information

Collapsed Variational Dirichlet Process Mixture Models

Collapsed Variational Dirichlet Process Mixture Models Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models

Outline. Clustering. Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models Capturing Unobserved Heterogeneity in the Austrian Labor Market Using Finite Mixtures of Markov Chain Models Collaboration with Rudolf Winter-Ebmer, Department of Economics, Johannes Kepler University

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

CS Lecture 19. Exponential Families & Expectation Propagation

CS Lecture 19. Exponential Families & Expectation Propagation CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces

More information

Smoothness of conditional independence models for discrete data

Smoothness of conditional independence models for discrete data Smoothness of conditional independence models for discrete data A. Forcina, Dipartimento di Economia, Finanza e Statistica, University of Perugia, Italy December 6, 2010 Abstract We investigate the family

More information

Bayesian mixture modeling for spectral density estimation

Bayesian mixture modeling for spectral density estimation Bayesian mixture modeling for spectral density estimation Annalisa Cadonna a,, Athanasios Kottas a, Raquel Prado a a Department of Applied Mathematics and Statistics, University of California at Santa

More information

39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017

39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017 Permuted and IROM Department, McCombs School of Business The University of Texas at Austin 39th Annual ISMS Marketing Science Conference University of Southern California, June 8, 2017 1 / 36 Joint work

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Three Skewed Matrix Variate Distributions

Three Skewed Matrix Variate Distributions Three Skewed Matrix Variate Distributions ariv:174.531v5 [stat.me] 13 Aug 18 Michael P.B. Gallaugher and Paul D. McNicholas Dept. of Mathematics & Statistics, McMaster University, Hamilton, Ontario, Canada.

More information