Linear mixed models and penalized least squares

Size: px
Start display at page:

Download "Linear mixed models and penalized least squares"

Transcription

1 Linear mixed models and penalized least squares Douglas M. Bates Department of Statistics, University of Wisconsin Madison Saikat DebRoy Department of Biostatistics, Harvard School of Public Health Abstract Linear mixed-effects models are an important class of statistical models that are not only used directly in many fields of applications but also used as iterative steps in fitting other types of mixed-effects models, such as generalized linear mixed models. he parameters in these models are typically estimated by maximum likelihood (ML) or restricted maximum likelihood (REML). In general there is no closed form solution for these estimates and they must be determined by iterative algorithms such as EM iterations or general nonlinear optimization. Many of the intermediate calculations for such iterations have been expressed as generalized least squares problems. We show that an alternative representation as a penalized least squares problem has many advantageous computational properties including the ability to evaluate explicitly a profiled log-likelihood or log-restricted likelihood, the gradient and Hessian of this profiled objective, and an ECME update to refine this objective. Key words: REML, gradient, Hessian, EM algorithm, ECME algorithm, maximum likelihood, profile likelihood, multilevel models Introduction We will establish some results for the penalized least squares representation of a general form of a linear mixed-effects model, then show how these results specialize for particular models. he general form we consider is y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, σ 2 Ω ), ɛ b () where y is the n-dimensional response vector, X is an n p model matrix for the p-dimensional fixed-effects vector β, Z is the n q model matrix for Preprint submitted to Journal of Multivariate Analysis 25 September 2003

2 the q-dimensional random-effects vector b that has a Gaussian distribution with mean 0 and relative precision matrix Ω (i.e., Ω is the precision of b relative to the precision of ɛ), and ɛ is the random noise assumed to have a spherical Gaussian distribution. he symbol indicates independence of random variables. We assume that X has full column rank and that Ω is positive definite. (If X is rank deficient or if Ω is singular then the model can be transformed to an alternative model that fulfills the desired conditions.) A relative precision factor,, is any q q matrix that satisfies Ω =. One possible is the Cholesky factor of Ω but others can be used. Because Ω is positive definite, any will be non-singular. In general (and hence Ω) depend on a k-dimensional parameter vector θ. ypically Z, Ω, and are very large and sparse (mostly zeros) while k, the dimension of θ, is small. he likelihood for the parameters β, θ, and σ 2, given y, in model () is L ( β, θ, σ 2 y ) = ( Ω y Xβ Zb 2 + b ) Ωb exp db. (2) (2πσ 2 (n+q)/2 ) 2σ 2 he restricted (or residual) maximum likelihood (REML) estimates of θ and σ 2 optimize a related criterion that can be written L R (θ, σ 2 ) = L(β, θ, σ 2 ) dβ. (3) his integral is not the typical way to derive or to justify REML estimates but, as we shall see, it provides a convenient form in which to evaluate L R. In the next section we show that the integrals in (2) and (3) can be expressed succinctly using the solution to a penalized least squares problem. In particular, we derive a profiled log-likelihood and a profiled log-restricted likelihood that depend on θ only. Maximizing the profiled log-likelihood (or profiled log-restricted-likelihood) is generally a much smaller and more stable optimization problem than attempting to optimize the log-likelihood for the full parameter vector. In section 3 we derive new expressions for the gradient and the Hessian of these profiled log-likelihoods. Using these derivatives, the profiled log-likelihood can be rapidly optimized, say by Newton steps, once the neighbourhood of θ has been determined. Some heuristics provide reasonable starting estimates for θ but these may not be sufficiently close to θ to ensure stable Newton steps. he expectationmaximization (EM) algorithm is a robust algorithm that approaches the neighbourhood of θ quickly but tends to converge slowly once it is in the neighbourhood. In section 4 we show how the penalized least squares results provide the update for a related algorithm called expectation-conditional maximizationeither (ECME). his allows the starting estimates to be updated by a moderate number of ECME iterations before starting the Newton steps. 2

3 hese results are derived for the general model (). In sections 5 and 6 we consider several common special cases. Our implementation of some of these methods is discussed in section 7. 2 A Penalized Least-Squares Problem For a fixed value of θ we consider the penalized least squares problem defined by the augmented model matrix Φ(θ) and the augmented response vector ỹ; min b,β ỹ Φ(θ) b 2 β where Φ(θ) = Z X and ỹ = y. (4) (θ) 0 0 One way to solve problem (4) is to form Φ e = [Φ, ỹ] and let R e be the Cholesky decomposition of Φ e Φ e Z Z + Ω Z X Z y R ZZ R ZX r Zy Φ e Φ e = X Z X X X y = R e R e where R e = 0 R XX r Xy. y Z y X y y 0 0 r yy (5) he matrices R ZZ and R XX are upper triangular of dimension q q and p p respectively. he corresponding vectors, r Zy and r Xy, are of dimension q and p, and r yy is a scalar. he conditions that Ω be positive definite and X have full column rank ensure that Φ has full column rank, and hence that R ZZ and R XX are nonsingular. Representation (5) is a particular form of the mixed model equations described in Henderson (984) []. We write the blocks in the opposite order from which they are typically written because of computational advantages associated with this order. Using a = [ b, β, ] we can write the numerator of the exponent in the integral in (2) as y Xβ Zb 2 + b Ωb = a Φ e Φ e a = a R e R e a = R e a 2 = r Zy R ZX β R ZZ b 2 + r Xy R XX β 2 + r 2 yy (6) 3

4 and a simple change of variable allows us to evaluate (2πσ 2 ) q/2 exp ( rzy R ZX β R ZZ b 2 ) db = 2σ 2 abs R ZZ = Z Z + Ω (7) Combining (2), (6), and (7) and taking the logarithm produces the log-likelihood l(β, θ, σ 2 ) = log L(β, θ, σ 2 ), which, for convenience, we write in the form of a deviance Z Z + Ω 2l(β, θ, σ 2 ) = log +n log(2πσ 2 )+ r2 yy + r Xy R XX β 2 (8) Ω σ 2 leading to the following results for the maximum likelihood estimates (mle s): () he conditional mle of the fixed-effects, β(θ), satisfies R XX β(θ) = r Xy (9) (2) he conditional mle of the variance, σ 2 (θ), is r 2 yy/n. (3) he profiled log-likelihood, l(θ), a function of θ only, is given by 2 l(θ) = 2l ( β(θ), θ, σ 2 (θ) ) Z Z + Ω [ = log + n + log Ω ( )] 2πr 2 yy n (0) (4) he conditional expected value of b, which we write as b(β, θ), satisfies R ZZ b(y, β, θ) = r Zy R ZX β () ypically we evaluate b( β(θ), θ), which we write as b(θ). (5) he conditional distribution of b is b y, β, θ, σ 2 N ( b(β, θ), σ 2 ( Z Z + Ω ) ) (2) 2. REML results As in (7), we can use a simple change of variable to obtain (2πσ 2 ) p/2 exp ( rxy R XX β 2 ) dβ = 2σ 2 abs R XX (3) 4

5 providing the log-restricted-likelihood, l R (θ, σ 2 ) = log L R (θ, σ 2 ), as Z Z + Ω 2l R (θ, σ 2 R XX 2 ) = log + (n p) log(2πσ 2 ) + r2 yy Ω σ. (4) 2 Noting that Z Z + Ω R XX 2 = Φ Φ we have the following results: () he conditional REML estimate of the variance, σ 2 R(θ), is r 2 yy/(n p). (2) he profiled log-restricted-likelihood is given by Φ [ Φ 2 l R (θ) = log + (n p) + log Ω ( )] 2πr 2 yy. (5) n p (3) he conditional distribution of b is b y, θ, σ 2 N ( b(θ), σ 2 V b ) (6) where V b is the upper-left q q submatrix of ( Φ Φ ), ( ) V b = RZZ I + RZX RXX RXX RZX R ZZ (7) 2.2 A preliminary decomposition Because we wish to evaluate the log-likelihood or log-restricted-likelihood for many different values of θ, we form a preliminary decomposition of the crossproducts of the model matrices and the response, Z Z Z X Z y ZZ ZX t Zy X Z X X X y = where = 0 XX t Xy (8) y Z y X y y 0 0 t yy so that the evaluation of (5) is equivalent to forming the orthogonal-triangular decomposition 0 0 R ZZ R ZX r Zy ZZ ZX t Zy 0 R = Q XX r Xy. (9) 0 XX t Xy 0 0 r yy 0 0 t yy o ensure well-defined estimates for the parameters we should have n q + p which means that (8) produces a reduction, sometimes a substantial reduc- 5

6 tion, in the amount of data that must be stored and manipulated for each evaluation of the log-likelihood. Furthermore, the matrices Z Z and ZZ are sparse and we store and manipulate these matrices taking the sparsity into account (see 5 and 6 for specific examples). We will assume that Z X, ZX, X X, and XX are dense and are stored accordingly. Equation (9) shows that t 2 yy is a lower bound for r 2 yy. Similarly, the residual sum of squares from regressing y on X is an upper bound on r 2 yy. hus the terms in (0 and 5) involving r 2 yy are bounded. he ratio Z Z + Ω / Ω is bounded below by unity (and approaches this bound as Ω 0) so the profiled deviance (0) is bounded below by n [ + log ( 2πt 2 yy/n )] and the profiled restricted deviance (5) is bounded below by (n p) [ + log ( 2πt 2 yy/(n p) )]. As Ω approaches singularity (say the minimum eigenvalue of Ω approaches zero while the other eigenvalues are bounded above) Z Z + Ω / Ω. hus we know that the ML or REML estimates will not occur on the boundary of the set of positive definite Ω. It is possible that finite ML or REML estimates of Ω will not exist. he minimum profiled deviance may correspond to an infinite precision (unbounded Ω), which is to say that the ML or REML estimates of σ 2 Ω, the variancecovariance of b, are singular. 3 Derivatives of the profiled log-likelihood Expressions (0) and (5) provide extremely efficient ways to determine ML or REML estimates for a linear mixed-effects model because we can optimize these expressions as a function of θ only, instead of as a function of the complete parameter vector ( θ, β, σ 2). he reduction in the dimension of the parameter over which we are optimizing helps to improve the performance and reliability of numerical optimization routines. Another way in which we can improve performance and reliability in numerical optimization is by providing analytic derivatives of the objective, which we do using results from Golub and Pereyra [2]. As in [2] we will use the operator D to indicate the Frechet derivative of an array or a scalar. For example, D Φ(θ) is an array of dimension n (q +p) k, which we treat as k matrices of size n (q + p) when writing expressions involving matrices and arrays. When we need to indicate partial derivatives 6

7 with respect to particular parameters we will use the notation D i Φ(θ) for the n (q + p) matrix Φ/ θ i. Golub and Pereyra [2] provide derivatives of the projection orthogonal to the column space of Φ, which, because Φ Φ is nonsingular, we can write P = I Φ ( Φ Φ ) Φ ; the pseudo-inverse of Φ, which is Φ + = ( Φ Φ ) Φ ; and the residual sum of squares, ryy(θ) 2 = P ỹ 2. hese derivatives are D P = P D ΦΦ + ( P D ΦΦ +), (20) 2 r2 yy(θ) = ỹ P D ΦΦ + ỹ, (2) D Φ + = Φ + D ΦΦ + + Φ + ( Φ +) D Φ P. (22) (Equation (22) is derived from equation (4.2) in [2], which has another term. However, that term is identically zero when Φ has full column rank.) For A a square, nonsingular matrix we have (log A(θ) ) = tr [ D(A)A ] (23) D ( A ) = A D(A)A (24) where tr denotes the trace of a matrix. Notice that (24) is a special case of (22). In Appendix A we show that the gradient and Hessian terms for the penalized residual sum-of-squares, r 2 yy(θ), are r 2 yy(θ) = ˆb D Ωˆb = tr ( D Ωˆbˆb ) (25) D j D i r 2 yy(θ) = ˆb [D j (D i Ω) 2D j ΩV b D i Ω] ˆb (26) providing the gradients ( 2 l) = tr D Ω (Z Z + Ω) Ω + ˆb ˆb σ σ (27) ( 2 l R ) = tr D Ω V b Ω + ˆb ˆb σ R σ R (28) 7

8 and Hessian terms of the form D j D i ( 2 l) = tr D j (D i Ω) (Z Z + Ω) Ω + ˆb σ tr [ D j Ω(Z Z + Ω) D i Ω(Z Z + Ω) ] ˆb (29) σ + tr ( D j ΩΩ D i ΩΩ ) 2 ˆb D j ΩV b D i Ω ˆb σ σ ˆb D j Ω ˆb ˆb D i Ω ˆb n σ σ σ σ D j D i ( 2l R ) = tr D j (D i Ω) V b Ω + ˆb ˆb (30) σ R σ R tr [D j ΩV b D i ΩV b ] + tr ( D j ΩΩ D i ΩΩ ) 2 ˆb σ R ˆb D j Ω ˆb ˆb n p σ R σ R σ R D j ΩV b D i Ω ˆb σ R D i Ω ˆb σ R 4 An ECME algorithm he EM algorithm [3] is a general iterative algorithm for computing maximum likelihood estimates in the presence of missing data. For linear mixed-effects models we formulate an EM algorithm by considering the random effects b to be unobserved data. In the terminology of the EM algorithm, we call the observed data, y, the incomplete data, and y augmented by b, the complete data. he EM algorithm has two steps: in the E step we compute Q, the expected log-likelihood (or deviance) for the complete data, and, in the M step, we maximize the expected log-likelihood (or minimize the expected deviance) with respect to the parameters in the model. Liu and Rubin [4] derived the EM algorithm for linear mixed-effects models using b as the missing data. In same paper they introduced expectation conditional maximization either (ECME) algorithms, which are an extension of the EM algorithm. In ECME algorithms the M step is broken down into a number of conditional maximization steps and in each conditional maximization step either the original log-likelihood, l, or its conditional expectation, Q, is maximized. he maximization in each step is done by placing constraints on the parameters in such a way that the collection of all the maximization steps 8

9 is with respect to the full parameter space. In describing an EM algorithm we must distinguish between current values of parameters and updated values. We denote the current values of the parameters by β 0, σ 2 0 and θ 0. hese are either starting values or values obtained from the last E and M steps. he parameter estimates to be obtained after an E and an M step are β, σ 2 and θ. he log-likelihood for the complete data is 2l(β, σ 2, θ y, b) = (n + q) log(2πσ 2 ) log Ω + y Xβ Zb 2 + b Ωb σ 2 (3) Because we can easily calculate σ 2 (θ) and β(θ) we define an update step in an ECME algorithm to be: () Given θ 0, set β = β(θ 0 ) and σ 2 = σ ( 2 (θ 0 ). hen the conditional distribution of b is N b(θ), σ 2 Z Z + Ω ) ). ( (2) Choose θ to minimize the conditional expectation of 2l Q(θ y, σ, 2 β, θ 0 ) = E b θ0 [ 2l(β, σ, 2 θ y, b)] [ ] = E b θ0 c log Ω + b Ωb/σ 2 = c log Ω + ˆb Ωˆb + tr [Ω ( Z Z + Ω(θ σ 2 0 ) ) ] (32) hus θ satisfies [ (ˆb(θ0 ) ˆb(θ0 ) θ Q = tr D Ω σ σ + ( Z Z + Ω(θ 0 ) ) Ω(θ ) )] = 0 (33) A similar derivation shows that, for the REML criterion, θ satisfies [ (ˆb(θ0 ) ˆb(θ0 ) )] θ Q R = tr D Ω + V b (θ 0 ) Ω(θ ) = 0 (34) σ R σ R From the similarity of (33) to (27) (and of (34) to (28)) we can see that a stationary point of this ECME algorithm will be a critical value of the profiled log-likelihood. 5 Computational methods for a single grouping factor he results in the previous sections provide concise expressions for the profiled log-likelihood (0), the profiled log-restricted-likelihood (5); the ECME 9

10 increments (33 or 34); and the gradient (27 or 28) and Hessian (29 or 30) of the profiled objective functions. All these expressions depend on being able to evaluate the initial decomposition (8) and, for several different θ, the decomposition (9), which can be a formidable computational problem because the matrices Z and Ω can be very large. However, these matrices generally are sparse and, by exploiting the sparsity, we can provide computationally feasible methods for all these results. he sparsity in Z and Ω occurs when the random effects vector b is divided into small components associated with one or more factors that group the observations. In the simplest situation there is one grouping factor, or one set of experimental units, and the model can be written y i = X i β + Z i b i + ɛ i, b i N (0, σ 2 Ω ), ɛ i N (0, σ 2 I), i =,..., m, ɛ i ɛ j, b i b j, i j; ɛ i b j, all i, j (35) where y i is the vector of length n i of responses for unit i; X i is the n i p fixed-effects model matrix for unit i; and Z i is the n i q model matrix for unit i and the random effects b i corresponding to that unit. Because we only have one grouping factor in this model we say that we have one level of random effects. However, this terminology is not universal. In particular, this model is called the two-level model in the multilevel modeling literature (e.g. [5]) because it has two levels of random variation. o convert model (35) to the form () we would set and b y X Z b b = 2 y, y = 2 X, X = 2 0 Z, Z = , (36). b m y m X m Z m Ω Ω Ω =... 0 = I m Ω (37) Ω where I m is the m m identity matrix and is the Kronecker product. hat is, Ω is a block-diagonal matrix whose diagonal is m copies of Ω. Similarly, a relative precision factor is = I m where is any q q matrix satisfying = Ω. 0

11 he calculation of R XX, R ZZ () R R ZZ = ZZ (2) , R ZX = R ZZ (m ) R ZX () R ZX (2). R ZX (m ) (38) [ and r Zy = r Zy(), r Zy(2),..., r Zy(m )] can be performed in blocks, using a series of QR decompositions based on the corresponding blocks in = Q R ZZ (i) i followed by Q i ZZ (i) 0 and 0 0 ZX (i) t Zy(i) = R ZX (i) r Zy(i), R XX (i) r Xy(i) (39) R XX () r Xy().. R XX r Xy R XX (m ) r Xy(m ) = Q 0 0 r yy. (40) XX t Xy t yy Noting that Ω = Ω m = 2m and that Z Z + Ω = m i= R ZZ (i) 2, we can evaluate the profiled log-likelihood (0) or the profiled log-restrictedlikelihood (5). o evaluate the ECME increments we note that tr [ D i (Ω)Ω ] = m tr [ ] D i (Ω )Ω = m tr [ ] D i (Ω ) [ (ˆb(θ0 ) ˆb(θ0 ) )] tr D i (Ω) + (Z Z + Ω(θ 0 )) = tr [ ] A D i (Ω)A σ tr σ [ (ˆb(θ0 ) ˆb(θ0 ) )] D i (Ω) + V b = tr [ ] A RD i (Ω)A R σ R σ R (4)

12 where the matrices A and A R are obtained from the QR decompositions ˆb /σ R ZZ () U A =. ˆb m /σ R ZZ (m ) and U R A R = ˆb /σ R R ZZ () R XX R ZX () R ZZ (). ˆb m /σ R R ZZ (m ) R XX R ZX (m ) R ZZ (m ) (42) he matrix Ω must be positive definite and symmetric. If no further structure is imposed on it then k = q (q + )/2 and a suitable value of (θ ) can be calculated as = m A for ML estimation and R = m A R for REML. If further structure is imposed on Ω (θ), so that k < q (q +)/2, then the gradient equations, (33) or (34), must be solved for the ECME update. Evaluation of the gradients (27 or 28) of the profiled objective functions can be simplified in the same way as the ECME increment is, as can the first term in the Hessian (29 or 30). he other terms in the Hessian can be simplified in various ways. Because D i Ω = I m D i Ω and Ω = I m Ω tr ( D j ΩΩ D i ΩΩ ) = m tr ( ) D j Ω Ω D i Ω Ω = m tr ( D j Ω D i Ω From the vectors c li, d li, and g li, l =,..., m ; i =,..., k defined as ) (43) we can evaluate c li = D i Ω ˆbl d li = R ZZ (l) c li g li = R XX R ZX (l)d li (44) (ˆb D j Ωˆb ) (ˆb D i Ωˆb ) ( m ) ( m ) = b l c lj b l c li m l= ˆb D j ΩV b D i Ωˆb = d ljd li + l= l= ( m ) ( m g lj l= l= g li ) (45) 2

13 he remaining terms in the Hessian expressions are evaluated as or tr [ D j Ω(Z Z + Ω) D i Ω(Z Z + Ω) ] = l= m l= tr [D j ΩV b D i ΩV b ] = m tr [ R ZZ (l) D jω R ZZ (l) R ZZ (l) D ] iω R ZZ (l) m tr l= [( m l= ( m l= tr [ R ZZ (l) D jω R ZZ (l) R ZZ (l) D ] iω R ZZ (l) tr [ R ZZ (l) D jω R ZZ (l) R ZX (l)r XX R XX R ZX (l)r ZZ (l) D iω R ZZ (l) ] RXX RZX (l)r ZZ (l) D jω R ZZ (l) R ZX (l)rxx )] R XX R ZX (l)r ZZ (l) D iω R ZZ (l) R ZX (l)r XX ) (46) (47) By working with the components of Ω instead of the whole matrix we save a considerable amount of storage and computation because q is typically very small (values of one or two are common) while m can be very large. We have worked with cases where m is in the millions. In most cases p q but it is unusual for p to exceed, say, one hundred. After the pre-decomposition the storage required is approximately m q (q +p) locations and the number of floating point operations (FLOPs) per iteration for the function evaluation is on the order of m q 3 for the decompositions and m pq 2 for the multiplications in (39), followed by m q p 2 for (40). Evaluation of the determinants of R ZZ (i), i =,..., m and R XX is trivial because these matrices are triangular. he dominant term in the FLOP count is m q (q + p) 2. his is also the order of the computation for the ECME update, the gradient calculation and the Hessian calculation. We can simplify some of the expressions for the Hessian by noting that, for arbitrary q q matrices U and V, tr [D i Ω UD j Ω V ] = k k 2 k 3 k 4 = k k 2 k 2 k 3 k 3 k 4 k 4 d (i) k k 2 d (j) k 3 k 4 tr [ e k e k 2 Ue k3 e k 4 V ] d (i) k k 2 d (j) k 3 k 4 ( e k2 Ue k3 e k 4 V e k ) = d (i) k k 2 d (j) k 3 k 4 u k2 k 3 v k4 k k 3

14 where d (i) jk is the (j, k)th element of D iω, e k is the k-th column of the identity matrix of size q q, and u ij and v ij are the (i, j)th elements of U and V, respectively. For the one-level model, all the terms involved in the Hessian can be rewritten using this principle and so the full Hessian can be written as d (i) k k 2 d (j) k 3 k 4 H k k 2 k 3 k 4 k k 2 k 3 k k where H is an array of size q q q q. 6 Models with multiple grouping factors We will consider models with multiple grouping factors where the random effects associated with each grouping factor are independent between groups and are i.i.d within groups. hat is, the matrix Ω consists of s blocks of the form I mj Ω j, j =,..., s. We will give details for s = 2, in which case Ω = I m Ω 0 0 I m2 Ω 2 Extensions to more than two grouping factors follow naturally.. (48) he block-diagonal structure in (48) will also be present in the factor,, and in all the derivatives with respect to components of θ. hat is, consists of s blocks of the form I mj j, j =,..., s where j is of size q j q j. We assume that each component of θ determines only one block in Ω and we designate the block associated with component j as b(j) where b(j) s, j =,..., k. hen D j Ω is zero except for the b(j)th diagonal block which is of the form I mb(j) D j Ω b(j). Notice that the second derivative, D j D i Ω, will be zero if b(i) b(j). We divide the random effects vector b into s = 2 blocks and subdivide the jth block into m j components of length q j denoted by b ji, j =,..., s; i =,..., m j. If Z Z is split into blocks corresponding to the blocks in Ω then the diagonal blocks in Z Z are themselves block diagonal but, unlike the situation with a single grouping factor, Z Z and ZZ will have non-zero off-diagonal blocks. hese off-diagonal blocks can be sparse or dense according to whether the grouping factors are nested or crossed. We distinguish three cases: completely crossed, partially crossed, and strictly nested. 4

15 Completely crossed grouping factors usually occur in designed experiments. For example, biological assays are often conducted by measuring the optical density of liquid samples in wells arranged in a grid on a plate. A common arrangement is 96 wells in a grid of 8 rows by 2 columns. If we assigned one set of random effects to the rows and another set of random effects to the columns then each row would occur with each column, resulting in completely crossed random effects. An example of nested grouping factors would be a longitudinal study, say records of annual achievement test scores, of students in several schools. For the student factor to be strictly nested within the school factor we require that each student attend only one school during the period of the study. In most large studies this will not be the case. We expect some students will attend more than one school but we do not expect every student to attend every school. In such a case the grouping factors are neither strictly nested nor completely crossed. We describe this situation as partially crossed grouping factors. ] If we divide Z = [Z Z 2 according to the two levels of random effects and correspondingly divide ZZ = ZZ ZZ 2 and R ZZ = R ZZ R ZZ 2 (49) 0 ZZ 22 0 R ZZ 22 then the block diagonal structure of Z Z is also present in ZZ and we can store and manipulate it accordingly. Even though Z 2 Z 2 is block diagonal there can be non-zero off-diagonals induced in ZZ 22 from ZZ 2 when taking the Cholesky decomposition. his is called fill-in [6]. If we consider the matrix ZZ 2 to consist of m m 2 blocks of size q q 2 then the (i,j)th such block will necessarily be zero if the ith level of the first grouping factor does not occur in combination with the jth level of the second grouping factor. For strictly nested grouping factors there will be only one non-zero block in each of the m sets of q adjacent rows. Furthermore, this pattern guarantees that there will be no fill-in of ZZ 22 and that it can be stored as a blockdiagonal matrix consisting of m 2 blocks of size q 2 q 2. he matrix R ZZ has the same pattern of sparsity so ZZ and R ZZ only require storage of size (m q + m 2 q 2 )(q + q 2 + p) and computing R ZZ is of order (m q + m 2 q 2 )(q + q 2 + p) 2 FLOPs. he same sparsity pattern will be present in RZZ, which we can calculate explicitly, allowing for the evaluation of all the formulas for the ECME update, the gradient, and the Hessian. hese results generalize to an arbitrary number of strictly nested levels of grouping factors (and are 5

16 incorporated this way in our software). For fully crossed grouping factors ZZ 2 and ZZ 22 are both dense and would need to be stored and manipulated as such. Because we can condense the calculation for the first group of random effects only, we choose the order of the grouping factors so that m q m 2 q 2. Once the order of the grouping factors is established then the calculations involving ZZ 2 and ZZ 22 are essentially the same as those involving ZX and XX except that producing R ZZ 22 from ZZ 22 also involves 2. he amount of storage required for fully crossed grouping factors is on the order of (m q + m 2 q 2 )(q + m 2 q 2 + p) locations and producing R ZZ requires on the order of (m q + m 2 q 2 )(q + m 2 q 2 + p) 2 FLOPs, which, obviously, can be considerably more than (m q + m 2 q 2 )(q + q 2 + p) 2. However, fully crossed grouping factors result in n m m 2 and usually come from designed experiments so we do not expect m 2 q 2 to be extremely large. he most interesting case is partially crossed grouping factors where ZZ 2 is sparse but does not obey that pattern that there is only one non-zero q q 2 block in each of the m sets of q adjacent rows. Having more than one nonzero block in such a set of rows (say because a student attended more than one school during the course of the study) does not generate fill-in in ZZ 2 but does generate fill-in in ZZ 22 and in R ZZ 22. If the extent of the crossing is moderate (i.e. an individual student may attend more than one school during the study but no student attends a large proportion of all the schools in the study, so that most of the q q 2 blocks in any one of the m sets of q adjacent rows are zero), then it will be advantageous to use sparse matrix representations of ZZ 2 (and R ZZ 2 ) and to generate R ZZ 22 using methods for the Cholesky decomposition of sparse semi-definite matrices. his particular calculation has been extensively studied because it is important in the implementation of interior-point methods in mathematical programming (Wright (997, pp ) [6]). A critical part of algorithms for the Cholesky decomposition of sparse semidefinite matrices is choosing an ordering of the columns of Z 2 so as to minimize fill-in in R ZZ 22. his only needs to be done once and can be based on the pattern of crossing of the grouping factors. he amount of storage and computation required to generate R ZZ for partially crossed grouping factors will fall between that for strictly nested grouping factors and that for fully crossed grouping factors. It is unlikely that R ZZ can be inverted in place when the grouping factors are partially crossed. We expect that it will be feasible to evaluate the objective function, the ECME update, and the gradient but that it may not be feasible to evaluate the Hessian. 6

17 7 Implementation We have implemented the computational methods for evaluating the profiled objective function (either log-likelihood or log-restricted-likelihood), the ECME increment and the gradient in the lme4 package for R [7] ( Some comparisons of the speed and stability of this implementation versus our previous implementation in the nlme package for R are available in Debroy (2003) [8]. Future versions of this software will allow for crossed random effects and will provide the Hessian, at least for the completely crossed and strictly nested cases. 8 Acknowledgements his work was supported by U.S. Army Medical Research and Materiel Command under Contract No. DAMD7-02-C-09. he views, opinions and/or findings contained in this report are those of the authors and should not be construed as an official Department of the Army position, policy or decision unless so designated by other documentation. We thank Deepayan Sarkar for helpful discussions. A Differentiating the penalized RSS Using (2) and the relationships D Φ = 0 D I we obtain [ ] I 0, Φ + ỹ = ˆb, and P ỹ = y X β Zˆb β ˆb (A.) r 2 yy(θ) = 2ˆb D ˆb = ˆb D Ωˆb (A.2) 7

18 he (j, i)th element of the Hessian 2 r 2 yy(θ) = D j D i r 2 yy is D 2 jd i ryy 2 = D (ỹ j P D i ΦΦ + ỹ ) = ỹ ( D j P D i ΦΦ + + P D j D i ΦΦ + + P D i ΦD j Φ +) ỹ [ = ỹ P D j ΦΦ + + ( Φ +) ] Dj Φ P D i ΦΦ + ỹ + ˆb D j D i ˆb + ỹ P D i Φ [Φ + D j ΦΦ + ( Φ Φ ) ] Dj Φ P ỹ = ỹ ( Φ +) Dj Φ D i ΦΦ + ỹ + ˆb D j D i ˆb ỹ ( Φ +) Dj Φ Φ ( Φ Φ ) Φ D i ΦΦ + ỹ + ỹp D j Φ ( Φ Φ ) Φ D i ΦΦ + ỹ + ỹp D i Φ ( Φ Φ ) Φ D j ΦΦ + ỹ Writing and ỹp D i Φ ( Φ Φ ) Dj Φ P ỹ w i = D i ΦΦ + ỹ = 0, D i ˆb u i = ( R ) Φ D i ΦΦ + ỹ = ( R ) D i ˆb, 0 v i = ( R ) Di Φ P ỹ = ( R ) D i ˆb, 0 (A.3) (A.4) (A.5) (A.6) where R is the Cholesky decomposition of Φ Φ = R R, expression (A.3) becomes D 2 jd i ryy(θ) 2 = ˆb D j D i ˆb + wj w i (u j + v j ) (u j + v j ) (A.7) Noting that u i + v i = (R ) D iωˆb, we have 0 D j D i r 2 yy(θ) = ˆb [D j D i Ω D j ΩV b D i Ω D i ΩV b D j Ω] ˆb. (A.8) 8

19 References [] C. Henderson, Applications of Linear Models in Animal Breeding, University of Guelph, 984. [2] G. H. Golub, V. Pereyra, he differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate, SIAM J. of Numerical Analysis 0 (2) (973) [3] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. of the Royal Statistical Society, Series B 39 (977) 22. [4] C. Liu, D. B. Rubin, he ECME algorithm: A simple extension of EM and ECM with faster monotone convergence, Biometrika 8 (994) [5] S. W. Raudenbush, A. S. Bryk, Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd Edition, Sage, [6] S. J. Wright, Primal-Dual Interior Point Methods, SIAM, 997. [7] R. Ihaka, R. Gentleman, R: A language for data analysis and graphics, Journal of Computational and Graphical Statistics 5 (3) (996) [8] S. DebRoy, Computational methods for mixed-effects models, Ph.D. thesis, U. of Wisconsin Madison (2003). 9

Sparse Matrix Representations of Linear Mixed Models

Sparse Matrix Representations of Linear Mixed Models Sparse Matrix Representations of Linear Mixed Models Douglas Bates R Development Core Team Douglas.Bates@R-project.org June 18, 2004 Abstract We describe a representation of linear mixed-effects models

More information

Penalized least squares versus generalized least squares representations of linear mixed models

Penalized least squares versus generalized least squares representations of linear mixed models Penalized least squares versus generalized least squares representations of linear mixed models Douglas Bates Department of Statistics University of Wisconsin Madison April 6, 2017 Abstract The methods

More information

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates 8 th International Amsterdam Conference on Multilevel Analysis 2011-03-16 Douglas Bates

More information

Computational methods for mixed models

Computational methods for mixed models Computational methods for mixed models Douglas Bates Department of Statistics University of Wisconsin Madison March 27, 2018 Abstract The lme4 package provides R functions to fit and analyze several different

More information

Overview. Multilevel Models in R: Present & Future. What are multilevel models? What are multilevel models? (cont d)

Overview. Multilevel Models in R: Present & Future. What are multilevel models? What are multilevel models? (cont d) Overview Multilevel Models in R: Present & Future What are multilevel models? Linear mixed models Symbolic and numeric phases Generalized and nonlinear mixed models Examples Douglas M. Bates U of Wisconsin

More information

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates Madison January 11, 11 Contents 1 Definition of linear mixed models Definition of linear mixed models As previously

More information

Mixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates

Mixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates Mixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates Douglas Bates University of Wisconsin - Madison and R Development Core Team

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Sparse Matrix Methods and Mixed-effects Models

Sparse Matrix Methods and Mixed-effects Models Sparse Matrix Methods and Mixed-effects Models Douglas Bates University of Wisconsin Madison 2010-01-27 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences

Alternative implementations of Monte Carlo EM algorithms for likelihood inferences Genet. Sel. Evol. 33 001) 443 45 443 INRA, EDP Sciences, 001 Alternative implementations of Monte Carlo EM algorithms for likelihood inferences Louis Alberto GARCÍA-CORTÉS a, Daniel SORENSEN b, Note a

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

Fundamentals of Unconstrained Optimization

Fundamentals of Unconstrained Optimization dalmau@cimat.mx Centro de Investigación en Matemáticas CIMAT A.C. Mexico Enero 2016 Outline Introduction 1 Introduction 2 3 4 Optimization Problem min f (x) x Ω where f (x) is a real-valued function The

More information

(17) (18)

(17) (18) Module 4 : Solving Linear Algebraic Equations Section 3 : Direct Solution Techniques 3 Direct Solution Techniques Methods for solving linear algebraic equations can be categorized as direct and iterative

More information

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models

Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University

More information

Linear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations.

Linear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations. POLI 7 - Mathematical and Statistical Foundations Prof S Saiegh Fall Lecture Notes - Class 4 October 4, Linear Algebra The analysis of many models in the social sciences reduces to the study of systems

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

Numerical Methods I Non-Square and Sparse Linear Systems

Numerical Methods I Non-Square and Sparse Linear Systems Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

Matrix Factorizations

Matrix Factorizations 1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular

More information

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates Madison January 11, 2011 Contents 1 Definition 1 2 Links 2 3 Example 7 4 Model building 9 5 Conclusions 14

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently

More information

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1 Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

Estimating Variances and Covariances in a Non-stationary Multivariate Time Series Using the K-matrix

Estimating Variances and Covariances in a Non-stationary Multivariate Time Series Using the K-matrix Estimating Variances and Covariances in a Non-stationary Multivariate ime Series Using the K-matrix Stephen P Smith, January 019 Abstract. A second order time series model is described, and generalized

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Numerically Stable Cointegration Analysis

Numerically Stable Cointegration Analysis Numerically Stable Cointegration Analysis Jurgen A. Doornik Nuffield College, University of Oxford, Oxford OX1 1NF R.J. O Brien Department of Economics University of Southampton November 3, 2001 Abstract

More information

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates 2011-03-16 Contents 1 Generalized Linear Mixed Models Generalized Linear Mixed Models When using linear mixed

More information

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility American Economic Review: Papers & Proceedings 2016, 106(5): 400 404 http://dx.doi.org/10.1257/aer.p20161082 Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility By Gary Chamberlain*

More information

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)

5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y) 5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Douglas Bates University of Wisconsin - Madison and R Development Core Team University of

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Linear Models Review

Linear Models Review Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign

More information

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs Outline Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates University of Wisconsin - Madison and R Development Core Team UseR!2009,

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015 Linear Regression Review Linear Regression Review

More information

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University Lecture Note 7: Iterative methods for solving linear systems Xiaoqun Zhang Shanghai Jiao Tong University Last updated: December 24, 2014 1.1 Review on linear algebra Norms of vectors and matrices vector

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

OR MSc Maths Revision Course

OR MSc Maths Revision Course OR MSc Maths Revision Course Tom Byrne School of Mathematics University of Edinburgh t.m.byrne@sms.ed.ac.uk 15 September 2017 General Information Today JCMB Lecture Theatre A, 09:30-12:30 Mathematics revision

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 3: Positive-Definite Systems; Cholesky Factorization Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 11 Symmetric

More information

Incomplete Cholesky preconditioners that exploit the low-rank property

Incomplete Cholesky preconditioners that exploit the low-rank property anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université

More information

CSC 576: Linear System

CSC 576: Linear System CSC 576: Linear System Ji Liu Department of Computer Science, University of Rochester September 3, 206 Linear Equations Consider solving linear equations where A R m n and b R n m and n could be extremely

More information

Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values

Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values 1MARCH 001 SCHNEIDER 853 Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values APIO SCHNEIDER Atmospheric and Oceanic Sciences Program,

More information

Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices

Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices Jaehyun Park June 1 2016 Abstract We consider the problem of writing an arbitrary symmetric matrix as

More information

Introduction to gradient descent

Introduction to gradient descent 6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Lecture Notes. Introduction

Lecture Notes. Introduction 5/3/016 Lecture Notes R. Rekaya June 1-10, 016 Introduction Variance components play major role in animal breeding and genetic (estimation of BVs) It has been an active area of research since early 1950

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

Chapter 12 REML and ML Estimation

Chapter 12 REML and ML Estimation Chapter 12 REML and ML Estimation C. R. Henderson 1984 - Guelph 1 Iterative MIVQUE The restricted maximum likelihood estimator (REML) of Patterson and Thompson (1971) can be obtained by iterating on MIVQUE,

More information

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R. Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

ECE133A Applied Numerical Computing Additional Lecture Notes

ECE133A Applied Numerical Computing Additional Lecture Notes Winter Quarter 2018 ECE133A Applied Numerical Computing Additional Lecture Notes L. Vandenberghe ii Contents 1 LU factorization 1 1.1 Definition................................. 1 1.2 Nonsingular sets

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

14.2 QR Factorization with Column Pivoting

14.2 QR Factorization with Column Pivoting page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Iterative Methods. Splitting Methods

Iterative Methods. Splitting Methods Iterative Methods Splitting Methods 1 Direct Methods Solving Ax = b using direct methods. Gaussian elimination (using LU decomposition) Variants of LU, including Crout and Doolittle Other decomposition

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

8. Diagonalization.

8. Diagonalization. 8. Diagonalization 8.1. Matrix Representations of Linear Transformations Matrix of A Linear Operator with Respect to A Basis We know that every linear transformation T: R n R m has an associated standard

More information

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

DELFT UNIVERSITY OF TECHNOLOGY

DELFT UNIVERSITY OF TECHNOLOGY DELFT UNIVERSITY OF TECHNOLOGY REPORT -09 Computational and Sensitivity Aspects of Eigenvalue-Based Methods for the Large-Scale Trust-Region Subproblem Marielba Rojas, Bjørn H. Fotland, and Trond Steihaug

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

Chapter 5 Prediction of Random Variables

Chapter 5 Prediction of Random Variables Chapter 5 Prediction of Random Variables C R Henderson 1984 - Guelph We have discussed estimation of β, regarded as fixed Now we shall consider a rather different problem, prediction of random variables,

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

The Conjugate Gradient Method

The Conjugate Gradient Method The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html

More information

Likelihood Methods. 1 Likelihood Functions. The multivariate normal distribution likelihood function is

Likelihood Methods. 1 Likelihood Functions. The multivariate normal distribution likelihood function is Likelihood Methods 1 Likelihood Functions The multivariate normal distribution likelihood function is The log of the likelihood, say L 1 is Ly = π.5n V.5 exp.5y Xb V 1 y Xb. L 1 = 0.5[N lnπ + ln V +y Xb

More information

Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma

Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma Venkataramanan (Ragu) Balakrishnan School of ECE, Purdue University 8 September 2003 European Union RTN Summer School on Multi-Agent

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters

More information

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Numerical Methods in Matrix Computations

Numerical Methods in Matrix Computations Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012 Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11,

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation 1 Outline. 1. Motivation 2. SUR model 3. Simultaneous equations 4. Estimation 2 Motivation. In this chapter, we will study simultaneous systems of econometric equations. Systems of simultaneous equations

More information