Linear mixed models and penalized least squares
|
|
- Ernest Robertson
- 6 years ago
- Views:
Transcription
1 Linear mixed models and penalized least squares Douglas M. Bates Department of Statistics, University of Wisconsin Madison Saikat DebRoy Department of Biostatistics, Harvard School of Public Health Abstract Linear mixed-effects models are an important class of statistical models that are not only used directly in many fields of applications but also used as iterative steps in fitting other types of mixed-effects models, such as generalized linear mixed models. he parameters in these models are typically estimated by maximum likelihood (ML) or restricted maximum likelihood (REML). In general there is no closed form solution for these estimates and they must be determined by iterative algorithms such as EM iterations or general nonlinear optimization. Many of the intermediate calculations for such iterations have been expressed as generalized least squares problems. We show that an alternative representation as a penalized least squares problem has many advantageous computational properties including the ability to evaluate explicitly a profiled log-likelihood or log-restricted likelihood, the gradient and Hessian of this profiled objective, and an ECME update to refine this objective. Key words: REML, gradient, Hessian, EM algorithm, ECME algorithm, maximum likelihood, profile likelihood, multilevel models Introduction We will establish some results for the penalized least squares representation of a general form of a linear mixed-effects model, then show how these results specialize for particular models. he general form we consider is y = Xβ + Zb + ɛ ɛ N (0, σ 2 I), b N (0, σ 2 Ω ), ɛ b () where y is the n-dimensional response vector, X is an n p model matrix for the p-dimensional fixed-effects vector β, Z is the n q model matrix for Preprint submitted to Journal of Multivariate Analysis 25 September 2003
2 the q-dimensional random-effects vector b that has a Gaussian distribution with mean 0 and relative precision matrix Ω (i.e., Ω is the precision of b relative to the precision of ɛ), and ɛ is the random noise assumed to have a spherical Gaussian distribution. he symbol indicates independence of random variables. We assume that X has full column rank and that Ω is positive definite. (If X is rank deficient or if Ω is singular then the model can be transformed to an alternative model that fulfills the desired conditions.) A relative precision factor,, is any q q matrix that satisfies Ω =. One possible is the Cholesky factor of Ω but others can be used. Because Ω is positive definite, any will be non-singular. In general (and hence Ω) depend on a k-dimensional parameter vector θ. ypically Z, Ω, and are very large and sparse (mostly zeros) while k, the dimension of θ, is small. he likelihood for the parameters β, θ, and σ 2, given y, in model () is L ( β, θ, σ 2 y ) = ( Ω y Xβ Zb 2 + b ) Ωb exp db. (2) (2πσ 2 (n+q)/2 ) 2σ 2 he restricted (or residual) maximum likelihood (REML) estimates of θ and σ 2 optimize a related criterion that can be written L R (θ, σ 2 ) = L(β, θ, σ 2 ) dβ. (3) his integral is not the typical way to derive or to justify REML estimates but, as we shall see, it provides a convenient form in which to evaluate L R. In the next section we show that the integrals in (2) and (3) can be expressed succinctly using the solution to a penalized least squares problem. In particular, we derive a profiled log-likelihood and a profiled log-restricted likelihood that depend on θ only. Maximizing the profiled log-likelihood (or profiled log-restricted-likelihood) is generally a much smaller and more stable optimization problem than attempting to optimize the log-likelihood for the full parameter vector. In section 3 we derive new expressions for the gradient and the Hessian of these profiled log-likelihoods. Using these derivatives, the profiled log-likelihood can be rapidly optimized, say by Newton steps, once the neighbourhood of θ has been determined. Some heuristics provide reasonable starting estimates for θ but these may not be sufficiently close to θ to ensure stable Newton steps. he expectationmaximization (EM) algorithm is a robust algorithm that approaches the neighbourhood of θ quickly but tends to converge slowly once it is in the neighbourhood. In section 4 we show how the penalized least squares results provide the update for a related algorithm called expectation-conditional maximizationeither (ECME). his allows the starting estimates to be updated by a moderate number of ECME iterations before starting the Newton steps. 2
3 hese results are derived for the general model (). In sections 5 and 6 we consider several common special cases. Our implementation of some of these methods is discussed in section 7. 2 A Penalized Least-Squares Problem For a fixed value of θ we consider the penalized least squares problem defined by the augmented model matrix Φ(θ) and the augmented response vector ỹ; min b,β ỹ Φ(θ) b 2 β where Φ(θ) = Z X and ỹ = y. (4) (θ) 0 0 One way to solve problem (4) is to form Φ e = [Φ, ỹ] and let R e be the Cholesky decomposition of Φ e Φ e Z Z + Ω Z X Z y R ZZ R ZX r Zy Φ e Φ e = X Z X X X y = R e R e where R e = 0 R XX r Xy. y Z y X y y 0 0 r yy (5) he matrices R ZZ and R XX are upper triangular of dimension q q and p p respectively. he corresponding vectors, r Zy and r Xy, are of dimension q and p, and r yy is a scalar. he conditions that Ω be positive definite and X have full column rank ensure that Φ has full column rank, and hence that R ZZ and R XX are nonsingular. Representation (5) is a particular form of the mixed model equations described in Henderson (984) []. We write the blocks in the opposite order from which they are typically written because of computational advantages associated with this order. Using a = [ b, β, ] we can write the numerator of the exponent in the integral in (2) as y Xβ Zb 2 + b Ωb = a Φ e Φ e a = a R e R e a = R e a 2 = r Zy R ZX β R ZZ b 2 + r Xy R XX β 2 + r 2 yy (6) 3
4 and a simple change of variable allows us to evaluate (2πσ 2 ) q/2 exp ( rzy R ZX β R ZZ b 2 ) db = 2σ 2 abs R ZZ = Z Z + Ω (7) Combining (2), (6), and (7) and taking the logarithm produces the log-likelihood l(β, θ, σ 2 ) = log L(β, θ, σ 2 ), which, for convenience, we write in the form of a deviance Z Z + Ω 2l(β, θ, σ 2 ) = log +n log(2πσ 2 )+ r2 yy + r Xy R XX β 2 (8) Ω σ 2 leading to the following results for the maximum likelihood estimates (mle s): () he conditional mle of the fixed-effects, β(θ), satisfies R XX β(θ) = r Xy (9) (2) he conditional mle of the variance, σ 2 (θ), is r 2 yy/n. (3) he profiled log-likelihood, l(θ), a function of θ only, is given by 2 l(θ) = 2l ( β(θ), θ, σ 2 (θ) ) Z Z + Ω [ = log + n + log Ω ( )] 2πr 2 yy n (0) (4) he conditional expected value of b, which we write as b(β, θ), satisfies R ZZ b(y, β, θ) = r Zy R ZX β () ypically we evaluate b( β(θ), θ), which we write as b(θ). (5) he conditional distribution of b is b y, β, θ, σ 2 N ( b(β, θ), σ 2 ( Z Z + Ω ) ) (2) 2. REML results As in (7), we can use a simple change of variable to obtain (2πσ 2 ) p/2 exp ( rxy R XX β 2 ) dβ = 2σ 2 abs R XX (3) 4
5 providing the log-restricted-likelihood, l R (θ, σ 2 ) = log L R (θ, σ 2 ), as Z Z + Ω 2l R (θ, σ 2 R XX 2 ) = log + (n p) log(2πσ 2 ) + r2 yy Ω σ. (4) 2 Noting that Z Z + Ω R XX 2 = Φ Φ we have the following results: () he conditional REML estimate of the variance, σ 2 R(θ), is r 2 yy/(n p). (2) he profiled log-restricted-likelihood is given by Φ [ Φ 2 l R (θ) = log + (n p) + log Ω ( )] 2πr 2 yy. (5) n p (3) he conditional distribution of b is b y, θ, σ 2 N ( b(θ), σ 2 V b ) (6) where V b is the upper-left q q submatrix of ( Φ Φ ), ( ) V b = RZZ I + RZX RXX RXX RZX R ZZ (7) 2.2 A preliminary decomposition Because we wish to evaluate the log-likelihood or log-restricted-likelihood for many different values of θ, we form a preliminary decomposition of the crossproducts of the model matrices and the response, Z Z Z X Z y ZZ ZX t Zy X Z X X X y = where = 0 XX t Xy (8) y Z y X y y 0 0 t yy so that the evaluation of (5) is equivalent to forming the orthogonal-triangular decomposition 0 0 R ZZ R ZX r Zy ZZ ZX t Zy 0 R = Q XX r Xy. (9) 0 XX t Xy 0 0 r yy 0 0 t yy o ensure well-defined estimates for the parameters we should have n q + p which means that (8) produces a reduction, sometimes a substantial reduc- 5
6 tion, in the amount of data that must be stored and manipulated for each evaluation of the log-likelihood. Furthermore, the matrices Z Z and ZZ are sparse and we store and manipulate these matrices taking the sparsity into account (see 5 and 6 for specific examples). We will assume that Z X, ZX, X X, and XX are dense and are stored accordingly. Equation (9) shows that t 2 yy is a lower bound for r 2 yy. Similarly, the residual sum of squares from regressing y on X is an upper bound on r 2 yy. hus the terms in (0 and 5) involving r 2 yy are bounded. he ratio Z Z + Ω / Ω is bounded below by unity (and approaches this bound as Ω 0) so the profiled deviance (0) is bounded below by n [ + log ( 2πt 2 yy/n )] and the profiled restricted deviance (5) is bounded below by (n p) [ + log ( 2πt 2 yy/(n p) )]. As Ω approaches singularity (say the minimum eigenvalue of Ω approaches zero while the other eigenvalues are bounded above) Z Z + Ω / Ω. hus we know that the ML or REML estimates will not occur on the boundary of the set of positive definite Ω. It is possible that finite ML or REML estimates of Ω will not exist. he minimum profiled deviance may correspond to an infinite precision (unbounded Ω), which is to say that the ML or REML estimates of σ 2 Ω, the variancecovariance of b, are singular. 3 Derivatives of the profiled log-likelihood Expressions (0) and (5) provide extremely efficient ways to determine ML or REML estimates for a linear mixed-effects model because we can optimize these expressions as a function of θ only, instead of as a function of the complete parameter vector ( θ, β, σ 2). he reduction in the dimension of the parameter over which we are optimizing helps to improve the performance and reliability of numerical optimization routines. Another way in which we can improve performance and reliability in numerical optimization is by providing analytic derivatives of the objective, which we do using results from Golub and Pereyra [2]. As in [2] we will use the operator D to indicate the Frechet derivative of an array or a scalar. For example, D Φ(θ) is an array of dimension n (q +p) k, which we treat as k matrices of size n (q + p) when writing expressions involving matrices and arrays. When we need to indicate partial derivatives 6
7 with respect to particular parameters we will use the notation D i Φ(θ) for the n (q + p) matrix Φ/ θ i. Golub and Pereyra [2] provide derivatives of the projection orthogonal to the column space of Φ, which, because Φ Φ is nonsingular, we can write P = I Φ ( Φ Φ ) Φ ; the pseudo-inverse of Φ, which is Φ + = ( Φ Φ ) Φ ; and the residual sum of squares, ryy(θ) 2 = P ỹ 2. hese derivatives are D P = P D ΦΦ + ( P D ΦΦ +), (20) 2 r2 yy(θ) = ỹ P D ΦΦ + ỹ, (2) D Φ + = Φ + D ΦΦ + + Φ + ( Φ +) D Φ P. (22) (Equation (22) is derived from equation (4.2) in [2], which has another term. However, that term is identically zero when Φ has full column rank.) For A a square, nonsingular matrix we have (log A(θ) ) = tr [ D(A)A ] (23) D ( A ) = A D(A)A (24) where tr denotes the trace of a matrix. Notice that (24) is a special case of (22). In Appendix A we show that the gradient and Hessian terms for the penalized residual sum-of-squares, r 2 yy(θ), are r 2 yy(θ) = ˆb D Ωˆb = tr ( D Ωˆbˆb ) (25) D j D i r 2 yy(θ) = ˆb [D j (D i Ω) 2D j ΩV b D i Ω] ˆb (26) providing the gradients ( 2 l) = tr D Ω (Z Z + Ω) Ω + ˆb ˆb σ σ (27) ( 2 l R ) = tr D Ω V b Ω + ˆb ˆb σ R σ R (28) 7
8 and Hessian terms of the form D j D i ( 2 l) = tr D j (D i Ω) (Z Z + Ω) Ω + ˆb σ tr [ D j Ω(Z Z + Ω) D i Ω(Z Z + Ω) ] ˆb (29) σ + tr ( D j ΩΩ D i ΩΩ ) 2 ˆb D j ΩV b D i Ω ˆb σ σ ˆb D j Ω ˆb ˆb D i Ω ˆb n σ σ σ σ D j D i ( 2l R ) = tr D j (D i Ω) V b Ω + ˆb ˆb (30) σ R σ R tr [D j ΩV b D i ΩV b ] + tr ( D j ΩΩ D i ΩΩ ) 2 ˆb σ R ˆb D j Ω ˆb ˆb n p σ R σ R σ R D j ΩV b D i Ω ˆb σ R D i Ω ˆb σ R 4 An ECME algorithm he EM algorithm [3] is a general iterative algorithm for computing maximum likelihood estimates in the presence of missing data. For linear mixed-effects models we formulate an EM algorithm by considering the random effects b to be unobserved data. In the terminology of the EM algorithm, we call the observed data, y, the incomplete data, and y augmented by b, the complete data. he EM algorithm has two steps: in the E step we compute Q, the expected log-likelihood (or deviance) for the complete data, and, in the M step, we maximize the expected log-likelihood (or minimize the expected deviance) with respect to the parameters in the model. Liu and Rubin [4] derived the EM algorithm for linear mixed-effects models using b as the missing data. In same paper they introduced expectation conditional maximization either (ECME) algorithms, which are an extension of the EM algorithm. In ECME algorithms the M step is broken down into a number of conditional maximization steps and in each conditional maximization step either the original log-likelihood, l, or its conditional expectation, Q, is maximized. he maximization in each step is done by placing constraints on the parameters in such a way that the collection of all the maximization steps 8
9 is with respect to the full parameter space. In describing an EM algorithm we must distinguish between current values of parameters and updated values. We denote the current values of the parameters by β 0, σ 2 0 and θ 0. hese are either starting values or values obtained from the last E and M steps. he parameter estimates to be obtained after an E and an M step are β, σ 2 and θ. he log-likelihood for the complete data is 2l(β, σ 2, θ y, b) = (n + q) log(2πσ 2 ) log Ω + y Xβ Zb 2 + b Ωb σ 2 (3) Because we can easily calculate σ 2 (θ) and β(θ) we define an update step in an ECME algorithm to be: () Given θ 0, set β = β(θ 0 ) and σ 2 = σ ( 2 (θ 0 ). hen the conditional distribution of b is N b(θ), σ 2 Z Z + Ω ) ). ( (2) Choose θ to minimize the conditional expectation of 2l Q(θ y, σ, 2 β, θ 0 ) = E b θ0 [ 2l(β, σ, 2 θ y, b)] [ ] = E b θ0 c log Ω + b Ωb/σ 2 = c log Ω + ˆb Ωˆb + tr [Ω ( Z Z + Ω(θ σ 2 0 ) ) ] (32) hus θ satisfies [ (ˆb(θ0 ) ˆb(θ0 ) θ Q = tr D Ω σ σ + ( Z Z + Ω(θ 0 ) ) Ω(θ ) )] = 0 (33) A similar derivation shows that, for the REML criterion, θ satisfies [ (ˆb(θ0 ) ˆb(θ0 ) )] θ Q R = tr D Ω + V b (θ 0 ) Ω(θ ) = 0 (34) σ R σ R From the similarity of (33) to (27) (and of (34) to (28)) we can see that a stationary point of this ECME algorithm will be a critical value of the profiled log-likelihood. 5 Computational methods for a single grouping factor he results in the previous sections provide concise expressions for the profiled log-likelihood (0), the profiled log-restricted-likelihood (5); the ECME 9
10 increments (33 or 34); and the gradient (27 or 28) and Hessian (29 or 30) of the profiled objective functions. All these expressions depend on being able to evaluate the initial decomposition (8) and, for several different θ, the decomposition (9), which can be a formidable computational problem because the matrices Z and Ω can be very large. However, these matrices generally are sparse and, by exploiting the sparsity, we can provide computationally feasible methods for all these results. he sparsity in Z and Ω occurs when the random effects vector b is divided into small components associated with one or more factors that group the observations. In the simplest situation there is one grouping factor, or one set of experimental units, and the model can be written y i = X i β + Z i b i + ɛ i, b i N (0, σ 2 Ω ), ɛ i N (0, σ 2 I), i =,..., m, ɛ i ɛ j, b i b j, i j; ɛ i b j, all i, j (35) where y i is the vector of length n i of responses for unit i; X i is the n i p fixed-effects model matrix for unit i; and Z i is the n i q model matrix for unit i and the random effects b i corresponding to that unit. Because we only have one grouping factor in this model we say that we have one level of random effects. However, this terminology is not universal. In particular, this model is called the two-level model in the multilevel modeling literature (e.g. [5]) because it has two levels of random variation. o convert model (35) to the form () we would set and b y X Z b b = 2 y, y = 2 X, X = 2 0 Z, Z = , (36). b m y m X m Z m Ω Ω Ω =... 0 = I m Ω (37) Ω where I m is the m m identity matrix and is the Kronecker product. hat is, Ω is a block-diagonal matrix whose diagonal is m copies of Ω. Similarly, a relative precision factor is = I m where is any q q matrix satisfying = Ω. 0
11 he calculation of R XX, R ZZ () R R ZZ = ZZ (2) , R ZX = R ZZ (m ) R ZX () R ZX (2). R ZX (m ) (38) [ and r Zy = r Zy(), r Zy(2),..., r Zy(m )] can be performed in blocks, using a series of QR decompositions based on the corresponding blocks in = Q R ZZ (i) i followed by Q i ZZ (i) 0 and 0 0 ZX (i) t Zy(i) = R ZX (i) r Zy(i), R XX (i) r Xy(i) (39) R XX () r Xy().. R XX r Xy R XX (m ) r Xy(m ) = Q 0 0 r yy. (40) XX t Xy t yy Noting that Ω = Ω m = 2m and that Z Z + Ω = m i= R ZZ (i) 2, we can evaluate the profiled log-likelihood (0) or the profiled log-restrictedlikelihood (5). o evaluate the ECME increments we note that tr [ D i (Ω)Ω ] = m tr [ ] D i (Ω )Ω = m tr [ ] D i (Ω ) [ (ˆb(θ0 ) ˆb(θ0 ) )] tr D i (Ω) + (Z Z + Ω(θ 0 )) = tr [ ] A D i (Ω)A σ tr σ [ (ˆb(θ0 ) ˆb(θ0 ) )] D i (Ω) + V b = tr [ ] A RD i (Ω)A R σ R σ R (4)
12 where the matrices A and A R are obtained from the QR decompositions ˆb /σ R ZZ () U A =. ˆb m /σ R ZZ (m ) and U R A R = ˆb /σ R R ZZ () R XX R ZX () R ZZ (). ˆb m /σ R R ZZ (m ) R XX R ZX (m ) R ZZ (m ) (42) he matrix Ω must be positive definite and symmetric. If no further structure is imposed on it then k = q (q + )/2 and a suitable value of (θ ) can be calculated as = m A for ML estimation and R = m A R for REML. If further structure is imposed on Ω (θ), so that k < q (q +)/2, then the gradient equations, (33) or (34), must be solved for the ECME update. Evaluation of the gradients (27 or 28) of the profiled objective functions can be simplified in the same way as the ECME increment is, as can the first term in the Hessian (29 or 30). he other terms in the Hessian can be simplified in various ways. Because D i Ω = I m D i Ω and Ω = I m Ω tr ( D j ΩΩ D i ΩΩ ) = m tr ( ) D j Ω Ω D i Ω Ω = m tr ( D j Ω D i Ω From the vectors c li, d li, and g li, l =,..., m ; i =,..., k defined as ) (43) we can evaluate c li = D i Ω ˆbl d li = R ZZ (l) c li g li = R XX R ZX (l)d li (44) (ˆb D j Ωˆb ) (ˆb D i Ωˆb ) ( m ) ( m ) = b l c lj b l c li m l= ˆb D j ΩV b D i Ωˆb = d ljd li + l= l= ( m ) ( m g lj l= l= g li ) (45) 2
13 he remaining terms in the Hessian expressions are evaluated as or tr [ D j Ω(Z Z + Ω) D i Ω(Z Z + Ω) ] = l= m l= tr [D j ΩV b D i ΩV b ] = m tr [ R ZZ (l) D jω R ZZ (l) R ZZ (l) D ] iω R ZZ (l) m tr l= [( m l= ( m l= tr [ R ZZ (l) D jω R ZZ (l) R ZZ (l) D ] iω R ZZ (l) tr [ R ZZ (l) D jω R ZZ (l) R ZX (l)r XX R XX R ZX (l)r ZZ (l) D iω R ZZ (l) ] RXX RZX (l)r ZZ (l) D jω R ZZ (l) R ZX (l)rxx )] R XX R ZX (l)r ZZ (l) D iω R ZZ (l) R ZX (l)r XX ) (46) (47) By working with the components of Ω instead of the whole matrix we save a considerable amount of storage and computation because q is typically very small (values of one or two are common) while m can be very large. We have worked with cases where m is in the millions. In most cases p q but it is unusual for p to exceed, say, one hundred. After the pre-decomposition the storage required is approximately m q (q +p) locations and the number of floating point operations (FLOPs) per iteration for the function evaluation is on the order of m q 3 for the decompositions and m pq 2 for the multiplications in (39), followed by m q p 2 for (40). Evaluation of the determinants of R ZZ (i), i =,..., m and R XX is trivial because these matrices are triangular. he dominant term in the FLOP count is m q (q + p) 2. his is also the order of the computation for the ECME update, the gradient calculation and the Hessian calculation. We can simplify some of the expressions for the Hessian by noting that, for arbitrary q q matrices U and V, tr [D i Ω UD j Ω V ] = k k 2 k 3 k 4 = k k 2 k 2 k 3 k 3 k 4 k 4 d (i) k k 2 d (j) k 3 k 4 tr [ e k e k 2 Ue k3 e k 4 V ] d (i) k k 2 d (j) k 3 k 4 ( e k2 Ue k3 e k 4 V e k ) = d (i) k k 2 d (j) k 3 k 4 u k2 k 3 v k4 k k 3
14 where d (i) jk is the (j, k)th element of D iω, e k is the k-th column of the identity matrix of size q q, and u ij and v ij are the (i, j)th elements of U and V, respectively. For the one-level model, all the terms involved in the Hessian can be rewritten using this principle and so the full Hessian can be written as d (i) k k 2 d (j) k 3 k 4 H k k 2 k 3 k 4 k k 2 k 3 k k where H is an array of size q q q q. 6 Models with multiple grouping factors We will consider models with multiple grouping factors where the random effects associated with each grouping factor are independent between groups and are i.i.d within groups. hat is, the matrix Ω consists of s blocks of the form I mj Ω j, j =,..., s. We will give details for s = 2, in which case Ω = I m Ω 0 0 I m2 Ω 2 Extensions to more than two grouping factors follow naturally.. (48) he block-diagonal structure in (48) will also be present in the factor,, and in all the derivatives with respect to components of θ. hat is, consists of s blocks of the form I mj j, j =,..., s where j is of size q j q j. We assume that each component of θ determines only one block in Ω and we designate the block associated with component j as b(j) where b(j) s, j =,..., k. hen D j Ω is zero except for the b(j)th diagonal block which is of the form I mb(j) D j Ω b(j). Notice that the second derivative, D j D i Ω, will be zero if b(i) b(j). We divide the random effects vector b into s = 2 blocks and subdivide the jth block into m j components of length q j denoted by b ji, j =,..., s; i =,..., m j. If Z Z is split into blocks corresponding to the blocks in Ω then the diagonal blocks in Z Z are themselves block diagonal but, unlike the situation with a single grouping factor, Z Z and ZZ will have non-zero off-diagonal blocks. hese off-diagonal blocks can be sparse or dense according to whether the grouping factors are nested or crossed. We distinguish three cases: completely crossed, partially crossed, and strictly nested. 4
15 Completely crossed grouping factors usually occur in designed experiments. For example, biological assays are often conducted by measuring the optical density of liquid samples in wells arranged in a grid on a plate. A common arrangement is 96 wells in a grid of 8 rows by 2 columns. If we assigned one set of random effects to the rows and another set of random effects to the columns then each row would occur with each column, resulting in completely crossed random effects. An example of nested grouping factors would be a longitudinal study, say records of annual achievement test scores, of students in several schools. For the student factor to be strictly nested within the school factor we require that each student attend only one school during the period of the study. In most large studies this will not be the case. We expect some students will attend more than one school but we do not expect every student to attend every school. In such a case the grouping factors are neither strictly nested nor completely crossed. We describe this situation as partially crossed grouping factors. ] If we divide Z = [Z Z 2 according to the two levels of random effects and correspondingly divide ZZ = ZZ ZZ 2 and R ZZ = R ZZ R ZZ 2 (49) 0 ZZ 22 0 R ZZ 22 then the block diagonal structure of Z Z is also present in ZZ and we can store and manipulate it accordingly. Even though Z 2 Z 2 is block diagonal there can be non-zero off-diagonals induced in ZZ 22 from ZZ 2 when taking the Cholesky decomposition. his is called fill-in [6]. If we consider the matrix ZZ 2 to consist of m m 2 blocks of size q q 2 then the (i,j)th such block will necessarily be zero if the ith level of the first grouping factor does not occur in combination with the jth level of the second grouping factor. For strictly nested grouping factors there will be only one non-zero block in each of the m sets of q adjacent rows. Furthermore, this pattern guarantees that there will be no fill-in of ZZ 22 and that it can be stored as a blockdiagonal matrix consisting of m 2 blocks of size q 2 q 2. he matrix R ZZ has the same pattern of sparsity so ZZ and R ZZ only require storage of size (m q + m 2 q 2 )(q + q 2 + p) and computing R ZZ is of order (m q + m 2 q 2 )(q + q 2 + p) 2 FLOPs. he same sparsity pattern will be present in RZZ, which we can calculate explicitly, allowing for the evaluation of all the formulas for the ECME update, the gradient, and the Hessian. hese results generalize to an arbitrary number of strictly nested levels of grouping factors (and are 5
16 incorporated this way in our software). For fully crossed grouping factors ZZ 2 and ZZ 22 are both dense and would need to be stored and manipulated as such. Because we can condense the calculation for the first group of random effects only, we choose the order of the grouping factors so that m q m 2 q 2. Once the order of the grouping factors is established then the calculations involving ZZ 2 and ZZ 22 are essentially the same as those involving ZX and XX except that producing R ZZ 22 from ZZ 22 also involves 2. he amount of storage required for fully crossed grouping factors is on the order of (m q + m 2 q 2 )(q + m 2 q 2 + p) locations and producing R ZZ requires on the order of (m q + m 2 q 2 )(q + m 2 q 2 + p) 2 FLOPs, which, obviously, can be considerably more than (m q + m 2 q 2 )(q + q 2 + p) 2. However, fully crossed grouping factors result in n m m 2 and usually come from designed experiments so we do not expect m 2 q 2 to be extremely large. he most interesting case is partially crossed grouping factors where ZZ 2 is sparse but does not obey that pattern that there is only one non-zero q q 2 block in each of the m sets of q adjacent rows. Having more than one nonzero block in such a set of rows (say because a student attended more than one school during the course of the study) does not generate fill-in in ZZ 2 but does generate fill-in in ZZ 22 and in R ZZ 22. If the extent of the crossing is moderate (i.e. an individual student may attend more than one school during the study but no student attends a large proportion of all the schools in the study, so that most of the q q 2 blocks in any one of the m sets of q adjacent rows are zero), then it will be advantageous to use sparse matrix representations of ZZ 2 (and R ZZ 2 ) and to generate R ZZ 22 using methods for the Cholesky decomposition of sparse semi-definite matrices. his particular calculation has been extensively studied because it is important in the implementation of interior-point methods in mathematical programming (Wright (997, pp ) [6]). A critical part of algorithms for the Cholesky decomposition of sparse semidefinite matrices is choosing an ordering of the columns of Z 2 so as to minimize fill-in in R ZZ 22. his only needs to be done once and can be based on the pattern of crossing of the grouping factors. he amount of storage and computation required to generate R ZZ for partially crossed grouping factors will fall between that for strictly nested grouping factors and that for fully crossed grouping factors. It is unlikely that R ZZ can be inverted in place when the grouping factors are partially crossed. We expect that it will be feasible to evaluate the objective function, the ECME update, and the gradient but that it may not be feasible to evaluate the Hessian. 6
17 7 Implementation We have implemented the computational methods for evaluating the profiled objective function (either log-likelihood or log-restricted-likelihood), the ECME increment and the gradient in the lme4 package for R [7] ( Some comparisons of the speed and stability of this implementation versus our previous implementation in the nlme package for R are available in Debroy (2003) [8]. Future versions of this software will allow for crossed random effects and will provide the Hessian, at least for the completely crossed and strictly nested cases. 8 Acknowledgements his work was supported by U.S. Army Medical Research and Materiel Command under Contract No. DAMD7-02-C-09. he views, opinions and/or findings contained in this report are those of the authors and should not be construed as an official Department of the Army position, policy or decision unless so designated by other documentation. We thank Deepayan Sarkar for helpful discussions. A Differentiating the penalized RSS Using (2) and the relationships D Φ = 0 D I we obtain [ ] I 0, Φ + ỹ = ˆb, and P ỹ = y X β Zˆb β ˆb (A.) r 2 yy(θ) = 2ˆb D ˆb = ˆb D Ωˆb (A.2) 7
18 he (j, i)th element of the Hessian 2 r 2 yy(θ) = D j D i r 2 yy is D 2 jd i ryy 2 = D (ỹ j P D i ΦΦ + ỹ ) = ỹ ( D j P D i ΦΦ + + P D j D i ΦΦ + + P D i ΦD j Φ +) ỹ [ = ỹ P D j ΦΦ + + ( Φ +) ] Dj Φ P D i ΦΦ + ỹ + ˆb D j D i ˆb + ỹ P D i Φ [Φ + D j ΦΦ + ( Φ Φ ) ] Dj Φ P ỹ = ỹ ( Φ +) Dj Φ D i ΦΦ + ỹ + ˆb D j D i ˆb ỹ ( Φ +) Dj Φ Φ ( Φ Φ ) Φ D i ΦΦ + ỹ + ỹp D j Φ ( Φ Φ ) Φ D i ΦΦ + ỹ + ỹp D i Φ ( Φ Φ ) Φ D j ΦΦ + ỹ Writing and ỹp D i Φ ( Φ Φ ) Dj Φ P ỹ w i = D i ΦΦ + ỹ = 0, D i ˆb u i = ( R ) Φ D i ΦΦ + ỹ = ( R ) D i ˆb, 0 v i = ( R ) Di Φ P ỹ = ( R ) D i ˆb, 0 (A.3) (A.4) (A.5) (A.6) where R is the Cholesky decomposition of Φ Φ = R R, expression (A.3) becomes D 2 jd i ryy(θ) 2 = ˆb D j D i ˆb + wj w i (u j + v j ) (u j + v j ) (A.7) Noting that u i + v i = (R ) D iωˆb, we have 0 D j D i r 2 yy(θ) = ˆb [D j D i Ω D j ΩV b D i Ω D i ΩV b D j Ω] ˆb. (A.8) 8
19 References [] C. Henderson, Applications of Linear Models in Animal Breeding, University of Guelph, 984. [2] G. H. Golub, V. Pereyra, he differentiation of pseudo-inverses and nonlinear least squares problems whose variables separate, SIAM J. of Numerical Analysis 0 (2) (973) [3] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. of the Royal Statistical Society, Series B 39 (977) 22. [4] C. Liu, D. B. Rubin, he ECME algorithm: A simple extension of EM and ECM with faster monotone convergence, Biometrika 8 (994) [5] S. W. Raudenbush, A. S. Bryk, Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd Edition, Sage, [6] S. J. Wright, Primal-Dual Interior Point Methods, SIAM, 997. [7] R. Ihaka, R. Gentleman, R: A language for data analysis and graphics, Journal of Computational and Graphical Statistics 5 (3) (996) [8] S. DebRoy, Computational methods for mixed-effects models, Ph.D. thesis, U. of Wisconsin Madison (2003). 9
Sparse Matrix Representations of Linear Mixed Models
Sparse Matrix Representations of Linear Mixed Models Douglas Bates R Development Core Team Douglas.Bates@R-project.org June 18, 2004 Abstract We describe a representation of linear mixed-effects models
More informationPenalized least squares versus generalized least squares representations of linear mixed models
Penalized least squares versus generalized least squares representations of linear mixed models Douglas Bates Department of Statistics University of Wisconsin Madison April 6, 2017 Abstract The methods
More informationMixed models in R using the lme4 package Part 4: Theory of linear mixed models
Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates 8 th International Amsterdam Conference on Multilevel Analysis 2011-03-16 Douglas Bates
More informationComputational methods for mixed models
Computational methods for mixed models Douglas Bates Department of Statistics University of Wisconsin Madison March 27, 2018 Abstract The lme4 package provides R functions to fit and analyze several different
More informationOverview. Multilevel Models in R: Present & Future. What are multilevel models? What are multilevel models? (cont d)
Overview Multilevel Models in R: Present & Future What are multilevel models? Linear mixed models Symbolic and numeric phases Generalized and nonlinear mixed models Examples Douglas M. Bates U of Wisconsin
More informationMixed models in R using the lme4 package Part 4: Theory of linear mixed models
Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates Madison January 11, 11 Contents 1 Definition of linear mixed models Definition of linear mixed models As previously
More informationMixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates
Mixed models in R using the lme4 package Part 6: Theory of linear mixed models, evaluating precision of estimates Douglas Bates University of Wisconsin - Madison and R Development Core Team
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationSparse Matrix Methods and Mixed-effects Models
Sparse Matrix Methods and Mixed-effects Models Douglas Bates University of Wisconsin Madison 2010-01-27 Outline Theory and Practice of Statistics Challenges from Data Characteristics Linear Mixed Model
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 19: Computing the SVD; Sparse Linear Systems Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationAlternative implementations of Monte Carlo EM algorithms for likelihood inferences
Genet. Sel. Evol. 33 001) 443 45 443 INRA, EDP Sciences, 001 Alternative implementations of Monte Carlo EM algorithms for likelihood inferences Louis Alberto GARCÍA-CORTÉS a, Daniel SORENSEN b, Note a
More informationProperties of Matrices and Operations on Matrices
Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationFundamentals of Unconstrained Optimization
dalmau@cimat.mx Centro de Investigación en Matemáticas CIMAT A.C. Mexico Enero 2016 Outline Introduction 1 Introduction 2 3 4 Optimization Problem min f (x) x Ω where f (x) is a real-valued function The
More information(17) (18)
Module 4 : Solving Linear Algebraic Equations Section 3 : Direct Solution Techniques 3 Direct Solution Techniques Methods for solving linear algebraic equations can be categorized as direct and iterative
More informationTechnical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models
Technical Vignette 5: Understanding intrinsic Gaussian Markov random field spatial models, including intrinsic conditional autoregressive models Christopher Paciorek, Department of Statistics, University
More informationLinear Algebra. The analysis of many models in the social sciences reduces to the study of systems of equations.
POLI 7 - Mathematical and Statistical Foundations Prof S Saiegh Fall Lecture Notes - Class 4 October 4, Linear Algebra The analysis of many models in the social sciences reduces to the study of systems
More informationB553 Lecture 5: Matrix Algebra Review
B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations
More informationNumerical Methods I Non-Square and Sparse Linear Systems
Numerical Methods I Non-Square and Sparse Linear Systems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 September 25th, 2014 A. Donev (Courant
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More informationLecture 9: Numerical Linear Algebra Primer (February 11st)
10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template
More informationMatrix Factorizations
1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular
More informationMixed models in R using the lme4 package Part 5: Generalized linear mixed models
Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates Madison January 11, 2011 Contents 1 Definition 1 2 Links 2 3 Example 7 4 Model building 9 5 Conclusions 14
More informationRestricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model
Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationPh.D. Qualifying Exam Friday Saturday, January 3 4, 2014
Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently
More informationInverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1
Inverse of a Square Matrix For an N N square matrix A, the inverse of A, 1 A, exists if and only if A is of full rank, i.e., if and only if no column of A is a linear combination 1 of the others. A is
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1
MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical
More informationEstimating Variances and Covariances in a Non-stationary Multivariate Time Series Using the K-matrix
Estimating Variances and Covariances in a Non-stationary Multivariate ime Series Using the K-matrix Stephen P Smith, January 019 Abstract. A second order time series model is described, and generalized
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)
AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences) Lecture 1: Course Overview; Matrix Multiplication Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationNumerically Stable Cointegration Analysis
Numerically Stable Cointegration Analysis Jurgen A. Doornik Nuffield College, University of Oxford, Oxford OX1 1NF R.J. O Brien Department of Economics University of Southampton November 3, 2001 Abstract
More informationMixed models in R using the lme4 package Part 5: Generalized linear mixed models
Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates 2011-03-16 Contents 1 Generalized Linear Mixed Models Generalized Linear Mixed Models When using linear mixed
More informationFixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility
American Economic Review: Papers & Proceedings 2016, 106(5): 400 404 http://dx.doi.org/10.1257/aer.p20161082 Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility By Gary Chamberlain*
More information5.1 Banded Storage. u = temperature. The five-point difference operator. uh (x, y + h) 2u h (x, y)+u h (x, y h) uh (x + h, y) 2u h (x, y)+u h (x h, y)
5.1 Banded Storage u = temperature u= u h temperature at gridpoints u h = 1 u= Laplace s equation u= h u = u h = grid size u=1 The five-point difference operator 1 u h =1 uh (x + h, y) 2u h (x, y)+u h
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationMixed models in R using the lme4 package Part 7: Generalized linear mixed models
Mixed models in R using the lme4 package Part 7: Generalized linear mixed models Douglas Bates University of Wisconsin - Madison and R Development Core Team University of
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationLinear Models Review
Linear Models Review Vectors in IR n will be written as ordered n-tuples which are understood to be column vectors, or n 1 matrices. A vector variable will be indicted with bold face, and the prime sign
More informationOutline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs
Outline Mixed models in R using the lme4 package Part 5: Generalized linear mixed models Douglas Bates University of Wisconsin - Madison and R Development Core Team UseR!2009,
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationSTAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.
STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015 Linear Regression Review Linear Regression Review
More informationLecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University
Lecture Note 7: Iterative methods for solving linear systems Xiaoqun Zhang Shanghai Jiao Tong University Last updated: December 24, 2014 1.1 Review on linear algebra Norms of vectors and matrices vector
More informationLikelihood-Based Methods
Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)
More informationOR MSc Maths Revision Course
OR MSc Maths Revision Course Tom Byrne School of Mathematics University of Edinburgh t.m.byrne@sms.ed.ac.uk 15 September 2017 General Information Today JCMB Lecture Theatre A, 09:30-12:30 Mathematics revision
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationNearest Neighbor Gaussian Processes for Large Spatial Data
Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns
More informationAMS526: Numerical Analysis I (Numerical Linear Algebra)
AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 3: Positive-Definite Systems; Cholesky Factorization Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 11 Symmetric
More informationIncomplete Cholesky preconditioners that exploit the low-rank property
anapov@ulb.ac.be ; http://homepages.ulb.ac.be/ anapov/ 1 / 35 Incomplete Cholesky preconditioners that exploit the low-rank property (theory and practice) Artem Napov Service de Métrologie Nucléaire, Université
More informationCSC 576: Linear System
CSC 576: Linear System Ji Liu Department of Computer Science, University of Rochester September 3, 206 Linear Equations Consider solving linear equations where A R m n and b R n m and n could be extremely
More informationAnalysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values
1MARCH 001 SCHNEIDER 853 Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values APIO SCHNEIDER Atmospheric and Oceanic Sciences Program,
More informationSparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices
Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices Jaehyun Park June 1 2016 Abstract We consider the problem of writing an arbitrary symmetric matrix as
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationScientific Computing
Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationLecture Notes. Introduction
5/3/016 Lecture Notes R. Rekaya June 1-10, 016 Introduction Variance components play major role in animal breeding and genetic (estimation of BVs) It has been an active area of research since early 1950
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationChapter 12 REML and ML Estimation
Chapter 12 REML and ML Estimation C. R. Henderson 1984 - Guelph 1 Iterative MIVQUE The restricted maximum likelihood estimator (REML) of Patterson and Thompson (1971) can be obtained by iterating on MIVQUE,
More informationWiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.
Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationECE133A Applied Numerical Computing Additional Lecture Notes
Winter Quarter 2018 ECE133A Applied Numerical Computing Additional Lecture Notes L. Vandenberghe ii Contents 1 LU factorization 1 1.1 Definition................................. 1 1.2 Nonsingular sets
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More information14.2 QR Factorization with Column Pivoting
page 531 Chapter 14 Special Topics Background Material Needed Vector and Matrix Norms (Section 25) Rounding Errors in Basic Floating Point Operations (Section 33 37) Forward Elimination and Back Substitution
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationIterative Methods. Splitting Methods
Iterative Methods Splitting Methods 1 Direct Methods Solving Ax = b using direct methods. Gaussian elimination (using LU decomposition) Variants of LU, including Crout and Doolittle Other decomposition
More informationMixture Models and EM
Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering
More information8. Diagonalization.
8. Diagonalization 8.1. Matrix Representations of Linear Transformations Matrix of A Linear Operator with Respect to A Basis We know that every linear transformation T: R n R m has an associated standard
More informationBare minimum on matrix algebra. Psychology 588: Covariance structure and factor models
Bare minimum on matrix algebra Psychology 588: Covariance structure and factor models Matrix multiplication 2 Consider three notations for linear combinations y11 y1 m x11 x 1p b11 b 1m y y x x b b n1
More information1 Data Arrays and Decompositions
1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is
More informationNumerical Linear Algebra
Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions
More informationProgram Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects
Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationDELFT UNIVERSITY OF TECHNOLOGY
DELFT UNIVERSITY OF TECHNOLOGY REPORT -09 Computational and Sensitivity Aspects of Eigenvalue-Based Methods for the Large-Scale Trust-Region Subproblem Marielba Rojas, Bjørn H. Fotland, and Trond Steihaug
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationChapter 5 Prediction of Random Variables
Chapter 5 Prediction of Random Variables C R Henderson 1984 - Guelph We have discussed estimation of β, regarded as fixed Now we shall consider a rather different problem, prediction of random variables,
More informationOptimization Problems
Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationGeneralized Linear Models
Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationRegression. Oscar García
Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationThe Conjugate Gradient Method
The Conjugate Gradient Method Classical Iterations We have a problem, We assume that the matrix comes from a discretization of a PDE. The best and most popular model problem is, The matrix will be as large
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html
More informationLikelihood Methods. 1 Likelihood Functions. The multivariate normal distribution likelihood function is
Likelihood Methods 1 Likelihood Functions The multivariate normal distribution likelihood function is The log of the likelihood, say L 1 is Ly = π.5n V.5 exp.5y Xb V 1 y Xb. L 1 = 0.5[N lnπ + ln V +y Xb
More informationFast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma
Fast Algorithms for SDPs derived from the Kalman-Yakubovich-Popov Lemma Venkataramanan (Ragu) Balakrishnan School of ECE, Purdue University 8 September 2003 European Union RTN Summer School on Multi-Agent
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters
More informationLecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices
Lecture 3: Simple Linear Regression in Matrix Format To move beyond simple regression we need to use matrix algebra We ll start by re-expressing simple linear regression in matrix form Linear algebra is
More informationOutline of GLMs. Definitions
Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density
More informationTesting Restrictions and Comparing Models
Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationNumerical Methods in Matrix Computations
Ake Bjorck Numerical Methods in Matrix Computations Springer Contents 1 Direct Methods for Linear Systems 1 1.1 Elements of Matrix Theory 1 1.1.1 Matrix Algebra 2 1.1.2 Vector Spaces 6 1.1.3 Submatrices
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012
Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11,
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More information1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation
1 Outline. 1. Motivation 2. SUR model 3. Simultaneous equations 4. Estimation 2 Motivation. In this chapter, we will study simultaneous systems of econometric equations. Systems of simultaneous equations
More information