Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian conditional random fields for multivariate regression. These models are a discriminative analogue of the sparse Gaussian Markov random field, a model that has seen a great deal of work in past years, particularly using l methods to learn high-dimensional sparse graphical representations. Our CRF model exploits sparsity between outputs variables and between inputs and outputs, combining the benefits of sparse inverse covariance estimation and l regularized regression. Fitting parameters in this model is more challenging than in previous formulations, and we propose a new optimization algorithm based upon the Alternating Direction Method of Multipliers (ADMM) technique. Finally, we present experimental results both on synthetic data, and on the task of forecasting electrical demand in the Pennsylvania grid; on this latter task, we improve upon the state-of-the-art system currently deployed by the regional electricity operator. Introduction Sparse inverse covariance estimation using l methods [], also known as the graphical lasso [4], enables convex learning of high-dimensional undirected graphical models. These methods estimate the inverse covariance of a zero-mean Gaussian distribution while penalizing the l norm of the off-diagonal entries; since the entries in the inverse covariance correspond to edges in a Gaussian Markov random field, this method is effectively learning a sparsely connected graphical model. In recent years, many algorithms have been proposed for solving this problem, including projected gradient methods [3], smoothed optimization [5], and alternating linearization methods [7]. However, when we have a prediction task in which we predict output variables from input variables, we may not want to model all correlations in the input data. This is the familiar generative/discriminative contrast in machine learning [6], in which it has been repeatedly observed that for prediction tasks, discriminative approaches can be superior [9]. In this work, we propose the sparse Gaussian conditional random field, a discriminative model that combines the benefits of discriminative learning and sparse inverse covariance estimation. Specifically, we formulate an l -penalized log-linear model in which we jointly model the covariance structure of the output variables along with their dependence on the input variables. As such, our work can be seen as a generalization of sparse Gaussian MRFs and l -penalized multivariate regression. Unfortunately, in the discriminative formulation the optimization problem becomes substantially more complex. Here we propose a new optimization technique for such problems based upon ADMM, a general approach that performs well in the standard inverse covariance estimation problem. As an example of such models, we consider the task of multivariate time series prediction. In particular, we are interested in forecasting electricity demand over multiple locations and time horizons. This problem has huge importance in energy planning and conservation, and there are a number of forecasting methods employed ubiquitously in the power industry [8]. On this task, we show improvement over the state-of-the-art deployed solution in the Pennsylvania power grid [0].

y y y 3 y p x x x 3 x n Figure : Illustration of sparse Gaussian CRF model. The sparse Gaussian CRF model Let x R n and y R p denote input and output variables for our prediction task. We formulate the Gaussian CRF as a log-linear model with p(y x; Λ, Θ) = { (x) exp } yt Λy x T Θy () where the quadratic term models the conditional dependencies of y and the linear term models the dependence of y on x. The model is parametrized by Λ R p p, which corresponds to the inverse covariance matrix, and Θ R n p, which maps the inputs to the outputs; an illustration of the model is shown in Figure. Since the the CRF represents a Gaussian distribution with mean µ = Λ Θ T x, the partition function is given by (x) = c Λ / exp { xt ΘΛ Θ T x }. () In this model, the maximum likelihood estimator is given by the optimization problem where minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx (3) S = E ( [ y x ] [ y x ] T ) = [ ] Syy S yx S xy S xx which in practice which we estimate from the data (below, we denote the number of samples by m). We can add l regularization by adding λ to the diagonal elements of S (formally, this corresponds to a Normal-Wishart prior on Λ and the columns of Θ). This is a convex problem, but the total total number of parameters is np + p(p + )/. By regularizing both Θ and the off-diagonal elements of Λ, we encourage both parameters to be sparse and combine the benefits of sparse inverse covariance estimation and l -penalized linear regression. 3 Optimization The ΘΛ ΘS xx term in (3) makes existing l methods for sparse inverse covariance estimation no longer applicable. Furthermore, in practice we have found that this term poses significant challenges for gradient-descent methods, such as proximal gradient, due to step-size considerations. Thus, we formulate the optimization problem in an alternative manner using the Schur complement and develop a custom alternating direction method of multipliers (ADMM) algorithm. [ ] Λ Θ T First, we note that if we write =, then we can formulate an upper bound on (3) as Θ Ψ minimize 0 f() = log + tr S (5) where 0 restricts to the positive semidefinite cone. This forces the Schur complement of, Ψ Θ T Λ Θ to be positive semidefinite and thus, at any feasible point, we have f() f(λ, Θ). Since this bound is always achievable, both optimization problems achieve the same value. This formulation also makes clear the close relationship between the Gaussian CRF and MRF: in the case of the CRF, the log det term is restricted to the (, ) submatrix which corresponds to our modeling the conditional dependencies between the output y variables only. (4)

MSE 4 3 MSE (train) Log loss (train) MSE (test) Log loss (test) 000 500 000 500 Log loss MSE 4 3 l l 0 0 500 0 0 0 0 0 3 0 4 λ 0 00 400 600 800 000 Training examples (m) Figure : Performance on synthetic data. (left) The effect of varying l regularization. (right) Comparison of l and l regularization. 3. Alternating direction method of multipliers ADMM techniques separate the objective into terms of different variables and add constraints that these variables be equal. The algorithm then alternates between minimizing the augmented Lagrangian and taking a gradient step on the scaled dual variables; see [] for a detailed description. In our setting, the ADMM-style optimization problem takes the form minimize,, 3 log ( ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (6) where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. Space constraints preclude a full description (see Appendix A), but the resulting algorithm is given by the iteration := T ρ (( + 3 + V )/ S/ρ) := S λ /ρ (( + 3 + V )/) 3 := P 0 (( + V V )/) V := V + + 3 V := V + + 3 where Sλ is the soft-thresholding operator as applied to the /ρ, norm, P 0 () is the projection onto the semidefinite cone and T ρ is given by ([ ]) [ ] A B Q T DQ B T ρ B T = C B T (8) C where QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with D ii = D ii + D ii + 4ρ ρ The update for the (, ) submatrix is similar to other ADMM-style algorithms which achieve stateof-the-art results for sparse inverse covariance selection, see for example [7] and []. 4 Experiments Synthetic Data. Here, we evaluate the performance of our model in a setting with few examples and high-dimensional input and output. We fix n = p = 50 and generate the parameter Θ R n p by creating a sparse matrix with % nonzero values sampled from N (0, ). To generate (7) (9) 3

MSE 0. 0.05 0 0 0 λ Train Test PJM forecast 0 3 MW Hour 00 00 000 900 800 700 600 Actual Predicted 500 0 5 0 5 0 Hours in future Figure 3: Performance on power demand forecasting. (left) The effect of varying l regularization. (right) A single prediction task for a 4 hour time horizon. Figure 4: Λ (left) and Θ (right) for the power demand forecasting model. Λ R p p, we create another % nonzero matrix A R p p by sampling from N (0, ) and then take Λ = ( + α)i + A + A T where α 0 is sufficient to ensure that Λ is positive definite. Given these parameters, we randomly sample Gaussian input features and generate the outputs according to y N ( Λ Θ T x, Λ ). In Figure (left), we see the effects of varying λ with a fixed sample size m = 00: as λ goes to zero, the parameters approach the unregularized maximum likelihood estimator and the model overfits. In Figure (right), we compare l and l regularization while varying the size of the training set. For each choice of m, we choose optimal λ and λ using cross-validation, and evaluate on a test set. Here we see that l regularization dramatically outperforms l regularization in the regime with fewer training examples than parameters. Power demand forecasting. Next, we consider forecasting power demand, the task of PJM, a regional electrical transmission operator in the Eastern United States (http://www.pjm.com). Following PJM s methodology, we make predictions every 6 hours for hourly power demand over the next 4 hours. We jointly predict across 6 locations for a total output dimension of p = 50. For input features, we use both the previous demand as well as temporal features and PJM s own forecasts. Note that by using PJM s own forecasts as input, our model is building upon an already strong baseline. The input has dimension n = 30 and we train the model on one week of data and then predict the next week. In Figure 3 (left), we see that as we vary l regularization, our model outperforms the strong PJM baseline. However, as λ goes to zero, the model overfits. In Figure 3 (right), we look at a single location and observe that the model closely fits observed demand, perhaps degrading slightly toward the 4 hour mark. Finally, in Figure 4, we see that the parameters recovered exhibit a high degree of sparsity as well as strong conditional dependencies between immediate time points and weaker conditional dependencies across locations. 4

References [] O. Banerjee, L. El Ghaoui, and A. daspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data, 008. [] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3():, 0. [3] J. C. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussians. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 008. [4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):43 44, 008. [5]. Lu. Smooth optimization approaches for sparse inverse covariance selection. SIAM Journal on Optimization, 9(4):807 87, 009. [6] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. 00. [7] K. Scheinberg, S. Ma, and D Goldfarb. Sparse inverse covariance selection via alternating linearization methods. In Neural Information Processing Systems, 00. [8] S. A. Soliman and A. M. Al-Kandari. Electrical Load Forecasting: Modeling and Model Construction. Elsevier, 00. [9] C. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):67 373, 0. [0] Various. PJM Manual 9: Load Forecasting and Analysis. PJM, 0. Available at: http://www.pjm.com/planning/resource-adequacy-planning/ / media/documents/manuals/m9.ashx. 5

A Derivation of the optimization method To reiterate, in the unregularized case the MLE problem we are trying to solve is the optimization problem (3) minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx. (0) The gradient of this function is given by Θ f(λ, Θ) = S xy + S xx ΘΛ Λ f(λ, Θ) = Λ + S yy Λ Θ T S xx ΘΛ () and thus in the case without regularization, we can find an analytical solution, which is just the least squares estimate Θ = Sxx S xy Λ () Λ = S yy S yx Sxx S xy. Unfortunately, when we add l regularization, the problem can no longer be solved analytically. Furthermore, the form of the gradient above (specifically, the fact that the gradients are ill-conditioned as the eigenvalues of Λ go to zero) makes many gradient-descent methods, like proximal gradient, perform poorly in practice. This motivates our Schur complement formulation of the optimization problem, and here we derive an ADMM algorithm for this formulation. A. Derivation of the ADMM update Our presentation in this section uses much of the same terminology as that in []. The Schur complement form of the optimization problem, now adding the regularization term, is minimize log + trs + λ,. (3) 0 where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. To solve this via an ADMM method, we split each term in the objective into separate terms and add the additional constraint that they all be equal (adding three redundant constraints results in a symmetric update equations) minimize,, 3 log ( ) ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (4) The augmented Lagrangian then takes the form L ρ (,, 3, Y, Y, Y 3 ) = log ( ) ) + tr Sλ, + { 3 0} where F signifies the Frobenius norm. + tr Y T ( ) + tr Y T ( 3 ) + tr Y T 3 ( 3 ) + ρ 4 F + ρ 4 3 F + ρ 4 3 F. ADMM proceeds by alternative minimization over the variables, and gradient descent on the dual Y variables; by introducing the scaled variables U i = ρ Y i, we can simplify the equations to be the sequence of updates := arg min log + tr S + ρ 4 k + U k F + ρ 4 k 3 + U3 k F := arg min 3 := arg min U := U k + U := U k + 3 U3 := U3 k + 3, + ρ 4 + U k F + ρ 4 k 3 + U k F { 0} + ρ 4 + U k F + ρ + U3 k F 4 6 (5) (6)

Noting that arg min f(x) + X X A F + X B F = arg min f(x) + X (A + X B)/ F (7) we can rewrite these equations as involving a single norm penalization := arg min log + tr S + ρ (k U k + 3 k + U3 k )/ F := arg min 3 := arg min To simplify the equations, we define where the updates are then given by, + ρ ( + U k + 3 k U k )/ F { 0} + ρ ( + U k + U3 k )/ F (8) V k = U k 3 U k, V k = U k U k, V k 3 = U k U k 3 (9) V := V k + + 3 V := V3 k + + 3 V3 := V k + + 3 Finally, noting that we need only compute the first two V s, since V k + V k + V3 k = 0, our final ADMM update equations take the form (0) := arg min := arg min 3 := arg min log + tr S + ρ (k + k 3 + V k )/ F, + ρ ( + 3 k + V k )/ F { 0} + ρ ( + V k V k )/ F V := V k + + 3 V := V k + + 3 () A. Analytical solution of the ADMM steps Next, we derive the analytical form for the updates in (): := T ρ (( + 3 + V )/ S/ρ) := S λ /ρ (( + 3 + V )/) 3 := P 0 (( + V V )/) () Starting with the update for, we note that if we set log + tr(s) + ρ X F = 0, we have + S + ρ( X ) = 0 S ij + ρ( ij X ij ) = 0 for (i, j) (, ) The second set of equations imply, for (i, j) (, ) While for the (, ) submatrix, we must find a matrix that satisfies (3) ij = X ij ρ S ij. (4) ρ = ρ( X ) S. (5) As derived in [7] (used there for the typical sparse inverse covariance estimation problem), this is given by taking the eigendecomposition QDQ T = ρ( X ) S and then multiplying the left hand side by Q T and Q so that we have ρ = D (6) 7

where = Q T Q. Finally, the solution is given by the quadratic formula ( ) ii = D ii + Dii + 4ρ. (7) ρ Combining these blockwise updates into a single operation, we have arg min log +tr S + ρ (k +3 k +V k )/ F = T ρ (( k +3 k +V k )/ S/ρ) (8) where T ρ is given by T ρ ([ A B B T C ]) [ ] Q T DQ B = B T C and QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with (9) D ii = D ii + D ii + 4ρ ρ (30) The update for is accomplished by elementwise soft-thresholding on, and the offdiagonal elements of. In particular, we exploit the fact that the solution to the optimization problem arg min (x z) + ν x (3) is given by the soft-thresholding operator { z ν z ν S ν (z) = 0 ν z ν (3) z + ν z ν x We define the Sν operator as ([ ]) [ ] A B Sν Sν (A diag(a)) + diag(a) S B T = ν (B) C S ν (B) T C where the soft-thresholding is applied elementwise. Applied to the minimization problem () for, this gives the update above. Finally, the update for 3 is given simply by projection onto the semidefinite cone (33) P 0 () = Q DQ T (34) where QDQ T = is the eigendecomposition and is a diagonal matrix with D ii = max(d ii, 0). 8