Sparse Gaussian conditional random fields

Similar documents
Large-scale Probabilistic Forecasting in Energy Systems using Sparse Gaussian Conditional Random Fields

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Gaussian Graphical Models and Graphical Lasso

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

Robust Inverse Covariance Estimation under Noisy Measurements

Lecture 25: November 27

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Introduction to Alternating Direction Method of Multipliers

Multivariate Normal Models

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Frist order optimization methods for sparse inverse covariance selection

Chapter 17: Undirected Graphical Models

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Multivariate Normal Models

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Fantope Regularization in Metric Learning

Newton-Like Methods for Sparse Inverse Covariance Estimation

Summary and discussion of: Dropout Training as Adaptive Regularization

Graphical Models for Collaborative Filtering

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Sparse inverse covariance estimation with the lasso

Generative v. Discriminative classifiers Intuition

Linear Models for Regression CS534

Conditional Random Field

10-701/ Machine Learning - Midterm Exam, Fall 2010

arxiv: v3 [stat.ml] 14 Apr 2016

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Machine Learning Gaussian Naïve Bayes Big Picture

Bias-Variance Tradeoff

Statistical Data Mining and Machine Learning Hilary Term 2016

Optimization methods

Undirected Graphical Models

Lasso: Algorithms and Extensions

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

Structure estimation for Gaussian graphical models

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Sparse Covariance Selection using Semidefinite Programming

Optimization methods

CPSC 540: Machine Learning

Robust Inverse Covariance Estimation under Noisy Measurements

Sparse Inverse Covariance Estimation for a Million Variables

Linear Models for Regression CS534

A direct formulation for sparse PCA using semidefinite programming

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation

Machine Learning

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

High dimensional Ising model selection

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Sparse PCA with applications in finance

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Support Vector Machines

Click Prediction and Preference Ranking of RSS Feeds

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Lasso Regression: Regularization for feature selection

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models

Graphical Model Selection

Machine Learning Practice Page 2 of 2 10/28/13

Homework 5. Convex Optimization /36-725

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

DATA MINING AND MACHINE LEARNING

Probabilistic Graphical Models

Generalized Power Method for Sparse Principal Component Analysis

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Stability and the elastic net

CPSC 540: Machine Learning

CPSC 540: Machine Learning

ADMM and Fast Gradient Methods for Distributed Optimization

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

A direct formulation for sparse PCA using semidefinite programming

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Lecture 2 Machine Learning Review

Sparse and Locally Constant Gaussian Graphical Models

Midterm exam CS 189/289, Fall 2015

Big Data Analytics: Optimization and Randomization

Joint Learning of Representation and Structure for Sparse Regression on Graphs

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Joint Gaussian Graphical Model Review Series I

Permutation-invariant regularization of large covariance matrices. Liza Levina

Convex Optimization Algorithms for Machine Learning in 10 Slides

Linear Models in Machine Learning

Lecture 9: September 28

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Generative v. Discriminative classifiers Intuition

Logistic Regression. William Cohen

Least Squares Regression

10708 Graphical Models: Homework 2

Node-Based Learning of Multiple Gaussian Graphical Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

COMS 4771 Regression. Nakul Verma

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Transcription:

Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian conditional random fields for multivariate regression. These models are a discriminative analogue of the sparse Gaussian Markov random field, a model that has seen a great deal of work in past years, particularly using l methods to learn high-dimensional sparse graphical representations. Our CRF model exploits sparsity between outputs variables and between inputs and outputs, combining the benefits of sparse inverse covariance estimation and l regularized regression. Fitting parameters in this model is more challenging than in previous formulations, and we propose a new optimization algorithm based upon the Alternating Direction Method of Multipliers (ADMM) technique. Finally, we present experimental results both on synthetic data, and on the task of forecasting electrical demand in the Pennsylvania grid; on this latter task, we improve upon the state-of-the-art system currently deployed by the regional electricity operator. Introduction Sparse inverse covariance estimation using l methods [], also known as the graphical lasso [4], enables convex learning of high-dimensional undirected graphical models. These methods estimate the inverse covariance of a zero-mean Gaussian distribution while penalizing the l norm of the off-diagonal entries; since the entries in the inverse covariance correspond to edges in a Gaussian Markov random field, this method is effectively learning a sparsely connected graphical model. In recent years, many algorithms have been proposed for solving this problem, including projected gradient methods [3], smoothed optimization [5], and alternating linearization methods [7]. However, when we have a prediction task in which we predict output variables from input variables, we may not want to model all correlations in the input data. This is the familiar generative/discriminative contrast in machine learning [6], in which it has been repeatedly observed that for prediction tasks, discriminative approaches can be superior [9]. In this work, we propose the sparse Gaussian conditional random field, a discriminative model that combines the benefits of discriminative learning and sparse inverse covariance estimation. Specifically, we formulate an l -penalized log-linear model in which we jointly model the covariance structure of the output variables along with their dependence on the input variables. As such, our work can be seen as a generalization of sparse Gaussian MRFs and l -penalized multivariate regression. Unfortunately, in the discriminative formulation the optimization problem becomes substantially more complex. Here we propose a new optimization technique for such problems based upon ADMM, a general approach that performs well in the standard inverse covariance estimation problem. As an example of such models, we consider the task of multivariate time series prediction. In particular, we are interested in forecasting electricity demand over multiple locations and time horizons. This problem has huge importance in energy planning and conservation, and there are a number of forecasting methods employed ubiquitously in the power industry [8]. On this task, we show improvement over the state-of-the-art deployed solution in the Pennsylvania power grid [0].

y y y 3 y p x x x 3 x n Figure : Illustration of sparse Gaussian CRF model. The sparse Gaussian CRF model Let x R n and y R p denote input and output variables for our prediction task. We formulate the Gaussian CRF as a log-linear model with p(y x; Λ, Θ) = { (x) exp } yt Λy x T Θy () where the quadratic term models the conditional dependencies of y and the linear term models the dependence of y on x. The model is parametrized by Λ R p p, which corresponds to the inverse covariance matrix, and Θ R n p, which maps the inputs to the outputs; an illustration of the model is shown in Figure. Since the the CRF represents a Gaussian distribution with mean µ = Λ Θ T x, the partition function is given by (x) = c Λ / exp { xt ΘΛ Θ T x }. () In this model, the maximum likelihood estimator is given by the optimization problem where minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx (3) S = E ( [ y x ] [ y x ] T ) = [ ] Syy S yx S xy S xx which in practice which we estimate from the data (below, we denote the number of samples by m). We can add l regularization by adding λ to the diagonal elements of S (formally, this corresponds to a Normal-Wishart prior on Λ and the columns of Θ). This is a convex problem, but the total total number of parameters is np + p(p + )/. By regularizing both Θ and the off-diagonal elements of Λ, we encourage both parameters to be sparse and combine the benefits of sparse inverse covariance estimation and l -penalized linear regression. 3 Optimization The ΘΛ ΘS xx term in (3) makes existing l methods for sparse inverse covariance estimation no longer applicable. Furthermore, in practice we have found that this term poses significant challenges for gradient-descent methods, such as proximal gradient, due to step-size considerations. Thus, we formulate the optimization problem in an alternative manner using the Schur complement and develop a custom alternating direction method of multipliers (ADMM) algorithm. [ ] Λ Θ T First, we note that if we write =, then we can formulate an upper bound on (3) as Θ Ψ minimize 0 f() = log + tr S (5) where 0 restricts to the positive semidefinite cone. This forces the Schur complement of, Ψ Θ T Λ Θ to be positive semidefinite and thus, at any feasible point, we have f() f(λ, Θ). Since this bound is always achievable, both optimization problems achieve the same value. This formulation also makes clear the close relationship between the Gaussian CRF and MRF: in the case of the CRF, the log det term is restricted to the (, ) submatrix which corresponds to our modeling the conditional dependencies between the output y variables only. (4)

MSE 4 3 MSE (train) Log loss (train) MSE (test) Log loss (test) 000 500 000 500 Log loss MSE 4 3 l l 0 0 500 0 0 0 0 0 3 0 4 λ 0 00 400 600 800 000 Training examples (m) Figure : Performance on synthetic data. (left) The effect of varying l regularization. (right) Comparison of l and l regularization. 3. Alternating direction method of multipliers ADMM techniques separate the objective into terms of different variables and add constraints that these variables be equal. The algorithm then alternates between minimizing the augmented Lagrangian and taking a gradient step on the scaled dual variables; see [] for a detailed description. In our setting, the ADMM-style optimization problem takes the form minimize,, 3 log ( ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (6) where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. Space constraints preclude a full description (see Appendix A), but the resulting algorithm is given by the iteration := T ρ (( + 3 + V )/ S/ρ) := S λ /ρ (( + 3 + V )/) 3 := P 0 (( + V V )/) V := V + + 3 V := V + + 3 where Sλ is the soft-thresholding operator as applied to the /ρ, norm, P 0 () is the projection onto the semidefinite cone and T ρ is given by ([ ]) [ ] A B Q T DQ B T ρ B T = C B T (8) C where QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with D ii = D ii + D ii + 4ρ ρ The update for the (, ) submatrix is similar to other ADMM-style algorithms which achieve stateof-the-art results for sparse inverse covariance selection, see for example [7] and []. 4 Experiments Synthetic Data. Here, we evaluate the performance of our model in a setting with few examples and high-dimensional input and output. We fix n = p = 50 and generate the parameter Θ R n p by creating a sparse matrix with % nonzero values sampled from N (0, ). To generate (7) (9) 3

MSE 0. 0.05 0 0 0 λ Train Test PJM forecast 0 3 MW Hour 00 00 000 900 800 700 600 Actual Predicted 500 0 5 0 5 0 Hours in future Figure 3: Performance on power demand forecasting. (left) The effect of varying l regularization. (right) A single prediction task for a 4 hour time horizon. Figure 4: Λ (left) and Θ (right) for the power demand forecasting model. Λ R p p, we create another % nonzero matrix A R p p by sampling from N (0, ) and then take Λ = ( + α)i + A + A T where α 0 is sufficient to ensure that Λ is positive definite. Given these parameters, we randomly sample Gaussian input features and generate the outputs according to y N ( Λ Θ T x, Λ ). In Figure (left), we see the effects of varying λ with a fixed sample size m = 00: as λ goes to zero, the parameters approach the unregularized maximum likelihood estimator and the model overfits. In Figure (right), we compare l and l regularization while varying the size of the training set. For each choice of m, we choose optimal λ and λ using cross-validation, and evaluate on a test set. Here we see that l regularization dramatically outperforms l regularization in the regime with fewer training examples than parameters. Power demand forecasting. Next, we consider forecasting power demand, the task of PJM, a regional electrical transmission operator in the Eastern United States (http://www.pjm.com). Following PJM s methodology, we make predictions every 6 hours for hourly power demand over the next 4 hours. We jointly predict across 6 locations for a total output dimension of p = 50. For input features, we use both the previous demand as well as temporal features and PJM s own forecasts. Note that by using PJM s own forecasts as input, our model is building upon an already strong baseline. The input has dimension n = 30 and we train the model on one week of data and then predict the next week. In Figure 3 (left), we see that as we vary l regularization, our model outperforms the strong PJM baseline. However, as λ goes to zero, the model overfits. In Figure 3 (right), we look at a single location and observe that the model closely fits observed demand, perhaps degrading slightly toward the 4 hour mark. Finally, in Figure 4, we see that the parameters recovered exhibit a high degree of sparsity as well as strong conditional dependencies between immediate time points and weaker conditional dependencies across locations. 4

References [] O. Banerjee, L. El Ghaoui, and A. daspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data, 008. [] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3():, 0. [3] J. C. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussians. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 008. [4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):43 44, 008. [5]. Lu. Smooth optimization approaches for sparse inverse covariance selection. SIAM Journal on Optimization, 9(4):807 87, 009. [6] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. 00. [7] K. Scheinberg, S. Ma, and D Goldfarb. Sparse inverse covariance selection via alternating linearization methods. In Neural Information Processing Systems, 00. [8] S. A. Soliman and A. M. Al-Kandari. Electrical Load Forecasting: Modeling and Model Construction. Elsevier, 00. [9] C. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):67 373, 0. [0] Various. PJM Manual 9: Load Forecasting and Analysis. PJM, 0. Available at: http://www.pjm.com/planning/resource-adequacy-planning/ / media/documents/manuals/m9.ashx. 5

A Derivation of the optimization method To reiterate, in the unregularized case the MLE problem we are trying to solve is the optimization problem (3) minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx. (0) The gradient of this function is given by Θ f(λ, Θ) = S xy + S xx ΘΛ Λ f(λ, Θ) = Λ + S yy Λ Θ T S xx ΘΛ () and thus in the case without regularization, we can find an analytical solution, which is just the least squares estimate Θ = Sxx S xy Λ () Λ = S yy S yx Sxx S xy. Unfortunately, when we add l regularization, the problem can no longer be solved analytically. Furthermore, the form of the gradient above (specifically, the fact that the gradients are ill-conditioned as the eigenvalues of Λ go to zero) makes many gradient-descent methods, like proximal gradient, perform poorly in practice. This motivates our Schur complement formulation of the optimization problem, and here we derive an ADMM algorithm for this formulation. A. Derivation of the ADMM update Our presentation in this section uses much of the same terminology as that in []. The Schur complement form of the optimization problem, now adding the regularization term, is minimize log + trs + λ,. (3) 0 where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. To solve this via an ADMM method, we split each term in the objective into separate terms and add the additional constraint that they all be equal (adding three redundant constraints results in a symmetric update equations) minimize,, 3 log ( ) ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (4) The augmented Lagrangian then takes the form L ρ (,, 3, Y, Y, Y 3 ) = log ( ) ) + tr Sλ, + { 3 0} where F signifies the Frobenius norm. + tr Y T ( ) + tr Y T ( 3 ) + tr Y T 3 ( 3 ) + ρ 4 F + ρ 4 3 F + ρ 4 3 F. ADMM proceeds by alternative minimization over the variables, and gradient descent on the dual Y variables; by introducing the scaled variables U i = ρ Y i, we can simplify the equations to be the sequence of updates := arg min log + tr S + ρ 4 k + U k F + ρ 4 k 3 + U3 k F := arg min 3 := arg min U := U k + U := U k + 3 U3 := U3 k + 3, + ρ 4 + U k F + ρ 4 k 3 + U k F { 0} + ρ 4 + U k F + ρ + U3 k F 4 6 (5) (6)

Noting that arg min f(x) + X X A F + X B F = arg min f(x) + X (A + X B)/ F (7) we can rewrite these equations as involving a single norm penalization := arg min log + tr S + ρ (k U k + 3 k + U3 k )/ F := arg min 3 := arg min To simplify the equations, we define where the updates are then given by, + ρ ( + U k + 3 k U k )/ F { 0} + ρ ( + U k + U3 k )/ F (8) V k = U k 3 U k, V k = U k U k, V k 3 = U k U k 3 (9) V := V k + + 3 V := V3 k + + 3 V3 := V k + + 3 Finally, noting that we need only compute the first two V s, since V k + V k + V3 k = 0, our final ADMM update equations take the form (0) := arg min := arg min 3 := arg min log + tr S + ρ (k + k 3 + V k )/ F, + ρ ( + 3 k + V k )/ F { 0} + ρ ( + V k V k )/ F V := V k + + 3 V := V k + + 3 () A. Analytical solution of the ADMM steps Next, we derive the analytical form for the updates in (): := T ρ (( + 3 + V )/ S/ρ) := S λ /ρ (( + 3 + V )/) 3 := P 0 (( + V V )/) () Starting with the update for, we note that if we set log + tr(s) + ρ X F = 0, we have + S + ρ( X ) = 0 S ij + ρ( ij X ij ) = 0 for (i, j) (, ) The second set of equations imply, for (i, j) (, ) While for the (, ) submatrix, we must find a matrix that satisfies (3) ij = X ij ρ S ij. (4) ρ = ρ( X ) S. (5) As derived in [7] (used there for the typical sparse inverse covariance estimation problem), this is given by taking the eigendecomposition QDQ T = ρ( X ) S and then multiplying the left hand side by Q T and Q so that we have ρ = D (6) 7

where = Q T Q. Finally, the solution is given by the quadratic formula ( ) ii = D ii + Dii + 4ρ. (7) ρ Combining these blockwise updates into a single operation, we have arg min log +tr S + ρ (k +3 k +V k )/ F = T ρ (( k +3 k +V k )/ S/ρ) (8) where T ρ is given by T ρ ([ A B B T C ]) [ ] Q T DQ B = B T C and QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with (9) D ii = D ii + D ii + 4ρ ρ (30) The update for is accomplished by elementwise soft-thresholding on, and the offdiagonal elements of. In particular, we exploit the fact that the solution to the optimization problem arg min (x z) + ν x (3) is given by the soft-thresholding operator { z ν z ν S ν (z) = 0 ν z ν (3) z + ν z ν x We define the Sν operator as ([ ]) [ ] A B Sν Sν (A diag(a)) + diag(a) S B T = ν (B) C S ν (B) T C where the soft-thresholding is applied elementwise. Applied to the minimization problem () for, this gives the update above. Finally, the update for 3 is given simply by projection onto the semidefinite cone (33) P 0 () = Q DQ T (34) where QDQ T = is the eigendecomposition and is a diagonal matrix with D ii = max(d ii, 0). 8