Sparse Gaussian conditional random fields

Size: px
Start display at page:

Download "Sparse Gaussian conditional random fields"

Transcription

1 Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, Abstract We propose sparse Gaussian conditional random fields for multivariate regression. These models are a discriminative analogue of the sparse Gaussian Markov random field, a model that has seen a great deal of work in past years, particularly using l methods to learn high-dimensional sparse graphical representations. Our CRF model exploits sparsity between outputs variables and between inputs and outputs, combining the benefits of sparse inverse covariance estimation and l regularized regression. Fitting parameters in this model is more challenging than in previous formulations, and we propose a new optimization algorithm based upon the Alternating Direction Method of Multipliers (ADMM) technique. Finally, we present experimental results both on synthetic data, and on the task of forecasting electrical demand in the Pennsylvania grid; on this latter task, we improve upon the state-of-the-art system currently deployed by the regional electricity operator. Introduction Sparse inverse covariance estimation using l methods [], also known as the graphical lasso [4], enables convex learning of high-dimensional undirected graphical models. These methods estimate the inverse covariance of a zero-mean Gaussian distribution while penalizing the l norm of the off-diagonal entries; since the entries in the inverse covariance correspond to edges in a Gaussian Markov random field, this method is effectively learning a sparsely connected graphical model. In recent years, many algorithms have been proposed for solving this problem, including projected gradient methods [3], smoothed optimization [5], and alternating linearization methods [7]. However, when we have a prediction task in which we predict output variables from input variables, we may not want to model all correlations in the input data. This is the familiar generative/discriminative contrast in machine learning [6], in which it has been repeatedly observed that for prediction tasks, discriminative approaches can be superior [9]. In this work, we propose the sparse Gaussian conditional random field, a discriminative model that combines the benefits of discriminative learning and sparse inverse covariance estimation. Specifically, we formulate an l -penalized log-linear model in which we jointly model the covariance structure of the output variables along with their dependence on the input variables. As such, our work can be seen as a generalization of sparse Gaussian MRFs and l -penalized multivariate regression. Unfortunately, in the discriminative formulation the optimization problem becomes substantially more complex. Here we propose a new optimization technique for such problems based upon ADMM, a general approach that performs well in the standard inverse covariance estimation problem. As an example of such models, we consider the task of multivariate time series prediction. In particular, we are interested in forecasting electricity demand over multiple locations and time horizons. This problem has huge importance in energy planning and conservation, and there are a number of forecasting methods employed ubiquitously in the power industry [8]. On this task, we show improvement over the state-of-the-art deployed solution in the Pennsylvania power grid [0].

2 y y y 3 y p x x x 3 x n Figure : Illustration of sparse Gaussian CRF model. The sparse Gaussian CRF model Let x R n and y R p denote input and output variables for our prediction task. We formulate the Gaussian CRF as a log-linear model with p(y x; Λ, Θ) = { (x) exp } yt Λy x T Θy () where the quadratic term models the conditional dependencies of y and the linear term models the dependence of y on x. The model is parametrized by Λ R p p, which corresponds to the inverse covariance matrix, and Θ R n p, which maps the inputs to the outputs; an illustration of the model is shown in Figure. Since the the CRF represents a Gaussian distribution with mean µ = Λ Θ T x, the partition function is given by (x) = c Λ / exp { xt ΘΛ Θ T x }. () In this model, the maximum likelihood estimator is given by the optimization problem where minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx (3) S = E ( [ y x ] [ y x ] T ) = [ ] Syy S yx S xy S xx which in practice which we estimate from the data (below, we denote the number of samples by m). We can add l regularization by adding λ to the diagonal elements of S (formally, this corresponds to a Normal-Wishart prior on Λ and the columns of Θ). This is a convex problem, but the total total number of parameters is np + p(p + )/. By regularizing both Θ and the off-diagonal elements of Λ, we encourage both parameters to be sparse and combine the benefits of sparse inverse covariance estimation and l -penalized linear regression. 3 Optimization The ΘΛ ΘS xx term in (3) makes existing l methods for sparse inverse covariance estimation no longer applicable. Furthermore, in practice we have found that this term poses significant challenges for gradient-descent methods, such as proximal gradient, due to step-size considerations. Thus, we formulate the optimization problem in an alternative manner using the Schur complement and develop a custom alternating direction method of multipliers (ADMM) algorithm. [ ] Λ Θ T First, we note that if we write =, then we can formulate an upper bound on (3) as Θ Ψ minimize 0 f() = log + tr S (5) where 0 restricts to the positive semidefinite cone. This forces the Schur complement of, Ψ Θ T Λ Θ to be positive semidefinite and thus, at any feasible point, we have f() f(λ, Θ). Since this bound is always achievable, both optimization problems achieve the same value. This formulation also makes clear the close relationship between the Gaussian CRF and MRF: in the case of the CRF, the log det term is restricted to the (, ) submatrix which corresponds to our modeling the conditional dependencies between the output y variables only. (4)

3 MSE 4 3 MSE (train) Log loss (train) MSE (test) Log loss (test) Log loss MSE 4 3 l l λ Training examples (m) Figure : Performance on synthetic data. (left) The effect of varying l regularization. (right) Comparison of l and l regularization. 3. Alternating direction method of multipliers ADMM techniques separate the objective into terms of different variables and add constraints that these variables be equal. The algorithm then alternates between minimizing the augmented Lagrangian and taking a gradient step on the scaled dual variables; see [] for a detailed description. In our setting, the ADMM-style optimization problem takes the form minimize,, 3 log ( ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (6) where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. Space constraints preclude a full description (see Appendix A), but the resulting algorithm is given by the iteration := T ρ (( V )/ S/ρ) := S λ /ρ (( V )/) 3 := P 0 (( + V V )/) V := V V := V where Sλ is the soft-thresholding operator as applied to the /ρ, norm, P 0 () is the projection onto the semidefinite cone and T ρ is given by ([ ]) [ ] A B Q T DQ B T ρ B T = C B T (8) C where QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with D ii = D ii + D ii + 4ρ ρ The update for the (, ) submatrix is similar to other ADMM-style algorithms which achieve stateof-the-art results for sparse inverse covariance selection, see for example [7] and []. 4 Experiments Synthetic Data. Here, we evaluate the performance of our model in a setting with few examples and high-dimensional input and output. We fix n = p = 50 and generate the parameter Θ R n p by creating a sparse matrix with % nonzero values sampled from N (0, ). To generate (7) (9) 3

4 MSE λ Train Test PJM forecast 0 3 MW Hour Actual Predicted Hours in future Figure 3: Performance on power demand forecasting. (left) The effect of varying l regularization. (right) A single prediction task for a 4 hour time horizon. Figure 4: Λ (left) and Θ (right) for the power demand forecasting model. Λ R p p, we create another % nonzero matrix A R p p by sampling from N (0, ) and then take Λ = ( + α)i + A + A T where α 0 is sufficient to ensure that Λ is positive definite. Given these parameters, we randomly sample Gaussian input features and generate the outputs according to y N ( Λ Θ T x, Λ ). In Figure (left), we see the effects of varying λ with a fixed sample size m = 00: as λ goes to zero, the parameters approach the unregularized maximum likelihood estimator and the model overfits. In Figure (right), we compare l and l regularization while varying the size of the training set. For each choice of m, we choose optimal λ and λ using cross-validation, and evaluate on a test set. Here we see that l regularization dramatically outperforms l regularization in the regime with fewer training examples than parameters. Power demand forecasting. Next, we consider forecasting power demand, the task of PJM, a regional electrical transmission operator in the Eastern United States ( Following PJM s methodology, we make predictions every 6 hours for hourly power demand over the next 4 hours. We jointly predict across 6 locations for a total output dimension of p = 50. For input features, we use both the previous demand as well as temporal features and PJM s own forecasts. Note that by using PJM s own forecasts as input, our model is building upon an already strong baseline. The input has dimension n = 30 and we train the model on one week of data and then predict the next week. In Figure 3 (left), we see that as we vary l regularization, our model outperforms the strong PJM baseline. However, as λ goes to zero, the model overfits. In Figure 3 (right), we look at a single location and observe that the model closely fits observed demand, perhaps degrading slightly toward the 4 hour mark. Finally, in Figure 4, we see that the parameters recovered exhibit a high degree of sparsity as well as strong conditional dependencies between immediate time points and weaker conditional dependencies across locations. 4

5 References [] O. Banerjee, L. El Ghaoui, and A. daspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data, 008. [] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3():, 0. [3] J. C. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussians. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 008. [4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):43 44, 008. [5]. Lu. Smooth optimization approaches for sparse inverse covariance selection. SIAM Journal on Optimization, 9(4):807 87, 009. [6] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. 00. [7] K. Scheinberg, S. Ma, and D Goldfarb. Sparse inverse covariance selection via alternating linearization methods. In Neural Information Processing Systems, 00. [8] S. A. Soliman and A. M. Al-Kandari. Electrical Load Forecasting: Modeling and Model Construction. Elsevier, 00. [9] C. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):67 373, 0. [0] Various. PJM Manual 9: Load Forecasting and Analysis. PJM, 0. Available at: / media/documents/manuals/m9.ashx. 5

6 A Derivation of the optimization method To reiterate, in the unregularized case the MLE problem we are trying to solve is the optimization problem (3) minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx. (0) The gradient of this function is given by Θ f(λ, Θ) = S xy + S xx ΘΛ Λ f(λ, Θ) = Λ + S yy Λ Θ T S xx ΘΛ () and thus in the case without regularization, we can find an analytical solution, which is just the least squares estimate Θ = Sxx S xy Λ () Λ = S yy S yx Sxx S xy. Unfortunately, when we add l regularization, the problem can no longer be solved analytically. Furthermore, the form of the gradient above (specifically, the fact that the gradients are ill-conditioned as the eigenvalues of Λ go to zero) makes many gradient-descent methods, like proximal gradient, perform poorly in practice. This motivates our Schur complement formulation of the optimization problem, and here we derive an ADMM algorithm for this formulation. A. Derivation of the ADMM update Our presentation in this section uses much of the same terminology as that in []. The Schur complement form of the optimization problem, now adding the regularization term, is minimize log + trs + λ,. (3) 0 where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. To solve this via an ADMM method, we split each term in the objective into separate terms and add the additional constraint that they all be equal (adding three redundant constraints results in a symmetric update equations) minimize,, 3 log ( ) ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (4) The augmented Lagrangian then takes the form L ρ (,, 3, Y, Y, Y 3 ) = log ( ) ) + tr Sλ, + { 3 0} where F signifies the Frobenius norm. + tr Y T ( ) + tr Y T ( 3 ) + tr Y T 3 ( 3 ) + ρ 4 F + ρ 4 3 F + ρ 4 3 F. ADMM proceeds by alternative minimization over the variables, and gradient descent on the dual Y variables; by introducing the scaled variables U i = ρ Y i, we can simplify the equations to be the sequence of updates := arg min log + tr S + ρ 4 k + U k F + ρ 4 k 3 + U3 k F := arg min 3 := arg min U := U k + U := U k + 3 U3 := U3 k + 3, + ρ 4 + U k F + ρ 4 k 3 + U k F { 0} + ρ 4 + U k F + ρ + U3 k F 4 6 (5) (6)

7 Noting that arg min f(x) + X X A F + X B F = arg min f(x) + X (A + X B)/ F (7) we can rewrite these equations as involving a single norm penalization := arg min log + tr S + ρ (k U k + 3 k + U3 k )/ F := arg min 3 := arg min To simplify the equations, we define where the updates are then given by, + ρ ( + U k + 3 k U k )/ F { 0} + ρ ( + U k + U3 k )/ F (8) V k = U k 3 U k, V k = U k U k, V k 3 = U k U k 3 (9) V := V k V := V3 k V3 := V k Finally, noting that we need only compute the first two V s, since V k + V k + V3 k = 0, our final ADMM update equations take the form (0) := arg min := arg min 3 := arg min log + tr S + ρ (k + k 3 + V k )/ F, + ρ ( + 3 k + V k )/ F { 0} + ρ ( + V k V k )/ F V := V k V := V k () A. Analytical solution of the ADMM steps Next, we derive the analytical form for the updates in (): := T ρ (( V )/ S/ρ) := S λ /ρ (( V )/) 3 := P 0 (( + V V )/) () Starting with the update for, we note that if we set log + tr(s) + ρ X F = 0, we have + S + ρ( X ) = 0 S ij + ρ( ij X ij ) = 0 for (i, j) (, ) The second set of equations imply, for (i, j) (, ) While for the (, ) submatrix, we must find a matrix that satisfies (3) ij = X ij ρ S ij. (4) ρ = ρ( X ) S. (5) As derived in [7] (used there for the typical sparse inverse covariance estimation problem), this is given by taking the eigendecomposition QDQ T = ρ( X ) S and then multiplying the left hand side by Q T and Q so that we have ρ = D (6) 7

8 where = Q T Q. Finally, the solution is given by the quadratic formula ( ) ii = D ii + Dii + 4ρ. (7) ρ Combining these blockwise updates into a single operation, we have arg min log +tr S + ρ (k +3 k +V k )/ F = T ρ (( k +3 k +V k )/ S/ρ) (8) where T ρ is given by T ρ ([ A B B T C ]) [ ] Q T DQ B = B T C and QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with (9) D ii = D ii + D ii + 4ρ ρ (30) The update for is accomplished by elementwise soft-thresholding on, and the offdiagonal elements of. In particular, we exploit the fact that the solution to the optimization problem arg min (x z) + ν x (3) is given by the soft-thresholding operator { z ν z ν S ν (z) = 0 ν z ν (3) z + ν z ν x We define the Sν operator as ([ ]) [ ] A B Sν Sν (A diag(a)) + diag(a) S B T = ν (B) C S ν (B) T C where the soft-thresholding is applied elementwise. Applied to the minimization problem () for, this gives the update above. Finally, the update for 3 is given simply by projection onto the semidefinite cone (33) P 0 () = Q DQ T (34) where QDQ T = is the eigendecomposition and is a diagonal matrix with D ii = max(d ii, 0). 8

Large-scale Probabilistic Forecasting in Energy Systems using Sparse Gaussian Conditional Random Fields

Large-scale Probabilistic Forecasting in Energy Systems using Sparse Gaussian Conditional Random Fields Large-scale Probabilistic Forecasting in Energy Systems using Sparse Gaussian Conditional Random Fields Matt Wytock and J Zico Kolter Abstract Short-term forecasting is a ubiquitous practice in a wide

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

Robust Inverse Covariance Estimation under Noisy Measurements

Robust Inverse Covariance Estimation under Noisy Measurements .. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Multivariate Normal Models

Multivariate Normal Models Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering

More information

Frist order optimization methods for sparse inverse covariance selection

Frist order optimization methods for sparse inverse covariance selection Frist order optimization methods for sparse inverse covariance selection Katya Scheinberg Lehigh University ISE Department (joint work with D. Goldfarb, Sh. Ma, I. Rish) Introduction l l l l l l The field

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Multivariate Normal Models

Multivariate Normal Models Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models

More information

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Newton-Like Methods for Sparse Inverse Covariance Estimation

Newton-Like Methods for Sparse Inverse Covariance Estimation Newton-Like Methods for Sparse Inverse Covariance Estimation Peder A. Olsen Figen Oztoprak Jorge Nocedal Stephen J. Rennie June 7, 2012 Abstract We propose two classes of second-order optimization methods

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables

Big & Quic: Sparse Inverse Covariance Estimation for a Million Variables for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

arxiv: v3 [stat.ml] 14 Apr 2016

arxiv: v3 [stat.ml] 14 Apr 2016 arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Gaussian Naïve Bayes Big Picture Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative

More information

Bias-Variance Tradeoff

Bias-Variance Tradeoff What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso

Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Qingming Tang Chao Yang Jian Peng Jinbo Xu Toyota Technological Institute at Chicago Massachusetts Institute of Technology Abstract. This

More information

Structure estimation for Gaussian graphical models

Structure estimation for Gaussian graphical models Faculty of Science Structure estimation for Gaussian graphical models Steffen Lauritzen, University of Copenhagen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 3 Slide 1/48 Overview of

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Robust Inverse Covariance Estimation under Noisy Measurements

Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang Intel-NTU, National Taiwan University, Taiwan Shou-de Lin Intel-NTU, National Taiwan University, Taiwan WANGJIM123@GMAIL.COM SDLIN@CSIE.NTU.EDU.TW Abstract This paper proposes a robust method

More information

Sparse Inverse Covariance Estimation for a Million Variables

Sparse Inverse Covariance Estimation for a Million Variables Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation

A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation Cho-Jui Hsieh Dept. of Computer Science University of Texas, Austin cjhsieh@cs.utexas.edu Inderjit S. Dhillon Dept. of Computer Science

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

2 Regularized Image Reconstruction for Compressive Imaging and Beyond EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement

More information

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014 for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,

More information

High dimensional Ising model selection

High dimensional Ising model selection High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

An Introduction to Statistical and Probabilistic Linear Models

An Introduction to Statistical and Probabilistic Linear Models An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models

A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models Laurent El Ghaoui and Assane Gueye Department of Electrical Engineering and Computer Science University of California Berkeley

More information

Graphical Model Selection

Graphical Model Selection May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Generalized Power Method for Sparse Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF

Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF Jong Ho Kim, Youngsuk Park December 17, 2016 1 Introduction Markov random fields (MRFs) are a fundamental fool on data

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Sparse and Locally Constant Gaussian Graphical Models

Sparse and Locally Constant Gaussian Graphical Models Sparse and Locally Constant Gaussian Graphical Models Jean Honorio, Luis Ortiz, Dimitris Samaras Department of Computer Science Stony Brook University Stony Brook, NY 794 {jhonorio,leortiz,samaras}@cs.sunysb.edu

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Joint Learning of Representation and Structure for Sparse Regression on Graphs

Joint Learning of Representation and Structure for Sparse Regression on Graphs Joint Learning of Representation and Structure for Sparse Regression on Graphs Chao Han Shanshan Zhang Mohamed Ghalwash Slobodan Vucetic Zoran Obradovic Abstract In many applications, including climate

More information

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released

More information

Joint Gaussian Graphical Model Review Series I

Joint Gaussian Graphical Model Review Series I Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Node-Based Learning of Multiple Gaussian Graphical Models

Node-Based Learning of Multiple Gaussian Graphical Models Journal of Machine Learning Research 5 (04) 445-488 Submitted /; Revised 8/; Published /4 Node-Based Learning of Multiple Gaussian Graphical Models Karthik Mohan Palma London Maryam Fazel Department of

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information