Sparse Gaussian conditional random fields
|
|
- Neil Harrison
- 5 years ago
- Views:
Transcription
1 Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, Abstract We propose sparse Gaussian conditional random fields for multivariate regression. These models are a discriminative analogue of the sparse Gaussian Markov random field, a model that has seen a great deal of work in past years, particularly using l methods to learn high-dimensional sparse graphical representations. Our CRF model exploits sparsity between outputs variables and between inputs and outputs, combining the benefits of sparse inverse covariance estimation and l regularized regression. Fitting parameters in this model is more challenging than in previous formulations, and we propose a new optimization algorithm based upon the Alternating Direction Method of Multipliers (ADMM) technique. Finally, we present experimental results both on synthetic data, and on the task of forecasting electrical demand in the Pennsylvania grid; on this latter task, we improve upon the state-of-the-art system currently deployed by the regional electricity operator. Introduction Sparse inverse covariance estimation using l methods [], also known as the graphical lasso [4], enables convex learning of high-dimensional undirected graphical models. These methods estimate the inverse covariance of a zero-mean Gaussian distribution while penalizing the l norm of the off-diagonal entries; since the entries in the inverse covariance correspond to edges in a Gaussian Markov random field, this method is effectively learning a sparsely connected graphical model. In recent years, many algorithms have been proposed for solving this problem, including projected gradient methods [3], smoothed optimization [5], and alternating linearization methods [7]. However, when we have a prediction task in which we predict output variables from input variables, we may not want to model all correlations in the input data. This is the familiar generative/discriminative contrast in machine learning [6], in which it has been repeatedly observed that for prediction tasks, discriminative approaches can be superior [9]. In this work, we propose the sparse Gaussian conditional random field, a discriminative model that combines the benefits of discriminative learning and sparse inverse covariance estimation. Specifically, we formulate an l -penalized log-linear model in which we jointly model the covariance structure of the output variables along with their dependence on the input variables. As such, our work can be seen as a generalization of sparse Gaussian MRFs and l -penalized multivariate regression. Unfortunately, in the discriminative formulation the optimization problem becomes substantially more complex. Here we propose a new optimization technique for such problems based upon ADMM, a general approach that performs well in the standard inverse covariance estimation problem. As an example of such models, we consider the task of multivariate time series prediction. In particular, we are interested in forecasting electricity demand over multiple locations and time horizons. This problem has huge importance in energy planning and conservation, and there are a number of forecasting methods employed ubiquitously in the power industry [8]. On this task, we show improvement over the state-of-the-art deployed solution in the Pennsylvania power grid [0].
2 y y y 3 y p x x x 3 x n Figure : Illustration of sparse Gaussian CRF model. The sparse Gaussian CRF model Let x R n and y R p denote input and output variables for our prediction task. We formulate the Gaussian CRF as a log-linear model with p(y x; Λ, Θ) = { (x) exp } yt Λy x T Θy () where the quadratic term models the conditional dependencies of y and the linear term models the dependence of y on x. The model is parametrized by Λ R p p, which corresponds to the inverse covariance matrix, and Θ R n p, which maps the inputs to the outputs; an illustration of the model is shown in Figure. Since the the CRF represents a Gaussian distribution with mean µ = Λ Θ T x, the partition function is given by (x) = c Λ / exp { xt ΘΛ Θ T x }. () In this model, the maximum likelihood estimator is given by the optimization problem where minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx (3) S = E ( [ y x ] [ y x ] T ) = [ ] Syy S yx S xy S xx which in practice which we estimate from the data (below, we denote the number of samples by m). We can add l regularization by adding λ to the diagonal elements of S (formally, this corresponds to a Normal-Wishart prior on Λ and the columns of Θ). This is a convex problem, but the total total number of parameters is np + p(p + )/. By regularizing both Θ and the off-diagonal elements of Λ, we encourage both parameters to be sparse and combine the benefits of sparse inverse covariance estimation and l -penalized linear regression. 3 Optimization The ΘΛ ΘS xx term in (3) makes existing l methods for sparse inverse covariance estimation no longer applicable. Furthermore, in practice we have found that this term poses significant challenges for gradient-descent methods, such as proximal gradient, due to step-size considerations. Thus, we formulate the optimization problem in an alternative manner using the Schur complement and develop a custom alternating direction method of multipliers (ADMM) algorithm. [ ] Λ Θ T First, we note that if we write =, then we can formulate an upper bound on (3) as Θ Ψ minimize 0 f() = log + tr S (5) where 0 restricts to the positive semidefinite cone. This forces the Schur complement of, Ψ Θ T Λ Θ to be positive semidefinite and thus, at any feasible point, we have f() f(λ, Θ). Since this bound is always achievable, both optimization problems achieve the same value. This formulation also makes clear the close relationship between the Gaussian CRF and MRF: in the case of the CRF, the log det term is restricted to the (, ) submatrix which corresponds to our modeling the conditional dependencies between the output y variables only. (4)
3 MSE 4 3 MSE (train) Log loss (train) MSE (test) Log loss (test) Log loss MSE 4 3 l l λ Training examples (m) Figure : Performance on synthetic data. (left) The effect of varying l regularization. (right) Comparison of l and l regularization. 3. Alternating direction method of multipliers ADMM techniques separate the objective into terms of different variables and add constraints that these variables be equal. The algorithm then alternates between minimizing the augmented Lagrangian and taking a gradient step on the scaled dual variables; see [] for a detailed description. In our setting, the ADMM-style optimization problem takes the form minimize,, 3 log ( ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (6) where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. Space constraints preclude a full description (see Appendix A), but the resulting algorithm is given by the iteration := T ρ (( V )/ S/ρ) := S λ /ρ (( V )/) 3 := P 0 (( + V V )/) V := V V := V where Sλ is the soft-thresholding operator as applied to the /ρ, norm, P 0 () is the projection onto the semidefinite cone and T ρ is given by ([ ]) [ ] A B Q T DQ B T ρ B T = C B T (8) C where QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with D ii = D ii + D ii + 4ρ ρ The update for the (, ) submatrix is similar to other ADMM-style algorithms which achieve stateof-the-art results for sparse inverse covariance selection, see for example [7] and []. 4 Experiments Synthetic Data. Here, we evaluate the performance of our model in a setting with few examples and high-dimensional input and output. We fix n = p = 50 and generate the parameter Θ R n p by creating a sparse matrix with % nonzero values sampled from N (0, ). To generate (7) (9) 3
4 MSE λ Train Test PJM forecast 0 3 MW Hour Actual Predicted Hours in future Figure 3: Performance on power demand forecasting. (left) The effect of varying l regularization. (right) A single prediction task for a 4 hour time horizon. Figure 4: Λ (left) and Θ (right) for the power demand forecasting model. Λ R p p, we create another % nonzero matrix A R p p by sampling from N (0, ) and then take Λ = ( + α)i + A + A T where α 0 is sufficient to ensure that Λ is positive definite. Given these parameters, we randomly sample Gaussian input features and generate the outputs according to y N ( Λ Θ T x, Λ ). In Figure (left), we see the effects of varying λ with a fixed sample size m = 00: as λ goes to zero, the parameters approach the unregularized maximum likelihood estimator and the model overfits. In Figure (right), we compare l and l regularization while varying the size of the training set. For each choice of m, we choose optimal λ and λ using cross-validation, and evaluate on a test set. Here we see that l regularization dramatically outperforms l regularization in the regime with fewer training examples than parameters. Power demand forecasting. Next, we consider forecasting power demand, the task of PJM, a regional electrical transmission operator in the Eastern United States ( Following PJM s methodology, we make predictions every 6 hours for hourly power demand over the next 4 hours. We jointly predict across 6 locations for a total output dimension of p = 50. For input features, we use both the previous demand as well as temporal features and PJM s own forecasts. Note that by using PJM s own forecasts as input, our model is building upon an already strong baseline. The input has dimension n = 30 and we train the model on one week of data and then predict the next week. In Figure 3 (left), we see that as we vary l regularization, our model outperforms the strong PJM baseline. However, as λ goes to zero, the model overfits. In Figure 3 (right), we look at a single location and observe that the model closely fits observed demand, perhaps degrading slightly toward the 4 hour mark. Finally, in Figure 4, we see that the parameters recovered exhibit a high degree of sparsity as well as strong conditional dependencies between immediate time points and weaker conditional dependencies across locations. 4
5 References [] O. Banerjee, L. El Ghaoui, and A. daspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data, 008. [] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3():, 0. [3] J. C. Duchi, S. Gould, and D. Koller. Projected subgradient methods for learning sparse Gaussians. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 008. [4] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):43 44, 008. [5]. Lu. Smooth optimization approaches for sparse inverse covariance selection. SIAM Journal on Optimization, 9(4):807 87, 009. [6] A. Y. Ng and M. I. Jordan. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. 00. [7] K. Scheinberg, S. Ma, and D Goldfarb. Sparse inverse covariance selection via alternating linearization methods. In Neural Information Processing Systems, 00. [8] S. A. Soliman and A. M. Al-Kandari. Electrical Load Forecasting: Modeling and Model Construction. Elsevier, 00. [9] C. Sutton and A. McCallum. An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4):67 373, 0. [0] Various. PJM Manual 9: Load Forecasting and Analysis. PJM, 0. Available at: / media/documents/manuals/m9.ashx. 5
6 A Derivation of the optimization method To reiterate, in the unregularized case the MLE problem we are trying to solve is the optimization problem (3) minimize f(λ, Θ) = Λ,Θ log Λ + ΛS yy + ΘS yx + ΘΛ Θ T S xx. (0) The gradient of this function is given by Θ f(λ, Θ) = S xy + S xx ΘΛ Λ f(λ, Θ) = Λ + S yy Λ Θ T S xx ΘΛ () and thus in the case without regularization, we can find an analytical solution, which is just the least squares estimate Θ = Sxx S xy Λ () Λ = S yy S yx Sxx S xy. Unfortunately, when we add l regularization, the problem can no longer be solved analytically. Furthermore, the form of the gradient above (specifically, the fact that the gradients are ill-conditioned as the eigenvalues of Λ go to zero) makes many gradient-descent methods, like proximal gradient, perform poorly in practice. This motivates our Schur complement formulation of the optimization problem, and here we derive an ADMM algorithm for this formulation. A. Derivation of the ADMM update Our presentation in this section uses much of the same terminology as that in []. The Schur complement form of the optimization problem, now adding the regularization term, is minimize log + trs + λ,. (3) 0 where, = Λ + Θ, the l norm applied elementwise to the off-diagonal elements of Λ and twice Θ. To solve this via an ADMM method, we split each term in the objective into separate terms and add the additional constraint that they all be equal (adding three redundant constraints results in a symmetric update equations) minimize,, 3 log ( ) ) + tr S + λ, + { 3 0} subject to =, = 3, 3 = (4) The augmented Lagrangian then takes the form L ρ (,, 3, Y, Y, Y 3 ) = log ( ) ) + tr Sλ, + { 3 0} where F signifies the Frobenius norm. + tr Y T ( ) + tr Y T ( 3 ) + tr Y T 3 ( 3 ) + ρ 4 F + ρ 4 3 F + ρ 4 3 F. ADMM proceeds by alternative minimization over the variables, and gradient descent on the dual Y variables; by introducing the scaled variables U i = ρ Y i, we can simplify the equations to be the sequence of updates := arg min log + tr S + ρ 4 k + U k F + ρ 4 k 3 + U3 k F := arg min 3 := arg min U := U k + U := U k + 3 U3 := U3 k + 3, + ρ 4 + U k F + ρ 4 k 3 + U k F { 0} + ρ 4 + U k F + ρ + U3 k F 4 6 (5) (6)
7 Noting that arg min f(x) + X X A F + X B F = arg min f(x) + X (A + X B)/ F (7) we can rewrite these equations as involving a single norm penalization := arg min log + tr S + ρ (k U k + 3 k + U3 k )/ F := arg min 3 := arg min To simplify the equations, we define where the updates are then given by, + ρ ( + U k + 3 k U k )/ F { 0} + ρ ( + U k + U3 k )/ F (8) V k = U k 3 U k, V k = U k U k, V k 3 = U k U k 3 (9) V := V k V := V3 k V3 := V k Finally, noting that we need only compute the first two V s, since V k + V k + V3 k = 0, our final ADMM update equations take the form (0) := arg min := arg min 3 := arg min log + tr S + ρ (k + k 3 + V k )/ F, + ρ ( + 3 k + V k )/ F { 0} + ρ ( + V k V k )/ F V := V k V := V k () A. Analytical solution of the ADMM steps Next, we derive the analytical form for the updates in (): := T ρ (( V )/ S/ρ) := S λ /ρ (( V )/) 3 := P 0 (( + V V )/) () Starting with the update for, we note that if we set log + tr(s) + ρ X F = 0, we have + S + ρ( X ) = 0 S ij + ρ( ij X ij ) = 0 for (i, j) (, ) The second set of equations imply, for (i, j) (, ) While for the (, ) submatrix, we must find a matrix that satisfies (3) ij = X ij ρ S ij. (4) ρ = ρ( X ) S. (5) As derived in [7] (used there for the typical sparse inverse covariance estimation problem), this is given by taking the eigendecomposition QDQ T = ρ( X ) S and then multiplying the left hand side by Q T and Q so that we have ρ = D (6) 7
8 where = Q T Q. Finally, the solution is given by the quadratic formula ( ) ii = D ii + Dii + 4ρ. (7) ρ Combining these blockwise updates into a single operation, we have arg min log +tr S + ρ (k +3 k +V k )/ F = T ρ (( k +3 k +V k )/ S/ρ) (8) where T ρ is given by T ρ ([ A B B T C ]) [ ] Q T DQ B = B T C and QDQ T = ρa is the eigendecomposition of ρa and D is a diagonal matrix with (9) D ii = D ii + D ii + 4ρ ρ (30) The update for is accomplished by elementwise soft-thresholding on, and the offdiagonal elements of. In particular, we exploit the fact that the solution to the optimization problem arg min (x z) + ν x (3) is given by the soft-thresholding operator { z ν z ν S ν (z) = 0 ν z ν (3) z + ν z ν x We define the Sν operator as ([ ]) [ ] A B Sν Sν (A diag(a)) + diag(a) S B T = ν (B) C S ν (B) T C where the soft-thresholding is applied elementwise. Applied to the minimization problem () for, this gives the update above. Finally, the update for 3 is given simply by projection onto the semidefinite cone (33) P 0 () = Q DQ T (34) where QDQ T = is the eigendecomposition and is a diagonal matrix with D ii = max(d ii, 0). 8
Large-scale Probabilistic Forecasting in Energy Systems using Sparse Gaussian Conditional Random Fields
Large-scale Probabilistic Forecasting in Energy Systems using Sparse Gaussian Conditional Random Fields Matt Wytock and J Zico Kolter Abstract Short-term forecasting is a ubiquitous practice in a wide
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationAn efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss
An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao
More informationRobust Inverse Covariance Estimation under Noisy Measurements
.. Robust Inverse Covariance Estimation under Noisy Measurements Jun-Kun Wang, Shou-De Lin Intel-NTU, National Taiwan University ICML 2014 1 / 30 . Table of contents Introduction.1 Introduction.2 Related
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationProperties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation
Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Coping with Large Covariances: Latent Factor Models, Graphical Models, Graphical LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationInverse Covariance Estimation with Missing Data using the Concave-Convex Procedure
Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure Jérôme Thai 1 Timothy Hunter 1 Anayo Akametalu 1 Claire Tomlin 1 Alex Bayen 1,2 1 Department of Electrical Engineering
More informationFrist order optimization methods for sparse inverse covariance selection
Frist order optimization methods for sparse inverse covariance selection Katya Scheinberg Lehigh University ISE Department (joint work with D. Goldfarb, Sh. Ma, I. Rish) Introduction l l l l l l The field
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMultivariate Normal Models
Case Study 3: fmri Prediction Graphical LASSO Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Emily Fox February 26 th, 2013 Emily Fox 2013 1 Multivariate Normal Models
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationEE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)
EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement to the material discussed in
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationNewton-Like Methods for Sparse Inverse Covariance Estimation
Newton-Like Methods for Sparse Inverse Covariance Estimation Peder A. Olsen Figen Oztoprak Jorge Nocedal Stephen J. Rennie June 7, 2012 Abstract We propose two classes of second-order optimization methods
More informationSummary and discussion of: Dropout Training as Adaptive Regularization
Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationBig & Quic: Sparse Inverse Covariance Estimation for a Million Variables
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin NIPS Lake Tahoe, Nevada Dec 8, 2013 Joint work with M. Sustik, I. Dhillon, P. Ravikumar and R. Poldrack FMRI Brain Analysis Goal:
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationSparse inverse covariance estimation with the lasso
Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More information10-701/ Machine Learning - Midterm Exam, Fall 2010
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam
More informationarxiv: v3 [stat.ml] 14 Apr 2016
arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationMachine Learning Gaussian Naïve Bayes Big Picture
Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative
More informationBias-Variance Tradeoff
What s learning, revisited Overfitting Generative versus Discriminative Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 19 th, 2007 Bias-Variance Tradeoff
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationExact Hybrid Covariance Thresholding for Joint Graphical Lasso
Exact Hybrid Covariance Thresholding for Joint Graphical Lasso Qingming Tang Chao Yang Jian Peng Jinbo Xu Toyota Technological Institute at Chicago Massachusetts Institute of Technology Abstract. This
More informationStructure estimation for Gaussian graphical models
Faculty of Science Structure estimation for Gaussian graphical models Steffen Lauritzen, University of Copenhagen Department of Mathematical Sciences Minikurs TUM 2016 Lecture 3 Slide 1/48 Overview of
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationRobust Inverse Covariance Estimation under Noisy Measurements
Jun-Kun Wang Intel-NTU, National Taiwan University, Taiwan Shou-de Lin Intel-NTU, National Taiwan University, Taiwan WANGJIM123@GMAIL.COM SDLIN@CSIE.NTU.EDU.TW Abstract This paper proposes a robust method
More informationSparse Inverse Covariance Estimation for a Million Variables
Sparse Inverse Covariance Estimation for a Million Variables Inderjit S. Dhillon Depts of Computer Science & Mathematics The University of Texas at Austin SAMSI LDHD Opening Workshop Raleigh, North Carolina
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationA Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation
A Divide-and-Conquer Procedure for Sparse Inverse Covariance Estimation Cho-Jui Hsieh Dept. of Computer Science University of Texas, Austin cjhsieh@cs.utexas.edu Inderjit S. Dhillon Dept. of Computer Science
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University February 2, 2015 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More information2 Regularized Image Reconstruction for Compressive Imaging and Beyond
EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement
More informationVariables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014
for a Million Variables Cho-Jui Hsieh The University of Texas at Austin ICML workshop on Covariance Selection Beijing, China June 26, 2014 Joint work with M. Sustik, I. Dhillon, P. Ravikumar, R. Poldrack,
More informationHigh dimensional Ising model selection
High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSparse PCA with applications in finance
Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction
More informationAn Introduction to Statistical and Probabilistic Linear Models
An Introduction to Statistical and Probabilistic Linear Models Maximilian Mozes Proseminar Data Mining Fakultät für Informatik Technische Universität München June 07, 2017 Introduction In statistical learning
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationA Convex Upper Bound on the Log-Partition Function for Binary Graphical Models
A Convex Upper Bound on the Log-Partition Function for Binary Graphical Models Laurent El Ghaoui and Assane Gueye Department of Electrical Engineering and Computer Science University of California Berkeley
More informationGraphical Model Selection
May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor
More informationMachine Learning Practice Page 2 of 2 10/28/13
Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationStability and the elastic net
Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationLearning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF
Learning the Network Structure of Heterogenerous Data via Pairwise Exponential MRF Jong Ho Kim, Youngsuk Park December 17, 2016 1 Introduction Markov random fields (MRFs) are a fundamental fool on data
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationSparse and Locally Constant Gaussian Graphical Models
Sparse and Locally Constant Gaussian Graphical Models Jean Honorio, Luis Ortiz, Dimitris Samaras Department of Computer Science Stony Brook University Stony Brook, NY 794 {jhonorio,leortiz,samaras}@cs.sunysb.edu
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationJoint Learning of Representation and Structure for Sparse Regression on Graphs
Joint Learning of Representation and Structure for Sparse Regression on Graphs Chao Han Shanshan Zhang Mohamed Ghalwash Slobodan Vucetic Zoran Obradovic Abstract In many applications, including climate
More informationCOMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma
COMS 4771 Introduction to Machine Learning James McInerney Adapted from slides by Nakul Verma Announcements HW1: Please submit as a group Watch out for zero variance features (Q5) HW2 will be released
More informationJoint Gaussian Graphical Model Review Series I
Joint Gaussian Graphical Model Review Series I Probability Foundations Beilun Wang Advisor: Yanjun Qi 1 Department of Computer Science, University of Virginia http://jointggm.org/ June 23rd, 2017 Beilun
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationGenerative v. Discriminative classifiers Intuition
Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationNode-Based Learning of Multiple Gaussian Graphical Models
Journal of Machine Learning Research 5 (04) 445-488 Submitted /; Revised 8/; Published /4 Node-Based Learning of Multiple Gaussian Graphical Models Karthik Mohan Palma London Maryam Fazel Department of
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More information