Divide-and-combine Strategies in Statistical Modeling for Massive Data

Size: px
Start display at page:

Download "Divide-and-combine Strategies in Statistical Modeling for Massive Data"

Transcription

1 Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

2 Introduction Many statistical problems can be formulated into the following form, min L({x i, y i } n θ R p i=1, θ) + λp(θ), (1) where L({x i, y i } n i=1, θ) is a loss function, or a negative log-likelihood function, or certain criterion function (e.g., M-estimator) and P(θ) is some regularization on θ. Examples - Linear regression, - Logistic regression, ˆβ = arg min y Xβ λ β 1 (or other penalties) β ˆβ = arg min β n log ( 1 + e x i β ) n y i x i β i=1 i=1 } {{ } negative log-likelihood + λ β 1 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

3 When is Divide-and-Combine needed? Notation: X 1 y 1 X 2 y 2 X = R n p and y = R n,.. X K where each X k R n k p, y k R n k is a subset of data (X, y). Two scenarios: 1 When the data (X k, y k ) s are collected and stored separately at different locations (e.g., by different organizations), and transferring data is prohibitive due to communication cost or security/privacy reasons. 2 When the data (X, y) are too big to be stored or processed in a single computer, e.g., petabytes of data that cannot fit into a single computer/server. y K Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

4 A Trivial Example: Simple Linear Regression Simple linear regression is embarrassingly parallel: K ˆβ = (X T X) 1 X T y = X T k X k Compute X T k X k s and X T k y k s in parallel then aggregate. 1 K X T k y k Simple linear regression happens to have a closed form solution that happens to be computable in parallel. But what about more general cases? For example, ˆβ LASSO = arg min y Xβ λ β 1. β No closed-form solution, not straightforward how to compute ˆβ LASSO in parallel. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

5 Two Approaches (not including subsampling) There are generally two approaches for solving (1) in parallel. 1 Distributed numerical optimization algorithms that solves (1) in parallel. - Distributed coordinate descent, for example, [1]. - Lagrangian primal-dual algorithms, including ADMM [2] and CoCoA [3]. - CSL/EDSL by [4, 5]: quadratic approximation with local Hessian. 2 Divide-and-combine statistical aggregation: aggregating subset results. min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

6 Part I: Distributed optimization algorithms, focus on ADMM (CSL/EDSL if time permits) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

7 ADMM The ADMM solves the following problem, min {f (x) + g(z)} s.t. Ax + Bz = c, (2) x,z where x and z are the parameters of interest. A, B, c are constants. Many statistical problems can be formulated into this form, e.g., linear regression with regularization, min β y Xβ λ β 1 min r λ β 1 s.t. r = y Xβ (3) β,r The ADMM solves (2) by iteratively minimizing its augmented Lagrangian L ρ (x, z, u) := f (x) + g(z) + u T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 in the primal variables x and z and updating the dual variable u via dual ascent, where ρ is the tunable augmentation parameter. Specifically, the ADMM carries out the following updates at iteration t, x t+1 := arg min x f (x) + ρ 2 Ax + Bzt c + u t 2 2, z t+1 := arg min z g(z) + ρ 2 Axt+1 + Bz c + u t 2 2, u t+1 := u t + (Ax t+1 + Bz t+1 c). (4) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

8 Parallelize the ADMM min L(X, y; θ) + λp(θ) θ Rp X 1 y 1 X 2 y 2 X=, y=.. X K y K min Equivalent to (a generic formulation) K min L(X k, y k ; θ k ) + λp(θ) θ 1,...,θ K,θ Apply ADMM, θ t+1 k K L(X k, y k ; θ) + λp(θ), s.t. θ k = θ, k. := arg min θk L(X k, y k ; θ k ) + ρ 2 θ k θ t + u t k 2 2, (typically easy to solve) θ t+1 := arg min θ λp(θ) + ρ 2 θ θ t+1 ūu t 2 2, (closed-form solution) u t+1 k := u t k + (θt+1 k θ t+1 ), where θ t+1 = ( ) K θt+1 k /K, ūu t+1 = ( ) K ut+1 k /K. There can be other formulations that result in easier subproblems, depending on the specific form of the problem, e.g., (3). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

9 ADMM: An Application The model: where ɛ i.i.d N(0, 1) Y = X 6 + X 12 + X 15 + X X 1 ɛ, (5) Estimation accuracy Time performance l 1 accuracy M=1 M=10 M=100 Time (second) M=1 M=10 M= Iteration Iteration Figure: ADMM applied to non-convex (SCAD) penalized quantile regression for (5) with τ = 0.3. Sample size n = 30, 000, dimension p = 100, M is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

10 ADMM Pros: General purpose, minimal assumptions. No approximation, no further statistical analysis. Very flexible parallelization, convergence rate insensitive to # of partitions K. Example: split along both n and p, useful when both n and p are large [8]. Cons: X = X 11 X X 1N X 21 X X 2N X M1 X M2... X MN Iterative, convergence is slow, communication is expensive. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

11 CSL/EDSL Fast convergence, but stronger assumptions on the problems it solve, [4, 5]. Convergence rate of CSL type DC QR Estimation error K=5 K=10 K=20 global Iteration Figure: CSL applied to penalized quantile regression with τ = 0.3 (surrogate Hessian is used). Sample size n = 500, dimension p = 15, K is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

12 Part II: Divide-and-combine statistical aggregation min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Consider simple cases with λ = 0. Examples for λ 0: [6] and the de-biased Lasso in [7,10], among others. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

13 Naive Approach: Simple Average ˆθ GLOBAL = K ˆθ k K Performs poorly in general. Especially when the underlying data generating model is non-linear. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

14 One-step Further In [9], a one-step estimator from the subsets average is considered, under the general context of M-estimator where L( ) is the criterion function. 1 First, take the average 2 Then, compute the one-step estimator where 2 L(X, y; ˆθ (0) ) = ˆθ (0) = K ˆθ k K ˆθ (1) = ˆθ (0) [ 2 L(X, y; ˆθ (0) )] 1 [ L(X, y; ˆθ (0) )], K 2 L(X k, y k ; ˆθ (0) ) and L(X, y; ˆθ (0) ) = K L(X k, y k ; ˆθ (0) ). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

15 One-step Further Theorem 1 (Cheng and Huo (2015)) Denote θ 0 as the true parameter. Under some mild conditions, the one-step estimator ˆθ (1) satisfies as long as K = O( n), where ) ( [ 2 L(X, y; ˆθ (0) )] 1 E Σ = E n (ˆθ (1) θ 0 ) N(0, Σ) ([ L(X, y; ˆθ (0) )][ L(X, y; ˆθ (0) )] T ) E ( [ 2 L(X, y; ˆθ (0) )] 1 ). That is to say, the aggregated estimation ˆθ (1) is asymptotically equivalent to the global estimator as if the estimation is made with all data, as long as the number of subsets does not grow too fast. For example, in MLE case, Σ reduces to the Fisher information. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

16 Aggregated Estimating Equation (AEE) Take a closer look at simple linear regression, K ˆβ = (X T X) 1 X T y = X T k X k 1 K K X T k y k = X T k X k a weighted-average of subset estimations. 1 K X T k X k ˆβ k, Use local curvature of the loss function, or gradient of estimating equation as weights, K X T (y Xβ) = X T k (y k X k β) (OLS estimating equation). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

17 Aggregated Estimating Equation (AEE) Lin et al. [11] generalizes the idea to estimating equation estimation. M(θ) = n φ(x i, y i ; θ) = 0. (or n i=1 L(x i, y i ; θ) = 0) i=1 - Use gradient of M k = i S k φ(x i, y i ; θ) as weight for subset k, i.e., φ(x i, y i ; ˆθ k ) A k = θ i S k - Then calculate the AEE estimator as K ˆθ AEE = A k It is proved in [11] that the AEE estimator is equivalent to the global estimator under some mild conditions as long as K = O(n γ ) for some 0 < γ < 1. 1 K A k ˆθ k Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

18 Summary Another approach: Song et. al proposed a D&C method by combining local confidence distributions or confidence inference functions, [12]. Distributed optimization algorithm VS Statistical aggregation: A distributed optimization algorithm is generally iterative, while statistical aggregation is non-iterative. Statistical aggregation methods are communication efficient. Distributed optimization algorithm solves the original problem and find the global estimation; statistical aggregation estimation is not equivalent to the global estimation, but an ideal statistical aggregation method is supposed to be asymptotically equivalent to the global solution as if the estimation is made on the entire data. Did not talk about inference, but For bootstrap based inference, distributed optimization algorithms help parallelize and speed up the computation. For inference based on asymptotic results, the asymptotic covariance matrices are often distributedly computable. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

19 Q&A Thank you! Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

20 References I [1] Peter Richtárik and Martin Takáč (2016) Distributed Coordinate Descent Method for Learning with Big Data. Journal of Machine Learning Research [2] Stephen Boyd et al. (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning [3] Virginia Smith et al. (2016) CoCoA: A general framework for communication-efficient distributed optimization. arxiv preprint arxiv: [4] Michael I. Jordan et al. (2016) Communication-Efficient Distributed Statistical Inference. arxiv: v3 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

21 References II [5] Jialei Wang et al. (2016) Efficient Distributed Learning with Sparsity. arxiv: v1 [6] Xueying Chen and Min-ge Xie (2014) A Split-and-conquer Approach for Analysis of Extraordinarily Large Data. Statistica Sinica [7] Adel Javanmard and Andrea Montanari (2014) Confidence Intervals and Hypothesis Testing for High-dimensional Regression. Journal of Machine Learning Research [8] Neal Parikh and Stephen Boyd (2014) Block splitting for distributed optimization. Mathematical Programming Computation [9] Cheng Huang and Xiaoming Huo (2015) A Distributed One-Step Estimator. arxiv: v2 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

22 References III [10] Jason Lee et al. (2015) Communication-efficient sparse regression: a one-shot approach. Journal of Machine Learning Research [11] Nan Lin and Ruibin Xi (2011) Aggregated estimating equation estimation. Statistics and Its Interface [12] Peter X.K. Song et al.(2016) Confidence distributions and confidence inference functions: General data integration methods for big complex data. Personal Communication Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, / 22

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote

More information

arxiv: v1 [math.oc] 23 May 2017

arxiv: v1 [math.oc] 23 May 2017 A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Distributed Optimization via Alternating Direction Method of Multipliers

Distributed Optimization via Alternating Direction Method of Multipliers Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss arxiv:1811.04545v1 [stat.co] 12 Nov 2018 Cheng Wang School of Mathematical Sciences, Shanghai Jiao

More information

The Alternating Direction Method of Multipliers

The Alternating Direction Method of Multipliers The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

An interior-point stochastic approximation method and an L1-regularized delta rule

An interior-point stochastic approximation method and an L1-regularized delta rule Photograph from National Geographic, Sept 2008 An interior-point stochastic approximation method and an L1-regularized delta rule Peter Carbonetto, Mark Schmidt and Nando de Freitas University of British

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo) Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: 1304.0282 Victor MIT, Economics + Center for Statistics Co-authors: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

Robust estimation, efficiency, and Lasso debiasing

Robust estimation, efficiency, and Lasso debiasing Robust estimation, efficiency, and Lasso debiasing Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics WHOA-PSI workshop Washington University in St. Louis Aug 12, 2017 Po-Ling

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学 Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks

More information

Differential network analysis from cross-platform gene expression data: Supplementary Information

Differential network analysis from cross-platform gene expression data: Supplementary Information Differential network analysis from cross-platform gene expression data: Supplementary Information Xiao-Fei Zhang, Le Ou-Yang, Xing-Ming Zhao, and Hong Yan Contents 1 Supplementary Figures Supplementary

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem Kangkang Deng, Zheng Peng Abstract: The main task of genetic regulatory networks is to construct a

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Learning in a Distributed and Heterogeneous Environment

Learning in a Distributed and Heterogeneous Environment Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

10-725/36-725: Convex Optimization Spring Lecture 21: April 6 10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari MS&E 226: Small Data Lecture 11: Maximum likelihood (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 18 The likelihood function 2 / 18 Estimating the parameter This lecture develops the methodology behind

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Single Index Quantile Regression for Heteroscedastic Data

Single Index Quantile Regression for Heteroscedastic Data Single Index Quantile Regression for Heteroscedastic Data E. Christou M. G. Akritas Department of Statistics The Pennsylvania State University SMAC, November 6, 2015 E. Christou, M. G. Akritas (PSU) SIQR

More information

Functional SVD for Big Data

Functional SVD for Big Data Functional SVD for Big Data Pan Chao April 23, 2014 Pan Chao Functional SVD for Big Data April 23, 2014 1 / 24 Outline 1 One-Way Functional SVD a) Interpretation b) Robustness c) CV/GCV 2 Two-Way Problem

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Distributed ADMM for Gaussian Graphical Models Yaoliang Yu Lecture 29, April 29, 2015 Eric Xing @ CMU, 2005-2015 1 Networks / Graphs Eric Xing

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data

A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data Min-ge Xie Department of Statistics & Biostatistics Rutgers, The State University of New Jersey In collaboration with Xuying

More information

Theoretical results for lasso, MCP, and SCAD

Theoretical results for lasso, MCP, and SCAD Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

10725/36725 Optimization Homework 4

10725/36725 Optimization Homework 4 10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

An ADMM algorithm for optimal sensor and actuator selection

An ADMM algorithm for optimal sensor and actuator selection An ADMM algorithm for optimal sensor and actuator selection Neil K. Dhingra, Mihailo R. Jovanović, and Zhi-Quan Luo 53rd Conference on Decision and Control, Los Angeles, California, 2014 1 / 25 2 / 25

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop

More information

Parameter Norm Penalties. Sargur N. Srihari

Parameter Norm Penalties. Sargur N. Srihari Parameter Norm Penalties Sargur N. srihari@cedar.buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Preconditioning via Diagonal Scaling

Preconditioning via Diagonal Scaling Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

arxiv: v3 [stat.ml] 14 Apr 2016

arxiv: v3 [stat.ml] 14 Apr 2016 arxiv:1307.0048v3 [stat.ml] 14 Apr 2016 Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce Kun Yang April 15, 2016 Abstract In this paper, we propose a one-pass

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17 MCMC for big data Geir Storvik BigInsight lunch - May 2 2018 Geir Storvik MCMC for big data BigInsight lunch - May 2 2018 1 / 17 Outline Why ordinary MCMC is not scalable Different approaches for making

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Dealing with Constraints via Random Permutation

Dealing with Constraints via Random Permutation Dealing with Constraints via Random Permutation Ruoyu Sun UIUC Joint work with Zhi-Quan Luo (U of Minnesota and CUHK (SZ)) and Yinyu Ye (Stanford) Simons Institute Workshop on Fast Iterative Methods in

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

The Randomized Newton Method for Convex Optimization

The Randomized Newton Method for Convex Optimization The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:

More information

Linear programming II

Linear programming II Linear programming II Review: LP problem 1/33 The standard form of LP problem is (primal problem): max z = cx s.t. Ax b, x 0 The corresponding dual problem is: min b T y s.t. A T y c T, y 0 Strong Duality

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Distributed Optimization and Statistics via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University Stanford Statistics Seminar, September 2010

More information

An Inexact Newton Method for Optimization

An Inexact Newton Method for Optimization New York University Brown Applied Mathematics Seminar, February 10, 2009 Brief biography New York State College of William and Mary (B.S.) Northwestern University (M.S. & Ph.D.) Courant Institute (Postdoc)

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)

More information

Sparsity Regularization

Sparsity Regularization Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation

More information

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology Rirdge Regression Szymon Bobek Institute of Applied Computer science AGH University of Science and Technology Based on Carlos Guestrin adn Emily Fox slides from Coursera Specialization on Machine Learnign

More information

Recent Developments in Statistical Theory and Methods Based on Distributed Computing (18w5089)

Recent Developments in Statistical Theory and Methods Based on Distributed Computing (18w5089) Recent Developments in Statistical Theory and Methods Based on Distributed Computing (18w5089) Sujit Ghosh (North Carolina State University), Xiaoming Huo (Georgia Institute of Technology), Hua Zhou (University

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information