Divide-and-combine Strategies in Statistical Modeling for Massive Data

Similar documents
Introduction to Alternating Direction Method of Multipliers

Importance Sampling for Minibatches

MSA220/MVE440 Statistical Learning for Big Data

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Sparse and Regularized Optimization

Alternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization

arxiv: v1 [math.oc] 23 May 2017

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Distributed Optimization via Alternating Direction Method of Multipliers

Parallel Coordinate Optimization

An efficient ADMM algorithm for high dimensional precision matrix estimation via penalized quadratic loss

The Alternating Direction Method of Multipliers

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

ADMM and Fast Gradient Methods for Distributed Optimization

Asynchronous Non-Convex Optimization For Separable Problem

An interior-point stochastic approximation method and an L1-regularized delta rule

Sparse Gaussian conditional random fields

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Big Data Analytics: Optimization and Randomization

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Uniform Post Selection Inference for LAD Regression and Other Z-estimation problems. ArXiv: Alexandre Belloni (Duke) + Kengo Kato (Tokyo)

Gaussian Graphical Models and Graphical Lasso

Robust estimation, efficiency, and Lasso debiasing

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学

Differential network analysis from cross-platform gene expression data: Supplementary Information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Optimization methods

Association studies and regression

Accelerated primal-dual methods for linearly constrained convex problems

A Bregman alternating direction method of multipliers for sparse probabilistic Boolean network problem

Sparse Linear Models (10/7/13)

Learning in a Distributed and Heterogeneous Environment

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

10-701/ Machine Learning - Midterm Exam, Fall 2010

Big Data Analytics. Lucas Rego Drumond

Single Index Quantile Regression for Heteroscedastic Data

Functional SVD for Big Data

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Probabilistic Graphical Models

On Bayesian Computation

Homework 5. Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

LASSO Review, Fused LASSO, Parallel LASSO Solvers

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data

Theoretical results for lasso, MCP, and SCAD

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Linear Models in Machine Learning

10725/36725 Optimization Homework 4

ECS171: Machine Learning

Linear Regression (9/11/13)

An ADMM algorithm for optimal sensor and actuator selection

Warm up: risk prediction with logistic regression

Linear Models for Regression CS534

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

Parameter Norm Penalties. Sargur N. Srihari

Linear Models for Regression CS534

Preconditioning via Diagonal Scaling

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

arxiv: v2 [cs.lg] 10 Oct 2018

EM Algorithm II. September 11, 2018

On Optimal Frame Conditioners

CS-E4830 Kernel Methods in Machine Learning

Estimators based on non-convex programs: Statistical and computational guarantees

ECS289: Scalable Machine Learning

arxiv: v3 [stat.ml] 14 Apr 2016

Convex Optimization Algorithms for Machine Learning in 10 Slides

MSA220/MVE440 Statistical Learning for Big Data

Constrained Optimization and Lagrangian Duality

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Outline of GLMs. Definitions

ECS289: Scalable Machine Learning

Dealing with Constraints via Random Permutation

STA141C: Big Data & High Performance Statistical Computing

The Randomized Newton Method for Convex Optimization

Linear programming II

On Markov chain Monte Carlo methods for tall data

Distributed Optimization and Statistics via Alternating Direction Method of Multipliers

An Inexact Newton Method for Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Sparsity Regularization

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Recent Developments in Statistical Theory and Methods Based on Distributed Computing (18w5089)

Statistical Inference

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Likelihood-Based Methods

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Support Vector Machine

Generalized Linear Models. Kurt Hornik

Transcription:

Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 1 / 22

Introduction Many statistical problems can be formulated into the following form, min L({x i, y i } n θ R p i=1, θ) + λp(θ), (1) where L({x i, y i } n i=1, θ) is a loss function, or a negative log-likelihood function, or certain criterion function (e.g., M-estimator) and P(θ) is some regularization on θ. Examples - Linear regression, - Logistic regression, ˆβ = arg min y Xβ 2 2 + λ β 1 (or other penalties) β ˆβ = arg min β n log ( 1 + e x i β ) n y i x i β i=1 i=1 } {{ } negative log-likelihood + λ β 1 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 2 / 22

When is Divide-and-Combine needed? Notation: X 1 y 1 X 2 y 2 X = R n p and y = R n,.. X K where each X k R n k p, y k R n k is a subset of data (X, y). Two scenarios: 1 When the data (X k, y k ) s are collected and stored separately at different locations (e.g., by different organizations), and transferring data is prohibitive due to communication cost or security/privacy reasons. 2 When the data (X, y) are too big to be stored or processed in a single computer, e.g., petabytes of data that cannot fit into a single computer/server. y K Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 3 / 22

A Trivial Example: Simple Linear Regression Simple linear regression is embarrassingly parallel: K ˆβ = (X T X) 1 X T y = X T k X k Compute X T k X k s and X T k y k s in parallel then aggregate. 1 K X T k y k Simple linear regression happens to have a closed form solution that happens to be computable in parallel. But what about more general cases? For example, ˆβ LASSO = arg min y Xβ 2 2 + λ β 1. β No closed-form solution, not straightforward how to compute ˆβ LASSO in parallel. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 4 / 22

Two Approaches (not including subsampling) There are generally two approaches for solving (1) in parallel. 1 Distributed numerical optimization algorithms that solves (1) in parallel. - Distributed coordinate descent, for example, [1]. - Lagrangian primal-dual algorithms, including ADMM [2] and CoCoA [3]. - CSL/EDSL by [4, 5]: quadratic approximation with local Hessian. 2 Divide-and-combine statistical aggregation: aggregating subset results. min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 5 / 22

Part I: Distributed optimization algorithms, focus on ADMM (CSL/EDSL if time permits) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 6 / 22

ADMM The ADMM solves the following problem, min {f (x) + g(z)} s.t. Ax + Bz = c, (2) x,z where x and z are the parameters of interest. A, B, c are constants. Many statistical problems can be formulated into this form, e.g., linear regression with regularization, min β y Xβ 2 2 + λ β 1 min r 2 2 + λ β 1 s.t. r = y Xβ (3) β,r The ADMM solves (2) by iteratively minimizing its augmented Lagrangian L ρ (x, z, u) := f (x) + g(z) + u T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 in the primal variables x and z and updating the dual variable u via dual ascent, where ρ is the tunable augmentation parameter. Specifically, the ADMM carries out the following updates at iteration t, x t+1 := arg min x f (x) + ρ 2 Ax + Bzt c + u t 2 2, z t+1 := arg min z g(z) + ρ 2 Axt+1 + Bz c + u t 2 2, u t+1 := u t + (Ax t+1 + Bz t+1 c). (4) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 7 / 22

Parallelize the ADMM min L(X, y; θ) + λp(θ) θ Rp X 1 y 1 X 2 y 2 X=, y=.. X K y K min Equivalent to (a generic formulation) K min L(X k, y k ; θ k ) + λp(θ) θ 1,...,θ K,θ Apply ADMM, θ t+1 k K L(X k, y k ; θ) + λp(θ), s.t. θ k = θ, k. := arg min θk L(X k, y k ; θ k ) + ρ 2 θ k θ t + u t k 2 2, (typically easy to solve) θ t+1 := arg min θ λp(θ) + ρ 2 θ θ t+1 ūu t 2 2, (closed-form solution) u t+1 k := u t k + (θt+1 k θ t+1 ), where θ t+1 = ( ) K θt+1 k /K, ūu t+1 = ( ) K ut+1 k /K. There can be other formulations that result in easier subproblems, depending on the specific form of the problem, e.g., (3). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 8 / 22

ADMM: An Application The model: where ɛ i.i.d N(0, 1) Y = X 6 + X 12 + X 15 + X 20 + 0.7X 1 ɛ, (5) Estimation accuracy Time performance l 1 accuracy 0 1 2 3 4 M=1 M=10 M=100 Time (second) 0.0 0.5 1.0 1.5 2.0 M=1 M=10 M=100 0 50 100 150 200 0 50 100 150 200 Iteration Iteration Figure: ADMM applied to non-convex (SCAD) penalized quantile regression for (5) with τ = 0.3. Sample size n = 30, 000, dimension p = 100, M is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 9 / 22

ADMM Pros: General purpose, minimal assumptions. No approximation, no further statistical analysis. Very flexible parallelization, convergence rate insensitive to # of partitions K. Example: split along both n and p, useful when both n and p are large [8]. Cons: X = X 11 X 12... X 1N X 21 X 22... X 2N........ X M1 X M2... X MN Iterative, convergence is slow, communication is expensive. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 10 / 22

CSL/EDSL Fast convergence, but stronger assumptions on the problems it solve, [4, 5]. Convergence rate of CSL type DC QR Estimation error 0.0 0.2 0.4 0.6 K=5 K=10 K=20 global 5 10 15 20 Iteration Figure: CSL applied to penalized quantile regression with τ = 0.3 (surrogate Hessian is used). Sample size n = 500, dimension p = 15, K is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 11 / 22

Part II: Divide-and-combine statistical aggregation min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Consider simple cases with λ = 0. Examples for λ 0: [6] and the de-biased Lasso in [7,10], among others. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 12 / 22

Naive Approach: Simple Average ˆθ GLOBAL = K ˆθ k K Performs poorly in general. Especially when the underlying data generating model is non-linear. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 13 / 22

One-step Further In [9], a one-step estimator from the subsets average is considered, under the general context of M-estimator where L( ) is the criterion function. 1 First, take the average 2 Then, compute the one-step estimator where 2 L(X, y; ˆθ (0) ) = ˆθ (0) = K ˆθ k K ˆθ (1) = ˆθ (0) [ 2 L(X, y; ˆθ (0) )] 1 [ L(X, y; ˆθ (0) )], K 2 L(X k, y k ; ˆθ (0) ) and L(X, y; ˆθ (0) ) = K L(X k, y k ; ˆθ (0) ). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 14 / 22

One-step Further Theorem 1 (Cheng and Huo (2015)) Denote θ 0 as the true parameter. Under some mild conditions, the one-step estimator ˆθ (1) satisfies as long as K = O( n), where ) ( [ 2 L(X, y; ˆθ (0) )] 1 E Σ = E n (ˆθ (1) θ 0 ) N(0, Σ) ([ L(X, y; ˆθ (0) )][ L(X, y; ˆθ (0) )] T ) E ( [ 2 L(X, y; ˆθ (0) )] 1 ). That is to say, the aggregated estimation ˆθ (1) is asymptotically equivalent to the global estimator as if the estimation is made with all data, as long as the number of subsets does not grow too fast. For example, in MLE case, Σ reduces to the Fisher information. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 15 / 22

Aggregated Estimating Equation (AEE) Take a closer look at simple linear regression, K ˆβ = (X T X) 1 X T y = X T k X k 1 K K X T k y k = X T k X k a weighted-average of subset estimations. 1 K X T k X k ˆβ k, Use local curvature of the loss function, or gradient of estimating equation as weights, K X T (y Xβ) = X T k (y k X k β) (OLS estimating equation). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 16 / 22

Aggregated Estimating Equation (AEE) Lin et al. [11] generalizes the idea to estimating equation estimation. M(θ) = n φ(x i, y i ; θ) = 0. (or n i=1 L(x i, y i ; θ) = 0) i=1 - Use gradient of M k = i S k φ(x i, y i ; θ) as weight for subset k, i.e., φ(x i, y i ; ˆθ k ) A k = θ i S k - Then calculate the AEE estimator as K ˆθ AEE = A k It is proved in [11] that the AEE estimator is equivalent to the global estimator under some mild conditions as long as K = O(n γ ) for some 0 < γ < 1. 1 K A k ˆθ k Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 17 / 22

Summary Another approach: Song et. al proposed a D&C method by combining local confidence distributions or confidence inference functions, [12]. Distributed optimization algorithm VS Statistical aggregation: A distributed optimization algorithm is generally iterative, while statistical aggregation is non-iterative. Statistical aggregation methods are communication efficient. Distributed optimization algorithm solves the original problem and find the global estimation; statistical aggregation estimation is not equivalent to the global estimation, but an ideal statistical aggregation method is supposed to be asymptotically equivalent to the global solution as if the estimation is made on the entire data. Did not talk about inference, but For bootstrap based inference, distributed optimization algorithms help parallelize and speed up the computation. For inference based on asymptotic results, the asymptotic covariance matrices are often distributedly computable. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 18 / 22

Q&A Thank you! Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 19 / 22

References I [1] Peter Richtárik and Martin Takáč (2016) Distributed Coordinate Descent Method for Learning with Big Data. Journal of Machine Learning Research [2] Stephen Boyd et al. (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning [3] Virginia Smith et al. (2016) CoCoA: A general framework for communication-efficient distributed optimization. arxiv preprint arxiv:1611.02189 [4] Michael I. Jordan et al. (2016) Communication-Efficient Distributed Statistical Inference. arxiv:1605.07689v3 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 20 / 22

References II [5] Jialei Wang et al. (2016) Efficient Distributed Learning with Sparsity. arxiv:1605.07991v1 [6] Xueying Chen and Min-ge Xie (2014) A Split-and-conquer Approach for Analysis of Extraordinarily Large Data. Statistica Sinica [7] Adel Javanmard and Andrea Montanari (2014) Confidence Intervals and Hypothesis Testing for High-dimensional Regression. Journal of Machine Learning Research [8] Neal Parikh and Stephen Boyd (2014) Block splitting for distributed optimization. Mathematical Programming Computation [9] Cheng Huang and Xiaoming Huo (2015) A Distributed One-Step Estimator. arxiv:1511.01443v2 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 21 / 22

References III [10] Jason Lee et al. (2015) Communication-efficient sparse regression: a one-shot approach. Journal of Machine Learning Research [11] Nan Lin and Ruibin Xi (2011) Aggregated estimating equation estimation. Statistics and Its Interface [12] Peter X.K. Song et al.(2016) Confidence distributions and confidence inference functions: General data integration methods for big complex data. Personal Communication Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 22 / 22