Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 1 / 22

Introduction Many statistical problems can be formulated into the following form, min L({x i, y i } n θ R p i=1, θ) + λp(θ), (1) where L({x i, y i } n i=1, θ) is a loss function, or a negative log-likelihood function, or certain criterion function (e.g., M-estimator) and P(θ) is some regularization on θ. Examples - Linear regression, - Logistic regression, ˆβ = arg min y Xβ 2 2 + λ β 1 (or other penalties) β ˆβ = arg min β n log ( 1 + e x i β ) n y i x i β i=1 i=1 } {{ } negative log-likelihood + λ β 1 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 2 / 22

When is Divide-and-Combine needed? Notation: X 1 y 1 X 2 y 2 X = R n p and y = R n,.. X K where each X k R n k p, y k R n k is a subset of data (X, y). Two scenarios: 1 When the data (X k, y k ) s are collected and stored separately at different locations (e.g., by different organizations), and transferring data is prohibitive due to communication cost or security/privacy reasons. 2 When the data (X, y) are too big to be stored or processed in a single computer, e.g., petabytes of data that cannot fit into a single computer/server. y K Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 3 / 22

A Trivial Example: Simple Linear Regression Simple linear regression is embarrassingly parallel: K ˆβ = (X T X) 1 X T y = X T k X k Compute X T k X k s and X T k y k s in parallel then aggregate. 1 K X T k y k Simple linear regression happens to have a closed form solution that happens to be computable in parallel. But what about more general cases? For example, ˆβ LASSO = arg min y Xβ 2 2 + λ β 1. β No closed-form solution, not straightforward how to compute ˆβ LASSO in parallel. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 4 / 22

Two Approaches (not including subsampling) There are generally two approaches for solving (1) in parallel. 1 Distributed numerical optimization algorithms that solves (1) in parallel. - Distributed coordinate descent, for example, [1]. - Lagrangian primal-dual algorithms, including ADMM [2] and CoCoA [3]. - CSL/EDSL by [4, 5]: quadratic approximation with local Hessian. 2 Divide-and-combine statistical aggregation: aggregating subset results. min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 5 / 22

Part I: Distributed optimization algorithms, focus on ADMM (CSL/EDSL if time permits) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 6 / 22

ADMM The ADMM solves the following problem, min {f (x) + g(z)} s.t. Ax + Bz = c, (2) x,z where x and z are the parameters of interest. A, B, c are constants. Many statistical problems can be formulated into this form, e.g., linear regression with regularization, min β y Xβ 2 2 + λ β 1 min r 2 2 + λ β 1 s.t. r = y Xβ (3) β,r The ADMM solves (2) by iteratively minimizing its augmented Lagrangian L ρ (x, z, u) := f (x) + g(z) + u T (Ax + Bz c) + ρ 2 Ax + Bz c 2 2 in the primal variables x and z and updating the dual variable u via dual ascent, where ρ is the tunable augmentation parameter. Specifically, the ADMM carries out the following updates at iteration t, x t+1 := arg min x f (x) + ρ 2 Ax + Bzt c + u t 2 2, z t+1 := arg min z g(z) + ρ 2 Axt+1 + Bz c + u t 2 2, u t+1 := u t + (Ax t+1 + Bz t+1 c). (4) Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 7 / 22

Parallelize the ADMM min L(X, y; θ) + λp(θ) θ Rp X 1 y 1 X 2 y 2 X=, y=.. X K y K min Equivalent to (a generic formulation) K min L(X k, y k ; θ k ) + λp(θ) θ 1,...,θ K,θ Apply ADMM, θ t+1 k K L(X k, y k ; θ) + λp(θ), s.t. θ k = θ, k. := arg min θk L(X k, y k ; θ k ) + ρ 2 θ k θ t + u t k 2 2, (typically easy to solve) θ t+1 := arg min θ λp(θ) + ρ 2 θ θ t+1 ūu t 2 2, (closed-form solution) u t+1 k := u t k + (θt+1 k θ t+1 ), where θ t+1 = ( ) K θt+1 k /K, ūu t+1 = ( ) K ut+1 k /K. There can be other formulations that result in easier subproblems, depending on the specific form of the problem, e.g., (3). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 8 / 22

ADMM: An Application The model: where ɛ i.i.d N(0, 1) Y = X 6 + X 12 + X 15 + X 20 + 0.7X 1 ɛ, (5) Estimation accuracy Time performance l 1 accuracy 0 1 2 3 4 M=1 M=10 M=100 Time (second) 0.0 0.5 1.0 1.5 2.0 M=1 M=10 M=100 0 50 100 150 200 0 50 100 150 200 Iteration Iteration Figure: ADMM applied to non-convex (SCAD) penalized quantile regression for (5) with τ = 0.3. Sample size n = 30, 000, dimension p = 100, M is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 9 / 22

ADMM Pros: General purpose, minimal assumptions. No approximation, no further statistical analysis. Very flexible parallelization, convergence rate insensitive to # of partitions K. Example: split along both n and p, useful when both n and p are large [8]. Cons: X = X 11 X 12... X 1N X 21 X 22... X 2N........ X M1 X M2... X MN Iterative, convergence is slow, communication is expensive. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 10 / 22

CSL/EDSL Fast convergence, but stronger assumptions on the problems it solve, [4, 5]. Convergence rate of CSL type DC QR Estimation error 0.0 0.2 0.4 0.6 K=5 K=10 K=20 global 5 10 15 20 Iteration Figure: CSL applied to penalized quantile regression with τ = 0.3 (surrogate Hessian is used). Sample size n = 500, dimension p = 15, K is the number of subsets. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 11 / 22

Part II: Divide-and-combine statistical aggregation min L(X 1, y 1 ; θ) (+λp(θ)) ˆθ 1 min L(X 2, y 2 ; θ) (+λp(θ)) ˆθ 2. min L(X K, y K ; θ) (+λp(θ)) ˆθ K?Aggregate ˆθ GLOBAL Consider simple cases with λ = 0. Examples for λ 0: [6] and the de-biased Lasso in [7,10], among others. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 12 / 22

Naive Approach: Simple Average ˆθ GLOBAL = K ˆθ k K Performs poorly in general. Especially when the underlying data generating model is non-linear. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 13 / 22

One-step Further In [9], a one-step estimator from the subsets average is considered, under the general context of M-estimator where L( ) is the criterion function. 1 First, take the average 2 Then, compute the one-step estimator where 2 L(X, y; ˆθ (0) ) = ˆθ (0) = K ˆθ k K ˆθ (1) = ˆθ (0) [ 2 L(X, y; ˆθ (0) )] 1 [ L(X, y; ˆθ (0) )], K 2 L(X k, y k ; ˆθ (0) ) and L(X, y; ˆθ (0) ) = K L(X k, y k ; ˆθ (0) ). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 14 / 22

One-step Further Theorem 1 (Cheng and Huo (2015)) Denote θ 0 as the true parameter. Under some mild conditions, the one-step estimator ˆθ (1) satisfies as long as K = O( n), where ) ( [ 2 L(X, y; ˆθ (0) )] 1 E Σ = E n (ˆθ (1) θ 0 ) N(0, Σ) ([ L(X, y; ˆθ (0) )][ L(X, y; ˆθ (0) )] T ) E ( [ 2 L(X, y; ˆθ (0) )] 1 ). That is to say, the aggregated estimation ˆθ (1) is asymptotically equivalent to the global estimator as if the estimation is made with all data, as long as the number of subsets does not grow too fast. For example, in MLE case, Σ reduces to the Fisher information. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 15 / 22

Aggregated Estimating Equation (AEE) Take a closer look at simple linear regression, K ˆβ = (X T X) 1 X T y = X T k X k 1 K K X T k y k = X T k X k a weighted-average of subset estimations. 1 K X T k X k ˆβ k, Use local curvature of the loss function, or gradient of estimating equation as weights, K X T (y Xβ) = X T k (y k X k β) (OLS estimating equation). Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 16 / 22

Aggregated Estimating Equation (AEE) Lin et al. [11] generalizes the idea to estimating equation estimation. M(θ) = n φ(x i, y i ; θ) = 0. (or n i=1 L(x i, y i ; θ) = 0) i=1 - Use gradient of M k = i S k φ(x i, y i ; θ) as weight for subset k, i.e., φ(x i, y i ; ˆθ k ) A k = θ i S k - Then calculate the AEE estimator as K ˆθ AEE = A k It is proved in [11] that the AEE estimator is equivalent to the global estimator under some mild conditions as long as K = O(n γ ) for some 0 < γ < 1. 1 K A k ˆθ k Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 17 / 22

Summary Another approach: Song et. al proposed a D&C method by combining local confidence distributions or confidence inference functions, [12]. Distributed optimization algorithm VS Statistical aggregation: A distributed optimization algorithm is generally iterative, while statistical aggregation is non-iterative. Statistical aggregation methods are communication efficient. Distributed optimization algorithm solves the original problem and find the global estimation; statistical aggregation estimation is not equivalent to the global estimation, but an ideal statistical aggregation method is supposed to be asymptotically equivalent to the global solution as if the estimation is made on the entire data. Did not talk about inference, but For bootstrap based inference, distributed optimization algorithms help parallelize and speed up the computation. For inference based on asymptotic results, the asymptotic covariance matrices are often distributedly computable. Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 18 / 22

Q&A Thank you! Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 19 / 22

References I [1] Peter Richtárik and Martin Takáč (2016) Distributed Coordinate Descent Method for Learning with Big Data. Journal of Machine Learning Research [2] Stephen Boyd et al. (2011) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning [3] Virginia Smith et al. (2016) CoCoA: A general framework for communication-efficient distributed optimization. arxiv preprint arxiv:1611.02189 [4] Michael I. Jordan et al. (2016) Communication-Efficient Distributed Statistical Inference. arxiv:1605.07689v3 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 20 / 22

References II [5] Jialei Wang et al. (2016) Efficient Distributed Learning with Sparsity. arxiv:1605.07991v1 [6] Xueying Chen and Min-ge Xie (2014) A Split-and-conquer Approach for Analysis of Extraordinarily Large Data. Statistica Sinica [7] Adel Javanmard and Andrea Montanari (2014) Confidence Intervals and Hypothesis Testing for High-dimensional Regression. Journal of Machine Learning Research [8] Neal Parikh and Stephen Boyd (2014) Block splitting for distributed optimization. Mathematical Programming Computation [9] Cheng Huang and Xiaoming Huo (2015) A Distributed One-Step Estimator. arxiv:1511.01443v2 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 21 / 22

References III [10] Jason Lee et al. (2015) Communication-efficient sparse regression: a one-shot approach. Journal of Machine Learning Research [11] Nan Lin and Ruibin Xi (2011) Aggregated estimating equation estimation. Statistics and Its Interface [12] Peter X.K. Song et al.(2016) Confidence distributions and confidence inference functions: General data integration methods for big complex data. Personal Communication Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017 22 / 22