A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data

Size: px

Start display at page:

Download "A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data"

Clarissa McKinney
5 years ago
Views:

1 A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data Min-ge Xie Department of Statistics & Biostatistics Rutgers, The State University of New Jersey In collaboration with Xuying Chen, Ying Hung and Chengrui Li SAMSI Workshop on Distributed and Parallel Data Analysis (DPDA) Raleigh, North Carolina; September 2016 Research supported in part by grants from NSF, DHS and IBM

2 Introduction - era of data science/big data Today, the integration of computer technology into science and daily life has enabled the collection of big data across all fields Extremely large data... big n, big p, big t and growing Impose difficulties/challenges in analysis of extremely large data Memory/storage issues: too large to fit into a single computer or a single site. Computing issues: too expensive to perform any computationally intensive analysis Statistical issues: heterogeneity (in designs, information, & more), sparsity, non-conventional (e.g.,image/voice/text/network) data, missing data, random/stochastic components, etc. But also lead to opportunities in many fields, including statistics!

Introduction - split-conquer-combine approach Split-conquer-combine (SCC)

.. Recent publications (partial list): Battey et al.

(2015), Huang & Huo (2015), Song & Liang (2015), Wang & Dunson (2013), Zhang

3 Introduction - split-conquer-combine approach Split-conquer-combine (SCC) method is natural, effective, generally applicable,... Recent publications (partial list): Battey et al. (2015), Chen & Xie (2012/14), Dai et al. (2015), Huang & Huo (2015), Song & Liang (2015), Wang & Dunson (2013), Zhang et al (2013), Zhao et al. (2014),... Versus parallel computing in CS: Emphasize on the statistical performance/properties and seek to get best possible outcomes.

Introduction - our first SCC project Port-of-entry inspection project (DHS/DIMACS/CCICADA) Background: Standardized shipping containers transport 95% of the U.S. imports by tonnage; they are highly vulnerable for smuggling nuclear/radiological and other illegal materials.

4 Introduction - our first SCC project Port-of-entry inspection project (DHS/DIMACS/CCICADA) Background: Standardized shipping containers transport 95% of the U.S. imports by tonnage; they are highly vulnerable for smuggling nuclear/radiological and other illegal materials. Inspection goal: Find shipments of high risk and identify important influencers (variables). Approach: Analyze historical data and search for patterns Problem: A huge amount of data (available data: two week shipments; big n, big p) (Chen 2012, Chen and Xie 2012/14)

5 Introduction - our first SCC project Setting: penalized regression/variable selection in GLMs - Feasibility: The result obtained using the SCC method is the same as/asymptotically equivalent to the result of analyzing the entire data all at once Computational efficiency: The SCC method can substantially reduce computing time and computer memory requirement, if a computationally intensive algorithm is used Better error bounds: As a byproduct, the SCC method is more Mean computing time (seconds) ± 2SD All data K=2 K= n:number of observations resistant to false model selections caused by spurious correlations (Chen 2012, Chen and Xie 2012/14)

New (second) project - motivating example of spatial data IBM data center thermal management project A data center is an integrated facility housing multiple-unit servers, providing application

6 New (second) project - motivating example of spatial data IBM data center thermal management project A data center is an integrated facility housing multiple-unit servers, providing application services or management for data processing. Goal: Design a data center with an efficient heat removal mechanism. Computational Fluid Dynamics (CFD) simulation (n = 26820; p = 9) Figure : Heat map for IBM T. J. Watson Data Center

7 Gaussian process model Gaussian process (GP) model: y = xβ + Z (x), y: n 1 vector of observations (e.g., room temperatures) x: n p design matrix β: p 1 unknown parameters Z (x) is a GP process with mean 0 and covariance σ 2 Σ(θ) Σ(θ): n-by-n correlation matrix with correlation parameters θ Correlation matrix: The ijth element of Σ(θ) is defined by a power exponential function p corr(z (x i ), Z (x j )) = exp( θ T x i x j ) = exp( θ k x ik x jk ). k=1 Remark: Assume σ 2 is known for simplicity in this talk.

8 Estimation and prediction Likelihood inference l(β, θ, σ) = 1 2σ 2 (y Xβ) Σ 1 (θ)(y Xβ) 1 2 log Σ(θ) n 2 log(σ2 ) So, β θ = argmax β {l(β θ, σ 2 )} = (X Σ 1 (θ)x) 1 X Σ 1 (θ)y θ β, σ 2 = argmax θ {l(θ β, σ 2 )} GP prediction, say y 0, at a new point x 0, given parameters (β, θ), follows a normal distribution with mean p 0 (β, θ) and variance m 0 (β, θ), where, p 0 (β, θ) = x 0 β + γ(θ) Σ 1 (θ)(y Xβ) m 0 (β, θ) = σ 2 (1 γ(θ) Σ 1 (θ)γ(θ)), and γ(θ) is a n 1 vector of ith element equals to φ( x i x 0 ; θ).

9 Two challenges in GP modeling Computational issue: Estimation and prediction involve Σ 1 and Σ with order of O(n 3 ): Not feasible when n is large Uncertainty quantification of GP predictor Plug-in predictive distribution is widely used (i.e., unknown parameters in the formula replaced by their point estimates) It underestimates the uncertainty (cf. Santner et al 2003)

10 Existing methods/efforts For the computational issue: Change the model to one that is computationally convenient: Rue and Held (2005), Cressie and Johannesson (2008). Approximate the likelihood function: Stein et al. (2004), Furrer et al. (2006), Fuentes (2007), Kaufman et al. (2008). Not focusing on uncertainty quantification/bring in additional uncertainty Still not feasible when n is super large For uncertainty quantification of GP prediction Bayesian predictive distribution: Kennedy and O Hagan (2001) Bootstrap approach: Sjostedt-de Luna (2003), Santner et al. (2003) Intensive computationally

11 Our solution Can we solve both problems by an unified framework? Yes, we develop a sequential split-conquer-combine method based on our understating from the previously developed split-conquer-combine method (for independent data) & confidence distribution developments

12 Introduction to confidence distribution (CD) Statistical inference (Parameter estimation): Point estimate Interval estimate Distribution estimate (e.g., confidence distribution) Example: X 1,..., X n i.i.d. follows N(µ, 1) Point estimate: x n = 1 n n i=1 x i Interval estimate: ( x n 1.96/ n, x n / n) Distribution estimate: N( x n, 1 ) n

13 Introduction to confidence distribution (CD) Statistical inference (Parameter estimation): Point estimate Interval estimate Distribution estimate (e.g., confidence distribution) Example: X 1,..., X n i.i.d. follows N(µ, 1) Point estimate: x n = 1 n n i=1 x i Interval estimate: ( x n 1.96/ n, x n / n) Distribution estimate: N( x n, 1 ) n The idea of the CD approach is to use a sample-dependent distribution (or density) function to estimate the parameter of interest. o It is to provide simple and interpretable summaries of what can reasonably be learned from data (Cox 2013) and also meaningful answers for all inference questions. (Xie and Singh 2013)

14 Inference using CD: Point estimators, Intervals, p-values & more b

15 Inference using CD: Point estimators, Intervals, p-values & more b Statistical Dream: Bayesian, Fiducial and Frequestist = BFF = Best Friends Forever

16 Efficient estimation using CD and Split-Conquer-Combine CD is useful in combining information (meta-analysis, fusion learning,...) Chen and Xie (2012/14) considered a computationally efficient approach using Split-Conquer-Combine and CD. Entire Dataset Split and Conquer Combine D1 D2 ˆ 1 ˆ 2 Dm ˆm ˆc Result is asymptotically equivalent to the one using the entire dataset But, if SCC is directly applied to dependent data, it will loss information Question: How to extend SCC to big data with dependent structures, such as GP?

17 Sequential Split-Conquer-Combine for GP Our solution: Figure : Sequential Split-Conquer-Combine Approach

18 Ingredients Divide the entire dataset into subsets (correlated) Perform a transformation sequentially to create independent subsets Combine estimators using CD Quantify prediction uncertainty using CD Assumption (weak): compact support correlation for 1-dim (i.e., Σ to Σ t) Tapering needed only on one variable (say X 1 ) Σ 11 Σ 12 O O. Σ 21 Σ.. 22 O Σ t = O.. Σ(m 1)(m 1) Σ (m 1)m O O Σ m(m 1) Σ mm (after index sorting according to X 1 values) n n,

19 Sequentially update data Divide the entire dataset into subsets y = {y a }, a = 1,..., m. Transform y to y by sequentially updating: y a = y a L a(a 1)y a 1, where L (a+1)a = Σ t(a+1)a Da 1, D a = Σ aa L a(a 1) D (a 1) L a(a 1). y a are independent with covariance Da; Also, D 1 D 2 D =... D m I, L =. L 21 I.... L m1 L m2 I and Σ t = LDL is the compact support correlation function. Sequential updates are computationally efficient

20 Estimation from each subset (Assume θ is fixed/known) MLE of the ath subset: β a = argmax β l a(β) = (C a D 1 a C a) 1 C a D 1 Significant computational reduction because D a is much smaller than the original covariance matrix. a y a

21 Estimation from each subset (Assume θ is fixed/known) MLE of the ath subset: β a = argmax β l a(β) = (C a D 1 a C a) 1 C a D 1 Significant computational reduction because D a is much smaller than the original covariance matrix. An individual CD (confidence distribution) for the ath updated subset is (cf., Xie and Singh 2013): H a(β) = Φ ( S 1/2 a (β β ) a) where S a = Cov( β a) and Φ is the CDF of p-dimensional multivariate standard normal. a y a

22 CD combining Following Singh, Xie and Strwderman (2005), Liu, Liu and Xie (2014) and Yang et al. (2014), a combined CD H c(β) is: ( H c(β) = Φ S 1/2 SSCC (β β ) SSCC ) where β SSCC = ( W a) 1 ( W a βa) with W a = (Ca Da 1 C a) 1 and S SSCC = Cov( β c). Similar framework can be applied to all the parameters (β, θ). Iteratively loop between β θ and θ β

23 CD combining Following Singh, Xie and Strwderman (2005), Liu, Liu and Xie (2014) and Yang et al. (2014), a combined CD H c(β) is: ( H c(β) = Φ S 1/2 SSCC (β β ) SSCC ) where β SSCC = ( W a) 1 ( W a βa) with W a = (Ca Da 1 C a) 1 and S SSCC = Cov( β c). Similar framework can be applied to all the parameters (β, θ). Iteratively loop between β θ and θ β Theorem 1 Under some regularity assumptions, when the number of blocks m = o(n 1/2 ) and n, the SSCC estimator λ c = ( β sscc, θ sscc) is asymptotically as efficient as MLE λ mle = ( β mle, θ mle ).

24 GP predictive distribution GP prediction y 0 at a new point x 0, given parameters (β, θ), follows a normal distribution with mean p 0 (β, θ) and variance m 0 (β, θ), where, p 0 (β, θ) = x 0 β + γ(θ) Σ 1 (θ)(y Xβ) m 0 (β, θ) = σ 2 (1 γ(θ) Σ 1 (θ)γ(θ)), and γ(θ) is a n 1 vector of ith element equals to φ( x i x 0 ; θ). Drawbacks: Computational issue again! Conventional plug-in predictive distribution underestimate the uncertainty

25 Quantify GP prediction uncertainty using SSCC and CD An easy-to-compute GP prediction: p 1 (β, θ) =x 0 β + m 1 (β, θ) =σ 2 (1 m a=1 m a=1 γ a (θ) D 1 a (θ)y a + m γa (θ) Da 1 (θ)γa (θ)) a=1 γa (θ) Da 1 (θ)c a(θ)β A better way to quantify uncertainty: use a CD-based predictive distribution function: Q(y 0 ; y) = λ Θ G λ (y 0 )dh c(λ; y), where G λ ( ) is the CDF of N(p 1 (λ t ), m 1 (λ t )), and H c(λ; y) is joint CD of λ = (β, θ) obtained from our SSCC approach

26 Theoretical result on prediction Theorem 2 Under some regularity assumptions, when the number of blocks m = o(n 1/2 ) and n, p 1 (β, θ) p 0 (β, θ), m 1 (β, θ) m 0 (β, θ) The easy-to-compute GP prediction is asymptotically equivalent to the original prediction Compared with the plug-in predictive distribution, the CD-based predictive distribution function has smaller average Kullback-Leibler distance to the true predictive distribution (Shen, Liu, and Xie 2016).

27 Quantify GP prediction uncertainty using SSCC and CD Monte-Carlo Algorithm: Repeat the following two steps T time to obtain T copies of simulated y 0 from Q( ; y), denoted by y (t) 0 and t = 1,...T : Generate ξ t y H c(λ; y), Generate y t 0 ξ t N(p 1 (ξ t ), m 1 (ξ t )). These T copies of y (t) 0 can be used to make predictive inference for y 0.

28 Simulation study: setup GP model: y = xβ + Z (x), Sample points: n = 1000, 1500, 2000; Four covariates p = 4 True parameter values: β = (3, 2, 1, 2, 1.5), θ = (15, 1.5, 2, 3) & σ 2 = 1 The ijth element of Σ(θ) is computed by σ ij = exp( p k=1 θ k x ik x jk ) The number of splits m = 5

29 Simulation study: analysis results n = 1000 n = 1500 n = 2000 MLE Taper SSCC MLE Taper SSCC MLE Taper SSCC ˆβ (0.33) 1.96(0.33) 1.96(0.33) 2.00(0.30) 2.00(0.30) 2.00(0.30) 2.08(0.23) 2.08(0.23) 2.08(0.23) ˆβ (0.43) 3.02(0.43) 3.02(0.43) 3.01(0.36) 3.01(0.36) 3.01(0.36) 2.90(0.28) 2.90(0.28) 2.90(0.28) ˆβ (0.23) 1.00(0.23) 1.00(0.23) 0.99(0.21) 0.99(0.21) 0.99(0.21) 1.04(0.20) 1.04(0.20) 1.04(0.20) ˆβ (0.24) 2.04(0.24) 2.04(0.24) 1.97(0.25) 1.97(0.25) 1.97(0.25) 1.98(0.24) 1.98(0.24) 1.98(0.24) ˆβ (0.25) 1.53(0.25) 1.53(0.25) 1.52(0.28) 1.52(0.28) 1.52(0.28) 1.49(0.18) 1.49(0.18) 1.49(0.18) ˆθ (0.42) 14.72(0.42) 14.80(0.43) 14.65(0.39) 14.65(0.39) 14.65(0.49) 14.69(0.45) 14.74(0.45) 14.79(0.46) ˆθ (0.06) 1.49(0.06) 1.51(0.06) 1.49(0.06) 1.49(0.06) 1.50(0.10) 1.49(0.06) 1.49(0.06) 1.49(0.10) ˆθ (0.06) 2.00(0.06) 2.01(0.06) 1.98(0.06) 1.98(0.06) 2.00(0.06) 1.99(0.07) 1.99(0.07) 2.00(0.10) ˆθ (0.06) 3.01(0.06) 2.99(0.06) 3.00(0.06) 3.00(0.06) 3.00(0.08) 3.01(0.06) 3.01(0.06) 3.00(0.07) CT 32.46(1.29) 30.61(1.34) 4.54(0.11) 99.66(3.83) 95.90(5.44) 10.39(0.53) (6.96) (9.33) 20.32(0.95) Table : Simulation results for n = 1000, 1500, 2000 with 100 replications

30 Simulation study: computing time Computing Time vs Sample Size Seconds MLE Taper SSCC Sample Size (n) Remark: Reduction of computational complexity from O(n 3 ) to O( la 3 ), where la is the size of the ath subset and l a = n.

31 Real data example: IBM data center thermal management project Variable βmle βtaper βscc βsscc X 1 CRAC unit 1 flow rate (0.101) (0.101) (0.561) (0.101) X 2 CRAC unit 2 flow rate (0.099) (0.099) (0.581) (0.099) X 3 CRAC unit 3 flow rate (0.101) (0.101) (0.839) (0.101) X 4 CRAC unit 4 flow rate (0.113) (0.113) (1.406) (0.113) X 5 Room temperature (0.138) (0.138) (3.116) (0.138) X 6 Tile open area percentage (0.106) (0.106) (0.894) (0.106) X 7 Location in x-axis (0.054) (0.054) (0.086) (0.054) X 8 Location in y-axis 1.672(0.080) 1.681(0.080) 1.783(0.484) 1.681(0.080) X 9 Height (0.194) (0.194) (6.437) (0.194) Computing time (in seconds) Table : Using a subsample of n = 9000; Only estimating β with θ given (fixed at the estimated values in Chen et al 2014).

32 Real data example: IBM data center thermal management project Computing time Figure : Using a subsample of n = 9000; Only estimating β with θ given (fixed at the estimated values in Chen et al 2014).

33 Real data example: IBM data center thermal management project n = 1800 n = 3600 n = Variable MLE Taper SSCC MLE Taper SSCC MLE Taper SSCC CRAC unit 1 flow rate ˆβ1-8.28(0.10) -8.29(0.10) -8.29(0.10) (0.09) (0.08) ˆθ1 0.84(0.01) 0.84(0.01) 0.86(0.01) (0.01) (0.01) CRAC unit 2 flow rate ˆβ2-9.00(0.10) -8.99(0.10) -8.99(0.10) (0.09) (0.08) ˆθ2 0.76(0.01) 0.76(0.01) 0.79(0.01) (0.01) (0.01) CRAC unit 3 flow rate ˆβ3-6.44(0.10) -6.45(0.10) -6.45(0.10) (0.09) (0.09) ˆθ3 1.20(0.01) 1.20(0.01) 1.19(0.01) (0.01) (0.01) CRAC unit 4 flow rate ˆβ4-5.42(0.11) -5.41(0.11) -5.41(0.11) (0.10) (0.10) ˆθ4 1.90(0.01) 1.90(0.01) 1.80(0.01) (0.01) (0.01) Room temperature ˆβ5-0.08(0.13) -0.07(0.13) -0.07(0.13) (0.13) (0.13) ˆθ5 3.50(0.01) 3.50(0.01) 3.40(0.01) (0.01) (0.01) Tile open area percentage ˆβ6-1.98(0.10) -1.97(0.10) -1.97(0.10) (0.10) (0.09) ˆθ6 1.29(0.01) 1.29(0.01) 1.20(0.01) (0.01) (0.01) Location in x-axis ˆβ7-3.39(0.06) -3.41(0.06) -3.41(0.06) (0.04) (0.03) ˆθ7 0.20(0.01) 0.20(0.01) 0.17(0.01) (0.01) (0.01) Location in y-axis ˆβ8 2.80(0.08) 2.80(0.08) 2.80(0.08) (0.07) (0.06) ˆθ8 0.60(0.01) 0.60(0.01) 0.50(0.01) (0.01) (0.01) Location in z-axis ˆβ (0.18) 22.35(0.18) 22.35(0.18) (0.18) (0.18) ˆθ (0.03) 21.90(0.03) 21.45(0.03) (0.03) (0.02) Computing Time (in seconds) Table : CFD data results with n = 1800, 3600 and 26820; estimating both β and θ.

Summary Introduce an unified SSCC framework for GP modeling that can dramatically reduce computation and also provide a better quantification of the prediction uncertainty A Sequential

34 Summary Introduce an unified SSCC framework for GP modeling that can dramatically reduce computation and also provide a better quantification of the prediction uncertainty A Sequential Split-Conquer-Combine (SSCC) procedure for big spatial data CD Combining Predictive distribution based on CDs Provide asymptotically equivalent estimation Reduce computational complexity from O(n 3 ) to O( l 3 a ), where l a is the size of the ath subset and l a = n.!"#$%&'()*!

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling Min-ge Xie Department of Statistics, Rutgers University Workshop on Higher-Order Asymptotics