Smooth additive models for large datasets

Size: px

Start display at page:

Download "Smooth additive models for large datasets"

Posy Banks
5 years ago
Views:

1 Smooth additive models for large datasets Simon Wood, University of Bristol 1, U.K. in collaboration with Zheyuan Li, Gavin Shaddick, Nicole Augustin University of Bath Funded by EPSRC and the University of Bath 1 Currently hiring Statistics faculty!

London smog 1952 5-9 Dec 1952. 4-12 thousand premature deaths.

2 London smog Dec thousand premature deaths. Black smoke (particulates) and sulphur from domestic coal fires. Clean air act Monitoring from 1961.

3 Black smoke monitoring... Northing (km) Black smoke monitoring network 4 decades of daily black smoke monitoring at a variable subset of the stations shown. Started in 1961 to monitor air pollution (then mostly from coal), in wake of 1950s smog deaths. Epidemiological studies need estimates of daily exposure away from stations. O(10 7 ) measurements and suitable smooth latent Gaussian models have O(10 4 ) coefficients with variance parameters Easting (in km)

4 7 Daily BS data log(bs) day d log(bs) year 0 north (km) log(bs) c b a east (km) day 15000

5 Black smoke previous work Shaddick & Zidek 2 modelled annual average black smoke: log(b i ) = f 1 (e i, n i ) + f 2 (e i, n i )y i + f 2 (e i, n i )y 2 i using INLA 3 for inference. + ɛ i Modelling reduces the impact of preferential sampling. But annual averages fail to capture the acute pollution events, on the time scale of days, that accelerate deaths, and are likely to be of long term health significance. A daily model for all the 10 7 data is appealing, but potentially expensive , Spatial Statistics 3 Rue, Martino, Chopin, 2009

6 Daily black smoke modelling and existing methods We had available a 24 core 128Gb computer. Trials with INLA suggested that we had no chance of fitting daily models with 128Gb, nor in acceptable time. Using R package mgcv, acceptable daily models would have required 1TB of storage just to form the model matrix using the default methods 4. mgcv s big additive model methods (bam 5 ) have acceptable memory footprint by factoring the model matrix block-wise. But experiments suggested weeks of computing time to fit. Perhaps parallelization of the bam methods is the solution? 4 Wood 2011, JRSSB 5 Wood, Goude & Shaw, 2015, JRSSC

7 The model class Concentrate on models of the form y i EF(µ i, φ) g(µ i ) = A i θ + j f j (x ji ) + Z i b A, Z are model matrices, f j are smooth functions, θ is parameters, b contains independent Gaussian random effects. g is a known link function. x j may be vector. Represent smooth functions, f, using spline basis expansions with coefficients β f (x) = K β k b k (x) k=1... and define a quadratic smoothing penalty, e.g. f 2 dx = β T Sβ (suggesting K = O(n 1/9 1/5 )).

8 Simple 1D basis penalty example y basis functions y (β i 1 2β i + β i i+1) 2 = x x y (β i 1 2β i + β i i+1) 2 = 16.4 y (β i 1 2β i + β i i+1) 2 = x x

9 Wide range of basis penalty smoothers available s(times,10.9) a X b f(x,z) z c x times Y s(region,63.04) d lat e lon f

10 Estimation of model coefficients Given bases for the smooths, the model can be written as y i EF(µ i, φ), g(µ i ) = X i β where X (n p) contains A, Z and evaluated basis functions, and β is a combine coefficient vector. The combined penalty can be written j λ jβ T S j β, where the λ j are smoothing parameters. Then ˆβ = argmax β l(β) j λ jβ T S j β/2.... this penalized log likelihood can be given Bayesian motivation using the prior 6 β N{0, ( j λ j S j ) }. 6 which also covers simple Gaussian random effects.

11 Full fitting algorithm The GLM likelihood means that β can be estimated by penalized iteratively re-weighted least squares. i.e. we iteratively use penalized least squares to fit a working linear model: z = Xβ + ɛ, cov(ɛ) = W 1 φ, E(ɛ) = 0 where z i = g (ˆµ i )(y i ˆµ i ) + ˆη i, W 1 ii = V (ˆµ i )g (ˆµ i ) 2 and η i = g(µ i ). V is a known and determined by EF. Or we use the prior on β and iteratively estimate z = Xβ + ɛ, β N(0, S λ ) ɛ N(0, W 1 φ), including λ, as a linear mixed model (this is PQL 7 ). S λ = λ j S j. λ estimation uses REML. 7 Breslow & Clayton, 1993, JASA

12 Justifying the REML step and a big data method Let QR = WX and f = Q T Wz. Assume φ = 1. Then f = Rβ + e, β N(0, S λ ) and, by CLT, e N(0, I). If ˆβ λ = argmin β f Rβ 2 + β T S λ β then 2l r = f R ˆβ λ 2 + ˆβ T λ S λ ˆβ λ + log R T R + S λ + log S λ + But to within an additive constant this is identical to 2l r for working model z = Xβ + ɛ, β N(0, S λ ), ɛ N(0, W 1 ). Similar argument if φ 1 using Pearson estimator. This justification also gives a low memory footprint big data algorithm: iterative blockwise QR accumulates R and f 8. log terms require eigen and QR for stable computation. 8 Wood, Goude and Shaw, 2015, JRSSC

13 Parallelizing the GAM fit method Aim: parallelize preceding big additive model methods. Hope: 24 cores turns one day of computing into 1 hour. Reality: what happened when we parallelized all steps of flop cost O(p 3 ) in preceding (mgcv:bam) method... bam scaling 1/time cores

14 The messy realities of parallel computing 1. Hyper-threading can make parallel slower than serial... faster than serial execution slower than serial execution FPU CPU core INT FPU CPU core INT 2. Dynamic core clock speed management for power efficiency can make low work thread take most time. CPU light workload core 1 slow fast finished later! heavy workload core 2 slow fast finished earlier! 3. Thermal limits: n cores are not n times faster than 1 core. CPU would fry itself!! CPU slow fast slow fast CPU slow fast slow fast CPU slow fast slow fast slow fast slow fast slow fast slow fast slow fast slow fast 4. A floating point operation (flop) may take one or two CPU cycles, retrieving a number from memory 10 times that. Numerical computation is memory bandwidth limited.

15 Memory bandwidth, Cache, block algorithms Cache is small fast access memory between CPU and main memory. Big speed up if most flops involve data already in Cache. Consider two 10 6 flop computations 1. C is a matrix, and y a 1000-vector. Compute Cy. Each of 10 6 elements of C read once, no re-use. 2. A and B are both matrices. Form AB. Repeatedly revisits the elements of A and B.... provided A and B fit in Cache, 2 is much faster. Structure algorithms around Cache friendly blocks! e.g. [ ] [ ] [ ] A11 A 12 B11 B 12 A11 B = 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22

16 Parallel block pivoted QR & Cholesky There are parallelizable block pivoted QR 9 and Cholesky 10 algorithms published, Here s how they scale with openmp parallelization... QR scaling Cholesky scaling 1/t /t cores cores Pivoted QR has alot of unavoidable matrix-vector operations (instead of matrix-matrix).... this caused the poor scaling of the GAM fitting method. 9 Quintana-Ortí, Sun and Bishop, 1998, SIAM J. Sci. Comp. 10 Lucas 2004, LAPACK working paper

17 Scalable fitting: Cholesky only? The (restricted) marginal likelihood used for λ estimation V(λ) = W(z X ˆβ λ ) 2 + ˆβ λ TS λ ˆβ λ + k + n M p log(2πφ) 2φ 2 + log XT WX + S λ log S λ + 2 is difficult because of the log terms: naive computation can be equivalent to taking logs of numerical zeroes! Stable log requires pivoted QR (& eigen) decomposition, but we would prefer Cholesky only to get scalability. Simple idea: avoid QR and eigen-decompostions by avoiding evaluating log determinants during optimization.

18 Gradient only Newton optimization Newton step,, uses only 1st and 2nd derivatives of V. Function values only required for step control halve if V(λ + ) > V(λ). Instead halve if V T > 0. V(λ) V(λ) λ λ This eliminates the log(0) problem. e.g., if ρ = log λ, log X T WX + λ j S j ρ j = tr {(X T WX + } λ j S j ) 1 S j λ j

19 Simple idea 2 1. Conventional PQL finds best fit variance/smoothing parameters for working model. 2. Waste of effort to find best fit when working model will anyway be discarded at next iteration! 3. Just find improved smoothing parameters by taking one Newton step each iteration. 4. With care we can still step control properly (can even prove convergence for some cases). 5. Actual iteration omitted for reasons of tediousness...

20 Simple idea 3: discrete covariate methods The methods so far scale like the pivoted Cholesky (rather than QR) and turn months of computing into days-weeks. Formation of X T WX is the leading order cost: O(np 2 ). Lang et al. 11 point out that for a single 1D smooth, f (x), the product X T WX is very efficiently computable if x has only m n discrete values. As statisticians we should be prepared to discretise x to m = O( n) bins. It is possible to find (novel) efficient computational methods for the multiple discretised covariate case, both for multiple 1D smooths and for tensor product smooths of multiple covariates. 11 Lang, Umlauf, Wechselberger, Harttgen & Kneib, 2014, Statistics & Computing.

21 Simple discrete method example For a single smooth, its n p j model matrix becomes X j (i, l) = X j (k j (i), l) where X j is an m j p j matrix evaluating the smooth at the corresponding gridded values. Then, for example X T j y = X T j ȳ where ȳ l = k j (i)=l Cost: O(n) + O(m j p j ) for m j n this a factor of p j saving. In general all required (cross)products are a factor of p j more efficient, where p j is the largest (marginal) basis dimension involved in the term. y i

22 Black smoke modelling Method based on parallel Cholesky, determinant free iteration and discretization of covariates implemented in mgcv function bam(...,discrete=true). The current best daily black smoke model is log(bs i ) = f 1 (y i ) + f 2 (doy i ) + f 3 (dow i ) + f 4 (y i, doy i ) + f 5 (y i, dow i ) + f 6 (doy i, dow i ) + f 7 (n i, e i ) + f 8 (n i, e i, y i ) + f 9 (n i, e i, doy i ) + f 10 (n i, e i, dow i ) +f 11 (h i )+f 12 (T 0 i, T1 i )+f 13( T1 i, T2 i )+f 14 (r i )+α k(i) +b id(i) +e i The model has around 10 4 coefficients and r 2 = With the new parallel discretised methods fit time is < 1 hour. We estimate that previous methods would have required > 1 month. Memory footprint is about 15Gb.

23 Daily black smoke model - 10 day timestep

24 residuals ACF BS residuals in time day 15000

25 BS variograms in space semivariance distance (km) semivariance semivariance distance (km) semivariance distance (km) semivariance semivariance distance (km) semivariance semivariance distance (km) distance (km) distance (km) distance (km) Black solid is raw data, open is model residuals. Lines are permutation based reference intervals. Top row, day 40. Bottom row, day 200.

26 Aside: penalty choice matters - a bad model movie

27 Derived maps Posterior exceedance probabilities km north km north km north km north km east km east km east km east Map shows average daily probability of exceeding current EU daily limit, for 4 years in the 1960s. Notice how the switch from coal to gas for domestic heating cleans up the London area quite quickly. The northern industrial city areas take longer.

28 Conclusions Possible to obtain three orders of magnitude computational speed-up in large additive model fitting by combining 1. A new iterative algorithm for REML based additive model estimation. Requires only Cholesky decomposition, and avoids evaluation of unstable log determinants. 2. OpenMP parallelization of modern pivoted block Cholesky. 3. New methods to efficiently compute all the model matrix products required in fitting, based on discretization of covariates (including tensor product smooths). The methods facilitate the first daily models of UK black smoke densities based on the multi-decade daily data from the UK black smoke monitoring network. See Wood, Li, Shaddick and Augustin (2017) JASA. Faculty job available in Bristol now!

29 Sparse methods for big nested models - a taster A crossed factor repeated measures model... y i = α + f 1 (x 1i ) + f 2 (x 2i ) + f a(i) (z 1i ) + f b(i) (z 2i ) + f s(i) (z 3i ) + ɛ i f are smooth functions, x j and z j are covariates. s(i) indicates the subject with i th observation. Each subject has one level of crossed factors a and b a(i) and b(i) indicate levels of a and b for observation i subjects; 30 and 20 levels for a and b. 90 measurements per subject. 10 coefficients per smooth implies coefficients, and there are observations.

30 Estimation As usual, can write model as y = Xβ + ɛ. Fit objective: y Xβ 2 + λ j β T S λ β. Formally ˆβ = (X T X + S λ ) 1 X T y. Sparse Cholseky L T L = P(X T X + S λ )P T where P pivots. Then ˆβ = P T L 1 L T PX T y. Extended Fellner Schall smoothing parameter update 12 λ j = σ 2 tr(s λ S j) tr{(x T X + S λ ) 1 S j } λ ˆβ T j. S j ˆβ tr{(x T X + S λ ) 1 S j } = B F, where B = L T PD j and D j D T j = S j. 12 Wood and Fasiolo, 2016, arxiv

31 Results Used Matrix library in R, on this laptop. Fit in < 5 mins and < 3Gb RAM. Generalization to GLM case trivial f1(x1) f2(x2) µ^ µ x x2 fa(z1) fb(z2) fs(z3) z1 z2 z3

Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded

Modelling with smooth functions Simon Wood University of Bath, EPSRC funded Some data... Daily respiratory deaths, temperature and ozone in Chicago (NMMAPS) 0 50 100 150 200 deaths temperature ozone 2000