Smooth additive models for large datasets

Size: px
Start display at page:

Download "Smooth additive models for large datasets"

Transcription

1 Smooth additive models for large datasets Simon Wood, University of Bristol 1, U.K. in collaboration with Zheyuan Li, Gavin Shaddick, Nicole Augustin University of Bath Funded by EPSRC and the University of Bath 1 Currently hiring Statistics faculty!

2 London smog Dec thousand premature deaths. Black smoke (particulates) and sulphur from domestic coal fires. Clean air act Monitoring from 1961.

3 Black smoke monitoring... Northing (km) Black smoke monitoring network 4 decades of daily black smoke monitoring at a variable subset of the stations shown. Started in 1961 to monitor air pollution (then mostly from coal), in wake of 1950s smog deaths. Epidemiological studies need estimates of daily exposure away from stations. O(10 7 ) measurements and suitable smooth latent Gaussian models have O(10 4 ) coefficients with variance parameters Easting (in km)

4 7 Daily BS data log(bs) day d log(bs) year 0 north (km) log(bs) c b a east (km) day 15000

5 Black smoke previous work Shaddick & Zidek 2 modelled annual average black smoke: log(b i ) = f 1 (e i, n i ) + f 2 (e i, n i )y i + f 2 (e i, n i )y 2 i using INLA 3 for inference. + ɛ i Modelling reduces the impact of preferential sampling. But annual averages fail to capture the acute pollution events, on the time scale of days, that accelerate deaths, and are likely to be of long term health significance. A daily model for all the 10 7 data is appealing, but potentially expensive , Spatial Statistics 3 Rue, Martino, Chopin, 2009

6 Daily black smoke modelling and existing methods We had available a 24 core 128Gb computer. Trials with INLA suggested that we had no chance of fitting daily models with 128Gb, nor in acceptable time. Using R package mgcv, acceptable daily models would have required 1TB of storage just to form the model matrix using the default methods 4. mgcv s big additive model methods (bam 5 ) have acceptable memory footprint by factoring the model matrix block-wise. But experiments suggested weeks of computing time to fit. Perhaps parallelization of the bam methods is the solution? 4 Wood 2011, JRSSB 5 Wood, Goude & Shaw, 2015, JRSSC

7 The model class Concentrate on models of the form y i EF(µ i, φ) g(µ i ) = A i θ + j f j (x ji ) + Z i b A, Z are model matrices, f j are smooth functions, θ is parameters, b contains independent Gaussian random effects. g is a known link function. x j may be vector. Represent smooth functions, f, using spline basis expansions with coefficients β f (x) = K β k b k (x) k=1... and define a quadratic smoothing penalty, e.g. f 2 dx = β T Sβ (suggesting K = O(n 1/9 1/5 )).

8 Simple 1D basis penalty example y basis functions y (β i 1 2β i + β i i+1) 2 = x x y (β i 1 2β i + β i i+1) 2 = 16.4 y (β i 1 2β i + β i i+1) 2 = x x

9 Wide range of basis penalty smoothers available s(times,10.9) a X b f(x,z) z c x times Y s(region,63.04) d lat e lon f

10 Estimation of model coefficients Given bases for the smooths, the model can be written as y i EF(µ i, φ), g(µ i ) = X i β where X (n p) contains A, Z and evaluated basis functions, and β is a combine coefficient vector. The combined penalty can be written j λ jβ T S j β, where the λ j are smoothing parameters. Then ˆβ = argmax β l(β) j λ jβ T S j β/2.... this penalized log likelihood can be given Bayesian motivation using the prior 6 β N{0, ( j λ j S j ) }. 6 which also covers simple Gaussian random effects.

11 Full fitting algorithm The GLM likelihood means that β can be estimated by penalized iteratively re-weighted least squares. i.e. we iteratively use penalized least squares to fit a working linear model: z = Xβ + ɛ, cov(ɛ) = W 1 φ, E(ɛ) = 0 where z i = g (ˆµ i )(y i ˆµ i ) + ˆη i, W 1 ii = V (ˆµ i )g (ˆµ i ) 2 and η i = g(µ i ). V is a known and determined by EF. Or we use the prior on β and iteratively estimate z = Xβ + ɛ, β N(0, S λ ) ɛ N(0, W 1 φ), including λ, as a linear mixed model (this is PQL 7 ). S λ = λ j S j. λ estimation uses REML. 7 Breslow & Clayton, 1993, JASA

12 Justifying the REML step and a big data method Let QR = WX and f = Q T Wz. Assume φ = 1. Then f = Rβ + e, β N(0, S λ ) and, by CLT, e N(0, I). If ˆβ λ = argmin β f Rβ 2 + β T S λ β then 2l r = f R ˆβ λ 2 + ˆβ T λ S λ ˆβ λ + log R T R + S λ + log S λ + But to within an additive constant this is identical to 2l r for working model z = Xβ + ɛ, β N(0, S λ ), ɛ N(0, W 1 ). Similar argument if φ 1 using Pearson estimator. This justification also gives a low memory footprint big data algorithm: iterative blockwise QR accumulates R and f 8. log terms require eigen and QR for stable computation. 8 Wood, Goude and Shaw, 2015, JRSSC

13 Parallelizing the GAM fit method Aim: parallelize preceding big additive model methods. Hope: 24 cores turns one day of computing into 1 hour. Reality: what happened when we parallelized all steps of flop cost O(p 3 ) in preceding (mgcv:bam) method... bam scaling 1/time cores

14 The messy realities of parallel computing 1. Hyper-threading can make parallel slower than serial... faster than serial execution slower than serial execution FPU CPU core INT FPU CPU core INT 2. Dynamic core clock speed management for power efficiency can make low work thread take most time. CPU light workload core 1 slow fast finished later! heavy workload core 2 slow fast finished earlier! 3. Thermal limits: n cores are not n times faster than 1 core. CPU would fry itself!! CPU slow fast slow fast CPU slow fast slow fast CPU slow fast slow fast slow fast slow fast slow fast slow fast slow fast slow fast 4. A floating point operation (flop) may take one or two CPU cycles, retrieving a number from memory 10 times that. Numerical computation is memory bandwidth limited.

15 Memory bandwidth, Cache, block algorithms Cache is small fast access memory between CPU and main memory. Big speed up if most flops involve data already in Cache. Consider two 10 6 flop computations 1. C is a matrix, and y a 1000-vector. Compute Cy. Each of 10 6 elements of C read once, no re-use. 2. A and B are both matrices. Form AB. Repeatedly revisits the elements of A and B.... provided A and B fit in Cache, 2 is much faster. Structure algorithms around Cache friendly blocks! e.g. [ ] [ ] [ ] A11 A 12 B11 B 12 A11 B = 11 + A 12 B 21 A 11 B 12 + A 12 B 22 A 21 A 22 B 21 B 22 A 21 B 11 + A 22 B 21 A 21 B 12 + A 22 B 22

16 Parallel block pivoted QR & Cholesky There are parallelizable block pivoted QR 9 and Cholesky 10 algorithms published, Here s how they scale with openmp parallelization... QR scaling Cholesky scaling 1/t /t cores cores Pivoted QR has alot of unavoidable matrix-vector operations (instead of matrix-matrix).... this caused the poor scaling of the GAM fitting method. 9 Quintana-Ortí, Sun and Bishop, 1998, SIAM J. Sci. Comp. 10 Lucas 2004, LAPACK working paper

17 Scalable fitting: Cholesky only? The (restricted) marginal likelihood used for λ estimation V(λ) = W(z X ˆβ λ ) 2 + ˆβ λ TS λ ˆβ λ + k + n M p log(2πφ) 2φ 2 + log XT WX + S λ log S λ + 2 is difficult because of the log terms: naive computation can be equivalent to taking logs of numerical zeroes! Stable log requires pivoted QR (& eigen) decomposition, but we would prefer Cholesky only to get scalability. Simple idea: avoid QR and eigen-decompostions by avoiding evaluating log determinants during optimization.

18 Gradient only Newton optimization Newton step,, uses only 1st and 2nd derivatives of V. Function values only required for step control halve if V(λ + ) > V(λ). Instead halve if V T > 0. V(λ) V(λ) λ λ This eliminates the log(0) problem. e.g., if ρ = log λ, log X T WX + λ j S j ρ j = tr {(X T WX + } λ j S j ) 1 S j λ j

19 Simple idea 2 1. Conventional PQL finds best fit variance/smoothing parameters for working model. 2. Waste of effort to find best fit when working model will anyway be discarded at next iteration! 3. Just find improved smoothing parameters by taking one Newton step each iteration. 4. With care we can still step control properly (can even prove convergence for some cases). 5. Actual iteration omitted for reasons of tediousness...

20 Simple idea 3: discrete covariate methods The methods so far scale like the pivoted Cholesky (rather than QR) and turn months of computing into days-weeks. Formation of X T WX is the leading order cost: O(np 2 ). Lang et al. 11 point out that for a single 1D smooth, f (x), the product X T WX is very efficiently computable if x has only m n discrete values. As statisticians we should be prepared to discretise x to m = O( n) bins. It is possible to find (novel) efficient computational methods for the multiple discretised covariate case, both for multiple 1D smooths and for tensor product smooths of multiple covariates. 11 Lang, Umlauf, Wechselberger, Harttgen & Kneib, 2014, Statistics & Computing.

21 Simple discrete method example For a single smooth, its n p j model matrix becomes X j (i, l) = X j (k j (i), l) where X j is an m j p j matrix evaluating the smooth at the corresponding gridded values. Then, for example X T j y = X T j ȳ where ȳ l = k j (i)=l Cost: O(n) + O(m j p j ) for m j n this a factor of p j saving. In general all required (cross)products are a factor of p j more efficient, where p j is the largest (marginal) basis dimension involved in the term. y i

22 Black smoke modelling Method based on parallel Cholesky, determinant free iteration and discretization of covariates implemented in mgcv function bam(...,discrete=true). The current best daily black smoke model is log(bs i ) = f 1 (y i ) + f 2 (doy i ) + f 3 (dow i ) + f 4 (y i, doy i ) + f 5 (y i, dow i ) + f 6 (doy i, dow i ) + f 7 (n i, e i ) + f 8 (n i, e i, y i ) + f 9 (n i, e i, doy i ) + f 10 (n i, e i, dow i ) +f 11 (h i )+f 12 (T 0 i, T1 i )+f 13( T1 i, T2 i )+f 14 (r i )+α k(i) +b id(i) +e i The model has around 10 4 coefficients and r 2 = With the new parallel discretised methods fit time is < 1 hour. We estimate that previous methods would have required > 1 month. Memory footprint is about 15Gb.

23 Daily black smoke model - 10 day timestep

24 residuals ACF BS residuals in time day 15000

25 BS variograms in space semivariance distance (km) semivariance semivariance distance (km) semivariance distance (km) semivariance semivariance distance (km) semivariance semivariance distance (km) distance (km) distance (km) distance (km) Black solid is raw data, open is model residuals. Lines are permutation based reference intervals. Top row, day 40. Bottom row, day 200.

26 Aside: penalty choice matters - a bad model movie

27 Derived maps Posterior exceedance probabilities km north km north km north km north km east km east km east km east Map shows average daily probability of exceeding current EU daily limit, for 4 years in the 1960s. Notice how the switch from coal to gas for domestic heating cleans up the London area quite quickly. The northern industrial city areas take longer.

28 Conclusions Possible to obtain three orders of magnitude computational speed-up in large additive model fitting by combining 1. A new iterative algorithm for REML based additive model estimation. Requires only Cholesky decomposition, and avoids evaluation of unstable log determinants. 2. OpenMP parallelization of modern pivoted block Cholesky. 3. New methods to efficiently compute all the model matrix products required in fitting, based on discretization of covariates (including tensor product smooths). The methods facilitate the first daily models of UK black smoke densities based on the multi-decade daily data from the UK black smoke monitoring network. See Wood, Li, Shaddick and Augustin (2017) JASA. Faculty job available in Bristol now!

29 Sparse methods for big nested models - a taster A crossed factor repeated measures model... y i = α + f 1 (x 1i ) + f 2 (x 2i ) + f a(i) (z 1i ) + f b(i) (z 2i ) + f s(i) (z 3i ) + ɛ i f are smooth functions, x j and z j are covariates. s(i) indicates the subject with i th observation. Each subject has one level of crossed factors a and b a(i) and b(i) indicate levels of a and b for observation i subjects; 30 and 20 levels for a and b. 90 measurements per subject. 10 coefficients per smooth implies coefficients, and there are observations.

30 Estimation As usual, can write model as y = Xβ + ɛ. Fit objective: y Xβ 2 + λ j β T S λ β. Formally ˆβ = (X T X + S λ ) 1 X T y. Sparse Cholseky L T L = P(X T X + S λ )P T where P pivots. Then ˆβ = P T L 1 L T PX T y. Extended Fellner Schall smoothing parameter update 12 λ j = σ 2 tr(s λ S j) tr{(x T X + S λ ) 1 S j } λ ˆβ T j. S j ˆβ tr{(x T X + S λ ) 1 S j } = B F, where B = L T PD j and D j D T j = S j. 12 Wood and Fasiolo, 2016, arxiv

31 Results Used Matrix library in R, on this laptop. Fit in < 5 mins and < 3Gb RAM. Generalization to GLM case trivial f1(x1) f2(x2) µ^ µ x x2 fa(z1) fb(z2) fs(z3) z1 z2 z3

Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded

Modelling with smooth functions. Simon Wood University of Bath, EPSRC funded Modelling with smooth functions Simon Wood University of Bath, EPSRC funded Some data... Daily respiratory deaths, temperature and ozone in Chicago (NMMAPS) 0 50 100 150 200 deaths temperature ozone 2000

More information

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K. An Introduction to GAMs based on penalied regression splines Simon Wood Mathematical Sciences, University of Bath, U.K. Generalied Additive Models (GAM) A GAM has a form something like: g{e(y i )} = η

More information

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking Model checking overview Checking & Selecting GAMs Simon Wood Mathematical Sciences, University of Bath, U.K. Since a GAM is just a penalized GLM, residual plots should be checked exactly as for a GLM.

More information

Flexible Spatio-temporal smoothing with array methods

Flexible Spatio-temporal smoothing with array methods Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session IPS046) p.849 Flexible Spatio-temporal smoothing with array methods Dae-Jin Lee CSIRO, Mathematics, Informatics and

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology

Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology Pumps, Maps and Pea Soup: Spatio-temporal methods in environmental epidemiology Gavin Shaddick Department of Mathematical Sciences University of Bath 2012-13 van Eeden lecture Thanks Constance van Eeden

More information

Sparse BLAS-3 Reduction

Sparse BLAS-3 Reduction Sparse BLAS-3 Reduction to Banded Upper Triangular (Spar3Bnd) Gary Howell, HPC/OIT NC State University gary howell@ncsu.edu Sparse BLAS-3 Reduction p.1/27 Acknowledgements James Demmel, Gene Golub, Franc

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

Course Notes: Week 1

Course Notes: Week 1 Course Notes: Week 1 Math 270C: Applied Numerical Linear Algebra 1 Lecture 1: Introduction (3/28/11) We will focus on iterative methods for solving linear systems of equations (and some discussion of eigenvalues

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency Chris Paciorek March 11, 2005 Department of Biostatistics Harvard School of

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Decompositions, numerical aspects Gerard Sleijpen and Martin van Gijzen September 27, 2017 1 Delft University of Technology Program Lecture 2 LU-decomposition Basic algorithm Cost

More information

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects Numerical Linear Algebra Decompositions, numerical aspects Program Lecture 2 LU-decomposition Basic algorithm Cost Stability Pivoting Cholesky decomposition Sparse matrices and reorderings Gerard Sleijpen

More information

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K. Checking, Selecting & Predicting with GAMs Simon Wood Mathematical Sciences, University of Bath, U.K. Model checking Since a GAM is just a penalized GLM, residual plots should be checked, exactly as for

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

P -spline ANOVA-type interaction models for spatio-temporal smoothing

P -spline ANOVA-type interaction models for spatio-temporal smoothing P -spline ANOVA-type interaction models for spatio-temporal smoothing Dae-Jin Lee 1 and María Durbán 1 1 Department of Statistics, Universidad Carlos III de Madrid, SPAIN. e-mail: dae-jin.lee@uc3m.es and

More information

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

On Gaussian Process Models for High-Dimensional Geostatistical Datasets On Gaussian Process Models for High-Dimensional Geostatistical Datasets Sudipto Banerjee Joint work with Abhirup Datta, Andrew O. Finley and Alan E. Gelfand University of California, Los Angeles, USA May

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K.

Smoothness Selection. Simon Wood Mathematical Sciences, University of Bath, U.K. Smoothness Selection Simon Wood Mathematical Sciences, Universit of Bath, U.K. Smoothness selection approaches The smoothing model i = f ( i ) + ɛ i, ɛ i N(0, σ 2 ), is represented via a basis epansion

More information

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Inversion Base Height. Daggot Pressure Gradient Visibility (miles) Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu:

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Computational methods for mixed models

Computational methods for mixed models Computational methods for mixed models Douglas Bates Department of Statistics University of Wisconsin Madison March 27, 2018 Abstract The lme4 package provides R functions to fit and analyze several different

More information

This ensures that we walk downhill. For fixed λ not even this may be the case.

This ensures that we walk downhill. For fixed λ not even this may be the case. Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Alan Gelfand 1 and Andrew O. Finley 2 1 Department of Statistical Science, Duke University, Durham, North

More information

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1.

Package MACAU2. R topics documented: April 8, Type Package. Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data. Version 1. Package MACAU2 April 8, 2017 Type Package Title MACAU 2.0: Efficient Mixed Model Analysis of Count Data Version 1.10 Date 2017-03-31 Author Shiquan Sun, Jiaqiang Zhu, Xiang Zhou Maintainer Shiquan Sun

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota,

More information

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science EAD 115 Numerical Solution of Engineering and Scientific Problems David M. Rocke Department of Applied Science Taylor s Theorem Can often approximate a function by a polynomial The error in the approximation

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter

Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter Spatial bias modeling with application to assessing remotely-sensed aerosol as a proxy for particulate matter Chris Paciorek Department of Biostatistics Harvard School of Public Health application joint

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K. mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K. mgcv, gamm4 mgcv is a package supplied with R for generalized additive modelling, including generalized additive mixed models.

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley Department of Forestry & Department of Geography, Michigan State University, Lansing

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting

Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Variable Selection for Generalized Additive Mixed Models by Likelihood-based Boosting Andreas Groll 1 and Gerhard Tutz 2 1 Department of Statistics, University of Munich, Akademiestrasse 1, D-80799, Munich,

More information

Statistics for extreme & sparse data

Statistics for extreme & sparse data Statistics for extreme & sparse data University of Bath December 6, 2018 Plan 1 2 3 4 5 6 The Problem Climate Change = Bad! 4 key problems Volcanic eruptions/catastrophic event prediction. Windstorms

More information

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K. Basis Penalty Smoothers Simon Wood Mathematical Sciences, University of Bath, U.K. Estimating functions It is sometimes useful to estimate smooth functions from data, without being too precise about the

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes

Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley 1 and Sudipto Banerjee 2 1 Department of Forestry & Department of Geography, Michigan

More information

Preconditioned Parallel Block Jacobi SVD Algorithm

Preconditioned Parallel Block Jacobi SVD Algorithm Parallel Numerics 5, 15-24 M. Vajteršic, R. Trobec, P. Zinterhof, A. Uhl (Eds.) Chapter 2: Matrix Algebra ISBN 961-633-67-8 Preconditioned Parallel Block Jacobi SVD Algorithm Gabriel Okša 1, Marián Vajteršic

More information

STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS

STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS Eric Gilleland Douglas Nychka Geophysical Statistics Project National Center for Atmospheric Research Supported

More information

7.2 Linear equation systems. 7.3 Linear least square fit

7.2 Linear equation systems. 7.3 Linear least square fit 72 Linear equation systems In the following sections, we will spend some time to solve linear systems of equations This is a tool that will come in handy in many di erent places during this course For

More information

R-INLA. Sam Clifford Su-Yun Kang Jeff Hsieh. 30 August Bayesian Research and Analysis Group 1 / 14

R-INLA. Sam Clifford Su-Yun Kang Jeff Hsieh. 30 August Bayesian Research and Analysis Group 1 / 14 1 / 14 R-INLA Sam Clifford Su-Yun Kang Jeff Hsieh Bayesian Research and Analysis Group 30 August 2012 What is R-INLA? R package for Bayesian computation Integrated Nested Laplace Approximation MCMC free

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

arxiv: v1 [stat.me] 25 Sep 2007

arxiv: v1 [stat.me] 25 Sep 2007 Fast stable direct fitting and smoothness selection arxiv:0709.3906v1 [stat.me] 25 Sep 2007 for Generalized Additive Models Simon N. Wood Mathematical Sciences, University of Bath, Bath BA2 7AY U.K. s.wood@bath.ac.uk

More information

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets Abhirup Datta 1 Sudipto Banerjee 1 Andrew O. Finley 2 Alan E. Gelfand 3 1 University of Minnesota, Minneapolis,

More information

Improving posterior marginal approximations in latent Gaussian models

Improving posterior marginal approximations in latent Gaussian models Improving posterior marginal approximations in latent Gaussian models Botond Cseke Tom Heskes Radboud University Nijmegen, Institute for Computing and Information Sciences Nijmegen, The Netherlands {b.cseke,t.heskes}@science.ru.nl

More information

Feature selection. Micha Elsner. January 29, 2014

Feature selection. Micha Elsner. January 29, 2014 Feature selection Micha Elsner January 29, 2014 2 Using megam as max-ent learner Hal Daume III from UMD wrote a max-ent learner Pretty typical of many classifiers out there... Step one: create a text file

More information

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Kneib, Fahrmeir: Supplement to Structured additive regression for categorical space-time data: A mixed model approach Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/

More information

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Mixed models in R using the lme4 package Part 4: Theory of linear mixed models Douglas Bates Madison January 11, 11 Contents 1 Definition of linear mixed models Definition of linear mixed models As previously

More information

Approximate Principal Components Analysis of Large Data Sets

Approximate Principal Components Analysis of Large Data Sets Approximate Principal Components Analysis of Large Data Sets Daniel J. McDonald Department of Statistics Indiana University mypage.iu.edu/ dajmcdon April 27, 2016 Approximation-Regularization for Analysis

More information

Scientific Computing

Scientific Computing Scientific Computing Direct solution methods Martin van Gijzen Delft University of Technology October 3, 2018 1 Program October 3 Matrix norms LU decomposition Basic algorithm Cost Stability Pivoting Pivoting

More information

Gaussian Processes 1. Schedule

Gaussian Processes 1. Schedule 1 Schedule 17 Jan: Gaussian processes (Jo Eidsvik) 24 Jan: Hands-on project on Gaussian processes (Team effort, work in groups) 31 Jan: Latent Gaussian models and INLA (Jo Eidsvik) 7 Feb: Hands-on project

More information

Disease mapping with Gaussian processes

Disease mapping with Gaussian processes EUROHEIS2 Kuopio, Finland 17-18 August 2010 Aki Vehtari (former Helsinki University of Technology) Department of Biomedical Engineering and Computational Science (BECS) Acknowledgments Researchers - Jarno

More information

Virtual Sensors and Large-Scale Gaussian Processes

Virtual Sensors and Large-Scale Gaussian Processes Virtual Sensors and Large-Scale Gaussian Processes Ashok N. Srivastava, Ph.D. Principal Investigator, IVHM Project Group Lead, Intelligent Data Understanding ashok.n.srivastava@nasa.gov Coauthors: Kamalika

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Multivariate spatial models and the multikrig class

Multivariate spatial models and the multikrig class Multivariate spatial models and the multikrig class Stephan R Sain, IMAGe, NCAR ENAR Spring Meetings March 15, 2009 Outline Overview of multivariate spatial regression models Case study: pedotransfer functions

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Spatial Process Estimates as Smoothers: A Review

Spatial Process Estimates as Smoothers: A Review Spatial Process Estimates as Smoothers: A Review Soutir Bandyopadhyay 1 Basic Model The observational model considered here has the form Y i = f(x i ) + ɛ i, for 1 i n. (1.1) where Y i is the observed

More information

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Lecture 9: Numerical Linear Algebra Primer (February 11st) 10-725/36-725: Convex Optimization Spring 2015 Lecture 9: Numerical Linear Algebra Primer (February 11st) Lecturer: Ryan Tibshirani Scribes: Avinash Siravuru, Guofan Wu, Maosheng Liu Note: LaTeX template

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b AM 205: lecture 7 Last time: LU factorization Today s lecture: Cholesky factorization, timing, QR factorization Reminder: assignment 1 due at 5 PM on Friday September 22 LU Factorization LU factorization

More information

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown. Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g)

More information

Penalized least squares versus generalized least squares representations of linear mixed models

Penalized least squares versus generalized least squares representations of linear mixed models Penalized least squares versus generalized least squares representations of linear mixed models Douglas Bates Department of Statistics University of Wisconsin Madison April 6, 2017 Abstract The methods

More information

Approximate Likelihoods

Approximate Likelihoods Approximate Likelihoods Nancy Reid July 28, 2015 Why likelihood? makes probability modelling central l(θ; y) = log f (y; θ) emphasizes the inverse problem of reasoning y θ converts a prior probability

More information

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model Xiuming Zhang zhangxiuming@u.nus.edu A*STAR-NUS Clinical Imaging Research Center October, 015 Summary This report derives

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Statistics for high-dimensional data: Group Lasso and additive models

Statistics for high-dimensional data: Group Lasso and additive models Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional

More information

Various types of likelihood

Various types of likelihood Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. semi-parametric likelihood, partial likelihood 3. empirical likelihood,

More information

Fast approximations for the Expected Value of Partial Perfect Information using R-INLA

Fast approximations for the Expected Value of Partial Perfect Information using R-INLA Fast approximations for the Expected Value of Partial Perfect Information using R-INLA Anna Heath 1 1 Department of Statistical Science, University College London 22 May 2015 Outline 1 Health Economic

More information

Solving PDEs with CUDA Jonathan Cohen

Solving PDEs with CUDA Jonathan Cohen Solving PDEs with CUDA Jonathan Cohen jocohen@nvidia.com NVIDIA Research PDEs (Partial Differential Equations) Big topic Some common strategies Focus on one type of PDE in this talk Poisson Equation Linear

More information

Space-time modelling of air pollution with array methods

Space-time modelling of air pollution with array methods Space-time modelling of air pollution with array methods Dae-Jin Lee Royal Statistical Society Conference Edinburgh 2009 D.-J. Lee (Uc3m) GLAM: Array methods in Statistics RSS 09 - Edinburgh # 1 Motivation

More information

Big Data in Environmental Research

Big Data in Environmental Research Big Data in Environmental Research Gavin Shaddick University of Bath 7 th July 2016 1/ 121 OUTLINE 10:00-11:00 Introduction to Spatio-Temporal Modelling 14:30-15:30 Bayesian Inference 17:00-18:00 Examples

More information

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Variable Selection and Model Choice in Survival Models with Time-Varying Effects Variable Selection and Model Choice in Survival Models with Time-Varying Effects Boosting Survival Models Benjamin Hofner 1 Department of Medical Informatics, Biometry and Epidemiology (IMBE) Friedrich-Alexander-Universität

More information

Gaussian Processes for Big Data. James Hensman

Gaussian Processes for Big Data. James Hensman Gaussian Processes for Big Data James Hensman Overview Motivation Sparse Gaussian Processes Stochastic Variational Inference Examples Overview Motivation Sparse Gaussian Processes Stochastic Variational

More information

High-dimensional Problems in Finance and Economics. Thomas M. Mertens

High-dimensional Problems in Finance and Economics. Thomas M. Mertens High-dimensional Problems in Finance and Economics Thomas M. Mertens NYU Stern Risk Economics Lab April 17, 2012 1 / 78 Motivation Many problems in finance and economics are high dimensional. Dynamic Optimization:

More information

gamboostlss: boosting generalized additive models for location, scale and shape

gamboostlss: boosting generalized additive models for location, scale and shape gamboostlss: boosting generalized additive models for location, scale and shape Benjamin Hofner Joint work with Andreas Mayr, Nora Fenske, Thomas Kneib and Matthias Schmid Department of Medical Informatics,

More information

Multidimensional Density Smoothing with P-splines

Multidimensional Density Smoothing with P-splines Multidimensional Density Smoothing with P-splines Paul H.C. Eilers, Brian D. Marx 1 Department of Medical Statistics, Leiden University Medical Center, 300 RC, Leiden, The Netherlands (p.eilers@lumc.nl)

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics ) James Martens University of Toronto June 24, 2010 Computer Science UNIVERSITY OF TORONTO James Martens (U of T) Learning

More information

Bayes Estimators & Ridge Regression

Bayes Estimators & Ridge Regression Readings Chapter 14 Christensen Merlise Clyde September 29, 2015 How Good are Estimators? Quadratic loss for estimating β using estimator a L(β, a) = (β a) T (β a) How Good are Estimators? Quadratic loss

More information

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields 1 Introduction Jo Eidsvik Department of Mathematical Sciences, NTNU, Norway. (joeid@math.ntnu.no) February

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

Linear Methods for Prediction

Linear Methods for Prediction This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

STAT 526 Advanced Statistical Methodology

STAT 526 Advanced Statistical Methodology STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 10 Analyzing Clustered/Repeated Categorical Data 0-0 Outline Clustered/Repeated Categorical Data Generalized Linear Mixed Models Generalized

More information

Beyond MCMC in fitting complex Bayesian models: The INLA method

Beyond MCMC in fitting complex Bayesian models: The INLA method Beyond MCMC in fitting complex Bayesian models: The INLA method Valeska Andreozzi Centre of Statistics and Applications of Lisbon University (valeska.andreozzi at fc.ul.pt) European Congress of Epidemiology

More information

Solving linear equations with Gaussian Elimination (I)

Solving linear equations with Gaussian Elimination (I) Term Projects Solving linear equations with Gaussian Elimination The QR Algorithm for Symmetric Eigenvalue Problem The QR Algorithm for The SVD Quasi-Newton Methods Solving linear equations with Gaussian

More information