Multidimensional Density Smoothing with P-splines

Similar documents
Currie, Iain Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh EH14 4AS, UK

GLAM An Introduction to Array Methods in Statistics

A Hierarchical Perspective on Lee-Carter Models

P -spline ANOVA-type interaction models for spatio-temporal smoothing

Space-time modelling of air pollution with array methods

Using P-splines to smooth two-dimensional Poisson data

Flexible Spatio-temporal smoothing with array methods

Recovering Indirect Information in Demographic Applications

MULTIDIMENSIONAL COVARIATE EFFECTS IN SPATIAL AND JOINT EXTREMES

Estimation of cumulative distribution function with spline functions

Nonparametric Small Area Estimation Using Penalized Spline Regression

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Bayesian covariate models in extreme value analysis

Bayesian density estimation from grouped continuous data

Penalized Splines, Mixed Models, and Recent Large-Sample Results

Regularization in Cox Frailty Models

Chapter 3: Regression Methods for Trends

Application of density estimation methods to quantal analysis

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Discussion of Maximization by Parts in Likelihood Inference

Towards a Regression using Tensors

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Small Area Estimation for Skewed Georeferenced Data

Multivariate Calibration with Robust Signal Regression

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Estimating prediction error in mixed models

Generalized Linear Models in R

Log-Density Estimation with Application to Approximate Likelihood Inference

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Generalized Linear Models

Illustration of the Varying Coefficient Model for Analyses the Tree Growth from the Age and Space Perspectives

Hani Mehrpouyan, California State University, Bakersfield. Signals and Systems

Generalized Linear Models

ASYMPTOTICS FOR PENALIZED SPLINES IN ADDITIVE MODELS

The OSCAR for Generalized Linear Models

Functional SVD for Big Data

Curve Fitting Re-visited, Bishop1.2.5

Unsupervised machine learning

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

CSC2515 Assignment #2

Smooth semi- and nonparametric Bayesian estimation of bivariate densities from bivariate histogram data

Density Estimation: New Spline Approaches and a Partial Review

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

COS 424: Interacting with Data

Functional Estimation in Systems Defined by Differential Equation using Bayesian Smoothing Methods

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

gamboostlss: boosting generalized additive models for location, scale and shape

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

MAT300/500 Programming Project Spring 2019

Nonparametric Regression. Badr Missaoui

12 - Nonparametric Density Estimation

Bias-corrected AIC for selecting variables in Poisson regression models

Modelling trends in digit preference patterns

Array methods in statistics with applications to the modelling and forecasting of mortality. James Gavin Kirkby

BLIND SOURCE SEPARATION IN POST NONLINEAR MIXTURES

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

mgcv: GAMs in R Simon Wood Mathematical Sciences, University of Bath, U.K.

Exact Likelihood Ratio Tests for Penalized Splines

A short introduction to INLA and R-INLA

Smoothing Age-Period-Cohort models with P -splines: a mixed model approach

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Computational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

An Introduction to Functional Data Analysis

Model comparison and selection

Effective Computation for Odds Ratio Estimation in Nonparametric Logistic Regression

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Regularization Methods for Additive Models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Reparametrization of COM-Poisson Regression Models with Applications in the Analysis of Experimental Count Data

Analysis Methods for Supersaturated Design: Some Comparisons

Aggregated cancer incidence data: spatial models

Finite Difference Methods (FDMs) 1

Introduction to Statistical modeling: handout for Math 489/583

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Spatially Adaptive Smoothing Splines

Multi-resolution models for large data sets

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Matrices. Math 240 Calculus III. Wednesday, July 10, Summer 2013, Session II. Matrices. Math 240. Definitions and Notation.

Computational Methods CMSC/AMSC/MAPL 460. Linear Systems, Matrices, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Spatial Process Estimates as Smoothers: A Review

Smoothing Age-Period-Cohort models with P -splines: a mixed model approach

Math 533 Extra Hour Material

Frailty models. November 30, Programme

Machine Learning: Assignment 1

Terminology for Statistical Data

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Chapter 7: Model Assessment and Selection

The Conjugate Gradient Method

Calibrating Environmental Engineering Models and Uncertainty Analysis

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Transcription:

Multidimensional Density Smoothing with P-splines Paul H.C. Eilers, Brian D. Marx 1 Department of Medical Statistics, Leiden University Medical Center, 300 RC, Leiden, The Netherlands (p.eilers@lumc.nl) Department of Experimental Statistics, Louisiana State University, Baton Rouge, Louisiana 70803 USA (bmarx@lsu.edu) Abstract: We propose a simple and effective multidimensional density estimator. Our approach is essentially penalized Poisson regression using a rich tensor product B-spline basis, where the Poisson counts come from processing the data into a multidimensional histogram, often consisting of thousands of bins. The penalty enforces smoothness of the B-spline coefficients, specifically within the rows, columns, layers depending on the dimension. In this paper, we focus on how a one-dimensional P-spline density estimator can be extended to two-dimensions, and beyond. In higher dimensions we provide a hint on how efficient grid algorithms can be implemented using array regression. Our approach optimizes the penalty weight parameter(s) using information criteria, specifically AIC. Two examples illustrate our method in two-dimensions. Keywords: AIC; Effective Dimension; Tensor Product. 1 Introduction Density estimation is a core activity in data exploration and applied statistics. Surprisingly, many statisticians approach it in a quite unsophisticated way: they are happy with off-the-shelf kernel smoothers, as can be seen in many papers on computational Bayesian modeling. An attractive alternative is (penalized) likelihood smoothing, modeling the logarithm of the density with splines. Kooperberg and Stone (1991) use adaptive-knot B- splines, while Eilers and Marx (1996) combine fixed-knot B-splines with a roughness penalty (P-splines), simplifying the scheme of O Sullivan (1988). In the P-spline approach, the raw data are processed and reduced to a histogram, then density estimation is achieved through (penalized) Poisson regression. This is attractive for high-volume one-dimensional data, because one deals with, e.g., 100 or 00 histogram bins instead of many thousands of raw observations. In two or more dimensions the histogram approach is even more attractive, when combined with a recently developed algorithm for fast weighted smoothing on grids (Eilers et al., 006).

Multidimensional P-spline Density A critical issue is optimization of the amount of smoothing. We use Akaike s Information Criterion (AIC). It is easy to compute, because an effective model dimension can be defined, and it can be computed with relatively little effort. P-spline overview: one-dimensional densities Let (y i, u i ) denote the Poisson count and bin midpoint pairs from a (narrowly binned) histogram, i = 1,..., n. The vector of counts is denoted y. We model the expected values of the counts as µ i = E(y i ) = exp( c b j (u i )θ j ), (1) or in matrix terms µ = Bθ, where B = [b j (u i )] is a (n c) B-spline basis built along the indexing axis u of the density. A rich basis is used (c sufficiently large) and the knots for the basis are equally-spaced. Apart from a constant, the Poisson log-likelihood is l = n i=1 log(µ yi i e µi ) = j=1 n (y i log µ i µ i ). () A penalty on the d-th differences of θ is subtracted from the log-likelihood, to tune smoothness. The number of basis functions in B is chosen large, to give ample flexibility. The penalized log-likelihood, l, then is l = l λ c j=d+1 i=1 Setting the gradient of (3) equal to zero gives ( d θ j ). (3) B (y µ) = λd Dθ, (4) where D is a matrix of contrasts such that Dθ = d θ. Linearization of (4) leads to (B W B + λd D)θ = B W z, () where z = η + W 1 (y µ) is the working variable and η = Bθ. The matrix W = diag(µ) and θ, µ are approximations to the solution of (), iterated until convergence. One easily recognizes the familiar equations for fitting a GLM, modified by the term λd D that stems from the penalty. At convergence, one can interpret () as linear smoothing of the working variable z. We have ẑ = Bˆθ = B(B Ŵ B + λd D) 1 B Ŵ z = Hz, (6)

P.H.C. Eilers and B.D. Marx 3 where H is the smoothing or hat matrix. The effective dimension of the smoother is approximated by race(h). We use it to define Akaike s information criterion (AIC) as AIC = n y i log(y i /ˆµ i ) + trace(h). (7) i=1 Essentially this is all old hat. A more detailed description of density smoothing with P-splines is given by Eilers and Marx (1996). We use the presentation above as an impetus to move density smoothing with use tensor product P-splines in two or more dimensions. 3 Two dimensional P-spline densities Let Y ih be counts in a two-dimensional (narrowly binned) n 1 n histogram, forming the array Y of Poisson counts with expectation M. The bin midpoints are now indexed by the n 1 n pairs (u 1 1 n, 1 n1 u ), where u 1 (u ) are the midpoint locations along the first (second) dimension. We model the expected values, now using tensor product B-splines on a rich c 1 c rectangular grid (equally-spaced along each dimension), as µ ih = E(Y ih ) = exp( c 1 c j=1 k=1 b j (u 1i )b k (u h )θ jk ). (8) The n 1 n -vectorized form of the Poisson counts and corresponding expected values are denoted by y = vec(y ) and µ = vec(m), respectively. Ignoring penalties for the moment, (8) is also a GLM, and can be rewritten as µ = E(y) = B B 1 θ = Bθ. (9) We now have the c 1 c array of coefficients Θ, with vectorized form θ = vec(θ). The matrix B, with n 1 n rows and c 1 c columns, collects the tensor product basis functions. As outlined in Durbán et al. (004), the penalty can be written as λ 1 D 1 Θ F + λ ΘD F = θ (λ 1 I c D 1D 1 + λ D D I c1 )θ = θ P θ, where F indicates the quadratic Frobenius norm, the sum of the squares of all elements of a matrix, and D 1 (D ) is the proper contrast matrix for the rows (columns) of Θ. Such a penalty enforces smoothness or tensor product B-spline coefficients within any one row or column, but breaks linkage in penalization across rows or columns. As seen in (??), the weight and order of the penalty is the same from row to row, the same from column to column, but can vary from rows to columns. In this scheme, with η = Bθ, the equivalent of () becomes (B W B + P )θ = B W ( η + W 1 (y µ)). (10)

4 Multidimensional P-spline Density 3.1 Optimizing the penalty To optimize the amount of smoothing, AIC can be computed similarly to (7). Upon convergence AIC is now computed as n 1 n AIC = y ih log(y ih /ˆµ ih ) + trace{b Ŵ B(B Ŵ B + P ) 1 }, (11) i=1 h=1 borrowing the result that trace is invariant under cyclical permutation of matrices. One varies the logarithms of λ 1 and λ on a grid, searching for the minimum. In many cases, or for first exploration, isotropic smoothing (λ 1 = λ = λ) will be sufficient and the grid search can be one-dimensional. 3. Toward higher dimensions Although this scheme will work, it can be wasteful with memory and computation time, especially for higher dimensions. For example, in three dimensions (with common n and c), the basis would have n 3 rows and c 3 columns and most practical applications would not fit in the memory of the average PC. To avoid this bottleneck, we use the grid algorithms of Eilers et al. (006). An example is the computation of the mean µ = vec(m) in two dimensions. Instead of computing it as a vector, as in (9), we can use the following equivalence µ = E(y) = B B 1 θ = Bθ M = E(Y ) = B 1 ΘB. (1) This avoids the construction of the large Kronecker product basis, saving space and time. This scheme can be extended to 3 dimensions or more, by recursive re-arrangement of multi-dimensional arrays as matrices and pre-multiplication by the one-dimensional bases. In a similar way the multi-dimensional inner products B W B can be obtained efficiently. The key observation is that in a 4-dimensional representation the elements of B W B can be written f jkj k = w hi b ij b hk b ij b hk, (13) h i which can be rearranged as f jkj k = h b hk b hk w hi b ij b ij. (14) By clever use of row-wise tensor products and switching between four- and two-dimensional representations of matrices and arrays, one can completely avoid the construction of the large tensor-product basis B. In addition to saving memory, computations are sped up by orders of magnitude. There is no room for details here, see Eilers et al. (006) and Currie et al. (006). i

P.H.C. Eilers and B.D. Marx 0. AIC 0 100 Old Faithful, optimal smooth, anisotropic 0 90 0. 0 log 10 (λ r ) 1 1. 10 1 0. 0. Waiting time (min) 80 70. 60 0. 3 3. 0 4 0 1 3 4 log (λ ) 10 c 40 Duration (min) FIGURE 1. Non-isotropic smoothing of Old Faithful data. Left: contours of AIC; right: contours of smooth density estimate. The density was normalized so that the largest value is equal to 1. The contour levels are 0.9, 0.7, 0., 0.3, 0., 0.1, 0.07, 0.0, 0.03, 0.0, 0.01. 1030 Isotropic smoothing; λ = 0.0 100 Old Faithful, optimal smooth 100 90 1010 80 AIC 1000 Waiting time (min) 70 990 60 980 0 970 4 3 1 0 1 log 10 (λ c ) 40 Duration (min) FIGURE. Isotropic smoothing of Old Faithful data. Left: AIC profile; right: contours of smooth density estimate. The density was normalized so that the largest value is equal to 1. The contour levels are 0.9, 0.7, 0., 0.3, 0., 0.1, 0.07, 0.0, 0.03, 0.0, 0.01. 4 Illustrative examples Figure 1 shows two-dimensional density contours for the well-known Old Faithful data. The number of Poisson observations is on a 100 100 his-

6 Multidimensional P-spline Density 6 Microarray data; # selected points: 1000 17 Isotropic smoothing; λ = 0.00 Array (log 10 4 3 1 Array 1 (log 10 AIC 170 171 1710 170 4 3 1 log 10 (λ c ) 6 Density estimate; λ = 0.00 6 Square root of density estimate; λ = 0.00 Array (log 10 4 3 Array (log 10 4 3 1 Array 1 (log 10 1 Array 1 (log 10 FIGURE 3. Presentation of microarray data scatterplot as a density. Upper: scatterplot of selection of raw data (left); optimal isotropic penalty parameter using AIC (right). lower: image of optimal density (left); square root density (for improved visualization) (right). togram grid, and a 13 13 grid of tensor products of cubic B-splines were used to construct the two-dimensional density. The order of the difference penalty is 3. We see that AIC is minimized with relatively light smoothing (λ 1.003) along the rows (waiting times) and with heavier smoothing (λ 1000) along the columns (duration). Figure shows the result of isotropic smoothing, yielding a compromise in the single λ 0.0. The large difference between the two λs for anisotropic smoothing might look surprising. The reason is that the conditional distributions of waiting time given (the previous) duration is always unimodal, not much different from a normal distribution. This allows a large weight for a third order penalty without destroying a proper fit to the data. On the other hand the conditional distribution of the (previous) duration given the waiting time is bimodal or rather skewed on part of the data domain. The weight of horizontal penalty cannot therefore be large. As a second example, Figure 3 displays as scatterplot of gene expressions (on a log 10 scale) as measured by two microarrays. The data pairs are plotted using a simple dot symbol. Only a random selection of 1000 observa-

P.H.C. Eilers and B.D. Marx 7 tions is shown, to prevent a large part of the cloud of pints to become completely black. As discussed by Eilers and Goeman (004), scatterplots with many dots are difficult to judge, because symbols at the boundaries attract too much attention from the observer, making the cloud of points look wider than it really is. Eilers and Goeman presented a very fast but simple smoother, with a goal of an attractive display of dense scatterplots. The density smoother proposed here can be used as an improvement, especially when λ is optimized by AIC. The results presented were also obtained with a 100 100 histogram grid and 13 13 tensor products of cubic B-splines. The optimal penalty weight value for isotropic smoothing is λ 0.000. The result of the smoother is presented as a gray-scale image of the height of the density. Depending on the medium used (computer screen, projector or printer) the effective dynamic range can be quite small. Therefore we also present a gray scale plot of the square root of the density. Discussion We have shown that tensor products of P-splines and Poisson regression of histogram data leads to an effective density estimator. As discussed in Eilers and Marx (1996), the position and number of bins (i.e. the location and width of the histogram grid) makes little difference to the final fit, provided that the grid is chosen to be sufficiently fine. There is no room here to illustrate that the same holds for two-dimensional smoothing. It is remarkable that a very sparse two-dimensional histogram leads to a very good density estimate. The Old Faithful data set has 7 observations. The average count per -D histogram bin is thus less than 0.03! As a practical guideline in two dimensions, we recommend that users (who are not concerned with computation time) start with 100 100 histogram bins, 0 0 B-splines, and a second or third order penalty. Some nice features of one-dimensional P-spline density estimation also carry over to higher dimensions. For one, there is a conservation of moments, for any λ, within each row, column, layer and so forth (depending on the dimension of the histogram). For example within any row, a penalty of order d = 1 yields equality of the sum of the observed and the sum of fitted counts. When d =, e.g. within any row, the previous condition holds and the mean of the observed data equals that of the fitted counts. Further, when d = 3, the previous two conditions hold with an additional preservation of variance between observed and fitted data. The P-spline density estimator is also not affected by unpleasant boundary effects, so familiar to kernel smoothers. In fact sharp specialized boundaries are encouraged with the P-spline approach. Especially for isotropic smoothing, optimization of AIC, using a simple grid search, is efficient. Computation, for one value of λ (in Matlab), takes less than a second on a 1000 MHz PIII computer. The time to compute the histogram is negligible, even for many observations.

8 Multidimensional P-spline Density Minimization of AIC is not the only possible approach to optimal smoothing. One can interpret models with penalties as mixed models and attempt to estimate the variance of the mixing distribution (Ruppert, Wand and Carroll, 003). Durbán et al. (006) investigated this approach to multidimensional density estimation. Alternatively, a purely Bayesian approach is also possible. The penalty is seen as the logarithm of a prior density of differences of the coefficients of the B-spline tensor products. Efficient simulation, using an optimized Langevin-Hastings algorithm, is possible. See Lambert and Eilers (006), extending their 00 work. References Currie, I.D., Durbán, M. and Eilers, P.H.C. (006). Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society, 68, 1-. Durbán, M., Currie, I.D. and Eilers, P.H.C. (004). Smoothing and forecasting mortality rates. Statistical Modelling, 4, 79-98. Durbán, M., Currie, I.D. and Eilers, P.H.C. (006). Multi-dimensional density smoothing with P-splines: a mixed model approach. Proceeding of the International Workshop on Statistical Modelling, Galway. Eilers, P.H.C., Currie, I.D. and and Durbán, M. (006). Fast and compact smoothing on large multidimensional grids. Computational Statistics and Data Analysis, 0, 61-76. Eilers, P.H.C and Goeman, J. (004). Enhancing scatterplots with smoothed densities. Bioinformatics, 0, 63-638. Eilers, P.H.C. and Marx, B.D. (1996). Flexible smoothing using B-splines and penalized likelihood (with Comments and Rejoinder). Statistical Science, 11(), 89-11. Kooperberg. C. and Stone, C.J. (1991). A study of logspline density estimation. Computational Statistics and Data Analysis, 1, 37-347. Lambert, P. and Eilers, P.H.C. (00). Bayesian proportional hazards model with time-varying coefficients: a penalized Poisson regression approach. Statistics in Medicine 4, 3977-3989. Lambert, P. and Eilers, P.H.C. (006). Bayesian multidimensional density smoothing. Proceeding of the International Workshop on Statistical Modelling, Galway. O Sullivan, F. (1988). Fast computation of fully automated log-density and log-hazard estimators. SIAM J. Sci. Stat. Comput., 9, 363-379. Ruppert, D., Wand, M. and Caroll, R. (003). Semiparametric Regression. New York: Cambridge.