Multidimensional Density Smoothing with P-splines

Multidimensional Density Smoothing with P-splines Paul H.C. Eilers, Brian D. Marx 1 Department of Medical Statistics, Leiden University Medical Center, 300 RC, Leiden, The Netherlands (p.eilers@lumc.nl) Department of Experimental Statistics, Louisiana State University, Baton Rouge, Louisiana 70803 USA (bmarx@lsu.edu) Abstract: We propose a simple and effective multidimensional density estimator. Our approach is essentially penalized Poisson regression using a rich tensor product B-spline basis, where the Poisson counts come from processing the data into a multidimensional histogram, often consisting of thousands of bins. The penalty enforces smoothness of the B-spline coefficients, specifically within the rows, columns, layers depending on the dimension. In this paper, we focus on how a one-dimensional P-spline density estimator can be extended to two-dimensions, and beyond. In higher dimensions we provide a hint on how efficient grid algorithms can be implemented using array regression. Our approach optimizes the penalty weight parameter(s) using information criteria, specifically AIC. Two examples illustrate our method in two-dimensions. Keywords: AIC; Effective Dimension; Tensor Product. 1 Introduction Density estimation is a core activity in data exploration and applied statistics. Surprisingly, many statisticians approach it in a quite unsophisticated way: they are happy with off-the-shelf kernel smoothers, as can be seen in many papers on computational Bayesian modeling. An attractive alternative is (penalized) likelihood smoothing, modeling the logarithm of the density with splines. Kooperberg and Stone (1991) use adaptive-knot B- splines, while Eilers and Marx (1996) combine fixed-knot B-splines with a roughness penalty (P-splines), simplifying the scheme of O Sullivan (1988). In the P-spline approach, the raw data are processed and reduced to a histogram, then density estimation is achieved through (penalized) Poisson regression. This is attractive for high-volume one-dimensional data, because one deals with, e.g., 100 or 00 histogram bins instead of many thousands of raw observations. In two or more dimensions the histogram approach is even more attractive, when combined with a recently developed algorithm for fast weighted smoothing on grids (Eilers et al., 006).

Multidimensional P-spline Density A critical issue is optimization of the amount of smoothing. We use Akaike s Information Criterion (AIC). It is easy to compute, because an effective model dimension can be defined, and it can be computed with relatively little effort. P-spline overview: one-dimensional densities Let (y i, u i ) denote the Poisson count and bin midpoint pairs from a (narrowly binned) histogram, i = 1,..., n. The vector of counts is denoted y. We model the expected values of the counts as µ i = E(y i ) = exp( c b j (u i )θ j ), (1) or in matrix terms µ = Bθ, where B = [b j (u i )] is a (n c) B-spline basis built along the indexing axis u of the density. A rich basis is used (c sufficiently large) and the knots for the basis are equally-spaced. Apart from a constant, the Poisson log-likelihood is l = n i=1 log(µ yi i e µi ) = j=1 n (y i log µ i µ i ). () A penalty on the d-th differences of θ is subtracted from the log-likelihood, to tune smoothness. The number of basis functions in B is chosen large, to give ample flexibility. The penalized log-likelihood, l, then is l = l λ c j=d+1 i=1 Setting the gradient of (3) equal to zero gives ( d θ j ). (3) B (y µ) = λd Dθ, (4) where D is a matrix of contrasts such that Dθ = d θ. Linearization of (4) leads to (B W B + λd D)θ = B W z, () where z = η + W 1 (y µ) is the working variable and η = Bθ. The matrix W = diag(µ) and θ, µ are approximations to the solution of (), iterated until convergence. One easily recognizes the familiar equations for fitting a GLM, modified by the term λd D that stems from the penalty. At convergence, one can interpret () as linear smoothing of the working variable z. We have ẑ = Bˆθ = B(B Ŵ B + λd D) 1 B Ŵ z = Hz, (6)

P.H.C. Eilers and B.D. Marx 3 where H is the smoothing or hat matrix. The effective dimension of the smoother is approximated by race(h). We use it to define Akaike s information criterion (AIC) as AIC = n y i log(y i /ˆµ i ) + trace(h). (7) i=1 Essentially this is all old hat. A more detailed description of density smoothing with P-splines is given by Eilers and Marx (1996). We use the presentation above as an impetus to move density smoothing with use tensor product P-splines in two or more dimensions. 3 Two dimensional P-spline densities Let Y ih be counts in a two-dimensional (narrowly binned) n 1 n histogram, forming the array Y of Poisson counts with expectation M. The bin midpoints are now indexed by the n 1 n pairs (u 1 1 n, 1 n1 u ), where u 1 (u ) are the midpoint locations along the first (second) dimension. We model the expected values, now using tensor product B-splines on a rich c 1 c rectangular grid (equally-spaced along each dimension), as µ ih = E(Y ih ) = exp( c 1 c j=1 k=1 b j (u 1i )b k (u h )θ jk ). (8) The n 1 n -vectorized form of the Poisson counts and corresponding expected values are denoted by y = vec(y ) and µ = vec(m), respectively. Ignoring penalties for the moment, (8) is also a GLM, and can be rewritten as µ = E(y) = B B 1 θ = Bθ. (9) We now have the c 1 c array of coefficients Θ, with vectorized form θ = vec(θ). The matrix B, with n 1 n rows and c 1 c columns, collects the tensor product basis functions. As outlined in Durbán et al. (004), the penalty can be written as λ 1 D 1 Θ F + λ ΘD F = θ (λ 1 I c D 1D 1 + λ D D I c1 )θ = θ P θ, where F indicates the quadratic Frobenius norm, the sum of the squares of all elements of a matrix, and D 1 (D ) is the proper contrast matrix for the rows (columns) of Θ. Such a penalty enforces smoothness or tensor product B-spline coefficients within any one row or column, but breaks linkage in penalization across rows or columns. As seen in (??), the weight and order of the penalty is the same from row to row, the same from column to column, but can vary from rows to columns. In this scheme, with η = Bθ, the equivalent of () becomes (B W B + P )θ = B W ( η + W 1 (y µ)). (10)

4 Multidimensional P-spline Density 3.1 Optimizing the penalty To optimize the amount of smoothing, AIC can be computed similarly to (7). Upon convergence AIC is now computed as n 1 n AIC = y ih log(y ih /ˆµ ih ) + trace{b Ŵ B(B Ŵ B + P ) 1 }, (11) i=1 h=1 borrowing the result that trace is invariant under cyclical permutation of matrices. One varies the logarithms of λ 1 and λ on a grid, searching for the minimum. In many cases, or for first exploration, isotropic smoothing (λ 1 = λ = λ) will be sufficient and the grid search can be one-dimensional. 3. Toward higher dimensions Although this scheme will work, it can be wasteful with memory and computation time, especially for higher dimensions. For example, in three dimensions (with common n and c), the basis would have n 3 rows and c 3 columns and most practical applications would not fit in the memory of the average PC. To avoid this bottleneck, we use the grid algorithms of Eilers et al. (006). An example is the computation of the mean µ = vec(m) in two dimensions. Instead of computing it as a vector, as in (9), we can use the following equivalence µ = E(y) = B B 1 θ = Bθ M = E(Y ) = B 1 ΘB. (1) This avoids the construction of the large Kronecker product basis, saving space and time. This scheme can be extended to 3 dimensions or more, by recursive re-arrangement of multi-dimensional arrays as matrices and pre-multiplication by the one-dimensional bases. In a similar way the multi-dimensional inner products B W B can be obtained efficiently. The key observation is that in a 4-dimensional representation the elements of B W B can be written f jkj k = w hi b ij b hk b ij b hk, (13) h i which can be rearranged as f jkj k = h b hk b hk w hi b ij b ij. (14) By clever use of row-wise tensor products and switching between four- and two-dimensional representations of matrices and arrays, one can completely avoid the construction of the large tensor-product basis B. In addition to saving memory, computations are sped up by orders of magnitude. There is no room for details here, see Eilers et al. (006) and Currie et al. (006). i

P.H.C. Eilers and B.D. Marx 0. AIC 0 100 Old Faithful, optimal smooth, anisotropic 0 90 0. 0 log 10 (λ r ) 1 1. 10 1 0. 0. Waiting time (min) 80 70. 60 0. 3 3. 0 4 0 1 3 4 log (λ ) 10 c 40 Duration (min) FIGURE 1. Non-isotropic smoothing of Old Faithful data. Left: contours of AIC; right: contours of smooth density estimate. The density was normalized so that the largest value is equal to 1. The contour levels are 0.9, 0.7, 0., 0.3, 0., 0.1, 0.07, 0.0, 0.03, 0.0, 0.01. 1030 Isotropic smoothing; λ = 0.0 100 Old Faithful, optimal smooth 100 90 1010 80 AIC 1000 Waiting time (min) 70 990 60 980 0 970 4 3 1 0 1 log 10 (λ c ) 40 Duration (min) FIGURE. Isotropic smoothing of Old Faithful data. Left: AIC profile; right: contours of smooth density estimate. The density was normalized so that the largest value is equal to 1. The contour levels are 0.9, 0.7, 0., 0.3, 0., 0.1, 0.07, 0.0, 0.03, 0.0, 0.01. 4 Illustrative examples Figure 1 shows two-dimensional density contours for the well-known Old Faithful data. The number of Poisson observations is on a 100 100 his-

6 Multidimensional P-spline Density 6 Microarray data; # selected points: 1000 17 Isotropic smoothing; λ = 0.00 Array (log 10 4 3 1 Array 1 (log 10 AIC 170 171 1710 170 4 3 1 log 10 (λ c ) 6 Density estimate; λ = 0.00 6 Square root of density estimate; λ = 0.00 Array (log 10 4 3 Array (log 10 4 3 1 Array 1 (log 10 1 Array 1 (log 10 FIGURE 3. Presentation of microarray data scatterplot as a density. Upper: scatterplot of selection of raw data (left); optimal isotropic penalty parameter using AIC (right). lower: image of optimal density (left); square root density (for improved visualization) (right). togram grid, and a 13 13 grid of tensor products of cubic B-splines were used to construct the two-dimensional density. The order of the difference penalty is 3. We see that AIC is minimized with relatively light smoothing (λ 1.003) along the rows (waiting times) and with heavier smoothing (λ 1000) along the columns (duration). Figure shows the result of isotropic smoothing, yielding a compromise in the single λ 0.0. The large difference between the two λs for anisotropic smoothing might look surprising. The reason is that the conditional distributions of waiting time given (the previous) duration is always unimodal, not much different from a normal distribution. This allows a large weight for a third order penalty without destroying a proper fit to the data. On the other hand the conditional distribution of the (previous) duration given the waiting time is bimodal or rather skewed on part of the data domain. The weight of horizontal penalty cannot therefore be large. As a second example, Figure 3 displays as scatterplot of gene expressions (on a log 10 scale) as measured by two microarrays. The data pairs are plotted using a simple dot symbol. Only a random selection of 1000 observa-

P.H.C. Eilers and B.D. Marx 7 tions is shown, to prevent a large part of the cloud of pints to become completely black. As discussed by Eilers and Goeman (004), scatterplots with many dots are difficult to judge, because symbols at the boundaries attract too much attention from the observer, making the cloud of points look wider than it really is. Eilers and Goeman presented a very fast but simple smoother, with a goal of an attractive display of dense scatterplots. The density smoother proposed here can be used as an improvement, especially when λ is optimized by AIC. The results presented were also obtained with a 100 100 histogram grid and 13 13 tensor products of cubic B-splines. The optimal penalty weight value for isotropic smoothing is λ 0.000. The result of the smoother is presented as a gray-scale image of the height of the density. Depending on the medium used (computer screen, projector or printer) the effective dynamic range can be quite small. Therefore we also present a gray scale plot of the square root of the density. Discussion We have shown that tensor products of P-splines and Poisson regression of histogram data leads to an effective density estimator. As discussed in Eilers and Marx (1996), the position and number of bins (i.e. the location and width of the histogram grid) makes little difference to the final fit, provided that the grid is chosen to be sufficiently fine. There is no room here to illustrate that the same holds for two-dimensional smoothing. It is remarkable that a very sparse two-dimensional histogram leads to a very good density estimate. The Old Faithful data set has 7 observations. The average count per -D histogram bin is thus less than 0.03! As a practical guideline in two dimensions, we recommend that users (who are not concerned with computation time) start with 100 100 histogram bins, 0 0 B-splines, and a second or third order penalty. Some nice features of one-dimensional P-spline density estimation also carry over to higher dimensions. For one, there is a conservation of moments, for any λ, within each row, column, layer and so forth (depending on the dimension of the histogram). For example within any row, a penalty of order d = 1 yields equality of the sum of the observed and the sum of fitted counts. When d =, e.g. within any row, the previous condition holds and the mean of the observed data equals that of the fitted counts. Further, when d = 3, the previous two conditions hold with an additional preservation of variance between observed and fitted data. The P-spline density estimator is also not affected by unpleasant boundary effects, so familiar to kernel smoothers. In fact sharp specialized boundaries are encouraged with the P-spline approach. Especially for isotropic smoothing, optimization of AIC, using a simple grid search, is efficient. Computation, for one value of λ (in Matlab), takes less than a second on a 1000 MHz PIII computer. The time to compute the histogram is negligible, even for many observations.

8 Multidimensional P-spline Density Minimization of AIC is not the only possible approach to optimal smoothing. One can interpret models with penalties as mixed models and attempt to estimate the variance of the mixing distribution (Ruppert, Wand and Carroll, 003). Durbán et al. (006) investigated this approach to multidimensional density estimation. Alternatively, a purely Bayesian approach is also possible. The penalty is seen as the logarithm of a prior density of differences of the coefficients of the B-spline tensor products. Efficient simulation, using an optimized Langevin-Hastings algorithm, is possible. See Lambert and Eilers (006), extending their 00 work. References Currie, I.D., Durbán, M. and Eilers, P.H.C. (006). Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society, 68, 1-. Durbán, M., Currie, I.D. and Eilers, P.H.C. (004). Smoothing and forecasting mortality rates. Statistical Modelling, 4, 79-98. Durbán, M., Currie, I.D. and Eilers, P.H.C. (006). Multi-dimensional density smoothing with P-splines: a mixed model approach. Proceeding of the International Workshop on Statistical Modelling, Galway. Eilers, P.H.C., Currie, I.D. and and Durbán, M. (006). Fast and compact smoothing on large multidimensional grids. Computational Statistics and Data Analysis, 0, 61-76. Eilers, P.H.C and Goeman, J. (004). Enhancing scatterplots with smoothed densities. Bioinformatics, 0, 63-638. Eilers, P.H.C. and Marx, B.D. (1996). Flexible smoothing using B-splines and penalized likelihood (with Comments and Rejoinder). Statistical Science, 11(), 89-11. Kooperberg. C. and Stone, C.J. (1991). A study of logspline density estimation. Computational Statistics and Data Analysis, 1, 37-347. Lambert, P. and Eilers, P.H.C. (00). Bayesian proportional hazards model with time-varying coefficients: a penalized Poisson regression approach. Statistics in Medicine 4, 3977-3989. Lambert, P. and Eilers, P.H.C. (006). Bayesian multidimensional density smoothing. Proceeding of the International Workshop on Statistical Modelling, Galway. O Sullivan, F. (1988). Fast computation of fully automated log-density and log-hazard estimators. SIAM J. Sci. Stat. Comput., 9, 363-379. Ruppert, D., Wand, M. and Caroll, R. (003). Semiparametric Regression. New York: Cambridge.