LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Size: px

Start display at page:

Download "LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape"

Lily Marianna Lawrence
5 years ago
Views:

1 LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape Nikolaus Umlauf

2 Overview Joint work with Andreas Groll, Julien Hambuckers and Thomas Kneib. 1 Introduction 2 Model specification 3 Model fitting 4 L1-type penalization 5 A simulation study 6 An application on the Munich rent data European Meeting of Statisticians, Helsinki 2017/07/27 1

3 Introduction Helsinki daily maximum temperature data (1993/ /06) T N(µ, σ 2 ). European Meeting of Statisticians, Helsinki 2017/07/27 2

4 Introduction Helsinki daily maximum temperature data (1993/ /06) T N(µ = f (T t 1 ), log(σ 2 ) = β 0 ). European Meeting of Statisticians, Helsinki 2017/07/27 3

5 Introduction Helsinki daily maximum temperature data (1993/ /06) T N(µ = f (T t 1 ), log(σ 2 ) = f (T t 1 )). European Meeting of Statisticians, Helsinki 2017/07/27 4

6 Introduction Helsinki daily maximum temperature data (1993/ /06) T N(µ = f (T t 1 ), log(σ 2 ) = f (T t 1 )). European Meeting of Statisticians, Helsinki 2017/07/27 5

7 Model specification Any parameter of a population distribution D may be modeled by explanatory variables y D (h 1 (θ 1 ) = η 1, h 2 (θ 2 ) = η 2,..., h K (θ K ) = η K ), Each parameter is linked to a structured additive predictor h k (θ k ) = η k = η k (x; β k ) = f 1k (x; β 1k ) f Jk k(x; β Jk k), j = 1,..., J k and k = 1,..., K and h k ( ) are known monotonic link functions. Vector of function evaluations f jk = (f jk (x 1 ; β jk ),..., f jk (x n ; β jk )) f jk (x 1 ; β jk ) f jk =. = f jk(x jk ; β jk ). f jk (x n ; β jk ) European Meeting of Statisticians, Helsinki 2017/07/27 6

8 Model specification European Meeting of Statisticians, Helsinki 2017/07/27 7

9 Model specification For simple linear effects X jk β jk : p jk (β jk ) const. For the smooth terms: p jk (β jk ; τ jk, α jk ) d βjk (β jk τ jk ; α βjk ) d τ jk (τ jk α τ jk ). Using a basis function approach a common choice is d βjk (β jk τ jk, α βjk ) P jk (τ jk ) 1 2 exp ( 1 2 β jkp jk (τ jk )β jk ). Precision matrix P jk (τ jk ) derived from prespecified penalty matrices α βjk = {K 1jk,..., K Ljk }. The variances parameters τ jk are equivalent to the inverse smoothing parameters in a frequentist approach. European Meeting of Statisticians, Helsinki 2017/07/27 8

10 Regularization in the GAMLSS framework ˆ A gradient boosting approach is provided by Mayr et al. (2012). ˆ Allows for variable selection within GAMLSS framework. ˆ Corresponding R-package gamboostlss (Hofner et al., 2015). ˆ Provides a large number of pre-specified distributions. ˆ New: an alternative gradient boosting approach is implemented in the R-package bamlss (Umlauf et al., 2017b): ˆ embeds many different approaches suggested in literature and software, ˆ serves as unified conceptional Lego toolbox for complex regression models. European Meeting of Statisticians, Helsinki 2017/07/27 9

11 Regularization in the GAMLSS framework New model terms f jk (x; β jk ) with LASSO-type penalties. European Meeting of Statisticians, Helsinki 2017/07/27 10

12 Regularization in the GAMLSS framework New model terms f jk (x; β jk ) with LASSO-type penalties. European Meeting of Statisticians, Helsinki 2017/07/27 11

13 Model fitting The main building block of regression model algorithms is the probability density function d y (y θ 1,..., θ K ). Estimation typically requires to evaluate n l(β; y, X) = log d y (y i ; θ i1 = h1 1 (η i1(x i, β 1 )),... i=1..., θ ik = h 1 K (η ik (x i, β K ))), with β = (β 1,..., β K ) and X = (X 1,..., X K ). The log-posterior log π(β, τ ; y, X, α) l(β; y, X) + K J k k=1 j=1 [ log pjk (β jk ; τ jk, α jk ) ], where τ = (τ 1,..., τ K ) = (τ 11,..., τ J 1 1,..., τ 1K,..., τ J K K ) (frequentist, penalized log-likelihood). European Meeting of Statisticians, Helsinki 2017/07/27 12

14 Model fitting Posterior mode estimation, fortunately, partitioned updating is possible β (t+1) 1 = U 1 (β (t) 1, β(t) 2,..., β(t) K ) β (t+1) 2 = U 2 (β (t+1) 1, β (t) 2,..., β(t) K ). β (t+1) K = U K (β (t+1) 1, β (t+1) 2,..., β (t) K ), E.g., Newton-Raphson type updating ( β (t+1) k = U k (β (t) k, ) = β(t) k H kk β (t) k ) 1 s ( β (t) k Can be further partitioned for each function within parameter block k. Moreover, using a basis function approach yields IWLS updates β (t+1) jk ). = (X jkw kk X jk + G jk (τ jk )) 1 X jkw kk (z k η (t) k, j ). European Meeting of Statisticians, Helsinki 2017/07/27 13

15 Model fitting A simple generic algorithm for distributional regression models: while(eps > ε & t < maxit) { for(k in 1:K) { for(j in 1:J[k]) { Compute η = η k f jk. Obtain new (β jk, τ jk) = U jk(x jk, y, η, β [t] jk, τ [t] Update η k. } } t = t + 1 Compute new eps. } jk ). Functions U jk ( ) could either return updates from an optimizing algorithm or proposals from a MCMC sampler. European Meeting of Statisticians, Helsinki 2017/07/27 14

16 L1-type penalization Idea: depending on the type of covariate effects, subtract a combination of (parts of) the following penalty terms τ 1 J(β) from the log-likelihood. Classical LASSO (Tibshirani, 1996): For a metric covariate x jk use J m (β jk ) = β jk. Group LASSO (Meier et al., 2008): For a (dummy-encoded) categorical covariate x jk use J g (β jk ) = β jk 2, with vector β jk collecting all corresponding coefficients. European Meeting of Statisticians, Helsinki 2017/07/27 15

17 L1-type penalization Alternatively, for categorical covariates often clustering of categories with implicit factor selection is desirable. Fused LASSO (Gertheiss and Tutz, 2010): Depending on the nominal (left) or ordinal scale level (right) of the covariate, use J f (β jk ) = l>m w (jk) lm β jkl β jkm or J f (β jk ) = c jk l=1 w (jk) l β jkl β jk,l 1 where c jk is the number of levels of categorical predictor x jk and w (jk) lm, w (jk) l denote suitable weights. Choosing l = 0 as the reference, β jk0 = 0 is fixed. European Meeting of Statisticians, Helsinki 2017/07/27 16

18 L1-type penalization Quadratic approximations of the penalties (compare Oelker & Tutz, 2017) J jk (β jk ) J jk (β (t) jk ) + 1 ( ) β 2 jkp jk (β jk )β jk + (β (t) jk ) P jk (β (t) jk )β(t) jk, with ( a P jk (β (t) jk ) = q jk jkβ (t) jk ) Njk Djk(a jk β(t) jk ) a jk a jk. a jk β(t) jk E.g., β 1 = β is approximated by β 2 + c, hence, IWLS based updating functions U jk ( ) are relatively easy to implement. European Meeting of Statisticians, Helsinki 2017/07/27 17

19 L1-type penalization Example of the approximation of the L 1 norm. Usually setting the constant to c 10 5 works well. European Meeting of Statisticians, Helsinki 2017/07/27 18

20 R package bamlss The package is available at Development version, in R simply type R> install.packages("bamlss", + repos = " European Meeting of Statisticians, Helsinki 2017/07/27 19

21 R package bamlss formula family data model.frame optimizer transformer sampler results summary plot select predict In principle, the setup does not restrict to any specific type of engine (Bayesian or frequentist). European Meeting of Statisticians, Helsinki 2017/07/27 20

22 R package bamlss Type Parser Transformer Optimizer Sampler Results Function bamlss.frame() bamlss.engine.setup(), randomize() bfit(), opt(), cox.mode(), jm.mode() boost(), lasso() GMCMC(), JAGS(), STAN(), BayesX(), cox.mcmc(), jm.mcmc() results.bamlss.default() To implement new engines, only the building block functions have to be exchanged. European Meeting of Statisticians, Helsinki 2017/07/27 21

23 R package bamlss Work in progress... Function beta_bamlss() binomial_bamlss() cnorm_bamlss() cox_bamlss() gaussian_bamlss() gamma_bamlss() gpareto_bamlss() jm_bamlss() multinomial_bamlss() mvn_bamlss() poisson_bamlss()... Distribution Beta distribution Binomial distribution Censored normal distribution Continuous time Cox-model Gaussian distribution Gamma distribution Generalized Pareto distribution Continuous time joint-model Multinomial distribution Multivariate normal distribution Poisson distribution New families only require density, distribution, random number generator, quantile, score and hess functions. European Meeting of Statisticians, Helsinki 2017/07/27 22

24 R package bamlss Wrapper function: R> f <- list(y ~ la(id,fuse=2), sigma ~ la(id,fuse=1)) R> b <- bamlss(f, family = "gaussian", sampler = FALSE, + optimizer = lasso, criterion = "BIC", multiple = TRUE) Standard extractor and plotting functions: summary(), plot(), fitted(), residuals(), predict(), coef(), loglik(), DIC(), samples(),... European Meeting of Statisticians, Helsinki 2017/07/27 23

25 Simulation setting ˆ Y N(µ(x); σ(x) 2 ) ˆ µ(x) = β 0µ + x 1 β 1µ x 4 β 4µ, σ(x) = exp ( β 0σ + x 1 β 1σ x 4 β 4σ ) ˆ x 1, x 2 nominal factors, x 3, x 4 ordinal factors. ˆ Intercepts: β 0µ = 1, β 0σ = 0.5. β 1µ = (0, 0.5, 0.5, 0.5, 0.5, 0.2, 0.2), β 2µ = (0, 1, 1) β 3µ = (0, 0.5, 0.5, 1, 1, 2, 2), β 4µ = (0, 0.3, 0.3) β 1σ = (0, 0.5, 0.4, 0, 0.5, 0.4, 0), β 2σ = (0.4, 0, 0.4) β 3σ = (0, 0, 0.4, 0.4, 0.4, 0.8, 0.8), β 4σ = (0, 0.5, 0.5) ˆ Additional noise variables: x 5, x 6 nominal factors, x 7, x 8 ordinal factors. ˆ n train = 1000 with 100 simulation runs. European Meeting of Statisticians, Helsinki 2017/07/27 24

26 Model comparison The following different models are compared w.r.t. goodness-of-fit: ˆ (unregularized) Maximum-Likelihood. ˆ Gradient boosting with m stop determined by BIC. (bamlss) ˆ Gradient boosting with m stop determined by out-of-sample prediction error (glmboostlss / gamboostlss). ˆ Fused LASSO with global λ determined by BIC. ˆ Fused LASSO with λ µ, λ σ determined by BIC. ˆ Fused LASSO with separate λ s for each predictor term. ˆ Combination of gradient boosting and (fused) LASSO. European Meeting of Statisticians, Helsinki 2017/07/27 25

27 Goodness-of-fit measures The different models are compared (amongst others) w.r.t. the following criteria: ˆ MSE βµ = β µ ˆβ µ 2, MSE βσ = β σ ˆβ σ 2 ˆ False positives of differences. ˆ False positives of pure noise variables. ˆ False negatives of non-noise variables. European Meeting of Statisticians, Helsinki 2017/07/27 26

28 Results: MSE βµ MSE.mu T GamBoost T T European Meeting of Statisticians, Helsinki 2017/07/27 27

29 Results: MSE βσ MSE.sigma T T GamBoost T European Meeting of Statisticians, Helsinki 2017/07/27 28

30 Results: FP of differences µ: nominal 1 µ: nominal µ: ordinal 1 µ: ordinal σ: nominal 1 σ: nominal σ: ordinal 1 σ: ordinal European Meeting of Statisticians, Helsinki 2017/07/27 29

31 Results: FP of pure noise variables µ: nominal 3 µ: nominal µ: ordinal 3 µ: ordinal σ: nominal 3 σ: nominal σ: ordinal 3 σ: ordinal European Meeting of Statisticians, Helsinki 2017/07/27 30

32 Results: FN of non-noise variables µ: nominal 1 µ: nominal µ: ordinal 1 µ: ordinal σ: nominal 1 σ: nominal σ: ordinal 1 σ: ordinal European Meeting of Statisticians, Helsinki 2017/07/27 31

33 The Munich rent data 2007 ˆ Munich rent standard data: used as a reference for the average rent of a flat depending on its characteristics and spatial features. ˆ n = 3015 households. ˆ Response: monthly rent per square meter (in e). ˆ Covariates: out of a large set we incorporate a selection of 9 factors (ordered and nominal/binary; similar to Gertheiss and Tutz, 2010). ˆ Model: Gaussian GAMLSS and use for both distribution parameters, i.e. µ and σ, a combination of the two different fused LASSO penalties. ˆ Optimal tuning parameters: via BIC on a 2-dimensional grid. European Meeting of Statisticians, Helsinki 2017/07/27 32

34 The Munich rent data 2007 Marginal BIC curves for µ & σ (in each case fixing the other tuning parameter at its respective BIC-minimum) European Meeting of Statisticians, Helsinki 2017/07/27 33

35 The Munich rent data 2007 Ordinal fused coefficient paths for the year of construction for parameters µ (left) and σ (right). European Meeting of Statisticians, Helsinki 2017/07/27 34

36 The Munich rent data 2007 Nominal fused coefficient paths for the district effect for parameters µ (left) and σ (right). European Meeting of Statisticians, Helsinki 2017/07/27 35

37 Summary & Conclusions ˆ Different LASSO-type penalties for the GAMLSS have been proposed. ˆ In particular, good results of the fused LASSO are obtained if clustering of categories is desirable/necessary. ˆ Boosting methods turned out to be problematic, if categories are clustered. ˆ Reasonable results for the application on the Munich rent data. ˆ Implementation available in the R-package bamlss. ˆ Further investigation of the combination of boosting and LASSO. European Meeting of Statisticians, Helsinki 2017/07/27 36

38 References & Software Gertheiss, J. & Tutz, G. (2010). Sparse modeling of categorial explanatory variables. The Annals of Applied Statistics, 4(4), Hofner, B., A. Mayr, & M. Schmid (2016). gamboostlss: An R package for model building and variable selection in the GAMLSS framework. Journal of Statistical Software 74(1), Mayr, A., Fenske, N., Hofner, B., Kneib, T., & Schmid, M.(2012). Generalized additive models for location, scale and shape for high-dimensional data - a flexible approach based on boosting. Journal of the Royal Statistical Society, Series C, 61(3), Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society B 70(1), Oelker, M.-R. & Tutz, G. (2017). A uniform framework for the combination of penalties in generalized structured models. Advances in Data Analysis and Classification 11(1), European Meeting of Statisticians, Helsinki 2017/07/27 37

39 References & Software Rigby, R. A. and D. M. Stasinopoulos (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54(3), Stasinopoulos, D.M. & Rigby, R.A. (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software 23(7). Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B Umlauf, N., Klein, N., & Zeileis, A. (2017a). BAMLSS: Bayesian additive models for location, scale and shape (and beyond). Working Paper, Faculty of Economics and Statistics, University of Innsbruck. Umlauf, N., Klein, N., Zeileis, A. & Köhler, M. (2017b). bamlss: Bayesian additive models for location, scale and shape (and beyond). R package version 0.1-2, URL: European Meeting of Statisticians, Helsinki 2017/07/27 38

40 Thank you for your attention! Nikolaus Umlauf

41 Weights for fused LASSO Along the lines of Gertheiss and Tutz (2010), for the fused LASSO the following weights are used: For nominal factors: w (jk) For ordinal factors: lm = 2 βjkl ML βjkm ML (c jk + 1)c jk w (jk) l = β ML jkl 1 βjk,l 1 ML c jk n (jk) l + n (jk) m n n (jk) l + n (jk) l 1 n Here, n (jk) l denotes the number of observations on level l of predictor x jk. European Meeting of Statisticians, Helsinki 2017/07/27 1

gamboostlss: boosting generalized additive models for location, scale and shape

gamboostlss: boosting generalized additive models for location, scale and shape Benjamin Hofner Joint work with Andreas Mayr, Nora Fenske, Thomas Kneib and Matthias Schmid Department of Medical Informatics,