Using the package hypergsplines: some examples.

Similar documents
Improving power posterior estimation of statistical evidence

Or How to select variables Using Bayesian LASSO

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Beyond Mean Regression

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

The sbgcop Package. March 9, 2007

Mixed effect model for the spatiotemporal analysis of longitudinal manifold value data

Addition to PGLR Chap 6

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Package LBLGXE. R topics documented: July 20, Type Package

Bayesian Grouped Horseshoe Regression with Application to Additive Models

robmixglm: An R package for robust analysis using mixtures

Analysing geoadditive regression data: a mixed model approach

Proteomics and Variable Selection

Mixtures of Prior Distributions

Comparing Priors in Bayesian Logistic Regression for Sensorial Classification of Rice

Mixtures of Prior Distributions

Modeling Real Estate Data using Quantile Regression

Package FDRreg. August 29, 2016

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Logistic Regression in R. by Kerry Machemer 12/04/2015

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

CPSC 540: Machine Learning

Monte Carlo in Bayesian Statistics

Package sbgcop. May 29, 2018

Alain F. Zuur Highland Statistics Ltd. Newburgh, UK.

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian covariate models in extreme value analysis

Package shrink. August 29, 2013

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Disease mapping with Gaussian processes

Penalized Regression

Bayesian Classification and Regression Trees

The STS Surgeon Composite Technical Appendix

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Ch 4. Linear Models for Classification

Bayesian Gaussian Process Regression

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Including historical data in the analysis of clinical trials using the modified power priors: theoretical overview and sampling algorithms

Outline Lecture 2 2(32)

Estimating the marginal likelihood with Integrated nested Laplace approximation (INLA)

Uncertainty Quantification and Validation Using RAVEN. A. Alfonsi, C. Rabiti. Risk-Informed Safety Margin Characterization.

Part 8: GLMs and Hierarchical LMs and GLMs

Modelling geoadditive survival data

Bayesian Lasso Binary Quantile Regression

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Markov Chain Monte Carlo (MCMC) and Model Evaluation. August 15, 2017

Discussion of Predictive Density Combinations with Dynamic Learning for Large Data Sets in Economics and Finance

Bayesian Inference by Density Ratio Estimation

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Contents. Part I: Fundamentals of Bayesian Inference 1

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

ECE521 week 3: 23/26 January 2017

Experimental Design and Data Analysis for Biologists

A short introduction to INLA and R-INLA

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Generalized Linear Models for Non-Normal Data

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Subject CS1 Actuarial Statistics 1 Core Principles

The linear model is the most fundamental of all serious statistical models encompassing:

WU Weiterbildung. Linear Mixed Models

A general mixed model approach for spatio-temporal regression data

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

CTDL-Positive Stable Frailty Model

Spatial modelling using INLA

Web Appendix for The Dynamics of Reciprocity, Accountability, and Credibility

Kernel adaptive Sequential Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Topic 12 Overview of Estimation

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Generalized linear models

0.1 poisson.bayes: Bayesian Poisson Regression

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Residuals and regression diagnostics: focusing on logistic regression

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Chapter 10 Nonlinear Models

Bayesian Model Averaging

Luke B Smith and Brian J Reich North Carolina State University May 21, 2013

R-INLA. Sam Clifford Su-Yun Kang Jeff Hsieh. 30 August Bayesian Research and Analysis Group 1 / 14

Structured Additive Regression Models: An R Interface to BayesX

Bayesian non-parametric model to longitudinally predict churn

Lecture 9 STK3100/4100

Integrated Non-Factorized Variational Inference

Linear Models A linear model is defined by the expression

Stat 5102 Final Exam May 14, 2015

Introduction to lnmle: An R Package for Marginally Specified Logistic-Normal Models for Longitudinal Binary Data

Generalized Linear Models

0.1 normal.bayes: Bayesian Normal Linear Regression

Transcription:

Using the package hypergsplines: some examples. Daniel Sabanés Bové 21st November 2013

This short vignette shall introduce into the usage of the package hypergsplines. For more information on the methodology, see the technical report (Sabanés Bové, Held, and Kauermann, 2011). If you have any questions or critique concerning the package, write an email to me: daniel.sabanesbove@ifspm.uzh.ch. Many thanks in advance! 0.1 Pima Indians diabetes data First, we will have a look at the Pima Indians diabetes data, which is available in the package MASS: > library(mass) > pima <- rbind(pima.tr, Pima.te) > pima$hasdiabetes <- as.numeric(pima$type == "Yes") > pima.nobs <- nrow(pima) Setup For n = 532 women of Pima Indian heritage, seven possible predictors for the presence of diabetes are recorded. We would like to investigate with an additive logistic regression, which of them are relevant, and what form the statistical association has is it a linear effect, or rather a nonlinear effect? We will use a fully and automatic Bayesian approach for this, see the technical report (Sabanés Bové et al., 2011) for more details. The first step is to define a modeldata object, where we input the response vector y, the matrix with the covariates, the spline type (here we use cubic O Sullivan splines with 4 inner knots) and the exponential family (here a canonical binomial model, i.e. we want logistic regression): > library(hypergsplines) > modeldata.pima <- with(pima, + glmmodeldata(y=hasdiabetes, + X= + cbind(npreg, + glu, + bp, + skin, + bmi, + ped, + age), + splinetype="cubic", + nknots=4l, + family=binomial)) Stochastic model search 2

> chainlength.pima <- 100 > computation.pima <- getcomputation(useopenmp=false, + higherordercorrection=false) Next, we will do a stochastic search on the (very large) model space to find good models. Here we have to decide on the model prior, and in this example we use the dependent type which corrects for the implicit multiplicity of testing. We use a chainlength of 100, which is very small but enough for illustration purposes (usually one should use at least 100 000), and save all models (in general nmodels is the number of models which are saved from all visited models). Finally, we decide that we do not want to use OpenMP acceleration and no higher order correction for the Laplace approximations. In order to be able to reproduce the analysis, it is advisable to set a seed for the random number generator before starting the stochastic search. > set.seed(93) > time.pima <- + system.time(models.pima <- + stochsearch(modeldata=modeldata.pima, + modelprior="dependent", + chainlength=chainlength.pima, + nmodels=chainlength.pima, + computation=computation.pima)) ----25---50---75---100 ---- ---- ---- ---- ==================== Number of non-identifiable model proposals: 0 Number of total cached models: 77 Number of returned models: 77 Wee see that the search took 11 seconds, and 77 models were found. The models list element of models.pima gives the table of the found models, with their degrees of freedom for every covariate, the log marginal likelihood, the log prior probability, the posterior probability and the number of times that the sampler encountered that model: > head(models.pima$models) npreg glu bp skin bmi ped age logmarglik logprior post hits 1 3 1 0 0 4 2 4-242.2493-14.08276 0.28527824 2 2 3 1 0 0 4 2 3-242.4728-14.08276 0.22813320 2 3 3 1 0 0 3 2 3-242.5401-14.08276 0.21329269 2 4 1 1 0 0 2 2 2-244.2944-14.08276 0.03690369 5 5 3 1 1 0 4 2 4-243.6871-14.77591 0.03386998 3 6 0 1 0 3 4 2 3-244.6538-14.08276 0.02576411 0 3

> map.pima <- models.pima$models[1, 1:7] We have saved the degrees of freedom vector of the estimated MAP model in map.pima. Inclusion probabilities The estimated marginal inclusion probabilities (probabilities for exclusion, linear inclusion and nonlinear inclusion) for all covariates are also saved: > round(models.pima$inclusionprobs,2) npreg glu bp skin bmi ped age not included 0.03 0.00 0.91 0.86 0.00 0.00 0 linear 0.07 0.96 0.09 0.09 0.04 0.01 0 non-linear 0.90 0.04 0.00 0.05 0.96 0.99 1 Sampling model parameters If we now want to look at the estimated covariate effects in the estimated MAP model which has configuration (3,1,0,0,4,2,4), then we first need to generate parameter samples from that model: > mcmc.pima <- getmcmc(samples=500l, + burnin=100l, + step=1l, + niwlsiterations=2l) > set.seed(634) > map.samples.pima <- glmgetsamples(config=map.pima, + modeldata=modeldata.pima, + mcmc=mcmc.pima, + computation=computation.pima) ----25---50---75---100 ---- ---- ---- ---- ==================== With the function getmcmc, we have defined a list of MCMC settings, comprising the number of samples we would like to have in the end, the length of the burn-in, the thinning step (here no thinning) and the number of IWLS iterations used (with 2 steps you get a higher acceptance rate than with 1 step, here the acceptance rate was 0.52). The result map.samples.pima has the following structure: > str(map.samples.pima) List of 3 $ samples :List of 5..$ t : num [1:500] 0.992 0.992 0.992 0.992 0.997.....$ intercept : num [1:500(1d)] -0.946-0.946-0.946-0.946-0.905.....$ linearcoefs:list of 5 4

....$ npreg: num [1:500(1d)] -0.354-0.354-0.354-0.354 1.223.......$ glu : num [1:500(1d)] 5.81 5.81 5.81 5.81 4.86.......$ bmi : num [1:500(1d)] 4.15 4.15 4.15 4.15 3.93.......$ ped : num [1:500(1d)] 4.09 4.09 4.09 4.09 2.46.......$ age : num [1:500(1d)] 3.67 3.67 3.67 3.67 2.01.....$ splinecoefs:list of 4....$ npreg: num [1:6, 1:500] 10.55-8.13 11.9-12.16 1.97.......$ bmi : num [1:6, 1:500] 12.13-17 66.08-3.97-4.98.......$ ped : num [1:6, 1:500] -7.12 7.86 1.71-6.35-7.14.......$ age : num [1:6, 1:500] 46.16 26.28 4.27 6.25-18.3.....$ z : num [1:500] 4.87 4.87 4.87 4.87 5.98... $ mcmc :List of 2..$ naccepted : num 309..$ acceptanceratio: num 0.515 $ logmarglik:list of 4..$ ilaestimate : num -242..$ mcmcestimate : num -242..$ mcmcse : num 0.114..$ margapproxzdens:list of 2....$ args: num [1:100] -52.0976-29.1796-0.8346-0.0443 0.4362.......$ dens: num [1:100] 0.00 1.41e-51 7.72e-30 3.36e-23 1.09e-18... It is a list with the samples, two diagnostics for the mcmc, and estimates for the log marginal likelihood (logmarglik). The latter one contains the original ILA estimate, the MCMC estimate of the log marginal likelihood with its standard error, and the coordinates of the posterior density of z = log(g). Curve estimates Now we can use the samples to plot the estimated effects of the MAP model covariates, with the plotcurveestimate function. For example: > plotcurveestimate(covname="age", + samples=map.samples.pima$samples, + modeldata=modeldata.pima) 5

f(age) 8 6 4 2 0 2 20 30 40 50 60 70 80 age Post-processing If we want to have estimates of the degrees of freedom on a continuous scale instead of the fixed grid (0, 1, 2, 3, 4), we can optimise the marginal likelihood with respect to the degrees of freedom of the MAP covariates: > optim.map.pima <- postoptimize(modeldata=modeldata.pima, + modelconfig=map.pima, + computation=computation.pima) > optim.map.pima npreg glu bp skin bmi ped age 1 2.276746 1 0 0 3.555901 2.103911 3.653702 6

For that model, we could again produce samples and plot curve estimates. Prediction samples If we would like to get prediction samples for new covariate values, this is also very easy via the getfitsamples function. Here we get posterior predictive samples because we input a covariate matrix which is part of the original covariate matrix used to fit the MAP model. Because the getfitsamples function produces samples on the linear predictor scale, we have to apply the appropriate response function (here the logistic distribution function plogis) to get samples on the observation scale. > fit.samples.pima <- getfitsamples(x=modeldata.pima$origx[1:10,], + samples=map.samples.pima$samples, + modeldata=modeldata.pima) > obs.samples.pima <- plogis(fit.samples.pima) The posterior predictive means are thus: > rowmeans(obs.samples.pima) [1] 0.04672958 0.63051441 0.08926869 0.69338988 0.03227731 0.29412741 [7] 0.06373851 0.58590940 0.28167756 0.68851588 and could be compared to the actual observations > modeldata.pima$y[1:10] [1] 0 1 0 0 0 1 0 0 0 1 Model averaging Model averaging works in principle similar to sampling from a single model, but multiple model configurations are supplied and their respective log posterior probabilities. For example, if we wanted to average the top ten models found, we would do the following: > average.samples.pima <- + with(models.pima, + getbmasamples(config=models[1:10,], + logpostprobs=models$logmarglik[1:10] + + models$logprior[1:10], + nsamples=500l, + modeldata=modeldata.pima, + mcmc=mcmc.pima, + computation=computation.pima)) ----25---50---75---100 ---- ---- ---- ---- ==================== 7

Then internally, first the models are sampled, and for each sampled model so many samples are drawn as determined by the model frequency in the model average sample. On this sample object, the above presented functions can again be applied (e.g. plot- CurveEstimate). To be continued... 8

Bibliography D. Sabanés Bové, L. Held, and G. Kauermann. Hyper-g priors for generalised additive model selection with penalised splines. Technical report, University of Zurich and University Bielefeld, 2011. URL http://arxiv.org/abs/1108.3520. 9