Additive Terms. Flexible Regression and Smoothing. Mikis Stasinopoulos 1 Bob Rigby 1

Similar documents
A flexible regression approach using GAMLSS in R. Bob Rigby and Mikis Stasinopoulos

Threshold Autoregressions and NonLinear Autoregressions

Generalized Additive Models

Stat 4510/7510 Homework 7

1 Piecewise Cubic Interpolation

Reference growth charts for Saudi Arabian children and adolescents

gamboostlss: boosting generalized additive models for location, scale and shape

Beyond Mean Regression

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Reference Growth Charts for Saudi Arabian Children and Adolescents

Nonparametric Regression. Badr Missaoui

Resampling techniques for statistical modeling

The use of diagnostic tools in GAMLSS for liver fibrosis detection

Modelling location, scale and shape parameters of the birnbaumsaunders generalized t distribution

Modelling using ARMA processes

School on Modelling Tools and Capacity Building in Climate and Public Health April 2013

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Quantile Regression Methods for Reference Growth Charts

A Handbook of Statistical Analyses Using R 3rd Edition. Torsten Hothorn and Brian S. Everitt

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Statistical Prediction

Generalized Additive Models (GAMs)

Gradient types. Gradient Analysis. Gradient Gradient. Community Community. Gradients and landscape. Species responses

Generalized additive modelling of hydrological sample extremes

Regression, Ridge Regression, Lasso

Introduction to Within-Person Analysis and RM ANOVA

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Nonparametric Quantile Regression

1 Chapter 2, Problem Set 1

Introduction to Computer Graphics. Modeling (1) April 13, 2017 Kenshi Takayama

Regression: Lecture 2

APTS course: 20th August 24th August 2018

Cubic Splines MATH 375. J. Robert Buchanan. Fall Department of Mathematics. J. Robert Buchanan Cubic Splines

Introduction to Machine Learning and Cross-Validation

Estimation of spatiotemporal effects by the fused lasso for densely sampled spatial data using body condition data set from common minke whales

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Estimation of cumulative distribution function with spline functions

Generalized Additive Models

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Introduction to Regression

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Ph.D. course: Regression models

Analyzing Highly Dispersed Crash Data Using the Sichel Generalized Additive Models for Location, Scale and Shape

Time Series and Forecasting Lecture 4 NonLinear Time Series

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Conditional Transformation Models

Concepts and Applications of Kriging. Eric Krause

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Data Mining Stat 588

A Modern Look at Classical Multivariate Techniques

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

SPLINE INTERPOLATION

arxiv: v4 [stat.ap] 2 Nov 2011

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Alternatives. The D Operator

Chapter 1 Numerical approximation of data : interpolation, least squares method

Lecture 11: Regression Methods I (Linear Regression)

Extending clustered point process-based rainfall models to a non-stationary climate

Detection of risk factors for obesity in early childhood with quantile regression methods for longitudinal data

Generalized Linear Models

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Why Forecast Recruitment?

Package face. January 19, 2018

Using Spatial Statistics Social Service Applications Public Safety and Public Health

Testing and Model Selection

Modelling excess mortality using fractional polynomials and spline functions. Modelling time-dependent excess hazard

Generalised linear models. Response variable can take a number of different formats

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Spatial Process Estimates as Smoothers: A Review

Cubic Splines; Bézier Curves

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Applied Econometrics (QEM)

Functional SVD for Big Data

A Super Quick Intro to Neural Nets. Carlos M. Carvalho The University of Texas McCombs School of Business

x j β j + λ (x ij x j )βj + λ β c2 and y aug = λi λi = X X + λi, X λi = y X.

Local Polynomial Modelling and Its Applications

MULTIDIMENSIONAL COVARIATE EFFECTS IN SPATIAL AND JOINT EXTREMES

Curves. Hakan Bilen University of Edinburgh. Computer Graphics Fall Some slides are courtesy of Steve Marschner and Taku Komura

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Keyframing. CS 448D: Character Animation Prof. Vladlen Koltun Stanford University

Statistical Methods as Optimization Problems

Lecture 11: Regression Methods I (Linear Regression)

Engineering 7: Introduction to computer programming for scientists and engineers

PAPER 218 STATISTICAL LEARNING IN PRACTICE

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

Median Cross-Validation

Quantile Regression for Extraordinarily Large Data

Chapter 14 Student Lecture Notes 14-1

15.1 The Regression Model: Analysis of Residuals

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Zero adjusted distributions on the positive real line

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

An Introduction to Functional Data Analysis

How to deal with non-linear count data? Macro-invertebrates in wetlands

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Transcription:

1 Flexible Regression and Smoothing Mikis Stasinopoulos 1 Bob Rigby 1 1 STORM, London Metropolitan University XXV SIMPOSIO INTERNACIONAL DE ESTADêSTICA, Armenia, Colombia, August 2015

2 Outline 1 Linear Additive Smoothers 2 Centiles The Dutch boys data The problem The methods The LMS model and its extensions GAMLSS functions for centile estimation 3 Conclusions

3 Linear Additive

4 Linear Additive model η = b 0 + b 1 x 1 + b 2 x 2 + b 3 if(f = 2) + b 4 if(f = 3) + b 5 (x 1 x 2 ) + R code b 6 x 1 if(f = 2) + b 7 x 1 if(f = 3) + s 1 (x 3 ) + s 2 (x 4 ) + s 3 b 8 (x 1 )x 4 + s 4 (x 1 )if(f = 2) + s 5 (x 1 )if(f = 3) + s 6 (x 3, x 5 ) m1 <- gamlss(y~x1+x2+f+x1:x2+f:x1+pb(x3)+pb(x4)+ pvc(x1, by=f)+ga(~s(x3,x5))

5 Linear Additive : CD4 data information

6 Linear Additive : CD4 data example

7 Linear Additive : polynomials Model R code h(x) = β 0 + β 1 x + β 2 x 2 + β 3 x 3 +... + β p x p m0 <- gamlss(y~x+i(x^2)+i(x^3)) m1 <- gamlss(y~poly(x,3))

8 Linear Additive : polynomials

9 Linear Additive : CD4 data example, polynomials degree 7

10 Linear Additive : fractional polynomials β 0 + β 1 x p 1 + β 2 x p 2 + β 3 x p 3 where p j, for j = 1, 2, 3 takes values ( 2, 1, 0.5, 0, 0.5, 1, 2, 3) with the value 0 interpreted as function log(x) (rather than x 0 ). R code m0 <- gamlss(y~fp(x, 3))

11 Linear Additive : fractional polynomials

12 Linear Additive : CD4 data example, fractional polynomials with 1, 2 and 3 degrees

13 Linear Additive : Piecewise Polynomials, Linear + Linear R code h(x) = β 00 + β 01 x + β 11 (x b)h(x > b) f1<-gamlss(cd4~fk(age, degree=1, start=2), data=cd4)

14 Linear Additive : Piecewise Polynomials, Linear + Linear

15 Linear Additive : CD4 data example, piecewise polynomials

16 Linear Additive : Piecewise Polynomials, Quadratic + Quadratic h(x) = β 00 + β 01 x + β 02 x 2 + [ β 10 + β 11 (x b) + β 12 (x b) 2] H(x > b)

17 Linear Additive : Piecewise Polynomials, Quadratic + Quadratic

18 Linear Additive : Piecewise Polynomials D K D h(x) = β 0j x j + β kj (x b k ) j H(x > b k ) j=0 k=1 j=0

19 Linear Additive : Piecewise Polynomials

20 Linear Additive : Piecewise Polynomials, B-splines

21 Linear Additive : Piecewise Polynomials, B-splines

22 Linear Additive : piecewise polynomial regression R code m2b <- gamlss(cd4 ~ bs(age), data = CD4) m3b <- gamlss(cd4 ~ bs(age, df = 3), data = CD4) m4b <- gamlss(cd4 ~ bs(age, df = 4), data = CD4) m5b <- gamlss(cd4 ~ bs(age, df = 5), data = CD4) m6b <- gamlss(cd4 ~ bs(age, df = 6), data = CD4) m7b <- gamlss(cd4 ~ bs(age, df = 7), data = CD4) m8b <- gamlss(cd4 ~ bs(age, df = 8), data = CD4)

23 Linear Additive : CD4 data example, piecewise polynomials 5, 7 knots

24 Linear Additive CD4 count data: P-spline fits

25 Smoothers Smoothers

26 Smoothers Smoothers: reported crimes

27 Smoothers Smoothers: local linear

28 Smoothers Smoothers: penalised Q = (y Zγ) W(y Zγ) + λγ Gγ. with solution ˆγ = ( Z WZ + λg) 1 Z Wy. ( 1 ŷ = Z Z WZ + λg) Z Wy = Sy

29 Smoothers Smoothers: Demos 1 demo.bsplines() 2 demo.randomwalk() 3 demo.interpolatesmo() 4 demo.histsmo() 5 demo.psplines()

30 Smoothers : P-splines pb() R code p1 <- gamlss(bmi~pb(age, method="ml"), data=dbbmi) p2 <- gamlss(bmi~pb(age, method="gcv"), data=dbbmi) p3 <- gamlss(bmi~pb(age, method="gaic", k=2), data=dbbmi) p4 <- gamlss(bmi~pb(age, method="gaic", k=log(length(dbbmi$bmi))), data=dbbmi)

31 Smoothers Smooth functions: P-splines pb()

32 Smoothers : Monotonic P-splines pbm() R code set.seed(1334) x = seq(0, 1, length = 1000) p = 0.4 y = sin(2 * pi * p * x) + rnorm(1000) * 0.1 m1 <- gamlss(y~pbm(x), trace=false) yy <- -y m2 <- gamlss(yy~pbm(x, mono="down"), trace=false)

33 Smoothers Smooth functions: Monotonic P-splines

34 Smoothers : cycle P-splines cy() R code set.seed(555) x = seq(0, 1, length = 1000) y<-cos(1 * x * 2 * pi + pi / 2)+rnorm(length(x)) * 0.2 m1<-gamlss(y~cy(x), trace=false)

35 Smoothers Smooth functions:cycle P-splines

36 Smoothers : cubic splines cs() R code # fitting cubic splines with fixed degrees of freedom rcs1<-gamlss(r~cs(fl)+cs(a), data=rent, family=ga) term.plot(rcs1, pages=1)

37 Smoothers Smooth functions: cubic splines

38 Smoothers : ridge and lasso ri() R code # standardise the data X<-with(usair, cbind(x1,x2,x3,x4,x5,x6)) sx<-scale(x) m0<- gamlss(y~sx, data=usair) # least squares m1<- gamlss(y~ri(sx), data=usair) # ridge m2<- gamlss(y~ri(sx, Lp=1), data=usair) # lasso m3<- gamlss(y~ri(sx, Lp=0), data=usair) # best subset AIC(m0,m1,m2,m3) ## df AIC ## m2 5.336309 341.2492 ## m0 8.000000 344.7232 ## m1 5.884452 345.6097 ## m3 2.838310 350.1807

39 Smoothers Smooth functions: ridge and lasso

40 Smoothers : varying coefficients pvc() R code m2<-gamlss(r~pb(fl)+pb(a)+pvc(a, by=fl), data=rent, family=ga) term.plot(m2, pages=1)

41 Smoothers Smooth functions: varying coefficients

42 Smoothers : varying coefficients pvc() R code g1<-gamlss(r~pb(fl)+pb(a)+pvc(fl,by=loc), family=ga) term.plot(g1, pages=1) data=rent,

43 Smoothers Smooth functions: varying coefficients

44 Smoothers : interface to gam() R code ga4 <-gam(r~s(fl,a), method="reml", data=rent, family=gamma(log)) gn4 <- gamlss(r~ga(~s(fl,a), method="reml"), data=rent, family=ga) term.plot(gn4) vis.gam(getsmo(gn5))

45 Smoothers Smoothers: interface to gam()

46 Smoothers Smooth functions: interface to gam()

47 Smoothers : neural network interface to nnet() R code mr3 <- gamlss(r~nn(~fl+a+h+b+loc, size=5, decay=0.01), data=rent, family=ga) summary(getsmo(mr3)) ## a 6-5-1 network with 41 weights ## options were - linear output units decay=0.01 ## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1... ## -0.13-0.86 0.05 0.84 0.00-1.43 2.07 ## b->o h1->o h2->o h3->o h4->o h5->o ## -0.28-0.68-0.24-2.39 3.64 1.33 plot(getsmo(mr3), y.lab=expression(g[1](mu)))

48 Smoothers Smooth functions: interface to nnet()

49 Smoothers : decision trees interface to rpart() R code r2 <- gamlss(r ~ tr(~fl+a+h+b+loc), sigma.fo=~tr(~fl+a+h+b+loc), data=rent, family=ga, gd.tol=100, c.crit=0.1) term.plot(r2, parameter="mu", pages=1)

50 Smoothers Smooth functions: decision tress interface to rpart()

51 Centiles The Dutch boys data The Dutch boys data head : the head head circumference of 7040 boys age : the age in years Source: Buuren and Fredriks (2001)

52 Centiles The Dutch boys data The Dutch boys data

53 Centiles The problem The Dutch boys data with centiles

54 Centiles The problem The Dutch boys data: centiles

55 Centiles The methods The methods i) the non parametric approach of quantile regression (Koenker, 2005; Koenker and Bassett, 1978) ii) the parametric LMS approach of Cole (1988), Cole and Green (1992) and its extensions Rigby and Stasinopoulos (2004, 2006, 2007).

56 Centiles The LMS model and its extensions Centiles: the LMS method Y f Y (y µ, σ, ν, τ) where f Y () is any distribution Y =head circumference and X = AGE ξ µ = s(x, df µ ) log(σ) = s(x, df σ ) ν = s(x, df ν ) log(τ) = s(x, df τ )

57 Centiles The LMS model and its extensions The LMS method and extensions Let Y be a random variable with range Y > 0 defined through the transformed variable Z given by: Z = 1 [( ) Y ν 1], if ν 0 σν µ = 1 ( ) Y σ log, if ν = 0. µ 1 if Z N(0, 1) then Y BCCG(µ, σ, ν) = LMS method 2 if Z t τ then Y BCT (µ, σ, ν, τ) = LMST method 3 if Z PE(0, 1, τ) then Y BCPE(µ, σ, ν, τ) = LMSP method adopted by WHO

58 Centiles The LMS model and its extensions Centiles: estimation of the smoothing parameters We need to select the five values df µ, df σ, df ν, df τ, ξ by trial and error minimize the generalized Akaike information criterion, GAIC( ) minimize the the validation global deviance VGD using local selection criteria, i.e. CV, ML Diagnostics should be used in all above cases

59 Centiles GAMLSS functions for centile estimation GAMLSS functions for centile estimation lms() automatic selection of an appropriate LMS method centiles() to plot centile curves against an x-variable. calibration() adjust for sample quantiles to create centiles centiles.com() to compare centiles curves for more than one object. centiles.split() as for centiles(), but splits the plot at specified values of x. centiles.fan() fan plot as in centiles(). centiles.pred() to predict and plot centile curves for new x-values. fitted.plot() to plot fitted values for all the parameters against an x-variable

60 Centiles GAMLSS functions for centile estimation The head circumference data: centiles()

61 Centiles GAMLSS functions for centile estimation The head circumference data: centiles.fan() Centile curves using BCT y 35 40 45 50 55 60 65 0 5 10 15 20 x

62 Centiles GAMLSS functions for centile estimation The head circumference data: comparison of the centiles, centiles.com()

63 Centiles GAMLSS functions for centile estimation The head circumference data: centiles.split()

64 Centiles GAMLSS functions for centile estimation The head circumference data: fitted.plot()

65 Centiles GAMLSS functions for centile estimation The head circumference data: The Q statistics Z1 Z2 Z3 Z4 Agostino N 0.02500 to 0.24499 0.09 0.85 1.21 0.02 1.47 477 0.24499 to 0.75499-0.48-1.18-0.58-0.25 0.40 473 0.75499 to 1.255 0.88-2.27-0.24-1.37 1.94 467 1.255 to 1.895 0.01 2.62 1.96 2.76 11.47 460 1.895 to 2.945-0.29 0.72-1.70-0.01 2.90 473 2.945 to 5.535-0.35-0.40-0.72 0.88 1.30 466... 15.835 to 17.185 0.45 0.00-0.59-0.08 0.35 466 17.185 to 18.675 0.09 0.32 1.01 1.48 3.23 472 18.675 to 21.685-0.54-0.65-0.88-0.42 0.96 467 TOTAL Q stats 2.85 16.90 15.43 18.04 33.46 7040 df for Q stats 1.95 12.01 12.35 13.00 25.35 0 p-val for Q stats 0.23 0.15 0.24 0.16 0.13 0

66 Centiles GAMLSS functions for centile estimation The head circumference data: Q.stats() Q Statistics Z1 Z2 Z3 Z4 0.02500 to 0.24499 0.24499 to 0.75499 0.75499 to 1.255 1.255 to 1.895 1.895 to 2.945 2.945 to 5.535 5.535 to 9.295 9.295 to 10.605 10.605 to 11.975 11.975 to 13.245 13.245 to 14.515 14.515 to 15.835 15.835 to 17.185 17.185 to 18.675 18.675 to 21.685

67 Centiles GAMLSS functions for centile estimation Comparing GAMLSS and QR: fitted centiles

68 Conclusions Conclusions: GAMLSS Unified framework for univariate regression type of models Allows any distribution for the response variable Y Models all the parameters of the distribution of Y Allows a variety of additive terms in the models for the distribution parameters The fitted algorithm is modular, where different components can be added easily Models can be fitted easily and fast Explanatory tool to find appropriate set of models It deals with overdispersion, skewness and kurtosis

69 Conclusions End END for more information see www.gamlss.org