Generalized Additive Models

Similar documents
Solutions to obligatorisk oppgave 2, STK2100

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Regression Models P8111

Logistic Regressions. Stat 430

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Exercise 5.4 Solution

Generalized linear models

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Outline of GLMs. Definitions

Chapter 10. Additive Models, Trees, and Related Methods

Generalized Additive Models (GAMs)

Logistic Regression - problem 6.14

How to deal with non-linear count data? Macro-invertebrates in wetlands

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

3 Joint Distributions 71

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Generalized Linear Models. stat 557 Heike Hofmann

UNIVERSITY OF TORONTO Faculty of Arts and Science

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Generalized Linear Models (GLZ)

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Generalized Linear Models 1

Cohen s s Kappa and Log-linear Models

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized Linear Models

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Introduction to Linear regression analysis. Part 2. Model comparisons

STA 450/4000 S: January

Exam details. Final Review Session. Things to Review

A Non-parametric bootstrap for multilevel models

Logistic Regression 21/05

A strategy for modelling count data which may have extra zeros

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Modelling Survival Data using Generalized Additive Models with Flexible Link

Simultaneous Confidence Bands for the Coefficient Function in Functional Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Checking, Selecting & Predicting with GAMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Subject CS1 Actuarial Statistics 1 Core Principles

7.1, 7.3, 7.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1/ 31

ssh tap sas913, sas

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Consider fitting a model using ordinary least squares (OLS) regression:

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to Regression

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

STA102 Class Notes Chapter Logistic Regression

Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM.

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann

Final Exam. Name: Solution:

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

STA6938-Logistic Regression Model

Simple logistic regression

Introduction to Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Contents. Acknowledgments. xix

Age 55 (x = 1) Age < 55 (x = 0)

COMPARING PARAMETRIC AND SEMIPARAMETRIC ERROR CORRECTION MODELS FOR ESTIMATION OF LONG RUN EQUILIBRIUM BETWEEN EXPORTS AND IMPORTS

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Estimating prediction error in mixed models

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Sections 4.1, 4.2, 4.3

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Scatter plot of data from the study. Linear Regression

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Model comparison and selection

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Model selection and comparison

Introduction to Regression

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Data Mining Stat 588

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Introduction to Regression

ECONOMETRICS II (ECO 2401S) University of Toronto. Department of Economics. Spring 2013 Instructor: Victor Aguirregabiria

Introducing Generalized Linear Models: Logistic Regression

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Transcription:

Generalized Additive Models

The Model The GLM is: g( µ) = ß 0 + ß 1 x 1 + ß 2 x 2 +... + ß k x k The generalization to the GAM is: g(µ) = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) where the functions f i (x i ) are arbitrary functions defined by the data.

Simple Additive Models Y = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) + e e is independent of x i and E(e) = 0 Y is continuous var(y) = sigma 2, for all observations (homoscedasticity)

Assumptions Statistical independence of observations Variance function is specified correctly Correct link function Specific observations do not influence fit

Smoothers Available in Splus Loess (locally weighted regression) Splines (regression, cubic, natural)

Properties of Smoothers Most smoothers are local, in the sense that they use adjacent data points (neighbourhoods) to estimate "predicted" values for each data point One must balance bias with precision The choice of how much smoothing to do is key in this decision process

This is no different than when you specify different parametric forms for an explanatory variable, in that one is trying to specify the correct functional form. NB: A linear variable is equivalent to an infinitely smoothed function!

The functions f i (x k ) can be very general and can include: # splines or LOESS» interactions such as lo(x 1,x 2 ) can be fit and these produce smooth two-dimensional surfaces in three dimensions # parametric forms, such as E(Y) = ß 0 + ßx, where x can be continuous, ordinal, nominal

Estimation in GAMs Estimation is through a combination of backfitting and iteratively reweighted least squares The method is not maximum likelihood but is based on similar types of principals. The functions f i and ß k are determined empirically according to the data and the assumed model. The deviance is calculated from the model, just as in GLMs.

Fitting Usual linear model is fit with least squares and there is an exact solution (no iterations). Backfitting algorithm used for GAMs, and it requires >1 iteration.

Backfitting Algorithm Y = ß 0 + f 1 (x 1 ) + f 2 (x 2 ) +... + f k (x k ) + e 1) Set ß 0 = mean(y) 2) Initialize f j (x j ) = f j (x j ) 0 3) Iterate and cycle over the k variables f j = S j (Y - ß 0 - Σ k=j f k x j ) until the f j do not change

Goodness-of-fit Deviance is defined in the same way as in GLMs. The comparison of nested GAMs by substracting deviances does not necessarily follow a χ 2 distribution, even asymptotically. However, one can use the chi-square distribution as sort of a reference for assessing fits.

However, approximately, E(Deviance) ~ residual df * phi. χ For nested models, E(D 1, D 2 ) ~ df 1 - df 2, implying that the 2 distribution on df 1 - df 2 degrees of freedom can be used. In practice, we use instead a penalized version of the deviance for comparing both nested and non-nested models. The penalty is proportional to the number of df used.

Aikaike Information Criterion AIC = Deviance + 2 * df model * phi This statistic accounts for the number of degrees of freedom used by the smoothers. Usually, a lower AIC implies that the model fits better than another. There is no specific statistical test associated with comparing AICs. NB: must have the same number of observations in the two models.

Confidence Intervals Pointwise standard errors of the functions f j are also calculated. Calculation of the confidence interval between two values of x is more difficult. For example, the 95% CI for the odds ratio between x=x 1 and x=x 2 in a model logit(y) = ß 0 + f(x) must be obtained using the bootstrap.

Generalized Additive Model Age at menopause > preg4.gam.1 _ gam(yvar~lo(age)+lo(eductn)+lo(age.fst)+lo(mg13.fir)+lo(m18to21.) +FAM.HIST+MG34.ALC+MG23.MAL,preg4,family=binomial,subset=(AGE>50), Age at menarche na.action=na.omit,x=t,y=t) > summary(preg4.gam.1) Call: gam(formula = YVAR ~ lo(age) + lo(eductn) + lo(age.fst) +. Deviance Residuals: Min 1Q Median 3Q Max -2.192961-0.9900211 0.4602939 0.9577022 2.138613 (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: 1007.108 on 726 degrees of freedom Residual Deviance: 837.6898 on 704.0405 degrees of freedom Number of Local Scoring Iterations: 4 DF for Terms and Chi-squares for Nonparametric Effects Df Npar Df Npar Chisq P(Chi) (Intercept) 1 lo(age) 1 2.4 2.53375 0.3475057 lo(eductn) 1 2.3 4.85548 0.1171361 lo(age.fst) 1 2.9 4.53878 0.1975979 lo(mg13.fir) 1 2.8 2.56048 0.4273571 lo(m18to21.) 1 2.6 26.30268 0.0000045 FAM.HIST 1 MG34.ALC 2 MG23.MAL 1 Npar Chisq : Score test to evaluate the nonlinear contribution to the nonparametric functions.

Adjusted GAM Model

Age at Menarche Ordinate ~ log(odds scale) log(0) = α + f(x) at x = 10, f(x) ~ 0.5 at x = 17, f(x) ~ 0.1 f(x) ~ -0.6 => odds ratio ~ e -0.6 ~ 0.55 Linear Model : log(0) = α + βx β = -0.84, x = 7 OR ~ e -0.84*7 ~ 0.55

Age at 1 st birth Loess (span = 50%)

Age at Menopause

Education (in years)

Age at 1 st pregnancy Age at Menarche Age at Menopause Previous Breast Disease