SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Similar documents
SCHOOL OF MATHEMATICS AND STATISTICS

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Exam Applied Statistical Regression. Good Luck!

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

SCHOOL OF MATHEMATICS AND STATISTICS

Generalized linear models

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Linear Regression Models P8111

Stat 5102 Final Exam May 14, 2015

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Generalized Linear Models 1

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to General and Generalized Linear Models

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

MAS361. MAS361 1 Turn Over SCHOOL OF MATHEMATICS AND STATISTICS. Medical Statistics

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

12 Modelling Binomial Response Data

Generalized Linear Models. stat 557 Heike Hofmann

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

MATH 644: Regression Analysis Methods

Lecture 6 Multiple Linear Regression, cont.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Lecture 4 Multiple linear regression

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

UNIVERSITY OF TORONTO Faculty of Arts and Science

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Stat 579: Generalized Linear Models and Extensions

LOGISTIC REGRESSION Joseph M. Hilbe

Generalised linear models. Response variable can take a number of different formats

Chapter 1 Statistical Inference

9 Generalized Linear Models

Outline of GLMs. Definitions

36-463/663: Multilevel & Hierarchical Models

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Generalized linear models

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Chapter 22: Log-linear regression for Poisson counts

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Logistic Regression - problem 6.14

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Chapter 4: Generalized Linear Models-II

Lecture 12: Effect modification, and confounding in logistic regression

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

STAT 525 Fall Final exam. Tuesday December 14, 2010

Generalized Estimating Equations

Regression Models - Introduction

ST430 Exam 2 Solutions

Generalized Linear Models. Kurt Hornik

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Generalized Linear Models

Statistics 135 Fall 2008 Final Exam

Lecture 18: Simple Linear Regression

ST430 Exam 1 with Answers

Introduction to General and Generalized Linear Models

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Regression Model. Badr Missaoui

SB1a Applied Statistics Lectures 9-10

Generalized Linear Models Introduction

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Single-level Models for Binary Responses

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Exercise 5.4 Solution

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Sections 4.1, 4.2, 4.3

Inference for Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Section 4.6 Simple Linear Regression

8 Nominal and Ordinal Logistic Regression

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Introduction to logistic regression

STAT5044: Regression and Anova. Inyoung Kim

Likelihoods for Generalized Linear Models

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

WISE International Masters

CAS MA575 Linear Models

11. Generalized Linear Models: An Introduction

Answer Key for STAT 200B HW No. 7

MSH3 Generalized linear model

Applied Statistics Preliminary Examination Theory of Linear Models August 2017

STAT 7030: Categorical Data Analysis

AMS-207: Bayesian Statistics

Transcription:

SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION Candidates may bring to the examination lecture notes and associated lecture material (but no textbooks) plus a calculator that conforms to University regulations. There are 60 marks available on the paper. Please leave this exam paper on your desk Do not remove it from the hall Registration number from U-Card (9 digits) to be completed by student MAS467 1 Turn Over

Blank MAS467 2 Continued

1 (i) A car manufacturer conducts a study to investigate the proportion of cars manufactured requiring maintenance over time. It is suggested that the proportion of cars requiring a certain amount of maintenance increases with the age of the car. A sample of 100 cars was taken at each age point and the table below summarises the proportion requiring maintenance (NB two samples of 8 year old cars are included). age (years) 1 4 8 8 10 15 maintenance (maint, in %) 5 8 20 23 30 45 The statistician at the company decided to fit a quadratic linear model to this data set, with the maintenance proportion as the response variable and age (in years) as the explanatory variable. The model is: maint = β 0 +β 1 age i +β 2 age 2 i +ǫ i, where ǫ i follows the normal distribution N(0,σ 2 ), for some variance σ 2 and ǫ i is independent of ǫ j, for i j. In answering this question you may find the following quantiles useful: t 3,0.95 = 2.35, t 3,0.995 = 5.84, t 4,0.995 = 4.60. (a) Write down the design matrix X of this model. (1 mark) (b) The model was fitted in R and gave the following output: Call: lm(formula = maint ~ age + I(age^2)) Residuals: 1 2 3 4 5 6 1.38-2.58-1.61 1.39 2.13-0.71 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.55292 2.85323 0.544 0.6241 age 2.00786 0.77705 2.584 0.0815. I(age^2) 0.06239 0.04686 1.331 0.2753 --- Multiple R-squared: 0.9833,Adjusted R-squared: 0.9721 F-statistic: 88.11 on 2 and 3 DF, p-value: 0.002166 Write down the maximum likelihood estimates of β 0, β 1, β 2 and σ 2. MAS467 3 Question 1 continued on next page

1 (continued) (c) Given (X T X) 1 = 1.34740071 0.314622963 0.0157196852 0.31462296 0.099935973 0.0057645042 0.01571969 0.005764504 0.0003635087, calculate the Pearson correlation coefficient between ˆβ 1 and ˆβ 2 (the maximum likelihood estimators of β 1 and β 2, respectively). (2 marks) (d) Find 99% confidence intervals for β 1 and β 2. (2 marks) (e) Based on the output above, comment on how well the quadratic model fits these data. (ii) Consider the linear model y = Xβ +ǫ, where y is a vector of n observations, X is a n p design matrix, β is a vector of p coefficients and ǫ is the usual error vector satisfying the Gauss- Markov conditions. Let the residual vector be e and define 1 n to be the column vector with n units, i.e. e = e 1 e 2. e n 1 and 1 n = 1. 1. Assume that the first column of X is 1 n and that X has rank p. (a) State the distribution of e; find the distribution of 1 T ne. (2 marks) (b) Given the general result tr [ X(X T X) 1 X T 1 n 1 T n] = n, where tr(a) denotes the trace of square matrix A, show Var(1 T ne) = 0. (c) Hence show n e i = 0. i=1 (1 mark) MAS467 4 Continued

2 (i) Suppose that data y 1,y 2,...,y n are independently generated by the exponential family of distributions [ ] y i θ b(θ) f(y i ;θ i,φ) = exp w i +c(y i,φ), φ where θ is the natural parameter, b(θ) a function of θ, which is assumed to be twice differentiable, φ the dispersion parameter, w i the weights and c(y i,φ) a function which depends on y i and φ, but not on θ. (a) Write down the log-likelihood of θ based on data y 1,...,y n. (1 mark) (b) If the canonical link is used and only the null model (comprising of the intercept only) is considered for the linear predictor, then show that the fitted values of y i are all the same and equal to ŷ i = n i=1 w iy i n i=1 w. i MAS467 5 Question 2 continued on next page

2 (continued) (ii) Consider that Y i follows a Poisson distribution with rate λ i and probability mass function P(Y i = y i ) = exp( λ i ) λy i i y i!, (1) for λ i > 0 and y i = 0,1,2,... We consider a generalised linear model with response model (1), with the canonical link and the linear predictor η i = α+βx i, where α and β are subject to estimation and x i is a known covariate, for i = 1,2,...,n. (a) Write down the log-likelihood of(α,β) T, based on datay 1,y 2,...,y n. (3 marks) (b) Write down the partial derivatives of the log-likelihood with respect toαandβ and show that the maximum likelihood estimates (MLEs) ˆα and ˆβ of α and β satisfy the following equations exp(ˆα) = n i=1 y i n i=1 exp(ˆβx i ) n exp(ˆα) x i exp(ˆβx i ) = i=1 n x i y i i=1 (3 marks) (c) If n = 2, x 1 = 1, x 2 = 1, y 1 = 1 and y 2 = 3, then show that the equations of part (b) above can be solved analytically and the MLEs are given by ˆα = 1 2 log3 and ˆβ = 1 2 log3 (5 marks) (d) Using (c) calculate the fitted values of y 1 and y 2. Show that these fitted values indicate a perfect fit. Discuss how the above model is related to the saturated model. MAS467 6 Continued

3 Data are collected on 80 individuals in a study assessing the effect of age and smoking status on lung cancer risk. The variables recorded are: X 1 - smoking status (X 1 = 1 for current smokers and X 1 = 0 for ex-smokers or non-smokers) abbreviated to smoke in the R analysis. X 2 - age. Y - lung cancer status (Y = 1 for people with lung cancer and Y = 0 for people without diagnosed lung cancer) abbreviated to cancer in the R analysis. Various generalised linear models, with a logit link, are fitted where the binary variable Y is the response and X 1 and X 2 are the explanatory variables. Let η i be the linear predictor for the i-th person. The four fitted models are: Model 1: η i = β 0 +β 1 X 1i Model 2: η i = β 0 +β 2 X 2i Model 3: η i = β 0 +β 1 X 1i +β 2 X 2i Model 4: η i = β 0 +β 1 X 1i +β 2 X 2i +β 3 X 1i X 2i In answering this question, you may find the following quantiles useful: χ 2 1,0.95 = 3.84, χ 2 2,0.95 = 5.99, χ 2 3,0.95 = 7.81, χ 2 76,0.95 = 97.35 χ 2 77,0.95 = 98.48 (i) (ii) What is the relationship between E(Y i ) and η i for the logit link? What would be the relationship if a probit link were used instead? (2 marks) The residual deviances for the four models are given in Table 1. By considering the changes in residual deviance determine the relationship between lung cancer risk, age and smoking status. For any hypothesis tests that you do, state clearly the null hypothesis and the degrees of freedom of the relevant χ 2 distribution used to perform the test. (6 marks) Model Residual deviance Model 1 59.36 Model 2 80.41 Model 3 38.58 Model 4 32.26 Table 1: Residual deviances for Models 1 to 4. (iii) Comment on the fit of the model you selected in part (ii) based on an appropriate χ 2 distribution. (2 marks) MAS467 7 Question 3 continued on next page

3 (continued) (iv) Using R, the following output is obtained for Model 4: Call: glm(formula = cancer ~ smoke * age, family = binomial) Coefficients: (Intercept) smoke age smoke:age -121.456 118.886 1.854-1.777 Degrees of Freedom: 79 Total (i.e. Null); 76 Residual Null Deviance: 104.8 Residual Deviance: 32.26 AIC: 40.26 AIC: 20.36 Using this output, calculate the odds of lung cancer for a 70 year old current smoker. (2 marks) (v) (vi) Based on the output from (iv), at what age would the estimated odds of lung cancer for smokers and non-smokers be the same? (2 marks) For Model 1 write down an expression for the log-likelihood (l) in terms of β 0, β 1 and X 1i and hence calculate 2 l 2 β in terms of η i. How might 2 l 2 0 β be 0 useful when making inferences in a generalised linear model? (6 marks) End of Question Paper MAS467 8