Applied linear statistical models: An overview

Similar documents
Lectures on Simple Linear Regression Stat 431, Summer 2012

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Math 423/533: The Main Theoretical Topics

Ch 2: Simple Linear Regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Multiple Linear Regression

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Association studies and regression

3 Multiple Linear Regression

Lecture 1: Linear Models and Applications

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Statistics 203: Introduction to Regression and Analysis of Variance Course review

4 Multiple Linear Regression

Lecture 5: Clustering, Linear Regression

STA 302f16 Assignment Five 1

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Concordia University (5+5)Q 1.

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Regression diagnostics

Estimating Estimable Functions of β. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 17

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Lecture 6 Multiple Linear Regression, cont.

Lecture 3: Multiple Regression

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

[y i α βx i ] 2 (2) Q = i=1

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Lecture 10 Multiple Linear Regression

STAT 4385 Topic 03: Simple Linear Regression

Probability and Statistics Notes

MIT Spring 2015

STAT 4385 Topic 06: Model Diagnostics

Confidence Intervals, Testing and ANOVA Summary

Estadística II Chapter 5. Regression analysis (second part)

Simple Linear Regression

Lecture 5: Clustering, Linear Regression

2. A Review of Some Key Linear Models Results. Copyright c 2018 Dan Nettleton (Iowa State University) 2. Statistics / 28

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Weighted Least Squares

Ch 3: Multiple Linear Regression

STAT 540: Data Analysis and Regression

Topic 22 Analysis of Variance

review session gov 2000 gov 2000 () review session 1 / 38

Chapter 14 Student Lecture Notes 14-1

Lecture 5: Clustering, Linear Regression

Multivariate Statistical Analysis

STA 2201/442 Assignment 2

Data Mining Stat 588

Simple Linear Regression

14 Multiple Linear Regression

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Homework 2: Simple Linear Regression

Regression, Ridge Regression, Lasso

Multiple Linear Regression

Chapter 14. Linear least squares

Chapter 5 Matrix Approach to Simple Linear Regression

STATISTICS 479 Exam II (100 points)

MATH 644: Regression Analysis Methods

Question Possible Points Score Total 100

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Introduction to Estimation Methods for Time Series models. Lecture 1

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Formal Statement of Simple Linear Regression Model

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

11 Hypothesis Testing

Estadística II Chapter 4: Simple linear regression

Regression Model Building

Simple Linear Regression

STAT5044: Regression and Anova. Inyoung Kim

Scatter plot of data from the study. Linear Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Multiple Linear Regression

Matrix Approach to Simple Linear Regression: An Overview

Inferences for Regression

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Topic 7 - Matrix Approach to Simple Linear Regression. Outline. Matrix. Matrix. Review of Matrices. Regression model in matrix form

MATH Notebook 4 Spring 2018

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

STAT5044: Regression and Anova

Diagnostics and Remedial Measures: An Overview

Lecture 1 Intro to Spatial and Temporal Data

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

Lecture 14 Simple Linear Regression

Scatter plot of data from the study. Linear Regression

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

STAT5044: Regression and Anova

STA442/2101: Assignment 5

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Example: Suppose Y has a Poisson distribution with mean

Maximum Likelihood Estimation

Transcription:

Applied linear statistical models: An overview Gunnar Stefansson 1 Dept. of Mathematics Univ. Iceland August 27, 2010

Outline

Some basics Course: Applied linear statistical models This lecture: A description of the material to be covered during the course and of the framework, where material is, etc. Course material is at Content+quizzes http://tutor-web.net http://vr3pc109.rhi.hi.is/ The book Other handouts (weekly homeworks, 3-4 projects (40%), copies from Scheffe etc) Instructor s home page: http://www.hi.is/ gunnar (data sets,links, this file) The Univ. Icel. web system (announcements/ email) Important: Read your e-mail for announcements! Warning: The tutor-web material is under development!

The homework weekly homework, on handouts (possibly within tutorials): on-line quizzes from book other exercises modify wikipedia write examples etc for tutor-web 3-4 projects (30%) first handout: tutor-web, STAT 645 smplreg do not print yet R stuff: tutor-web, STAT 150 R do not print yet math stuff: tutor-web, MATH 612 ccas

Outline

a line to data: STATS 310 smplreg : Have n pairs (x 1, y 1 ),..., (x n, y n ) and want best fitting line. Estimation (OLS): S(α, β) = i (y i (α + βx i )) 2 Minimize S over α, β to get Cod growth 400 500 600 700 800 900 1000 1500 2000 2500 3000 3500 4000 Capelin biomass b = a = ȳ b x i (x i x)(y i ȳ) i (x i x) 2 Example: biomass. Cod growth vs capelin

y 0 1 2 3 4 5 x A formal statistical model Fixed numbers, x i Random variables: Y i n(α + βx i, σ 2 ) or: Y i = α + βx i + ɛ i with ɛ i n(0, σ 2 ) independent and identically distributed (i.i.d.) The data: 0 5 10 15 20 y i = α + βx i + e i So: y i -values are outcomes of the random variables Y i, but x i -values are constants. Usual assumptions.

The assumptions The assumptions (which may all fail) are: x-values are constants (no error) linearity constant variance Gaussian Independence Will test these and modify accordingly Examples: Fish growth (nonlinear); Bird counts (nonnormal); fuel consumption (heteroscedastic); stock prices (autocorrelated)

Outline

Some distributions Univariate Gaussian density f (y) = 1 e (y µ)2 2σ 2 2πσ y R n Product of univariate Gaussian densities, for i.i.d. Gaussians f (y 1,..., y n ) = i 1 e (y i µ) 2σ 2 2πσ 2 = P (y 1 i µ) 2 (2π) n/2 σ n e i 2σ 2 y 1,..., y n R The general multivariate Gaussian case f (x) = 1 (2π) n/2 Σ 1 2 e 1 2 (x µ) Σ 1 (x µ) x R n (is a density on R n if Σ > 0 and µ R n )

Linear combinations of Gaussian random variables Linear combinations (ax + by, c Y etc) of Gaussian random variables (independent or jointly multivariate Gaussian) are also Gaussian. E [ax + by ] = aµ X + bµ Y V [ax + by ] = a 2 σx 2 + b2 σy 2 if independent V [ax + by ] = a 2 σx 2 + b2 σy 2 + 2abCov(X, Y ) in general E [c Y] = c µ V [c Y] = c Σ Y c V [AY] = AΣ Y A

Distributions of estimates The number b should be viewed as the outcome of the random variable, ˆβ = i(x i x)y i i (x i x) 2 (note the rewrite from earlier formula). So ˆβ is a linear combination of Gaussian Y 1,..., Y n so ˆβ is Gaussian. Will show that E[ ˆβ] = β, find V [ ˆβ] = ˆσ 2ˆβ =... and ˆβ β σ 2ˆβ t n 2 Inference: Hypothesis testing and confidence intervals. Example: Test whether slope is zero (H 0 : β = 0) i.e. whether there is some relationship between x and y. Never be content with a mathematical or conceptual model: Insist that it is justified by data!

Outline

y Goodness of fit and diagnostics: STATS 310 xxx Verifying SLR assumptions... Will derive tests for nonlinearity (lack-of-fit), normality, homoscedasticity, independence, outliers, etc etc. This is (mainly) based on residuals e i = y i ŷ i or variations thereof Some concepts: Standardized residuals, studentized residuals, 20 40 60 80 0 10 20 30 40 50 x resid(fm) 0 10 20 30 40 50 Simplest diagnostics: Plot residuals in all possible ways deleted residuals etc etc. Note: It is never enough to fit a model or use it for predictions. One must always also verify whether the model is adequate. 2 0 2 4 x

Outline

SLR in matrix form y R n = vector of measurements 1 x 1 X =.. 1 x n the X-matrix min (y i (α + βx i )) 2 is equivalent to finding ( ) α β = β to mininmize y Xβ 2 Number notation: y = Xβ + e Point estimate as a projection

Multiple : STATS 310 mulreg The model: y = Xβ + e where X in an n p matrix. Example: Interaction model: y i = α + βx i + γw i + δx i w i. Defining x i1 = 1, x i2 = x i, x i3 = w i, x i4 = x i w i, this becomes a multiple linear model. More examples: Estimate single intercept, many slopes; Test whether multiple lines are all parallel;... Used in all fields of biology (fisheries, genetics,...), economics, psychology, sociology,... Need to develop point estimates, methods of validation and testing.

Matrix solution The point estimate is ˆβ = (X X) 1 X Y which is always unbiased (if the mean of the Y -s is correct) [ˆβ] E = β and has variance-covariance matrix [ˆβ] V = σ 2 (X X) 1. (if the variance assumptions are correct) and is multivariate Gaussian (if the Y -values are Gaussian).

Outline

Testing Statistical tests (of model reductions) can use projections in R n where o.n. bases give SSEs and d.f. y = ˆζ 1 u 1 +... ˆζ q u q +ˆζ q+1 u q+1 +... ˆζ r u r +ˆζ r+1 u r+1 +... ˆζ n u n SSE(F) = y Xˆβ 2 = SSE(F) SSE(R) = Zˆγ X ˆβ 2 = SSE(R) = y Zˆγ 2 = n i=p+1 p i=r+1 n i=r+1 ˆζ 2 i ˆζ 2 i ˆζ 2 i

Estimable functions: STATS 310 xxx In the one-way layout not all parameter combinations can be estimated. Those linear combinations of parameters, ψ = c β which have an unbiased estimate using a linear combination of y-valules are termed estimable functions. y n( }{{}}{{} X β, σ }{{} 2 }{{} I ) n 1 n p p 1 n n ψ i = c i β; C = (c 1,..., c q ) ; ψ = Cβ ˆψ = Ay = C ˆβ n(cβ, σ 2 AA ) ( Theorem: ˆψ n ψ, Σ ˆψ ), y X ˆβ 2 χ 2 σ 2 n r and these two quantities are independent.

Outline

Model selection Many x-variables? Need to choose subset for inclusion Look at all subsets? How should quality of fit be measured? Forward and backwards stepwise. Example: What drives recruitment to fish stocks? Have a series of e.g. 50 years of data, but several dozens of possible x-values.

Model validation Validate mainly using various residuals Investigate assumptions, but now also investigate influential observations DFFITS etc Hat matrix H, particularly the leverage values h ii 2 0 2 4 6 8 10 0 2 4 6 8 0 10 20 30 40 50 1 0 1 2 3 4 5 6 2 0 2 4 6 8 5 10 15 20 25 0 2 4 6 8 10 0 50 100 200 300 H = X ( X X ) 1 X Examples of problem cases projects y to ŷ Investigate collinearity

Analysis of variance: STATS 310 anova One-way layout: y ij = µ + α i + e ij This is a particular linear model (multiple model), but with special properties! Will develop expressions for solutions, sums of squares etc etc. Example: Breast feeding and IQ Group n IQbar s I 90 92.8 15.2 not breast fed IIa 17 94.8 19.0 failed breast feeding IIb 193 103.7 15.3 breast fed More examples: Fertilizers, crops; differential gene expressions;... Two-way layout etc etc

Multiple comparisons procedures: Scheffe Theorem: Under the above assumptions and definitions, ( ) ( ˆψ ψ B 1 ˆψ ψ) /q y X ˆβ 2 /(n r) F q,n r [ ( P β ˆψ ψ) ( ) ] B 1 ˆψ ψ qs 2 F q,n r,1 α = 1 α If Ψ = {ψ = k 1 ψ 1 +... + k q ψ q }, then [ P ˆψ qf ˆσ ˆψ ψ ˆψ + qf ˆσ ˆψ ψ Ψ] = 1 α and we are therefore allowed to search among all estimable functions within the set to find significant effects. Can now do legal data-snooping!

Generalizations and other topics The Generalized linear model (GLM) drops the assumption of normality and defines a link function The Generalized Additive Model (GAM) allows smoothing function in place of linear combinations. Random effects models Nonlinear models Correlated observations