Applied linear statistical models: An overview Gunnar Stefansson 1 Dept. of Mathematics Univ. Iceland August 27, 2010
Outline
Some basics Course: Applied linear statistical models This lecture: A description of the material to be covered during the course and of the framework, where material is, etc. Course material is at Content+quizzes http://tutor-web.net http://vr3pc109.rhi.hi.is/ The book Other handouts (weekly homeworks, 3-4 projects (40%), copies from Scheffe etc) Instructor s home page: http://www.hi.is/ gunnar (data sets,links, this file) The Univ. Icel. web system (announcements/ email) Important: Read your e-mail for announcements! Warning: The tutor-web material is under development!
The homework weekly homework, on handouts (possibly within tutorials): on-line quizzes from book other exercises modify wikipedia write examples etc for tutor-web 3-4 projects (30%) first handout: tutor-web, STAT 645 smplreg do not print yet R stuff: tutor-web, STAT 150 R do not print yet math stuff: tutor-web, MATH 612 ccas
Outline
a line to data: STATS 310 smplreg : Have n pairs (x 1, y 1 ),..., (x n, y n ) and want best fitting line. Estimation (OLS): S(α, β) = i (y i (α + βx i )) 2 Minimize S over α, β to get Cod growth 400 500 600 700 800 900 1000 1500 2000 2500 3000 3500 4000 Capelin biomass b = a = ȳ b x i (x i x)(y i ȳ) i (x i x) 2 Example: biomass. Cod growth vs capelin
y 0 1 2 3 4 5 x A formal statistical model Fixed numbers, x i Random variables: Y i n(α + βx i, σ 2 ) or: Y i = α + βx i + ɛ i with ɛ i n(0, σ 2 ) independent and identically distributed (i.i.d.) The data: 0 5 10 15 20 y i = α + βx i + e i So: y i -values are outcomes of the random variables Y i, but x i -values are constants. Usual assumptions.
The assumptions The assumptions (which may all fail) are: x-values are constants (no error) linearity constant variance Gaussian Independence Will test these and modify accordingly Examples: Fish growth (nonlinear); Bird counts (nonnormal); fuel consumption (heteroscedastic); stock prices (autocorrelated)
Outline
Some distributions Univariate Gaussian density f (y) = 1 e (y µ)2 2σ 2 2πσ y R n Product of univariate Gaussian densities, for i.i.d. Gaussians f (y 1,..., y n ) = i 1 e (y i µ) 2σ 2 2πσ 2 = P (y 1 i µ) 2 (2π) n/2 σ n e i 2σ 2 y 1,..., y n R The general multivariate Gaussian case f (x) = 1 (2π) n/2 Σ 1 2 e 1 2 (x µ) Σ 1 (x µ) x R n (is a density on R n if Σ > 0 and µ R n )
Linear combinations of Gaussian random variables Linear combinations (ax + by, c Y etc) of Gaussian random variables (independent or jointly multivariate Gaussian) are also Gaussian. E [ax + by ] = aµ X + bµ Y V [ax + by ] = a 2 σx 2 + b2 σy 2 if independent V [ax + by ] = a 2 σx 2 + b2 σy 2 + 2abCov(X, Y ) in general E [c Y] = c µ V [c Y] = c Σ Y c V [AY] = AΣ Y A
Distributions of estimates The number b should be viewed as the outcome of the random variable, ˆβ = i(x i x)y i i (x i x) 2 (note the rewrite from earlier formula). So ˆβ is a linear combination of Gaussian Y 1,..., Y n so ˆβ is Gaussian. Will show that E[ ˆβ] = β, find V [ ˆβ] = ˆσ 2ˆβ =... and ˆβ β σ 2ˆβ t n 2 Inference: Hypothesis testing and confidence intervals. Example: Test whether slope is zero (H 0 : β = 0) i.e. whether there is some relationship between x and y. Never be content with a mathematical or conceptual model: Insist that it is justified by data!
Outline
y Goodness of fit and diagnostics: STATS 310 xxx Verifying SLR assumptions... Will derive tests for nonlinearity (lack-of-fit), normality, homoscedasticity, independence, outliers, etc etc. This is (mainly) based on residuals e i = y i ŷ i or variations thereof Some concepts: Standardized residuals, studentized residuals, 20 40 60 80 0 10 20 30 40 50 x resid(fm) 0 10 20 30 40 50 Simplest diagnostics: Plot residuals in all possible ways deleted residuals etc etc. Note: It is never enough to fit a model or use it for predictions. One must always also verify whether the model is adequate. 2 0 2 4 x
Outline
SLR in matrix form y R n = vector of measurements 1 x 1 X =.. 1 x n the X-matrix min (y i (α + βx i )) 2 is equivalent to finding ( ) α β = β to mininmize y Xβ 2 Number notation: y = Xβ + e Point estimate as a projection
Multiple : STATS 310 mulreg The model: y = Xβ + e where X in an n p matrix. Example: Interaction model: y i = α + βx i + γw i + δx i w i. Defining x i1 = 1, x i2 = x i, x i3 = w i, x i4 = x i w i, this becomes a multiple linear model. More examples: Estimate single intercept, many slopes; Test whether multiple lines are all parallel;... Used in all fields of biology (fisheries, genetics,...), economics, psychology, sociology,... Need to develop point estimates, methods of validation and testing.
Matrix solution The point estimate is ˆβ = (X X) 1 X Y which is always unbiased (if the mean of the Y -s is correct) [ˆβ] E = β and has variance-covariance matrix [ˆβ] V = σ 2 (X X) 1. (if the variance assumptions are correct) and is multivariate Gaussian (if the Y -values are Gaussian).
Outline
Testing Statistical tests (of model reductions) can use projections in R n where o.n. bases give SSEs and d.f. y = ˆζ 1 u 1 +... ˆζ q u q +ˆζ q+1 u q+1 +... ˆζ r u r +ˆζ r+1 u r+1 +... ˆζ n u n SSE(F) = y Xˆβ 2 = SSE(F) SSE(R) = Zˆγ X ˆβ 2 = SSE(R) = y Zˆγ 2 = n i=p+1 p i=r+1 n i=r+1 ˆζ 2 i ˆζ 2 i ˆζ 2 i
Estimable functions: STATS 310 xxx In the one-way layout not all parameter combinations can be estimated. Those linear combinations of parameters, ψ = c β which have an unbiased estimate using a linear combination of y-valules are termed estimable functions. y n( }{{}}{{} X β, σ }{{} 2 }{{} I ) n 1 n p p 1 n n ψ i = c i β; C = (c 1,..., c q ) ; ψ = Cβ ˆψ = Ay = C ˆβ n(cβ, σ 2 AA ) ( Theorem: ˆψ n ψ, Σ ˆψ ), y X ˆβ 2 χ 2 σ 2 n r and these two quantities are independent.
Outline
Model selection Many x-variables? Need to choose subset for inclusion Look at all subsets? How should quality of fit be measured? Forward and backwards stepwise. Example: What drives recruitment to fish stocks? Have a series of e.g. 50 years of data, but several dozens of possible x-values.
Model validation Validate mainly using various residuals Investigate assumptions, but now also investigate influential observations DFFITS etc Hat matrix H, particularly the leverage values h ii 2 0 2 4 6 8 10 0 2 4 6 8 0 10 20 30 40 50 1 0 1 2 3 4 5 6 2 0 2 4 6 8 5 10 15 20 25 0 2 4 6 8 10 0 50 100 200 300 H = X ( X X ) 1 X Examples of problem cases projects y to ŷ Investigate collinearity
Analysis of variance: STATS 310 anova One-way layout: y ij = µ + α i + e ij This is a particular linear model (multiple model), but with special properties! Will develop expressions for solutions, sums of squares etc etc. Example: Breast feeding and IQ Group n IQbar s I 90 92.8 15.2 not breast fed IIa 17 94.8 19.0 failed breast feeding IIb 193 103.7 15.3 breast fed More examples: Fertilizers, crops; differential gene expressions;... Two-way layout etc etc
Multiple comparisons procedures: Scheffe Theorem: Under the above assumptions and definitions, ( ) ( ˆψ ψ B 1 ˆψ ψ) /q y X ˆβ 2 /(n r) F q,n r [ ( P β ˆψ ψ) ( ) ] B 1 ˆψ ψ qs 2 F q,n r,1 α = 1 α If Ψ = {ψ = k 1 ψ 1 +... + k q ψ q }, then [ P ˆψ qf ˆσ ˆψ ψ ˆψ + qf ˆσ ˆψ ψ Ψ] = 1 α and we are therefore allowed to search among all estimable functions within the set to find significant effects. Can now do legal data-snooping!
Generalizations and other topics The Generalized linear model (GLM) drops the assumption of normality and defines a link function The Generalized Additive Model (GAM) allows smoothing function in place of linear combinations. Random effects models Nonlinear models Correlated observations