Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/??

Today Review / overview of what we learned. - p. 2/??

General themes in regression models Specifying regression models. What is the joint (conditional) distribution of all outcomes given all covariates? Are outcomes independent (conditional on covariates)? If not, what is an appropriate model? Fitting the models. Once a model is specified how are we going to estimate the parameters? Is there an algorithm or some existing software to fit the model? Comparing regression models. Inference for coefficients in the model: are some zero (i.e. is a smaller model better?) What if there are two competing models for the data? Why would one be preferable to the other? What if there are many models for the data? How do we compare models for the data? - p. 3/??

Simple linear regression model Only one covariate Y i = β 0 + β 1 X i + ε i, Errors ε are independent N(0, σ 2 ). 1 i n - p. 4/??

Multiple linear regression model Y i = β 0 + β 1 X i1 + + β p 1 X i,p 1 + ε i, 1 i n Errors ε are independent N(0, σ 2 ). β j s: (partial) regression coefficients. Special cases: polynomial / spline regression models where extra columns are functions of one covariate. - p. 5/??

ANOVA & categorical variables Generalization of two-sample tests One-way (fixed) Y ij = µ + α i + ε ij, 1 i r, 1 j n α s are constants to be estimated. Errors ε ij are independent N(0, σ 2 ). Two-way (fixed): Y ijk = µ+α i +β j +(αβ) ij +ε ijk, 1 i r, 1 j m, 1 k n α s, β s, (αβ) s are constants to be estimated. Errors ε ijk are independent N(0, σ 2 ). Experimental design: when balanced layouts are impossible, which is the best design? - p. 6/??

Generalized linear models Non-Gaussian errors. Binary outcomes: logistic regression (or probit). Count outcomes: Poisson regression. A link and variance function determine a GLM. Link: g(e(y i )) = g(µ i ) = η i = x i β = X i0 β 0 + + X i,p 1 β p 1. Variance function: Var(Y i ) = V (µ i ). - p. 7/??

Nonlinear regression models Regression function depends on parameters in a nonlinear fashion. Y i = f(x i1,..., X ip ; θ 1,..., θ q ) + ε i, Errors ε i are independent N(0, σ 2 ). 1 i n - p. 8/??

Robust regression Suppose that we have additive noise, but not Gaussian. Likelihood of the form ( ( Y p 1 j=0 L(β Y, X 0,..., X p 1 ) exp ρ β )) jx j s Leads to robust regression ( n Yi p 1 j=0 ρ X ) ijβ j. s i=1 Can downweight residuals with bigger tails than normal random variables. - p. 9/??

Random & mixed effects ANOVA When the levels of the categorical variables in an ANOVA are a sample from a population, effects should be treated as random. One-way (random): Y ij = µ + α i + ε ij, 1 i r, 1 j n where α i N(0, σ 2 α) are random, independent of the errors ε ij which are independent N(0, σ 2 ). Introduces correlation in the Y s: ( Cov(Y ij, Y i j ) = δ ii σ 2 α + δ jj σ 2). - p. 10/??

Mixed linear models Essentially a model of covariance between observations based on subject effects. General form: Y n 1 = X n p β p 1 + Z n q γ q 1 + ε n 1 where ε N(0, σ 2 I); γ N(0, D) for some covariance D. In this model Y N(Xβ, ZDZ + σ 2 I). Covariance is modelled through random effect design matrix Z and covariance D. - p. 11/??

Time series regression models Another model of covariance between observations, based on dependence in time. Y n 1 = X n p β p 1 + ε n 1 In these models, ε N(0, Σ) where the covariance Σ depends on what kind of time series model is used (i.e. which ARM A(p, q) model?). Example, if ε is AR(1) with parameter ρ then Σ ij = σ 2 ρ i j. - p. 12/??

Functional linear model We talked about a functional two-sample t-test. General form Y i,t = β 0,t + p 1 j=1 X ij β j,t + ε i,t where the noise ε i,t is a random function, independent across observation (curves) Y i,. Parameter estimates are curves: leads to nice inference problems for smooth random curves. - p. 13/??

Least squares Multiple linear regression OLS β = (X t X) 1 X t Y. Non-constant variance but independent WLS β = (X t W X) 1 X t W Y, W i = 1/Var(Y i ) General correlation GLS β = (X t Σ 1 X) 1 X t Σ 1 Y, Cov(Y ) = σ 2 Σ. - p. 14/??

Maximum likelihood In the Gaussian setting, with Σ known least squares is MLE. In other cases, we needed iterative techniques to solve MLE: nonlinear regression: iterative projections onto the tangent space; robust regression: IRLS with weights determined by ψ = ρ generalized linear models: IRLS, Fisher scoring time series regression models: two-stage procedure approximates MLE (can iterate further, though) mixed models: similar techniques (though we skipped the details) - p. 15/??

Diagnostics: influence and outliers Diagnostic plots: Added variable plots. QQplot. Residuals vs. fitted. Standardized residuals vs. fitted. Measures of influence Cook s distance. DF F IT S. DF BET AS. Outlier test with Bonferroni correction. Techniques are most developed for multiple linear regression model, but some can be generalized (using whitened residuals). - p. 16/??

Penalized regression We looked at ridge regression, too. A generic example of the bias-variance tradeoff in statistics. Minimize SSE λ (β) = nx i=1 Y i p 1 X j=1 X ij β j! 2 + λ p 1 X j=1 β 2 j. Other penalties possible: basic idea is that the penalty is a measure of complexity of the model. Smoothing spline: ridge regression for scatterplot smoothers. - p. 17/??

Hypothesis tests: multiple linear regression Multiple linear regression: model R has j less coefficients than model F equivalently there are j linear constraints on β s. SSE(R) SSE(F ) j F = SSE(F ) n p F j,n p (if H 0 is true) Reject H 0 : R is true at level α if F > F 1 α,j,n p. - p. 18/??

Hypothesis tests: general case Other models: DEV (M) = 2 log L(M) replaces SSE(M). Difference D = DEV (R) DEV (F ) χ 2 j (asymptotically). Denominator in the F statistic is usually either known or based on something like Pearson s X 2 : φ = 1 n p n r i (F ) 2. i=1 In general, residuals are whitened r(m) = Σ(M) 1/2 (Y X β(m)). Reject H 0 : R is true at level α if D > χ 2 1 α,n p. - p. 19/??

Model selection: AIC, BIC, stepwise Best subsets regression (leaps): adjusted R 2, C p. Akaike Information Criterion (AIC) AIC(M) = 2 log L(M) + 2 p(m). in model M evaluated at the MLE (Maximum Likelihood Estimators). Schwarz s Bayesian Information Criterion (BIC) BIC(M) = 2 log L(M) + p(m) log n Penalized regression can be thought of as model selection as well: choosing the best penalty parameter on the basis of where GCV (M) = 1 Tr(S(M)) n i=1 Ŷ (M) = S(M)Y. ( ) 2 Y i Ŷi(M) - p. 20/??

That s it! Thanks! - p. 21/??