Regression: Part II
Linear Regression y~n X, 2 X Y Data Model β, σ 2 Process Model Β 0,V β s 1,s 2 Parameter Model
Assumptions of Linear Model Homoskedasticity No error in X variables Error in Y variables is measurement error Normally distributed error Observations are independent No missing data
Heteroskedasticity
Solutions 1) Transform the data 1) Pro: No additional parameters 2) Cons: No longer modeling the original data, likelihood & process model have different meaning, backtransformation non-trivial (Jensen's Inequality) 2) Model the variance 1) Pro: working with original data and model, no tranf. 2) Con: additional process model and parameters (and priors)
Heteroskedasticity y~n 1 2 x, 1 2 x 2 X Y Data Model β, α Process Model Β 0,V β A 0,V α Parameter Model
Example: Linear varying SD y~n 1 2 x, 1 2 x 2 Likelihood (R) LnL = function(theta,x,y){ beta = theta[1:2] alpha = theta[3:4] -sum(dnorm(y,beta[1]+beta[2]*x,alpha[1]+alpha[2]*x,log=true)) } Bayes (WinBUGS) model{ for(i in 1:2) { beta[i] ~ dnorm(0,0.001)} ## priors for(i in 1:2) { alpha[i] ~ dlnorm(0,0.001)} for(i in 1:n){ prec[i] <- 1/pow(alpha[1] + alpha[2]*x[i],2) mu[i] <- beta[1]+beta[2]*x[i] y[i] ~ dnorm(mu[i],prec[i]) } }
Likelihood
Bayes
Additional thoughts on modeling variance Need not be linear Can model in terms of sd, variance, or precision Can vary with treatments/factors or categorical variables e.g. can relax the ANOVA assumptions of equal variance among treatments
Assumptions of Linear Model Homoskedasticity No error in X variables Error in Y variables is measurement error Normally distributed error Observations are independent No missing data
Errors in Variables Regression model assumes all the error is in the Y Often know there is non-negligable error in the measurement of X
Errors in Variables = x 1 2 y~n, 2 x o ~N x, 2 ~N B V 0, B 2 ~IG s s 1, 2 2 ~IG t t 1, 2 x~n X V 0, X Process model Data model for y Data model for x Prior for beta Prior for sigma Prior for tau Prior for X
Errors in Variables y~n X, 2 x o ~N x, 2 X (o) Y Data Model X β, σ Process Model X 0,V x Β 0,V β s 1,s 2 Parameter Model
Full Posterior p, 2, 2, X y, X o N y x, 2 0 1 N x o x, 2 N B V 0, B IG 2 s s IG 2 t t 1, 2 1, 2 Conditionals N x X 0, V X p N y x, 2 N B V 0 1 0, B p 2 N y x, 2 IG 2 s s 0 1 1, 2 p 2 N x o x, 2 IG 2 t t 1, 2 p X N x o x, 2 N x X V 0, X
model { ## priors for(i in 1:2) { beta[i] ~ dnorm(0,0.001)} sigma ~ dgamma(0.1,0.1) tau ~ dgamma(0.1,0.1) for(i in 1:n) { xt[i] ~ dunif(0,10)} } for(i in 1:n){ x[i] ~ dnorm(xt[i],tau) mu[i] <- beta[1]+beta[2]*x[i] y[i] ~ dnorm(mu[i],sigma) }
Conceptually within the MCMC Update the regression model given the current values of X Update the observation error in X based on the difference between the current and observed values of X Update the values of X based on the observed values of X and the regression model Overall, integrate over the possible values of X
Additional Thoughts on EIV x o ~g x Errors in X's need not be Normal Errors need not be additive Can account for known biases x o ~N 0 1 x, 2
Additional Thoughts on EIV x o ~g x Errors in X's need not be Normal Errors need not be additive Can account for known biases x o ~N 0 1 x, 2 Observed data can be a different type (proxy) Very useful to have informative priors
Latent Variables Variables that are not directly observed Values are inferred from model Parameter model: prior on value Data and Process models provide constraint p X N y 0 1 x, 2 N x o x, 2 N x X 0, V X MCMC integrates over (by sampling) the values the unobserved variable could take on Contribute to uncertainty in parameters, model Ignoring this variability can lead to falsely overconfident conclusions
Assumptions of Linear Model Homoskedasticity Model variance No error in X variables Errors in variables Error in Y variables is measurement error Normally distributed error Observations are independent No missing data
Missing data models y~n X, 2 Let's assume a standard multiple regression model (homoskedastic, no error in X) If some of the y's are missing Can just predict the distribution of those values using the model PI What if some of the X's are missing The observed y is more likely to have come from some values of X than others
Less Likely More Likely
Missing Data =X y~n, 2 ~N B V 0, B 2 ~IG s s 1, 2 x ~N X V mis 0, X Process model Data model for y Prior for beta Prior for sigma Prior for missing X p x mis N X, 2 N x X 0, V X
Missing Data Model y~n X, 2 X Y Data Model X mis β, σ Process Model X 0,V x Β 0,V β s 1,s 2 Parameter Model
Conceptually within the MCMC Update the regression model based on ALL the rows of data conditioned on the current values of the missing data Update the missing data based on the current regression model and the values that all other covariates take on Overall, integrate over the uncertainty in missing X's Model uncertainty increases, but less so than if whole rows of data was dropped (partial info.)
ASSUMPTION!! Missing data models assume that the data is missing at random If data is missing SYSTEMATICALLY it can not be estimated
BUGS example: Simple Regression model{ ## priors for(i in 1:2) { beta[i] ~ dnorm(0,0.001)} sigma ~ dgamma(0.1,0.1) for(i in mis) { x[i] ~ dunif(0,10)} } for(i in 1:n){ mu[i] <- beta[1]+beta[2]*x[i] y[i] ~ dnorm(mu[i],sigma) } Vector giving indices of missing values X Y 4.68 8.46 2.95 8.55 9.09 7.01 8.15 9.06 1.76 11.38 4.23 9.12 7.73 7.3 2.43 8.02 6.46 8.45 4.06 8.95 2.42 9.62 0.6 9.15 8.17 7.51 0.22 10.8 4.93 9.78 2.99 10.71 8.36 8.89 6.4 8.21 8.17 6.22 6.46 5.4 1.82 10.05 9.52 7.96 2.44 9.63 6.84 7.05 7.42 8.73 NA 7.5
Example
Assumptions of Linear Model Homoskedasticity Model variance No error in X variables Errors in variables No missing data Missing data model Normally distributed error Error in Y variables is measurement error Observations are independent
Generalized Linear Models Retains linear function Allows for alternate PDFs to be used in likelihood However, with many non-normal PDFs the range of the model parameters does not allow a linear function to be used safely Pois(λ): λ > 0 Binom(n,θ) 0 < θ < 1 Typically a link function is used to relate linear model to PDF
Distribution Normal Exponential Gamma Poisson Binomial Multinomial Link Functions Canonical Link Functions Link Name Link Function Mean Function Identity Xb = µ µ = Xb Inverse Log Logit Xb = µ -1 µ = (Xb) -1 Xb = ln(µ) µ = exp(xb) Xb=ln = exp Xb 1 1 exp Xb Can use most any function as a link function but may only be valid over a restricted range Many are technically nonlinear functions
Logit Xb=ln 1 Interpretation: Log of the ODDS RATIO logit(0.5) = 0.0
Logistic Regression Common model for the analysis of boolean data (0/1, True/False, Present/Absent) Assumes a Bernoulli likelihood Bern(θ) = Binom(1,θ) Likelihood specification y~bern logit =X Data Model Process Model Bayesian ~N B 0, V B Parameter Model
Logistic Regression y~binom 1,logit 1 X X Y Data Model β Process Model Β 0,V β Parameter Model
Logistic Regression in R Option 1 built in function glm(y ~ x, family = binomial(link= logit )) Option 2 homebrew lnl = function(beta){ -dbinom(y,1,ilogit(beta[0] + beta[1]*x),log=t) }
Call: glm(formula = y ~ x, family = binomial()) Deviance Residuals: Min 1Q Median 3Q Max -2.3138-0.6560-0.2362 0.6169 2.4143 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -3.85078 0.48091-8.007 1.17e-15 *** x 0.73874 0.08779 8.415 < 2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 345.79 on 249 degrees of freedom Residual deviance: 209.40 on 248 degrees of freedom AIC: 213.40
Alternative link functions probit Normal CDF cauchit - Cauchy CDF log -- =exp X cloglog - Complimentary log-log Asymmetric, often used for high or low probabilities =1 exp exp X If you code yourself, any function that projects from Real to (0,1)
Coming next... GLM Bayesian Logistic Poisson Regression Multinomial Continuing our exploration of relaxing the assumptions of linear models