General Linear Model Introduction, Classes of Linear models and Estimation

Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory) conditions or in the real world. The underlying urose may be to Develo understanding of the underlying henomena Prediction of future events (outcome), Test some secified hyotheses, Control the outcome of future events. A MODEL (a mathematical equation) involving deterministic (controlled) variables, stochastic variables, as well as unknown arameters, will be useful in erforming this task. Note: Assumtions about the robability distribution of the stochastic variable may be made. These are considered art of the model. The unknown arameters in the model can be estimated (learned) from the available training data. Proosition: Suose that y denotes a quantity in the real world (resonse) about which we want to learn. We assume that there exists a finite, ossibly large, collection of variables{ x, x,, x r }, called exlanatory variables, and a function g, such that y= gx (, x,, x r ). Thus, y and { x, x,, x r } are assumed to be functionally related. We are not imlying that these variables are known and observable, and/or the function g is known. When this relationshi is not exact, it may be a good and useful aroximation to the exact model. This is called a statistical model.

Signal lus noise model: A oular class of linear models The exlanatory variables are known and observable, but g may be unknown and one is willing to assume that g( x, x,, xk) = µ ( x, x,, xk) + ε, such that the signal µ ( x, x,, x k ) is known u to a set of unknown arameters and the additive errorε acts as a random (uncontrollable or unexlainable) noise. Sometimes not all the xi ' s, that determine the resonse y, may be known. However, one can assume that g( x, x,, xk) = µ ( x, x,, x) + η ( x+, x+,, xk), where, conditional on the values of key exlanatory variables x, x,, x, the quantities x+, x+,, xk change so that η ( x+, x+,, xk) behaves like a random errorε. This error is called the equation error or the secification error. In certain other situations, the underlying resonse, y*, itself may not be observable exactly. Instead, we measure y = y* + ε, whereε denote the measurement error. For simlicity, one can write the model for the measurement (observable quantity) Y as = µ (,,, ) + ε, (P) Y x x x r Thus, the random noise ε could be either the secification error, or the measurement error, or a mixture of both these errors. If the errors are not additive, sometimes these models are called Generalized linear models (GLIM), e.g., logistic regression models, etc.

General Linear Model (GLM): The Poulation Model (P) where, Y and ε are random variables, x, x,, xr are deterministic variables, and The mean resonse function EY { x} = µ ( x ), is linear in unknown arameters { β, β,, β r }, i.e., for all ( x, x,, xr ) Χ,, µ ( x) = β j f j( x, x,, xr). Here, the features fi ' sare assumed to be comletely known functions of x, x,, xr. In engineering, one talks about feature extraction rocess, which searches for aroriate features to describe the resonse. In statistical science, this is called model selection. The variable Y is called a resonse variable, or an endogenous variable. The variables x, x,, xr [or the features fi ' s] are called the indeendent variables, the exlanatory variables, the redictor variables, or the exogenous variables. Broad Classification of General Linear Models: Linear Regression Models: (Y, X, X,, Xr ) are a set of jointly distributed random variables, such that EY [ X, X,, Xr] = µ ( x) = β j f j( x, x,, xr), and exression (P) above holds, e.g., Simle linear regression, or Multile linear regression. For analysis uroses, we treat the regression models as articular case of the GLM. Here, we are segmenting (stratifying) the whole oulation based on the values of the variables ( X, X,, Xr ) and studying the conditional exectation of the resonse variable as a function of these variables. [Why do we choose conditional mean, not some quantile?]

Exerimental Design Models: Each exlanatory variable in the GLM is quantitative or categorical levels of certain factors or traits under study. The categorical levels are reresented as {0,} or {-,} indicating resence or absence of traits, and the GLM is called an exerimental design or ANOVA model. Basically, we are interested in understanding major causes of variability in the resonse variable. Classically, the analysis of such exeriments could be simlified quite a bit due the underlying structure in the set of exlanatory variables. These days research effort is devoted to finding otimal designs that allow otimal estimation of a secified set of arametric functions, based on some otimality criterion. These models are called the fixed effects models. Examles include, One-Way, Two-Way, Cross-classified multifactor exeriments or nested designs. When the exeriment includes some design variables, as well as some continuous exlanatory variables, these models are called Analysis of Covariance (ANCOVA) models. Variance Comonent Models (Random Effect Models): In many exeriments, the levels of a factor are assumed to be randomly drawn from a oulation of levels. The effects of this factor are (unobserved) random variables, following a distribution with mean zero and unknown variance. These are called random effect models. Mixed Effects Models: The mixed effects models have some factors with fixed effects and some that have random effects. Remark: Functionally related variables, when all the variables are subject to measurement errors, are called Error-in-variables models. These should not be treated as a articular case of the GLM. Learning from Data: We wish to learn (make inference: estimate, test hyotheses) about the unknown arameters β, β,, β based on a Training Set (Samle). Start with the Samle model for the observations in the training set. But, it is assumed that the oulation model is valid, to enable us to relate the resonse y to x, x,, xr for unobserved or out-of-samle units in the oulation.

Training Samle Model: Given n observations [[( Yi, xi ), xi = ( xi,, xir )], i =,, n, the samle model can be exressed as Y = µ ( x, x,, x ) + ε, i =,,, n, () i i i ir i where, i, i,,, n and equal variance σ. ε =, denote the noise (random errors), each with mean zero Clearly, EY [ x ] = µ ( x, x,, x ), i=,,, n. () i i i i ir From now on, we denote the features f j, j =,,,, themselves as coded redictor variables x, x,, x. In the simlest setting, the random errors are also assumed to be uncorrelated. Thus the samle GLM can be exressed as i = β j ij + εi, [ εi ] = 0, [ εi ] = σ, ( εi, εk ) = 0,. (3) Y x E Var Cov i k EY [ ] = β x. i i j ij Examles: Simle linear regression model Multile linear regression model, Polynomial regression model, One-way fixed-effect ANOVA model, One-way random-effects ANOVA model. x (4) In order to use this model for rediction of future resonse given a set of redictor values, the unknown arameter needs to be estimated (learned) from the training samle. For any reasonable estimate β of the vector β, estimated errors (residuals) in the resonse e ( Y β x ) i i j ij = should be as small as ossible. The choice of β is based on solving an otimization roblem: Minimize a loss function l( β ), an imlicit function of the estimated errors, {,,, }, that tends to kee the errors as small as ossible. For examle, e e e n n n n l( β) = ei,or l( β) = ei, lw( β) = we i i i= i= i=, where

l : The absolute error ( L loss) criterion, l : The ordinary least squared error ( L loss) criterion, and l w : The weighted least squares error criterion. Ordinary Least Square (OLS) Criterion: Find the estimated coefficient vector ˆ β = arg min l( β), that minimizes the sum of squared errors, i.e., β R n min l( β) = min ( Y ). i β jx ij β R β R i = j = Historically, the LS criterion has been oular, since one could find its solution analytically as well as geometrically. Therefore, its statistical roerties can be studied easily. The absolute error loss criterion required solving a linear rograming roblem, thus it was difficult to derive its statistical roerties analytically. Nowadays, regularized versions (minimization subject to some uer bound on the size of the vector β ) of both these criteria are oular in data mining alications. For examle, Ridge regression: Minimize the squared error loss subject to an uer bound on the L -norm of the coefficient vector, LASSO: Minimize the squared error loss subject to an uer bound on the L -norm of the coefficient vector. Note that without a concise notation, it is tedious to exress these quantities. Vector/matrix notation for the resonse, redictor variables, error terms and the unknown coefficients: Y x j β ε Y x j β ε Yn =, Xn = [ x, x,, x], where x, β = and ε =. Y x n nj β ε n

Given the resonse vector Y, and the design matrix X, the samle GLM can be written in matrix notation as Y= Xβ + ε ε = ε = ε ε = σ I (5), E[ ] 0, Cov[ ] ((cov( i, j))). For this model, E[ Y] = X β, Cov( Y ) = σ I. (6) Thus the OLS criterion is equivalent to minimizing the residual sum of squares, i.e., min ee = min( Y X β) ( Y X β). β R β R Exand S( β) = ( β)( β) = β β + β β Y X Y X YY YX XY XX. In order to be able to write these models in a comact notation, we need to have some background in linear algebra. In the next few lectures, we will review some of these tools.