Introduction: why using models?

Size: px

Start display at page:

Download "Introduction: why using models?"

Blaze Neal
5 years ago
Views:

1 Chapter 1 Introduction: why using models? This course introduces models, and more specifically linear models. Accordingly, the first question should be: why the hell would I need models for my research? To answer to that question, we will provide two examples that, we hope, will clarify the reasons why models are needed (and, as side effect, might worry you a little bit...). 1.1 The stones problem Imagine that you suffer from kidney stones, and you need to be treated to relief the pain. Two techniques are available: ultra-sound stones dislocation (US) or open surgery (OS). Before choosing one of the options, you consult a specialist (or search on internet...) to know the probability of success of each technique. Calling a success the absence of recurrence of the problem within the 5 years following the treatment, you find the following results: Success Failure Total US Open Total Legend: experimental results on 2 cohorts of kidney stone patients. Since the success rate is higher for US than for OS, you are likely to choose the ultra-sound treatment (and probably happy to avoid surgery!). So you will report your choice to your doctor. At this time, as a good practitioner, he will probably indicate that the two treatments are not used in the same situations: when the stones are big, using US might not be a good solution, and surgery might be more efficient in that case. So you go for another check on the previous experimental results to see whether information on the dimension of the stones is available for the 700 patients. It turns out that this information can also be obtained, and so, to get a confirmation of the doctor s advice, you split the previous table into 2 tables, according to the size ( big or small ) of the stone (your doctor has given the rationale to classify a stone as big or small ). And, using the same patients, you obtain the following tables: 1

2 2 CHAPTER 1. INTRODUCTION: WHY USING MODELS? Small Success Failure Total US Open Total Legend: results of kidney stone patients with small stones. Big Success Failure Total US Open Total Legend: results of kidney stone patients with big stones. So, summing the corresponding cells of these two tables bring back the first table. Now, two things can be observed on these tables: much more patients are treated using US for small stones than using OS, and the reverse is true for big stones more perturbing, the rates of success are higher for OS in both tables... And so, we have come to this paradoxical situation where the global ( marginal ) table shows that US is better than OS, while both partial ( conditional on the stone size ) tables show the reverse! This paradox is called Simpson s paradox, and is one of the pitfalls met when using a simple model (i.e. the first table) not accounting for potential hidden variables (in our problem, the stone size). So, using an oversimplified model might lead to wrong decisions, which illuminates the reason why the capacity to manage more complex model is strongly needed. 1.2 The diets problem As another example, imagine that two sets of individuals have been selected to experiment a treatment allowing a faster recovery in case of bone fracture. The experiment is double-blind: each individual and the experimenter are not aware of who is receiving what (the treatment or a placebo). On a preliminary trial, few individuals have been used because the goal is to set up the experimental protocol and to get a first idea on what can be expected. A bone densitometry is performed on both groups after a 6-months treatment. The results are presented (in a convenient unit) in the following table: Placebo Treatment Legend: results of the bone density experiment on 10 subjects. Using simple tests such as ANOVA or Student t-test allows to test the null hypothesis of the absence of effect of the treatment. Let s use an ANOVA. An example using R is provided below.

3 1.2. THE DIETS PROBLEM 3 # F i r s t, d e f i n e the data v e c t o r s > P l a c e b o< c ( 1 0 0, 1 0 0, 1 2 0, 1 7 0, ) > T r e a t m e n t< c ( 1 6 0, 1 5 0, 1 9 0, 2 0 0, ) # Transform i n t o d e n s i t o m e t r y o b s e r v a t i o n s and f a c t o r > dens< c ( Placebo, T r e a t m e n t ) > t r e a t< f a c t o r ( c ( rep ( P, 5 ), rep ( T, 5 ) ) ) # Build the model (ANOVA) > m< lm ( dens treat ) # Obtain the ANOVA t a b l e > anova ( m ) A n a l y s i s of V a r i a n c e T a b l e Response : dens Df Sum Sq Mean Sq F value Pr(>F ) t r e a t R e s i d u a l s This type of analysis (and the R code) will be detailed later on, but the important point is that the p-value associated to this test is p = : such results are not expected (in the sense that the p-value is smaller than the usual 0.05 threshold) if the null hypothesis is true. Consequently, we have good reason to reject the null hypothesis (no difference between the groups) and the treatment will be deemed as apparently efficient to increase bones density. Again, as in the previous example, a careful practitioner will come with questions about the sampling process. As is well known, the bone density tends to decrease as the individuals get older, and a relevant question is to know whether the two samples are balanced with respect to age (in other words, do the animals in both cohorts have roughly the same age?). So, again, we come to question about the effect of a hidden variable: the age of the patients. Assume that this information is available. The completed table would become, after adding the ages between parentheses: Placebo Treatment 100 (28) 160 (24) 100 (28) 150 (22) 120 (24) 190 (20) 170 (23) 200 (18) 180 (18) 225 (16) Legend: results of the densitometry study on 10 experimental subjects, along with their ages (between parentheses). A graphical representation of this dataset is provided below (Figure 1.1). The graph shows that: the age impacts on the density, with a marked decay as the age increases, when treated and placebo patients have the same age, the densitometric value in the treated individuals seems to be higher than in the placebo, possibly indicating that the treatment has a positive effect on the densitometry, as found using the simple approach given above. Actually, after correcting for the age effect (using methods that will be developed in this course), the p-value for the test of the null hypothesis of no

4 4 CHAPTER 1. INTRODUCTION: WHY USING MODELS? Densitometry Age Figure 1.1: Treated (in blue) and placebo (in red) densitometries as a function of the patients age. difference between treated and placebo patients is p = (if you are too impatient to know how we have come to that result, you can check the exercise 2 ( Densitometry ), that solves this problem using the linear model methodology...): according to this result, the null hypothesis is accepted: there seems to be no difference between treated and placebo! So, as in the first example, the conclusions are really different when the hidden variable is taken into account in the model. And we come to the same conclusion: to obtain correct conclusions in our analyses, we have to be able to have more detailed models, encompassing potential confounding variables, instead of the usual simple (and often oversimplified) models.

5 Chapter 2 Linear models 2.1 Introduction Linear models are a very important class of models in statistics. They embrace several basic techniques such as simple and multiple regressions or one or multiple ways analyses of variances, and extend them to other more general situations, allowing powerful and versatile modelling of many biological processes. The idea at the basis of linear models is that many models rely on the assumption of a linear dependence between the measures taken in experiments ( the observations, often noted y i, with i varying from 1 to n) and parameters of interest in these experiences. We will illustrate this relationship using two well-known situations (that will be further developed later in this text) Modelling growth Imagine that we are interested in some specific feed. One animal using a diet based on this feed is followed for a few weeks, and its weight is recorded every week. This protocol would allow to ask questions, such as the evolution of the weight with time (maybe to compare it with another diet). If the weight increases progressively as the weeks go by, we could want to model the growth using a model such as: y i = β 0 + β 1 w i In words, it means that we assume that the weight y i increases linearly: at the beginning of the experiment, the weight is β 0 = y 0 ; then, after 1 week (w i = 1), the weight is y 1 = β 0 + β 1, after 2 weeks y 2 = β β 1, etc. In this linear relationship, β 0 is thus equal to the weight at the beginning of the experiment and β 1 is the weekly weight gain. It is also the slope of the straight line representing the weight as a function of the week. Our collected data will allow to obtain an estimation of the value of β 0 and β 1, which will be referred to as parameters of the model. If the experimenter identifies that the relationship between the weight and the week is not really linear, but rather curvilinear, he could try to model this relationship using (for example) a parabolic relationship rather than a linear one. The model would be this time: y i = β 0 + β 1 w i + β 2 w 2 i 5

6 6 CHAPTER 2. LINEAR MODELS Note that, although the relationship between the observations y i and the weeks w i is now quadratic, the relationship between the observations and the parameters (β 0, β 1 and β 2 ) is still linear. Another point is that these representations (as well as other that we could try to use) are only models of the reality: it should not be expected that they perfectly match the observations. In other words, in most cases, there will be (hopefully small) discrepancies between the predictions of the model (i.e. the weight after w i weeks is β 0 + β 1 w i ) and the observed values (i.e. y i ). These discrepancies, noted e i, have thus to be included in the model to better represent the data (see figure 2.1). So the models above would become: and y i = β 0 + β 1 w i + e i y i = β 0 + β 1 w i + β 2 w 2 i + e i These are well known models: the first one is the classical linear regression model, and the second one is a quadratic regression model. We use the term regression to refer to situations where the dependent variable Y is assumed to be dependent on continuous independent variables (W, or W and W 2 are supposed continuous in the two models given above). The next example will provide another situation where the independent variable is categorical instead of continuous. As mentioned earlier, we will focus in the following sections of this chapter on the use of this type of models involving a linear relationship between a continuous dependent variable and a set of parameters. We will learn how to estimate the parameters of the models and test hypotheses of interest about these parameters. Remark: another situation, not covered in this text, occurs when the dependent variable is categorical, leading to a even broader class of models, known as generalized linear models Breed and blood pressure In this example, we are interested to know whether the blood pressure is under genetic control in a given species. Assuming that different breeds correspond to different genetic backgrounds, a simple experiment is to compare the blood pressures of random samples taken in different breeds. Of course, the samples would need to be balanced with respect to potential confounding factors such as age or gender, as explained in the preceding chapter. In this case, an easy model of the situation would be: y ij = µ + α i In this expression, y ij is the blood pressure measured on individual j of the breed i, µ is the average blood pressure in that species, and α i is the difference between the average blood pressure in breed i and µ. The parameters of this model would be µ and the α i, i = 1,, B. Again, the relationship between the dependent variable Y and the parameters is linear. And our goal is again

7 2.1. INTRODUCTION 7 Weight y 2 ŷ 2 e 2 Weeks Figure 2.1: a regression linear example. The red dots correspond to the observations, the black crosses to the values estimated using the linear regression. The residual (i.e. the difference between the observed value y and the predicted value ŷ is shown for one of the points in the experiment. to estimate these parameters and, more importantly, to test whether α i are significantly different from each other. If the answer to this last question turned out to be yes, we would have a confirmation that the blood pressure differs between breeds. As in the preceding example, the simple model given above will not perfectly match the observations: with the current formula, it would mean that we consider that each individual in the same breed has the same blood pressure, which is of course not true. So, we again introduce a supplementary term (called residual ) in the equation in order to model the individual differences with the breed average: y ij = µ + α i + e ij where e ij is the difference between the blood pressure of individual j in breed i and the average blood pressure in this breed. The point to note about this example is that we have again a representation of our data where a continuous dependent variable is modelled as a linear function of parameters of interest: we have built a linear model. The difference with the first example is that the independent variable is categorical ( the breed effect ) in this second example when it was continuous ( the week effect ) in the first one. Incidentally, you might have identified this second example as an example of a one-factor analysis of variance. And so, as mentioned in the introduction, these classical models (linear and curvilinear regressions, analysis of variance) are instances of linear models, along with many other models. The good news is that the general theory of linear models will allow us to obtain generic procedures to estimate the parameters of all these models and test many hypotheses, as we will see. To reach that ambitious goal, we need an efficient mathematical tool: the matrix notation. With few definitions, matrices allow to represent various models in a unified framework. We will thus continue this chapter by introducing few

8 8 CHAPTER 2. LINEAR MODELS fundamental concepts on matrices, and then show how this notation allows to represent various well-known situations. Some theory is then presented to explain testing in linear models and examples are given to show the concepts in action. 2.2 Matrices Many statistical models can be written in a compact form using matrices. Matrices are rectangular tables of coefficients. Some of their characteristics are provided hereafter Matrices properties Dimensions A matrix M with dimensions n and m is a rectangular table of coefficients with n lines and m columns. Matrices are often represented using a bold letter subscripted with the dimensions. For example, M 3 5 represents a matrix M with 3 lines and 5 columns. Matrices where one of the dimensions is 1 are called vectors. For example, V 1 3 is a line vector with 3 elements, and W 4 1 is a column vector with 4 elements. When both dimensions are equal to 1, the matrix reduces to a scalar coefficient. Coefficients Elements inside the matrix are coefficients. They can be numbers or parameters of the model (this will be later explained). Each individual coefficient can be accessed by specifying its coordinates (line and column numbers). For example, M(2, 3) specifies the coefficient placed on the second line and in the third column of matrix M. This definition allows one to represent explicitly a matrix. Suppose a N 2 3 matrix; it can be represented as: ( ) N(1, 1) N(1, 2) N(1, 3) N = N(2, 1) N(2, 2) N(2, 3) Matrix operations Equality Two matrices are equal when they have the same dimensions and when corresponding elements are equal. So, N a b = M c d means that: 1. a = c and b = d. 2. N(i, j) = M(i, j) for all i = 1,, a and j = 1,, b. Transposition Transposing a matrix consists in a tilt of the matrix. In words, the i th line becomes the i th column, and the j th column becomes the j th line. The transposed

9 2.2. MATRICES 9 matrix of N is noted N. For example, we could have: ( ) N = Then, we would write: N = Addition Adding two (or more) matrices is straightforward. First, the matrices to be added should have the same dimensions. Second, the resulting matrix is obtained simply by adding the corresponding terms (i.e. with the same coordinates i and j) of the matrices involved in the sum. As an illustration, consider the 2 following matrices: A = ( If C = A + B, then we obtain: Substraction C = ) and B = ( ( Substraction is defined similarly: if S = A B, the coefficients of S are obtained by substracting the corresponding coefficients of B from the corresponding coefficients of A. For the example above, we would obtain: ( ) 4 4 S = 4 4 Multiplication Multiplication is a bit more complicated. Actually, two different types of multiplications coexist. Scalar multiplication In some situations, all elements of a matrix are multiplied by a scalar coefficient. For example, we could write: ( ) ( ) σ 2 ρ σ V = 2 ρ σ 2 σ 2 = R σ 2 1 ρ = σ 2 ρ 1 Matrix multiplication Matrix multiplication is more tricky. It is only defined in some situations, that we will detail now. The main constraint is that the number of columns of the first matrix (the one to the left of the * sign) must match the number of rows of the second matrix (the one to the right of the * sign). In other words, the multiplication of A n m and B p q is only defined if m = p. ) ).

10 10 CHAPTER 2. LINEAR MODELS A second charcateristic is that the resulting matrix has the same number of rows as the first matrix and the same number of columns as the second matrix: A m n B n q = C m q Finally, the tool to obtain the coefficients of the product matrix is as follows: C(i, j) = n A(i, k) B(k, j) k=1 Let s show a few examples. We start with the product of two vectors. Suppose we want to multiply: v = ( ) 4 and w = 5. 6 The multiplication v 1 3 w 3 1 is well defined since the condition on the matrices dimension is fullfilled (3 columns in v and 3 lines in w). The result of the operation will have dimensions (1 1), which means that we will obtain a scalar number. Using the tool described above, we obtain: z = v w = = 32 At this point, it should be noted that the multiplication w v is also allowable, since the number of columns of w (1) matches the number of lines of v. The result will be a (3*3) matrix, which shows that, for matrices, in general, w v v w. It can be computed that: w v = This property of matrix multiplications is referred to as non-commutativity of the product. Identity matrix A square (i.e. with the same number of lines and columns) matrix I with I(i, j) = 1 when i = j and I(i, j) = 0 when i j is called an identity matrix. To see where this name comes from, let s do the following multiplication: I A where A is any matrix (of course, the dimension of I must be chosen such that multiplication is possible). To show the mutiplication, a (3*4) A matrix, but other dimensions would not change the conclusions a 11 a 12 a 13 a 14 I 3 3 A 3 4 = a 21 a 22 a 23 a a 31 a 32 a 33 a 34 = = A a 11 a 12 a 13 a 14 a 21 a 22 a 23 a 24 a 31 a 32 a 33 a 34

11 2.2. MATRICES 11 It could be similarly shown that: A 3 4 I 4 4 = A 3 4 The point is thus that multiplying any matrix A by the identity matrix I yields the same A matrix. Inversion Matrix inversion is a complex operation. The idea behind this operation is given in its definition: a matrix, noted A 1, is called ( the inverse ) of A ( if A A 1 = ) A 1 A = I. For example, the inverse of A = is A = 1 1 as can be verified by multiplication. Two remarks have to be made at this stage: 1. inverse matrices are only defined for square matrices. 2. not all square matrices have an inverse matrix. As another example of easy to invert matrices, we can also mention diagonal matrices. Diagonal matrices are square matrices for which a(i, j) = 0 when i j and a(i, j) 0 when i = j. The inverse is then simply obtained by taking a diagonal with the same dimensions as the original matrix, and with the diagonal elements equal to the inverse of the corresponding elements in the original matrix. For example: = / /3 as can easily be verified. A condition for a matrix to have an inverse is that no line (column) of the matrix is a linear combination of other lines (columns) of the matrix. Without loss of generality, we will explain what this means, using lines of the matrix. Note that a line of the matrix can be seen as a vector. Then, a line l i is a linear combination of other lines l j (j i) of the table if l i = n i=1;i j l j. Let s provide an example. Suppose we have the following matrix: A = The third line of that matrix can be written as: l 3 = 1 l l 2 So, it is a linear combination of the 2 first lines and, accordingly, no inverse exists for A. Such a matrix is called a singular matrix. Of course, it is far from obvious to see whether such relationships exist, especially in situations where the matrix is large. Fortunately, a global measure exists to tell us whether the matrix can be inverted or not. This measure is called the determinant of the matrix. Square matrices for which the determinant is equal to zero are singular. Square matrices for which the determinant is different from zero have a unique inverse.

12 12 CHAPTER 2. LINEAR MODELS Computing a determinant is generally cumbersome, especially for large matrices (the number of needed operations is equal to the factorial of the dimension of the matrix, so that for a 10*10 matrix, this number is ), but most matrix and statistical software allow to compute this determinant. We ll give an example using the R software. # Create the A matrix A< matrix ( c ( 1, 3, 2, 2, 2, 1, 5, 7, 4 ), b y r o w=t, nr =3) # Compute the determinant o f A det ( A ) [ 1 ] e 16 # e q u a l to 0 ( e x c e p t f o r machine p r e c i s i o n ) # Now, we change A( 3, 3 ). This b r e a k s the l i n e a r r e l a t i o n s h i p # between the l i n e s o f the matrix A [ 3, 3 ]< 5 # We recompute the determinant o f the modified A det ( A ) [ 1 ] 4 # We can compute the i n v e r s e o f t h i s m o d i f i e d A matrix # u s i n g the s o l v e ( ) f u n c t i o n o f R s o l v e ( A ) [, 1 ] [, 2 ] [, 3 ] [ 1, ] [ 2, ] [ 3, ] # Let ' s check t h a t t h i s i s the i n v e r s e o f A # u s i n g the matrix m u l t i p l y o p e r a t o r '% %' A % %s o l v e ( A ) [, 1 ] [, 2 ] [, 3 ] [ 1, ] e+00 [ 2, ] e 16 [ 3, ] e+00 Generalized inverses In situations where the inverse cannot be computed (i.e. when the matrix is not square, or when the determinant of a square matrix is equal to 0), generalized inverses matrices can sometimes be used. A generalized inverse of A, noted A or G, is any matrix for which: A = A G A Note that the classical inverse respects that definition. Let s show a generalized inverse for the A singular matrix of the previous example. Without giving the details, it is easy to show that: G 1 = is a generalized inverse of A. Note however that (infinitely many) other generalized inverses exist for A. Another example is: G 2 = Solving a system of equations using matrices As a first illustration of the use of matrices, we will show how they can be used to solve system of equations of arbitrary sizes. We ll show that on a small

13 2.2. MATRICES 13 example. Assume the following system has to be solved: x + 3 y + 2 z = 13 2 x + 2 y + z = 9 5 x + 7 y + 5 z = 34 This system canalso be presented using matrices. Let s define the unknowns x 13 vector x = y z and the right-hand side vector b = We also define the system coefficients matrix: A = It can be seen from these definitions that the system can be written as: A x = b using the matrix multiplication to compute A x and the equality to obtain the 3 equations. Next, we compute the inverse of A, and pre-multiply each term of the matrix equality, which yields: A 1 A x = A 1 b I x = A 1 b x = A 1 b and we obtain all solutions at once! Let s again do that with R. This is how it goes: # Provide the right hand s i d e > b< c ( 1 3, 9, 3 4 ) # Provide the c o e f f i c i e n t s matrix > A< matrix ( c ( 1, 3, 2, 2, 2, 1, 5, 7, 5 ), b y r o w=t, nr =3) # Compute x > x< s o l v e ( A ) % %b # Show x > x [, 1 ] [ 1, ] 1 [ 2, ] 2 [ 3, ] 3 The nice thing is that this procedure works whatever the size of the system. The hard part is the matrix inversion, but using a software hides this tricky part behind a simple function call. Incidentally, this example with a system of equations allows to understand the problem happening for the inversion of matrices with lines (or columns) linear dependencies. Suppose that the system to be solved is: x + 3 y + 2 z = 13 2 x + 2 y + z = 9 5 x + 7 y + 4 z = 31

14 14 CHAPTER 2. LINEAR MODELS The system coefficients matrix is thus: A = and this matrix has been found to be singular. Accordingly, no inverse exists and the procedure described above cannot be used to obtain a solution. Now, it should be realized that since the third line of the coefficients matrix is a combination of the two first ones, the same is true for the third equation. Let s consider the left hand sides: 1 (x + 3 y + 2 z) + 2 (2 x + 2 y + z) = 5 x + 7 y + 4 z Since the same can be said for the right-hand sides ( = 31), the third equation is simply a combination of the two first and brings nothing new to the system. Consequently, the system has only 2 distinct equations and 3 unknowns, which leads to an infinite set of solutions (remark: if the relationship between the left hand sides does not hold for the right-hand sides, then the system is said inconsistent and has no solution). This is why we could find an infinite set of generalized inverses for A, each leading to one of the possible solutions. We can provide 2 solutions using the generalized inverses we found earlier: x 1 = G 1 b = = and: x 2 = G 2 b = = It can be verified that both these solutions (and infinitely many others) are valid solutions to the system. One of the generalized inverses, calle de Moore- Penrose (MP) inverse, can also be obtained using the R software, as shown in the following code snippet: # Load the package c o n t a i n i n g the code f o r the ' ginv ' f u n c t i o n > l i b r a r y ( MASS ) # Provide a s i n g u l a r matrix > A< matrix ( c ( 1, 3, 2, 2, 2, 1, 5, 7, 4 ), b y r o w=t, nr =3) # Obtain a g e n e r a l i z e d i n v e r s e

15 2.3. LINEAR MODELS 15 > G< ginv ( A ) # Show G A > G % %A [, 1 ] [, 2 ] [, 3 ] [ 1, ] [ 2, ] [ 3, ] # Show t h a t G i s a g e n e r a l i z e d i n v e r s e > A % %G % %A [, 1 ] [, 2 ] [, 3 ] [ 1, ] [ 2, ] [ 3, ] # Now f o r a r e c t a n g u l a r matrix > B< matrix ( c ( 1, 2, 3, 5, 4, 6 ), b y r o w=t, nr =2) # Obtain the MP i n v e r s e > H< ginv ( B ) # Show B H > B % %H [, 1 ] [, 2 ] [ 1, ] e 17 [ 2, ] e+00 # Show B H B > B % %H % %B [, 1 ] [, 2 ] [, 3 ] [ 1, ] [ 2, ] Linear models Now that some material on matrices has been provided, we can start looking at a generic description of linear models. As the name clearly indicates, we are going to play with a model: we do not necessarily mean that the actual phenomenon has exactly a linear behavior, but rather than the way the system behaves can be approached using a linear assumption. In some situations, experimental results will indicate that this linear assumption should not be made, and linear models will then be of little help. But in many cases, they will be a useful starting assumption, eventually followed by refinements, if necessary. Linear models are thus models for which the relationship between the dependent variable (represented by an observations vector) and the parameters of the model (represented by a parameters vector) takes a linear form. More explicitly: In this expression: y = X β + e 1. y = (y 1, y 2,, y n ) is a vector with the n observations made on the dependent variable, 2. β = (β 1, β 2,, β m ) is a vector with m unknown parameters that specify the model, 3. X n m is a matrix linking the parameters to the observations, 4. e = (e 1, e 2,, e n ) is a vector with the n residuals of the model (i.e. the difference between the values predicted by the model and the observed values). Unless otherwise stated, we will assume that the error terms are uncorrelated and have all the same variance (but the theory could be generalized to situations where these two assumptions are not valid).

16 16 CHAPTER 2. LINEAR MODELS Two main questions have to be tackled: 1. how do we estimate the parameters of the model? 2. how do we perform statistical tests on the parameters? These questions will be addressed in the following sections Estimation of the parameters If the relationship between dependent (y) and independent (β) variables were exactly linear, it could be written as y = X β and the residual term would be of no use. This term is added to acknowledge the fact that the observations don t behave exactly linearly, and randomly depart from the exact linear behavior. Of course, the less residuals, the most the model sticks to the linear form, and we should choose model parameters that make residuals as small as possible. This suggests a method to estimate the parameters of the model: let s use as estimates of β the values, noted ˆβ that minimize the residuals sum of squares. This can be written as: ˆβ = min β n i=1 e 2 i = min e e β This estimator is logically called least square of the error estimate. It can be computed by first computing the value of e e as a function of β, and then by derivating the obtained expression with respect to β and equating to zero: e e = (y X β) (y X β) = y y 2 β X y β X X β Derivating this expression with respect to the vector β and taking for β the value ˆβ that renders this expression equal to 0, we get: X X ˆβ = X y When X X is a non-singular matrix, the solution is thus obtained by: ˆβ = (X X) 1 X y When X X is a singular matrix, generalized inverses can be obtained to yield (infinitely many) solutions in the form: ˆβ 0 = G 0 X y where the subscript makes clear that the obtained solution depends on the chosen generalized inverse Expectation and variances The estimators we have obtained in the previous section have been constructed on the basis of the observations and of the design of the model, as is testified in the formula: β = (X X) 1 X y

17 2.3. LINEAR MODELS 17 Since the observations would vary from sample to sample, it turns out that the estimators would also vary. So, new questions arise for these estimators: what is the expected value of the estimators? And what is the variance. We start with the expected value, and we will assume that the design of the experiment remains fixed: accordingly, the X matrix (sometimes called the design matrix ) will be assumed fixed. Then, the expectation of the estimator is: [ ] E ˆβ [ ] = E (X X) 1 X y = (X X) 1 X E [y] Since y = Xβ + e, E [y] = E [Xβ + e] = E [Xβ] + E [e] In this expression, X is fixed, and thus constant, β is an unknown constant vector, and the expectation of e is 0. We can thus write that: [ ] E ˆβ = (X X) 1 X X β = β Consequently, the estimator is unbiased. Next, we compute the variances. First, we have to define the variance of a vector x of dimension n as a matrix of dimensions n n, where the diagonal elements are the variances of the corresponding elements of the vector, and the non-diagonal elements are the covariances of the corresponding elements of the vector. Also, it is not difficult to demonstrate that V (K x) = K V (x) K. Coming back to our problem, we can use these properties as follows: [ ] V ˆβ [ ] = V (X X) 1 X y Since X β is constant, we have that: = (X X) 1 X V (y) X (X X) 1 V (y) = V (e) = R where R is the variance-covariance matrix of the residuals. In many cases (but not always), the residuals are assumed homoscedastic (V (e 1 ) = V (e 2 ) = = V (e n )) and independent (Cov(e i, e j ) = 0 for any i = 1,, n, j = 1,, n and i j). Consequently, R = I σe. 2 In that case, the variance of the estimators becomes: [ ] V ˆβ = (X X) 1 X X (X X) 1 = (X X) 1 σ 2 e Note that σ 2 e is also a parameter that needs to be estimated. Accordingly, we can most of the time only get an estimator of the variance covariance matrix, similarly to what is done in simple tests where the estimated variance is plugged into the formula, replacing a Gaussian variable z by a Student t one.

18 18 CHAPTER 2. LINEAR MODELS Testing hypotheses After obtaining a reasonable solution for the estimators of the parameters of our model, our next task will be to use these estimates to test hypotheses. Our tests will target the true parameters values (i.e. β i ) while we only have estimates (i.e. ˆβ i ) of these values. Accordingly, we will need to take the random variations of the estimators into account to draw conclusions. Since we want to have quite general procedures, we will use the most general linear hypothesis about the parameters as a basis for future tests. In other words, th null hypothesis we have in mind is: H 0 : L β = c where β is the (m*1) (true) parameters vector, L is a (k*m) known matrix and c is a (k*1) known vector. We will show in the examples that this general linear hypothesis allows us to obtain the usual tests and many other that might be useful. Since the derivation of the statistics to be used is beyond the scope of this course (and of no particular interest for the students...), we will jump straight to the result of the derivation: ( ( ) F (q, n r(x)) = L ˆβ c) (L G L ) 1 ( y X ˆβ ) ( y X ˆβ L ˆβ c ) N r(x) q or, equivalently and maybe more informatively: ( L ˆβ ( c) (L G L ) 1 L ˆβ ) c /q F (q, n r(x)) = ( y X ˆβ ) ( y X ˆβ ) /N r(x) where: 1. q is the number of independent rows (i.e. hypotheses) of L, 2. r(x) is the rank of X, representing the number of independent columns of X. The rationale of the F computation generalizes the usual computations used in simple models such as linear regressions or ANOVA: 1. under the null hypothesis, the numerator and the denominator in the latter formula both provide estimates of the ( error variance σe. 2 This is particularly visible in the denominator, since y X ˆβ ) = e and thus the denominator is e e/n r(x), i.e. the sum of the squared residuals divided by the number of degrees of freedom of this sum of squares. 2. under the alternative hypothesis, the denominators still provides an unbiased estimate of σ 2 e (it does not depend on the hypothesis), while the numerator will be inflated To obtain this result, it has been assumed that the random residues in the model (e i ) are independent, normally distributed and homoscedastic. Although this extravagant formula might admittedly be complicated to apprehend, the few examples below will help understanding its meaning. The point is that we can obtain F values to test any linear hypothesis of interest (unless these hypotheses are not testable, a situation we will not consider here).

19 2.4. SOME SIMPLE MODELS Some simple models As claimed when starting the chapter, we now present 3 analyses that will provide (hopefully) illuminating examples of the use of linear models A simple linear regression At the start of this chapter, we imagined an analysis where an individual, fed with a new diet, would be followed for a few weeks to analyse how its weight changes over time. The idea was thus to model the weight as a function of time (expressed in weeks), and to be able to test hypotheses on the parameters of the model (such as is the weekly gain weight equal to 0?, or other similar hypotheses). We will now address this problem using the tools developed in the preceding sections. Consider the following dataset, where individual weights have been measured during the 5 first weeks of a new diet. Here are the results: Weeks Weights Plotting these data and adding a trend line gives the following figure: weights weeks Figure 2.2: Experimental points (in blue) and trend (in red)

20 20 CHAPTER 2. LINEAR MODELS The figure 2.2 clearly shows that the experimental points almost lie on a straight line, which suggests to use linear regression as a model. So, the idea is to fit an equation linking linearly the weights Y i to the corresponding weeks X i to this data set. The equation is: Y i = β 0 + β 1 X i + e i, i = 1,, n where the couples (X i, Y i ), i = 1,, n are the observations, e i represent the residuals, and β 0 and β 1 are 2 parameters to be estimated. This can also be written using matrix notation as: with: y = and e = e 1 e 2. e n Y 1 Y 2. Y n. = y = X β + e 1 X 1 1 1, X = 1 X 2.. = 1 2.., β = 1 X n 1 5 ( β0 β 1 ) Estimation of the parameters We can use the general procedure described above in order to find least squares estimates of the parameters of the models (β 0 and β 1 ). We have seen that the solution was provided by: ( ) ˆβ0 = (X ˆβ X) 1 X y 1 Using the matrices given above, this leads to: ( ) ( ˆβ = ˆβ ) 1 ( The inversion of a 2*2 matrix has been seen above, and so we can obtain: ( ) ( ) ( ) ( ) ˆβ = = ˆβ We have thus obtained the estimators of the intercept ˆβ 0 and of the slope ˆβ 1. This can also be easily done using R: ) # Create the i n c i d e n c e matrix > X< matrix ( c ( rep ( 1, 5 ), 1 : 5 ), b y r o w=f, nr =5) > Y< c ( 4. 1, 4. 6, 4. 9, 5. 2, 5. 4 ) # Next, compute the e s t i m a t e s > b< s o l v e ( t ( X ) % %X ) % %t ( X ) % %Y > b [, 1 ] [ 1, ] [ 2, ]

21 2.4. SOME SIMPLE MODELS 21 Of course, this code makes it necessary to manipulate explicitly the dataset (and to remember the formulae...). Fortunately, a more implicit writing is possible using the lm function (which stands for linear model ). The code is: # F i r s t, d e f i n e the data v e c t o r s > X< 1 : 5 > Y< c ( 4. 1, 4. 6, 4. 9, 5. 2, 5. 4 ) # Next compute the parameters o f the model # The ' ' s i g n means ' i s a l i n e a r f u n c t i o n of ' > lm ( Y X ) C a l l : lm ( formula = Y X ) Coefficients : ( I n t e r c e p t ) X In this simple case, we can show that the matrix computations lead to the classical formulae of linear regression. We can start by computing X X: ( n X n X = i=1 X ) i n i=1 X i n i=1 X2 i as can easily be obtained. Note that this matrix is always invertible, and we don t need, in this case, to resort to generalized inverse matrices. Now it can be verified that: (X X) 1 = 1 n n i=1 X2 i ( n i=1 X i) 2 ( n i=1 X2 i n i=1 X i n i=1 X i n Although simply formulated in matrix format, this is a bit more difficult when developed under this form. This can nevertheless be simplified using the following result. Noting the sample averages of the X i (Y i ) as X (Ȳ ), we have that: n n x 2 i = (X i X) 2 i=1 = = = i=1 n (Xi 2 + X 2 2 X i X) i=1 n Xi 2 + i=1 n i=1 Similarly, it can be shown that: n i=1 X 2 2 X X 2 i ( n i=1 X i) 2 n n i=1 X i ) n x i y i = i=1 = n (X i X) (Y i Ȳ ) i=1 n X i Y i i=1 n i=1 X i n i=1 Y i n

22 22 CHAPTER 2. LINEAR MODELS This allows us to rewrite the matrix as: (X X) 1 = Computing X y is easy: 1 n n i=1 x2 i X y = ( n i=1 X2 i n i=1 X i n i=1 X i n ( n i=1 Y ) i n i=1 X i Y i Accordingly, we obtain the following well known estimates for the parameters: ˆβ 1 = n i=1 x i y i n i=1 x2 i ) and: ˆβ 0 = Ȳ ˆβ 1 X Coming back to our example, we can very easily compute these two estimators. Using R, this is how one could do: # F i r s t, d e f i n e the data v e c t o r s > X< 1 : 5 > Y< c ( 4. 1, 4. 6, 4. 9, 5. 2, 5. 4 ) # Next, compute the x and y v e c t o r s > x< X mean ( X ) > y< Y mean ( Y ) # Compute b 1 > b 1< sum( x y ) /sum ( x x ) > b 1 [ 1 ] # Compute b 0 > b 0< mean ( Y ) b 1 mean ( X ) > b 0 [ 1 ] Testing an hypothesis After estimating the parameters, the next challenge is generally to test hypotheses of interest about the true value of the parameters. Remember that we only obtained estimators, and not the true values, for the parameters of our model. In this simple example (and, in general, in linear regression problems), we are most often interested to know whether the dependent variable varies (linearly) when the independent variable changes: does the weight change (linearly) week after week? If this is not the case, the weight would remain stable and the straight line passing more or less through the points would be horizontal, meaning that its slope would be null. Accordingly, testing whether the slope is equal to 0 corresponds to testing whether the weight does not linearly depend on the time (expressed in weeks). So, this null hypothesis can be written: H 0 : β 1 = 0 Using the general formulation discussed above, this means that: 1. L = ( 0 1 ), 2. c = 0,

23 2.4. SOME SIMPLE MODELS 23 It is easy to see that these choices of L and c lead to the hypothesis we are interested in: L β = ( 0 1 ) ( ) β0 = β β 1 = c = 0 1 Furthermore: 1. G = (X X) 1 (since no dependency exists between the lines of X X, the inverse exists and is unique), 2. q = 1, 3. r(x) = 2 (the 2 columns of X are linearly independent). Plugging all these results into the formula given to obtain the F value, we get: F (1, 3) = ˆβ 1 2 n i=1 x2 i 3 1 Q ( where Q = y X ˆβ ) ( y X ˆβ ) = n i=1 e2 i is the errors sum of squares. It is visible from this formula that we have obtained the classical test of the linear regression. So the complicated test formula boils down to the classical linear regression test formula in this simple setting. The following codes will show how the tests can be achieved using R. Let s start with an explicit matrix computation of the test: # Data X< matrix ( c ( rep ( 1, 5 ), 1 : 5 ), b y r o w=false, nr =5) y< c ( 4. 1, 4. 6, 4. 9, 5. 2, 5. 4 ) # Hypothesis L< c ( 0, 1 ) c< 0 # S o l v e G< s o l v e ( t ( X ) % %X ) b< G % %t ( X ) % %y # Build the t e s t LGLi< s o l v e ( L % %G % %L ) n u m e r< ( L % %b c ) % %LGLi % %t ( L % %b c ) n u m e r< n u m e r ( l e n g t h ( y ) qr ( X ) $ rank ) d e n o m< t ( y X % %b ) % %( y X % %b ) F< numer / denom # Show F > F [, 1 ] [ 1, ] # Compute p v a l u e > p f ( F, 1, 3, lower. tail=f A L S E ) [, 1 ] [ 1, ] We can also use the simplified formula we just derived to obtain the same results: # Create the i n c i d e n c e matrix > X< matrix ( c ( rep ( 1, 5 ), 1 : 5 ), b y r o w=false, nr =5) > Y< c ( 4. 1, 4. 6, 4. 9, 5. 2, 5. 4 ) # Compute the e s t i m a t e s > b< s o l v e ( t ( X ) % %X ) % %t ( X ) % %Y # Compute Q > Q< t ( Y X % %b ) % %( Y X % %b ) # Compute the sum o f the squared x > x< X [,2] mean ( X [, 2 ] ) > sx2< sum( x x )

24 24 CHAPTER 2. LINEAR MODELS # Compute F > F< b [ 2 ] b [ 2 ] sx2 3/Q # Show F > F [, 1 ] [ 1, ] # Compute p v a l u e > p f ( F, 1, 3, lower. tail=f A L S E ) [, 1 ] [ 1, ] The function is used to compute the p-value associated to the test of the null hypothesis. The 3 first arguments of the routine are the value of F and the degrees of freedom of the numerator (q, which is equal to 1 in the example) and of the numerator (n rank(x) = 3 in our example). The last argument is to indicate that we want the probability of exceeding the given value of F (by default, the function returns the lower tail, i.e. the probability of being lower than F ). This p-value is often reported in the publications, with values conventionally reported as significant when lower than 0.05, very significant when lower than 0.01 and extremely significant when lower than (but these conventions might differ from paper to paper. Of course, again, if we don t want to use this overly complicated approach in this simple situation, we still can use the embedded facilities of the lm call we showed in the previous paragraph: # F i r s t, d e f i n e the data v e c t o r s > X< 1 : 5 > Y< c ( 4. 1, 4. 6, 4. 9, 5. 2, 5. 4 ) # Next compute the parameters o f the model # The r e s u l t o f the c a l l to lm i s a R o b j e c t s t o r e d i n l r > lr< lm ( Y X ) # Various f e a t u r e s o f the o b j e c t can be demanded. > summary ( lr ) C a l l : lm ( formula = Y X ) Residuals : Coefficients : Estimate Std. Error t value Pr ( > t ) ( I n t e r c e p t ) e 05 X Signif. codes : Residual standard error : on 3 degrees of freedom Multiple R squared : , Adjusted R squared : F statistic : on 1 and 3 DF, p value : You can check that the same results are obtained using the 3 methods, with of course much less hassle in the last approach... These last results deserve more comments. First, it should be noted that we have created a model object in R, called lr, that contains informations on the model. The function summary allows to extract part of the information present in lr. 1. the line following Call: gives the code leading to the model object, 2. the line following Residuals: provides the residuals e i obtained after fitting the model. For example, for the first observation, e 1 = y 1 ˆβ 0 ˆβ 1 w 1 = = 0.1,

25 2.4. SOME SIMPLE MODELS the lines after Coefficients: show the estimators of the parameters of the model, along with their standard error. The null hypotheses H 0 : β 0 = 0 and H 0 : β 1 = 0 are then tested and the results of the classical t-tests are provided in the last two columns: the value of t first, then the probability of exceeding the corresponding t value if the null hypothesis is true. This probability is often referred to as p-value. As can be seen for β 1 and as could easily be obtained for β 0 (try it...!), the p-values of the t tests are the same as the ones obtained using our general F tests (the two tests are actually equivalent). The standard errors are obtained as square roots of the corresponding variance terms of the variance-covariance matrix of β, obtained above: (a) an estimator s 2 e of the error variance σ 2 e is first obtained as s 2 e = n i=1 e2 i / [n rank(x)]. In our example, n i=1 e2 i = 0.028, n = 5 and the rank of X is 2. So, the estimation of the error variance is s 2 e = 0.028/3 = and the residual standard error is s e = (b) estimators of the variances of ˆβ0 and ˆβ 1 are then easily obtained using the coefficients of the (X X) 1 derived above. The variances are s 2 ( ˆβ 0 ) = 1.1 s 2 e and s 2 ( ˆβ 1 ) = 0.1 s 2 e, and the corresponding standard errors are 1.1 s e = and 0.1 s e = , respectively. The t values are then computed as t = ˆβ/se( ˆβ), yielding the values and 10.47, respectively. Checking whether exceeding such values of t in the distribution with (n rank(x) = 3 degrees of freedom leads to the p-values shown in the last column. In R, it is easy to compute such probabilities, using the pt(...) function, as shown below: > e< Y X % %b > s2e< t ( e ) % %e /3 > se0< s q r t ( s o l v e ( t ( X ) % %X ) [ 1, 1 ] s2e ) > p uni< pt ( b [ 1 ] / se0, d f =3, l o w e r. tail=f A L S E ) > p bi< 2 p uni > p bi [ 1 ] e 05 > se1< s q r t ( s o l v e ( t ( X ) % %X ) [ 2, 2 ] s2e ) > p uni< pt ( b [ 2 ] / se1, d f =3, l o w e r. tail=f A L S E ) > p bi< 2 p uni > p bi [ 1 ] So, the function pt(...) returns the probability that a t value with 3 degrees of freedom ( df=3 ) is larger (the option lower.tail=false allows to return the probability to be larger than the observed value, while lower.tail=true allows to return the probability to be lower) than the observed value ( b[1]/se0 or b[2]/se1 ). Finally, since we want the (bilateral) probability to be either larger than b/se or smaller than b/se and since these two probabilities are equal due to the symmetry of the t distributions, the probability is multiplied by the R-squared values provide information on the fit of the model: values close to 1 indicate that the model sticks very well to the data, while values close to 0 mean that the model poorly explains the observed variation.

26 26 CHAPTER 2. LINEAR MODELS Since R 2 always increases when more factors are added to the model, even when those factors are not relevant, a correction has been provided to take the number of independent variables into account in the computation of R 2, leading to the adjusted R-squared value. A last note on the model object generated by a call to the function lm(...) : other features are available than the one shown in the previous code. A list is provided through the names(...) command,as shown below: # Obtain a l i s t o f the f e a t u r e s a v a i l a b l e i n the o b j e c t l r > names ( lr ) [ 1 ] c o e f f i c i e n t s r e s i d u a l s e f f e c t s rank [ 5 ] f i t t e d. v a l u e s a s s i g n qr d f. r e s i d u a l [ 9 ] x l e v e l s c a l l terms model # Obtain one o f t h e s e f e a t u r e s, f o r example : > lr $ f i t t e d. v a l u e s A comparison of groups In this section, we will consider the following problem, similar to the blood pressure problem in the introduction: the expression of a gene has been measured on horses in three different breeds (Ardennes, Warmblood, Half-blood) in order to determine whether breed has an impact on the activity of this gene. Measures have been calibrated (i.e. made relative to a constantly expressed gene) and standardized. Here are the obtained results: Ardennes Warm Half Questions of interest could be: is there a significant difference between the breeds with respect to the expression of this gene? And is it reasonable to consider that the expression in half-blood horses is average when compared to the two other breeds? These questions can again be addressed using a linear model. As explained in the introduction, the observations can be represented as linear functions of the (categorical) breed effect, using the following model: where: y ij = µ + α i + ɛ ij 1. y ij is the observation made on the j th individual in breed i, 2. µ is the average (over the horse population) value of the gene expression, 3. α i is the average impact of breed i, i = 1,, 3 on the gene expression, 4. e ij is the departure of animal j in breed i from its breed average.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if