School of Education, Culture and Communication Division of Applied Mathematics

Size: px

Start display at page:

Download "School of Education, Culture and Communication Division of Applied Mathematics"

Wesley Butler
5 years ago
Views:

1 School of Education, Culture and Communication Division of Applied Mathematics MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS Estimation and Testing the Quotient of Two Models by Marko Dimitrov Masterarbete i matematik / tillämpad matematik DIVISION OF APPLIED MATHEMATICS MÄLARDALEN UNIVERSITY SE VÄSTERÅS, SWEDEN

2 School of Education, Culture and Communication Division of Applied Mathematics Master thesis in mathematics / applied mathematics Date: Project name: Estimation and Testing the Quotient of Two Models Author: Marko Dimitrov Supervisor(s): Christopher Engström Reviewer: Milica Rančíć Examiner: Sergei Silvestrov Comprising: 15 ECTS credits

3 Abstract In the thesis, we introduce linear regression models such as Simple Linear Regression, Multiple Regression, and Polynomial Regression. We explain basic methods of the model parameters estimation, Ordinary Least Squares (OLS) and Maximum Likelihood Estimation (MLE). The properties of the estimates, and what assumptions need to be made for the model for the estimates to be the Best Linear Unbiased Estimates (BLUE) are given. The basic Bootstrap methods are introduced. The real world problem is simulated in order to see how measurement error affects the quotient of two estimated models.

4 Acknowledgments I would like to thank my supervisor Senior Lecturer Christopher Engström of the School of Education, Culture and Communication at Mälardalen University. Prof. Engström s consistently allowed this paper to be my own work but steered me in the right direction whenever he thought I needed it. I would also like to thank Prof. Dr. Miodrag Ðor dević who was involved in the validation survey for this master thesis. Without his participation and input, the validation survey could not have been successfully conducted. I would also like to acknowledge Senior Lecturer Milica Rančíć, School of Education, Culture and Communication at Mälardalen University as the reviewer, and I am gratefully indebted to her for her very valuable comments on this thesis. The data used in the master thesis comes from ship log data gathered at Qtagg AB, from one ship gathered over roughly half a month and I wish to acknowledge Qtagg AB for the data. Finally, I must express my very profound gratitude to my friends and girlfriend for providing me with unfailing support and continuous encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Thank you. Author: Marko Dimitrov

5 Contents List of Figures 3 List of Tables 4 Introduction 7 1 Simple Linear Regression The Model Estimation of the Model Parameters Ordinary Least Squares Properties of the Ordinary Least Square Estimators An Estimator of the Variance and Estimated Variances Hypothesis Testing, Confidence Intervals and t-test The Coefficient of Determination Maximum Likelihood Estimation Multiple Regression The Model Estimation of the Model Parameters Ordinary Least Squares Properties of the Ordinary Least Squares Estimators An Estimator of the Variance and Estimated Variance Maximum Likelihood Estimation Properties of the Maximum Likelihood Estimators Polynomial Regression Orthogonal Polynomials The Bootstrap Introduction Statistics The Bootstrap Estimates Parametric Simulation Approximations Non-parametric Simulation

6 3.5 Confidence Intervals Simulation and Evaluation Mathematical Description of the Problem An Analogy With the Real World Parameter Estimation Confidence Intervals True Values of the Quotient Evaluation of the Results Discussion Conclusions Future Work Fulfillment of Thesis Objectives A Definitions 51 A.1 Linear Algebra A.2 Matrix Calculus A.3 Statistics B Probability Distributions 53 B.1 Binomial Distribution B.2 Uniform Distribution B.3 Generalized Pareto Distribution B.4 Normal Distribution B.5 Log-normal Distribution B.6 Gamma Distribution B.7 Student Distribution B.8 Chi-Square Distribution Bibliography 57 Index 58 3

7 List of Figures 4.1 The data (velocity) without measurement errors Case 1 - The data (fuel efficiency) without measurement errors Case 2 - The data (fuel efficiency) without measurement errors Case 3 - The data (fuel efficiency) without measurement errors A sample of data taken from the Uniform Distribution A sample of data taken from the Generalized Pareto Distribution A sample of data taken from the Normal Distribution A sample of data taken from the Log-normal Distribution A sample of data taken from the Gamma Distribution A sample of data taken from the Student s t-distribution A sample of data taken from the Gamma Distribution

8 List of Tables 4.1 Table of the confidence intervals for the mean of the quotient

9 Abbreviations and Acronyms SS R Sum of Squares due to Regression. 9 SS T Total Sum of Squares. 9 BLUE Best Linear Unbiased Estimates. 7 MLE Maximum Likelihood Estimation. 9 OLS Ordinary Least Squares. 5 RSS Residual Sum of Squares. 5, 16 CDF Cumulative Density Function. 21 EDF Empirical Distribution Function. 21 PDF Probability Density Function. 21 i.i.d. Independent and Identically Distributed. 21 df Degrees of Freedom. 7 6

10 Introduction Regression analysis is a statistical technique used for analyzing data, and for finding a relationship between two or more variables. Behind the regression analysis lays elegant mathematics and statistical theory. It can be used in many fields, in engineering, economy, biology, medicine etc. In the book, Dougherty [4], there is a good example of where the regression can be used and how to use it. We explain Simple Linear Regression, for much more information, proofs, theorems, and examples the author refers the reader to the book by Weisberg [10]. There are also good examples of Simple Linear regression in book by Dougherty [4]. Multiple Regression analysis is, perhaps, more important than the Simple Linear Regression. There are a lot of results and books about the Multiple Regression, starting with Rencher and Schaalje [6] which author refers to the reader. I would like to mention books by Wasserman [9], Montgomery et al. [5], Seber and Lee [7], Casella and Berger [1], and Weisberg [10] which contain a lot more information. Besides the simple linear regression and multiple regression, books such Casella and Berger [1], and Weisberg [10] contain other linear regression models as well as nonlinear models. For the better understanding of the Polynomial Regression, the author refers the book by Wasserman [9] (Chapter 7) which will give you enough information on why Regression Analysis is good, and why it is not. The problem we mention in the thesis, ill-conditioning, is well-explained and solved in Chapter 7 Wasserman [9]. An excellent introduction to the bootstrap methods and confidence intervals is given in Davis [2] and Davison and Hinkley [3]. However, in the book by Van Der Vaart and Wellner [8] (Section 3.6 to 3.9), they mention the bootstrap empirical process and take the bootstrap method to the next level. Our goal is to simulate the real world problem and use the methods we mention. We will make some assumptions, estimate two models (Simple Linear, Multiple or Polynomial regression models), and look at the quotient of the two models. One would search for the distribution of the quotient of two models, but that could be really complicated, that s why we introduce the bootstrap method. Computing the confidence intervals for the mean of the quotient, we get the results. We introduce different types of measurement errors to the data and see how does that affect the quotient. 7

11 Formulation of Problem and Goal of the Thesis This project is inspired by my supervisor Christopher Engström. The formulation of the problem studied in the project and the goal of the project, given by the supervisor, are below. When creating a new control system or new hardware for a vehicle or some other machine there is also need to test it in practice. For example, if you want to evaluate if one method is more fuel efficient then another. The standard method to do this is by doing the testing in a controlled environment where you can limit the number of outside influences on the system. However, doing the tests in a controlled environment is not always possible - either because of cost considerations or because the thing you want to test is something that is hard to archive in a controlled environment. The goal of the thesis is to evaluate how well a quotient between two models, for example, the fuel efficiency of two engines, behaves when the data is taken in the non-controlled environment where two engines cannot be tested simultaneously and the effects of outside factors are large. Mathematically, the problem can be described as follows: 1) Given two sets of data, a model is constructed for each in order to predict one of the variables given the others (regression problem); 2) From these two models try to predict the quotient between the two predicted variables if they were given the same input parameters, for example by computing confidence intervals; 3) Introduce bias or different types of errors with known distributions into the data and determine how this affects the randomness in the result; 4) The project should be made using a mixed theoretical and experimental approach but may lean towards one or the other. 8

12 Chapter 1 Simple Linear Regression 1.1 The Model Regression is a method of finding the relationship between two variables Y and X. The variable Y is called a response variable and the variable X is called a covariate. The variable X is also called a predictor variable or a feature. In the simple linear regression we have only one covariate, but, as we will see later, there could be more covariates. Let s assume that we have a set of data D = {(y i,x i )} N i=1. To find relationship between Y and X we estimate the regression function r(x) = E(Y X = x) = y f (y x)dy. (1.1) The simplest is to assume that the regression function is a linear function: r(x) = θ 0 + θ 1 x. where x is a scalar (not a vector). Beside the regression function (mean function), the simple linear regression model consists of an another function Var(Y X = x) = σ 2 which is the variance function. By changing the parameters θ 0 and θ 1 we can get every possible line. To us, the parameters are unknown and we have to estimate them by using the data D. Since the variance σ 2 is positive, the observed value, in general, will not be the same as the expected value. Because of that, to account for the difference between those values, we look at the error ξ i = y i (θ 0 + θ 1 x i ) for every i {1,2,...,N}. The errors depend on the parameters and are not observable, therefore they are random variables. We can write simple linear regression model as y i = θ 0 + θ 1 x i + ξ i, i = 1,2,...,N. (1.2) 9

13 The model is called simple because there is only one feature to predict the predictor variable, and the linear part means that the model (1.2) is linear in parameters θ 0 and θ 1, to be precise, the assumption that the regression function (1.1) is linear. Considering that ξ i are random variables, y i are random variables as well. For the model to be complete, we have to make following assumptions about the errors ξ i, i = 1,2,...,N. 1. E(ξ i x i ) = 0 for all i = 1,2,...,N; 2. Var(ξ i x i ) = σ 2 for all i = 1,2,...,N; 3. Cov(ξ i,ξ j x i ) = 0 for all i j, i, j = 1,2,...,N. First assumption guarantees that the model (1.2) is well defined. It is equivalent to E(y i x i ) = θ 0 + θ 1 x i which means that y i depends only on x i and all other factors are random, contained in ξ i. The second assumption implies Var(y i x i ) = σ 2, the variance is constant, it does not depend on values of x i. Third assumption is equivalent to Cov(y i,y j x i ) = 0. The errors, as well as variables y i, are uncorrelated with each other. Under the assumption of normality, this would mean that the errors are independent. 1.2 Estimation of the Model Parameters Ordinary Least Squares One of many methods to estimate unknown parameters θ 0 and θ 1 in (1.2) is the Ordinary Least Squares (OLS) method. Let ˆθ 0 and ˆθ 1 be the estimates of θ 0 and θ 1. We define the fitted line by ˆr(x) = ˆθ 0 + ˆθ 1 x the fitted values as the residuals as ŷ i = ˆr(x i ) ˆξ i = y i ŷ i = y i ( ˆθ 0 + ˆθ 1 x i ) and the residual sums of squares or (RSS) by RSS = N ˆξ i 2. (1.3) i=1 By minimizing the residual sums of squares we get the estimates ˆθ 0 and ˆθ 1. Those estimates are called the least square estimates. The function we want to minimize is RSS(θ 0,θ 1 ) = N i=1 (y i (θ 0 + θ 1 x i )) 2 (1.4) 10

14 and by solving the linear system RSS(θ 0,θ 1 ) θ 0 = 0 RSS(θ 0,θ 1 ) θ 1 = 0 we get ˆθ 0 and ˆθ 1. When we differentiate, linear system (1.5) becomes (1.5) 2 2 N i=1 N i=1 (y i ((θ 0 + θ 1 x i )) = 0 (y i ((θ 0 + θ 1 x i ))x i = 0 Solving the linear system we get the least square estimates N ˆθ 0 = ȳ ˆθ 1 x, ˆθ 1 = N i=1 x iy i N xȳ N i=1 x2 i N x2 = N i=1 (x i x)(y i ȳ) N i=1 (x i x) 2 (1.6) N where ȳ = 1 i and x = N i=1y 1 x i. N i=1 The estimates given in (1.6) will be the estimates which minimize the function (1.4) if we prove that the second derivatives are positive. We could also notice that the function (1.4) has no maximum, therefore the estimates are the minimum Properties of the Ordinary Least Square Estimators To estimate parameters θ 0 and θ 1, the three assumptions on page 10 were not used. Even if the assumption E(y i x i ) = 0 for all i = 1,2,...,N does not hold, we can define ŷ i = θ 0 + θ 1 x i to fit the data D = {y i,x i } N i=1. The estimates ˆθ 0 and ˆθ 1 are also random variables, because they depend on statistical errors. If the assumption on page 10 hold, from the Gauss-Markov theorem (see Theorem (1) on page 19), the estimators ˆθ 0 and ˆθ 1 are unbiased and have the minimum variance among all linear unbiased estimators of parameters θ 0 and θ 1, E( ˆθ 0 X) = θ 0 E( ˆθ 1 X) = θ 1 also the variance of the estimates are σ Var( ˆθ 2 0 X) = N i=1 x2 i [ N x2 ] Var( ˆθ 1 X) = σ 2 1 N + x 2 N i=1 (x i x) 2. (1.7) 11

15 Since θ 0 depends on θ 1, it s obvious that the estimates are correlated Cov( ˆθ 0, ˆθ 1 X) = σ 2 x N i=1 (x i x) 2. The estimates ˆθ 0 and ˆθ 1 are called Best Linear Unbiased Estimates (BLUE) An Estimator of the Variance and Estimated Variances Ordinary Least Squares does not yield the estimation of the variance. Naturally, estimation ˆσ [ ] 2 should be obtained by averaging the squared residuals because σ 2 2. = E y i E(y i x i ) x i From the assumption 2, on page 10, we have the constant variance σ 2 for every y i, i = 1,2,...,N. Also, we use ŷ i to estimate E(y i x i ). To get the unbiased estimation ˆσ 2 of σ 2, we divide RSS ((1.3)) by its degrees of freedom (df), where residual df is number of cases in data D (N) minus the number of parameters, which is 2. The estimate is ˆσ 2 = RSS N 2, (1.8) this quantity is called the residual mean square. To estimate the variances of ˆθ 0 and ˆθ 1 we simply change σ 2 with ˆσ 2 in (1.7). Therefore, are the estimated variances. ˆσ Var( ˆθ 2 0 X) = N i=1 x2 i [ N x2 ] Var( ˆθ 1 X) = ˆσ 2 1 N + x 2 N i=1 (x i x) Hypothesis Testing, Confidence Intervals and t-test Until now, we didn t need to make any assumptions about the error s distribution besides the three assumption on page 10. Suppose that we add the following assumption ξ i x i : N(0,σ 2 ), i = 1,2,...,N. Since the predictions are linear combination of the error, we have y i x i : N(θ 0 + θ 1 x i,σ 2 ), i = 1,2,...,N. With this assumption we can construct confidence intervals about the model parameters and test hypotheses. Perhaps we are more interested in hypotheses about θ 1, by doing so we can determine if there is actually a linear relationship between X and Y by testing the hypotheses H 0 : θ 1 = 0, H 1 : θ 1 0. (1.9) 12

16 In general, we can test the hypothesis H 0 : θ 1 = c, H 1 : θ 1 c. (1.10) where c is an arbitrary constant. Depending on what we need to determine, we choose the constant c. Before we examine the hypothesis (1.9) and (1.10) we need the following properties: ( ) ˆθ 1 : N θ 1,σ 2 1 N i=1 (x, i x) 2 (N 2) ˆσ 2 1 σ 2 : χ 2 (N 2), ˆθ 1 and ˆσ 2 are independent random variables, where ˆσ 2 is given by (1.8). Using these properties, the hypothesis test of (1.10) is obtained by computing the t-statistics ˆθ 1 c t = (1.11) Var( ˆθ 1 X) where Var( ˆθ 1 X) is standard deviation. The t-statistic given by (1.11) has distribution t(n 2, δ). The non-centrality parameter δ is given by δ = E( ˆθ 1 X) Var( ˆθ 1 X) = θ 1 1 σ 1 N i=1 (x i x) 2. (1.12) Hypothesis (1.9) is just a special case of hypothesis (1.10), which means the t-statistics for (1.9) is ˆθ 1 t = ˆσ 2, 1 N i=1 (x i x) 2 where t is distributed as t(n 2), because from (1.12), if H 0 : θ 1 = 0 we have δ = 0. For two-sided alternative hypothesis given in (1.9), we reject the null hypothesis H 0 with the significance α when t t α 2,N 2, where t α 2,N 2 is the upper α 2 percentage point of the central student s distribution. Probability p, which fits for the absolute value of observed t (as the inverse of distribution function), is called the p value. Considering the following p > α = p 2 > α 2 = t < t α 2,N 2 which means that we accept the null hypothesis H 0. Alternatively, if p α we reject the H 0. Finally, to get the confidence interval, starting with P{ t t α 2,N 2} = 1 α using transformations, a 100(1 α)% confidence interval for θ 1 is given by ˆθ 1 t α 2,N 2 ˆσ 2 1 θ 1 ˆθ 1 +t α N i=1 (x i x) 2 2,N 2 ˆσ 2 1 N i=1 (x i x) 2 13

17 1.4 The Coefficient of Determination We define the coefficient of determination as R 2 = SS R SS T where SS R = N i=1 (ŷ i ȳ) 2 is a sum of square due to regression and SS T = N i=1 (y i ȳ) 2 is a total sum of squares. Since it can be proved that SS T = RSS + SS R the total sum of squares is in fact total amount of the variation in y i. Considering this, we have 1 = SS T = RSS + SS R = RSS + R 2, SS T SS T SS T which means that R 2 is a proportion of how much of the variation is explained by the model (by the regression). From 0 RSS SS T, it follows that R 2 [0,1]. The bigger the R 2 is, the more variability of Y are explained by the model. We can always add more variables to the model and the coefficient would not decrease, but that doesn t mean that the new model is better. The error sum of squares should be reduced to get the better model. Some of computer packages use adjusted coefficient of determination given by R 2 adj = 1 RSS/d f SS T /(N 1). 1.5 Maximum Likelihood Estimation While the OLS method does not require assumption about the errors to estimate the parameters, the maximum likelihood estimation (MLE) method can be used if the error s distribution is known. For the set of data D = {y i,x i } N i=1, if we assume that the errors in the simple regression model are normally distributed ξ i x i : N(0,σ 2 ), i = 1,2,...,N then y i x i : N(θ 0 + θ 1 x i,σ 2 ), i = 1,2,...,N. Since the parameters θ 0, θ 1 and σ 2 are unknown, the likelihood function is given by { } L(y i,x i ;θ 0,θ 1,σ 2 N ) = (2πσ 2 ) 1 2 exp 1 i=1 2σ 2 (y i θ 0 θ 1 x i ) 2 { } = (2πσ 2 ) N 2 exp 1 (1.13) N 2σ 2 (y i θ 0 θ 1 x i ) 2 i=1 14

18 Values ˆθ 0, ˆθ 1 and ˆσ 2 that maximize function (1.13) are called the I. To find the maximum value of the function (1.13) is the same as finding maximum of its natural logarithm, { }) lnl(y i,x i ;θ 0,θ 1,σ 2 ) = ln ((2πσ 2 ) N2 exp 1 N 2σ 2 (y i θ 0 θ 1 x i ) 2 = N 2 ln(2π) N 2 lnσ2 1 2σ 2 To find maximum of (1.14) we solve system i=1 N i=1 (y i θ 0 θ 1 x i ) 2. (1.14) or equivalently lnl(θ 0,θ 1,σ 2 ) θ 0 = 0 lnl(θ 0,θ 1,σ 2 ) θ 1 = 0 lnl(θ 0,θ 1,σ 2 ) σ 2 = 0 1 σ 2 1 σ 2 N i=1 N i=1 N 2σ σ 4 (y i θ 0 θ 1 x i ) = 0 (y i θ 0 θ 1 x i )x i = 0 N i=1 (y i θ 0 θ 1 x i ) 2 = 0. (1.15) The solution to (1.15) gives us the maximum likelihood estimates ˆθ 0 = ȳ ˆθ 1 x, ˆθ 1 = N i=1 (x i x)(y i ȳ) N i=1 (x i x) 2 ˆσ 2 = N i=1 (y i ˆθ 0 ˆθ 1 x 1 ) 2 N (1.16) which are the same as the estimates in (1.6), which we obtained using the OLS method. From ˆσ 2 we can get unbiased estimator for the parameter σ 2 and the ˆσ 2 is asymptotically unbiased itself. Since the MLE method requires more assumptions, naturally, with more assumption comes better properties. The estimators have the minimum variance among all other unbiased estimators. Therefore, under the assumption of normality, the maximum likelihood estimates are the same as the OLS estimates. 15

19 Chapter 2 Multiple Regression In this chapter, we generalize the methods for estimating parameters from the Chapter (1). Namely, we want to predict the variable Y using several features X 1, X 2,..., X k, k N. Basically, we add features to explain parts of Y that have not been explained by the other features. 2.1 The Model The regression function (1.1), under the assumption of linearity, for this problem becomes r(x) = E(Y X = x) = θ 0 + θ 1 x 1 + θ 2 x θ k x k where x is a vector x = (x 1,x 2,...,x k ). Therefore, the multiple regression model can be written as y = θ 0 + θ 1 x 1 + θ 2 x θ k x k + ξ. In order to estimate parameters in the model, θ 0,θ 1,...,θ k, we need N observations, a data set D. Suppose that we have data D = {y i,x i } N i=1 where x i is a vector, x i = (x i1,x i2,...,x ik ), k N, k is number of feature variables, i = 1,2,...,N. Hence, we can write the model for the i th observation as y i = θ 0 + θ 1 x i1 + θ 2 x i θ k x ik + ξ i, i = 1,2,...,N. (2.1) By saying linear model, we mean linear in the parameters. There are many examples where a model is not linear in x i j s but it is linear in θ i s. For k = 1, we get the simple regression model, so it is not a surprise that the three assumptions on page 10 should hold for the multiple regression as well, i.e., 1. E(ξ i x i ) = 0 for all i = 1,2,...,N; 2. Var(ξ i x i ) = σ 2 for all i = 1,2,...,N; 3. Cov(ξ i,ξ j x i,x j ) = 0 for all i j, i, j = 1,2,...,N. 16

20 Interpretation of these assumptions is similar as the interception for (1.1). For k = 2, the mean function E(Y X) = θ 0 + θ 1 X 1 + θ 2 X θ k X k is a plane in 3 dimensions. If k > 2, we get a hyperplane. We can not imagine or draw a k dimensional plane for k > 2. Notice that the mean function given above means that we are conditioning on all values of the covariates. For easier interpretation of the results, we would like to write the model (2.1) in a matrix form. Start by writing (2.1) as y 1 = θ 0 + θ 1 x 11 + θ 1 x θ k x 1k + ξ 1 y 2 = θ 0 + θ 1 x 21 + θ 1 x θ k x 2k + ξ 2. y N = θ 0 + θ 1 x N1 + θ 1 x N θ k x Nk + ξ N which gives us clear view of how to write the model in the matrix form. Simply, y 1 1 x 11 x 12 x x 1k θ 0 ξ 1 y 2. = 1 x 21 x 22 x x 2k θ ξ 2. y N 1 x N1 x N2 x N3... x Nk ξ N θ k and if we denote y 1 1 x 11 x 12 x x 1k θ 0 ξ 1 y 2 y =. ; X = 1 x 21 x 22 x x 2k ; θ = θ 1. ; ξ = ξ 2. y N 1 x N1 x N2 x N3... x Nk θ k ξ N (2.2) it becomes y = Xθ + ξ. The three assumptions can be expressed as E(ξ X) = 0, Cov(ξ X) = σ 2 I where Var(ξ i x i ) = σ 2 and Cov(ξ i,ξ j x i,x j ) is contained in Cov(ξ X) = σ 2 I. X is N (k + 1) matrix. We require the full column rank of the matrix, which means that N has to be greater than the number of columns (k + 1). Otherwise, there could happen that one of the columns is a linear combination of other columns. Through this chapter N will be greater than k + 1 and that rank of matrix X is k + 1, rank(x) = k + 1. Parameters θ are called regression coefficients. 2.2 Estimation of the Model Parameters Our goal is to estimate unknown parameters θ and σ 2 from the data D. Depending on the error s distribution, we can use different methods to estimate the parameters. 17

21 2.2.1 Ordinary Least Squares The method that does not require any assumptions about the exact distribution of errors is Ordinary Least Squares. The fitted values ŷ i are given by ŷ i = ˆθ 0 + ˆθ 1 x i1 + ˆθ 2 x i ˆθ k x ik, i = 1,2,...N. To obtain the OLS estimators for the parameters in θ, we seek for ˆθ 0, ˆθ 1, ˆθ 2,..., ˆθ k that minimize N ˆξ i 2 = i=1 = N i=1 N i=1 (y i ŷ i ) 2 (y i ( ˆθ 0 + ˆθ 1 x i1 + ˆθ 2 x i ˆθ k x ik )) 2 (2.3) One way to minimize the function is finding the partial derivatives with respect to each ˆθ j, j = 0,1,...,k, set the results to be equal to zero and solve k + 1 equations to find the estimates. One of the reasons why we wrote the model in the matrix form is to simplify these calculations. Therefore, since we assumed rank(x) = k + 1 < N, the following procedure holds. Firstly, (2.3) can be written in matrix form as ˆξ ˆξ = N i=1 (y i x i ˆθ) 2 where ξˆ 1 ˆθ 0 1 ˆ ξ 2 ˆξ =. ; ˆθ 1 ˆθ =. ; x x i1 i =.. ξˆ N ˆθ k x ik So, the function (2.3) we want to minimize now becomes N ˆξ i 2 = ˆξ ˆξ = i=1 N i=1 (y i x i ˆθ) 2 = (y X ˆθ) (y X ˆθ) = y y (X ˆθ) y y X ˆθ + (X ˆθ) X ˆθ = y y 2y X ˆθ + ˆθ X X ˆθ where we used basic matrix operations. Now we use matrix calculus to obtain the estimates. Differentiating ˆξ ˆξ with respect to ˆθ and setting the result to equal zero, we get 0 2Xy + 2X X ˆθ = 0 from which we have X X ˆθ = X y. 18

22 From the assumption rank(x) = k + 1, matrix X X positive-definite matrix, therefore the matrix is nonsingular, so (X X) 1 exists. Now we have the solution ˆθ = (X X) 1 X y. (2.4) Checking if the hessian of ˆξ ˆξ is positive-definite matrix, gives us that ˆθ is actually the minimum. The hessian is 2X X which is a positive-definite matrix because of the assumption of the rank. Since ˆθ minimizes the sum of squares, we call it the ordinary least square estimator Properties of the Ordinary Least Squares Estimators Even without the three (two) assumptions we could obtain the OLS estimators, but their properties wouldn t be as nice as with the assumptions. Let s assume that E(y X) = Xθ. The following holds E( ˆθ X) = E((X X) 1 X y X) = (X X) 1 X E(y X) = (X X) 1 X Xθ = θ (2.5) which means that ˆθ is unbiased estimator of θ. Let s now assume that Cov(ξ X) = σ 2 I. Under this assumption we can find the covariance matrix of ˆθ, Cov( ˆθ X) = Cov((X X) 1 X y X) = (X X) 1 X Cov(y X)((X X) 1 X ) = (X X) 1 X σ 2 IX(X X) 1 = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1 (2.6) Using this two properties, we can prove one of the most important theorem, also known as Gauss-Markov Theorem. Theorem 1. (Gauss-Markov Theorem) If y = Xθ + ξ, E(ξ X) = 0, Cov(ξ X) = σ 2 I and rank(x) = k + 1, then the ordinary least square estimator given by (2.4) is the Best Linear Unbiased Estimator (BLUE), the estimator has minimum variance among all unbiased estimators. Proof. The linearity of the estimator is easy to notice from the (2.4). The proof that the estimator is unbiased is given in (2.5). Let s prove now that the variance σ 2 (X X) 1 of the least squares estimator is the minimum among all unbiased estimators. 19

23 Assume that we have an linear estimator ˆβ = B 1 y of θ. Without losing the generality, there exists a non zero matrix B such that B 1 = (X X) 1 X + B. Besides the linearity, the estimator ˆβ should also be unbiased, so the following holds and also E( ˆβ X) = θ E( ˆβ X) = E(B 1 y X) = B 1 E(y X) = ((X X) 1 X + B)E(Xθ + ξ X) = ((X X) 1 X + B)Xθ = (X X) 1 X Xθ + BXθ = (I + BX)θ which implies that BX = 0. The estimator was arbitrary, let s prove that its variance is greater or equal to the variance of the OLS estimator. If we prove that Cov( ˆβ X) Cov( ˆθ X), it will imply that the variances of ˆθ i are the minimum among all others because the diagonal elements of the matrices are the variances of the estimators. Note that above means that Cov( ˆβ X) Cov( ˆθ X) is a positive semi-definite matrix. The following holds Cov( ˆβ X) = Cov(B 1 y X) = B 1 Cov(y X)B 1 = σ 2 B 1 B 1 = σ 2 ((X X) 1 X + B)((X X) 1 X + B) = σ 2 ((X X) 1 X X(X X) 1 + (X X) 1 X B + BX(X X) 1 + BB ) = σ 2 ((X X) 1 + BB ) where we used that BX = 0 (also X B = 0). From (2.6), we have Cov( ˆθ X) = σ 2 (X X) 1, so Cov( ˆβ X) Cov( ˆθ X) = BB 0 is an positive definite matrix because of assumption that B is non zero matrix. Considering the comment above, the OLS estimator is BLUE An Estimator of the Variance and Estimated Variance Under the assumptions on page 16, the variance of y i is constant for all i = 1,2,...,N. Therefore, Var(y i x i ) = σ 2 = E(y i E(y i x i ) x i ) 2 and also, E(y i x i ) = x iθ. Naturally, based on the data D = {y i,x i } N i=1 we estimate the variance as it follows ˆσ 2 = 1 N k 1 N i=1 (y i x i ˆθ) 2 20

24 or in the matrix form ˆσ 2 = RSS N k 1 (2.7) where RSS = (y X ˆθ) (y X ˆθ) is Residual Sum of Squares. The statistic (2.7), under the assumptions on page 16, is an unbiased estimator of the parameter σ 2, i.e. E( ˆσ 2 X) = σ 2. Using (2.6) and (2.7), the unbiased estimator for Cov( ˆθ) is Ĉov(θ) = ˆσ 2 (X X) 1. If we add one assumption to the Gauss-Markov theorem from page 19, which is E(ξ 4 i x i) = 3σ 4, then the estimated variance (2.7) has minimum variance among all quadratic unbiased estimators, which can be proven. See Theorem 7.3g. in [6]. 2.3 Maximum Likelihood Estimation So far, there were no assumption made about the distribution of the errors. To obtain Maximum Likelihood Estimator, we need to make those assumptions. In this section, we will assume normality of the random variable ξ. So, let ξ : N N (0,σ 2 I), where N N stands for N dimensional normal distribution. From the covariate matrix we have that the errors are uncorrelated which, under the assumption of normality, means that they are independent as well. The random variable y is normally distributed with expectation Xθ and covariate matrix σ 2 I), which implies that the joint probability density function, which we denote with ϕ(y;x,θ,σ 2 ), is ϕ(y,x;θ,σ 2 ) = N i=1 ϕ(y i ;x i,θ,σ 2 ) because y i are independent random variables. Or equivalently, we can write it as ( ϕ(y,x;θ,σ 2 ) = (2π) N 2 σ 2 I 2 1 exp 1 ) 2 (y Xθ) (σ 2 I) 1 (y Xθ) from the definition of the density of multivariate normal distribution. When y and X are known, the density function is treated as a function of parameters θ and σ 2 and in this case we call it the likelihood function, and we denote it as ( L(y,X;θ,σ 2 ) = (2π) N 2 σ 2 I 2 1 exp 1 ) 2 (y Xθ) (σ 2 I) 1 (y Xθ). (2.8) By maximizing the function (2.8) for given y and X we obtain the maximum likelihood estimators θ and σ 2. Maximizing logarithm of the function (2.8) is the same as maximizing the likelihood function, so, for easier calculation, we maximize the logarithm of the likelihood function. The likelihood function now becomes lnl(y,x;θ,σ 2 ) = N 2 ln(2π) N 2 lnσ2 1 2σ 2 (y Xθ) (y Xθ). (2.9) 21

25 When we find the gradient of the function lnl(y,x;θ,σ 2 ) and equalize it with 0, we get the Maximum Likelihood Estimator ˆθ, which is given by ˆθ = (X X) 1 X y and it is the same as the estimator obtained with the OLS method. And the biased estimator of the variance σ 2, which we get from this, is given by ˆσ 2 b = 1 N (y X ˆθ) (y X ˆθ). The unbiased estimator of the variance is ˆσ 2 = 1 N k 1 (y X ˆθ) (y X ˆθ). To verify that the estimator ˆθ actually maximizes function (2.9), we calculate hessian matrix of the function (2.9) and prove that it is negative definite matrix. Since the hessian matrix is X X, under the assumption from the beginning of the chapter that rank(x) = k +1, we have X X 0 which proves the claim Properties of the Maximum Likelihood Estimators The following properties of the estimators hold under the assumption of the normality of error s distribution. We will just state them without proofs. ˆθ : N k+1 (θ,σ 2 (X X) 1 ); (N k 1) ˆσ 2 /σ 2 : χ 2 (N k 1); ˆθ and ˆσ 2 are independent; ˆθ and ˆσ 2 are jointly sufficient statistics for θ and σ 2 ; The estimators ˆθ and ˆσ 2 have minimum variance among all unbiased estimators. 2.4 Polynomial Regression In this section we introduce Polynomial Regression model as a special case of Multiple Regression model with only few properties and short descriptions. You can read more about the Polynomial Regression in the book [7, Chapter 8], and in the book [10, Section 5.3]. Namely, if we set x i j = x j i in (2.1), j = 1,2,...,k, k N 1, we get y i = θ 0 + θ 1 x i + θ 2 x 2 i θ k x k i + ξ i, i = 1,2,...,N. (2.10) which is k th degree or (k + 1) th order polynomial regression model. 22

26 The inspiration for such model arises from Weierstrass approximation theorem (see [2, Chapter VI]), which claims that every continuous on a finite interval can be uniformly approximated as closely as desired by a polynomial function. Although it seems like a great solution, the better approximation requires higher polynomial degree, which means more unknown parameters to estimate. Theoretically k can go up to N 1, but, when k is greater then approximately 6, the matrix X X becomes ill-conditioned and other problems arise. The matrix X now becomes 1 x 1 x1 2 x x k 1 1 x 2 x2 2 x x k 2 X = (2.11). 1 x N xn 2 x3 N... xk N and matrices y, θ, and ξ are the same. The model (2.10) can be written as y = Xθ + ξ. (2.12) Even thought the problem of finding the unknown parameters in Polynomial Regression is similar to the problem in Multiple Regression, Polynomial Regression has special features. The model (2.10) is the k th order polynomial model in one variable. When k = 2 the model is called quadratic, when k = 3 the model is called cubic and so on. The model can also be in two or more variables, for example, a second order polynomial is given by y = θ 0 + θ 1 x 1 + θ 2 x 2 + θ 11 x θ 22 x θ 12 x 1 x 2 + ξ. which is known as the response surface. For our purposes, we will study only Polynomial Regression in one variable. We want to keep order of the model as low as possible. By fitting higher order polynomial we will most likely over fit the model, which means that such a model will not be a good predictor or will enhance understanding of the unknown function. From our assumption of rank(x) = k + 1, full column rank, in polynomial regression models, when we increase the order of the polynomial, as we mentioned, the matrix X X becomes ill-conditioned. This implies that the parameters will be estimated with error, because (X X) 1 might not be accurate Orthogonal Polynomials Before computers were made, people had problem calculating the powers x 0,x 1,...x k manually (by hand), but in order to fit the Polynomial Regression this is necessary. Assume we fit the Simple Linear Regression model to some data. We want to increase the order of the model, but not to start from the very start. What we want to do is to create situation where adding an extra term merely refines the previous model. We can archive that using the system of orthogonal polynomials. Now, with computers, this has less use. The system of orthogonal 23

27 polynomials can mathematically be obtained using Gram-Schmidt method. The k th orthogonal polynomial has degree k. As we mentioned, ill-conditioning is a problem as well. In polynomial regression model, the assumption that all independent variables are independent is not satisfied. This issue can also be solved by orthogonal polynomials. There are continuous orthogonal polynomials and discrete orthogonal polynomials. The continuous orthogonal polynomials are classic orthogonal polynomials such as Hermite polynomials, Laguerre polynomials, Jacobi polynomials. We use discrete orthogonal polynomials, where the orthogonality relation involves summation. The columns in matrix X in the model (2.12) are not orthogonal. So, if we want to add another term θ k+1 xi k+1, the matrix (X X) 1 will change (we need to calculate it again). Also, the lower order parameter ˆθ i, i = 0,1,...,k will change. Let s instead fit the model y i = θ 0 P 0 (x i ) + θ 1 P 1 (x i ) + θ 2 P 2 (x i ) θ k P k (x i ) + ξ i, i = 1,2,...,N, (2.13) where P j (x i ) are orthogonal polynomials, P j (x i ) is j th order polynomial, j = 0,1,...,k, (P 0 (x i ) = 1). From orthogonality we have N i=1 P m (x i )P n (x i ) = 0, m n, m,n = 0,1,...,k, The model (2.13) can be written in matrix form y = Xθ + ξ. Considering that matrix X is now P 0 (x 1 ) P 1 (x 1 ) P 2 (x 1 )... P k (x 1 ) P 0 (x 2 ) P 1 (x 2 ) P 2 (x 2 )... P k (x 2 ) X = P 0 (x N ) P 1 (x N ) P 2 (x N )... P k (x N ) and from orthogonality, the following holds N i=1 P2 0 (x i) X 0 N i=1 X = P2 1 (x i) N i=1 P2 k (x i) We know that the ordinary least square estimator is given by ˆθ = (X X) 1 X y, or equivalently ˆθ j = N i=1 P j(x i )y i N i=1 P2 j (x i), From (2.6) we have the variance, or equivalently Var( ˆθ j x j ) = j = 0,1,2,...,k. σ 2 N i=1 P2 j (x i). 24

28 It is interesting to notice that ˆθ 0 = N i=1 P 0(x i )y i N i=1 P2 0 (x i) = N i=1 y i = ȳ N Perhaps we want to add a term θ k+1 P k+1 (x i ) to model (2.13), then the estimator for θ k+1 will be ˆθ k+1 = N i=1 P k+1(x i )y i N i=1 P2 k+1 (x i). To obtain the estimator, we didn t change other terms in the model, we only look at newly added term. Because of the orthogonality, there is no need for finding (X X) 1 or any other estimators again. This is a way to easily fit higher order polynomial regression model. We can terminate the process when we find optimal (for our purpose) model. 25

29 Chapter 3 The Bootstrap 3.1 Introduction Let x 1,x 2,...,x N be a homogeneous sample of data, which can be observed as the outcomes of independent and identically distributed (i.i.d.) random variables X 1,X 2,...,X N, with Probability Density Function (PDF) f, and Cumulative Distribution Function (CDF) F. Using the sample, we can make inferences about parameter θ (population characteristic). To do that, we need a statistic S, which we assume that we have already chosen and that it is an estimate of θ (which is a scalar). For our needs, we are focused on how to calculate confidence intervals for parameters θ using the PDF of statistic S. We also could be interested in its bias, standard error, or its quantiles. There are two situations, the one we are interested in the non-parametric, and the parametric. Statistical methods based on mathematical model with known parameter τ that fully determines PDF f are called parametric methods, and the model is called parametric model. In this case, the parameter θ is a function of parameter τ. The statistical methods where we use only the fact that random variables are i.i.d. are non-parametric methods, and the models are called non-parametric models. For the non-parametric analysis, empirical distribution is important. The empirical distribution sets equal probabilities to each element of the sample, x i, i = 1,2,...,N. The probabilities are 1 N. Empirical Distribution Function (EDF) ˆF as an estimate of CDF F is defined as: ˆF(x) = The function ˆF can also be written as: number of elements in the sample x. N ˆF(x) = 1 N N i=1 where I Ai is the indicator of event A i and A i = {ω X i (ω) x}. I Ai (3.1) 26

30 Because of the importance of the EDF, we will define it more formally. Define function ν(x) as: ν(x) = { j : X j x, j = 1,2,...,N}, x R. The function represents cardinality of a set. Now we can define EDF as: ˆF(x) = ν(x) N, x R. (3.2) Random variable ˆF(x) is a statistic with values in a set { 0, 1 N, 2 N,..., N 1 N,1 }. The distribution of random variable will be: ( P ˆF(x) = k ) ( ) N = P(ν(x) = k) = F(x) k (1 F(x)) N k, N k k = 0,1,2,...,N, where F is the CDF, which means that ˆF(x) follows a Binomial Distribution with parameters p = P(X x) = F(x), x R and N. Considering the fact E(I Ai ) = F(x), for x R, ˆF n F almost certain or P( ˆF n F) = 1, which can be proven by Borel s law of large numbers Statistics Many statistics can be represented as a property of EDF. For example, x = N 1 N i=1 x i (the sample average) is the mean of the EDF. Generally, the statistic s is a function of x 1,x 2,...,x N and will not be affected by reordering the data, which implies that statistic s will depend on the EDF ˆF. So, statistic s can be written as a function of ˆF, s = s( ˆF). The statistical function s( ) can be perceived as a way for computing statistic s from function ˆF. This function is useful in non-parametric case since the parameter θ is defined by the function as s(f) = θ. The mean and the variance can be observed as a statistical functions: s(f) = xdf(x) ( 2. s(f) = x 2 df(x) xdf(x)) For the parametric methods we often define θ as a function of the model parameter τ, but the same definition stands for them too. 27

31 Notation S = s( ) will be used as a function and notation s as the estimate of θ, which is based on the data x 1,x 2,...,x N. The estimate can usually be expressed as s = s( ˆF), which actually represents the relation between parameter θ and the CDF F. From the definition (3.1), ˆF n F, as we mentioned before, then if s( ) is continuous, S converges to θ when n (consistency). We will not go into more details, applying bootstrap does not require such formality. We will assume that S = s( ˆF). 3.2 The Bootstrap Estimates Finding the distribution of statistics S can help us inference about estimates θ. For example, if we want to obtain 100(1 2α)% confidence interval for θ, we could possibly show that statistic S has approximately normal distribution with mean θ + β and standard deviation σ. The β is the bias of S. When we have assumption that bias and variance are known then: where function Φ is: ( s (θ + β) ) P(S s F) Φ, σ Φ(z) = 1 2π z e t2 2 dt, z R If the α quantile of the standard normal distribution is z α = Φ 1 (α), then 100(1 2α)% confidence interval for θ is: s β σ z 1 α θ s β σ z α (3.3) which we obtained from: ) P (β + σ z α S θ β + σ z 1 α 1 2α. However, the bias and the variance will almost never be known. Therefore, we need to estimate them. Express β and σ as: β = b(f) = E(S F) s(f), σ 2 = v(f) = Var(S F), where we note that S F means that random variables from which S is calculated have distribution F (X 1,X 2,...,X N are i.i.d. with CDF F). Assume that ˆF is estimation of function F, then we can obtain the estimates of β and σ as: B = b( ˆF) = E(S ˆF) s( ˆF) V = v( ˆF) = Var(S ˆF), (3.4) This estimates are called the bootstrap estimates. 28

32 3.3 Parametric Simulation The bootstrap idea has two steps, first estimating parameters, and then approximate them using simulation. We do that because sometimes we cannot simply express the formula for calculating parameter estimates. The practical alternative is re-sampling the data from a fitted parametric model, and calculation of properties of S which we need. Let F τ be CMF and f τ be PDF. Suppose that we have data x 1,x 2,...,x N and a parametric model for the distribution of the data. Let ˆF(x) = Fˆτ (x) be CMF of a fitted model which we get when we estimate τ (usually) using Maximum Likelihood Estimate with ˆτ. Note random variable distributed accordingly to ˆF as X Approximations Suppose now that calculation is for some reason too complicated. As we mentioned the alternative is to simulate data sets (re-sample) and estimate the properties. Let X1,...,X N be a data set i.i.d. from distributed ˆF. Denote with S statistic calculated from simulated data set. By repeating the process R times, we obtain R values S1,S 2,...,S R. The estimator of the bias will now become: B = b( ˆF) = E(S ˆF) s = E (S ) s and this is estimated by: B R = 1 R R Sr s = S s. r=1 Here, s is parameter value for the model, so S s is analog to S θ. Similarly, the estimator of the variance of S is: V R = 1 R 1 R r=1 (S r S ) 2. As R increase, by the law of large numbers, B R converges to B (the exact value under the fitted model) as well as V R to V. 3.4 Non-parametric Simulation Supposed that we have X 1,X 2,...,X N for which it is sensible to assume that they are i.i.d. from unknown distribution F. Using EDF ˆF we estimate CDF F, and we use ˆF as we would use it in a parametric model. First we see if we can calculate it easily, if not, we simulate the data sets (re-sampling) and approximate. Empirical calculations of the properties we require. Simulation using EDF is based on the fact that EDF puts equal probabilities to each values of data set x 1,x 2,...,x N. So, every simulated sample (re-sample) X 1,X 2,...,X N is taken at random. This re-sampling method is called the non-parametric bootstrap. 29

33 3.5 Confidence Intervals The distributed of S can be used to calculate confidence intervals, which is the main goal of the bootstrap for our needs. There are multiple ways to use bootstrap simulation, we will describe two methods. We could use the normal approximation of distribution S. This means that we will need to estimate limits (3.3) using the bootstrap estimates of bias and variance. Using the bootstrap method, we can estimate the quantiles for S θ with s (R+1)p s where we assume that (R+1)p is a whole number, so the p quantile of S θ is (R + 1)p th ordered value of s s, that is s (R+1)p s. So, an 100(1 2α)% confidence interval is: which can be obtained from: 2s s (R+1)(1 α) θ 2s s (R+1)α (3.5) P(a S θ b) = 1 2α = P(S b θ S b) = 1 2α. The interval (3.5) is called basic bootstrap confidence interval. The bigger R the more accurate the confidence interval will be. Typically, you take R > 1000, but there are more factors that accuracy depends on, for more details you can check the books mentioned in the Bibliography. When the distribution of S θ depends on unknowns, we try to mimic Student s-t statistic, therefore we define a studentized version of S θ as: Z = S θ V where V is an estimate of Var(S F). With this, we eliminate the unknown standard deviation when making inference about the normal mean. Student-t 100(1 2α)% confidence interval for mean is: x ˆσ t N 1 (1 α) θ x ˆσ t N 1 (α) where ˆσ is estimated standard deviation of the mean, and t N (α) is quantile of the Student-t distribution with N degrees of freedom. We can obtain 100(1 2α)% confidence interval for θ analogously, for distribution Z, as follows: s ˆσ z 1 α θ s ˆσ z α here z p is p quantile of Z. To estimate the quantiles of Z, we use replicates of the studentized bootstrap statistic: Z = S s V where we obtain values from re-samples X1,X 2,...,X N. When we use simulated values z 1,z 2,...z R to estimate z α, then we obtain studentized bootstrap confidence interval for θ: s ˆσ z (R+1)(1 α) θ s ˆσ z (R+1)α. (3.6) 30

34 The studentized bootstrap method is used to obtain confidence intervals in our non-parametric problem. 31

35 Chapter 4 Simulation and Evaluation In this chapter, we will simulate a real-world problem using data created in the program language MatLab. The goal is to estimate two models and test the quotient between them. In order to do that, we are assuming that we know the true relationship between the variables we observe - we know the true models. The data from which we need to estimate a model has measurement errors. Therefore, suppose that we know the real data and the data with the measurement error. We want to see how the assumption of the measurement error s distribution affects the quotient. 4.1 Mathematical Description of the Problem Let D 1 = {ỹ i1, x i1 } N 1 i=1 and D 2 = {ỹ j2, x j2 } N 2 j=1 be two sets of data. We assume that the data x i1, x j2, ỹ i1 and ỹ j2 are known to us, and that there is an error in measurement when obtaining the data, which means x i1 = x i1 + ξ i1, i = 1,2,...,N 1 ; x j2 = x j2 + ξ j2, j = 1,2,...,N 2, ỹ i1 = y i1 + ε i1, i = 1,2,...,N 1 ; ỹ j2 = y j2 + ε j2, j = 1,2,...,N 2, where ξ i1 and ξ i2, as well as ε i2 and ε j2, follow the same distribution for every i = 1,2,...,N 1, j = 1,2,...,N 2. For the purpose of the problem, we will assume that we also know the true values of the data, x i1, x j2, y i1 and y j2. We know the true relationship between observed variables Y and X, which means that we know the true models that fit the data. Let D 1 = {y i1,x i1 } N 1 i=1 and D 2 = {y j2,x j2 } N 2 j=1. Depending on the problem, we will either use the Simple Linear Regression or the Polynomial Regression (OLS method) to create the models. We will set that the parameters in one of the two real models to be 5% smaller than the parameters in the other model. From the data D 1 and D 2, using the OLS method to obtain the parameter estimates, we 32

36 will get two models y 1 (x) and y 2 (x), and we are interested in the quotient: y 1 (x) y 2 (x). (4.1) Since the goal is to see how and if the assumption of the measurement s error distribution affects the quotient (4.1), we will repeat this process for different errors but from the same distribution and obtain the confidence interval for the quotient using the bootstrap method. We know the true quotient, the true ratio between the models, which we will use to see if it belongs to the confidence interval we obtain. 4.2 An Analogy With the Real World We want to simulate the relationship between velocity and fuel efficiency in the ships. The true data always comes with measurement errors due to many factors, which is the reason why we add the measurement error to our data. The assumption that we know the true models can help us understand the relationship between the velocity and fuel consumption. Suppose, for example, we want to see which of two engines is better and how much (does it spend more or less fuel). This can be hard because of the measurement errors, results might lead us in the wrong direction. By testing the quotient of the two models, assuming different errors, we can see if the assumption of particular error should affect our decision of which of two engines spends less and how certain can we be in our decision. 4.3 Parameter Estimation The velocity and fuel consumption data we use comes from ship log data gathered at Qtagg AB, from one ship gathered over roughly half a month. Our real sets of data D 1 and D 2 consists of velocity measured in knots and fuel efficiency measured in liters per hour. The plotted data for the real velocity (without measurement errors) is given in figure (4.1). 33

37 Figure 4.1: The data (velocity) without measurement errors Our assumption of knowing the true models will give us insight of how the fuel efficiency data should look. We will consider only three cases. 1. the true models are the following: y 1 (x) = θ 11 x y 2 (x) = θ 12 x (4.2) where we choose the θ 11 to be θ , and we set the other one to be 5%, bigger, θ As we mentioned, we will always set the parameters to be 5% bigger for one model. Using the models, we can obtain the true data y i1 and y i j. The data is given in figure (4.2). 2. the true models are the following: y 1 (x) = θ 31 x 3 y 2 (x) = θ 32 x 3 (4.3) where θ , θ The true data y i1 and y i j, using those models are given in figure (4.3). 3. the true models are the following: y 1 (x) = θ 01 + θ 11 x + θ 21 x 2 + θ 31 x 3 y 2 (x) = θ 02 + θ 12 x + θ 22 x 2 + θ 32 x 3 (4.4) where θ 1 = [ , , ,0.9254] and θ 2 has coefficients 5% greater then θ 1. The data is given in figure (4.4). 34

38 Figure 4.2: Case 1 - The data (fuel efficiency) without measurement errors The data in the all of the three cases are without any measurement errors. In the next sections, we will add errors to those data. When we add errors, we will get new data from which we will create adequate models. We will estimate the parameters using OLS method described earlier. Depending on the case, we estimate the parameters to get the same type model as the true model. We will describe how to do that. Assume that we have data D 1 and D 2, using results for the OLS method, we obtain estimates for each case as follows: 1. in this case we have for each model to estimate only one parameter. Let: The OLS estimates are: X i = x 1i x 2i. x Ni,i, Ỹ i = ỹ 1i ỹ 2i. ỹ Ni,i, i = 1,2. ˆθ 1i = ( X i X i ) 1 X iỹi, i = 1,2. 2. this case does not differ a lot from the first case. Using OLS method, and results in polynomial regression, we get: ˆθ 3i = ( X i X i ) 1 X iỹi, i = 1,2, 35

39 Figure 4.3: Case 2 - The data (fuel efficiency) without measurement errors where X i = x 3 1i x 3 2i. x 3 N i,i, Ỹ i = ỹ 1i ỹ 2i. ỹ Ni,i, i = 1,2. 3. this case is classic polynomial regression, there is no need to describe it. Now that we know how to estimate parameters, we will just state the results in the next sections. 4.4 Confidence Intervals In order to obtain confidence intervals for the mean of the quotient, we need data. In first and second part, we actually need to obtain confidence interval for the mean of the quotient of the coefficients. To simulate the data, we need to do the following: For the first and second part we do the same. First, we add error from the same distribution to the real data sets. From those two sets, we estimate two models as explained above. After the estimation, we save the quotient of the estimated parameters - because it is actually the quotient of the two models. In order to obtain more of the quotients, we repeat the process. Denote the data set of the quotients as Q. The third part is quite different. Here, the quotient is not just a simple ratio of two parameters. Of course, we first estimate models using D 1 and D 2. After estimation, 36

40 Figure 4.4: Case 3 - The data (fuel efficiency) without measurement errors using the same input parameters x {x i1,x j2,i = 1,2,...,N 1, j = 1,2,...,N 2 } in (4.1) we obtain a set of data S 1. This set of data is not enough to see how the error affects the quotient. The same way as we obtained set S 1, we repeat the process and obtain sets S 1,S 2,...,S 9000 (we decided that 9000 sets should be enough). Now we compute the mean of every data set S i, i = 1,2,...,9000, and the union of the means is the data set of the quotients Q. In both cases, in order to obtain the confidence interval, we use the bootstrap method on the data set Q. We want to see how does the assumption that the measurement error has a particular distribution affects the quotient - see how the confidence intervals of the mean behave. For each assumption of the distribution, we will state the confidence intervals. Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Uniform Distribution on interval ( 0.3,1) (see (B.2)), ξ i1,ξ j2,ε i1,ε j2 : U( 0.3,1), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 37

3. and the true models are given in (4.4): [0.9136, 1.1509] A sample of data taken from the Uniform Distribution is given in figure (4.5). Figure 4.

41 3. and the true models are given in (4.4): [0.9136, ] A sample of data taken from the Uniform Distribution is given in figure (4.5). Figure 4.5: A sample of data taken from the Uniform Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Generalized Pareto Distribution with parameters ξ = 0.1, µ = 0.2 and σ = 0.2 (see (B.3)), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 3. and the true models are given in (4.4): [ , ] A sample of data taken from the Generalized Pareto Distribution is given in figure (4.6). 38

Figure 4.6: A sample of data taken from the Generalized Pareto Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Normal Distribution with parameters µ = 0, σ 2 = 0.4 (see (B.

42 Figure 4.6: A sample of data taken from the Generalized Pareto Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Normal Distribution with parameters µ = 0, σ 2 = 0.4 (see (B.4)), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 3. and the true models are given in (4.4): [0.5428, ] A sample of data taken from the Normal Distribution is given in figure (4.7). 39

Figure 4.7: A sample of data taken from the Normal Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Log-Normal Distribution with parameters µ = 0, σ 2 = 0.1 (see (B.

43 Figure 4.7: A sample of data taken from the Normal Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Log-Normal Distribution with parameters µ = 0, σ 2 = 0.1 (see (B.5)), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 3. and the true models are given in (4.4): [0.9204, ] A sample of data taken from the Normal Distribution is given in figure (4.8). 40

Figure 4.8: A sample of data taken from the Log-normal Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Gamma Distribution with parameters α = 21, β = 0.02 (see (B.

44 Figure 4.8: A sample of data taken from the Log-normal Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Gamma Distribution with parameters α = 21, β = 0.02 (see (B.6)), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 3. and the true models are given in (4.4): [0.9467, ] A sample of data taken from the Gamma Distribution is given in figure (4.9). 41

Figure 4.9: A sample of data taken from the Gamma Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Student s t-distribution with degrees of freedom d f = 15, (see (B.

45 Figure 4.9: A sample of data taken from the Gamma Distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Student s t-distribution with degrees of freedom d f = 15, (see (B.7)), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 3. and the true models are given in (4.4): [0.6981, ] A sample of data taken from the Student s t-distribution is given in figure (4.10). 42

46 Figure 4.10: A sample of data taken from the Student s t-distribution Assuming the random variables ξ i1,ξ j2,ε i1,ε j2 have Chi-Square Distribution with degrees of freedom d f = 0.8, (see (B.8)), we get the following confidence intervals of the mean of the quotient: 1. in the first case, when the true models are given in (4.2), we get: [ , ] 2. when the true models are given in (4.3): [ , ] 3. and the true models are given in (4.4): [0.9467, ] A sample of data taken from the Student s t-distribution is given in figure (4.11). 43

Figure 4.11: A sample of data taken from the Gamma Distribution 4.5 True Values of the Quotient From the (4.2) and (4.3) we know the true quotient of the models, which is: θ 11 = 128.518 θ 12 139.

47 Figure 4.11: A sample of data taken from the Gamma Distribution 4.5 True Values of the Quotient From the (4.2) and (4.3) we know the true quotient of the models, which is: θ 11 = θ = , θ 31 = θ = , which is obviously the same, because we choose the parameters. Similarly, the true quotient of the model (4.4) is the same: θ true = Evaluation of the Results For easier interpretation of the results, see table (4.1). 44

Multivariate Regression

Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the