Revista INGENIERÍA UC ISSN: Universidad de Carabobo Venezuela

Size: px

Start display at page:

Download "Revista INGENIERÍA UC ISSN: Universidad de Carabobo Venezuela"

Linette Cynthia Snow
6 years ago
Views:

1 Revista INGENIERÍA UC ISSN: Universidad de Carabobo Venezuela Vega-González, Cristóbal E. Nota Técnica: Selección de modelos en regresión lineal Revista INGENIERÍA UC, vol. 20, núm. 1, enero-abril, 2013, pp Universidad de Carabobo Valencia, Venezuela Disponible en: Cómo citar el artículo Número completo Más información del artículo Página de la revista en redalyc.org Sistema de Información Científica Red de Revistas Científicas de América Latina, el Caribe, España y Portugal Proyecto académico sin fines de lucro, desarrollado bajo la iniciativa de acceso abierto

REVISTA INGENIERíA UC, VOL. 20, NO. 1, ABRIL 2013 87-94 Nota Técnica: Selección de modelos en regresión lineal Cristóbal E.

2 REVISTA INGENIERíA UC, VOL. 20, NO. 1, ABRIL Nota Técnica: Selección de modelos en regresión lineal Cristóbal E. Vega González Instituto de Matemática y Cálculo Aplicado (IMYCA), Facultad de Ingeniería, Universidad de Carabobo, Valencia, Venezuela. Motivado por la lectura de varios artículos científicos con graves detalles en la estimación por mínimos cuadrados ordinarios, el autor se propuso redactar estas notas. Revisa y recomienda varias referencias para los mínimos cuadrados ordinarios. Seguidamente analiza los procedimientos de selección de modelo, donde desarrolla y recomienda la selección del modelo por el principio de mínima longitud de descripción. Palabras clave: Mínimos cuadrados ordinarios, Criterios de selección de modelos, Principio MDL Resumen.- Abstract.- Tech note: Model selection at linear regression Motivated by reading various scientific papers with serious details at fit by ordinary least squares, the author was to write this note. This note reviews and recommends several references for the ordinary least squares. The author then discusses procedures for model criterion selection, where he develops and recommends the selection of the model by the minimum description length principle. Keywords: Ordinary least squares, Model criterion selection, MDL principle Recibido: enero 2013 Aceptado: abril Introduction Recently, several scientific papers got our hands, in which for a dataset the authors want fit a curve via ordinary least square (OLS). Papers with the following characteristics The researches want to fit by OLS a dataset where for l values C 1,..., C l of concentration and the amplitudes of instrumental are given by s replica of each one {A j,1,..., A j,s, } (a l s points dataset) the research determinate the calibration curve via OLS method, but the research does not use the l s pairs, used only l points given for (C j, A), where A = k A j,k /s. Correo-e: cvega@uc.edu.ve (Cristóbal E. Vega González) This fit model is sub adjusted to data, the t values and residuals are false. The researches used for a OLS with dataset size small, with less than 30 point. The researches like a polynomial fit via OLS, polynomial grade 12 and only 20 points. For polynomial selection (among the possible results) the authors used the minimum mean square error (MSE). The techniques used in the procedures extremely caught our attention. Therefore, these notes will be written with a purely educational intention. Example 1: An example on the previous observation is for a dataset consistent of 30 points for concentration values {1, 2, 3, 4, 5, 6} the authors take 5 amplitude replica of each one. The instrumental calibration curve is the OLS fit at Table 1 and Figure 1. But the authors reported

3 88 C. Vega /, Vol. 20, No. 1, Abril 2013, Table 1: OLS, with 1 30 observations, Dependent variable: A Coef. Desv t value p value C Mean (V.D.) Desv. (V.D.) RQS Regr. Desv R R 2 corrected F(1, 28) p Value (F) 1.55e 18 Figura 2: Linear fit dataset On the second fit, Table 2 and Figure 2, the researches ignore the residual heterocedastic. Example 2: A critical case is to fit a linear regression to dataset Figura 1: Linear fit dataset OLS fit with only five point (C, A) (see Table 2 and Figure 2). A = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (5, 1), (5, 2), (5, 3), (4, 4), (5, 5), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)} Then, you compare the result with a linear regression fit A = {(1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3)}. Table 2: OLS, with 1 6 observations, Dependent variable: A Coef. Desv t value p value C Mean (V.D.) Desv. (V.D.) RQS Regr. Desv R R 2 corrected F(1, 28) p Value (F) 6.73e 07 This note reports first a description of ordinary least square, the OLS assumption necessaries, the necessary of t values estimation, and residuals test validation. Secondly, the note describes the model criteria selection where it shows the developedment of MDL principle. 2. Ordinary least square Ordinary least squares (OLS) is a method for the estimation of the unknown parameters in a linear regression. This method minimizes the sum

4 C. Vega /, Vol. 20, No. 1, Abril 2013, of square vertical distances between the responses observed in dataset and responses predicted by linear approximation. The resulting estimator can be expressed by a simple formula [1]. The OLS estimator is consistent when the regressors are exogenous and there is no perfect multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum variance mean unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed, OLS is the maximum likelihood estimator. OLS is used in economics (econometrics), political science and electrical engineering (control theory and signal processing), among many areas of application. This consistence affirmation is not possible to show it with a small size dataset. The first clear and concise exposition of the method of least squares was published by Legendre in 1805 [2]. The technique is described as an algebraic procedure for fitting linear equations to data and Legendre demonstrates the new method by analyzing the same data as Laplace for the shape of the earth. The value of Legendre s method of least squares was immediately recognized by leading astronomers and geodesists of the time. For more than one explanatory variable, it is called multiple linear regression. In multiple linear regression, there are several independent variables or functions of independent variables x 1, x 2,..., x n (potential predictors). Associated with each predictor x j a binary variable γ j, and consider models given by y = β j x j + ɛ, (1) γ j =1 Replacing the term in x j by x j, the regression gives a polynomial regression [3] Estimation of the t values A priority task in the OLS estimate is t values estimation. These indicate whether the calculated coefficients are significantly different from zero. Suppose one is fitting a linear model. It is desired to test the null hypothesis that the parameters taken to be 0, in which case the hypothesis is that variable are unrelated. If your scientific paper does not reports the t values, the estimation is null Residuals study After the model fits, a residuals study is necessary (for example see [4]). This study should prove that residuals are independent and identically distributed random variables, with mean zero and constant variance. For this, some of following test are necessary: (a) A t test, with Null hypothesis: mean = 0.0 Alternative: not equal. (b) A sign test, with Null hypothesis: median = 0.0 Alternative: not equal. (c) A signed rank test, with Null hypothesis: median = 0.0 Alternative: not equal. (d) A normality test as Shapiro Wilk test. If the data provide by a stochastic processes, as a time series, in addition, the following tests for residuals randomness are necessary: (e) Runs above and below median. (f) Runs up and down. (g) Box Pierce test for autocorrelations. (h) Media comparative t test, first and second data half. (i) Variance comparative variances F test, first and second data half. The residuals test give seriousness to work Sample size For that the residuals tests work, it is necessary to have more than 30 data, for this reason dataset size should be more than 30. Fit OLS is a matrix method, which expected that the matrix will be not singular. When the objective is estimating parameters n, it is necessary dataset size greater than 2n + 1. Then, it is not statistically plausible estimating a polynomial of a degree greater or equal to nine with a dataset of only twenty points.

5 90 C. Vega /, Vol. 20, No. 1, Abril 2013, Model selection criterion 3.1. Minimum mean square error However, a terminological difference arises in the expression mean squared error (MSE). The mean squared error of a regression is a number computed from the sum of squares of the computed residuals, and not of the unobservable errors. If that sum of squares is divided by n, the number of observations, the result is the mean of the squared residuals. Since this is a biased estimate of the variance of the unobserved errors, the bias is removed by multiplying the mean of the squared residuals by n/d f where d f is the number of degrees of freedom (n minus the number of parameters being estimated). This latter formula serves as an unbiased estimate of the variance of the unobserved errors, and is called the mean squared error [5]. OLS method minimizes the sum of squared vertical distances between the observed responses in the dataset and the responses predicted by the linear approximation, i.e., OLS method minimizes quadratic error, and between the mean square error. This is how you use to measure the quality of the tool to the tool itself. How it should be a criterion for model selection? There are several criterions for model selection, namely, Akaike information criterion (AIC) [6] y [7], Bayesian information criterion (BIC) by Schwarz [8], Hannan Quinn information criterion (HQC) [9], and Rissanen minimum description length (MDL) principle [10], [11], [12] y [13]. The author of a research paper should choose one, for a model selection. AIC, BIC and HQC are developed on literature, and they are implemented on several computer software such as Mathlab, Scilab, R Cran and GRETL. Now, the abstention will be on Rissanen MDL principle MDL principle Rissanen [10] condenses the principle of parsimony of a model in his MDL principle: opting by the model that gives the shortest description of dataset. In this context a concise model is an easy one to describe. Considering that, a good fit implies that the model captures or describes important characteristics evident in the data. Let A be a finite set and C a codewords set, a code C on A in a simple mapping C : A C. Usually, binary codes are considered so that each codeword is a string of 0 s and 1 s. In the discrete case, A is a finite set and Q denotes a probability distribution on A. The fundamental premise of the MDL principle is that log 2 Q can be viewed as the code length of a binary code for elements or symbols in A. A linear code of length n and rank k is a linear subspace B with dimension k of the vector space F n q where F q is the finite field with q elements. Such a code is called a q ary code. If q = 2 or q = 3, the code is described as a binary code, or a ternary code respectively. The vectors in C are called codewords. The size of a code is the number of codewords and equals q k. The weight of a codeword is the number of its elements that are non zero and the distance between two codewords is the Hamming distance between them, that is, the number of elements in which they differ. The distance d of a linear code is minimum weight of its non zero codewords, or equivalently, the minimum distance between distinct codewords. A linear code of length n, dimension k, and distance d is called an [n, k, d] code. In codes context, the principle MDL suggests choosing the model that provides the shortest description of dataset. Data description is formally equivalent to the coding. Thus, implementing MDL, focus is based on statistical modeling as a means for generating codes, and the resulting code lengths provide a metric by which compares models in competition. As a broad principle, MDL has rich connections with more traditional frameworks for statistical estimation. In classical parametric statistics, for example, we want to estimate the parameter θ of a given model (class) M = { f (x n θ); θ Θ R k } based on observations x n = (x 1,..., x n ).

6 C. Vega /, Vol. 20, No. 1, Abril 2013, From a coding perspective, assume that both sender and receiver know which member f θ of the parametric family M generated a data string x n (or, equivalently, both sides know θ). Then Shannon s Source Coding Theorem states that the best description length of x n (in an average sense) is simply log f θ (x n ), because on average the code based on f θ achieves the entropy lower bound. Adding to this cost, it arrives at a code length log f θ (x n ) + L(θ) for the data string x n. Binary variable vector γ = (γ 1,..., γ n ) {0, 1} n is used as a simple index for the 2 M possible models given by Equation 1. Let β γ and X γ denote the vector of coefficients and the design matrix associated with those variables x j for which γ 1 = 1. Now, to apply MDL principle by the problem of model selection is equivalent to the problem of identifying one or more vectors γ that yield the best or nearly best models for y in Equation 1. In many cases, not all of the 2 M possibilities have sense, and hence our search might be confined to only a subset of index vectors γ. Forms of Minimum Description Length for Regression For regression, MDL criteria can be written as a sum of two code lengths, L(γ) + L(y X γ, γ). (2) The first summand on Equation 2 penalizes the code length, and second penalizes the residual energy. The residual energy by OLS is given by L(y X γ, γ) = n 2 log ( (ˆɛ γ ) 2 ), where {ˆɛ γ } are residues from estimated model γ. The penalized code length is given by L(γ) = η k log n, 2 k is numbers of non null parameters and η = {2, 2,5, 3} is a weight preventing spurious regressions. At applications for regression models, it will be considered the value of 2.5 reported in [13]. 4. A polynomial example Figura 3: Example Dataset For an illustrative example take the dataset an Figure 3. This figure also represents the linear regression y = 0, ,026055t, with t values parameter t values C C A simple Figure 3 observation shows that the dataset are not at linear fit. For this motive, the following alternative polynomial models are determined: M1 Linear, Figure 3. M2 Quadratic, Figure 4, Table 3. M3 y = c 0 + c 1 t + c 2 t 2 + c 3 t 3, Table 4. M4 y = c 0 + c 1 t + c 3 t 3, Figure 5, Table 5. M5 y = c 2 t 2 + c 3 t 3 + c 4 t 4, Table 6. M6 y = c 1 t + c 4 t 4, Figure 6, Table 7.

7 92 C. Vega /, Vol. 20, No. 1, Abril 2013, Figura 4: Example quadratic fit Figura 5: Example y = c 0 + c 1 t + c 3 t 3 fit Table 3: Example M2 fit (y = c 0 + c 1 t + c 2 t 2 ) Table 5: Example M4 fit const t e t e e const t e t e e Table 4: Example M3 fit (y = c 0 + c 1 t + c 2 t 2 + c 3 t 3 ) const t e t e e t e e Table 6: Example M5 fit (y = c 2 t 2 + c 3 t 3 + c 4 t 4 ) t e e t e e t e e M7 y = c 2 t 2 + c 3 t 3 + c 5 t 5, Table 8. M3 Fit model given at Table 4 is not representative statistically, because some t values are very small, then its p values are over a established significance, say 5 %. This model can not be reportd at any scientific paper. Table 7: Example M6 fit (y = c 1 t + c 4 t 4 ) t e t e e Table 8: Example M7 fit (y = c 2 t 2 + c 3 t 3 + c 5 t 5 ) t e e t e e t e e Now, the models M1, M2, M4, M5, M6 and M7

8 C. Vega /, Vol. 20, No. 1, Abril 2013, length. For this reason, the model that best fits the data is M6. Additionally, M6 is the only one that passes the residuals randomness among the obtained significant statistically models. We must remember that this example comes from simulated data and M6 corresponds to the polynomial simulation source. 5. Conclusion Figura 6: Example y = c 1 t + c 4 t 4 M7 fit are representative statistically. How to select the best model fitted to dataset? The authors hope that with this Tech Note may clear up doubts regarding the dataset size and the model selection criterion. For any further inquiry, please contact Stochastic processes laboratory (Laboratorio de Procesos Estocásticos) at Institute of mathematics and calculus applied (IMYCA), Facultad de Ingeniería, Universidad de Carabobo. Table 9: MSE and DL by example models fit Model MSE LD M M M M M M Acknowledgements This work has partial financing of the creation project of the laboratory of stochastic processes (FONACIT) and the Research Direction of the Faculty of engineering at the University of Carabobo. The author particularly thanked Jhoseth Rodríguez for forms reviews. Table 9 shows MSE and LD by the determinated models, with this information it is possible a model selection. M5 Model y = 2, t 2 1, t 3 +1, t 4 has a minimum MSE, and M6 Model y = 0,1329t 4, t 4 has MDL. M5 has more parameters, which guarantees that it reduces residual energy, and therefore the ECM, as well the model parameters number must be penalized, which is achieved by the description Referencias [1] Burnham, Kenneth P.; David Anderson (2002). Model Selection and Multi-Model Inference (2nd ed.). Springer. [2] Legendre, Adrien-Marie (1805), Nouvelles méthodes pour la détermination des orbites des comètes [New Methods for the Determination of the Orbits of Comets] (in French), Paris: F. Didot [3] Hazewinkel, Michiel, ed. (2001), Regression analysis, Encyclopedia of Mathematics, Springer [4] DeWayne R. Derryberry (2014). Basic Data Analysis for Time Series with R, Wiley. [5] Steel, Robert G. D.and Torrie, James H. (1960). Principles and Procedures of Statistics, with Special Reference to Biological Sciences. McGraw-Hill. [6] Akaike, Hirotugu (1974), A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6):

9 94 C. Vega /, Vol. 20, No. 1, Abril 2013, [7] Akaike, Hirotugu (1980), Likelihood and the Bayes procedure, in Bernardo, J. M.; et al., Bayesian Statistics, Valencia: University Press, pp [8] Schwarz, Gideon E. (1978). Estimating the dimension of a model. Annals of Statistics 6 (2): [9] Hannan, E. J., and B. G. Quinn (1979) The Determination of the Order of an Autoregression, Journal of the Royal Statistical Society, B, 41, [10] Rissanen, J. (1978). Modeling by shortest data description. Automatica 14 (5): [11] Mark H. Hansen and Bin Yu (2001) Model Selection and the Principle of Minimum Description Length, Journal of the American Statistical Association, 2001, 96 (45): [12] Vega, Cristóbal (2003) Aplicación de la técnicas wavelets a series temporales, Tesis Doctoral, Universidad de Granada, Granada España. [13] Rissanen, J. (2007). Information and Complexity in Statistical Modeling. Springer.

Alfredo A. Romero * College of William and Mary

A Note on the Use of in Model Selection Alfredo A. Romero * College of William and Mary College of William and Mary Department of Economics Working Paper Number 6 October 007 * Alfredo A. Romero is a Visiting