Total Least Squares Approach in Regression Methods

WDS'08 Proceedings of Contributed Papers, Part I, 88 93, 2008. ISBN 978-80-7378-065-4 MATFYZPRESS Total Least Squares Approach in Regression Methods M. Pešta Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic. Abstract. Total least squares (TLS) is a data modelling technique which can be used for many types of statistical analysis, e.g. a regression. In the regression setup, both dependent and independent variables are considered to be measured with errors. Thereby, the TLS approach in statistics is sometimes called an errors-invariables (EIV) modelling and, moreover, this type of regression is usually known as an orthogonal regression. We take an EIV regression model into account. Necessary algebraic tools are introduced in order to construct the TLS estimator. A comparison with the classical ordinary least squares estimator is illustrated. Consequently, the existence and uniqueness of the TLS estimator are discussed. Finally, we show the large sample properties of the TLS estimator, i.e. a strong and weak consistency, and an asymptotic distribution. Introduction Observing several characteristics may be thought as variables straightforwardly postulates a natural question: What is the relationship between these measured characteristics? One of many possible attitudes can arise, that some of the characteristics might be explained by a (functional) dependence on the other characteristics. Therefore, we consider the first mentioned variables as dependent or response and the second ones as independent or explanatory. Our proposed model of dependence contains errors in the response variable (we think only of one dependent variable) and in the explanatory variables as well. But firstly, we just try to find an appropriate fit for some points in the Euclidean space using a hyperplane, i.e. approximating several incompatible linear relations. Afterwards, some properties for the measurement errors are added and, hence, several statistical asymptotical qualities are developed. Overdetermined System Let us consider the overdetermined system of linear relations y Xβ, y Ê n, X Ê n m, n > m. () Relations in () are deliberately not denoted as equations, because in many cases, the exact solution need not exist. Thereby, only an approximation can be found. Hence, one can speak about the best solution of the overdetermined system (). But the best in which way? Singular Value Decomposition Before inquiring into an appropriate solution of (), we should introduce some very important tools for further exploration. Theorem (Singular Value Decomposition SVD) If A Ê n m then there exist orthonormal matrices U = [u,...,u n Ê n n and V = [v,...,v m Ê m m such that U AV = Σ = diag {σ,..., σ p } Ê n m, σ... σ p 0, and p = min {n, m}. (2) Proof. See Golub and Van Loan [996. In SVD, the diagonal matrix Σ is uniquely determined by A (though the matrices U and V are not). Previous powerful matrix decomposition allows us to define a cutting point r for a given matrix A Ê n m using its singular values σ i σ... σ r > σ r+ =... = σ p = 0, p = min {n, m}. 88

Since the matrices U and V in (2) are orthonormal, it yields rank(a) = r and one may obtain a dyadic decomposition (expansion) of the matrix A: A = r σ i u i vi. (3) i= A suitable matrix norm is also required and, hence, the Frobenius norm for matrix A (a ij ) n,m i,j= is defined as follows n m A F := a 2 ij tr(a = A) = p σi 2 = r σi 2, p = min {n, m}. (4) i= j= Furthermore, the following approximation theorem plays the main role in the forthcoming derivation, where a matrix is approximated with another one with lower rank. Theorem (Eckart-Young-Mirsky Matrix Approximation) Let the SVD of A Ê n m be given by A = r i= σ iu i vi with rank(a) = r. If k < r and A k = k i= σ iu i vi, min A B F = A A r k F = σi 2. (5) rank(b)=k i= i= i=k+ Proof. See Eckart and Young [936 and Mirsky [960. Above all, one more technical property needs to be incorporated. Theorem (Sturm Interlacing Property) Let n m and the singular values of A Ê n m are σ... σ m. If B results from A by deleting one column of A and B has singular values σ... σ m, then σ σ σ 2 σ 2... σ m σ m 0. (6) Proof. See Thompson [972. Total Least Squares Solution Now, three basic approximation ways of the overdetermined system () are suggested. The traditional approach penalizes only the misfit in the dependent variable part min ǫ Ê n,β Ê ǫ m 2 s.t. y + ǫ = Xβ (7) and is called the ordinary least squares (OLS). Here, the data matrix X is thought as exactly known and errors occur only in the vector y. An opposite case to the OLS is represented by the data least squares (DLS), which allow corrections only in the explanatory variables (independent input data) min Θ Ê n m,β Ê m Θ F s.t. y = (X + Θ)β. (8) Finally, we concentrate ourselves on the total least squares approach minimizing the squares of errors in the values of both dependent and independent variables min [ε,ξ Ê n (m+),β Ê m [ε,ξ F s.t. y + ε = (X + Ξ)β. (9) A graphical illustration of three previous cases can be found in Figure. One may notice that the TLS search for the orthogonal projection of the observed data onto the unknown approximation corresponding to a TLS solution. Once a minimizing [ˆε, ˆΞ of the TLS problem (9) is found, then any β satisfying y+ˆε = (X+ ˆΞ)β is called a TLS solution. The basic form of the TLS solution was investigated for the first time by Golub and Van Loan [980. 89

OLS DLS TLS Various Least Squares Fit OLS DLS TLS Figure. Various least squares fits (ordinary, data, and total LS) for the same three data points in the two-dimensional plane that coincides with the regression setup of one response and one explanatory variable. Theorem (TLS Solution of y Xβ) Let the SVD of X Ê n m be given by X = m i= σ i u i v i the SVD of [y,x = m+ i= σ iu i vi. If σ m > σ m+, then [ŷ, ˆX := [y + ˆε,X + ˆΞ = UˆΣV and ˆΣ = diag {σ,...,σ m, 0} (0) with the corresponding TLS correction matrix and [ˆε, ˆΞ = σ m+ u m+ v m+ () solves the TLS problem and ˆβ = e v [v 2,m+,..., v m+,m+ (2) m+ exists and is the unique solution to ŷ = ˆXβ. Proof. Proof by contradiction, we firstly show that e v m+ 0. Suppose v,m+ = 0, then there exist 0 w Ê m such that [ 0,w [ [y,x 0 [y,x w = σm+ 2 which yields into w X Xw = σ 2 m+. But this is a contradiction with the assumption σ m > σ m+, since σ 2 m is the smallest eigenvalue of X X. Sturm interlacing theorem (6) and the assumption σ m > σ m+ yield σ m > σ m+. Therefore, σ m+ is not a repeated singular value of [y,x and σ m > 0. If σ m+ 0, then rank[y,x = m+. We want to find [ŷ, ˆX such that [y,x [ŷ, ˆX F is minimal and [ŷ, ˆX[, β = 0 for some β. Therefore, rank([ŷ, ˆX) = m and applying Eckart-Young-Mirsky 90

theorem (5), one may easily obtain the SVD of [ŷ, ˆX in (0) and the TLS correction matrix (), which must have rank one. Now, it is clear that the TLS solution is given by the last column of V. Finally, since dim Ker([ŷ, ˆX) =, then the TLS solution (2) must be unique. If σ m+ = 0, then v m+ Ker([y,X) and [y,x[, β = 0. Hence, no approximation is needed, overdetermined system () is compatible, and the exact TLS solution is given by (2). Uniqueness of this TLS solution follows from the fact that [, β Range([y,X ). A closed-form expression of the TLS solution (2) can be derived. If σ m > σ m+, the existence and uniqueness of the TLS solution has already been shown. Thereby, since singular vectors v i, i.e. from (0), are eigenvectors of [y,x [y,x, then ˆβ also satisfies and, hence, [y,x [y,x [ ˆβ = [ y y y X X y X X [ ˆβ = σm+ 2 [ ˆβ ˆβ = (X X σ 2 m+i m ) X y. (3) Previous equation reminds us a form of an estimator in the ridge regression setup. Therefore, one may expect avoiding multicollinearity problems with classical OLS regression (7), due to the ridge regression and the TLS orthogonal regression correspondence. Expression (3) looks almost similar to the OLS estimator β of (7), except the term containing σm+ 2. This term is missing in the well-known OLS estimator with full rank regression matrix providing by Gauss-Markov theorem of a solution as so-called normal equations X X β = X y. From a statistical point of view, a situation when σ m = σ m+ occurs for real data is unlikely and also quite irrelevant. But Van Huffel and Vandewalle [99 investigated this case and concluded the following summary. Suppose σ q > σ q+ =... = σ m+, q m and denote Q := [v q+,...,v m+. Then: σ m > σ m+ the unique TLS solution (2) exists; σ m = σ m+ & e Q 0 infinitely many TLS solutions of (9) exist and one can pick up one of them with the smallest norm; σ m = σ m+ & e Q = 0 no solution of (9) exists and one needs to define another ( more restrictive ) TLS problem. A more restrictive TLS problem, mentioned previously, is called a nongeneric TLS problem. Simply, additional restriction [ε, Ξ Q = 0 added to the constraints in (9) tries to project out unimportant or redundant data from the original TLS problem (9). Errors-in-Variables Model One should not only pay attention to the existence or form of the TLS solution, but also to its properties, e.g. statistical ones. In statistics, the TLS problem (9) corresponds to a so-called errors-invariables setup. Here, unobservable true values y 0 and X 0 satisfy a single linear relationship y 0 = α n + X 0 β (4) and unknown parameters α (intercept) and β (regression coefficients) need to be estimated. Observations y and X measure y 0 and X 0 with additive errors ε and Ξ σ 2 ν y = y 0 + ε, (5) X = X 0 + Ξ. (6) Rows of the errors [ε,ξ are iid with common zero mean and covariance matrix σνi 2 m+, where > 0 is unknown. TLS Estimator For simplicity, we suppose that condition σ m > σ m+ is satisfied. Let us denote G := I n n n n with n := [,..., for practical purposes. Then, we define the estimate of coefficient β as the TLS solution ˆβ and the estimate of intercept α as follows ˆα := ȳ [ x,..., x m ˆβ (7) where x i means the average of the elements of ith column of matrix X. Finally, the variance term σ 2 ν is estimated using singular values ˆσ 2 := n σ 2 m+. 9

Large Sample Properties An asymptotical behaviour of an estimator is one of its basic characteristics. The asymptotical properties can provide some information about the quality (i.e. efficiency) of the estimator. Consistency Firstly, we provide a theorem showing the strong consistency of the TLS estimator. Theorem (Strong Consistency) If lim n n X 0 X 0 exists, then Moreover, if lim n n X 0 GX 0 > 0, then Proof. See Gleser [98. lim n ˆσ2 a.s. = σν 2. (8) lim ˆβ n a.s. = β, (9) a.s. lim ˆα = α. (20) n The assumptions in the previous theorem are somewhat restrictive and need not be satisfied, e.g. univariate errors-in-variables model with the values of the independent variable vary linearly with the sample size. Therefore, these assumptions need to be weakened yielding the following theorem. Theorem (Weak Consistency) Suppose that the distribution of the rows of [ε,ξ possesses finite fourth moment. Denote X 0 := [ n,x 0. If then ( λ min X n 0 X ) 0 ( ) λ 2 min X 0 X 0 λ max ( X 0 X 0 [ ˆαˆβ [ P α β, n, ), n Proof. Can be easily derived using Theorem 2 by Gallo [982a., n. (2) Notation λ min (respectively, λ max ) denotes the minimal (respectively, maximal) eigenvalue. It has to be remarked on the fourth moment finiteness of the rows of [ε, Ξ, that this mathematically means for all i {,...,n} r j ij <, ω ij {ε i,ξ i,,...ξ i,m }, r j Æ. (22) ω j rj=4 The assumptions in the previous theorems ensure that the values of the independent variables spread out fast enough. Gallo [982a proved that the previous intermediate assumptions are implied by the assumptions in the theorem for strong consistency. Asymptotic Distributions Finally, an asymptotic distribution for further statistical inference has to be shown. Theorem (Asymptotic Normality) Suppose that the distribution of the rows of [ε, Ξ possesses finite fourth moment. If then [ ˆα α n ˆβ β Proof. See Gallo [982b. lim n n X 0 X 0 > 0 has an asymptotic zero-mean multivariate normal distribution as n. The covariance matrix of the multivariate normal distribution from the previous theorem is not shown here due to its complicated form and one may find that formula in Gallo [982b. 92

Discussion and Conclusions In this paper, the TLS problem from algebraical point of view is summarized and a connection with the errors-in-variables a statistical model is shown. An unification of algebraical and numerical results with statistical ones is demonstrated. The TLS optimizing problem is defined here also with the OLS and DLS alternatives. Its solution is found using spectral information of the system; and the existence and uniqueness of this solution are discussed. The errors-in-variables model as a correspondence to the orthogonal regression is introduced. Moreover, a comparison of the classical regression approach with the errors-in-variables setup is shown. Finally, large sample properties such as a strong and weak consistency, and an asymptotical distribution of the TLS estimator an estimator in the errors-in-variables model are recapitulated. For a further research, one may be interested in the extension of the TLS approach in the nonlinear regression or, on the top of that, in the nonparametric regression. Amemiya [997 proposed a way of the first order linearization of the nonlinear relations. A computational stability could be improved using the Golub-Kahan bidiagonalization connected up with the TLS problem by Paige and Strakoš [2006. This approach needs to be studied from the statistical point of view as well. Acknowledgments. The present work was supported by the Grant Agency of the Czech Republic (grant 20/05/H007). References Amemiya, Y., Generalization of the TLS approach in the errors-in-variables problem, in Proceedings of the Second International Workshop on Total Least Squares and Errors-in-Variables Modeling, edited by S. Van Huffel, pp. 77 86, 997. Eckart, G. and Young, G., The approximation of one matrix by another of lower rank, Psychometrica,, 2 28, 936. Gallo, P. P., Consistency of regression estimates when some variables are subject to error, Communications in Statistics: Theory and Methods,, 973 983, 982a. Gallo, P. P., Properties of Estimators in Errors-in-variables Models, Ph.D. thesis, Institute of Statistics Mimeoseries #5, University of North Carolina, Chapel Hill, NC, 982b. Gleser, L. J., Estimation in a multivariate errors in variables regression model: Large sample results, Annals of Statistics, 9, 24 44, 98. Golub, G. H. and Van Loan, C. F., An analysis of the total least squares problem, SIAM Journal on Numerical Analysis, 7, 883 893, 980. Golub, G. H. and Van Loan, C. F., Matrix Computation, Johns Hopkins University Press, Baltimore, MD, 3rd edn., 996. Mirsky, L., Symmetric gauge functions and unitarily invariant norms, Quarterly Journal of Mathematics Oxford,, 50 59, 960. Paige, C. C. and Strakoš, Z., Core problems in linear algebraic systems, SIAM Journal on Matrix Analysis and Applications, 27, 86 875, 2006. Thompson, R. C., Principal submatricies IX: Interlacing inequalities for singular values of submatrices, Linear Algebra Applications, 5, 2, 972. Van Huffel, S. and Vandewalle, J., The Total Least Squares Problem: Computational Aspects and Analysis, SIAM, Philadelphia, PA, 99. 93