Total Least Squares Approach in Regression Methods

Similar documents
UNIFYING LEAST SQUARES, TOTAL LEAST SQUARES AND DATA LEAST SQUARES

1 Singular Value Decomposition and Principal Component

Total least squares. Gérard MEURANT. October, 2008

Matrix Factorizations

Probabilistic Latent Semantic Analysis

Review of similarity transformation and Singular Value Decomposition

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Multivariate Statistical Analysis

Linear Regression and Its Applications

IV. Matrix Approximation using Least-Squares

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Introduction to Numerical Linear Algebra II

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

14 Singular Value Decomposition

Singular Value Decomposition

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2017 LECTURE 5

Jianhua Z. Huang, Haipeng Shen, Andreas Buja

Lecture 5 Singular value decomposition

CHARACTERIZATIONS. is pd/psd. Possible for all pd/psd matrices! Generating a pd/psd matrix: Choose any B Mn, then

Characterization of half-radial matrices

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

COMP 558 lecture 18 Nov. 15, 2010

Linear Methods in Data Mining

Section 3.9. Matrix Norm

Lecture 11: Regression Methods I (Linear Regression)

The total least squares problem in AX B. A new classification with the relationship to the classical works

Linear regression methods

Consistent and equivariant estimation in errors-in-variables models with dependent errors

Lecture 11: Regression Methods I (Linear Regression)

3 Multiple Linear Regression

The Lanczos and conjugate gradient algorithms

Singular Value Decompsition

DS-GA 1002 Lecture notes 10 November 23, Linear models

EECS 275 Matrix Computation

The Hilbert Space of Random Variables

Learning gradients: prescriptive models

EE731 Lecture Notes: Matrix Computations for Signal Processing

Linear Algebra Methods for Data Mining

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

THE CORE PROBLEM WITHIN A LINEAR APPROXIMATION PROBLEM AX B WITH MULTIPLE RIGHT-HAND SIDES. 1. Introduction. Consider a linear approximation problem

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Linear Algebra in Actuarial Science: Slides to the lecture

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

1. Addition: To every pair of vectors x, y X corresponds an element x + y X such that the commutative and associative properties hold

Chapter 3 Transformations

Lecture 1: Review of linear algebra

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

The SVD-Fundamental Theorem of Linear Algebra

CS 143 Linear Algebra Review

Summary of Week 9 B = then A A =

Linear Models in Econometrics

MATH36001 Generalized Inverses and the SVD 2015

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Singular Value Decomposition and Polar Form

Homework 1. Yuan Yao. September 18, 2011

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Solution of Linear Equations

. a m1 a mn. a 1 a 2 a = a n

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

arxiv: v1 [math.na] 1 Sep 2018

The Singular Value Decomposition

Knowledge Discovery and Data Mining 1 (VO) ( )

Review problems for MA 54, Fall 2004.

Example Linear Algebra Competency Test

Lecture 2: Linear Algebra Review

Linear Algebra: Characteristic Value Problem

Eigenvalues and diagonalization

A Note on Simple Nonzero Finite Generalized Singular Values

1 Principal component analysis and dimensional reduction

Forecast comparison of principal component regression and principal covariate regression

Chapter 6: Orthogonality

Regression. Oscar García

7. Symmetric Matrices and Quadratic Forms


Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Singular Value Decomposition

Lecture 19 Multiple (Linear) Regression

Statistics 910, #5 1. Regression Methods

Multivariate Regression Analysis

Computing least squares condition numbers on hybrid multicore/gpu systems

Error estimates for the ESPRIT algorithm

Majorization for Changes in Ritz Values and Canonical Angles Between Subspaces (Part I and Part II)

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Large Scale Data Analysis Using Deep Learning

Least squares Solution of Homogeneous Equations

Least Squares Optimization

Stat 206: Linear algebra

identity matrix, shortened I the jth column of I; the jth standard basis vector matrix A with its elements a ij

Lanczos tridigonalization and Golub - Kahan bidiagonalization: Ideas, connections and impact

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

STAT 100C: Linear models

Pseudoinverse & Orthogonal Projection Operators

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Tutorial on Principal Component Analysis

Transcription:

WDS'08 Proceedings of Contributed Papers, Part I, 88 93, 2008. ISBN 978-80-7378-065-4 MATFYZPRESS Total Least Squares Approach in Regression Methods M. Pešta Charles University, Faculty of Mathematics and Physics, Prague, Czech Republic. Abstract. Total least squares (TLS) is a data modelling technique which can be used for many types of statistical analysis, e.g. a regression. In the regression setup, both dependent and independent variables are considered to be measured with errors. Thereby, the TLS approach in statistics is sometimes called an errors-invariables (EIV) modelling and, moreover, this type of regression is usually known as an orthogonal regression. We take an EIV regression model into account. Necessary algebraic tools are introduced in order to construct the TLS estimator. A comparison with the classical ordinary least squares estimator is illustrated. Consequently, the existence and uniqueness of the TLS estimator are discussed. Finally, we show the large sample properties of the TLS estimator, i.e. a strong and weak consistency, and an asymptotic distribution. Introduction Observing several characteristics may be thought as variables straightforwardly postulates a natural question: What is the relationship between these measured characteristics? One of many possible attitudes can arise, that some of the characteristics might be explained by a (functional) dependence on the other characteristics. Therefore, we consider the first mentioned variables as dependent or response and the second ones as independent or explanatory. Our proposed model of dependence contains errors in the response variable (we think only of one dependent variable) and in the explanatory variables as well. But firstly, we just try to find an appropriate fit for some points in the Euclidean space using a hyperplane, i.e. approximating several incompatible linear relations. Afterwards, some properties for the measurement errors are added and, hence, several statistical asymptotical qualities are developed. Overdetermined System Let us consider the overdetermined system of linear relations y Xβ, y Ê n, X Ê n m, n > m. () Relations in () are deliberately not denoted as equations, because in many cases, the exact solution need not exist. Thereby, only an approximation can be found. Hence, one can speak about the best solution of the overdetermined system (). But the best in which way? Singular Value Decomposition Before inquiring into an appropriate solution of (), we should introduce some very important tools for further exploration. Theorem (Singular Value Decomposition SVD) If A Ê n m then there exist orthonormal matrices U = [u,...,u n Ê n n and V = [v,...,v m Ê m m such that U AV = Σ = diag {σ,..., σ p } Ê n m, σ... σ p 0, and p = min {n, m}. (2) Proof. See Golub and Van Loan [996. In SVD, the diagonal matrix Σ is uniquely determined by A (though the matrices U and V are not). Previous powerful matrix decomposition allows us to define a cutting point r for a given matrix A Ê n m using its singular values σ i σ... σ r > σ r+ =... = σ p = 0, p = min {n, m}. 88

Since the matrices U and V in (2) are orthonormal, it yields rank(a) = r and one may obtain a dyadic decomposition (expansion) of the matrix A: A = r σ i u i vi. (3) i= A suitable matrix norm is also required and, hence, the Frobenius norm for matrix A (a ij ) n,m i,j= is defined as follows n m A F := a 2 ij tr(a = A) = p σi 2 = r σi 2, p = min {n, m}. (4) i= j= Furthermore, the following approximation theorem plays the main role in the forthcoming derivation, where a matrix is approximated with another one with lower rank. Theorem (Eckart-Young-Mirsky Matrix Approximation) Let the SVD of A Ê n m be given by A = r i= σ iu i vi with rank(a) = r. If k < r and A k = k i= σ iu i vi, min A B F = A A r k F = σi 2. (5) rank(b)=k i= i= i=k+ Proof. See Eckart and Young [936 and Mirsky [960. Above all, one more technical property needs to be incorporated. Theorem (Sturm Interlacing Property) Let n m and the singular values of A Ê n m are σ... σ m. If B results from A by deleting one column of A and B has singular values σ... σ m, then σ σ σ 2 σ 2... σ m σ m 0. (6) Proof. See Thompson [972. Total Least Squares Solution Now, three basic approximation ways of the overdetermined system () are suggested. The traditional approach penalizes only the misfit in the dependent variable part min ǫ Ê n,β Ê ǫ m 2 s.t. y + ǫ = Xβ (7) and is called the ordinary least squares (OLS). Here, the data matrix X is thought as exactly known and errors occur only in the vector y. An opposite case to the OLS is represented by the data least squares (DLS), which allow corrections only in the explanatory variables (independent input data) min Θ Ê n m,β Ê m Θ F s.t. y = (X + Θ)β. (8) Finally, we concentrate ourselves on the total least squares approach minimizing the squares of errors in the values of both dependent and independent variables min [ε,ξ Ê n (m+),β Ê m [ε,ξ F s.t. y + ε = (X + Ξ)β. (9) A graphical illustration of three previous cases can be found in Figure. One may notice that the TLS search for the orthogonal projection of the observed data onto the unknown approximation corresponding to a TLS solution. Once a minimizing [ˆε, ˆΞ of the TLS problem (9) is found, then any β satisfying y+ˆε = (X+ ˆΞ)β is called a TLS solution. The basic form of the TLS solution was investigated for the first time by Golub and Van Loan [980. 89

OLS DLS TLS Various Least Squares Fit OLS DLS TLS Figure. Various least squares fits (ordinary, data, and total LS) for the same three data points in the two-dimensional plane that coincides with the regression setup of one response and one explanatory variable. Theorem (TLS Solution of y Xβ) Let the SVD of X Ê n m be given by X = m i= σ i u i v i the SVD of [y,x = m+ i= σ iu i vi. If σ m > σ m+, then [ŷ, ˆX := [y + ˆε,X + ˆΞ = UˆΣV and ˆΣ = diag {σ,...,σ m, 0} (0) with the corresponding TLS correction matrix and [ˆε, ˆΞ = σ m+ u m+ v m+ () solves the TLS problem and ˆβ = e v [v 2,m+,..., v m+,m+ (2) m+ exists and is the unique solution to ŷ = ˆXβ. Proof. Proof by contradiction, we firstly show that e v m+ 0. Suppose v,m+ = 0, then there exist 0 w Ê m such that [ 0,w [ [y,x 0 [y,x w = σm+ 2 which yields into w X Xw = σ 2 m+. But this is a contradiction with the assumption σ m > σ m+, since σ 2 m is the smallest eigenvalue of X X. Sturm interlacing theorem (6) and the assumption σ m > σ m+ yield σ m > σ m+. Therefore, σ m+ is not a repeated singular value of [y,x and σ m > 0. If σ m+ 0, then rank[y,x = m+. We want to find [ŷ, ˆX such that [y,x [ŷ, ˆX F is minimal and [ŷ, ˆX[, β = 0 for some β. Therefore, rank([ŷ, ˆX) = m and applying Eckart-Young-Mirsky 90

theorem (5), one may easily obtain the SVD of [ŷ, ˆX in (0) and the TLS correction matrix (), which must have rank one. Now, it is clear that the TLS solution is given by the last column of V. Finally, since dim Ker([ŷ, ˆX) =, then the TLS solution (2) must be unique. If σ m+ = 0, then v m+ Ker([y,X) and [y,x[, β = 0. Hence, no approximation is needed, overdetermined system () is compatible, and the exact TLS solution is given by (2). Uniqueness of this TLS solution follows from the fact that [, β Range([y,X ). A closed-form expression of the TLS solution (2) can be derived. If σ m > σ m+, the existence and uniqueness of the TLS solution has already been shown. Thereby, since singular vectors v i, i.e. from (0), are eigenvectors of [y,x [y,x, then ˆβ also satisfies and, hence, [y,x [y,x [ ˆβ = [ y y y X X y X X [ ˆβ = σm+ 2 [ ˆβ ˆβ = (X X σ 2 m+i m ) X y. (3) Previous equation reminds us a form of an estimator in the ridge regression setup. Therefore, one may expect avoiding multicollinearity problems with classical OLS regression (7), due to the ridge regression and the TLS orthogonal regression correspondence. Expression (3) looks almost similar to the OLS estimator β of (7), except the term containing σm+ 2. This term is missing in the well-known OLS estimator with full rank regression matrix providing by Gauss-Markov theorem of a solution as so-called normal equations X X β = X y. From a statistical point of view, a situation when σ m = σ m+ occurs for real data is unlikely and also quite irrelevant. But Van Huffel and Vandewalle [99 investigated this case and concluded the following summary. Suppose σ q > σ q+ =... = σ m+, q m and denote Q := [v q+,...,v m+. Then: σ m > σ m+ the unique TLS solution (2) exists; σ m = σ m+ & e Q 0 infinitely many TLS solutions of (9) exist and one can pick up one of them with the smallest norm; σ m = σ m+ & e Q = 0 no solution of (9) exists and one needs to define another ( more restrictive ) TLS problem. A more restrictive TLS problem, mentioned previously, is called a nongeneric TLS problem. Simply, additional restriction [ε, Ξ Q = 0 added to the constraints in (9) tries to project out unimportant or redundant data from the original TLS problem (9). Errors-in-Variables Model One should not only pay attention to the existence or form of the TLS solution, but also to its properties, e.g. statistical ones. In statistics, the TLS problem (9) corresponds to a so-called errors-invariables setup. Here, unobservable true values y 0 and X 0 satisfy a single linear relationship y 0 = α n + X 0 β (4) and unknown parameters α (intercept) and β (regression coefficients) need to be estimated. Observations y and X measure y 0 and X 0 with additive errors ε and Ξ σ 2 ν y = y 0 + ε, (5) X = X 0 + Ξ. (6) Rows of the errors [ε,ξ are iid with common zero mean and covariance matrix σνi 2 m+, where > 0 is unknown. TLS Estimator For simplicity, we suppose that condition σ m > σ m+ is satisfied. Let us denote G := I n n n n with n := [,..., for practical purposes. Then, we define the estimate of coefficient β as the TLS solution ˆβ and the estimate of intercept α as follows ˆα := ȳ [ x,..., x m ˆβ (7) where x i means the average of the elements of ith column of matrix X. Finally, the variance term σ 2 ν is estimated using singular values ˆσ 2 := n σ 2 m+. 9

Large Sample Properties An asymptotical behaviour of an estimator is one of its basic characteristics. The asymptotical properties can provide some information about the quality (i.e. efficiency) of the estimator. Consistency Firstly, we provide a theorem showing the strong consistency of the TLS estimator. Theorem (Strong Consistency) If lim n n X 0 X 0 exists, then Moreover, if lim n n X 0 GX 0 > 0, then Proof. See Gleser [98. lim n ˆσ2 a.s. = σν 2. (8) lim ˆβ n a.s. = β, (9) a.s. lim ˆα = α. (20) n The assumptions in the previous theorem are somewhat restrictive and need not be satisfied, e.g. univariate errors-in-variables model with the values of the independent variable vary linearly with the sample size. Therefore, these assumptions need to be weakened yielding the following theorem. Theorem (Weak Consistency) Suppose that the distribution of the rows of [ε,ξ possesses finite fourth moment. Denote X 0 := [ n,x 0. If then ( λ min X n 0 X ) 0 ( ) λ 2 min X 0 X 0 λ max ( X 0 X 0 [ ˆαˆβ [ P α β, n, ), n Proof. Can be easily derived using Theorem 2 by Gallo [982a., n. (2) Notation λ min (respectively, λ max ) denotes the minimal (respectively, maximal) eigenvalue. It has to be remarked on the fourth moment finiteness of the rows of [ε, Ξ, that this mathematically means for all i {,...,n} r j ij <, ω ij {ε i,ξ i,,...ξ i,m }, r j Æ. (22) ω j rj=4 The assumptions in the previous theorems ensure that the values of the independent variables spread out fast enough. Gallo [982a proved that the previous intermediate assumptions are implied by the assumptions in the theorem for strong consistency. Asymptotic Distributions Finally, an asymptotic distribution for further statistical inference has to be shown. Theorem (Asymptotic Normality) Suppose that the distribution of the rows of [ε, Ξ possesses finite fourth moment. If then [ ˆα α n ˆβ β Proof. See Gallo [982b. lim n n X 0 X 0 > 0 has an asymptotic zero-mean multivariate normal distribution as n. The covariance matrix of the multivariate normal distribution from the previous theorem is not shown here due to its complicated form and one may find that formula in Gallo [982b. 92

Discussion and Conclusions In this paper, the TLS problem from algebraical point of view is summarized and a connection with the errors-in-variables a statistical model is shown. An unification of algebraical and numerical results with statistical ones is demonstrated. The TLS optimizing problem is defined here also with the OLS and DLS alternatives. Its solution is found using spectral information of the system; and the existence and uniqueness of this solution are discussed. The errors-in-variables model as a correspondence to the orthogonal regression is introduced. Moreover, a comparison of the classical regression approach with the errors-in-variables setup is shown. Finally, large sample properties such as a strong and weak consistency, and an asymptotical distribution of the TLS estimator an estimator in the errors-in-variables model are recapitulated. For a further research, one may be interested in the extension of the TLS approach in the nonlinear regression or, on the top of that, in the nonparametric regression. Amemiya [997 proposed a way of the first order linearization of the nonlinear relations. A computational stability could be improved using the Golub-Kahan bidiagonalization connected up with the TLS problem by Paige and Strakoš [2006. This approach needs to be studied from the statistical point of view as well. Acknowledgments. The present work was supported by the Grant Agency of the Czech Republic (grant 20/05/H007). References Amemiya, Y., Generalization of the TLS approach in the errors-in-variables problem, in Proceedings of the Second International Workshop on Total Least Squares and Errors-in-Variables Modeling, edited by S. Van Huffel, pp. 77 86, 997. Eckart, G. and Young, G., The approximation of one matrix by another of lower rank, Psychometrica,, 2 28, 936. Gallo, P. P., Consistency of regression estimates when some variables are subject to error, Communications in Statistics: Theory and Methods,, 973 983, 982a. Gallo, P. P., Properties of Estimators in Errors-in-variables Models, Ph.D. thesis, Institute of Statistics Mimeoseries #5, University of North Carolina, Chapel Hill, NC, 982b. Gleser, L. J., Estimation in a multivariate errors in variables regression model: Large sample results, Annals of Statistics, 9, 24 44, 98. Golub, G. H. and Van Loan, C. F., An analysis of the total least squares problem, SIAM Journal on Numerical Analysis, 7, 883 893, 980. Golub, G. H. and Van Loan, C. F., Matrix Computation, Johns Hopkins University Press, Baltimore, MD, 3rd edn., 996. Mirsky, L., Symmetric gauge functions and unitarily invariant norms, Quarterly Journal of Mathematics Oxford,, 50 59, 960. Paige, C. C. and Strakoš, Z., Core problems in linear algebraic systems, SIAM Journal on Matrix Analysis and Applications, 27, 86 875, 2006. Thompson, R. C., Principal submatricies IX: Interlacing inequalities for singular values of submatrices, Linear Algebra Applications, 5, 2, 972. Van Huffel, S. and Vandewalle, J., The Total Least Squares Problem: Computational Aspects and Analysis, SIAM, Philadelphia, PA, 99. 93