SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

Size: px

Start display at page:

Download "SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA"

Job Ray
5 years ago
Views:

1 SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA By Lingsong Zhang, Haipeng Shen and Jianhua Z. Huang Purdue University, University of North Carolina, Texas A&M University 1. Derivation of the Criteria for Penalty Parameter Selection. We now derive the leave-one-row/column-out cross-validation and generalized cross-validation criteria for penalty parameter selection. The derivation extends the previous work of Huang, Shen and Buja (2009), When estimating v given u, we can select the penalty parameter λ v by using the leave-one-column-out cross-validation. Let v ( j) be the estimate of v obtained by leaving out the jth column of X, and Ŷ( j) = U v ( j) be the fitted value of Y using v ( j). Let Y j and Ŷ( j) j denote the jth block of Y and Ŷ( j) respectively, corresponding to the jth column in X. The leaveone-column-out cross-validation score for updating v given u is defined as (1) CV(λ v λ u ) = 1 n = 1 n n j=1 n j=1 (u v ( j) j (Y j Ŷ( j) j x j ) T W j (u v ( j) j x j ) ) T W j (Y j Ŷ( j) j ), where W j is the jth diagonal block of W. We now derive a shortcut formula for the cross-validation score so that there is no need to actually perform the leave-one-column-out operation in implementation of the method, which is computationally expensive. Let v j denote the jth element of v given in Equation (7) of the main paper. Let X be the same as X, except that the jth column is replaced by v j u, and Y = Svec(X ). Note that this means that the jth block (consisting of m elements) of Y and Y are different and the rest of blocks are the same. By Correspondence to Lingsong Zhang. Shen s work was partially supported by NIH/NIDA (1 RC1 DA ) and NSF (CMMI , DMS ). Huang s work was partially supported by NCI (CA57030), NSF (DMS , DMS , DMS ), and Award No. KUS-C , made by King Abdullah University of Science and Technology (KAUST). 1

2 2 ZHANG, SHEN AND HUANG the definition of H, we have that Thus, (2) Ŷ ( j) = U v ( j) = HY. Y Ŷ( j) = Y HY = Y HY H(Y Y). Denote H jj as the jth diagonal block of H (i.e., it is a m m square matrix). Since only the jth block of Y Y are nonzeros, inspecting the jth block of both sides of the equation (2) gives Y j Ŷ( j) j = Y j Ŷj + H jj (Y j Ŷ( j) j ). Rearrangement of this equation leads to the following leave-one-out lemma. Lemma 1. The leave-jth-column-out cross-validated residual for estimating v given u is Y j Ŷ( j) j = (I H jj ) 1 (Y j Ŷj). Next, we use Lemma 1 to give a simple expression of the leave-one-column cross-validation score. Let O = (U T WU + 2Ω v u ) 1. We can show that H jj = O jj uu T W j, where O jj is the jjth entry of the matrix O. This expression and Lemma 1 can be used to derive the following result. The proof of the result is given at the end of the appendix. Lemma 2. The weighted jth leave-one-column-out cross-validation error sum of squares (u v ( j) j x j ) T W j (u v ( j) j x j ) can be written as (3) x T j W j x j xt j W ju u T W j u + ut W j u ( v j x T j W ju/u T W j u) 2 (1 O jj u T W j u) 2. By definition, the leave-one-column-out cross-validation score is the average over j of the expression given in (3). Since the first two terms are not related to λ v when conditioning on u, the cross-validation score can be equivalently expressed as the average over j of the last term in (3), which is (excluding the irrelevant factor u T W j u) CV(λ v λ u ) = 1 n n j=1 ( v j x T j W ju/u T W j u) 2 (1 O jj u T W j u) 2.

3 ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION 3 Note that O jj u T W j u is a scalar, so O jj u T W j u = tr(o jj uu T W j ) = tr(h jj ). Consequently, we obtain the following expression of the leave-one-columnout cross-validation score (4) CV(λ v λ u ) = 1 n ( v j x T j W ju/u T W j u) 2 n (1 tr(h jj )) 2. j=1 Replacing tr(h jj ) in CV(λ v λ u ) with its average over j, tr(h)/n, leads to the GCV criterion (5) GCV(λ v λ u ) = 1 n ( v j x T j W ju/u T W j u) 2 n (1 tr(h)/n) 2. j=1 It can be shown that the jth component of v = (U T WU) 1 U T WY, the unregularized update of v, is v j = xt j W ju/u T W j u. Thus, the GCV formula can also be written as GCV(λ v λ u ) = v v 2 /n (1 tr(h)/n) 2. The GCV formula for selecting the penalty parameter λ u when updating u given v can be derived in a similar manner Proof for Lemma 2. By the result in Lemma 1, the weighted jth leave-one-column-out cross-validation error sum of squares (denoted by r j ) is (6) r j = (u v ( j) j x j ) T W j (u v ( j) j x j ) = (u v j x j ) T (1 H jj ) 1 W j (1 H jj ) 1 (u v j x j ). Since H jj = O jj uu T W j, by using the Sherman Morrison formula (see (A.2.1) on page 210 of Cook and Weisberg, 1982), we obtain (I H jj ) 1 = I + O jjuu T W j 1 O jj u T W j u. Thus, letting z = u v j x j, we have that (7) (1 H jj ) 1 (u v j x j ) = z + O jjuu T W j z 1 O jj u T W j u. Moreover, (8) z T W j z (ut W j z) 2 u T W j u = xt j W j x j (ut W j x j ) 2 u T W j u.

4 4 ZHANG, SHEN AND HUANG Plugging (7) into (6), expanding the quadratic form, and applying some additional algebra, we obtain r j = z T W j z (ut W j z) 2 u T W j u + ut W j u (ut W j z/u T W j u) 2 (1 O jj u T W j u) 2. The desired result then follows from (8) and the definition of z. 2. Dealing with Missing Data via the MM Algorithm. We show that the missing value handling approach described in Section 2.5 of the main paper is an application of the MM algorithm (Hunter and Lange, 2004) and has desirable properties. Let I o be the set of indices of all (i, j) pairs for which x i,j is observable. The indices outside I o correspond to missing observations. In the presence of missing observations, the RobR criterion function to be minimized is (9) R(u, v) = ( ) xij u i v j + P λ (u, v), σ (i,j) I o ρ and P λ (u, v) is the penalty function defined in equation (4) of the main paper. Suppose u 0 and v 0 are some initial guesses of u and v. Define X = ( x ij ) with { x ij (i, j) I o x i,j = u 0 i v0 j (i, j) I o. Define the surrogate criterion function R(u, v; u 0, v 0 ) = ( xij u i v j ρ σ i,j ) + P λ (u, v). It is easily seen that R(u, v; u 0, v 0 ) R(u, v) with equality when u = u 0 and v = v 0, and so R(u, v) is a majorizing function. The MM algorithm starts from some initial guesses u 0 and v 0, minimizes the surrogate function R(u, v; u 0, v 0 ) over u and v, updates the initial guesses with the current minimizers, and iterates until convergence. Let (u m, v m ), m = 0, 1, 2,..., be a sequence of minimizers generated by the algorithm. We have that R(u m+1, v m+1 ) R(u m, v m ; u m, v m ) = R(u m, v m ). Thus the criterion value decreases with the number of iterations, and the algorithm is guaranteed to converge to a local minimum. 3. Additional Simulation Studies.

5 ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION 5 m Rm RobRm m Rm RobRm m Rm RobRm m Rm RobRm m Rm RobRm L2 distance between uhat and u Fig 1: Rank One Simulation with Missing Values: Boxplots of the L 2 Distance between û and u Rank One Signal Matrix with Missing Values. This section provides detailed results for the simulation study reported in Section 3.2 of the main paper. Our motivating Spanish mortality data contain both outliers and missing values, which motivates us to investigate the performance of RobR when facing missing values. For each simulated data set in Section 3.1 of the main paper, we randomly select and delete 100 cells from it to form a new data set with missing values. We use the imputation method described in Section 2.5 of the main paper to estimate u and v for, R and RobR. Figures 1 and 2 compare the boxplots of the L 2 distances between the estimates and the truth, respectively. We can clearly see that the RobR remains to be the winner across all the settings considered Rank Two Signal Matrix. This section provides detailed results for the simulation study reported in Section 3.3 of the main paper. We now study the situation where the true signal matrix is rank two,

6 6 ZHANG, SHEN AND HUANG m Rm RobRm m Rm RobRm m Rm RobRm m Rm RobRm m Rm RobRm L2 distance between vhat and v Fig 2: Rank One Simulation with Missing Values: Boxplots of the L 2 Distance between v 0 and v 0.

7 ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION 7 using a similar simulation setting that has been considered by Huang, Shen and Buja (2009). In particular, let U 1 (y) = sin(2πy), U 2 (y) = sin(2π(y 0.25)), V1 (z) = exp( 4(z 0.25) 2 ), V2 (z) = exp( 4(z 0.75) 2 ); and consider the following true rank-two two-way functional model: (10) X(y, z) = 100U 1 (y)v 1 (z) + 50U 2 (y)v 2 (z) + ɛ(y, z), with y [0, 1] and z [0, 1]. Here U and V are normalized U and V. To simulate the functional data matrix with no outliers, we consider 100 equal-spaced grids in either direction and sample ɛ(y, z) independently from N(0, ). The simulation again is repeated 100 times. Similar to the rank 1 approximation in Section 3.1 of the main paper, we consider the following simulation scenarios: 1. s: The benchmark setting to see how RobR compares with the non-robust methods when there are no outliers. 2. Outlying cells: We randomly select 25 cells in the data and replace their entries with outlying values, in particular, values that are randomly simulated from U[C 1, 2C 1 ], the uniform distribution with support [C 1, 2C 1 ] with C 1 defined similarly as in Section 3.1 of the main paper. 3. : We randomly select five rows, and replace them by five outlying curves, defined as Y k sin(4πz) plus noise, where Y k is a random number generated from U[C 1, 2C 1 ]. 4. : We randomly select a continuous square block of cells, which in size is at most a quarter of the whole matrix, and add to the cell entries a random number generated from U[C 1, 2C 1 ]. 5. : We replace the diagonal entries with numbers generated from U[C 1, 2C 1 ]. As pointed out by Huang, Shen and Buja (2009), the defining decomposition in (10) is not in form as the components are not orthogonal. Hence, we below compare how the three methods estimate the true underlying rank-two signal, and the corresponding rank-two subspaces spanned by the first-two left and right singular vectors, respectively. We use the following two criteria to gauge the performance of the three methods. The first criterion is the Frobenius distance between the true rank-two signal matrix X 0 and the estimated best rank-two matrix X 0, i.e. the Frobenius norm of the approximation error matrix: X 0 X 0. Figure 3 summaries the comparison of the Frobenius distance for the three methods. In all

8 8 ZHANG, SHEN AND HUANG R RobR R RobR R RobR R RobR R RobR Frobenius distance between rank 2 approximation and X Fig 3: Rank Two Simulation: The Frobenius norm of the difference matrices for rank 2 approximation by different method

9 ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION 9 R RobR R RobR R RobR R RobR R RobR Principal angle between uhat and u Fig 4: Rank Two Simulation The principal angle between Û and U cases with outliers, RobR performs the best, having the smallest average distance and variability. When the data have no outliers, RobR and R perform similarly and both are better than. As our second measure, we use the largest principal angle (Golub and Van Loan, 1996) between the true subspace and the subspace spanned by the corresponding singular vector estimates, which measures the closeness between the subspaces. Specifically, let U = span(u1, U 2 ) denote the linear subspace spanned by U1 (y) and U 2 (y) evaluated at the grid points and Û be the corresponding estimate of this subspace. The principal angle between U and Û can be computed as cos 1 (ρ) 180/π, where ρ is the minimum eigenvalue of the matrix Q Ṱ Q U U where QÛ and Q U are orthogonal basis matrices obtained by the QR decomposition of the matrices Û and U, respectively. Similarly, we can define V = span(v1, V 2 ) and its estimate V, and calculate the principal angles between the two spaces. Figures 4 and 5 compare the boxplots of the principal angles between U and its estimate Û as well as between V and its estimate V, respectively,

10 10 ZHANG, SHEN AND HUANG R RobR R RobR R RobR R RobR R RobR Principal angle between vhat and v Fig 5: Rank Two Simulation: The principal angle between V and V

11 ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION 11 obtained from the 100 simulation runs. When the data have no outliers, R performs the best when estimating U, and R and RobR perform similarly, both better than, when estimating V. For all the outlying settings, RobR has the best performance. 4. Additional Analysis on the Spanish Mortality Data. In addition to the individual pairs of singular vectors, we also compare the cumulative approximation performance from the three methods as well as the corresponding approximation errors in Figure 6. The top row shows the 3-dimensional surface plots of the best rank-two approximations, where we focus on age between 11 and 50 to highlight the comparison. Again, one can clearly see in the /R approximations the two outlying strips around 1918 and As a comparison, the RobR approximation depicts a nice two-way smooth pattern, which is much less affected by the outliers. The colors suggest that the range of the /R approximations is much wider than RobR. The bottom row plots the corresponding approximation error surfaces. Note that the outlier periods appear much larger in the RobR plot. App R App RobR App Year Age Year Age Year Age Res R Res RobR Res Year Age Year Age Year Age Fig 6: Comparison of the Cumulative Rank-Two Approximations and the Corresponding Approximation Errors.

12 12 ZHANG, SHEN AND HUANG References. Cook, R. D. and Weisberg, S. (1982). Residuals and influence in regression. Chapman and Hall New York. Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations, 3rd edition. The Johns Hopkins University Press. Huang, J. Z., Shen, H. and Buja, A. (2009). The analysis of two-way functional data using two-way regularized singular value decompositions. Journal of the American Statistical Association Hunter, D. R. and Lange, K. (2004). The Tutorial on MM Algorithm. The American Statistician Department of Statistics Purdue University 150 N. University St. West Lafayette, IN, Department of Statistics and Operations Research University of North Carolina Chapel Hill, NC, Department of Statistics Texas A&M University 3143 TAMU College Station, TX

Jianhua Z. Huang, Haipeng Shen, Andreas Buja

Several Flawed Approaches to Penalized SVDs A supplementary note to The analysis of two-way functional data using two-way regularized singular value decompositions Jianhua Z. Huang, Haipeng Shen, Andreas