Jianhua Z. Huang, Haipeng Shen, Andreas Buja

Similar documents
SUPPLEMENTAL NOTES FOR ROBUST REGULARIZED SINGULAR VALUE DECOMPOSITION WITH APPLICATION TO MORTALITY DATA

Functional SVD for Big Data

The Analysis of Two-Way Functional Data Using Two- Way Regularized Singular Value Decompositions

The Analysis of Two-Way Functional Data. Using Two-Way Regularized Singular Value Decompositions

Functional principal components analysis via penalized rank one approximation

Supplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data

14 Singular Value Decomposition

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Total Least Squares Approach in Regression Methods

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Least Squares Optimization

Lecture 6: Methods for high-dimensional problems

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Review problems for MA 54, Fall 2004.

Least Squares Optimization

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Sparse principal component analysis via regularized low rank matrix approximation

IV. Matrix Approximation using Least-Squares

Least Squares Optimization

Mathematical foundations - linear algebra

Chapter 3 Transformations

5.) For each of the given sets of vectors, determine whether or not the set spans R 3. Give reasons for your answers.

Forecast comparison of principal component regression and principal covariate regression

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

AMS526: Numerical Analysis I (Numerical Linear Algebra)

7 Principal Component Analysis

Lecture 02 Linear Algebra Basics

Parallel Singular Value Decomposition. Jiaxing Tan

Lecture 2: Linear Algebra Review

Lecture 8: Linear Algebra Background

Tutorials in Optimization. Richard Socher

Biclustering via Sparse Singular Value Decomposition

1 Singular Value Decomposition and Principal Component

Statistical Geometry Processing Winter Semester 2011/2012

Preface to Second Edition... vii. Preface to First Edition...

Linear Algebra, part 2 Eigenvalues, eigenvectors and least squares solutions

Regularized principal components analysis

Chapter 1.6. Perform Operations with Complex Numbers

Contents. Preface for the Instructor. Preface for the Student. xvii. Acknowledgments. 1 Vector Spaces 1 1.A R n and C n 2

Linear Algebra (Review) Volker Tresp 2017

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Multivariate Statistical Analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Regularization: Ridge Regression and the LASSO

Learning with Singular Vectors

Chapter 5 Eigenvalues and Eigenvectors

CSE 554 Lecture 7: Alignment

Linear Algebra (Review) Volker Tresp 2018

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Singular Value Decompsition

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Fundamentals of Matrices

Inner Product Spaces 6.1 Length and Dot Product in R n

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

EE731 Lecture Notes: Matrix Computations for Signal Processing

Designing Information Devices and Systems II

EIGENVALUES AND SINGULAR VALUE DECOMPOSITION

Principal Component Analysis

Sparse PCA with applications in finance

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

LECTURE NOTE #10 PROF. ALAN YUILLE

MATH 583A REVIEW SESSION #1

Spatial Process Estimates as Smoothers: A Review

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

15 Singular Value Decomposition

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

The Mathematics of Facial Recognition

Hands-on Matrix Algebra Using R

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Review of Linear Algebra

Machine Learning (Spring 2012) Principal Component Analysis

COMP 558 lecture 18 Nov. 15, 2010

A Generalized Least Squares Matrix Decomposition

Eigenvalues and diagonalization

Sparse Principal Component Analysis for multiblocks data and its extension to Sparse Multiple Correspondence Analysis

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

(4.1) 22 Ch. 3. Sec. 3. Line Fitting

EE731 Lecture Notes: Matrix Computations for Signal Processing

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Linear Subspace Models

THE GRADIENT PROJECTION ALGORITHM FOR ORTHOGONAL ROTATION. 2 The gradient projection algorithm

MATH 425-Spring 2010 HOMEWORK ASSIGNMENTS

Linear Algebra. Min Yan

Here each term has degree 2 (the sum of exponents is 2 for all summands). A quadratic form of three variables looks as

MAXIMUM LIKELIHOOD IN GENERALIZED FIXED SCORE FACTOR ANALYSIS 1. INTRODUCTION

Exercises * on Principal Component Analysis

Eigenvalue Problems and Singular Value Decomposition

Approximate Principal Components Analysis of Large Data Sets

SVD, PCA & Preprocessing

Review of Linear Algebra

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Package sgpca. R topics documented: July 6, Type Package. Title Sparse Generalized Principal Component Analysis. Version 1.0.

Biostatistics Advanced Methods in Biostatistics IV

Transcription:

Several Flawed Approaches to Penalized SVDs A supplementary note to The analysis of two-way functional data using two-way regularized singular value decompositions Jianhua Z. Huang, Haipeng Shen, Andreas Buja Abstract The application of two-way penalization to SVDs is not a trivial matter. In this note, we show that several natural approaches to penalized SVDs do not work and explain why so. We present these flawed approaches to spare readers fruitless search in dead ends. 1 Outline Section 2 presents several approaches to penalized SVD through generalization of penalized regressions and then discusses their flaws. Sections 3 and 4 provide some deeper discussion by comparing the criteria in terms of bi-rayleigh quotients and alternating optimization algorithms, respectively. 2 Several approaches to penalized SVD We consider a m n data matrix X whose indexes both live in structured domains, and our goal is to find best rank-one approximations of X in a sense that reflects the domain structure. The calculations generalize to any rank-q approximation. We write rank-one approximations as uv T, where u and v are n- and m-vectors, respectively. We will not assume that either is normalized, hence they are unique only up to a scale factor that can be shifted between them: u cu, v v/c. (1) It will turn out that these scale transformations play an essential role in the design of a desirable form of regularized SVD. As a reminder, the criterion for unregularized standard SVD is C 0 (u, v) = X uv T 2 F, (2) Jianhua Z. Huang is Professor (Email: jianhua@stat.tamu.edu), Department of Statistics, Texas A&M University, College Station, TX 77843. Haipeng Shen (Email: haipeng@email.unc.edu) is Assistant Professor, Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599. Andreas Buja (Email: buja@wharton.upenn.edu) is Liem Sioe Liong/First Pacific Company Professor, Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104. Jianhua Z. Huang s work was partially supported by NSF grant DMS-0606580, NCI grant CA57030, and Award Number KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST). Haipeng Shen s work was partially supported by NSF grant DMS-0606577, CMMI-0800575, and UNC-CH R. J. Reynolds Fund Award for Junior Faculty Development. 1

where 2 F denote the squared Frobenius norm of a matrix. A plausible approach to invariance under (1) would be to require normalization, u = v = 1, and introduce a slope factor s such that C 0 (u, v) = X s uv T 2 F. We prefer, however, to deal with invariance directly because normalization requirements result in extraneous difficulties rather than clarifications. We will now require u and v to satisfy some domain-specific conditions, typically a form of smoothness. Specifically, let Ω u and Ω v be n n and m m non-negative definite matrices, respectively, such that smaller values of the quadratic forms u T Ω u u and v T Ω v v are more desirable than larger ones in terms of the respective index domain. These quadratic forms will be chosen to be roughness penalties if u and v are required to be smooth on their domains. For example, if the index set of u is I = {1,..., n} corresponding to equi-spaced time points, then a crude roughness penalty is u T Ω u u = i (u i+1 2u i + u i 1 ) 2. For future use we recall that the solution of the penalized regression problem y f 2 + f T Ω f = min f (f, y IR n ) is f = (I + Ω) 1 y, that is, the matrix S = (I + Ω) 1 is the smoothing matrix that corresponds to the hat matrix of linear regression. This, however, is not an orthogonal projection but a symmetric matrix with eigenvalues between zero and one (Hastie and Tibshirani, 1990). The goal is to balance the penalties u T Ω u u and v T Ω v v against goodness-of-fit as measured by the residual sum of squares X uv T 2 F. Following practices in regression, it would be natural to try to achieve such a balance by minimizing the sum C 1 (u, v) = X uv T 2 F + u T Ω u u + v T Ω v v. (3) This criterion has, however, the undesirable property that it is not invariant under scale transformations (1). While the goodness-of-fit criterion X uv T 2 F remains unchanged, the penalization terms change to c 2 u T Ω u u and c 2 v T Ω v v, respectively. It appears therefore that this additive combination of the penalties imposes specific scales on u and v relative to each other, while the goodness-of-fit criterion has nothing to say about these relative scales. Indeed, we will see in Sections 3 below that minimization of (3) forces the two penalties to attain identical values. If the obvious approach is deficient, what would be a desirable way to balance goodness-of-fit and penalties? A heuristic pointer can be found by expanding the goodness-of-fit criterion C 0 with some trace algebra as follows: C 0 (u, v) = X uv T 2 F = X 2 F 2 u T Xv + u 2 v 2. (4) Of note is that the rightmost term is bi-quadratic. This matters because the penalties act as modifiers of this term and should be of comparable functional form. It is therefore natural to search for bi-quadratic forms of combined penalties, and the simplest form would be the product as opposed to the sum of the penalties: C 2 (u, v) = X uv T 2 F + u T Ω u u v T Ω v v. (5) 2

This criterion, while satisfying invariance under the scale transformations (1), has another deficiency: it does not specialize to a one-way regularization method in v when Ω u = 0, say. (This specialization requirement would appear to be satisfied by C 1, but appearances can be misleading, as is the case here; see Section 3). The search for criteria that satisfy both requirements lead us to the following proposal: C 3 (u, v) = X uv T 2 F + u T Ω u u v 2 + u 2 v T Ω v v. (6) The idea is to combine smoothing in u (caused by u T Ω u u) with shrinkage in v (caused by v 2 ), and vice versa. The criterion is invariant under (1) and it specializes to a version of one-way regularized SVD when one of the penalities vanishes (Huang, Shen and Buja, 2008). We will see, however, that this criterion has a coupling problem (as do C 1 and C 2 ): the natural alternating algorithm that optimizes u and v in turn amounts to alternating smoothing where the amount of smoothing for u depends on v, and vice versa (Section 4). While this defect may make only a weak heuristic argument compared to the scale invariance and specialization requirements, it takes on more weight with hindsight once the conceptually cleanest solution is found. This solution turns out to be the most complex of all, the sum of both types of bi-quadratic penalties tried so far: C 4 (u, v) = X uv T 2 F + u T Ω 1 u v 2 + u 2 v T Ω v v + u T Ω 1 u v T Ω v v. (7) While this criterion is derived from axiomatic argument in our paper (Huang, Shen and Buja, 2009), one can get a glimpse of the reason why this succeeds by substituting the expansion (4) of the goodness-of-fit criterion in (7) and simplifying the algebra: C 4 (u, v) = X 2 F 2 u T Xv + u T (I + Ω u )u v T (I + Ω v )v. (8) This kind of factorization is absent from the previous combined penalties. The matrices I+Ω u and I+Ω v are the inverses of the smoother matrices S u and S v, a fact that will result in an intuitive alternating algorithm (Section 4). Comparing the expanded form (4) of C 0 with (8), we see that the Euclidean squared norms u 2 and v 2 are replaced by the quadratic forms u T (I+Ω u )u and v T (I+Ω v )v, respectively. This fact hints that hierarchies of regularized singular vectors based on C 4 will be orthogonal in the sense that u T 1 (I + Ω u )u 2 = 0 and v T 1 (I + Ω v )v 2 = 0, as opposed to u T 1 u 2 = 0 and v T 1 v 2 = 0. 3 Comparison of criteria in terms of bi-rayleigh quotients In this section we compare the penalized LS criteria in terms of scale-invariant ratios that are a form of bi-rayleigh quotients, and in the next subsection in terms of stationary equations that suggests alternating algorithms. 3

C 0 (u, v) = X uv T 2 F : min s,t C 0 (su, tv) = X 2 F (ut Xv) 2 u 2 v 2 u = c v Xv, c v = v 2 v = c u X T u, c u = u 2 C 1 (u, v) = X uv T 2 F + ut Ω u u + v T Ω v v : min s,t C 1 (su, tv) = X 2 (u T Xv (u T Ω uu v T Ω vv) 1/2) 2 u 2 v 2 u = c v S u (λ v ) Xv, λ v = c v = v 2 v = c u S v (λ u ) X T u, λ u = c u = u 2 C 2 (u, v) = X uv T 2 F + ut Ω u u v T Ω v v : min s,t C 2 (su, tv) = X 2 (u T Xv) 2 u 2 v 2 + u T Ω uu v T Ω vv u = c v S u (λ v ) Xv, λ v = R v (v), c v = v 2 v = c u S v (λ u ) X T u, λ u = R u (u), c u = u 2 C 3 (u, v) = X uv T 2 F + ut Ω u u v 2 + u 2 v T Ω v v : min s,t C 3 (su, tv) = X 2 (u T Xv) 2 u 2 v 2 + u T Ω uu v 2 + u 2 v T Ω vv u = c v S u (λ v ) Xv, λ v = (1 + R v (v)) 1, c v = v 2 λ v v = c u S v (λ u ) X T u, λ u = (1 + R u (u)) 1, c u = u 2 λ v C 4 (u, v) = X uv T 2 F + ut Ω 1 u v 2 + u 2 v T Ω v v + u T Ω 1 u v T Ω v v : min s,t C 4 (su, tv) = X 2 (u T Xv) 2 u T (I+Ω u)u v T (I+Ω v)v u = c v S u (1) Xv, c v = v 2 (1 + R v (v)) 1 v = c u S v (1) X T u, c u = u 2 (1 + R u (u)) 1 Table 1: The five (penalized) LS criteria, their bi-rayleigh quotients, and the stationary equations for alternating algorithms; S u/v are the smoother matrices and R u/v the plain Rayleigh quotients of Ω u/v [see (9) and (10)]. 4

We first show how one arrives at bi-rayleigh quotients, here so called because they are the analogs of the usual Rayleigh quotients in eigen problems adapted to singular value problems. One arrives at them by minimizing scalar factors in quadratic loss functions. We exemplify with unpenalized LS (4) where we introduce two scale factors, one for u and v each: C 0 (su, tv) = X st uv T 2 F = X 2 2 st u T Xv + s 2 t 2 u 2 v 2 = X 2 2 r u T Xv + r 2 u 2 v 2, where r = st. As a simple quadratic in r of the form C 2 2Br + A 2 r 2, the minimum is attained at r = B/A 2 and the value of the minimum is C 2 B 2 /A 2. In this case, we obtain min s,t C 0 (su, tv) = X 2 F (ut Xv) 2 u 2 v 2, We call the rightmost term a bi-rayleigh quotient as it is the ratio of two biquadratics, in analogy to the usual Rayleigh quotient which is the ratio of two quadratics. Maximization of the bi-rayleigh quotient is equivalent to minimization of the LS criterion. The stationary solutions of a bi-rayleigh quotient are the pairs of left- and right-singular vectors of a singular value problem, just as the stationary solutions of a plain Rayleigh quotient are the eigenvectors of an eigen problem. The above exercise of scale optimization can be as easily executed for the penalized criteria C 2, C 3 and C 4 of Section 2. Their quadratic functions C 2 2Br + A 2 r 2 only differ in the coefficient A 2 which in each case is the sum of u 2 v 2 and the penalties. The results are shown on the second line of each box of Table 1. The naive penalized LS criterion C 1 (u, v) = X uv T 2 F + ut Ω u u + v T Ω v v requires separate treatment as it is the only one that is not scale invariant, so that the slopes s and t do not coalesce into a product r. Here are the stationary equations: s C 1(su, tv) = 2tu T Xv + 2st 2 u 2 v 2 + 2su T Ω u u = 0 t C 1(su, tv) = 2su T Xv + 2s 2 t u 2 v 2 + 2tv T Ω v v = 0 Multiplying the first equation with s and the second with t, one recognizes immediately that s 2 u T Ω u u = t 2 v T Ω v v, forcing the two penalties to be equal at the minimum. Using this identity, one can express the combined penalty symmetrically as follows: s 2 u T Ω u u + t 2 v T Ω v v = 2st ( u T Ω u u v T Ω v v ) 1/2. Thus after scale minimization the combined penalty becomes invariant under scale changes (1). Using this fact one obtains the scale-minimized solution without problem; the result is shown in the second box of Table 1. These calculations show that, 5

counter to appearances, C 1 does not specialize to a form of one-way regularized SVD when one of the penalties vanishes. Comparing the bi-rayleigh quotients across the penalized criteria, we see that the naive criterion C 1 is somewhat reminiscent of the regularized PCA proposal by Rice and Silverman (1991) in that it penalizes by subtracting from the numerator rather than adding to the denominator. Criteria C 2, C 3 and C 4 share the property that they penalize the denominator, and in this regard they are all reminiscent of the regularized PCA proposal by Silverman (1996). We also note that only the last, C 4, has a denominator that factors similar to the unpenalized case C 0, providing a heuristic argument in its favor. As a comparison, the numerator of the bi-rayleigh quotient for Criterion C 1 is not of the simple biquadratic type, and the denominators for C 2 and C 3 do not factorize into a simple product of quadratics. 4 Comparison of criteria in terms of alternating algorithms Table 1 also shows the stationary equations of the (penalized) LS criteria, solved for u and v, respectively, cast in terms of smoother matrices and (plain) Rayleigh coefficients: S u (λ) = (I + λω u ) 1, R u (u) = ut Ω u u u 2, (9) S v (λ) = (I + λω v ) 1, R v (v) = vt Ω v v v 2 ; (10) see the third and fourth lines in each box of Table 1. The bandwidth parameters λ are necessary to accommodate criteria C 1, C 2 and C 3 whose stationary equations apparently require variable bandwidths whereby the bandwidth for updating u depends on v, and vice versa. Such coupling of the bandwidths is a heuristic argument against these criteria. Among the penalized criteria, C 4 is the only one in whose stationary equations the bandwidths are fixed. The stationary equations immediately suggest alternating algorithms for C 4 : update u from v, then update v from u. References Hastie T. and Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall, London. Huang, J. Z., Shen, H. and Buja, A. (2008). Functional principal components analysis via penalized rank one approximation. Electronic Journal of Statistics, 2, 678 695. Huang, J. Z., Shen, H. and Buja, A. (2009). The analysis of two-way functional data using two-way regularized singular value decompositions. Journal of American Statistical Association, revised. 6

Rice, J. A. and Silverman, B. W. (1991). Estimating the mean and covariance structure nonparametrically when the data are curves. Journal of the Royal Statistical Society, Series B, 53, 233 243. Silverman, B. W. (1996). Smoothed functional principal components analysis by choice of norm. The Annals of Statistics, 24, 1 24, 7