THE GRADIENT PROJECTION ALGORITHM FOR ORTHOGONAL ROTATION. 2 The gradient projection algorithm

Similar documents
1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Linear Algebra (Review) Volker Tresp 2017

Sub-Stiefel Procrustes problem. Krystyna Ziętak

Lecture 5 : Projections

Linear Algebra (Review) Volker Tresp 2018

5 Handling Constraints

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture 6 Positive Definite Matrices

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

which arises when we compute the orthogonal projection of a vector y in a subspace with an orthogonal basis. Hence assume that P y = A ij = x j, x i

Inverse Singular Value Problems

Iterative Solution of a Matrix Riccati Equation Arising in Stochastic Control

Optimization methods

Programming, numerics and optimization

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

MET Workshop: Exercises

A SIMPLE GENERAL PROCEDURE FOR ORTHOGONAL ROTATION ROBERT I. JENNRICH UNIVERSITY OF CALIFORNIA AT LOS ANGELES

Designing Information Devices and Systems II

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Linear Algebra Review. Vectors

Matrix assembly by low rank tensor approximation

Proposition 42. Let M be an m n matrix. Then (32) N (M M)=N (M) (33) R(MM )=R(M)

Math The Laplacian. 1 Green s Identities, Fundamental Solution

ECE 275A Homework #3 Solutions

Math Linear Algebra

Chapter 3 Transformations

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Matrix stabilization using differential equations.

Review problems for MA 54, Fall 2004.

Statistical Geometry Processing Winter Semester 2011/2012

Jeffrey D. Ullman Stanford University

Properties of Matrices and Operations on Matrices

Deep Learning Book Notes Chapter 2: Linear Algebra

LEAST SQUARES SOLUTION TRICKS

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

1 Lyapunov theory of stability

Summary of Week 9 B = then A A =

Sample ECE275A Midterm Exam Questions

Math 113 Final Exam: Solutions

Iterative Methods for Solving A x = b

Gradient Methods Using Momentum and Memory

CS6964: Notes On Linear Systems

Singular Value Decomposition

Written Examination

Spring, 2012 CIS 515. Fundamentals of Linear Algebra and Optimization Jean Gallier

Chapter 1. Preliminaries. The purpose of this chapter is to provide some basic background information. Linear Space. Hilbert Space.

Trust Regions. Charles J. Geyer. March 27, 2013

Matrices A brief introduction

Matrix decompositions

CME 345: MODEL REDUCTION - Projection-Based Model Order Reduction

Linear-quadratic control problem with a linear term on semiinfinite interval: theory and applications

Linear Algebra Massoud Malek

Principal Component Analysis

Spectral Processing. Misha Kazhdan

ONP-MF: An Orthogonal Nonnegative Matrix Factorization Algorithm with Application to Clustering

Orthonormal Bases; Gram-Schmidt Process; QR-Decomposition

A Minimal Uncertainty Product for One Dimensional Semiclassical Wave Packets

Matrices A brief introduction


Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix.

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Putzer s Algorithm. Norman Lebovitz. September 8, 2016

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Dynamical low-rank approximation

Statistical Pattern Recognition

A priori bounds on the condition numbers in interior-point methods

C.6 Adjoints for Operators on Hilbert Spaces

Bindel, Fall 2013 Matrix Computations (CS 6210) Week 2: Friday, Sep 6

σ 11 σ 22 σ pp 0 with p = min(n, m) The σ ii s are the singular values. Notation change σ ii A 1 σ 2

Information geometry for bivariate distribution control

Sensitivity Analysis for the Problem of Matrix Joint Diagonalization

Numerical Algorithms as Dynamical Systems

Simultaneous Diagonalization of Positive Semi-definite Matrices

Jianhua Z. Huang, Haipeng Shen, Andreas Buja

Singular Value Decomposition

Principal Component Analysis

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Robustness of Principal Components

Linear Algebra- Final Exam Review

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Linear Systems. Carlo Tomasi

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

2. Review of Linear Algebra

Chapter 8 Gradient Methods

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

Moore-Penrose Conditions & SVD

CHAPTER 3. Gauss map. In this chapter we will study the Gauss map of surfaces in R 3.

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Conditions for Robust Principal Component Analysis

Convex Optimization. Problem set 2. Due Monday April 26th

Geometric Modeling Summer Semester 2010 Mathematical Tools (1)

Functional Analysis Review

IV. Matrix Approximation using Least-Squares

Camera Models and Affine Multiple Views Geometry

Differentiation of functions of covariance

Transcription:

THE GRADIENT PROJECTION ALGORITHM FOR ORTHOGONAL ROTATION 1 The problem Let M be the manifold of all k by m column-wise orthonormal matrices and let f be a function defined on arbitrary k by m matrices. The general orthogonal rotation problem is to minimize f restricted to M. The most common problem is rotation to simple loadings in factor analysis. There f(t ) = Q(AT ) where Q is a rotation criterion, for example varimax, and A is an initial loading matrix. The object is to minimize f(t ) over all orthogonal matrices T. In this case m = k. A variety of other applications may be found in Jennrich (2001). 2 The gradient projection algorithm The gradient projection (GP) algorithm to be discussed is essentially the singular value decomposition (SVD) algorithm of Jennrich (2001). It, however, was not recognized that the SVD algorithm was in fact a GP algorithm until a remark to that effect appeared in a paper on oblique gradient projection algorithms (Jennrich, 2002). The SVD algorithm was derived and studied as a majorization algorithm. It now seems simpler and more natural to view 1

and study it as a GP algorithm and that is the purpose of this note. This approach will also emphasize its similarity to the GP algorithm for oblique rotation. Let B denote the Frobenius norm of a matrix B. This is defined by B 2 = tr(b B) With respect to this norm the gradient G of f at T is the matrix of partial derivatives of f(t ) with respect to the components of T. Viewing T as a current value, the idea of the GP algorithm is to move in the negative gradient direction an amount α > 0, that is from T to T αg. In general T αg will not be in M. To deal with this it is projected onto M. This is easy to do because the projection of an arbitrary k by m matrix X onto M is given by ρ(x) = UV where U and V are matrices from a singular value decomposition X = UDV of X. This result may be found in Jennrich (2002). It may also be derived easily using target rotation. The simplest form of the GP algorithm is: Choose α > 0 and an initial T in M. (a) Compute the gradient G of f at T. (b) Replace T by ρ(t αg) and go to (a) or stop. This simple algorithm can and often does work, but the stopping rule is a bit vague and the algorithm s success may depend on the choice of α. 2

To address the stopping rule problem it is shown in Jennrich (2001) that T is a stationary point of f restricted to M if and only if G = T S (1) for some symmetric matrix S. It is also shown that this condition may be expressed more explicitly as s = 0 where s 2 = skm(t G) 2 + (I T T )G 2 (2) Motivated by this we will stop the GP algorithm when s ɛ for some small ɛ > 0 Let T be a differentiable mapping into M. Then T T = I and dt T + T dt = 0 from which it follows that T dt is skew-symmetric. Let skm(b) denote the skew-symmetric part of B. To address the choice of α problem we begin by finding the differential of ρ at an arbitrary T in M. Lemma 1: dρ T (dx) = T skm(t dx) + (I T T )dx Proof: Let f(t ) = X T 2 /2 Its gradient at T is G = X T If T = ρ(x), T minimizes and hence is a stationary point of f restricted to M. Using (1) X T = T S 3

for some symmetric matrix S and hence X = T S for some, but not the same, symmetric matrix S. Differentiating dx = dt S + T ds At X = T, S = I and dx = dt + T ds (3) and T dx = T dt + ds Using the fact that T dt is skew-symmetric and ds is symmetric skm(t dx) = T dt Thus T skm(t dt ) = T T dt From (3) (I T T )dx = (I T T )dt Thus T skm(t dx) + (I T T )dx = dt But dt = dρ T (dt ) which completes the proof. We turn next to the basic theorem for choosing α. Theorem 1: If T is in M, if G is the gradient of f at T, and if h(α) = f(ρ(t αg)) 4

then Proof: Using Lemma 1, h (0) = skm(t G) 2 (I T T )G 2 (4) h (0) = (G, dρ T (G)) = (G, T skm(t G) + (I T T )G) = skm(t G) 2 (I = T T )G 2 This completes the proof. Note that the right hand side of (4) is s 2. Thus if T is not a stationary point, h (0) is negative and hence f(ρ(t αg)) < f(t ) when α is sufficiently small. We can use this to choose α so the GP algorithm is monotone decreasing. For example consider the algorithm: Choose α 0 > 0 and T in M (a) Compute s as defined by (2). If s < ɛ, stop. (b) Compute the gradient G of f at T and set α = α 0. (c) Compute T = ρ(t αg). (d) If f( T ) f(t ) replace α by α/2 and go to (c). (e) If f( T ) < f(t ) replace T by T and go to (a). Note that step (a) stops the algorithm when it is sufficiently close to a stationary point. Note that because of Theorem 1, the return to step (c) 5

from step (d) can only occur a finite number of times. Note also that the algorithm is strictly monotone. That is the value of f at a new value of T is strictly less than the value of f at the previous value of T. While not guaranteed by this alone, strictly monotone algorithms almost invariably converge to stationary points. In the author s experience this has never failed to happen, not even in simulations. Moreover, because stationary points that are not local minima are not points of attraction of a monotone algorithm, strictly monotone algorithms tend to converge stationary points that are local minima. In our experience α 0 = 1 and ɛ = 10 5 are reasonable choices for these parameters. We use an initial T = I except for some simulations where random T were used. A number of modifications of this algorithm are possible. For example in step (b) one can replace α by 2α rather than setting it equal to α 0. This tends to speed the algorithm when α o is too large or too small. One can also use Theorem 1 to produce a fancier partial step in step (d). This may be done by approximating h(α) in Theorem 1 by a quadratic that equals f(t ) at zero, f( T ) at α, and has slope s 2 at zero. The minimizer α = s 2 α 2 2(f( T ) f(t ) + s 2 α) of the approximation is an approximate minimizer of h may be used instead of α/2 to replace α in step (d). The author has tried, but does use these modifications because the speed of the unmodified algorithm has never been a problem even in simulations. 6

3 References Jennrich, R. I. (2001). A simple general procedure for orthogonal rotation. Psychometrika, 66, 289-306. Jennrich, R.I. (2002). A simple general method for oblique rotation. Psychometrika, 67, in press. 7