Subset Selection. Ilse Ipsen. North Carolina State University, USA

Similar documents
Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Randomized algorithms for the approximation of matrices

Rank-Deficient Nonlinear Least Squares Problems and Subset Selection

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

SUBSET SELECTION ALGORITHMS: RANDOMIZED VS. DETERMINISTIC

A randomized algorithm for approximating the SVD of a matrix

A fast randomized algorithm for approximating an SVD of a matrix

14.2 QR Factorization with Column Pivoting

Introduction to Numerical Linear Algebra II

Randomized Numerical Linear Algebra: Review and Progresses

Randomized Algorithms for Matrix Computations

Column Selection via Adaptive Sampling

A fast randomized algorithm for overdetermined linear least-squares regression

Approximate Spectral Clustering via Randomized Sketching

Lecture 5: Randomized methods for low-rank approximation

Chapter 3 Transformations

Linear Algebra Massoud Malek

arxiv: v1 [math.na] 29 Dec 2014

A determinantal point process for column subset selection

dimensionality reduction for k-means and low rank approximation

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

1. Addition: To every pair of vectors x, y X corresponds an element x + y X such that the commutative and associative properties hold

Computational Methods. Least Squares Approximation/Optimization

Randomized algorithms for matrix computations and analysis of high dimensional data

13. Nonlinear least squares

Lecture 9: Matrix approximation continued

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Orthogonalization and least squares methods

Randomly Sampling from Orthonormal Matrices: Coherence and Leverage Scores

ETNA Kent State University

RandNLA: Randomization in Numerical Linear Algebra

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

Matrix decompositions

Rank revealing factorizations, and low rank approximations

Rank revealing factorizations, and low rank approximations

Gram-Schmidt Orthogonalization: 100 Years and More

Numerical Methods. Elena loli Piccolomini. Civil Engeneering. piccolom. Metodi Numerici M p. 1/??

A Fast And Optimal Deterministic Algorithm For NP-Hard Antenna Selection Problem

Continuous analogues of matrix factorizations NASC seminar, 9th May 2014

randomized block krylov methods for stronger and faster approximate svd

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

ETNA Kent State University

Sparse Features for PCA-Like Linear Regression

Total Least Squares Approach in Regression Methods

arxiv: v1 [math.na] 23 Nov 2018

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

DEN: Linear algebra numerical view (GEM: Gauss elimination method for reducing a full rank matrix to upper-triangular

Linear Algebra Review. Vectors

Orthonormal Transformations and Least Squares

Multivariate Statistical Analysis

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9

Review problems for MA 54, Fall 2004.

Dominant feature extraction

c 2011 Society for Industrial and Applied Mathematics

Theorem A.1. If A is any nonzero m x n matrix, then A is equivalent to a partitioned matrix of the form. k k n-k. m-k k m-k n-k

ENGG5781 Matrix Analysis and Computations Lecture 8: QR Decomposition

Householder reflectors are matrices of the form. P = I 2ww T, where w is a unit vector (a vector of 2-norm unity)

5. Orthogonal matrices

Cheat Sheet for MATH461

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

CUR Matrix Factorizations

arxiv: v1 [math.na] 1 Sep 2018

The Singular Value Decomposition

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Conceptual Questions for Review

Linear Algebra, part 3 QR and SVD

Parallel Singular Value Decomposition. Jiaxing Tan

Subspace sampling and relative-error matrix approximation

Index. book 2009/5/27 page 121. (Page numbers set in bold type indicate the definition of an entry.)

Low Rank Approximation Lecture 3. Daniel Kressner Chair for Numerical Algorithms and HPC Institute of Mathematics, EPFL

Random Methods for Linear Algebra

Numerical Linear Algebra

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture 6. Numerical methods. Approximation of functions

1. Select the unique answer (choice) for each problem. Write only the answer.

Linear Algebra Methods for Data Mining

Class notes: Approximation

Linear Algebra in Actuarial Science: Slides to the lecture

Math 407: Linear Optimization

Linear Analysis Lecture 16

Tighter Low-rank Approximation via Sampling the Leveraged Element

1 Linear Algebra Problems

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

DS-GA 1002 Lecture notes 10 November 23, Linear models

LOW-RANK MATRIX APPROXIMATIONS DO NOT NEED A SINGULAR VALUE GAP

Fundamentals of Engineering Analysis (650163)

Sparse Approximation of Signals with Highly Coherent Dictionaries

PRACTICE FINAL EXAM. why. If they are dependent, exhibit a linear dependence relation among them.

Least-Squares Fitting of Model Parameters to Experimental Data

Compressed Sensing and Robust Recovery of Low Rank Matrices

Knowledge Discovery and Data Mining 1 (VO) ( )

to be more efficient on enormous scale, in a stream, or in distributed settings.

Large Scale Data Analysis Using Deep Learning

Exercise Sheet 1.

c 2015 Society for Industrial and Applied Mathematics

Why the QR Factorization can be more Accurate than the SVD

Scientific Computing: Dense Linear Systems

Randomization: Making Very Large-Scale Linear Algebraic Computations Possible

Provably Correct Algorithms for Matrix Column Subset Selection with Selectively Sampled Data

Transcription:

Subset Selection Ilse Ipsen North Carolina State University, USA

Subset Selection Given: real or complex matrix A integer k Determine permutation matrix P so that AP = ( A 1 }{{} k A 2 ) Important columns A 1 Columns of A 1 are very linearly independent Wannabe basis vectors Redundant columns A 2 Columns of A 2 are well represented by A 1

Outline Two applications Mathematical formulation Bounds Algorithms (two norm) Redundant columns (Frobenius norm) Randomized algorithms

First application Joint work with Tim Kelley Solution of nonlinear least squares problems Levenberg-Marquardt trust-region algorithm Difficulties: illconditioned or rank deficient Jacobians errors in residuals and Jacobians Improve conditioning with subset selection Damped driven harmonic oscillators allow us to construct different scenarios

Harmonic Oscillators m x + c x + k x = 0 displacement x(t) mass m damping constant c spring stiffness k Given: displacement measurements x j at time t j Want: parameters p = (m, c, k) Nonlinear least squares problem min x(t j, p) x j 2 p j http://www.relisoft.com/science/physics/images/oscillator.gif

Nonlinear Least Squares Problem Residual R(p) = ( x(t 1, p) x 1 x(t 2, p) x 2... ) T Nonlinear Least Squares Problem min p f(p) where f(p) = 1 2 R(p)T R(p) At a minimizer p : f(p ) = 0 Gradient: f(p) = R (p) T R(p) Jacobian: R (p) Solve f(p) = 0 by Levenberg-Marquardt algorithm p new = p ( νi + R (p) T R (p)) 1 R (p) T R(p)

Jacobian Levenberg-Marquardt algorithm p new = p ( νi + R (p) T R (p)) 1 R (p) T R(p) But: Jacobian R (p) does not have full column rank m x + c x + k x = 0 for infinitely many p = (m, c, k) x + c m x + k m x = 0 m c x + x + k c x = 0 m k x + c k x + x = 0 Which parameter to keep fixed? Which column in the Jacobian is redundant? Subset Selection!

Second Application Scott Pope s Ph.D. thesis (NCState, 2009) Cardiovascular and respiratory modeling Identify parameters that are important for predicting blood flow and pressure Solve nonlinear least squares problem combined with subset selection Parameters identified as important: total systemic resistance cerebrovascular resistance arterial compliance time of peak systolic ventricular pressure

Mathematical Formulation of Subset Selection Given: real or complex matrix A with n columns integer k Determine permutation matrix P so that AP = ( A 1 }{{} k A }{{} 2 ) n k Important columns A 1 Columns of A 1 are very linearly independent Smallest singular value of A 1 is large Redundant columns A 2 Columns of A 2 are well represented by A 1 min Z A 1 Z A 2 is small (two norm)

Singular Value Decomposition (SVD) m n matrix A, m n AP = U Σ V Singular vectors: U T U = I m V T V = VV T = I n Singular values: Σ = σ 1... σ 1... σ n 0 σ n

Ideal Matrices for Subset Selection Singular Value Decomposition AP = ( A 1 }{{} k ( Σ1 A }{{} 2 ) = U n k Σ 2 ) V If exists permutation P so that V = I then Important columns large singular values of A ( ) Σ1 A 1 = U and σ 0 i (A 1 ) = σ i (A) 1 i k Redundant columns small singular values of A ( ) 0 A 2 = U and min A 1 Z A 2 = A 2 = σ k+1 (A) Z Σ 2

Subset Selection Requirements Important columns A 1 k columns of A 1 should be very linearly independent Smallest singular value σ k (A 1 ) should be large for some γ σ k (A)/γ σ k (A 1 ) σ k (A) Redundant columns A 2 Columns of A 2 should be well represented by A 1 min Z A 1 Z A 2 should be small (two norm) for some γ σ k+1 (A) min Z A 1 Z A 2 γ σ k+1 (A)

Bounds for Subset Selection Singular value decomposition AP = ( A 1 }{{} k A }{{} 2 ) = U n k ( Σ1 Σ 2 ) ( ) V11 V 12 V 21 V 22 Important columns A 1 σ i (A) V 1 11 σ i(a 1 ) σ i (A) for all 1 i k Redundant columns A 2 σ k+1 (A) min Z A 1 Z A 2 V 1 11 σ k+1(a)

Matrix ( V 11 How Small Can V 1 11 Be? V 12 ) is k n with orthonormal rows If V 11 is nonsingular then V 1 11 1 + V 1 11 V 12 2 Follows from I = V 11 V T 11 + V 12V T 12 If det(v 11 ) is maximal then V 1 11 V 12 k(n k) Follows from Cramer s rule: (V 1 11 V 12) ij 1 There exists a permutation such that V 1 11 1 + k(n k)

Bounds for Subset Selection AP = ( A 1 }{{} k ( Σ1 A }{{} 2 ) = U n k Σ 2 ) ( ) V11 V 12 V 21 V 22 Permute right singular vectors so that det(v 11 ) maximal Important columns A 1 σ i (A) 1 + k(n k) σ i (A 1 ) σ i (A) for all 1 i k Redundant columns A 2 σ k+1 (A) min Z A 1 Z A 2 1 + k(n k) σ k+1 (A)

Algorithms for Subset Selection QR decomposition with column pivoting ( ) R11 R AP = Q 12 where Q T Q = I 0 R 22 Important columns ( ) R11 Q = A 0 1 and σ i (A 1 ) = σ i (R 11 ) 1 i k Redundant columns ( ) R12 Q = A 2 min A 1 Z A 2 = R 22 Z R 22

QR Decomposition with Column Pivoting Determines permutation matrix P so that ( ) R11 R AP = Q 12 0 R 22 where R 11 well-conditioned R 22 small Numerical rank of A is k QR decomposition reveals rank

Rank Revealing QR (RRQR) Decompositions Businger & Golub (1965) QR with column pivoting Faddev, Kublanovskaya & Faddeeva (1968) Golub, Klema & Stewart (1976) Gragg & Stewart (1976) Stewart (1984) Foster (1986) T. Chan (1987) Hong & Pan (1992) Chandrasekaran & Ipsen (1994) Gu & Eisenstat (1996) Strong RRQR

Input: Strong RRQR (Gu & Eisenstat 1996) m n matrix A, m n, integer k ( ) R11 R Output: AP = Q 12 0 R 22 R 11 is well conditioned R 11 is k k σ i (A) 1 + k(n k) σ i (R 11 ) σ i (A) 1 i k R 22 is small σ k+j (A) σ j (R 22 ) 1 + k(n k) σ k+j (A) Offdiagonal block not too large ( ) R 1 11 R 12 1 ij

Strong RRQR Algorithm 1 Compute some QR decomposition with column pivoting ( ) R11 R AP initial = Q 12 0 R 22 2 Repeat «R11 Exchange a column of with a column of 0 Update permutations P, retriangularize until det(r 11 ) stops increasing 3 Output: AP final = ( A 1 }{{} k A }{{} 2 ) n k R12 R 22 «Operation count: O ( mn 2) until det(r 11) stops increasing by n

Another Strong RRQR Algorithm In the spirit of Golub, Klema and Stewart (1976) 1 Compute SVD ( Σ1 A = U Σ 2 ) ( ) V1 V 2 2 Apply strong RRQR to V 1 : V 1 P = (V }{{} 11 k V }{{} 12 ) n k 1 1 + k(n k) σ i (V 11 ) 1 1 i k 3 Output: AP = ( A }{{} 1 k A }{{} 2 ) n k

Summary: Deterministic Subset Selection m n matrix A, m n, AP = ( A 1 }{{} k Important columns A 1 Redundant columns A 2 A }{{} 2 ) n k σ k (A)/p(k, n) σ k (A 1 ) σ k (A) σ k+1 (A) min Z A 1 Z A 2 p(k, n) σ k+1 (A) p(k, n) Depends on leading k right singular vectors Best known value: p(k, n) = 1 + k(n k) Algorithms: Rank revealing QR decompositions, SVD Operation count: O ( mn 2) for p(k, n) = p 1 + nk(n k)

AP = ( A 1 }{{} k RRQR: A }{{} 2 ) n k Redundant Columns min Z A 1 Z A 2 p(k, n) σ k+1 (A) Orthogonal projector onto range(a 1 ): A 1 A 1 min z A 1 Z A 2 = (I A 1 A 1 ) A 2 Largest among all small singular values σ k+1 (A) = Σ 2 RRQR: (I A 1 A 1 ) A 2 p(k, n) Σ 2

Subset Selection for Redundant Columns AP = ( A 1 }{{} k A }{{} 2 ) n k Among all ( n k) choices find A1 that minimizes (I A 1 A 1 ) A 2 ξ Best bounds: 2 norm (I A 1 A 1 ) A 2 2 1 + k(n k) Σ 2 2 Frobenius norm (I A 1 A 1 ) A 2 F k + 1 Σ 2 F

Frobenius Norm There exist k columns A 1 so that (I A 1 A 1 ) A 2 2 F (k + 1) σ j (A) 2 j k+1 Idea: Volume sampling Deshpande, Rademacher, Vempala & Wang 2006 (I A 1 A 1 ) A 2 2 F = j k+1 (I A 1A 1 ) a j 2 2 Volume Vol(a j ) = a j 2 Vol ( A 1 a j ) = Vol(A1 ) (I A 1 A 1 )a j 2 Volume and singular values i 1 <...<i k Vol (A i1...i k ) 2 = i 1 <...<i k σ i1 (A) 2... σ ik (A) 2

Maximizing Volumes Is Really Hard Given: matrix A with n columns of unit norm integer k real number ν [0, 1] Finding k columns A 1 of A such that Vol(A 1 ) ν is NP-hard There is no polynomial time approximation scheme [Civril & Magdon-Ismail, 2007]

Randomized Subset Selection Frieze, Kannan & Vempala 2004 Deshpande, Rademacher, Vempala & Wang 2006 Liberty, Woolfe, Martinsson, Rokhlin & Tygert 2007 Drineas, Mahoney & Muthukrishnan 2006, 2008 Boutsidis, Mahoney & Drineas 2009 Civril & Magdon-Ismail 2009 Applications Statistical data analysis: feature selection principal component analysis Pass efficient algorithms for large data sets

2-Phase Randomized Algorithm Boutsidis, Mahoney & Drineas 2009 1 Randomized Phase: Sample small number ( k log k) of columns 2 Deterministic Phase: Apply rank revealing QR to sampled columns With 70% probability: Two norm min Z A 1 Z A 2 2 O Frobenius norm min Z A 1 Z A 2 F O ( k 3/4 log 1/2 k (n k) 1/4) Σ 2 2 ( ) k log 1/2 k Σ 2 F

2-Phase Randomized Algorithm 1 Compute SVD ( Σ1 A = U ) ( ) V1 Σ 2 V 2 2 Randomized phase: Scale: V 1 V 1 D Sample c columns: 3 Deterministic phase: Apply RRQR to V s : 4 Output: AP s P d = ( A }{{} 1 k (V 1 D) P s = ( V s }{{} ĉ V s P d = ( V d }{{} k A }{{} 2 ) n k ) )

Randomized Phase (Frobenius Norm) Sampling Column i of V 1 sampled with probability p i Probabilities p i = c ( ) (V1 ) i 2 2 1 i n V 1 F Scaling Scaling matrix D = diag ( 1/ p 1... 1/ p n ) Scaled matrix V 1 D All columns have the same norm Columns sampled with probability 1/n Purpose of scaling makes sampling uniform makes expected values easier to compute

Analysis of 2-Phase Algorithm ( ) V1s D 1 Sample c columns (VD) P s = s V 2s D s 2 RRQR selects k columns (V 1s D s ) P d = ( V d ) Perturbation theory RRQR min Z A 1 Z A 2 F Σ 2 F + Σ 2 V 2s D s F σ k (V d ) With high probability σ k (V 1s D s ) 1/2 c k log k σ k (V d ) σ k(v 1s D s ) 1 + k(ĉ k) min Z A 1 Z A 2 F O Σ 2 V 2s D s F 4 Σ 2 F ( ) k log 1/2 k Σ 2 F

Expected Values of Frobenius Norms If D ii = 1/ p i then ( ) E X D 2 F = X 2 F Frobenius norm X D 2 F = trace(x D2 X T ) Linearity Scaling [ ] E X D 2 F = trace(x E [D 2] }{{} I X T ) [ ] E D 2 ii = p i 1 + (1 p i ) 0 = 1 p i

From Expected Values to Probability ( ) E X D 2 F = X 2 F Markov s inequality x = X D 2 F, Prob (x a) E(x)/a a = 10 E(x) With probability at most 1/10 X D 2 F 10 X 2 F With probability at least 9/10 X D 2 F 10 X 2 F

Issues with Randomized Algorithms How to choose c: 10 3 k log k, k log k, 17 k log k,...? We don t know the number of sampled columns ĉ Number of sampled columns can be too small: ĉ < k No information about singular values of important columns How often does one have to run the algorithm to get a good result? How accurately do the singular vectors and singular values have to be computed? How sensitive is the algorithm to the choice of probabilities? How does the randomized algorithm compare to the deterministic algorithms: accuracy, run time?

Summary Given: real or complex matrix A, integer k Want: AP = ( A }{{} 1 A 2 ) }{{} k n k Important columns A 1 Singular values close to k largest singular values of A Redundant columns A 2 Proj. of A 2 on range(a 1) 2,F smallest singular values of A Bounds depend on dominant k right singular vectors Deterministic algorithms: RRQR, SVD Randomized algorithm: 2 phases: 1. randomized sampling, 2. RRQR on samples Exact subset selection is hard