Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Subset Selection Deterministic vs. Randomized Ilse Ipsen North Carolina State University Joint work with: Stan Eisenstat, Yale Mary Beth Broadbent, Martin Brown, Kevin Penner

Subset Selection Given: real or complex matrix A integer k Determine permutation matrix P so that AP = ( A 1 }{{} k A 2 ) Important columns A 1 Columns of A 1 are very linearly independent Redundant columns A 2 Columns of A 2 are well represented by A 1

Subset Selection Requirements Important columns A 1 Smallest singular value σ k (A 1 ) should be large Redundant columns A 2 min A 1 A 2 should be small (two norm)

Subset Selection Requirements Important columns A 1 Smallest singular value σ k (A 1 ) should be large σ k (A)/γ σ k (A 1 ) σ k (A) for some γ Redundant columns A 2 min A 1 A 2 should be small (two norm) for some γ σ k+1 (A) min A 1 A 2 γ σ k+1 (A)

Outline Deterministic algorithms: strong RRQR, SVD Randomized 2-phase algorithm Perturbation analysis of randomized algorithm Numerical experiments: randomized vs. deterministic New deterministic algorithm

Deterministic Subset Selection Businger & Golub (1965) QR with column pivoting Faddev, Kublanovskaya & Faddeeva (1968) Golub, Klema & Stewart (1976) Gragg & Stewart (1976) Stewart (1984) Foster (1986) T. Chan (1987) Hong & Pan (1992) Chandrasekaran & Ipsen (1994) Gu & Eisenstat (1996) Strong RRQR

First Deterministic Algorithm Rank revealing QR decomposition ( ) R11 R AP = Q 12 where 0 R 22 Q T Q = I Important columns ( ) R11 Q = A 0 1 and σ i (A 1 ) = σ i (R 11 ) 1 i k Redundant columns ( ) R12 Q = A 2 min A 1 A 2 = R 22 R 22

Strong RRQR (Gu & Eisenstat 1996) Input: Output: m n matrix A, m n, integer k ( ) R11 R AP = Q 12 R 0 R 11 is k k 22 R 11 is well conditioned σ i (A) 1 + k(n k) σ i (R 11 ) σ i (A) 1 i k R 22 is small σ k+j (A) σ j (R 22 ) 1 + k(n k) σ k+j (A) Offdiagonal block not too large ( ) R 1 11 R 12 1 ij

Strong RRQR Algorithm 1 Compute some QR decomposition with column pivoting ( ) R11 R AP initial = Q 12 0 R 22 2 Repeat «R11 Exchange a column of with a column of 0 Update permutations P, retriangularize until det(r 11 ) stops increasing 3 Output: AP final = ( A 1 }{{} k A }{{} 2 ) n k R12 R 22 «

Second Deterministic Algorithm Singular value decomposition AP = ( A 1 }{{} k A 2 ) = U }{{} n k ( Σ1 Σ 2 ) ( ) V11 V 12 V 21 V 22 Important columns A 1 σ i (A) V 1 11 σ i(a 1 ) σ i (A) for all 1 i k Redundant columns A 2 σ k+1 (A) min A 1 A 2 V 1 11 σ k+1(a) [Hong & Pan 1992]

Almost Strong RRQR Algorithm In the spirit of Golub, Klema and Stewart (1976) 1 Compute SVD ( Σ1 A = U Σ 2 ) ( ) V1 V 2 2 Apply strong RRQR to V 1 : V 1 P = (V }{{} 11 k V }{{} 12 ) n k 1 1 + k(n k) σ i (V 11 ) 1 1 i k 3 Output: AP = ( A }{{} 1 k A }{{} 2 ) n k

Almost Strong RRQR Produces permutation P so that AP = ( A 1 }{{} k A 2 ) }{{} n k where Important columns A 1 σ i (A) 1 + k(n k) σ i (A 1 ) σ i (A) 1 i k Redundant columns A 2 σ k+1 (A) min A 1 A 2 1 + k(n k) σ k+1 (A)

Deterministic Subset Selection Algorithms: strong RRQR, SVD Permuting columns of A corresponds to permuting right singular vector matrix V Perturbation bounds in terms of V σ k+1 (A) min A 1 A 2 V 1 11 σ k+1(a) Operation count for m n matrix, m n min A 1 A 2 in O ( mn 2 + n 3 log f n ) flops 1 + f 2 k(n k) σ k+1 (A)

Randomized Algorithms Frieze, Kannan & Vempala 1998, 2004 Drineas, Kannan & Mahoney 2006 Deshpande, Rademacher, Vempala & Wang 2006 Rudelson & Vershynin 2007 Liberty, Woolfe, Martinsson, Rokhlin & Tygert 2007 Drineas, Mahoney & Muthukrishnan 2006, 2008 Boutsidis, Mahoney & Drineas 2008, 2009 Civril & Magdon-Ismail 2009 Survey paper: Halko, Martinsson & Tropp 2009

2-Phase Randomized Algorithm Boutsidis, Mahoney & Drineas 2009 1 Randomized Phase: Sample small number ( k log k) of columns 2 Deterministic Phase: Apply rank revealing QR to sampled columns With 70% probability: Two norm min A 1 A 2 2 O Frobenius norm min A 1 A 2 F O ( k 3/4 log 1/2 k (n k) 1/4) Σ 2 2 ( ) k log 1/2 k Σ 2 F

Deterministic vs. Randomized Algorithms Want: permutation P so that AP = ( A }{{} 1 k Compute SVD ( ) ( ) Σ1 V1 A = U Σ 2 V 2 A 2 ) }{{} n k Obtain P from k dominant right singular vectors V 1

Deterministic vs. Randomized Algorithms Want: permutation P so that AP = ( A }{{} 1 k Compute SVD ( ) ( ) Σ1 V1 A = U Σ 2 V 2 A 2 ) }{{} n k Obtain P from k dominant right singular vectors V 1 Deterministic: Apply RRQR to all columns of matrix V 1 Randomized: Apply RRQR to subset of columns of scaled matrix V 1 D

2-Phase Randomized Algorithm 1 Compute SVD ( Σ1 A = U ) ( ) V1 Σ 2 V 2 2 Randomized phase: Scale: V 1 V 1 D Sample c columns: 3 Deterministic phase: (V 1 D) P s = (V 1s D s }{{} ĉ Apply RRQR to V 1s D s : (V 1s D s ) P d = (V 11 D 1 }{{} k 4 Output: AP s P d = ( A }{{} 1 k A }{{} 2 ) n k ) )

SVD : Perturbation Bounds ( Σ1 AP = U Σ 2 ) ( ) V11 V 12 V 21 V 22 V 11 is k k For deterministic algorithms or min A 1 A 2 Σ 2 /σ k (V 11 ) min A 1 A 2 Σ 2 + Σ 2 V 21 /σ k (V 11 )

SVD : Perturbation Bounds ( Σ1 AP = U Σ 2 ) ( ) V11 V 12 V 21 V 22 V 11 is k k For deterministic algorithms or min A 1 A 2 Σ 2 /σ k (V 11 ) min A 1 A 2 Σ 2 + Σ 2 V 21 /σ k (V 11 ) For randomized 2-phase algorithm min A 1 A 2 Σ 2 + Σ 2 V 21 D 1 /σ k (V 11 D 1 ) for any nonsingular matrix D 1

Perturbation Bounds for Randomized Algorithm D 1 is scaling matrix min A 1 A 2 Σ 2 + Σ 2 V 21 D 1 /σ k (V 11 D 1 ) V 11 D 1 comes from RRQR: (V 1s D s ) }{{} ĉ P d = (V 11 D 1 }{{} k ) min A 1 A 2 Σ 2 + 1 + k(ĉ k) Σ 2 V 21 D 1 /σ k (V 1s D s )

Probabilistic Bounds: Frobenius Norm Column i of X sampled with probability p i Scaling matrix D ii = 1/ p i with probability p i

Probabilistic Bounds: Frobenius Norm Column i of X sampled with probability p i Scaling matrix D ii = 1/ p i with probability p i Frobenius norm X D 2 F = trace(x D2 X T ) Linearity E [ X D 2 ] F = trace(x E [D 2] X T ) }{{} I Scaling E [ D 2 ii] = pi 1 p i + (1 p i ) 0 = 1 Expected value E [ X D 2 F] = X 2 F

Randomized Subset Selection: Frobenius Norm Perturbation bound min A 1 A 2 F Σ 2 F + 1 + k(ĉ k) Σ 2 V 21 D 1 F /σ k (V 1s D s ) With probability 1 1 α Σ 2 V 21 D 1 F α Σ 2 V 2 F = α Σ 2 F Holds for any probability distribution

Sampling: Frobenius Norm When is σ k (V 1s D s ) 0 with high probability? Expected number of sampled columns: c Column i of V 1 sampled with probability p i = min{1, c q i } where q i = (V 1 ) i 2 2 /k Try to sample columns with large norm Scaling matrix D = ( 1/ p 1... 1/ p n ) If c = Θ (k log k) then with probability.9 σ k (V 1s D s ) 1/2 [Boutsidis, Mahoney & Drineas 2009]

Sampling: Frobenius Norm How many columns should V 1s actually have? Expected number of sampled columns from V 1 : c Actual number of columns in V 1s : ĉ E[ĉ] c

Sampling: Frobenius Norm How many columns should V 1s actually have? Expected number of sampled columns from V 1 : c Actual number of columns in V 1s : ĉ E[ĉ] c If c q i 1 for all i, and ĉ k then ĉ σ k (V 1s D s ) c If ĉ < c/4 then σ k (V 1s D s ) < 1/2 Make sure that enough columns are actually sampled

Randomized Subset Selection: Two Norm min A 1 A 2 2 Σ 2 2 + 1 + k(ĉ k) Σ 2 V 21 D 1 2 /σ k (V 1s D s ) Probability distribution p i = min{1, c q i } q i = 1 2 (V 1 ) i 2 2 k ( (Σ2 V 2 ) i 2 + 1 2 Σ 2 V 2 F ) 2 With probability.9 Σ 2 V 21 D 1 2 γ ( ) (n k + 1) log c 1/4 Σ 2 2 c [Boutsidis, Mahoney & Drineas 2009]

Numerical Experiments Compare strong RRQR and randomized 2-phase algorithm Subset selection for 2 norm Matrix orders n 500, 2000 0 k 240 Matrices: Kahan, random, scaled random, triangular numerical rank k Randomized algorithm: run 40 times Iterative determination of c c = 2k while σ k (V 1s D s ) < 1/2 do c = 2 c

Accuracy Residuals min A 1 A 2 2 for n = 500 and k = 20 RRQR Randomized Matrix max min mean Kahan 7 10 0 2 10 0 5 10 1 5 10 1 num. rank k 7 10 0 2 10 1 8 10 0 1 10 1 triangular 3 10 0 1 10 0 1 10 0 1 10 0 random 3 10 1 3 10 1 3 10 1 3 10 1 scaled rand 4 10 1 5 10 1 4 10 1 5 10 1 No significant difference in accuracy between deterministic and randomized algorithms

Number of Sampled Columns Values of c for n = 500 and k = 20 Matrix c values tried most frequent mean ĉ Kahan 40, 80, 160, 320, 640 2k = 40 56 num. rank k 40, 80, 160 4k = 80 83 triangular 40, 80, 160 4k = 80 97 random 40, 80, 160 4k = 80 93 scaled rand 40, 80, 160 4k = 80 97 c = 4k seems to be a good value

Different Probability Distributions Two norm q i = 1 2 (V 1 ) i 2 2 k + 1 2 ( (Σ2 V 2 ) i 2 Σ 2 V 2 F ) 2 expensive q i = 1 2 (V 1 ) i 2 2 k + 1 2 numerically unstable (can be negative) Frobenius norm q i = (V 1) i 2 2 k numerically stable and cheap (A) i 2 2 (AVT 1 V 1) i 2 2 A 2 F AVT 1 V 1 2 F

Different Probability Distributions Residuals min A 1 A 2 2 for n = 500 and k = 20 Matrix P max min mean ĉ num. rank k 2 2 10 1 8 10 0 1 10 1 83 F 1 10 1 7 10 0 1 10 1 88 triangular 2 1 10 0 1 10 0 1 10 0 97 F 1 10 0 1 10 0 8 10 0 91 random 2 3 10 1 3 10 1 3 10 1 93 F 3 10 1 3 10 1 3 10 1 87 scaled rand 2 5 10 1 4 10 1 5 10 1 97 F 5 10 1 4 10 1 5 10 1 93 Kahan 2 2 10 0 5 10 1 5 10 1 56 F 7 10 1 5 10 1 5 10 1 42 No difference in accuracy between 2 norm and Frobenius norm probability distributions

Ideas from Randomized Algorithm ( Σ1 AP = U ) ( ) V1 Σ 2 V 2 V 1 = (V 11 }{{} k V 12 ) }{{} n k Order columns of V 1 in order of decreasing norms Apply strong RRQR to columns of largest norm Intuition V 11 2 F = k V 12 2 F This means: V 11 F large implies V 12 F small σ k (V 11 ) 2 = 1 V 12 2 2 This means: V 12 2 small implies σ k (V 11 ) large

New 2-Phase Deterministic Algorithm 1 Compute SVD ( Σ1 A = U ) ( ) V1 Σ 2 V 2 2 Ordering phase: Order according to decreasing column norms Select leading 4k columns: 3 Rank revealing phase: V 1 V 1 P O A P O = ( A S }{{} 4k Apply strong RRQR to A S : A S P R = ( A 1 }{{} k ) )

Results for New Deterministic Algorithm n = 2000 and k = 40 Residuals min A 1 A 2 2 Time ratio TR = time(new algorithm)/time(rrqr) Matrix Residuals σ k (A 1 ) TR RRQR new RRQR new Kahan 4 10 0 4 10 0 8 10 2 8 10 2 0.11 random 9 10 1 9 10 1 1 10 1 1 10 1 0.03 s. rand 1 10 2 1 10 2 2 10 1 2 10 1 0.07 triang 4 10 0 3 10 1 3 10 1 3 10 1 0.02 New algorithm appears to be as accurate as RRQR and possibly faster

Subset selection Summary Given: real or complex matrix A, integer k Want: AP = ( A }{{} 1 A 2 ) with k σ k (A 1 ) σ k (A) min A 1 A 2 σ k+1 (A) Deterministic algorithms: strong RRQR, SVD Randomized 2-phase algorithm Randomized algorithm: no more accurate than strong RRQR for matrices of order 2000 Numerical issues with randomized algorithm New deterministic algorithm: As accurate as strong RRQR and perhaps faster