Computer Sciences Department

Size: px

Start display at page:

Download "Computer Sciences Department"

Kerrie Hall
6 years ago
Views:

1 Computer Sciences Deprtment Computing the Singulr Vlue Decomposition of 3 x 3 mtrices with miniml brnching nd elementry floting point opertions lek Mcdms ndrew Selle Rsmus Tmstorf Joseph Tern Eftychios Sifkis Technicl Report #69 My

2 Computing the Singulr Vlue Decomposition of 3 3 mtrices with miniml brnching nd elementry floting point opertions lek Mcdms, ndrew Selle Rsmus Tmstorf Joseph Tern, Eftychios Sifkis 3, Wlt Disney nimtion Studios University of Cliforni, Los ngeles 3 University of Wisconsin, Mdison bstrct numericl method for the computtion of the Singulr Vlue Decomposition of 3 3 mtrices is presented The proposed methodology robustly hndles rnk-deficient mtrices nd gurntees orthonormlity of the computed rottionl fctors The lgorithm is tilored to the chrcteristics of SIMD or vector processors In prticulr, it does not require ny explicit brnching beyond simple conditionl ssignments (s in the C++ ternry opertor?:, or the SSE4 instruction VBLENDPS), enbling trivil dt-level prllelism for ny number of opertions Furthermore, no trigonometric or other expensive opertions re required; the only floting point opertions utilized re ddition, multipliction, nd n inexct (yet fst) reciprocl squre root which is brodly vilble on current SIMD/vector rchitectures The performnce observed pproches the limit of mking the 3 3 SVD memory-bound (s opposed to CPU-bound) opertion on current SMP pltforms Keywords: singulr vlue decomposition, Jcobi eigenvlue lgorithm Method overview Let be rel-vlued, 3 3 mtrix fctoriztion of s UΣV T is gurnteed to exist, where U nd V re 3 3 rel orthogonl mtrices nd Σ is 3 3 digonl mtrix with rel nd nonnegtive digonl entries Since the mtrix product UΣV T remins invrint if the sme permuttion is pplied to the columns of U,V nd to the digonl entries of Σ, common convention is to choose Σ so tht its digonl entries pper in non-incresing order The exct convention followed in our method is slightly different, specificlly: The orthogonl fctors U nd V will be true rottion mtrices by construction (ie det(u) det(v) ) This is contrsted to the possibility of U or V hving determinnt of, which corresponds to rottion combined with reflection The digonl entries of Σ will be sorted in decresing order of mgnitude, but will not necessrily be non-negtive (relxing the non-negtivity constrint is necessry to llow U nd V to be true rottions, since the determinnt of could be either positive or negtive) More specificlly, the singulr vlue with the smllest mgnitude (σ 3, or Σ 33) will hve the sme sign s det(), while the two lrger singulr vlues σ, σ will be non-negtive These conventions re motivted by pplictions in grphics which require the orthogonl mtrices U nd V to correspond to rel 3D sptil rottions In ny cse, different conventions cn be enforced s post process with simple mnipultions such s negting nd/or permuting singulr vlues nd vectors The lgorithm first determines the fctor V by computing the eigennlysis of the mtrix T VΣ V T which is symmetric nd positive semi-definite This is ccomplished vi modified Jcobi itertion where the Jcobi fctors re pproximted using inexpensive, elementry rithmetic opertions s described in section Since the symmetric eigennlysis lso produces Σ, nd consequently Σ itself, the remining fctor U cn theoreticlly be obtined s U VΣ ; however this process is not pplicble when (nd s result, Σ) is singulr, nd cn lso led to substntil loss of orthogonlity in U for n ill-conditioned, yet nonsingulr, mtrix nother possibility is to form V UΣ nd observe tht this mtrix contins the columns of U, ech scled by the respective digonl entry of Σ The Grm-Schmidt process could generte U s the orthonorml bsis for V, yet this method would still suffer from instbility in the cse of ner-zero singulr vlues Our pproch is bsed on the QR fctoriztion of the mtrix V, using Givens rottions With proper ttention to some specil cses, s described in section 4, the Givens QR procedure is gurnteed to produce fctor Q( U) which is exctly orthogonl by construction, while the upper-tringulr fctor R will be shown to be in fct digonl nd identicl to Σ up to sign flips of the digonl entries The novel contribution of our pproch lies in the computtion of the Jcobi/Givens rottions without the use of expensive rithmetic such s trigonometric functions, squre roots or even division; the only necessry opertions re ddition, multipliction nd n inexct reciprocl squre root function (which is vilble in mny rchitectures nd is often fster not only thn squre root, but stndrd division s well) The method is robust for ny rel mtrix, nd converges to given ccurcy within fixed smll number of itertions Symmetric eigennlysis The first step in computing the decomposition UΣV T is to compute the eigennlysis of the symmetric, positive semidefinite mtrix S T VΣ V T This will be performed by vrint of the Jcobi eigenvlue lgorithm [Golub nd vn Lon 989] Jcobi itertion We provide summry presenttion of the clssicl Jcobi eigenvlue lgorithm; for more detiled exposition the reder is referred to [Golub nd vn Lon 989] The Jcobi process constructs sequence of similrity trnsforms S (k+) [Q (k) ] T S (k) Q (k) Ech mtrix Q (k) is constructed s Givens rottion, of the form

3 c s Q(p, q, θ) s c C row p row q where c cos(θ) nd s sin(θ) The objective of every conjugtion with Given mtrix Q (k) Q(p, q, θ) is to bring the next iterte S (k+) closer to digonl form, by eliminting the off-digonl entries s (k) pq nd s (k) qp cn be shown tht X i j (ie by enforcing tht s (k+) pq [s (k+) ij ] X i j [s (k) ij ] [s (k) pq ] s (k+) qp ) It Therefore, fter ech Jcobi itertion, the sum of squred offdigonl entries of S is reduced by the sum of squres of the two off-digonl nnihilted entries There is nturl choice of the entries to be eliminted tht cn esily seen to generte convergent process: if we nnihilte the mximum off-digonl entry of n 3 3 we re gurnteed to nnihilte t lest /3 of the offdigonl sum-of-squres (in fct, t ech itertion fter the first, this reduction will be t lest / of the previous sum of squres, since previous itertions leve only one other non-zero off-digonl pir) The Jcobi itertion rpidly drives the itertes S (k) to digonl form For 3 3 mtrices, 5 itertions re typiclly sufficient to digonlize S to within single-precision roundoff error lthough the previous rgument suggests t lest liner convergence, the order becomes in fct qudrtic once S (k) hs been brought somewht close to digonl form dditionlly, it cn be shown tht the sme symptotic order of convergence is ttined even if the off-digonl elements to be eliminted re selected in fixed, cyclic order, ie (p, q) (, ), (, 3), (, 3), (, ), (, 3), This llevites the need for conditionl execution bsed on the mgnitude of the off-digonl entries Note: Cyclic Jcobi requires tht we lwys pick θ < π/4 to ensure convergence, n option tht is lwys possible s illustrted next The rest of this section ddresses the computtion of the trigonometric fctors c cos(θ) nd s sin(θ) Once the indices (p, q) of the off-digonl pir to be nnihilted hve been selected, the vlues of c nd s depend only on the submtrix t the intersection of the p-th row nd q-th column (note tht s pq s qp due to symmetry): «spp s pq s pq s qq Thus, the determintion of c nd s is equivlent to the problem of digonlizing symmetric mtrix The clssicl Givens digonliztion Let be symmetric mtrix We seek trigonometric fctors c cos(θ) nd s sin(θ) such tht the result of the conjugtion: B Q T Q c s s c «««c s s c «c + cs + s cs( ) + (c s ) cs( ) + (c s ) s cs + c is digonl mtrix Therefore, we need to enforce tht b b cs( ) + (c s ) If, we cn simply select θ π/4 (or equivlently, c s / ) Otherwise, the previous condition cn be rewritten s: cs c s cos θ sin θ cos θ sin θ tn θ tn θ tn(θ) () () From eqution () we hve θ rctn(/( )) If the rc-tngent function returns vlue in the intervl ( π/, π/), this expression will gurntee θ < π/4 s required for convergence of Cyclic Jcobi Nturlly, the use of (forwrd nd inverse) trigonometric functions will significntly increse the cost of this computtion n lterntive technique is described in [Golub nd vn Lon 989] where the qudrtic eqution () is solved to yield tn θ directly, from which the sine nd cosine re computed lgebriclly This pproch still requires minimum of two reciprocls, one squre root nd one exct reciprocl-of-squre-root in ddition to ny multiply/dd opertions (n extr reciprocl plus squre root will be needed if brnching instructions re to be voided) Our proposed optimiztion quickly computes n pproximte form of the Givens rottion, where inccurcy my be introduced both in the ngle computtion, s well s constnt scling of the rottion mtrix However, the nture of this pproximtion is such tht the Jcobi procedure is only impcted by miniml decelertion, while the ccurcy of the converged result is not compromised Our methodology requires only multiply/dd opertions, plus single inexpensive, inexct reciprocl-of-squre-root evlution (even lrge reltive errors re well cceptble) No brnching is required, with the exception of conditionl ssignments dditionl computtionl svings rise from the compct representtion of rottions s quternions, insted of explicit mtrices 3 pproximting the trigonometric fctors In this section, we introduce first pproximtion to the trigonometric fctors c, s in the Givens mtrices, which does not require evluting trigonometric functions or solving qudrtic Our pproch stems from n symptotic pproximtion when the rottion ngle θ is smll Eqution () suggests tht this would be the cse for exmple when the Jcobi itertion is close to convergence nd Σ does not hve repeted singulr vlues; nevertheless, our process is designed to gurntee resonble progress regrdless of ny such conditions Let us temporrily ssume tht θ is smll Under this ssumption, we cn pproximte tn(θ) tn θ In fct, let us denote with

4 φ the ngle tht stisfies the eqution tn(θ) tn φ s n exct identity; we cn then equivlently stte tht when θ is smll, we will hve φ θ We summrize these pproximtions, in conjunction with eqution () s follows: 8 sin(*tn(tn(*t)/)-*t)/sin(*t) /tn(*t) tn(θ) tn φ θ tn(θ) (3) 6 4 This expression cn be rewritten to provide the following expression for cos(φ) nd sin(φ) (which, in turn, pproximte the trigonometric fctors c cos(θ) nd s sin(θ), respectively): sin φ cos φ 8 < : sin φ ω cos φ ω( ) ω / p + ( ) (4) /6 /8 3 /6 /4 Figure : pproximting the trigonometric Givens fctors using tn(θ) tn θ (or by setting fixed ngle θ π/4)the offdigonl mgnitude reduction frction b / is plotted on the verticl xis, s function of the optiml Givens ngle Wht is the qulity of this pproximtion, however, specificlly for our purposes of generting digonl mtrix B Q T Q? This cn be quntified by looking t the off-digonl element b generted fter the conjugtion with these pproximte Givens rottions: Eq(3) b b cos φ sin φ( ) + (cos φ sin φ) sin(φ) + cos(φ) Eq() sin(φ) tn(θ) + cos(φ) sin(φ) cos(θ) + cos(φ) sin(θ) sin(θ) sin(φ θ) sin(θ) sin( rctn(tn(θ)/) θ) sin(θ) (5) Eqution (5) provides concise expression for the reduction of the mgnitude of the off-digonl element, s function of the optiml Givens rottion ngle (which is, in turn, function of the mtrix entries) Figure (solid line) illustrtes the mgnitude reduction frction s function of the ngle θ We observe tht for smll vlues of θ the qulity of the pproximtion is excellent, effectively leding to nnihiltion of the off-digonl element However, for lrger vlues of θ, the reduction becomes smller, nd we ctully obtin no reduction t ll for vlues θ π/4 The poor performnce of the previous pproximtion when θ π/4 will be ddressed by considering yet nother choice for the Givens ngle φ Eqution (4) revels tht this pproximte φ my lie outside the intervl ( π/4, π/4), which in contrst could hve been gurnteed for the optiml ngle θ This restriction ws importnt in ensuring convergence of the Cyclic Jcobi method Thus, we consider the possibility of truncting the pproximte vlue to the vlue φ π/4, t the very lest in the cse when the computed vlue lies outside tht intervl For this fixed choice of the Givens ngle, the off-digonl element b fter the conjugtion becomes: b b cos π 4 sin π 4 ( ) + (cos π 4 sin π 4 ) tn(θ) tn(θ) This reduction frction is lso plotted in figure (dshed line) Note tht both mgnitude reduction frctions re even, nd periodic (T π/) functions, so it is sufficient to study them in the intervl [, π/4] Notbly, if we were ble to pick the best of the two proposed pproximtions (in terms of the mgnitude reduction they produce) we see tht reduction frction significntly smller thn cn be gurnteed In fct, it is possible to formlize this selection between the two pproximtions; from equtions (3,7) we cn solve for the intersection point of the two curves in figure s θ rctn()/ (we omit the relevnt trigonometric mnipultions) The mgnitude reduction rtio b / t the intersection point is equl to / tn(θ ) 5 Thus, by selecting the best of the two pproximtions we re gurnteed t lest 5% reduction in the mgnitude of the off-digonl element Finlly, the choice bout which pproximtion is the best one to use cn be mde without resorting to the obvious ngle criterion (θ < θ ), by observing tht the fixed ngle φ π/4 should be selected only when it yields mgnitude reduction by fctor no lrger thn 5: b ( ) These results suggest the following lgorithm (6) (7)

5 lgorithm Non-trigonometric pproximtion of the Givens ngle : function PPROXGIVENS(,, ) Returns (c, s) : b [ < ( ) ] b is boolen 3: ω / p + ( ) 4: s b?ω : 5 5 sin(π/4) 5: c b?ω( ): 5 5 cos(π/4) 6: return (c, s) 7: end function 4 pproximte Givens rottion using quternions 3 3 rottion mtrix cn be equivlently encoded s quternion (, b, c, d) (cos(θ/), sin(θ/)v), where θ is the ngle of rottion, nd v (v x, v y, v z) is the normlized xis of rottion In prticulr, 3 3 Givens rottion with (p, q) (, ), ie rottion of the top-leftmost submtrix, hs the mtrix cos θ sin θ sin θ cos θ nd the equivlent quternion representtion (cos(θ/),,, sin(θ/)) The dded benefit of this representtion is tht the quternion does not need to be normlized, ie the quternion (γ cos(θ/),,, γ sin(θ/)), where γ R, is just s vlid s representtion of this rottion This suggests tht we could mimic the symptotic pproximtion tn(θ) tn θ of the previous section to obtin tn(θ) 4 tn(θ/) This suggests the following expression for the pproximte Givens ngle φ: perform this lst conjugtion, since we do not expect to obtin the mtrix Σ from the solution of the symmetric eigenproblem, but insted compute Σ directly from the Givens QR fctoriztion of V UΣ (the product V cn be computed using the oncenormlized quternion corresponding to mtrix V) In prctice, we do hve motive to perform some normliztion of the Givens quternion, to void the risk of overflow or underflow fter lrge number of Jcobi itertions We typiclly never perform more thn 5- Jcobi itertions, so even if we scle the quternion to within, sy, fctor of wy from normlized quternion, ny risk of overflow or underflow would be eliminted We found tht nturl (nd very inexpensive) wy to perform this normliztion is to compute the sclefctor ω in eqution (9) using the inexct Reciprocl-Squre-Root function tht is built-in nd very efficient on most modern processors For exmple the SSE RSQRTPS instruction yields reltive error of t most % while hving ltency comprble to x-3x of stndrd pcked multiply or dd (which is much less thn n exct x87 squre root, or even reciprocl computtion) For resons lredy explined, the ccurcy of the symmetric eigennlysis is in no wy ffected by the inccurcy of this opertion, nd even higher reltive error would not mtter, s long s overflow nd underflow re verted For the purposes of evluting the mgnitude reduction fctor of the off-digonl element, it is pproprite to ssume exct normliztion, since ny residul scling would simply ffect the entire mtrix nd would be corrected once t the end of the process Once gin, we hve: Eq(8) b sin(φ θ) b sin(θ) sin(4 rctn(tn(θ)/4) θ) sin(θ) () tn(θ) Consequently, ( ) sin φ cos φ 4 tn(φ/) θ tn(θ) (8) 8 >< sin(φ/)ω cos(φ/)ω( ) (9) >: ω / q +[( )] 5 sin(4*tn(tn(*t)/4)-*t)/sin(*t) /tn(*t) Thus, we cn represent this rottion with the quternion -5 (ω( ),,, ω ) () We note tht this quternion representtion is eqully cceptble nd ccurte, regrdless of the vlue of the scle fctor ω In fct, even without ny scling t ll, the quternion (,,, ) from theoreticl stndpoint would be perfectly ccurte representtion of this rottion The quternions of subsequent Jcobi itertions cn be multiplictively combined without normliztion, nd even the conjugtion Q T Q cn be computed using un-normlized quternions, yielding result tht is identicl to using explicit orthogonl mtrices, up to globl scling of the resulting mtrix This scling would need to be corrected just once, t the end of the sequence of Jcobi itertions, by normlizing just once the quternion tht combines ll Jcobi rottions, nd repeting the conjugtion one lst time In our cse, we would not even - /6 /8 3 /6 /4 Figure : pproximting the trigonometric Givens fctors using tn(θ) 4 tn(θ/) (or by setting fixed ngle θ π/4) The off-digonl mgnitude reduction frction b / is plotted on the verticl xis, s function of the optiml Givens ngle Figure compres the off-digonl mgnitude reduction frction obtined by the quternion pproximtion of eqution 9 with the previously discussed fixed choice of φ π/4 Equting expressions (7) nd () we obtin tht the two curves intersect t θ rctn(4 tn(π/8))/ 5388 (this is the leftmost intersection in figure ), while the off-digonl mgnitude reduction

6 frction t this point is cot(π/8)/ Therefore, by choosing the best option between the pproximte Givens quternion (), or the fixed ngle φ π/4, we re gurnteed mgnitude reduction of bout 6%, slightly less thn the pproximtion of the previous section Nevertheless, this is strictly the worst-cse scenrio, nd both pproximtions become much more ccurte fter just few Jcobi itertions once the mtrix is brought closer to digonl form s in section 3 the choice whether to use the symptotic pproximtion, or the fixed ngle φ π/4 cn be mde without checking ngles or trigonometric quntities; the fixed ngle should be used when it chieves better residul reduction thn the mximum vlue cot(π/8)/4 6355: b cot(π/8) 4 (+ ) [( )] (3+ ) The finl lgorithm becomes: + 4 lgorithm Computtion of pproximte Givens quternion : const γ 3 + : const c cos(π/8) 3: const s sin(π/8) 4: function PPROXGIVENSQUTERNION(,, ) 5: c h ( ) c h cos(θ/) 6: s h s h sin(θ/) 7: b [γs h < c h] b is boolen 8: ω RSQRT(c h + s h) RSQRT(x) / x 9: c h b?ωc h :c : s h b?ωs h :s : return (c h,,, s h ) returns quternion : end function Note tht lgorithm corresponds to Jcobi rottion with (p, q) (, ) In order to rotte nother pir, the inputs nd the ordering of the quternion elements re djusted ccordingly We finlly ddress one implementtion detil: it my be more efficient (from n implementtion stndpoint) to compute the elements of the ctul rottion mtrix Q before performing the ctul conjugtion, rther thn using the quternion itself The (unscled) corresponding rottion mtrix is: Q unscled (c c h s h s h c h s h c h c h s h c h+s h cos φ sin φ sin φ cos φ 3 Sorting the singulr vlues (c h+s h)q () Once the orthogonl fctor V hs been computed, we cn obtin n expression for the product of U nd Σ s B : UΣ UΣV T V V Note tht the lst expression is the one ctully used to evlute B Since B UΣ, this mtrix is simply the result of scling ech column of the orthogonl fctor U with the respective singulr vlue (ie the respective digonl element of Σ) Consequently, the mgnitude of ech singulr vlue in Σ cn be computed by simply evluting the -norm of the respective column of B We previously stted tht our lgorithm will be required to produce digonl mtrix Σ where the singulr vlues long the digonl re sorted in decresing order of mgnitude This ordering is not merely n rbitrry convention, but will lso benefit the QR fctoriztion explined lter in section 4 We shll enforce this property by reordering the columns of B in decresing order of their -norm (which will induce the sme ordering in the digonl entries of Σ, s discussed) nd lso pply the sme permuttion to the columns of V t the sme time In order to prove tht such trnsformtion is llowed, consider the individul columns of B [b b b 3] nd V [v v v 3] respectively Since BV T, we hve: 3X b iv T i i Thus, if the sme permuttion is pplied to the columns of mtrices B nd V, the mtrix reconstructed s their product remins unffected Note tht it is lso possible to simultneously negte corresponding pir of columns b i nd v i without ffecting the vlidity of the decomposition We cn therefore sort the singulr vlues by swpping pirs of columns (b i, b j) long with their counterprts (v i, v j) in the fshion of bubblesort method, until the columns of B pper in decresing order of their -norm Note tht simply swpping two columns of V will flip the sign of its determinnt, violting the property tht V is true rottion mtrix; insted, we lso negte one of the two columns being swpped (both for V nd the respective column in B) which will keep V s true rottion The entire process is summrized in the following pseudocode: lgorithm 3 Singulr vlue sort in decresing mgnitude order : procedure CONDSWP(c, X, Y ) c is boolen : Z X Z is temporry vrible 3: X c?y :X 4: Y c?z:y 5: end procedure 6: procedure CONDNEGSWP(c, X, Y ) c is boolen 7: Z X Z is temporry vrible 8: X c?y :X 9: Y c?z:y : end procedure : procedure SORTSINGULRVLUES(b, b, b 3, v, v, v 3) : ρ b, ρ b, ρ 3 b 3 3: c [ρ < ρ ] c is boolen 4: CONDNEGSWP(c, b, b ); CONDNEGSWP(c, v, v ) 5: CONDSWP(c, ρ, ρ ) 6: c [ρ < ρ 3] 7: CONDNEGSWP(c, b, b 3); CONDNEGSWP(c, v, v 3) 8: CONDSWP(c, ρ, ρ 3) 9: c [ρ < ρ 3] : CONDNEGSWP(c, b, b 3); CONDNEGSWP(c, v, v 3) : end procedure Lstly, we recll tht in section the rottion mtrix V ws in fct constructed s quternion q (s, x, y, z) For the purposes of the current section, we could either convert this representtion to n explicit 3 3 mtrix, or simply compute the mtrix B V by rotting ech row vector of with the conjugte quternion q However, if we need to produce V in quternion form t the end of the SVD lgorithm, it would be inconvenient to convert bck nd forth between mtrix nd quternion representtions only so tht the previously defined procedure CONDNEGSWP could be

7 pplied to mtrix representtion of V Fortuntely, this opertion cn lso be expressed by simple quternion In prticulr, note tht function cll CONDNEGSWP(true, v, v ) is equivlent to replcing V with VR, where which is rottion mtrix, with corresponding (un-normlized) quternion q R (,,, ) In the cse we wnt to mke the cll to CONDNEGSWP conditionl on the vrible c, the permuttion quternion is simply q R (,,, c), ssuming tht c tkes binry vlue of either or The quternion corresponding to the product VR will then simply be q q R (which, notbly, requires only 4 dditions or subtrctions) The sme logic cn be followed to emulte the ction of CONDNEGSWP on other pirs of columns of V, while operting purely on its quternion representtion 4 Computtion of the fctors U nd Σ We previously constructed the mtrix B V nd explined tht it is equl to the product UΣ of the two remining unknown components of the SVD In the lst phse of our lgorithm we will compute the individul fctors U nd Σ from the mtrix B 4 Extrcting U nd Σ vi QR decomposition The mtrix B UΣ is essentilly column scling of U by the respective digonl entries of Σ Thus, seemingly strightforwrd method for computing the orthogonl mtrix U would be to simply rescle ech column vector so tht it hs unit norm However, this procedure cnnot be used in the cse of zero singulr vlue; moreover, even when singulr vlue is nonzero yet orders of mgnitude smller thn the other vlues, this normliztion my produce mtrix U tht is fr from orthogonl Intuitively, this loss of orthogonlity is due in prt to the fct tht, when column of U with very smll entries is multiplied with lrge number to convert this column to unit vector, ny numericl errors will be gretly mplified These issues re excerbted in the cse where more thn one of the singulr vlues is equl to zero Our pproch gurntees the orthogonlity of the computed mtrix U nd is bsed on the QR fctoriztion of B using Givens rottions We strt by showing the following lemm, for generl dimension of the SVD (ie potentilly lrger thn the 3 3 cse): Lemm Let U be n orthonorml n n mtrix nd Σ digonl mtrix of the sme dimensions Let QR UΣ be the (not necessrily unique) QR-fctoriztion of the product UΣ, where Q is orthogonl nd R is upper tringulr If the nonzero digonl elements of Σ pper before ny zero entries (ie if Σ hs k nonzero entries nd [Σ] ii, i k, while [Σ] ii, k + i n), then the following sttements hold true: If Q [q q q n ], U [u u u n], r ij : [R] ij, nd Σ dig(σ, σ,, σ n), the following sttements re true when i [, k] : q i ±u i r ii ±σ i (with the sme sign s the identity bove) r ij for ny j i The fctor R is in fct digonl Proof The i-th column of the mtrix eqution UΣ QR is written s follows: ix σ iu i r ki q k k We will prove the combintion of the 3 properties by induction on i For i, we hve: σ u r q σ u r u σ r r ±σ nd, from the first eqution, we lso hve q ±u (with the sme sign s in the identity r ±σ ) lso, let j We hve: σ ju j r kj q k k u T (σ ju j) u T σ ju T u j! r kj q k k r kj u T q k k Since j i, we hve u T u j lso, we previously showed tht q ±u, thus u T q k ±δ k (δ ij is the Kronecker delt) Combining these results with the lst eqution we get: r kj (±δ k ) k ±r j For the induction step i i + we hve: σ i+u i+ Xi+ r k,i+ q k k ix r k,i+ q k k {z } (induction) r i+,i+q i+ +r i+,i+q i+ Tking the -norm of this eqution yields, s before, r i+,i+ ±σ i+ nd q i+ ±u i+ Similrly, for j i + we hve σ ju j r kj q k k u T i+ (σ ju j) u T i+ σ ju T i+u j which completes our proof! r kj q k k r kj u T i+q k k r kj (±δ i+,k ) k ±r i+,j

8 ccording to the properties proven in Prt, the mtrix R hs the structure «±σ D B R, where C ˆR ±σ k nd ˆR is n upper tringulr mtrix of size (n k) (n k) Therefore, the system UΣ QR is written s D UΣ Q ˆR «Since σ k+ σ n, the lst (n k) columns of this mtrix eqution re written s: Q ˆR «The mtrix Q is nonsingulr, thus the lst eqution implies tht ˆR, suggesting tht R is purely digonl mtrix D This lemm indictes tht the QR decomposition cn be used to fctorize B into n orthogonl mtrix (tken s the fctor U) nd digonl mtrix which will ply the role of Σ The condition tht nonzero singulr vlues need to precede those equl to zero (we chieve this in our cse by the sorting process in section 3) is bsolutely essentil Consider the counter exmple of system UΣ QR with the following vlues The fctoriztion on the right is perfectly vlid QR decomposition, yet R is neither digonl, nor does it pproximte Σ in ny wy By performing strict sort, rther thn simple seprtion of zero/nonzero singulr vlues, our methodology is robust to situtions where singulr vlue (respectively, the norm of column of B) is nonzero, yet much smller thn the mgnitude of some other, lrger singulr vlue Finlly, we note tht this generl theory does not gurntee ny prticulr sign for the digonl elements in the fctor R; the convention presented in section will be consequence of the methodology (Givens rottions) which we employ to compute the QR decomposition 4 Givens QR fctoriztion We shll use the method of Givens rottions to compute the QR fctoriztion, due to the simplicity of its fundmentl opertions nd the fct it gurntees to produce true rottion mtrix Q In contrst, Grm-Schmidt procedure would require significnt ttention to produce true rottion mtrix, especilly in the presence of smll (or zero) singulr vlues The Householder scheme would lso be n option, lbeit one tht requires more complex steps, nd cre needs to be tken due to the fct tht it opertes by constructing orthogonl reflections rther thn true rottions For generl n n mtrix B, the method of Givens rottions constructs the tringulr fctor R by nnihilting the elements below the digonl one-by-one, in column-mjor lexicogrphicl order, ie (, ), (3, ),, (n, ), (, ), (3, ),, (n, n ) The (i, j) element is nnihilted by left-multiplying the result of the previous opertions with Givens mtrix Q(i, j, θ ij) T, s follows: Q(n, n-, θ n,n-) T Q(3,, θ 3) T Q(,, θ ) T B R Q T B R B QR where Q Q(,, θ )Q(3,, θ 3) Q(n, n-, θ n,n-) Due to the specific order in which the elements below the digonl of B re being nnihilted, every Givens rottion in this sequence will not chnge ny of the zeroes tht were introduced by the Givens rottions pplied before it Schemticlly, when the Givens rottion intended to nnihilte element (q, p) is redy to be pplied, the following trnsformtion tkes plce : Q(p, q, θ pq) T q,q+ n qq q,q+ qn q+,q+ q+,n p,q+ p,n pq p,q+ p,n p+,q p+,q+ p+,n nq n,q+ nn q,q+ n qq q,q+ qn q+,q+ q+,n p,q+ p,n p,q+ p,n p+,q p+,q+ p+,n nq n,q+ nn C C s seen in the lst eqution, only rows p nd q re ffected, nd only from the q-th column onwrds We cn lso see tht this Givens rottion will succeed in nnihilting the element pq if n only if ««cos θpq sin θ pq qq qq sin θ pq cos θ pq pq This property cn be enforced by simply selecting: cos θ pq qq pq p, sin θ pq p qq + pq qq + pq «(3) We lso observe tht fter pplying this rottion, the sign of qq p qq + pq will be non-negtive s consequence, if the Givens rottions re constructed in this fshion, t the end of the sequence of rottions ll digonl elements of the resulting mtrix R, with the exception of the very lst one, will be non-negtive This property stisfies the lst convention we hd dopted in section for the sign of the digonl elements of Σ

9 specil cse tht needs to be ddressed occurs when both of qq nd pq re either zero, or extremely smll In this cse, the normliztion required to obtin the trigonometric fctors cos θ pq nd sin θ pq cn led to division by zero (or significnt loss of ccurcy, t the very lest) We detect this cse by checking if qq + pq < ɛ for specified threshold ɛ (in the sme order of mgnitude s our tolernce for errors in the singulr vlues) When this specil cse is detected, we set insted: cos θ pq signum( qq), sin θ pq These vlues will still gurntee tht qq nd tht, ultimtely, the first n singulr vlues in Σ will be non-negtive 43 Quternion implementtion of Givens QR We conclude by illustrting methodology tht genertes the Givens rottions directly in quternion form; we would utilize this pproch if, for the purposes of given ppliction, it is preferble to compute the rottions U nd V s quternions lthough it is certinly possible to convert the 3 3 rottion mtrix U to quternion s post-process, it is preferble to construct the rottions s quternions in the first plce Doing so will void the explicit mtrix-to-quternion conversion, procedure tht needs to consider number of different cses, nd is not optimlly structured for ggressive SSE optimiztions We will describe the methodology in the context of the first mtrix Q(,, θ ) from the sequence of Givens rottions used to compute the QR fctoriztion; the remining rottions will be constructed in n nlogous fshion The mtrix representtion of this rottion is : Q(,, cos θ sin θ sin θ cos θ where we dropped the subscripts in the ngle θ for simplicity In order for the opertion Q(,, θ) T B to nnihilte element b the following condition must hold, bsed on eqution (3) : sin θ b + cos θ b or, more generlly, for the Givens rottion designed to nnihilte element b pq we will require : sin θ + cos θ (4) where denotes the Pivot element on the digonl (this is element b qq on the mtrix being rotted), nd is the mtrix entry to be eliminted (or, b pq) The sme rottion cn lterntively be represented by n (unnormlized) quternion q : q (c h,,, s h ) (γ cos θ,,, γ sin θ ) where γ is n rbitrry scling fctor From eqution (4) we get: sin θ cos θ tn θ tn θ (5) tn θ Eqution (5) is essentilly qudrtic eqution on tn θ The two solutions of this qudrtic re: «sh c h tn θ ± p + (6) Since the quternion scle fctor γ is irrelevnt, we re free to simply choose c h nd s h ± p + (with either sign) Regrdless of the sign chosen in the formul for s h, in theory both of these vlues will generte Givens rottion tht successfully nnihiltes the intended mtrix entry However, we need to py ttention to the following issues: One of the choices for s h my be prone to ctstrophic cncelltion nd loss of ccurcy For exmple, if >, the opertion p ± + will lose ccurcy, s it is subtrcting the finite precision representtions of ner-identicl quntities We need to ensure tht fter the Givens rottion, the resulting Pivot element qq is non-negtive, per our convention in section With simple cse study (which will be omitted here, in the interest of terseness) the best choices for c h nd s h re determined to be: If <, then c h s h + If > then c h + s h q + q + q «+ + q «+ + For the cse > it my be initilly uncler how these vlues relte to the solution of eqution (5) However, we know tht one of the dmissible solutions is: s h + p + c h ( + p + )( + p + ) ( + p + ) ( + p + ) + p + from which the formuls for the cse > re derived Finlly, we need to estblish tht fter the constructed Givens rottion hs been pplied, the vlue of qq will be positive Noting tht s h γ sin(θ/) nd c h γ cos(θ/), we define: c : c h s h γ (cos θ sin θ ) γ cos θ s : s h c h γ sin θ cos θ γ sin θ Thus, we cn obtin the sine nd cosine of of the Givens ngle thet by normlizing: cos θ c s p, sin θ p c + s c + s With the ssistnce of these formuls, we cn verify tht the vlues chosen for c h, s h, either for > or < will ultimtely yield (omitting the necessry, yet tedious lgebric reductions): cos θ p, sin θ p + +

10 s consequence qq cos θ + sin θ p + In contrst, some of the roots of eqution (5) which were not used would hve produced cos θ p / +, nd sin θ p / + These vlues would hve lso eliminted element pq, but would hve produced nonpositive digonl element qq insted s in the Givens Jcobi procedure of section, eqution () cn be used to obtin (un-normlized) version of the corresponding 3 3 rottion mtrix, if such representtion of the Givens rottion is desired The entire procedure is summrized in lgorithm 4; note tht the dditionl checks in lines 3,4 of the pseudocode re designed to sfegurd ginst division by (ner-)zero when both elements nd re extremely smll The threshold vlue ɛ is set to our tolernce for the mgnitude of the elements remining below the digonl of R fter the Givens procedure is concluded lgorithm 4 Computtion Givens quternion for QR fctoriztion : function QRGIVENSQUTERNION(, ) : ρ p + 3: s h [ρ > ɛ]? : 4: c h + mx(ρ, ɛ) 5: b [ < ] b is boolen 6: CONDSWP(b, s h, c h ) CONDSWP defined in lg 3 7: ω RSQRT(c h + s h) RSQRT(x) / x 8: c h ωc h 9: s h ωs h : return (c h,,, s h ) returns quternion : end function Note The quternion representtion of the rottionl fctors U nd V hs limited our need for n exct squre root (or reciprocl squre root) opertion However, such n exct normliztion will be needed t lest once, t the end of the SVD lgorithm to remove ny ccumulted scling In ddition, lgorithm 4 clls for n exct squre root in line For these purposes, we found it sufficient to improve the ccurcy of RSQRTPS by performing one itertion of Newton s method for the eqution f(y) y x (the solution of this eqution is exctly / x), s detiled in [Lomont 3] The resulting, more ccurte versions of the squre root function (nd its reciprocl) re summrized in pseudocode s follows: lgorithm 5 Improved ccurcy SQRT nd RSQRT : function CCURTERSQRT(x) : y SQRT(x) 3: y y `3 xy ) 4: return y 5: end function 6: function CCURTESQRT(x) 7: return x CCURTERSQRT(x) 8: end function code on the computtion of 4 67M decompositions of mtrices with uniformly rndom elements, normlized such tht the Frobenius norm of ech input mtrix is equl to one For the purposes of this benchmrk, we fixed the number of Jcobi sweeps (using our pproximte, quternion-bsed formultion) to constnt number of 4 itertions Nturlly, vrious degrees of ccurcy cn be obtined by using different count of Jcobi itertions; however, for our 67M uniformly rndom, unit-normlized mtrices, this number of itertions resulted in: The mximum mgnitude mong off-digonl entries fter the symmetric eigennlysis ws 4 999% percentile of input mtrices chieved mximum off-digonl mgnitude of less thn 5 The verge mximum off-digonl mgnitude cross ll input mtrices ws 3 6 This level of ccurcy ws deemed well pproprite for the purposes of the ccompnying submission [Mcdms et l ] 5 Performnce nd sclbility Figure 3 illustrtes the totl runtime of our SVD lgorithm on the smple input of 4 67M rndom mtrices (of course, since our lgorithm hs completely fixed control flow, computtion time is input-independent) We generlly observed ner-liner speedup between single core nd -core performnce Observed devitions include: Speedup We observed n dditionl 5% performnce boost when moving from -core/-thred to -core/4-thred setup, leverging the hyperthreding cpbility of the processor We ttribute this dditionl ccelertion to the hiding of instruction ltency of our dense, explicitly vectorized code chieved in the hyperthreding setting Executions with just single core per socket tke dvntge of the frequency boost of single-threded runs, ntive in the Nehlem rchitecture Time (s) Speedup Time (s) 5 Results nd performnce ct ct 4c4t 6c6t 8c8t ct ct c4t We hve implemented nd tested SIMD, multithreded version of our lgorithm, using explicit SSE intrinsics The following performnce mesurements were cptured on -core/4-thred (hyperthreding enbled) 66GHz Intel Xeon X565 server, using the Intel C++ compiler for Linux, version 3 We benchmrked our Figure 3: Execution times, nd speedup reltive to the singlethred bseline performnce of our SVD lgorithm, on dul socket Intel Xeon X565 server Benchmrk includes totl of 4 decompositions McNt denotes n M-core, N-thred execution

11 5 Comprison with other eigennlysis methods Figure 4 provides comprison between our method, nd populr lterntives for solving vrints of eigenvlue problems The methods being compred include: Our method, with ll the necessry computtion overhed required to compute the rottionl fctors of the SVD in quternion form Explicitly SIMD vectorized Our method, with the SVD fctors being computed only in mtrix (not quternion) form (Note tht [Mcdms et l ] requires slightly more expensive vrint of these two options, requiring both quternion representtion of the rottion R UV T, s well s n explicit mtrix form of V itself) Explicitly SIMD vectorized The symmetric eigennlysis component only of our method constnt four modified Jcobi sweeps re used Explicitly SIMD vectorized The symmetric eigennlysis component only of the Polr Decomposition in [Rivers nd Jmes 7] No wrm strts hve been used; the number of Jcobi sweeps is fixed to three, which produces n verge ccurcy comprble to 4 sweeps of our modified Jcobi procedure Sclr implementtion only (multithreding used without vectoriztion) Computtion of eigenvlues of symmetric 3 3 mtrix, using closed-form solution [Smith 96] Sclr implementtion only quternion-bsed implementtion of the Jcobi procedure for the 3 3 symmetric eigennlysis ( Method Complete SVD computtion with rottions in quternion form (our method, 4 wide SIMD) Complete SVD computtion with rottions in mtrix form (our method, 4 wide SIMD) Symmetric eigennlysis only (our method, 4 wide SIMD) Symmetric eigennlysis only ([Rivers nd Jmes 7], sclr) Closed form eigenvlue computtion only (sclr) Computtion of digonlizing quternion (sclr) Time per decomposition (ns) core, thred core, 4thred Figure 4: Comprison of vrious lgorithm for 3 3 eigennlysis tsks Single threded nd -core/4-thred times re given, normlized to the time required for every individul decomposition Note tht some of the methods my not be directly comprble; refer to the text for discussion of differences nd ssumptions It should be noted tht these performnce numbers cnnot be tken s bsolute nd definitive mesures of the superiority of n individul lgorithm, since number of fctors hve to be considered before ccepting these figures s commensurte Nmely: Mny vrints only ddress the symmetric eigennlysis problem, insted of the entire SVD (note tht for our lgorithm we need not only the polr decomposition, but the fctor V of the SVD s well) In order to llow for more fir comprison with these methods, we conducted comprisons with prefix of our method, tht stops when the symmetric eigennlysis hs been computed Insted of relying on published performnce figures, we rern the best implementtions of these techniques we could find, with the sme mchine/compiler/optimiztion settings used for our code lso, we multithreded mny of these lgorithms to give them the sme benefit of prllel execution (including the ltency-hiding fetures of hyperthreding, when vilble) For some of these lterntive lgorithms (or prts thereof) we hve resonble expecttion of SIMD potentil When compring our pproch to these methods, one should normlize to the sme vector width Note however tht this SIMD potentil my often NOT pply to the entire SVD process, but only frction of it (eg the symmetric eigennlysis) Stopping criteri Some lterntive lgorithms iterte until certin criterion hs been stisfied (eg the mximum offdigonl element hs been reduced below certin threshold) Insted, in our pproch we chose to implement fixed number of Jcobi sweeps The reson for this choice is tht when using SSE/SIMD the itertion cnnot be conveniently stopped for only some out of the decompositions tht re pcked into n SIMD sequence We previously explined why our choice of 4 sweeps is resonble one Perhps the most importnt differentiting fctor is the following: When ttempting to implement lterntive SVD methods s prt of n end-to-end system s [Mcdms et l ], we relized tht certin lterntives were simply not cceptble for the purposes of specific pplictions simple exmple is the FstLSM-type decomposition [Rivers nd Jmes 7], which computes the fctor S of F RS using Jcobi symmetric eigennlysis, nd then forms R s R FS The wy S is constructed, it is lwys positive definite mtrix; thus, in the presence of inversion where often det(f) <, the produced polr decomposition will produce fctor R tht contins reflection (ie det(r) ) In cses with ner-zero singulr vlues, the produced fctor R my severely lck orthogonlity s well See [Mcdms et l ] for further discussion of this issue References GOLUB, G, ND VN LON, C 989 Mtrix Computtions The John Hopkins University Press LOMONT, C 3 Fst inverse squre root Purdue University, MCDMS,, ZHU, Y, SELLE,, EMPEY, M, TMSTORF, R, TERN, J, ND SIFKIS, E Efficient elsticity for chrcter skinning with contct nd collisions CM Trns Grph RIVERS,, ND JMES, D 7 FstLSM: fst lttice shpe mtching for robust rel-time deformtion CM Trns Grph (SIGGRPH Proc) 6, 3 SMITH, O K 96 Eigenvlues of symmetric 3x3 mtrix Commun CM 4 (pril), 68

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the