Matrix Recipes Javier R Movellan December 28, 2006 Copyright c 2004 Javier R Movellan 1
1 Definitions Let x, y be matrices of orer m n an o p respectively, ie x 11 x 1n y 11 y 1p x = y = (1) x m1 x mn y o1 y op We refer to a matrix of orer 1 1 as a scalar, an a matrix of orer m 1 where m > 1, as a vector We let f be a generic function of the form y = f(x) If x is a vector an y is a scalar we say that f is a scalar function of a vector If x is a matrix an y a vector, we say that f is a vector function of a matrix, an so on General Definition: Derivative of a matrix function of a matrix x 11 x 1n ef = x (2) x m1 where ef = x ij 11 x mn 1q x ij x ij p1 x ij pq x ij (3) Thus the erivative of a p q matrix y with respect to a m n matrix x is a (mp) (nq) matrix of the following form 11 x 11 11 1q x 1n x 11 1q x 1n 11 x m1 11 1q x mn x m1 1q x mn ef = x (4) p1 x 11 p1 pq x 1n x 11 pq x 1n p1 x m1 pq x m1 p1 x mn pq x mn From this efinitions we can erive a variety of important special cases Derivative of a scalar function of a vector We think of a vector as a matrix with a single column an a scalar as a matrix with a single cell Applying 2
the general efinition we get that ef = x x 1 x m (5) Graient of a scalar function of a vector We represent graients using the sign For the case of graients of scalar functions of a vector, we efine the graient to equal the erivative (note this will not be the case for vector functions of vectors) x y ef = (6) x Derivative of a vector function of a vector If we think of y an x as p 1 an m 1 matrices, then the erivative woul be a vector with mp imensions ef = ( 11 p1 11 p1 ) T (7) x x 11 x 11 x 1m x 1m Jacobian of a vector function of a vector A more useful representation can be obtaine by working with the erivative of y with respect to the transpose of x This results on an p m matrix which is known as the Jacobian of y with respect to x 1 J x y ef = x x T = 1 1 x m (8) p x 1 p x m The Jacobian is very useful for linearizing vector functions of vectors f(x) f(x 0 ) + J x f(x x 0 ) (9) Graient of a vector function of a vector Note if y is a scalar then J x y = = ( x T x y) T ie, the Jacobian of a scalar function is the transpose of the graient With this in min we will efine the graient of a vector function of a vector as the transpose of the Jacobian ( ) T x y ef = (J x y) T = x T = T x (10) Hessian of a scalar function of a vector Let y = f(x), y R, x R n The matrix H x y ef = 2 xy ef = x x y ef = ( ) 2 y T x 1 x 1 2 y x 1 x n = x x (11) 2 y x n x 1 2 x n x n 3
is calle the Hessian matrix of y with respect to x 2 Summary of efinitions For x, y vectors Graient: Jacobian: Hessian: x y ef = T x (12) J x y ef = ( x y) T ef = x T (13) H x y ef = 2 x y ef = x x y (14) 3 Chain Rules 1 Vector functions of vectors: If z = h(y), is a vector function of a vector, an y = f(x) is a vector function of a vector then or equivalently J x z = (J y z)(j x y) (15) x z = ( x y)( y z) (16) Example 1: Consier the quaratic function z = y T ay (17) where y is an n imensional vector a is an n n matrix It is easy to show that x T ax = x j (a kj + a jk ) (18) x k j thus Moreover if J x x T ax = x T (a + a T ) (19) x x T ax = (a + a T )x (20) y = bx (21) where x is an m imensional vector an b is an n m matrix then Thus J x y = b (22) x y = b T (23) 4
J x (bx c) T a(bx c) = ( J bx c (bx c) T a(bx c) ) (J x (bx c)) (24) = (x c) T (a + a T )b (25) an x (x µ) T a(x µ) = b T (a + a T )(x c) T (26) Example 2: Let x, y vectors x e xt y y = x e xt y e x T ye xt y y (27) where x e xt y = x x T y x T y e xt y = ye xt y (28) e x T y e xt y y = e x T yye xt y = y T (29) Thus x e xt y y = y e xt y y T (30) 2 Scalar functions of matrices: If z = h(y) is a scalar function of a matrix an y = f(x) is a matrix function of a scalar then [ ( ) ] [ T ( ) ] T z x trace z z = trace (31) x x 5
4 Useful Formulae (I nee to check whether notation is consistent with other parts of the ocument) Let a, b, c, matrices, x, y, z vectors 41 Linear an Quaratic Functions x xt c = c (32) x cx = x (cx)t = c T (33) J x cx = ( x cx) T = c (34) x (ax + b)t c(x + e) = a T c(x + e) + T c T (ax + b), (35) x (ax + b)t c(ax + b) = a T (c + c T )(ax + b) (36) a (ax + b)t c(ax + b) = (c + c T )(ax + b)x T, (37) a xt a T bay = b T axy T + bayx T (38) a xt ay = xy T (39) a x e ta y = txe ta y (40) 6
42 Traces trace [a] = I a (41) b trace [abc] = b trace [ c T b T a T ] = a T c T (42) b trace [abn ] = ( n 1 ) T b i ab n i 1 (43) i=0 e trace [ abe T c ] = a T c T b T + caeb (44) b trace [ a T ba ] = (b + b T )a (45) a trace [ a 1 b ] = a 1 b T a 1 (46) 43 Determinants a et [a] = et [a] a T (47) 44 Kronecker an Vecs vec[abc] vec[b] 5 Matrix Differential Equations = c a T (48) (a t + b t ) = a t t t + b t t (a t b t ) = a t t t b b t t + a t t (a 2 t ) = a t t t a a t t + a t t (a 2 t ) = a t t t a a t t + a t t a t a 1 t = 0 = a t t t a 1 t a 1 t t + a 1 a t t t (49) (50) (51) (52) (53) = a 1 a t t t a 1 (54) (55) 7
It is only possible to expect e at t = a t e at uner conitions of full commutativity Types of linear matrix equations a t t = ba tc (56) (57) with special cases when b or c are the ientity matrix The following tricks allow moving the coefficients to the left, or right then a t t = a tc (58) a T t t = c T a T t (59) (60) an if then a 1 t t a t t = ba t (61) = a 1 a t t a 1 = a 1 ba t a 1 t = a 1 t b (62) (63) It can be shown that if a 0 is non-singular, then the solution a t is non-singular for all t 8
6 Determinants I + xy = 1 + x y, x, y R n e a = e trace(a) 7 Trace trace(a + b) = trace(a) + trace(b) trace(a) = trace(a T ) trace(abc) = trace(cab) = trace(bca) trace(xy T a) = y T ax, x R m, y R n trace(a T rb T ) = a T rb confirm r oes not nee rotation If a is m n an b is n m then trace(ab) = trace(ba) = trace(a T b T ) 8 Matrix Exponentials e a = e trace (a) if a = pλp 1 then e a = pe Λ p 1 eta t = ae ta a x e ta y = te ta xy 81 Proofs: x e ta = a ij δ x e t(a+δ1i1 j ) y ( ) = x e ta δ=0 δ etδ1i1 j y δ=0 = x e ta t1 i 1 je δ1i1 j y δ=0 = t(e ta x) i y j (64) Thus x e ta a = te ta xy (65) 9 Matrix Logarithms Let a a real or complex square matrix of oreer n with positive eigenvalues Then there is a unique matrix b such that (1) a = e b an (2) the imagignary part of the eigenvalues is in [ π, π] We call b the principle logarithm of a if a = pλp 1 then log(a) = plog(λ)p 1 9
10 Kronecker an Vec Provie a way to eal with erivatives of matrix functions without having to use cubix or quartix (ie, matrices with 3 or 4 imensions) Instea of working with matrix functions we work with vectorize versions of matrix functions This gives rise to the Kronecker prouct, or tensor prouct Definition: Kronecker prouct a 11 b a 1n b a b = (66) a m1 b a mn b Definition: Vec operator vec[a] = a 11 a m1 a 12 a m2 a 1n a mn (67) 101 Properties 1 a b c = (a b) c = a (b c) (68) provie the imensions of the matrices allows for all the expressions to exist 2 3 4 5 (a + b) (c + ) = a c + a + b c + b (69) (a b)(c ) = (ac) (b) (70) (a b)(b ) = (ac) (b) (71) (a b) T = a T b T (72) 10
6 7 8 (a b) 1 = a 1 b 1 (73) vec [ ab T ] = b a (74) vec [abc] = (c T a) vec [b] (75) 9 If {λ i, u i } are eigenvalues/eigenvectors of a an {δ i, v i } are eigenvalues/eigenvectors of b then {λ i δ j, u i v j } are eigenvalues/eigenvectors ofr a b 10 11 et[a b] = et[a] m et[b] n (76) where a, b are of orer n n an m m respectively trace(a b) = trace[a] trace[b] (77) 12 2 vecx(axbx T ) = 2 vecxvec(x T )(b T a)vecx = b a +b atrace(a b) = trace[a] trace[b] (78) where a, b, x are matrices 11 Optimization of Quaratic Functions This is arguably the most useful optimization problem in applie mathematics Its solution is behin a large variety of useful algorithms incluing Multivariate Linear Regression, the Kalman Filter, Linear Quaratic Controllers, etc Let ρ(x) = E(bx C) T a(bx C) + x T x (79) where a an are symmetric positive efinite matrices, b is a matrix, x a vector an C a ranom vector Taking the Jacobian with respect to x an applying the chain rule we have J x ρ = J bx c (bx c) T a(bx c) J x (bx c) + J x x T x (80) = 2(bx c) T ab + 2x T (81) x ρ = (J x ) T = 2b T a(bx c) + 2 x (82) Setting the graient to zero we get (b T ab + )x = b T ac (83) This is commonly known as the Normal Equation Thus the value ˆx that minimizes ρ is ˆx = hc (84) 11
where Moreover h = (b T ab + ) 1 b T a (85) ρ(ˆx) = (bhc c) T a(bhc c) + c T h T hc (86) = c T h T b T abhc 2c T h T b T ac + c T ac + c T h T hc (87) Now note Thus c T h T b T abhc + c T h T hc = c T h T (b T ab + )hc (88) where = c T a T b(b T ab + ) 1 (b T ab + )(b T ab + ) 1 b T ac (89) = c T a T b(b T ab + ) 1 b T ac (90) = c T h T b T ac (91) ρ(ˆx) = c T ac c T h T b T ac = c T kc (92) k = a h T b T a = a a T b(b T ab + ) 1 b T a (93) This is known as the Riccati Equation which is foun in a variety of stochastic filtering an control problems Example Application: Rige Regression Let y R n represents a set of observations on a variable we want to preict, b is an n p matrix, where each row is a set of observations on p variables use to preict c, an x R p are the weights given to each variable to preict y A useful measure of error is ρ(x) = (bx y) T (bx y) + λx T x (94) where λ 0 is a constant that penalizes for the use of large values of x Thus the solution to this problem is where I p is the p p ientity matrix 12 Optimization Methos 121 Newton-Raphson Metho ˆx = (b T b + λi p ) 1 b T by (95) Let y = f(x), for y R, x R n The Newton-Raphson algorithm is an iterative metho for optimizing y We start the process at an arbitrary point x 0 R n 12
Let x t R n represent the state of the algorithm at iteration t We approximate the function f using the linear an quaratic terms of the Taylor expansion of f aroun x t ˆf t (x) = f(x t ) + x f(x t )(x x t ) T + 1 2 (x x t) T ( x x f(x t )) (x x t ) (96) an then we then fin the extremum of ˆf t with respect to x an move irectly to that extremum To o so note that x ˆft (x) = x f(x t ) + ( x x f(x t )) (x x t ) (97) We let x(t + 1) be the value of x for which x ˆft (x) = 0 x t+1 = x t + ( x x f(x t )) 1 x f(x t ) (98) It is useful to compare the Newton-Raphson metho with the stanar metho of graient ascent The graient ascent iteration is efine as follows x t+1 = x t + ɛi n x f(x t ) (99) where ɛ is a small positive constant Thus graient escent can be seen as a Newton-Raphson metho in which the Hessian matrix is approximate by 1 ɛ I n 122 Gauss-Newton Metho Let f(x) = n i=1 r i(x) 2 for r i : R n R We start the process with an arbitrary point x 0 R n Let x t R n represent the state of the algorithm at iteration t We approximate the functions r i using the linear term of their Taylor expansion aroun x t ˆr i (x t ) = r i (x t ) + ( x r i (x t )) T (x x t ) (100) n n ˆf t (x) = (ˆr i (x t )) 2 = (r i (x t ) ( x r i (x t )) T x t + ( x r i (x t )) T x) 2 (101) i=1 i=1 (102) Minimizing ˆf t (x) is a linear least squares problem of well known solution If we let y i = ( x r i (x t )) T x t r i (x t ) an u i = x r i (x t ) then n n x t+1 = ( u i u T i ) 1 ( u i y i ) (103) i=1 Note this is equivalent to Newton-Raphson with the Hessian being approximate by ( n i=1 u iu T i ) 1 i=1 13
History 1 The first version of this ocument was written by Javier R Movellan, on May 2004, base on an Appenix from the Multivariate Logistic Regression Primer at the Kolmogorov Project The original ocument was 7 pages long 14