APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product w x is defined as: w x nx w i x i w T x i (A..) and has a natural geometric interpretation as: w x w x cos( ) (A..2) where is the angle between the two vectors. Thus if the lengths of two vectors are fixed their inner product is largest when 0, whereupon one vector is a constant multiple of the other. If the scalar product x T y 0, then x and y are orthogonal (they are a right angles to each other). The length of a vector is denoted x, the squared length is given by x 2 x T x x 2 x 2 + x 2 2 + + x 2 n (A..3) A unit vector x has x T x. Definition 7 (Linear dependence). A set of vectors x,...,x n is linearly dependent if there exists a vector x j that can be expressed as a linear combination of the other vectors. Vice-versa, if the only solution to nx i x i 0 (A..4) i 543

e e* a a β e* α e Figure A.: Resolving a vector a into components along the orthogonal directions e and e. The projection of a onto these two directions are lengths and along the directions e and e. is for all i 0,i,...,n, the vectors x,...,x n are linearly independent. A..2 The scalar product as a projection Suppose that we wish to resolve the vector a into its components along the orthogonal directions specified by the unit vectors e and e. That is e e and e e 0. This is depicted in fig(a.). We are required to find the scalar values and such that a e + e From this we obtain (A..5) a e e e + e e, a e e e + e e (A..6) From the orthogonality and unit lengths of the vectors e and e, this becomes simply a e, a e (A..7) A set of vectors is orthonormal if they are mutually orthogonal and have unit length. This means that we can write the vector a in terms of the orthonormal components e and e as a (a e) e +(a e ) e (A..8) One can see therefore that the scalar product between a and e projects the vector a onto the (unit) direction e. The projection of a vector a onto a direction specified by f is therefore A..3 a f f 2 f Lines in space (A..9) A line in 2 (or more) dimensions can be specified as follows. The vector of any point along the line is given, for some s, by the equation p a + su, s 2 R. (A..0) where u is parallel to the line, and the line passes through the point a, see fig(a.2). This is called the parametric representation of the line. An alternative specification can be given by realising that all vectors along the line are orthogonal to the normal of the line, n (u and n are orthonormal). That is (p a) n 0, p n a n (A..) If the vector n is of unit length, the right hand side of the above represents the shortest distance from the origin to the line, drawn by the dashed line in fig(a.2) (since this is the projection of a onto the normal direction). a n u p Figure A.2: A line can be specified by some position vector on the line, a, and a unit vector along the direction of the line, u. In 2 dimensions, there is a unique direction, n, perpendicular to the line. In three dimensions, the vectors perpendicular to the direction of the line lie in a plane, whose normal vector is in the direction of the line, u. 544 DRAFT March 9, 200

p v u a n Figure A.3: A plane can be specified by a point in the plane, a and two, non-parallel directions in the plane, u and v. The normal to the plane is unique, and in the same direction as the directed line from the origin to the nearest point on the plane. A..4 Planes and hyperplanes A line is a one dimensional hyperplane. To define a two-dimensional plane (in arbitrary dimensional space) one may specify two vectors u and v that lie in the plane (they need not be mutually orthogonal), and a position vector a in the plane, see fig(a.3). Any vector p in the plane can then be written as p a + su + tv, (s, t) 2 R. (A..2) An alternative definition is given by considering that any vector within the plane must be orthogonal to the normal of the plane n. (p a) n 0, p n a n (A..3) The right hand side of the above represents the shortest distance from the origin to the plane, drawn by the dashed line in fig(a.3). The advantage of this representation is that it has the same form as a line. Indeed, this representation of (hyper)planes is independent of the dimension of the space. In addition, only two vectors need to be defined a point in the plane, a, and the normal to the plane n. A..5 Matrices An m n matrix A is a collection of scalar m n values arranged in a rectangle of m rows and n columns. A vector can be considered a n matrix. If the element of the i-th row and j-th column is A ij,thena T denotes the matrix that has A ji there instead - the transpose of A. For example A and its transpose are : 0 A @ 2 3 4 4 5 9 6 7 0 A A T @ 2 4 6 3 5 7 4 9 A (A..4) The i, j element of matrix A can be written A ij or in cases where more clarity is required, [A] ij (for example A ij ). Definition 8 (transpose). The transpose B T of the n by m matrix B is the m by n matrix D with components h B Ti kj B jk ; k,...,m j,...,n. (A..5) B, B T T B and (AB) T B T A T. If the shapes of the matrices A,B and C are such that it makes sense to calculate the product ABC, then (ABC) T C T B T A T (A..6) A square matrix A is symmetric if A T A. A square matrix is called Hermitian if A A T (A..7) where denotes the complex conjugate operator. For Hermitian matrices, the eigenvectors form an orthogonal set, with real eigenvalues. DRAFT March 9, 200 545

Definition 9 (Matrix addition). For two matrix A and B of the same size, [A + B] ij [A] ij +[B] ij (A..8) Definition 20 (Matrix multiplication). For an l by n matrix A and an n by m matrix B, theproduct AB is the l by m matrix with elements [AB] ij nx [A] ij [B] jk ; i,...,l k,...,m. (A..9) j For example a a 2 a 2 a 22 x x 2 a x + a 2 x 2 a 2 x 2 + a 22 x 2 (A..20) Note that even if BA is defined as well, that is if l n, generally BA is not equal to AB (when they do we say they commute). The matrix I is the identity matrix, necessarily square, with s on the diagonal and 0 s everywhere else. For clarity we may also write I m for an square m m identity matrix. Then for an m n matrix A I m A AI n A (A..2) The identity matrix has elements [I] ij ij given by the Kronecker delta: ij i j 0 i 6 j (A..22) Definition 2 (Trace). trace (A) X i A ii X i i (A..23) where i are the eigenvalues of A. A..6 Linear transformations Rotations If we assume that rotation of a two-dimensional vector x (x, y) T can be accomplished by matrix multiplication Rx then, since matrix multiplication is distributive, we only need to work out how the axes unit vectors i (, 0) T and j (0, ) T transform since Rx xri + yrj The unit vectors i and j under rotation by degrees transform to vectors Ri r r 2 cos sin Rj r2 r 22 sin cos (A..24) (A..25) 546 DRAFT March 9, 200

From this, one can simply read o the values for the elements cos sin R sin cos (A..26) A..7 Determinants Definition 22 (Determinant). For a square matrix A, the determinant is the volume of the transformation of the matrix A (up to a sign change). That is, we take a hypercube of unit volume and map each vertex under the transformation, and the volume of the resulting object is defined as the determinant. Writing [A] ij a ij, a a det 2 a a 2 a a 22 a 2 a 2 (A..27) 22 0 a a 2 a 3 det @ a 2 a 22 a 23 A a (a 22 a 33 a 23 a 32 ) a 2 (a 2 a 33 a 3 a 23 )+a 3 (a 2 a 32 a 3 a 22 ) (A..28) a 3 a 32 a 33 The determinant in the (3 3) case has the form a2 a a det 23 a2 a a a 32 a 2 det 23 a2 a + a 33 a 3 a 3 det 22 33 a 3 a 32 (A..29) The determinant of the (3 3) matrix A is given by the sum of terms ( ) i+ a i det (A i )wherea i is the (2 2) matrix formed from A by removing the i th row and column. This form of the determinant generalises to any dimension. That is, we can define the determinant recursively as an expansion along the top row of determinants of reduced matrices. The absolute value of the determinant is the volume of the transformation. det A T det(a) (A..30) For square matrices A and B of equal dimensions, det (AB) det(a)det(b), det (I) ) det A /det (A) (A..3) For any matrix A which collapses dimensions, then the volume of the transformation is zero, and so is the determinant. If the determinant is zero, the matrix cannot be invertible since given any vector x, given a projection y Ax, we cannot uniquely compute which vector x was projected to y there will in general be an infinite number of solutions. Definition 23 (Orthogonal matrix). A square matrix A is orthogonal if AA T I A T A. From the properties of the determinant, we see therefore that an orthogonal matrix has determinant ± and hence corresponds to a volume preserving transformation i.e. a rotation. Definition 24 (Matrix rank). For an m n matrix X with n columns, each written as an m-vector: X x,...,x n (A..32) the rank of X is the maximum number of linearly independent columns (or equivalently rows). A n n square matrix is full rank if the rank is n and the matrix is non-singular. Otherwise the matrix is reduced rank and is singular. DRAFT March 9, 200 547

A..8 Matrix inversion Definition 25 (Matrix inversion). For a square matrix A, its inverse satisfies A A I AA (A..33) It is not always possible to find a matrix A such that A A I. In that case, we call the matrix A singular. Geometrically, singular matrices correspond to projections : if we were to take the transform of each of the vertices v of a binary hypercube Av, the volume of the transformed hypercube would be zero. If you are given a vector y and a singular transformation, A, one cannot uniquely identify a vector x for which y Ax - typically there will be a whole space of possibilities. Provided the inverses matrices exist (AB) B A (A..34) For a non-square matrix A such that AA T is invertible, then the pseudo inverse, defined as A A T AA T (A..35) satisfies AA I. A..9 Computing the matrix inverse For a 2 2 matrix, it is straightforward to work out for a general matrix, the explicit form of the inverse. a b If the matrix whose inverse we wish to find is A, then the condition for the inverse is c d a c b e d g f h 0 0 (A..36) Multiplying out the left hand side, we obtain the four conditions ae + bg ce + dg af + bh cf + dh 0 0 (A..37) It is readily verified that the solution to this set of four linear equations is given by e g f d h ad bc b c A (A..38) a The quantity ad bc is the determinant of A. There are many ways to compute the inverse of a general matrix, and we refer the reader to more specialised texts. Note that, if one wants to solve only a linear system, although the solution can be obtained through matrix inversion, this should not be use. Often, one needs to solve huge dimensional linear systems of equations, and speed becomes an issue. These equations can be solved much more accurately and quickly using elimination techniques such as Gaussian Elimination. A..0 Eigenvalues and eigenvectors The eigenvectors of a matrix correspond to the natural coordinate system, in which the geometric transformation represented by A can be most easily understood. 548 DRAFT March 9, 200

Definition 26 (Eigenvalues and Eigenvectors). For a square matrix A, e is an eigenvector of A with eigenvalue if Ae e (A..39) ny det (A) i i Hence a matrix is singular if it has a zero eigenvalue. The trace of a matrix can be expressed as (A..40) trace (A) X i i (A..4) For an (n n) dimensional matrix, there are (including repetitions) n eigenvalues, each with a corresponding eigenvector. We can reform equation (A..39) as (A I) e 0 (A..42) This is a linear equation, for which the eigenvector e and eigenvalue is a solution. We can write equation (A..42) as Be 0, whereb A I. If B has an inverse, then a solution is e B 0 0, which trivially satisfies the eigen-equation. For any non-trivial solution to the problem Be 0, we therefore need B to be non-invertible. This is equivalent to the condition that B has zero determinant. Hence is an eigenvalue of A if det (A I) 0 (A..43) This is known as the characteristic equation. This determinant equation will be a polynomial of degree n and the resulting equation is known as the characteristic polynomial. Once we have found an eigenvalue, the corresponding eigenvector can be found by substituting this value for in equation (A..39) and solving the linear equations for e. It may be that the for an eigenvalue the eigenvector is not unique and there is a space of corresponding vectors. Geometrically, the eigenvectors are special directions such that the e ect of the transformation A along a direction e is simply to scale the vector e. For a rotation matric R in general there will be no direction preserved under the rotation so that the eigenvalues and eigenvectors are complex valued (which is why the Fourier representation, which corresponds to representation in a rotated basis, is necessarily complex). Remark (Orthogonality of eigenvectors of symmetric matrices). For a real symmetric matric A A T, and two of its eigenvectors e i and e j of A are orthogonal (e i ) T e j 0 if the eigenvalues i and j are di erent. The above can be shown by considering: Ae i i e i ) (e j ) T Ae i i (e j ) T e i (A..44) Since A is symmetric, the left hand side is equivalent to ((e j ) T A)e i (Ae j ) T e i j (e j ) T e i ) i (e j ) T e i j (e j ) T e i (A..45) If i 6 j, this condition can be satisfied only if (e j ) T e i 0, namely that the eigenvectors are orthogonal. DRAFT March 9, 200 549

A.. Matrix decompositions The observation that the eigenvectors of a symmetric matrix are orthogonal leads directly to the spectral decomposition formula below. Definition 27 (Spectral decomposition). A symmetric matrix A has an eigen-decomposition A nx i ie i e T i (A..46) where i is the eigenvalue of eigenvector e i and the eigenvectors form an orthogonal set, e i T e j ij In matrix notation A E E T e i T e i (A..47) (A..48) where E is the matrix of eigenvectors and the corresponding diagonal eigenvalue matrix. More generally, for a square non-symmetric non-singular A we can write A E E (A..49) Definition 28 (Singular Value Decomposition). The SVD decomposition of a n p matrix X is X USV T (A..50) where dim U n n with U T U I n. Also dim V p p with V T V I p. The matrix S has dim S n p with zeros everywhere except on the diagonal entries. The singular values are the diagonal entries [S] ii and are positive. The singular values are ordered so that the upper left diagonal element of S contains the largest singular value. Quadratic forms Definition 29 (Quadratic form). x T Ax + x T b (A..5) Definition 30 (Positive definite matrix). A symmetric matrix A, with the property that x T Ax 0 for any vector x is called nonnegative definite. A symmetric matrix A, with the property that x T Ax > 0 for any vector x 6 0 is called positive definite. A positive definite matrix has full rank and is thus invertible. Using the eigen-decomposition of A, x T Ax X i iy T e i (e i ) T x X i i x T e i 2 (A..52) which is greater than zero if and only if all the eigenvalues are positive. Hence A is positive definite if and only if all its eigenvalues are positive. 550 DRAFT March 9, 200

Multivariate Calculus A.2 Matrix Identities Definition 3 (Trace-Log formula). For a positive definite matrix A, trace (log A) log det (A) (A.2.) Note that the above logarithm of a matrix is not the element-wise logarithm. In MATLAB the required function is logm. In general for an analytic function f(x), f(m) is defined via the power-series expansion of the function. On the right, since det (A) is a scalar, the logarithm is the standard logarithm of a scalar. Definition 32 (Matrix Inversion Lemma (Woodbury formula)). Provided the appropriate inverses exist: A + UV T A A U I + V T A U V T A (A.2.2) Eigenfunctions Z K(x 0,x) a (x) a a (x 0 ) (A.2.3) x By an analogous argument that proves the theorem of linear algebra above, the eigenfunctions are orthogonal of a real symmetric kernel, K(x 0,x)K(x, x 0 ) are orthogonal: Z x a(x) b (x) ab (A.2.4) where (x) is the complex conjugate of (x). From the previous results, we know that a symmetric real matrix K must have a decomposition in terms eigenvectors with positive, real eigenvalues. Since this is to be true for any dimension of matrix, it suggests that we need the (real symmetric) kernel function itself to have a decomposition (provided the eigenvalues are countable) K(x i,x j ) X µ µ µ(x i ) µ(x j ) (A.2.5) since then X y i K(x i,x j )y j X i,j i,j,µ µy i µ (x i ) µ(x j )y j X µ µ ( X i y i µ (x i )) ( X i {z } z i y i µ (x i )) {z } zi (A.2.6) which is greater than zero if the eigenvalues are all positive (since for complex z, zz 0). If the eigenvalues are uncountable (which happens when the domain of the kernel is unbounded), the appropriate decomposition is Z K(x i,x j ) (s) (x i,s) (x j,s)ds (A.2.7) A.3 Multivariate Calculus This definition of the inner product is useful, and particularly natural in the context of translation invariant kernels. We are free to define the inner product, but this conjugate form is often the most useful. DRAFT March 9, 200 55

Multivariate Calculus x f(x) Figure A.4: Interpreting the gradient. The ellipses are contours of constant function value, f const. At any point x, the gradient vector rf(x) points along the direction of maximal increase of the function. x 2 Definition 33 (Partial derivative). Consider a function of n variables, f(x,x 2,...,x n ) or f(x). The partial derivative of f w.r.t. x at x is defined as the following limit (when it exists) f(x lim + h, x 2,...,x n) f(x ) @x xx h!0 h (A.3.) The gradient vector of f will be denoted by rf or g rf(x) g(x) 0 B @ @x. @x n C A (A.3.2) A.3. Interpreting the gradient vector Consider a function f(x) that depends on a vector x. We are interested in how the function changes when the vector x changes by a small amount : x! x +,where is a vector whose length is very small. According to a Taylor expansion, the function will change to f (x + )f(x)+ X i i + O 2 (A.3.3) @x i We can interpret the summation above as the scalar product between the vector rf with components [rf] i @x i and. f (x + )f(x)+(rf) T + O 2 (A.3.4) The gradient points along the direction in which the function increases most rapidly. Why? Consider a direction ˆp (a unit length vector). Then a displacement, units along this direction changes the function value to f(x + ˆp) f(x)+ rf(x) ˆp (A.3.5) The direction ˆp for which the function has the largest change is that which maximises the overlap rf(x) ˆp rf(x) ˆp cos rf(x) cos (A.3.6) where is the angle between ˆp and rf(x). The overlap is maximised when 0, giving ˆp rf(x)/ rf(x). Hence, the direction along which the function changes the most rapidly is along rf(x). A.3.2 Higher derivatives The first derivative of a function of n variables is an n-vector; the second-derivative of an n-variable function is defined by the n 2 partial derivatives of the n first partial derivatives w.r.t. the n variables @ i,...,n; j,...,n (A.3.7) @x i 552 DRAFT March 9, 200

Multivariate Calculus which is usually written @ 2 f @x i, i 6 j @ 2 f @x i 2, i j (A.3.8) If the partial derivatives /@x i, / and @ 2 f/@x i are continuous, then @ 2 f/@x i exists and @ 2 f/@x i @ 2 f/ @x i. (A.3.9) These n 2 second partial derivatives are represented by a square, symmetric matrix called the Hessian matrix of f(x). 0 @ 2 f @x 2... A.3.3 H f (x) B @ Chain rule. @ 2 f @x @x n... @ 2 f @x @x n. @ 2 f @x n 2 C A (A.3.0) Consider f(x,...,x n ). Now let each x j be parameterized by u,...,u m, i.e. x j x j (u,...,u m ). What is /@u? So f f(x + x,...,x n + x n ) f(x,...,x n ) x j f Therefore mx nx @u mx j u + higher order terms nx j x j + higher order terms @u u + higher order terms (A.3.) Definition 34 (Chain rule). @u nx j @u (A.3.2) or in vector notation @ f(x(u)) rf T (x(u)) @x(u) @u @u (A.3.3) Definition 35 (Directional derivative). Assume f is di erentiable. We define the scalar directional derivative (D v f)(x ) of f in a direction v at a point x. Let x x + hv, Then (D v f)(x ) d dh f(x + hv) h0 X j v j xx rf T v (A.3.4) A.3.4 Matrix calculus DRAFT March 9, 200 553

Inequalities Definition 36 (Derivative of a matrix trace). For matrices A, and B @ trace (AB) BT @A (A.3.5) Definition 37 (Derivative of log det (A)). So that @ log det (A) @trace (log A) trace A @A @ T log det (A) A @A (A.3.6) (A.3.7) Definition 38 (Derivative of a matrix inverse). For an invertible matrix A, @A A T @AA (A.3.8) A.4 Inequalities A.4. Convexity Definition 39 (Convex function). A function f(x) is defined as convex if for any x, y and 0 apple apple f( x +( )y) apple f(x)+( )f(y) (A.4.) If f(x) is convex, f(x) is called concave. An intuitive picture of a convex function is to consider first the quantity x +( )y. As we vary from 0 to, this traces points between x ( 0) and y ( ). Hence for 0 we start at the point x, f(x) and as increase trace a straight line towards the point y, f(y) at. Convexity states that the function f always lies below this straight line. Geometrically this means that the function f(x) is always always increasing (never non-decreasing). Hence if d 2 f(x)/dx 2 > 0 the function is convex. As an example, the function log x is concave since its second derivative is negative: d dx log x x, d 2 dx 2 log x x 2 (A.4.2) A.4.2 Jensen s inequality For a convex function f(x), it follows directly from the definition of convexity that f(hxi p(x) ) applehf(x)i p(x) (A.4.3) for any distribution p(x). 554 DRAFT March 9, 200