Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Linear Algebra U Kang Seoul National University U Kang 1

In This Lecture Overview of linear algebra (but, not a comprehensive survey) Focused on the subset most relevant to deep learning For details on linear algebra, refer to Linear Algebra and Its Applications by Gilbert Strang U Kang 2

Scalar A scalar is a single number Integers, real number, rational numbers, etc. We denote it with italic font a, n, x U Kang 3

Vectors A vector is a 1-D array of numbers Can be real, binary, integer, etc. Notation for type and size U Kang 4

Matrices A matrix is a 2-D array of numbers Example notation for type and shape U Kang 5

Tensors A tensor is an array of numbers, that may have Zero dimensions, and be a scalar One dimension, and be a vector Two dimensions, and be a matrix Three dimensions or more Matrix over time Knowledge base (subject, verb, object) U Kang 6

Matrix Transpose (A T ) i,j = A j,i (AB) T = B T A T U Kang 7

Matrix Product C = AB U Kang 8

Matrix Product Matrix product as sum of outer product Product with a diagonal matrix U Kang 9

Identity Matrix Example identity matrix: I 3 U Kang 10

Systems of Equations expands to U Kang 11

Solving Systems of Equations A linear system of equations can have: No solution 3x + 2y = 6, 3x + 2y = 12 Many solutions 3x + 2y = 6, 6x + 4y = 12 Exactly one solution: this means multiplication by the matrix is an invertible function AAAA = bb xx = AA 1 bb U Kang 12

Matrix Inversion Matrix inverse: an inverse A -1 of an nxn matrix A satisfies AAAA 1 = AA 1 AA = II nn Solving a system using an inverse U Kang 13

Invertibility Matrix A cannot be inverted if More rows than columns More columns than rows Redundant rows/columns ( linearly dependent, low rank ) full rank The number 0 is an eigenvalue of A A non-invertible matrix is called a singular matrix An invertible matrix is called non-singular U Kang 14

Linear Dependence and Span Linear combination of vectors {vv 1,, vv nn } : ii cc ii vv (ii) The span of a set of vectors is the set of all points obtainable by linear combination of the original vectors Matrix-vector product Ax can be viewed as a linear combination of column vectors of A: AAAA = ii xx ii AA :,ii The span of columns of A is called column space or range of A A set of vectors is linearly independent if no vector in the set is a linear combination of the other vectors linearly dependent U Kang 15

Rank of a Matrix Rank of a matrix A: Number of linearly independent columns (or rows) of A For example: Matrix A = s rank? Why? U Kang 16

Norms Functions that measure how large a vector is Similar to a distance between zero and the point represented by the vector Formally, a norm is any function f that satisfies the following: U Kang 17

Norms L p norm Most popular norm: L2 norm, p = 2 Euclidean distance For matrix/tensor, Frobenius norm (L F ) does the same thing L1 norm, p = 1: Called Manhattan distance Max norm, infinite p: Dot product in terms of norm: U Kang 18

Special Matrices and Vectors Diagonal matrix 2 by 2 : aa bb We use diag(v) to denote the diagonal matrix where v is the vector containing the diagonal elements Inverse of diagonal matrix is computed easily U Kang 19

Special Matrices and Vectors Unit vector Symmetric matrix Orthogonal matrix U Kang 20

Eigenvector and Eigenvalue Eigenvector and eigenvalue of A Eigenvalue/eigenvector pairs are defined only for square matrix AA RR nn nn # of eigenvalues: n There may be duplicates Eigenvalues and eigenvectors can contain real or imaginary numbers U Kang 21

Intuition A as vector transformation x 2 1 A x 2 1 1 3 1 0 = x x U Kang 22

Intuition By defn., eigenvectors remain parallel to them selves ( fixed points ) λ 1 3.62 * v 1 0.52 0.85 = A 2 1 1 3 v 1 0.52 0.85 U Kang 23

Eigendecomposition Eigendecomposition of a matrix Let A be a square (n x n) matrix with n linearly independent eigenvectors (=diagonizable) Then A can be factorized as where V is an (nxn) matrix whose i th column is the i th eigenvector of A U Kang 24

Eigendecomposition Every real symmetric matrix has a real, orthogonal eigendecomposition A real orthogonal matrix is a square matrix with real entries whose columns and rows are orthogonal unit vectors (orthonormal vectors) QQ TT QQ = QQ QQ TT = II QQ TT = QQ 1 U Kang 25

Eigendecomposition Interpreting matrix-vector multiplication Ax using eigendecomposition U Kang 26

Eigendecomposition Understanding optimization of ff xx = xx TT AAAA, s.t. xx 2 = 1, using eigenvalues and eigenvectors U Kang 27

Eigendecomposition Positive definite matrix: a real symmetric matrix whose eigenvalues are all positive Positive semidefinite matrix Negative definite matrix Negative semidefinite matrix For positive semidefinite matrix A, xx, xx TT AAAA 0. For positive definite matrix A, xx 0, xx TT AAAA > 0 U Kang 28

Singular Value Decomposition (SVD) Similar to eigendecomposition More general; matrix need not be square AA = UUΛVV TT U Kang 29

SVD - Definition A [n x m] = U [n x r] Λ [ r x r] (V [m x r] ) T A: n x m matrix (eg., n documents, m terms) U: n x r matrix (n documents, r concepts) Left singular vectors Λ: r x r diagonal matrix (strength of each concept ) (r : rank of the matrix) V: m x r matrix (m terms, r concepts) Right singular vectors U Kang 30

SVD - Definition A [n x m] = U [n x r] Λ [ r x r] (V [m x r] ) T A U Λ = x x r x r V T r x m n x m n x r U Kang 31

SVD - Properties THEOREM [Press+92]: always possible to decompose matrix A into A = U Λ V T, where U, Λ, V: unique (*) U, V: column orthonormal (ie., columns are unit vectors, orthogonal to each other) U T U = I; V T V = I (I: identity matrix) Λ: singular are positive, and sorted in decreasing order U Kang 32

SVD - Properties AA UUΛVV TT = σσ ii (uu ii vv TT ii ) ii n n m A m ΛΛ V T U Best rank-k approximation in L F U Kang 33

SVD and eigendecomposition The left-singular vectors of A are the eigenvectors of AA T The right-singular vectors of A are the eigenvectors of A T A The non-zero singular values of A are the square root of the eigenvalues of A T A U Kang 34

Moore-Penrose Pseudoinverse Assume we want to solve A x = y for x What if A is not invertible? What if A is not square? Still, we can find the best x by using pseudoinverse For an (n x m) matrix A, the pseudoinverse A + is an (m x n) matrix U Kang 35

Moore-Penrose Pseudoinverse If the equation has: Exactly one solution: this is the same as the inverse No solution: this gives us the solution with the smallest error Ax y 2 Over-specified case Many solutions: this gives us the solution with the smallest norm of x Under-specified case U Kang 36

Pseudoinverse: Over-specified case No solution: this gives us the solution with the smallest error Ax y 2 [3 2] T [x] = [1 2] T (i.e., 3x = 1, 2x = 2) 2 1 desirable point y 1 2 3 4 reachable points (3x, 2x) ([3 2] T ) + [1 2] = = 7/13 This method is called the least square U Kang 37

Pseudoinverse: Under-specified case Many solutions: this gives us the solution with the smallest norm of x [1 2] [w z] T = 4 (i.e., 1 w + 2 z = 4) 2 z 1 x 0 shortest-length solution all possible solutions 1 2 3 4 w [1 2] + 4 = [1/5 2/5] T 4 = [4/5 8/5] U Kang 38

Computing the Pseudoinverse The SVD allows the computation of the pseudoinverse: U Kang 39

Trace TTTT AA = ii AA ii,ii TTTT AAAAAA = TTTT CCCCCC = TTTT(BBBBBB) TTTT AA = TTTT(AA TT ) AA FF = TTTT(AAAA TT ) For scalar a, TTTT aa = aa The trace of a matrix is also computed by the sum of all eigenvalues U Kang 40

What you need to know Linear algebra is crucial for understanding notations and mechanisms of many ML/deep learning methods Important concepts Matrix product, identity, invertibility, norm, eigendecomposition, singular value decomposition, pseudo inverse, and trace U Kang 41

Questions? U Kang 42