Linear Algebra. Lecture slides for Chapter 2 of Deep Learning Ian Goodfellow

Linear lgebra Lecture slides for Chapter 2 of Deep Learning Ian Goodfellow 206-06-24

bout this chapter Not a comprehensive survey of all of linear algebra Focused on the subset most relevant to deep learning Larger subset: e.g., Linear lgebra by Georgi Shilov

Scalars scalar is a single number Integers, real numbers, rational numbers, etc. We denote it with italic font: a, n, x

Vectors vector is a -D array of numbers: x = 2 6 4 x x 2. x n 3 7 5. (2.) Can be real, binary, integer, etc. Example notation for type and size: in t R n

ndices and write the set as a subscript x6, we define the set S = {, 3, 6} and Matrices x the complement of a set. For example ents of x except for x, and x S is the of x except x,ofxnumbers: matrix is a for 2-D array 3 and x6. 2, = 4 2, 3, 3,2, 2,2 5 ) =,2 3,2 2, 2,2 3, 3,2 The transpose of the matrix can be thought of as a mirror image across the al. th column of. When we need to explicitly identify the elements of a x, we write them as an array enclosed in square brackets: Row,,2. (2.2) 2, 2,2 ray of numbers, so each element is identifi. We usually give matrices upper-case va Column Example notation for type and shape: h as. If a real-valued matrix has a m n we say that 2 R. We usually id g its name in italic but not bold font, an times we may need to index matrix-valued expressions that are not just gle letter. In this case, we use subscripts after the expression, but do onvert anything to lower case. For example, f ()i,j gives element (i, j) e matrix computed by applying the function f to. ors: In some cases we will need an array with more than two axes. In eneral case, an array of numbers arranged on a regular grid with a ble number of axes is known as a tensor. We denote a tensor named this typeface:. We identify the element of at coordinates (i, j, k) riting.

Tensors tensor is an array of numbers, that may have zero dimensions, and be a scalar one dimension, and be a vector two dimensions, and be a matrix or more dimensions.

ations have many useful properties that make mathematical rtant operation on matrices is the transpose. The transpose of a CHPTER 2. LINER LGEBR more convenient. For example, matrix multiplication is mirror image of the matrix across a diagonal line, called the main Matrix Transpose ning down and to the right, starting from its upper left corner. See graphical depiction of = thisb operation. We denote the transpose of a (B + C) + C., and it is defined such that ( )i,j = j,i. (2.3) (BC) = (B)C. (2.6) (2.7) an be thought of as matrices that contain only one column. The not iscommutative condition B = B 2 (the 3 aisvector therefore a matrix with only one row. Sometimes we,,2 does not 4 5) = = alar multiplication. 33However, the dot between two product : x transpose y = yof x. (2.8)the Figure 2.: The the matrix can be thought of as a mirror image across 2, 2,2 3, 3,2, 2, 3,,2 2,2 3,2 main diagonal. matrix product has a simple form: the i-th column of. When we need to explicitly identify the elements of a as matrix, we write them (B) = B an. array enclosed in square brackets: (2.9),,2. that the value (2.2) onstrate Eq. 2.8, by exploiting the fact of 2, 2,2

Matrix (Dot) Product C = B. (2.4) s defined by C i,j = X k X i,k B k,j. (2.5) m = m n p n Must match p

Inverse Matrices PTER 2. LINER LGEBR Identity Matrix 2 inversion 3 werful tool called matrix that allows us to 0 0 for many values of. 4 0 0 5 0 0 ersion, we first need to define the concept of an identity x is a matrix that does not change any vector when we Figure 2.2: Example identity matrix: This is I3. at matrix. We denote the identity matrix that preserves n n n. Formally, I n 2xR+, xand + + x = b (2 2, n 2,2 2 8x 2 R, In x = x. 2,n n 2... m, x + m,2 x2 + + m,n xn = bm. (2 (2.20) (2 ity matrix is simple: all of the entries along the main Matrix-vector product notation provides a more compact representation the other entries are zero. See Fig. 2.2 for an example. ations of this form.

Systems of Equations x = b (2.) trix, 2 R is a known expands vector, and 2 R is a,: x = b (2.2) 2,: x = b 2 (2.3)... (2.4) m,: x = b m (2.5)

Solving Systems of Equations linear system of equations can have: No solution Many solutions Exactly one solution: this means multiplication by the matrix is an invertible function

f the identity matrix is simple: all of the entries along the main identity matrix is simple: all of the entries along the main while all of the other entries are zero. See Fig. 2.2 for an example. all of the other entries are zero. See Fig. 2.2 for an example. Matrix Inversion inverse of is denoted as, and it is defined as the matrix se of is denoted as, and it is defined as the matrix = In. Matrix inverse: (2.2) = In. solve Eq. 2. by the following steps: (2.2) Solving a system using an inverse: Eq. 2. by the following steps: (2.22) x = b x x = b= b x = In x = bb Numerically In x = 36 b unstable, analysis (2.22) (2.23) (2.23) (2.24) but useful for abstract (2.24) 36

Invertibility Matrix can t be inverted if More rows than columns More columns than rows Redundant rows/columns ( linearly dependent, low rank )

Norms X! Functions that measure how large a vector is Similar to a distance between zero and the point represented by the vector f(x) =0) x = 0 f(x + y) apple f(x)+f(y) (the triangle inequality) 8 2 R,f( x) = f(x)

zero and elements that are small but nonzero. In these cases, we turn to a function that grows at the same rate in all locations, but retains mathematical simplicity: the L norm. The L norm may be simplified to X x = xi. (2.3) o measure the size of a vector. In machine learning, we ectors using a function called a norm. Formally, the L i Norms Lp norm The normis commonly used in machine learning when the diﬀerence between PTERzero 2. LINER LGEBR and nonzero elements is very important. Every time an element of x moves p away from 0 by, the L norm increases by. x = X x p! We sometimes measure the size of the by counting its number of nonzero p i vector ause it increases very slowly near the origin. In several 0machine learning elements. Some authors refer to this function as the L norm, but this is incorrect lications, it is important to discriminate between elements that are exactly i terminology. The number of non-zero entries in a vector is not a norm, because and elements that are small but nonzero. In these cases, we turn to a function scaling the vector by does not change the number of nonzero entries. The L t grows at the same rate in all locations, but retains mathematical simplicity: norm is often used as anorm: substitute for the number Most L2 norm, p=2 of nonzero entries. L norm. The Lpopular norm may be simplified to norm, One other norm that commonly arises in machine learning is the L p X also known as the max x norm. simplifies to the absolute value of the (2.3) = This xnorm i. L norm, p=: i element with the largest magnitude in the vector, the L norm, are functions mapping vectors to non-n ive level, the norm of a vector x measures the distan L norm is commonly used in machine when the diﬀerence between Max norm, infinite p: x learning (2.32) = max xi. nt x. More rigorously, a norm is any function f that o and nonzero elements is very important. Everyi time an element of x moves y from 0 by, the L norm increases by. ties: Sometimes we may also wish to measure the size of a matrix. In the context We sometimes measure the the size the vector by to counting nonzero obscure of deep learning, mostofcommon way do thisitsis number with theof otherwise 0 norm, but this is incorrect ments.frobenius Some authors refer to this function as the L norm sx minology. The number of non-zero entries in a vector is not a norm, because

n =. (2.35) learning algorithm in terms of arbitrary matrices, esmachine often arise when the entries are generated by some function of at not depend on the order of the arguments. example, sive (and less descriptive) by some endoes arise when the entries arealgorithm generated by restricting somefor function of distance measurements, with giving the distance point es not depend on the order ofi,jthe arguments. Forfrom example, = because distance functions are symmetric. i,j j,i ance the distance from point i,j giving ices measurements, need be square.with It is possible to construct a rectangular is avector with unit norm:functions are symmetric. = because distance j,i quare diagonal matrices do not have inverses but it is still Unit vector: x a2 =. (2.36) ector unit For norm: them with cheaply. non-square diagonal matrix D, the Special Matrices and Vectors scaling each element of x, and either concatenating some x 2 =. (2.36) is taller than it is wide, or discarding some of the last d a vector y are orthogonal to each other if x y = 0. If both D norm, is wider than it that is tall. ero this means they are at a 90 degree angle to each Symmetric Matrix: vector ymatrix are orthogonal to each other x y nonzero = 0. If norm. both most n vectors may with is any thatbeismutually equal toorthogonal its owniftranspose: that they are at a 90 degree angle to each eorm, not this only means orthogonal but also have unit norm, we call them =. (2.35) n vectors may be mutually orthogonal with nonzero norm. LGEBR only but also have unit weorthonormal call them is a square matrix whose rows are norm, mutually nmatrix arise orthogonal when the entries are generated by some function of Orthogonal matrix: ns are depend mutuallyon orthonormal: s not the order of the arguments. For example, i,j rows nce giving distanceorthonormal from (2.37) point ix ismeasurements, a squarematrix whose arethe mutually =with = I. functions are symmetric. = j,i because distance e mutually orthonormal: = (2.38) 4, ctor with unit norm:

Trace Tr() = X i i,i. (2.48) Tr(BC) =Tr(CB)=Tr(BC) (2.5) Y Y

Learning linear algebra Do a lot of practice problems Start out with lots of summation signs and indexing into individual entries Eventually you will be able to mostly use matrix and vector product notation quickly and easily