[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017

For today... Basic linear algebra. Basic probability. Programming in R.

Linear Algebra Basic linear algebra is one of the most important branch of mathematics to know when it comes to machine learning. Used in parameter estimation. Dimensionality reduction. Text analysis etc.

Scalars a, n, x Scalars are single numbers and include Z, R, Q etc. Generally denoted with italics.

Vectors x 1 x 2 X =. ẋ n (1) One dimensional array of scalars. Dimension usually denoted by R n More generally we would refer to a vector like this as X R n.

Matrices [ ] x11 x X = 12 x 21 x 22 (2) A matrix is a two dimensional array of numbers with m columns and n rows. Dimension usually denoted by R mxn In this case we would say X R 2X2 is we wanted to refer to all 2x2 matrices containing real numbers.

Some examples... Precinct level election data from 2016. Vote count by precinct is vector v R 54182 The entire data set could be considered a matrix V R 8x54182

Tensors A tensor is just an array and can apply to vectors, matrices and higher dimensional matrices. Tensors have applications in deep learning, especially in image analysis.

Single color image is represented as a 3d tensor Three channels, c = {R, G, B}. Three stacked layers of pixel intensities. One matrix for each channel with each matrix A c N mxn. Values range from [0, 255].

Matrix operations: transpose X = x 11 x 12 x 21 x 22 x 31 x 32 [ X T x11 x = 21 x 31 x 12 x 22 x 32 ] Transposing a matrix essentially involves switching indices so that rows and now columns and vice versa.

Matrix operations: transpose properties For any two matrices X R mxn and Y R nxp : (X T ) T = X (X + Y) T = X T + Y T (ax T ) T = ax T (XY) T = Y T X T

Matrix operations: multiplication For any two matrices X R mxn and Y R nxp : Z = XY m Z i,j = X i,k Y k,j k=1 Z R mxp

Matrix operations: multiplication For any two matrices X R mxn and Y R nxp. Distributive : X(Y + Z) = XY + XZ Associative : X(YZ) = (XY)Z

Vector operations: dot product The dot product has an interesting geometrical interpretation. Given two vectors x and y, the dot product is actually the L 1 norm of the vectors multiplied by cosine of the angle between them: x y cos(θ) This reduces to: x y = n x i y i i=1

Identity matrix 1 0 0 0 I 4 = 0 1 0 0 0 0 1 0 0 0 0 1 The identity matrix is a matrix that has 1 s along the diagonal and 0 s elsewhere. It has nice properties that can make programming more efficient.

Identity matrix X R nxn XI n = I n X = X X + YX + ZX = X(I + Y + Z) For example, matrix multiplication is significantly more computationally intensive than matrix addition. In the equation above, the left hand side will be slower than the right hand side as matrix dimensions increase.

Systems of equations X R mxn ;b R n ; y R m Xb = y x 11 b 1 + x 12 b 2 + + x 1n b n = y 1 x 21 b 1 + x 22 b 2 + + x 2n b n = y 2. x m1 b 1 + x m2 b 2 + + x mn b n = y m Computationally efficient to simultaneously solve systems of linear equations using matrices. In the equations above b are unknown parameter values that we would like to solve.

Linear dependence, span and solutions As it turns out, for X 1 to exist and, hence, obtain a solution for this equation system, there must be exactly one solution for every value of y. No solutions or ; Infinitely many solutions for each y are possible. Not possible to have 1 < s < where s are possible solutions.

Linear dependence, span and solutions Xb = n b i X :,i (3) i=1 There are two conditions involving the span and linear dependence of X which must be met to ensure that X 1 exists, so let s explore these. Equation 3, which is a generalization of our system of equations is known as a linear combination. Linear combination of a set of vectors {v 1,, v n } in general is n i=1 c iv i.

Linear dependence, span and solutions As it turns out n i=1 c iv i can also be defined as a span of X since it represents all of the possible linear combinations of the columns of X. In order to have a solution to Xb = y, y must be in the span of the columns of X. This, then implies the first necessary, but not sufficient condition for a solution, which is that X must have at least m columns, i.e. n m.

Linear dependence, span and solutions As mentioned, n m is a necessary but not sufficient condition to have a solution. For example, you can have a case where n = m, where both columns are identical, say for a 3x3 matrix. This redundancy reduces the column space to R 1, so that m = 1 and the condition above is not met. This is formally known as linear dependence.

Linear dependence, span and solutions 1 2 1 1, 2 = 2 1 1 2 1 Vectors are linearly independent if no vector is a linear combination of the others. The two vectors above, for example, are linearly dependent.

Linear dependence, span and solutions 1 1 1, 2 1 3 Whereas these two vectors would be linearly independent.

Conditions for inversion/solution Xb = y Taking all this together, the conditions for a solution this the system of linear equations using matrix inversions require that: 1 X be square m = n and; 2 X must have linearly independent columns. A square matrix with linearly dependent columns is said to be singular. Systems of equations for matrices that are not square or are singular can be solved, just not with matrix inversions.

Norms ( ) 1/p L p = x p = x i p (4) In machine learning, it is often useful to know the length of vector, especially when it involves the computation of loss functions. The norm is essentially a measure of the length of a vector. The L p norm is the general form of a vector norm. i

Norms L 1 = x 1 = x 1 + x 2 + + x n L 2 = x 2 = x1 2 + x 2 2 + + x n 2 The L 2 norm is used most often in machine learning and is sometimes referred to only as x. The L 2 is often used to measure the Euclidean distance between two vectors or matrices.

Eigendecompositions Decomposition of functions and numbers can tell us a lot about their properties. Eg) integers into prime factors: 12 = 2x2x3, tells us that any multiple of 12 will also be divisible by 2 and 3. Matrix decompositions serve a similar function. Eigendecompositions are the breaking up of matrices into eigenvectors and eigenvalues.

Eigenvectors and eigenvalues Xv = λv (5) Eigenvector of a square matrix is a non-zero vector v such that multiplying it by X only alters the scale of v. λ is known as the eigenvalue corresponding to the eigenvector. From a geometric standpoint, eigenvectors are like the core of a matrix that is invariant to rotational transformations.

Eigendecomposition X = QΛQ T (6) Can be useful in dimensionality reduction. Every real symmetric matrix can be decomposed into an expression using eigenvectors and eigenvalues. Λ is a diagonal matrix consisting of the eigenvalues of the eigenvectors contained in Q.

Singular value decomposition X = UDV T (7) A more general method of decomposing any real matrix. Also used in dimensionality reduction. Example: measuring legislative influence.

Motivation Probability is at the heart of many machine learning algorithms. Most machine learning methods involve the use of random variables and probability distributions. Many unsupervised methods use discrete probability distributions and Bayesian analysis.

Random variables Random variables are variables which takes on values according to a probability distribution (randomly). For example, define X = {x 1, x 2, x 3,, x n }. Each x i N(µ, σ) can be though of as a random draw from some distribution (in this case the Gaussian) with some parameters µ and σ.

Random variables Random variables can be discrete, taking on a finite or countably infinite number of values or; Continuous taking on an infinite number of values.

Probability distributions A probability distribution describes the likelihood that a random variable will take on certain values. Discrete random variables are described by probability mass functions (PMFs) which map numerical values to probabilities. Continuous random variables are described by a probability density function(pdf)

Properties of PMFs A PMF is a function P that must satisfy: 0 P(x) 1 P(x) = 1 x x

Properties of PDFs A PDF is a function p that must satisfy: + p(x) 0 p(x)dx = 1

Marginal Probability When given two random variables, x and y we may be interested only in the distribution of one. We can accomplish this by marginalizing the distributions through summation or integration. P(X = x) = P(X = x, Y = y) = y P(X = x Y = y)p(y = y) y

Conditional Probability Often we re interested in the likelihood of an event occurring, given another event has occurred. P(X = x Y = y) = P(X = x, Y = y) P(X = x)

Chain Rule of Conditional Probability Joint distributions can be decomposed into a series of conditional distributions. This comes in handy especially when trying to understand some of the nuances of Bayesian machine learning methods. P(X 1,, X n ) = n P(X 1 ) P(X i X 1,, X i 1 ) i=2

Independence and Conditional Independence Independence implies P(X, Y ) = P(X)P(Y ) Conditional independence implies P(X, Y Z ) = P(X Z )P(Y Z )

Expectation, Variance and Covariance Expected value of random variable X with a PDF f (x) is the average value of the random variable over that function. Discrete: E[X] = Continuous: E[X] = x i p i i=1 xf (x)dx

Expectation, Variance and Covariance Variance is a measurement of how much variation exists within a random variable. Var(X) = E [(X E[X]) 2]

Expectation, Variance and Covariance Covariance measures the extent to which one random variable increases or decreases with another random variable. Important for measuring properties of models and relationships between variables eg) correlation coefficient. Cov(X, Y ) = E [(X E[X])(Y E[Y ]]

Common probability distributions Bernoulli model the probability of success for a single event (eg. coin flip). Gaussian or normal distribution. Models a very wide variety of phenomena: distributions of populations, sampling distributions, etc etc.

Bernoulli distribution P(X = x) = θ x (1 θ) 1 x E[x] = θ Var(x) = θ(1 θ) Imagine a biased coin in which the probability of a heads is θ = 0.8. P(X = Tails = 0) = 0.8 0 (1 0.8) 1 = 0.2 E[x] = 0.8 Var[x] = 0.2 0.8 = 0.16

Bernoulli distribution P(X = x) = θ x (1 θ) 1 x E[x] = θ Var(x) = θ(1 θ) Controlled by one parameter θ [0, 1]. Models probability of two outcomes 0, 1.

Bernoulli distribution

Gaussian (normal) distribution N(x; µ, σ 2 ) = E[x] = µ Var[x] = σ 2 ( ) 1 1 2πσ exp (x µ)2 2σ2 Parameterized by the mean µ which determines location of the distribution and the variance σ 2 which controls how wide or narrow the distribution is.

Gaussian (normal) distribution

Central limit theorem The sum of many random, independent variables, converges in distribution to the Gaussian. Several different forms of proof exist for the CLT. Basis of sampling theory.