[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Similar documents
Linear Algebra for Machine Learning. Sargur N. Srihari

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Lecture 2: Repetition of probability theory and statistics

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Linear Algebra - Part II

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

01 Probability Theory and Statistics Review

Large Scale Data Analysis Using Deep Learning

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Algorithms for Uncertainty Quantification

Vectors and Matrices Statistics with Vectors and Matrices

3 a 21 a a 2N. 3 a 21 a a 2M

Linear Algebra Review. Vectors

Discrete Probability Refresher

Basics on Probability. Jingrui He 09/11/2007

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Review (Probability & Linear Algebra)

Linear Algebra (Review) Volker Tresp 2018

GG303 Lecture 6 8/27/09 1 SCALARS, VECTORS, AND TENSORS

Mathematical foundations - linear algebra

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Deep Learning Book Notes Chapter 2: Linear Algebra

Image Registration Lecture 2: Vectors and Matrices

Lecture 1: August 28

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Review of Linear Algebra

Review of Probability Theory

Linear Algebra (Review) Volker Tresp 2017

PROBABILITY THEORY REVIEW

Multivariate probability distributions and linear regression

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

2. Matrix Algebra and Random Vectors

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

ECS130 Scientific Computing. Lecture 1: Introduction. Monday, January 7, 10:00 10:50 am

Elementary maths for GMT

Designing Information Devices and Systems I Fall 2017 Official Lecture Notes Note 2

Jointly Distributed Random Variables

Statistical Pattern Recognition

Mathematical foundations - linear algebra

1 Proof techniques. CS 224W Linear Algebra, Probability, and Proof Techniques

Probability and Statistical Decision Theory

We use the overhead arrow to denote a column vector, i.e., a number with a direction. For example, in three-space, we write

Basic Linear Algebra in MATLAB

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 2

The study of linear algebra involves several types of mathematical objects:

The Hilbert Space of Random Variables

18 Bivariate normal distribution I

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Designing Information Devices and Systems I Fall 2018 Lecture Notes Note 2

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

Lecture Note 1: Probability Theory and Statistics

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Recitation. Shuang Li, Amir Afsharinejad, Kaushik Patnaik. Machine Learning CS 7641,CSE/ISYE 6740, Fall 2014

Linear Algebra Review. Fei-Fei Li

Basic Concepts in Linear Algebra

Introduction to Bayesian Statistics

Unsupervised Learning: Dimensionality Reduction

SDS 321: Introduction to Probability and Statistics

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Review of Basic Concepts in Linear Algebra

Some Probability and Statistics

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Definition (T -invariant subspace) Example. Example

Introduction to Computational Finance and Financial Econometrics Matrix Algebra Review

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Knowledge Discovery and Data Mining 1 (VO) ( )

A Probability Review

Probability and Information Theory. Sargur N. Srihari

Statistics for scientists and engineers

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Basic Concepts in Matrix Algebra

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

3. Probability and Statistics

Linear Algebra & Geometry why is linear algebra useful in computer vision?

B553 Lecture 5: Matrix Algebra Review

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Two hours. Statistical Tables to be provided THE UNIVERSITY OF MANCHESTER. 14 January :45 11:45

Some Probability and Statistics

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Linear Algebra and Eigenproblems

We introduce methods that are useful in:

BASICS OF PROBABILITY

Probability and Distributions

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

CS 143 Linear Algebra Review

Linear Algebra & Geometry why is linear algebra useful in computer vision?

3 Multiple Discrete Random Variables

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

CS 246 Review of Linear Algebra 01/17/19

STAT Chapter 5 Continuous Distributions

Introduction to Machine Learning

Random Vectors, Random Matrices, and Matrix Expected Value

Transcription:

[POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017

For today... Basic linear algebra. Basic probability. Programming in R.

Linear Algebra Basic linear algebra is one of the most important branch of mathematics to know when it comes to machine learning. Used in parameter estimation. Dimensionality reduction. Text analysis etc.

Scalars a, n, x Scalars are single numbers and include Z, R, Q etc. Generally denoted with italics.

Vectors x 1 x 2 X =. ẋ n (1) One dimensional array of scalars. Dimension usually denoted by R n More generally we would refer to a vector like this as X R n.

Matrices [ ] x11 x X = 12 x 21 x 22 (2) A matrix is a two dimensional array of numbers with m columns and n rows. Dimension usually denoted by R mxn In this case we would say X R 2X2 is we wanted to refer to all 2x2 matrices containing real numbers.

Some examples... Precinct level election data from 2016. Vote count by precinct is vector v R 54182 The entire data set could be considered a matrix V R 8x54182

Tensors A tensor is just an array and can apply to vectors, matrices and higher dimensional matrices. Tensors have applications in deep learning, especially in image analysis.

Single color image is represented as a 3d tensor Three channels, c = {R, G, B}. Three stacked layers of pixel intensities. One matrix for each channel with each matrix A c N mxn. Values range from [0, 255].

Matrix operations: transpose X = x 11 x 12 x 21 x 22 x 31 x 32 [ X T x11 x = 21 x 31 x 12 x 22 x 32 ] Transposing a matrix essentially involves switching indices so that rows and now columns and vice versa.

Matrix operations: transpose properties For any two matrices X R mxn and Y R nxp : (X T ) T = X (X + Y) T = X T + Y T (ax T ) T = ax T (XY) T = Y T X T

Matrix operations: multiplication For any two matrices X R mxn and Y R nxp : Z = XY m Z i,j = X i,k Y k,j k=1 Z R mxp

Matrix operations: multiplication For any two matrices X R mxn and Y R nxp. Distributive : X(Y + Z) = XY + XZ Associative : X(YZ) = (XY)Z

Vector operations: dot product The dot product has an interesting geometrical interpretation. Given two vectors x and y, the dot product is actually the L 1 norm of the vectors multiplied by cosine of the angle between them: x y cos(θ) This reduces to: x y = n x i y i i=1

Identity matrix 1 0 0 0 I 4 = 0 1 0 0 0 0 1 0 0 0 0 1 The identity matrix is a matrix that has 1 s along the diagonal and 0 s elsewhere. It has nice properties that can make programming more efficient.

Identity matrix X R nxn XI n = I n X = X X + YX + ZX = X(I + Y + Z) For example, matrix multiplication is significantly more computationally intensive than matrix addition. In the equation above, the left hand side will be slower than the right hand side as matrix dimensions increase.

Systems of equations X R mxn ;b R n ; y R m Xb = y x 11 b 1 + x 12 b 2 + + x 1n b n = y 1 x 21 b 1 + x 22 b 2 + + x 2n b n = y 2. x m1 b 1 + x m2 b 2 + + x mn b n = y m Computationally efficient to simultaneously solve systems of linear equations using matrices. In the equations above b are unknown parameter values that we would like to solve.

Linear dependence, span and solutions As it turns out, for X 1 to exist and, hence, obtain a solution for this equation system, there must be exactly one solution for every value of y. No solutions or ; Infinitely many solutions for each y are possible. Not possible to have 1 < s < where s are possible solutions.

Linear dependence, span and solutions Xb = n b i X :,i (3) i=1 There are two conditions involving the span and linear dependence of X which must be met to ensure that X 1 exists, so let s explore these. Equation 3, which is a generalization of our system of equations is known as a linear combination. Linear combination of a set of vectors {v 1,, v n } in general is n i=1 c iv i.

Linear dependence, span and solutions As it turns out n i=1 c iv i can also be defined as a span of X since it represents all of the possible linear combinations of the columns of X. In order to have a solution to Xb = y, y must be in the span of the columns of X. This, then implies the first necessary, but not sufficient condition for a solution, which is that X must have at least m columns, i.e. n m.

Linear dependence, span and solutions As mentioned, n m is a necessary but not sufficient condition to have a solution. For example, you can have a case where n = m, where both columns are identical, say for a 3x3 matrix. This redundancy reduces the column space to R 1, so that m = 1 and the condition above is not met. This is formally known as linear dependence.

Linear dependence, span and solutions As mentioned, n m is a necessary but not sufficient condition to have a solution. For example, you can have a case where n = m, where both columns are identical, say for a 3x3 matrix. This redundancy reduces the column space to R 1, so that m = 1 and the condition above is not met. This is formally known as linear dependence.

Linear dependence, span and solutions 1 2 1 1, 2 = 2 1 1 2 1 Vectors are linearly independent if no vector is a linear combination of the others. The two vectors above, for example, are linearly dependent.

Linear dependence, span and solutions 1 1 1, 2 1 3 Whereas these two vectors would be linearly independent.

Conditions for inversion/solution Xb = y Taking all this together, the conditions for a solution this the system of linear equations using matrix inversions require that: 1 X be square m = n and; 2 X must have linearly independent columns. A square matrix with linearly dependent columns is said to be singular. Systems of equations for matrices that are not square or are singular can be solved, just not with matrix inversions.

Norms ( ) 1/p L p = x p = x i p (4) In machine learning, it is often useful to know the length of vector, especially when it involves the computation of loss functions. The norm is essentially a measure of the length of a vector. The L p norm is the general form of a vector norm. i

Norms L 1 = x 1 = x 1 + x 2 + + x n L 2 = x 2 = x1 2 + x 2 2 + + x n 2 The L 2 norm is used most often in machine learning and is sometimes referred to only as x. The L 2 is often used to measure the Euclidean distance between two vectors or matrices.

Eigendecompositions Decomposition of functions and numbers can tell us a lot about their properties. Eg) integers into prime factors: 12 = 2x2x3, tells us that any multiple of 12 will also be divisible by 2 and 3. Matrix decompositions serve a similar function. Eigendecompositions are the breaking up of matrices into eigenvectors and eigenvalues.

Eigenvectors and eigenvalues Xv = λv (5) Eigenvector of a square matrix is a non-zero vector v such that multiplying it by X only alters the scale of v. λ is known as the eigenvalue corresponding to the eigenvector. From a geometric standpoint, eigenvectors are like the core of a matrix that is invariant to rotational transformations.

Eigendecomposition X = QΛQ T (6) Can be useful in dimensionality reduction. Every real symmetric matrix can be decomposed into an expression using eigenvectors and eigenvalues. Λ is a diagonal matrix consisting of the eigenvalues of the eigenvectors contained in Q.

Singular value decomposition X = UDV T (7) A more general method of decomposing any real matrix. Also used in dimensionality reduction. Example: measuring legislative influence.

Motivation Probability is at the heart of many machine learning algorithms. Most machine learning methods involve the use of random variables and probability distributions. Many unsupervised methods use discrete probability distributions and Bayesian analysis.

Random variables Random variables are variables which takes on values according to a probability distribution (randomly). For example, define X = {x 1, x 2, x 3,, x n }. Each x i N(µ, σ) can be though of as a random draw from some distribution (in this case the Gaussian) with some parameters µ and σ.

Random variables Random variables can be discrete, taking on a finite or countably infinite number of values or; Continuous taking on an infinite number of values.

Probability distributions A probability distribution describes the likelihood that a random variable will take on certain values. Discrete random variables are described by probability mass functions (PMFs) which map numerical values to probabilities. Continuous random variables are described by a probability density function(pdf)

Properties of PMFs A PMF is a function P that must satisfy: 0 P(x) 1 P(x) = 1 x x

Properties of PDFs A PDF is a function p that must satisfy: + p(x) 0 p(x)dx = 1

Marginal Probability When given two random variables, x and y we may be interested only in the distribution of one. We can accomplish this by marginalizing the distributions through summation or integration. P(X = x) = P(X = x, Y = y) = y P(X = x Y = y)p(y = y) y

Marginal Probability When given two random variables, x and y we may be interested only in the distribution of one. We can accomplish this by marginalizing the distributions through summation or integration. p(x) = y y p(x, y) = p(xa y)p(y)dy

Conditional Probability Often we re interested in the likelihood of an event occurring, given another event has occurred. P(X = x Y = y) = P(X = x, Y = y) P(X = x)

Chain Rule of Conditional Probability Joint distributions can be decomposed into a series of conditional distributions. This comes in handy especially when trying to understand some of the nuances of Bayesian machine learning methods. P(X 1,, X n ) = n P(X 1 ) P(X i X 1,, X i 1 ) i=2

Chain Rule of Conditional Probability Joint distributions can be decomposed into a series of conditional distributions. This comes in handy especially when trying to understand some of the nuances of Bayesian machine learning methods. P(X, Y ) = P(X)P(Y X) = P(Y )P(X Y )

Independence and Conditional Independence Independence implies P(X, Y ) = P(X)P(Y ) Conditional independence implies P(X, Y Z ) = P(X Z )P(Y Z )

Expectation, Variance and Covariance Expected value of random variable X with a PDF f (x) is the average value of the random variable over that function. Discrete: E[X] = Continuous: E[X] = x i p i i=1 xf (x)dx

Expectation, Variance and Covariance Variance is a measurement of how much variation exists within a random variable. Var(X) = E [(X E[X]) 2]

Expectation, Variance and Covariance Covariance measures the extent to which one random variable increases or decreases with another random variable. Important for measuring properties of models and relationships between variables eg) correlation coefficient. Cov(X, Y ) = E [(X E[X])(Y E[Y ]]

Common probability distributions Bernoulli model the probability of success for a single event (eg. coin flip). Gaussian or normal distribution. Models a very wide variety of phenomena: distributions of populations, sampling distributions, etc etc.

Bernoulli distribution P(X = x) = θ x (1 θ) 1 x E[x] = θ Var(x) = θ(1 θ) Imagine a biased coin in which the probability of a heads is θ = 0.8. P(X = Tails = 0) = 0.8 0 (1 0.8) 1 = 0.2 E[x] = 0.8 Var[x] = 0.2 0.8 = 0.16

Bernoulli distribution P(X = x) = θ x (1 θ) 1 x E[x] = θ Var(x) = θ(1 θ) Controlled by one parameter θ [0, 1]. Models probability of two outcomes 0, 1.

Bernoulli distribution

Gaussian (normal) distribution N(x; µ, σ 2 ) = E[x] = µ Var[x] = σ 2 ( ) 1 1 2πσ exp (x µ)2 2σ2 Parameterized by the mean µ which determines location of the distribution and the variance σ 2 which controls how wide or narrow the distribution is.

Gaussian (normal) distribution

Central limit theorem The sum of many random, independent variables, converges in distribution to the Gaussian. Several different forms of proof exist for the CLT. Basis of sampling theory.