Recitation. Shuang Li, Amir Afsharinejad, Kaushik Patnaik. Machine Learning CS 7641,CSE/ISYE 6740, Fall 2014

Similar documents
Review (Probability & Linear Algebra)

Knowledge Discovery and Data Mining 1 (VO) ( )

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Linear Algebra Review. Vectors

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

2. Linear algebra. matrices and vectors. linear equations. range and nullspace of matrices. function of vectors, gradient and Hessian

CS 246 Review of Linear Algebra 01/17/19

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

CS 143 Linear Algebra Review

Statistics for scientists and engineers

Linear Algebra. Session 12

Stat 206: Linear algebra

Statistical Pattern Recognition

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Lecture 8: Linear Algebra Background

Linear Algebra Review and Reference

Review of Some Concepts from Linear Algebra: Part 2

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Background Mathematics (2/2) 1. David Barber

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Chapter 3 Transformations

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Review of Linear Algebra

A Quick Tour of Linear Algebra and Optimization for Machine Learning

1. General Vector Spaces

Lecture 2: Linear Algebra Review

1 Proof techniques. CS 224W Linear Algebra, Probability, and Proof Techniques

Mathematical foundations - linear algebra

Recitation 2: Probability

Linear Algebra Massoud Malek

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Mathematical foundations - linear algebra

Lecture 1: Review of linear algebra

Machine Learning for Large-Scale Data Analysis and Decision Making A. Week #1

Linear Algebra Formulas. Ben Lee

2. Review of Linear Algebra

Review of some mathematical tools

Vectors and Matrices Statistics with Vectors and Matrices

Basic Elements of Linear Algebra

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Review problems for MA 54, Fall 2004.

10-725/36-725: Convex Optimization Prerequisite Topics

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Exercise Sheet 1.

Linear Algebra for Machine Learning. Sargur N. Srihari

3. Probability and Statistics

The Singular Value Decomposition

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Linear Algebra Review. Fei-Fei Li

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations

1. Linear systems of equations. Chapters 7-8: Linear Algebra. Solution(s) of a linear system of equations (continued)

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Linear Algebra Primer

Basic Concepts in Matrix Algebra

DS-GA 1002 Lecture notes 10 November 23, Linear models

Gaussian Models (9/9/13)

Conceptual Questions for Review

STA141C: Big Data & High Performance Statistical Computing

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Singular Value Decomposition (SVD)

Linear Algebra: Matrix Eigenvalue Problems

Lecture Notes 1: Vector spaces

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

8. Diagonalization.

ECE 650 Lecture 4. Intro to Estimation Theory Random Vectors. ECE 650 D. Van Alphen 1

Probability Theory for Machine Learning. Chris Cremer September 2015

IV. Matrix Approximation using Least-Squares

Linear Algebra March 16, 2019

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

STAT 100C: Linear models

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

Chap 3. Linear Algebra

Basics on Probability. Jingrui He 09/11/2007

Stat 159/259: Linear Algebra Notes

2. Matrix Algebra and Random Vectors

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 2

ME 597: AUTONOMOUS MOBILE ROBOTICS SECTION 2 PROBABILITY. Prof. Steven Waslander

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

UNIT 6: The singular value decomposition.

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Linear Regression and Its Applications

MTH 464: Computational Linear Algebra

MAT Linear Algebra Collection of sample exams

Study Guide for Linear Algebra Exam 2

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Linear Algebra Review. Fei-Fei Li

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

1 9/5 Matrices, vectors, and their applications

Lecture II: Linear Algebra Revisited

Linear Algebra. Matrices Operations. Consider, for example, a system of equations such as x + 2y z + 4w = 0, 3x 4y + 2z 6w = 0, x 3y 2z + w = 0.

MATH 304 Linear Algebra Lecture 19: Least squares problems (continued). Norms and inner products.

Introduction to Matrix Algebra

MATH 315 Linear Algebra Homework #1 Assigned: August 20, 2018

Large Scale Data Analysis Using Deep Learning

Transcription:

Recitation Shuang Li, Amir Afsharinejad, Kaushik Patnaik Machine Learning CS 7641,CSE/ISYE 6740, Fall 2014

Probability and Statistics 2

Basic Probability Concepts A sample space S is the set of all possible outcomes of a conceptual or physical, repeatable experiment. (S can be finite or infinite.) E.g., S may be the set of all possible outcomes of a dice roll: S 1 2 3 4 5 6 E.g., S may be the set of all possible nucleotides of a DNA site: S A C G T E.g., S may be the set of all possible time-space positions of an aircraft on a radar screen. An Event A is any subset of S Seeing "1" or "6" in a dice roll; observing a "G" at a site; UA007 in space-time interval 3

Probability An event space E is the possible worlds the outcomes can happen. All dice-rolls, reading a genome, monitoring the radar signal A probability P(A) is a function that maps an event A onto the interval [0,1]. P(A) is also called the probability measure or probability mass of A. Sample space of all possible worlds. Its area is 1 Worlds in which A is true Worlds in which A is false P(a) is the area of the oval 4

Kolmogorov Axioms For any event A, B S: 1 P(A) 0 P S = 1 P A B = P A + P B P(A B) 5

Random Variable A random variable is a function that associates a unique numerical value (a token) with every outcome of an experiment. (The value of the r.v. will vary the experiment is repeated) RV Distributions: Continuous RV: The outcome of observing the measured location of an aircraft The outcome of recording the true location of an aircraft Discrete RV: The outcome of a dice-roll The outcome of a coin toss 6

Discrete Prob. Distribution A probability distribution P defined on a discrete sample space S is an assignment of a non-negative real number P(s) to each sample s S : Probability Mass Function (PMF):p x x i = P[X = x i ] Cumulative Mass Function (CMF): F x x i = P X x i = j:x j <x i p x x j Properties: p x x i 0 and i p X x i = 1 Examples: Bernoulli Distribution: 1 p for x = 0 p for x = 1 Binomial Distribution: P(X = k)= n k pk (1 p) n k 7

Continuous Prob. Distribution A continuous random variable X is defined on a continuous sample space: an interval on the real line, a region in a high dimensional space, etc. It is meaningless to talk about the probability of the random variable assuming a particular value --- P(x) = 0 Instead, we talk about the probability of the random variable assuming a value within a given interval, or half interval, or arbitrary Boolean combination of basic propositions. Cumulative Distribution Function (CDF): F x x = P[X x] x Probability Density Function (PDF):F x x = fx (x) dx or f x x = d F x (x) dx Properties: f x (x) 0 and fx (x) dx = 1 Interpretation: f x x = P[X x,x+δ Δ ] 8

Continuous Prob. Distribution Examples: Uniform Density Function: 1 f x x = for a x b b a 0 otherwise Exponential Density Function: f x x = 1 μ e x μ for x 0 x F x x = 1 e μ for x 0 Gaussian(Normal) Density Function f x x = 1 2πσ e (x μ) 2 2σ 2 9

Continuous Prob. Distribution Gaussian Distribution: If Z~N(0,1) F x x = Φ x = x f x z dz = 1 2π x e z2 2 dz This has no closed form expression, but is built in to most software packages (eg. normcdf in matlab stats toolbox). 10

Characterizations Expectation: The mean value, center of mass, first moment: E X g(x) = N-th moment: g x = x n g x p X x dx = μ N-th central moment: g x = (x μ) n Mean: E X X = xpx x dx E αx = αe[x] E α + X = α + E[X] Variance(Second central moment): Var x = E X (X E X X ) 2 = E X X 2 E X [X] 2 Var αx = α 2 Var X Var α + X = Var X 11

Central Limit Theorem If (X 1,X 2, X n ) are i.i.d. continuous random variables, then the joint distribution is f X CLT proves that f X is Gaussian with mean E[X i ] and Var[X i ] X = f X 1, X 2,, X n = 1 n i=1 n X i as n Somewhat of a justification for assuming Gaussian noise 12

Joint RVs and Marginal Densities Joint densities: F X,Y x, y = P X x Y y = x y fx,y α, β dαdβ Marginal densities: f X x = fx,y x, β dβ p X x i = j p X,Y (x i, y j ) Expectation and Covariance: E X + Y = E X + E[Y] cov X, Y = E[ X E X X Y E Y Y = E XY E X E[Y] Var X + Y = Var X + 2cov X, Y + Var(Y) 13

Conditional Probability P(X Y)= Fraction of the worlds in which X is true given that Y is also true. For example: H= Having a headache F= Coming down with flu P(Headche Flu) = fraction of flu-inflicted worlds in which you have a headache. How to calculate? Definition: Corollary: P X Y = P(X Y) P(Y) = P Y X P X P Y P X Y = P Y X P(X) This is called Bayes Rule 14

The Bayes Rule P Headache Flu = = P Headache,Flu P Flu P Flu Headache P Headache P Flu Other cases: P X Y P(Y) P Y X = P X Y P Y +P X Y P( Y) P X Y P(Y) P Y = y i X = i S P X Y = y i P Y=y i P Y X, Z = P X Y, Z P(Y,Z) P(X,Z) = P X Y, Z P(Y,Z) P X Y, Z P Y,Z +P X Y, Z P( Y,Z) 15

Rules of Independence Recall that for events E and H, the probability of E given H, written as P(E H), is P E H P E H = P H E and H are (statistically) independent if P E H = P E P(H) Or equivalently P E = P(E H) That means, the probability of E is true doesn't depend on whether H is true or not E and F are conditionally independent given H if P E H, F = P(E H) Or equivalently P E, F H = P E H P(F H) 16

Rules of Independence Examples: P(Virus Drink Beer) = P(Virus) iff Virus is independent of Drink Beer P(Flu Virus;DrinkBeer) = P(Flu Virus) iff Flu is independent of Drink Beer, given Virus P(Headache Flu;Virus;DrinkBeer) = P(Headache Flu;DrinkBeer) iff Headache is independent of Virus, given Flu and Drink Beer Assume the above independence, we obtain: P(Headache;Flu;Virus;DrinkBeer) =P(Headache Flu;Virus;DrinkBeer) P(Flu Virus;DrinkBeer) P(Virus Drink Beer) P(DrinkBeer) =P(Headache Flu;DrinkBeer) P(Flu Virus) P(Virus) P(DrinkBeer) 17

Multivariate Gaussian p x μ, Σ = 1 2π n/2 Σ 1/2 exp{ 1 2 x μ Σ 1 x μ } Moment Parameterization μ = E(X) Σ = Cov X = E[ X μ X μ ] Mahalanobis Distance 2 = x = μ Σ 1 x μ Canonical Parameterization p x η, Λ = exp{a + η x 1 2 x Λx} whereλ = Σ 1, η = Σ 1 μ, a = 1 (n log2π log Λ + 2 η Λη ) Tons of applications (MoG, FA, PPCA, Kalman filter, ) 18

Multivariate Gaussian P (X1, X2) Joint Gaussian P X 1, X 2 Marginal Gaussian μ = μ 1 μ 2 = 11 12 21 22 μ 2 m = μ 2 Σ 2 m = Σ 2 Conditional Gaussian P X 1 X 2 = x 2 μ 1 2 = μ 1 + Σ 12 Σ 1 22 x 2 μ 2 Σ 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 19

Operations on Gaussian R.V. The linear transform of a Gaussian r.v. is a Gaussian. Remember that no matter how x is distributed E AX + b = AE X + b Cov AX + b = ACov X A this means that for Gaussian distributed quantities: X N μ, Σ AX + b N(Aμ + b, AΣA ) The sum of two independent Gaussian r.v. is a Gaussian Y = X 1 + X 2, X 1 X 2 μ y = μ 1 + μ 2, Σ y = Σ 1 + Σ 2 The multiplication of two Gaussian functions is another Gaussian function (although no longer normalized) N a, A N b, B N c, C, where C = A 1 + B 1 1, c = CA 1 a + CB 1 b 20

MLE Example: toss a coin Objective function: l θ; Head = log P Head θ = log θ n 1 θ N n = n log θ + N n log( 1 θ) We need to maximize this w.r.t. θ Take derivatives w.r.t. θ dl dθ = n θ N n 1 θ = 0 θ MLE = n N 21

Linear Algebra 22

Outline Motivating Example Eigenfaces Basics Dot and Vector Products Identity, Diagonal and Orthogonal Matrices Trace Norms Rank and linear independence Range and Null Space Column and Row Space Determinant and Inverse of a matrix Eigenvalues and Eigenvectors Singular Value Decomposition Matrix Calculus 23

Motivating Example - EigenFaces Consider the task of recognizing faces from images. Given images of size 512x512, each image contains 262,144 dimensions or features. Not all dimensions are equally important in classifying faces. Solution To use ideas from linear algebra, especially eigenvectors, to form a new set of reduced features. 24

Linear Algebra Basics - I Linear algebra provides a way of compactly representing and operating on sets of linear equations 4x 1 5x 2 = 13 2x 1 + 3x 2 = 9 can be written in the form of Ax = b A = 4 5 2 3 b = 13 9 A R m n denotes a matrix with m rows and n columns, where elements belong to real numbers. x R n denotes a vector with n real entries. By convention an n dimensional vector is often thought as a matrix with n rows and 1 column. 25

Linear Algebra Basics - II Transpose of a matrix results from flipping the rows and columns. Given A R m n, transpose is A R n m For each element of the matrix, the transpose can be written as A ij = Aji The following properties of the transposes are easily verified A = A (AB) = B A (A + B) = A + B A square matrix A R n n is symmetric if A = A and it is anti-symmetric if A = A. Thus each matrix can be written as a sum of symmetric and anti-symmetric matrices. 26

Vector and Matrix Multiplication - I The product of two matrices A R m n and B R n p is given by C R m p n, where C ij = A ik B kj k=1 Given two vectors x, y R n, the term x y (also x y) is called the inner product or dot product of the vectors, and is a real number given by n x i y i. For example, i=1 x y = x 1 x 2 x 3 y 3 2 = i=1 x i y i y 3 y 1 Given two vectors x R n, y R m, the term xy is called the outer product of the vectors, and is a matrix given by (x i y j ) = x i y j.. For example, 27

Vector and Matrix Multiplication - II xy = x 1 x 2 y 1 y 2 y 3 = x 3 x 1 y 1 x 1 y 2 x 1 y 3 x 2 y 1 x 2 y 2 x 2 y 3 x 3 y 1 x 3 y 2 x 3 y 3 The dot product also has a geometrical interpretation, for vectors in x, y R 2 with angle θ between them x x y = x y cosθ θ y which leads to use of dot product for testing orthogonality, getting the Euclidean norm of a vector, and scalar projections. 28

Trace of a Matrix The trace of a matrix A R n n, denoted as tr(a), is the sum of the diagonal elements in the matrix tr(a) = i=1 n A ii The trace has the following properties For A R n n, tr(a) = tra For A, B R n n, tr(a + B) = tr(a) + tr(b) For A R n n, t R, tr ta = t tr(a) For A, B, C such that ABC is a square matrix tr(abc) = tr(bca) = tr(cab) The trace of a matrix helps us easily compute norms and eigenvalues of matrices as we will see later 29

Linear Independence and Rank A set of vectors x 1, x 2,..., x n R m are said to be (linearly) independent if no vector can be represented as a linear combination of the remaining vectors. That is if x n = n 1 i=1 α i x i for some scalar values α 1, α 2, R then we say that the vectors are linearly dependent; otherwise the vectors are linearly independent The column rank of a matrix A R m n is the size of the largest subset of columns of A that constitute a linearly independent set. Row rank of a matrix is defined similarly for rows of a matrix. For any matrix row rank = column rank 30

Range and Null Space The span of a set of vectors x 1, x 2,..., x n is the set of all vectors that can be expressed as a linear combination of the set v: v = n i=1 α i x i, α i R If x 1, x 2,..., x n R n is a set of linearly independent set of vectors, then span( x 1, x 2,..., x n ) = R n The range of a matrix A R m n, denoted as R(A), is the span of the columns of A The nullspace of a matrix A R m n, denoted N A, is the set of all vectors that equal 0 when multiplied by A N A = x R n Ax = 0 31

Column and Row Space The row space and column space are the linear subspaces generated by row and column vectors of a matrix Linear subspace, is a vector space that is a subset of some other higher dimension vector space For a matrix A R m n Col space(a) = span(columns of A) Rank(A) = dim(row space(a)) = dim(col space(a)) 32

Norms I Norm of a vector x of a vector is informally a measure of the length More formally, a norm is any function f : R n R that satisfies For all x R n, f(x) 0 (non-negativity) f(x) = 0 is and only if x = 0 (definiteness) For x R n, t R, f tx = t f x (homogeneity) For all x, y R n, f(x + y) f(x) + f(y) (triangle inequality) Common norms used in machine learning are l 2 norm x 2 = n i=1 x 2 i 33

Norm - II l 1 norm n x 1 = i=1 x i l norm x = max i x i All norms presented so far are examples of the family of l p norms, which are parameterized by a real number p 1 n x p = i=1 x i p 1 p Norms can be defined for matrices, such as the Frobenius norm. m n A F = i=1 j=1 A ij2 = tr(a A) 34

Identity, Diagonal and Orthogonal Matrices The identity matrix, denoted by I R n n is a square matrix with ones on the diagonal and zeros everywhere else A diagonal matrix is matrix where all non-diagonal matrices are 0. This is typically denoted as D = diag(d 1, d 2, d 3,..., d n ) Two vectors x, y R n are orthogonal if x. y = 0. A square matrix U R n n is orthogonal if all its columns are orthogonal to each other and are normalized It follows from orthogonality and normality that U T U = I = UU T Ux 2 = x 2 35

Determinant and Inverse of a Matrix The determinant of a square matrix A R n n is a function f: R n n R, denoted by A or det A, and is calculated as A = n i=1 ( 1) i+j a ij A \i, \j (for any j 1,2,, n) The inverse of a square matrix A R n n is denoted A 1 is the unique matrix such that A 1 A = I = AA 1 and For some square matrices A 1 may not exist, and we say that A is singular or non-invertible. In order for A to have an inverse, A must be full rank. For non-square matrices the inverse, denoted by A +,is given by A + = A A 1 A T called the pseudo inverse 36

Eigenvalues and Eigenvectors - I Given a square matrix A R n n we say that λ C is an eigenvalue of A and x C n is an eigenvector if Ax = λx, x 0 Intuitively this means that upon multiplying the matrix A with a vector x, we get the same vector, but scaled by a parameter λ Geometrically, we are transforming the matrix A from its original orthonormal basis/co-ordinates to a new set of orthonormal basis x with magnitude as λ 37

Eigenvalues and Eigenvectors - II Source - Wikipedia 38

Eigenvalues and Eigenvectors - III All the eigenvectors can be written together as AX = XΛ where the diagonals of X are the eigenvectors of A, and Λ is a diagonal matrix whose elements are eigenvalues of A If the eigenvectors of A are invertible, then A = XΛX 1 There are several properties of eigenvalues and eigenvectors n Tr(A) = i=1 λ i A = i=1 n λ i Rank of A is the number of non-zero eigenvalues of A If A is non-singular then 1/λ i are the eigenvalues of A 1 The eigenvalues of a diagonal matrix are the diagonal elements of the matrix itself! 39

Eigenvalues and Eigenvectors - IV For a symmetric matrix A, it can be shown that eigenvalues are real and the eigenvectors are orthonormal. Thus it can be represented as UΛU Considering quadratic form of A, x Ax = x UΛU x = y Λy = n λ i y i 2 (where y = U x) Since y i 2 is always positive the sign of the expression always depends on λ i. If λ i >0 then the matrix A is positive definite, if λ i 0 then the matrix A is positive semidefinite i=1 For a multi variate Gaussian, the variances of x and y do not fully describe the distribution. The eigenvectors of this covariance matrix capture the directions of highest variance and eigenvalues the variance 40

Eigenvalues and Eigenvectors - V We can rewrite the original equation in the following manner Ax = λx, x 0 λi A x = 0, x 0 This is only possible if λi A is singular, that is λi A = 0. Thus, eigenvalues and eigenvectors can be computed. Compute the determinant of A λi. This results in a polynomial of degree n. Find the roots of the polynomial by equating it to zero. The n roots are the n eigenvalues of A. They make A λi singular. For each eigenvalue λ, solve A λi x to find an eigenvector x 41

Singular Value Decomposition Singular value decomposition, known as SVD, is a factorization of a real matrix with applications in calculating pseudo-inverse, rank, solving linear equations, and many others. For a matrix M R m n assume n m M = UΣV where U R m m, V R n n, Σ R m n The m columns of U, and the n columns of V are called the left and right singular vectors of M. The diagonal elements of Σ, Σ ii are known as the singular values of M. Let v be the i th column of V, and u be the i th column of U, and σ be the i th diagonal element of Σ Mv = σu and M u = σv 42

Singular Value Decomposition - II Σ 11 Σ 1n M = u 1 u 2 u n v 1 v 2 v n Σ m1 Σ mn principal directions Singular value decomposition is related to eigenvalue decomposition Suppose X = x 1 u x 2 u x m u R m n Then covariance matrix is C = 1 m XX Starting from singular vector pair M u = σv MM u = σmv MM u = σ 2 u Cu = λu Scaling factor Projection in principal directions T 43

Singular Value Decomposition - III Source - Wikipedia 44

Matrix Calculus For a vector x, b R n, let f x = b x, then x b x is equal to b f(x) xk = x k i=1 n b i x i = bk Now for a quadratic function, f x = x Ax, with A S n, f(x) xk f(x) = 2Ax xk = x k n i=1 n i=1 A ij x i x j = i k A ik x i + j k A kj x j + 2Akkxk = 2 n i=1 A ki x i Let f X = X 1, then X 1 = X 1 ( X)X 1 45

References for self study Best Resources for review of material - Linear Algebra Review and Reference by Zico Kotler Matrix Cookbook by KB Peterson 46