A Quick Tour of Linear Algebra and Optimization for Machine Learning

Similar documents
Quick Tour of Linear Algebra and Graph Theory

Knowledge Discovery and Data Mining 1 (VO) ( )

Quick Tour of Linear Algebra and Graph Theory

Linear Algebra Formulas. Ben Lee

CS 246 Review of Linear Algebra 01/17/19

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Math Bootcamp An p-dimensional vector is p numbers put together. Written as. x 1 x =. x p

2. Linear algebra. matrices and vectors. linear equations. range and nullspace of matrices. function of vectors, gradient and Hessian

Introduction to Matrix Algebra

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

10-701/ Recitation : Linear Algebra Review (based on notes written by Jing Xiang)

Linear Algebra review Powers of a diagonalizable matrix Spectral decomposition

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Linear Algebra Review and Reference

Mathematical Foundations of Applied Statistics: Matrix Algebra

10-725/36-725: Convex Optimization Prerequisite Topics

Linear Algebra Review. Vectors

Jim Lambers MAT 610 Summer Session Lecture 1 Notes

Matrix Algebra, part 2

Math 315: Linear Algebra Solutions to Assignment 7

Chapter 1. Matrix Algebra

Math Linear Algebra Final Exam Review Sheet

B553 Lecture 5: Matrix Algebra Review

STAT200C: Review of Linear Algebra

Linear Algebra for Machine Learning. Sargur N. Srihari

Problem Set (T) If A is an m n matrix, B is an n p matrix and D is a p s matrix, then show

Foundations of Matrix Analysis

Linear Algebra - Part II

Linear Algebra Review

Machine Learning. Lecture 2: Linear regression. Feng Li.

Lecture 2: Linear Algebra Review

Matrices and Linear Algebra

CS 143 Linear Algebra Review

Convex Optimization Overview (cnt d)

Linear Algebra. Matrices Operations. Consider, for example, a system of equations such as x + 2y z + 4w = 0, 3x 4y + 2z 6w = 0, x 3y 2z + w = 0.

Phys 201. Matrices and Determinants

Introduction Eigen Values and Eigen Vectors An Application Matrix Calculus Optimal Portfolio. Portfolios. Christopher Ting.

Preliminary Linear Algebra 1. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 100

Chapters 5 & 6: Theory Review: Solutions Math 308 F Spring 2015

Matrices A brief introduction

Matrices A brief introduction

Linear Algebra Highlights

. a m1 a mn. a 1 a 2 a = a n

Properties of Matrices and Operations on Matrices

1. Linear systems of equations. Chapters 7-8: Linear Algebra. Solution(s) of a linear system of equations (continued)

Massachusetts Institute of Technology Department of Economics Statistics. Lecture Notes on Matrix Algebra

Linear Algebra Review

[3] (b) Find a reduced row-echelon matrix row-equivalent to ,1 2 2

EE731 Lecture Notes: Matrix Computations for Signal Processing

NOTES on LINEAR ALGEBRA 1

10.34 Numerical Methods Applied to Chemical Engineering Fall Quiz #1 Review

CP3 REVISION LECTURES VECTORS AND MATRICES Lecture 1. Prof. N. Harnew University of Oxford TT 2013

Chap 3. Linear Algebra

3 (Maths) Linear Algebra

GATE Engineering Mathematics SAMPLE STUDY MATERIAL. Postal Correspondence Course GATE. Engineering. Mathematics GATE ENGINEERING MATHEMATICS

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Math Camp II. Basic Linear Algebra. Yiqing Xu. Aug 26, 2014 MIT

Introduction to Quantitative Techniques for MSc Programmes SCHOOL OF ECONOMICS, MATHEMATICS AND STATISTICS MALET STREET LONDON WC1E 7HX

Linear algebra for computational statistics

Mathematical foundations - linear algebra

The Singular Value Decomposition

Some notes on Linear Algebra. Mark Schmidt September 10, 2009

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

UNIT 6: The singular value decomposition.

Linear Algebra. Christos Michalopoulos. September 24, NTU, Department of Economics

Linear Algebra Practice Final

MAT 610: Numerical Linear Algebra. James V. Lambers

Online Exercises for Linear Algebra XM511

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

4 Linear Algebra Review

Math 489AB Exercises for Chapter 1 Fall Section 1.0

Mathematical foundations - linear algebra

MATRIX ALGEBRA. or x = (x 1,..., x n ) R n. y 1 y 2. x 2. x m. y m. y = cos θ 1 = x 1 L x. sin θ 1 = x 2. cos θ 2 = y 1 L y.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Review of Some Concepts from Linear Algebra: Part 2

Linear Algebra & Analysis Review UW EE/AA/ME 578 Convex Optimization

Lecture Summaries for Linear Algebra M51A

Mobile Robotics 1. A Compact Course on Linear Algebra. Giorgio Grisetti

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 2

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Linear Algebra Review

Lecture 2: Linear operators

Chapter 4 Euclid Space

1. Select the unique answer (choice) for each problem. Write only the answer.

TBP MATH33A Review Sheet. November 24, 2018

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Mathematical Methods wk 2: Linear Operators

ACM 104. Homework Set 4 Solutions February 14, 2001

Math 205, Summer I, Week 4b:

Convex Optimization Overview (cnt d)

18.06 Problem Set 8 - Solutions Due Wednesday, 14 November 2007 at 4 pm in

Elementary maths for GMT

OHSx XM511 Linear Algebra: Solutions to Online True/False Exercises

LINEAR ALGEBRA REVIEW

Large Scale Data Analysis Using Deep Learning

Recall the convention that, for us, all vectors are column vectors.

Mathematical Methods for Engineers 1 (AMS10/10A)

Practice Final Exam. Solutions.

A Brief Outline of Math 355

Unit 3: Matrices. Juan Luis Melero and Eduardo Eyras. September 2018

Transcription:

A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28

Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators and Properties Special Types of Matrices Vector Norms Linear Independence and Rank Matrix Inversion Range and Nullspace of a Matrix Determinant Quadratic Forms and Positive Semidefinite Matrices Eigenvalues and Eigenvectors Matrix Eigendecomposition 2 / 28

Outline of Part II: Review of Basic Optimization The Gradient The Hessian Least Squares Problem Gradient Descent Stochastic Gradient Descent Convex Optimization Special Classes of Convex Problems Example of Convex Problems in Machine learning Convex Optimization Tools 3 / 28

Matrices and Vectors Matrix: A rectangular array of numbers, e.g., A R m n : a 11 a 12... a 1n a 21 a 22... a 2n A =... a m1 a m2... a mn Vector: A matrix with only one column (default) or one row, e.g., x R n x = x 1 x 2. x n 4 / 28

Matrix Multiplication If A R m n, B R n p, C = AB, then C R m p : C ij = Properties of Matrix Multiplication: Associative: (AB)C = A(BC) n A ik B kj k=1 Distributive: A(B + C) = AB + AC Non-commutative: AB BA Block multiplication: If A = [A ik ], B = [B kj ], where A ik s and B kj s are matrix blocks, and the number of columns in A ik is equal to the number of rows in B kj, then C = AB = [C ij ] where C ij = k A ikb kj 5 / 28

Operators and properties Transpose: A R m n, then A T R n m : (A T ) ij = A ji Properties: (A T ) T = A (AB) T = B T A T (A + B) T = A T + B T Trace: A R n n, then: tr(a) = n i=1 A ii Properties: tr(a) = tr(a T ) tr(a + B) = tr(a) + tr(b) tr(λa) = λtr(a) If AB is a square matrix, tr(ab) = tr(ba) 6 / 28

Special types of matrices Identity matrix: I = I n R n n : { 1 i=j, I ij = 0 otherwise. A R m n : AI n = I m A = A Diagonal matrix: D = diag(d 1, d 2,..., d n ): { d i j=i, D ij = 0 otherwise. Symmetric matrices: A R n n is symmetric if A = A T. Orthogonal matrices: U R n n is orthogonal if UU T = I = U T U 7 / 28

Vector Norms A norm of a vector x is a measure of it s length or magnitude. The most common is the Euclidean or l 2 norm. ( n ) 1 l p norm : x p = x i p p i=1 n l 2 norm : x 2 = xi 2 i=1 used in ridge regression: y X β 2 + λ β 2 2 l 1 norm : x 1 = n x i i=1 used in l 1 penalized regression: y X β 2 + λ β 1 l norm : x = max x i i 8 / 28

Linear Independence and Rank A set of vectors {x 1,..., x n } is linearly independent if {α 1,..., α n }: n i=1 α ix i = 0 Rank: A R m n, then rank(a) is the maximum number of linearly independent columns (or equivalently, rows) Properties: rank(a) min{m, n} rank(a) = rank(a T ) rank(ab) min{rank(a), rank(b)} rank(a + B) rank(a) + rank(b) 9 / 28

Matrix Inversion If A R n n, rank(a) = n, then the inverse of A, denoted A 1 is the matrix that: AA 1 = A 1 A = I Properties: (A 1 ) 1 = A (AB) 1 = B 1 A 1 (A 1 ) T = (A T ) 1 The inverse of an orthogonal matrix is its transpose 10 / 28

Range and Nullspace of a Matrix Span: span({x 1,..., x n }) = { n i=1 α ix i α i R} Projection: Proj(y; {x i } 1 i n ) = argmin v span({xi } 1 i n ){ y v 2 } Range: A R m n, then R(A) = {Ax x R n } is the span of the columns of A Proj(y, A) = A(A T A) 1 A T y Nullspace: null(a) = {x R n Ax = 0} 11 / 28

Determinant A R n n, a 1,..., a n the rows of A, then det(a) is the volume of the S = { n i=1 α ia i 0 α i 1}. Properties: det(i ) = 1 det(λa) = λdet(a) det(a T ) = det(a) det(ab) = det(a)det(b) det(a) 0 if and only if A is invertible. If A invertible, then det(a 1 ) = det(a) 1 12 / 28

Quadratic Forms and Positive Semidefinite Matrices A R n n, x R n, x T Ax is called a quadratic form: x T Ax = A ij x i x j 1 i,j n A is positive definite if x R n : x T Ax > 0 A is positive semidefinite if x R n : x T Ax 0 A is negative definite if x R n : x T Ax < 0 A is negative semidefinite if x R n : x T Ax 0 13 / 28

Eigenvalues and Eigenvectors A R n n, λ C is an eigenvalue of A with the corresponding eigenvector x C n (x 0) if: Ax = λx eigenvalues: the n possibly complex roots of the polynomial equation det(a λi ) = 0, and denoted as λ 1,..., λ n Properties: tr(a) = n i=1 λ i det(a) = n i=1 λ i rank(a) = {1 i n λ i 0} 14 / 28

Matrix Eigendecomposition A R n n, λ 1,..., λ n the eigenvalues, and x 1,..., x n the eigenvectors. X = [x 1 x 2... x n ], Λ = diag(λ 1,..., λ n ), then AX = X Λ. A called diagonalizable if X invertible: A = X ΛX 1 If A symmetric, then all eigenvalues real, and X orthogonal (hence denoted by U = [u 1 u 2... u n ]): A = UΛU T = n λ i u i ui T i=1 A special case of Singular Value Decomposition 15 / 28

The Gradient Suppose f : R m n R is a function that takes as input a matrix A and returns a real value. Then the gradient of f is the matrix Note that the size of this matrix is always the same as the size of A. In particular, if A is the vector x R n, 16 / 28

The Hessian Suppose f : R n R is a function that takes a vector in R n and returns a real number. Then Hessian matrix with respect to x, the n n matrix: Gradient and Hessian of Quadratic and Linear Functions: 17 / 28

Least Squares Problem Solve the following minimization problem: Note that Taking the gradient with respect to x we have (and using the properties above): Setting this to zero and solving for x gives the following closed form solution (psuedo-inverse): 18 / 28

Gradient Descent Gradient Descent (GD): takes steps proportional to the negative of the gradient (first order method) Advantage: very general (we ll see it many times) Disadvantage: Local minima (sensitive to starting point) Step size not too large, not too small Common choices: Fixed Linear with iteration (May want step size to decrease with iteration) More advanced methods (e.g., Newton s method) 19 / 28

Gradient Descent A typical machine learning problem aims to minimize Error(loss) + Regularizer (penalty): Gradient Descent (GD): choose initial w (0) repeat until minf (w) = f (w; y, x) + g(w) w w (t+1) = w (t) η t F (w (t) ) w (t+1) w (t) ɛ or F (w (t) ) ɛ 20 / 28

Stochastic (Online) Gradient Descent Use updates based on individual data points chosen at random Applicable when minimizing an objective function is sum of differentiable functions: f (w; y, x) = 1 n n f (w; y i, x i ) i=1 Suppose we receive an stream of samples (y t, x t ) from the distribution, the idea of SGD is: w (t+1) = w (t) η t w f (w (t) ; y t, x t ) In practice, we typically shuffle data points in the training set randomly and use them one by one for the updates. 21 / 28

Stochastic (Online) Gradient Descent The objective does not always decrease for each step comparing to GD, SGD needs more steps, but each step is cheaper mini-batch (say pick up 100 samples and average) can potentially accelerate the convergence 22 / 28

Convex Optimization A set of points S is convex if, for any x, y S and for any 0 θ 1, θx + (1 θ)y S A function f : S R is convex if its domain S is a convex set and f (θx + (1 θ)y) θf (x) + (1 θ)f (y) for all x, y S, 0 θ 1. Convex functions can be efficiently minimized. 23 / 28

Convex Optimization A convex optimization problem in an optimization problem of the form where f is a convex function, C is a convex set, and x is the optimization variable. Or equivalently: where g i are convex functions, and h i are affine functions. Theorem All locally optimal points of a convex optimization problem are globally optimal. Note that the optimal solution is not necessarily unique. 24 / 28

Special Classes of Convex Problems Linear Programming: Quadratic Programming: Quadratically Constrained Quadratic Programming: Semidefinite Programming: 25 / 28

Example of Convex Problems in Machine learning Support Vector Machine (SVM) Classifier: This is a quadratic program with optimization variables ω R n, ξ R m, b R, and the input data x(i), y(i), i = 1,... m, and the parameter C R 26 / 28

Convex Optimization Tools In many applications, we can write an optimization problem in a convex form. Then we can use several software packages for convex optimization to efficiently solve these problems. These convex optimization engines include: MATLAB-based: CVX, SeDuMi, Matlab Optimization Toolbox (linprog, quadprog) Machine Learning: Weka (Java) libraries: CVXOPT (Python), GLPK (C), COIN-OR (C) SVMs: LIBSVM, SVM-light commerical packages: CPLEX, MOSEK 27 / 28

References The source of this review are the following: Boyd, S. and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Course notes from CMU s 10-701. Course notes from Stanford s CS229, and CS224w Course notes from UCI s CS273a 28 / 28