Machine Learning for Data Science (CS 4786)

Similar documents
Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS4786) Lecture 9

Machine Learning for Data Science (CS4786) Lecture 4

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

5.1 Review of Singular Value Decomposition (SVD)

Linear Regression Demystified

(VII.A) Review of Orthogonality

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Chapter Vectors

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Axis Aligned Ellipsoid

Algebra of Least Squares

Topics in Eigen-analysis

U8L1: Sec Equations of Lines in R 2

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

10-701/ Machine Learning Mid-term Exam Solution

Introduction to Optimization Techniques. How to Solve Equations

Some examples of vector spaces

Mon Apr Second derivative test, and maybe another conic diagonalization example. Announcements: Warm-up Exercise:

Lecture 8: October 20, Applications of SVD: least squares approximation

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Introduction to Machine Learning DIS10

Linear Support Vector Machines

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Support vector machine revisited

Symmetric Matrices and Quadratic Forms

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

1 Generating functions for balls in boxes

State Space Representation

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

2 Geometric interpretation of complex numbers

Efficient GMM LECTURE 12 GMM II

Math 155 (Lecture 3)

ECON 3150/4150, Spring term Lecture 3

Orthogonal transformations

3.2 Properties of Division 3.3 Zeros of Polynomials 3.4 Complex and Rational Zeros of Polynomials

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

Notes The Incremental Motion Model:

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Principle Of Superposition

The Random Walk For Dummies

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

The Method of Least Squares. To understand least squares fitting of data.

Optimally Sparse SVMs

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

5.1. The Rayleigh s quotient. Definition 49. Let A = A be a self-adjoint matrix. quotient is the function. R(x) = x,ax, for x = 0.

Ray-triangle intersection

A widely used display of protein shapes is based on the coordinates of the alpha carbons - - C α

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

Linear Classifiers III

Lecture #20. n ( x p i )1/p = max

Ma 530 Introduction to Power Series

Empirical Process Theory and Oracle Inequalities

(3) If you replace row i of A by its sum with a multiple of another row, then the determinant is unchanged! Expand across the i th row:

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

1 1 2 = show that: over variables x and y. [2 marks] Write down necessary conditions involving first and second-order partial derivatives for ( x0, y

R is a scalar defined as follows:

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Matrix Algebra from a Statistician s Perspective BIOS 524/ Scalar multiple: ka

CMSE 820: Math. Foundations of Data Sci.

1 Adiabatic and diabatic representations

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

MATH 10550, EXAM 3 SOLUTIONS

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Questions and answers, kernel part

U8L1: Sec Equations of Lines in R 2

Zeros of Polynomials

TEACHER CERTIFICATION STUDY GUIDE

An Introduction to Randomized Algorithms

Chapter 4. Fourier Series

Filter banks. Separately, the lowpass and highpass filters are not invertible. removes the highest frequency 1/ 2and

FFTs in Graphics and Vision. The Fast Fourier Transform

Homework Set #3 - Solutions

Question 1: The magnetic case

Matsubara-Green s Functions

PAPER : IIT-JAM 2010

Polynomials with Rational Roots that Differ by a Non-zero Constant. Generalities

Chapter 1. Complex Numbers. Dr. Pulak Sahoo

6.867 Machine learning, lecture 7 (Jaakkola) 1

8. Applications To Linear Differential Equations

Optimization Methods: Linear Programming Applications Assignment Problem 1. Module 4 Lecture Notes 3. Assignment Problem

Math E-21b Spring 2018 Homework #2

REVISION SHEET FP1 (MEI) ALGEBRA. Identities In mathematics, an identity is a statement which is true for all values of the variables it contains.

Introduction to Optimization Techniques

Eigenvalues and Eigenvectors

MA131 - Analysis 1. Workbook 3 Sequences II

Seunghee Ye Ma 8: Week 5 Oct 28

Measures of Spread: Standard Deviation

Math Solutions to homework 6

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Transcription:

Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm or method. The text i red are mathematical details for those who are iterested. Dimesioality Reductio ad Liear Projectio We are provided with data poits x,..., x where each x t R d is a d-dimesioal vector. The goal i dimesioality reductio is to compress these poits ito vectors y,..., y R K where K is smaller tha d. I this lecture we will cosider dimesioality reductio through liear trasformatios, meaig, the low dimesioal represetatio y t for each datapoit x t is obtaied by settig y t = W x t where W defies the liear trasformatio give by a d k matrix Notice that oe ca represet the matrix W as W = [w,..., w K ] where w,..., w K are each d-dimesioal vectors. PCA to Oe Dimesio K = ) For the case whe K =, y t [] = w x t. Now ote that arbitrary scalig of w to say αw for some α R simply leads to y t [] s beig scaled by α. Such scalig does ot really affect our projectios however. Hece without loss of geerality we ca cimply set w =, that is fid a vector of uit orm. The figure below illustrates the basic idea whe we cosider the projectio dow to oly oe dimesio. The poits i blue are the origial x,..., x. w the first directio of projectio is illustrated i the figure by the red lie. The poits i gree represet the recostructios ˆx,..., ˆx. The oe dimesioal represetatio y [],..., y [] are illustrated by the legths i the figure.

Now the mai questio at had boils dow to, How do we pick the right liear trasformatio W so a to retai as much iformatio about the origial data poits as possible.. Maximizig Spread variace) Basic idea: Pick the directios alog which data is maximally spread or variace is high). As a example i the illustratio below we would like to pick the secod optio as the spread if the poits across the chose directio is larger i the secod figure. How do we formalize this for K = )? Let us first cosider the first directio to pick w. We wat to pick the directio log which variace of y [],..., y [] is largest. That is we wat to pick the directio w uit orm vector) that maximize the sample variace i the projected directio which is give by: y t [] y t [] = w x t w x t t= Thus the solutio w is give by, t= w w: w = t= w x t t= t= w x t t=

Let µ = t= x t. Now let us simplify the above expressio further, w = arg max w: w = w: w = w: w = w: w = w: w = w w x t w x t t= ) w x t x t t= t= t= w x t µ) t= w x t µ)x t µ) w t= w: w = w Σw Where Σ is sample the covariace matrix. ) x t µ)x t µ) w t= The first directio we pick will be the the uit vector w that maximizes w Σw Roughly speakig) Wheever we wat to maximize or miimize) a fuctio subject to a costrait, we ca use the idea of Lagrage multipliers. What the result says is that there exists λ R such that the solutio to w ca be alteratively writte dow as : w w R d w Σw λ w To optimize the above we simply take derivative ad equate to 0. This gives us Σw λw = 0 The directio w we obtai by maximizig the variace i the directio is some uit vector that satisfies Σw = λw But this is exactly the defiitio of a eige vector of matrix Σ. I Dutch the word eige meas self or ow. Eige vector of a matrix multiplied to the matrix results i a vector that is the just a scaled versio of the eigevector itself. So we see that w is a eigevector of Σ. The ext questio is, which eige vector to choose. To this ed ote that we wat to maximize w Σw ad we just saw that Σw = λw. Hece w Σw = λ w = λ Sice we wat to maximize the above quatity it stads to reaso that we pick w to be the eige vector correspodig to the largest eige value of Σ. 3

3 What about K >? For simplicity, for this sectio we shall assume that t= x t = 0 because if ot we ca simply ceter the poits by subtractig the mea from each oe. A simple fact from liear algebra is that, if we cosider ay orthoormal basis of R d give by w,..., w d R d, the ay vector i R d ca be represeted as a liear combiatio of the d basis. Recall that orthoormal vectors are vectors w,..., w d such that each vector is of uit legth, that is i d, w i = w i [j] = ad the vectors are orthogoal to each other, that is j= i, j s.t. i j, w i w j = 0 The key idea we are goig to use to produce the K dimesioal represetatio of the poits is that we shall first fid orthoormal basis w,..., w d i which to represet each poit x t. But however we shall pick oly K of the d basis w,..., w d ad approximate the data poits i the this chose K dimesioal subspace spaed by w,..., w K. Thus the matrix W will be got by cosiderig oly these K basis. Sice we ca write ay vector i d-dimesio as a liear combiatio of the orthoormal basis, let us write each x t = y t [j]w j ) j= where for each x t, y t [j] represets the coefficiet o the jth basis w j. Now the K dimesioal represetatio of the poit x t is give by the K umbers y t [],..., y t [K]. What this meas is that we ca view the data poit x t beig approximated by the recostructio Now further ote that, sice x t cosider W = [w,..., w K ], ˆx t = y t [j]w j ) j= = d j= y t[j]w j ad sice w,..., w are orthoormal, if we W x t = W y t [j]w j = j= y t [] y t [K] which coicides with our defiitio of liear trasformatio i the first sectio. 3) 4

3. View I: Maximize Total Variace Basic idea: coordiates. Pick the orthogoal directios alog which data is maximally spread i each of the How do we formalize this? We wat to maximize the sum of the variaces of y t s. That is, we wat to maximize y t [j] y t [j]) = wj x t wj x t j= t= t= j= That is we wat to solve the optimizatio problem: w,..., w K ) = argmax orthoormal W = argmax orthoormal W j= t= wj x t t= wj Σw j j= t= w x t To solve the above, first we ote that we ca solve w just as the K = case which yields the top eige vector as solutio for w. Next we ca solve w, subject to it beig orthogoal to w ad this exactly yields the secod largest eigevector. Next w 3 is the third largest ad so forth. Thus the solutio we get is the first K largest eige vectors. 3. View II: Miimizig Recostructio Error Basic idea: Aother way of thikig about which orthoormal basis to choose is to pick the basis such that the recostructio error is miimized. That is pick the orthoormal basis w,..., w K such that x t ˆx t is miimized. See the figure o page ). t= t= Let us simplify the above term, ˆx t x t = y t [j]w j x t t= t= j= = y t [j]w j y t [j]w j t= j= j= = y t [j]w j t= j=k+ = wj x t µ))w j t= j=k+ usig the fact that y t [j] = w j x t from Eq. 3) 5

from the above to the ext equatio is ot hard but just takes a bit of starig. Note that for ay vector v, v = v v. Expadig the above ad oticig that w j s are orthoormal yields the below. It s ok if this step seems hard, just take it as give. = = = t= j=k+ t= j=k+ j=k+ sice x t s are cetered, = j=k+ w j w j Σw j w j x t w j x t x t w j ) x t x t t= w j The orthoormal basis w,..., w d that we shall pick are the oes that miimize the recostructio error ad are hece the orthoormal basis that miimize, d j=k+ w j Σw j. We agai use the Lagragia multipliers to rewrite the costraied miimizatio problem with the uit orm costraits) ito a ucostraied miimizatio problem. Specifically we see that there exists λ,..., λ d such that the orthoormal basis w,..., w d are the oes that miimize, j=k+ w j Σw j λ j w j Takig derivative ad equatig to 0 we fid that for ay idex K + j d, j= Σw j λ j w j = 0 Thus agai we fid that the w s are eige vectors. To miimize the recostructio error we simply pick the eige basis of Σ ad retai K of them while throwig away the remaiig. Now sice each w j is a eige vector we have that Σw j = λ j w j where λ j is the correspodig eige value. Now the questio remais, which Eige vector to keep ad which to throw away. To this ed recall that we wat to miimize j=k+ w j Σw j = j=k+ λ j w j w j = j=k+ Thus it stads to reaso that to miimize the above we pick w K+,..., w d to be the eigevectors with the d K smallest eigevalues. λ j 6

4 PCA Algorithm What both the views tell us is that the matrix W we shall use for PCA is got by takig the K eigevectors correspodig to the top K eige values. As for the PCA algorithm, oe way to implemet it is to first compute the covariace matrix give the data. This ca be doe by calculatig the mea vector ad the covariace matrix as µ = t= x t Σ = x t µ)x t µ) t= Next we perform eige decompositio of the matrix ad take the top K eige vectors ad set w W = = eigsσ, K) wk 4. Projectio to Lower Dimesio Projectio is simply give by y t = W x t µ) The above is the versio where we ceter x t s which is represeted by the fact that µ is subtracted from each x t. 4. Recostructio Recostructio of the data poits based o low dimesioal represetatio is give by ˆx t = W y t + µ 7