Machine Learning for Data Science (CS4786) Lecture 9

Similar documents
Machine Learning for Data Science (CS4786) Lecture 4

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786)

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Intelligent Systems I 08 SVM

State Space Representation

Lecture #3. Math tools covered today

Factor Analysis. Lecture 10: Factor Analysis and Principal Component Analysis. Sam Roweis

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Symmetric Matrices and Quadratic Forms

CMSE 820: Math. Foundations of Data Sci.

Topics in Eigen-analysis

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

For a 3 3 diagonal matrix we find. Thus e 1 is a eigenvector corresponding to eigenvalue λ = a 11. Thus matrix A has eigenvalues 2 and 3.

5.1 Review of Singular Value Decomposition (SVD)

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

10-701/ Machine Learning Mid-term Exam Solution

, then cv V. Differential Equations Elements of Lineaer Algebra Name: Consider the differential equation. and y2 cos( kx)

Mon Apr Second derivative test, and maybe another conic diagonalization example. Announcements: Warm-up Exercise:

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Optimum LMSE Discrete Transform

Matrix Algebra from a Statistician s Perspective BIOS 524/ Scalar multiple: ka

Variable selection in principal components analysis of qualitative data using the accelerated ALS algorithm

Linear Classifiers III

Math 510 Assignment 6 Due date: Nov. 26, 2012

PCA SVD LDA MDS, LLE, CCA. Data mining. Dimensionality reduction. University of Szeged. Data mining

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

MATH10212 Linear Algebra B Proof Problems

Matrix Representation of Data in Experiment

Introduction to Optimization Techniques. How to Solve Equations

Support Vector Machines and Kernel Methods

Notes The Incremental Motion Model:

Lecture 8: October 20, Applications of SVD: least squares approximation

A) is empty. B) is a finite set. C) can be a countably infinite set. D) can be an uncountable set.

STATS 306B: Unsupervised Learning Spring Lecture 8 April 23

18.657: Mathematics of Machine Learning

COLLIN COUNTY COMMUNITY COLLEGE COURSE SYLLABUS CREDIT HOURS: 3 LECTURE HOURS: 3 LAB HOURS: 0

Filter banks. Separately, the lowpass and highpass filters are not invertible. removes the highest frequency 1/ 2and

Sparsification using Regular and Weighted. Graphs

Dimensionality reduction in Hilbert spaces

v = -!g(x 0 ) Ûg Ûx 1 Ûx 2 Ú If we work out the details in the partial derivatives, we get a pleasing result. n Ûx k, i x i - 2 b k

Dimensionality Reduction vs. Clustering

New Routes from Minimal Approximation Error to Principal Components

Chapter Vectors

CHAPTER 5. Theory and Solution Using Matrix Techniques

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

18.S096: Homework Problem Set 1 (revised)

HE ATOM & APPROXIMATION METHODS MORE GENERAL VARIATIONAL TREATMENT. Examples:

New Routes from Minimal Approximation Error to Principal Components

Axis Aligned Ellipsoid

Support vector machine revisited

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

KERNEL MODELS AND SUPPORT VECTOR MACHINES

6.451 Principles of Digital Communication II Wednesday, March 9, 2005 MIT, Spring 2005 Handout #12. Problem Set 5 Solutions

Introduction to Optimization Techniques

Some examples of vector spaces

Lecture 25 (Dec. 6, 2017)

Assumptions. Motivation. Linear Transforms. Standard measures. Correlation. Cofactor. γ k

M A T H F A L L CORRECTION. Algebra I 1 4 / 1 0 / U N I V E R S I T Y O F T O R O N T O

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

1 1 2 = show that: over variables x and y. [2 marks] Write down necessary conditions involving first and second-order partial derivatives for ( x0, y

3. Calculus with distributions

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

6.867 Machine learning

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

A widely used display of protein shapes is based on the coordinates of the alpha carbons - - C α

Singular value decomposition. Mathématiques appliquées (MATH0504-1) B. Dewals, Ch. Geuzaine

15.081J/6.251J Introduction to Mathematical Programming. Lecture 21: Primal Barrier Interior Point Algorithm

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Matrix Theory, Math6304 Lecture Notes from October 25, 2012 taken by Manisha Bhardwaj

Introduction to Machine Learning DIS10

1 Principal Components Analysis

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

Fitting 3D Data with a Cylinder

Where do eigenvalues/eigenvectors/eigenfunctions come from, and why are they important anyway?

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Time series models 2007

x c the remainder is Pc ().


Signal Processing in Mechatronics

Section 14. Simple linear regression.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

PH 425 Quantum Measurement and Spin Winter SPINS Lab 1

Note: we can take the Real and Imaginary part of the Schrödinger equation and write it in a way similar to the electromagnetic field. p = n!

Analytic Number Theory Solutions

The Method of Least Squares. To understand least squares fitting of data.

Image Spaces. What might an image space be

4. Hypothesis testing (Hotelling s T 2 -statistic)

Soo King Lim Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7:

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Math 778S Spectral Graph Theory Handout #3: Eigenvalues of Adjacency Matrix

Transcription:

Machie Learig for Data Sciece (CS4786) Lecture 9 Pricipal Compoet Aalysis Course Webpage : http://www.cs.corell.eu/courses/cs4786/207fa/

DIM REDUCTION: LINEAR TRANSFORMATION x > y > Pick a low imesioal subspace X Project liearly to this subspace W = Subspace retais as much iformatio Y x > y > i = x > i W K y > K

Example: Stuets i classroom z y x

PCA: VARIANCE MAXIMIZATION 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8 - -.2 -.5 - -0.5 0 0.5.5

PCA: VARIANCE MAXIMIZATION 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8 - -.2 -.5 - -0.5 0 0.5.5

PCA: VARIANCE MAXIMIZATION 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8 - -.2 -.5 - -0.5 0 0.5.5

DIM REDUCTION: LINEAR TRANSFORMATION Prelue: reucig to imesio Pick a low imesioal subspace w Project liearly to this subspace x x 2 Subspace retais as much iformatio y = w > x = kx k cos (\wx ) x 4 x 3 0

PCA: VARIANCE MAXIMIZATION Pick irectios alog which ata varies the most First pricipal compoet: Variace = X = = = = y t X s= y s! 2! X X 2 w > x t w > x s s= X w > x t w > X w > (x t µ) 2!! X 2 x s s= = average square ier prouct

Which Directio? 0.8 0.6 0.4 O average parallel O average orthogoal 0.2 0-0.2-0.4-0.6-0.8 - X w > (x t -.2 -.5 - -0.5 0 0.5.5 µ) 2 = X kx t µk 2 cosie(w, x t µ)

PCA: VARIANCE MAXIMIZATION Pick irectios alog which ata varies the most First pricipal compoet: w = arg max w w 2 = = arg max w w 2 = = arg max w w 2 = = arg max w w 2 = is the covariace matrix w x t w (x t µ) 2 2 w x t w (x t µ)(x t µ) w w w

Review Review covariace Review Eige vectors

PCA: VARIANCE MAXIMIZATION Covariace matrix: = (x t µ)(x t µ) Its a matrix, [i, j] measures covariace of features i a j [i, j] = (x t [i] µ[i])(x t [j] µ[j])

What are Eige Vectors?.5 0.5 0-0.5 - -.5 -.5 - -0.5 0 0.5.5

What are Eige Vectors?.5 x 7! Ax 0.5 0-0.5 - -.5 -.5 - -0.5 0 0.5.5

What are Eige Vectors?.5 x 7! Ax 0.5 0-0.5 - Ax = -.5 -.5 - -0.5 0 0.5.5 x

Which Directio? 0.8 0.6 0.4 O average parallel O average orthogoal 0.2 0-0.2-0.4-0.6-0.8 - -.2 -.5 - -0.5 0 0.5.5 Top Eigevector of covariace matrix

What if we wat more tha oe umber for each ata poit? That is we wat to reuce to K > imesios? z y x

PCA: VARIANCE MAXIMIZATION How o we fi the K compoets? We are lookig for orthogoal irectios that maximize total sprea i each irectio Fi orthoormal W that maximizes K j= y t [j] 2 y t [j] = = K j= w j x t K w j w j j= 2 x t This solutios is give by W = Top K eigevectors of

PCA: VARIANCE MAXIMIZATION How o we fi the K compoets? As: Maximize sum of sprea i the K irectios We are lookig for orthogoal irectios that maximize total sprea i each irectio Fi orthoormal W that maximizes K j= y t [j] 2 y t [j] = = K j= w j x t K w j w j j= 2 x t This solutios is give by W = Top K eigevectors of

PCA: VARIANCE MAXIMIZATION How o we fi the K compoets? We are lookig for orthogoal irectios that maximize total sprea i each irectio Fi orthoormal W that maximizes K j= y t [j] 2 y t [j] = = K j= w j x t K w j w j j= X w i [k]w j [k] =0 & k= x t X w i [k] = k= 2 This solutios is give by W = Top K eigevectors of

! PRINCIPAL COMPONENT ANALYSIS Eigevectors of the covariace matrix are the pricipal compoets. =cov X Top K pricipal compoets are the eigevectors with K largest eigevalues 2. W = eigs,k Projectio = Data Top Keigevectors ( ) Recostructio = Projectio Traspose of top K eigevectors X µ Iepeetly iscovere by Pearso i 90 a Hotellig i 933. 3. Y = W

A Alterative View of PCA

PCA: MINIMIZING RECONSTRUCTION ERROR 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8 - x ˆx -.2 -.5 - -0.5 0 0.5.5 X kˆx t x t k 2

Maximize Sprea Miimize Recostructio Error

ORTHONORMAL PROJECTIONS Thik of w,...,w K as cooriate system for PCA (i a K imesioal subspace) y values provie coefficiets i this system Without loss of geerality, w,...,w K ca be orthoormal, i.e. w i w j & w i =. 2 Recostructio: ˆx t = kw i k 2 2 = X K y t [j]w j If we take all,...,w, the x t = j= y t[j]w j. To reuce w i? w j ) w i [k]w j [k] =0 imesioality we oly cosier first K vectors of the basis k= X j= k= w i [k] 2

CENTERING DATA 0 Compressig these ata poits

CENTERING DATA 0 is same as compressig these.

ORTHONORMAL PROJECTIONS (Cetere) Data-poits as liear combiatio of some orthoormal basis, i.e. x t = µ + y t [j]w j j= where w,...,w R are the orthoormal basis a µ = x t. Represet ata as liear combiatio of just K orthoormal basis, ˆx t = µ + K y t [j]w j + µ j=

PCA: MINIMIZING RECONSTRUCTION ERROR Goal: fi the basis that miimizes recostructio error, ˆx t x t 2 2 = = = = = K y t [j]w j + µ x t j= K y t [j]w j + µ j= j=k+ y t [j]w j j=k+ 2 2 y t [j] 2 w j 2 2 + 2 2 2 y t [j]w j µ j= 2 2 (but a + b 2 2 = a2 2 + b2 2 + 2a b) j=k+ i=j+ y t [j]y t [i]w j w i y t [j] 2 w j 2 2 (last step because w j w i )

PCA: MINIMIZING RECONSTRUCTION ERROR ˆx t x t 2 2 = = = = = y t [j] 2 w j 2 2 (but w j = ) y t [j] 2 (ow y j = w j (x t µ)) (w j (x t µ)) 2 w j w j w j (x t µ)(x t µ) w j

PCA: MINIMIZING RECONSTRUCTION ERROR ˆx t x t 2 2 = = = = = y t [j] 2 w j 2 2 (but w j = ) y t [j] 2 (ow y j = w j (x t µ)) (w j (x t µ)) 2 w j w j w j (x t µ)(x t µ) w j

PCA: MINIMIZING RECONSTRUCTION ERROR ˆx t x t 2 2 = = = = = y t [j] 2 w j 2 2 (but w j = ) y t [j] 2 (ow y j = w j (x t µ)) (w j (x t µ)) 2 w j w j w j (x t µ)(x t µ) w j

PCA: MINIMIZING RECONSTRUCTION ERROR ˆx t x t 2 2 = = = = = y t [j] 2 w j 2 2 (but w j = ) y t [j] 2 (ow y j = w j (x t µ)) (w j (x t µ)) 2 w j w j w j (x t µ)(x t µ) w j

PCA: MINIMIZING RECONSTRUCTION ERROR ˆx t x t 2 2 = = = = = y t [j] 2 w j 2 2 (but w j = ) y t [j] 2 (ow y j = w j (x t µ)) (w j (x t µ)) 2 w j w j w j (x t µ)(x t µ) w j

PCA: MINIMIZING RECONSTRUCTION ERROR Miimize w.r.t. w,...,w K s that are orthoormal, argmi j, w j 2 = w j w j Solutio, (iscar) w K+,...,w are bottom K eigevectors Hece w,...,w K are the top K eigevectors

! PRINCIPAL COMPONENT ANALYSIS Eigevectors of the covariace matrix are the pricipal compoets. =cov X Top K pricipal compoets are the eigevectors with K largest eigevalues 2. W = eigs,k Projectio = Data Top Keigevectors ( ) Recostructio = Projectio Traspose of top K eigevectors X µ Iepeetly iscovere by Pearso i 90 a Hotellig i 933. 3. Y = W

RECONSTRUCTION 4. bx W > +µ = Y