CS540 Machine learning Lecture 5

Similar documents
Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage

Learning with Singular Vectors

Linear Regression. S. Sumitra

Eigenvalues and diagonalization

We ve seen today that a square, symmetric matrix A can be written as. A = ODO t

Linear Methods for Regression. Lijun Zhang

CSCI567 Machine Learning (Fall 2014)

Lecture VIII Dim. Reduction (I)

Linear Models in Machine Learning

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 6: Methods for high-dimensional problems

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Principal Component Analysis and Linear Discriminant Analysis

Linear Regression. Volker Tresp 2014

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Week 3: Linear Regression

Linear Regression (continued)

Linear Models for Regression

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Statistical Machine Learning, Part I. Regression 2


Introduction to Machine Learning

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

l 1 and l 2 Regularization

Background Mathematics (2/2) 1. David Barber

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Regularization: Ridge Regression and the LASSO

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Multivariate Statistical Analysis

Singular Value Decomposition and Principal Component Analysis (PCA) I

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

CS545 Contents XVI. l Adaptive Control. l Reading Assignment for Next Class

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Linear Models for Regression

Linear Algebra Review. Vectors

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Overfitting, Bias / Variance Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Principal Component Analysis

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Foundation of Intelligent Systems, Part I. Regression 2

CPSC 340 Assignment 4 (due November 17 ATE)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

A Modern Look at Classical Multivariate Techniques

Foundations of Computer Vision

CS281 Section 4: Factor Analysis and PCA

Theorems. Least squares regression

Big Data Analytics: Optimization and Randomization

Exercises * on Principal Component Analysis

Lecture 5 : Projections

References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

IV. Matrix Approximation using Least-Squares

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Today. Calculus. Linear Regression. Lagrange Multipliers

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Empirical Gramians and Balanced Truncation for Model Reduction of Nonlinear Systems

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

1 = I I I II 1 1 II 2 = normalization constant III 1 1 III 2 2 III 3 = normalization constant...

Linear Regression. Volker Tresp 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Numerical Methods I Singular Value Decomposition

Data Mining Stat 588

Linear Algebra for Machine Learning. Sargur N. Srihari

EECS 275 Matrix Computation

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Mathematical foundations - linear algebra

Matrix Vector Products

Machine Learning for OR & FE

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

PCA and admixture models

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Spectral Regularization

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Orthonormal Bases; Gram-Schmidt Process; QR-Decomposition

A Least Squares Formulation for Canonical Correlation Analysis

Biostatistics Advanced Methods in Biostatistics IV

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

1 Singular Value Decomposition and Principal Component

Regularization via Spectral Filtering

Signal Analysis. Principal Component Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

2. Review of Linear Algebra

15 Singular Value Decomposition

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Linear regression methods

COMP 558 lecture 18 Nov. 15, 2010

4 Bias-Variance for Ridge Regression (24 points)

Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

STK-IN4300 Statistical Learning Methods in Data Science

14 Singular Value Decomposition

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Properties of Matrices and Operations on Matrices

Transcription:

CS540 Machine learning Lecture 5 1

Last time Basis functions for linear regression Normal equations QR SVD - briefly 2

This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

Geometry of least squares Columns of X define a d-dimensional linear subspace in n-dimensions. Yhat is projection of y into that subspace. Here n=3, d=2. Unit norm 4

Projection of y onto X Orthogonal projection Let r = y - \hat{y}. Residual must be orthogonal to X. Hence Prediction on training set Hat matrix Residual is orthogonal 5

This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 6

Eigenvector decomposition (EVD) For any square matrix A, we say λ is an eval and u is its evec if Stacking up all evecs/vals gives If evecs linearly independent diagonalization 7

EVD of symmetric matrices If A is symmetric, all its evals are real, and all its evecs are orthonormal, u it u j =δ ij Hence and 8

SVD For any real matrix 9

Truncated SVD Rank k approximation to a matrix Equivalent to PCA 10

Truncated SVD 11

SVD and EVD If A is symmetric positive definite, then svals(a)=evals(a), leftsvecs(a)=rightsvecs(a)=evecs(a) modulo sign changes 12

SVD and EVD For arbitrary real matrix A leftsvecs(a) = evecs(a A ) rightsvecs(a) = evecs(a A) Svals(A)^2 = evals(a A) = evals(a A ) 13

SVD for least squares We have What if D j = 0 (so rank of X is less than d)? 14

Pseudo inverse If D_j=0, use Of all solutions w that minimize Xw y, the pinv solution also minimizes w 15

This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 16

QR and SVD take O(d 3 ) time Gradient descent We can find the MLE by following the gradient O(d) per step, but may need many steps η=0.1 η=0.6 Exact line search 17

Stochastic gradient descent Approximate the gradient by looking at a single data case Least Mean Squared Widrow-Hoff Delta-rule Can be used to learn online 18

This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 19

Ridge regression Minimize penalized negative log likelihood l(w)+λ w 2 2 Weight decay, shrinkage, L2 regularization, ridge regression 20

Regularization D=14 λ=0 λ=10-5 λ=10-3 21

Coefficients if λ=0 (MLE) Why it works -0.18, 10.57, -110.28, -245.63, 1664.41, 2647.81, -965 27669.94, 19319.66, -41625.65, -16626.90, 31483.81, 54 Coefficients if λ=10-3 -1.54, 5.52, 3.66, 17.04, -2.63, -23.06, -0.37, -8.49 7.92, 5.40, 8.29, 7.75, 1.78, 2.03, -8.42, Small weights mean the curve is almost linear (same is true for sigmoid function) 22

The objective function is w = argmin w Ridge regression n (y i x T iw w 0 ) 2 +λ i=1 We don t shrink w_0. We should standardize first. Constrained formulation d j=1 w 2 j w = argmin w n (y i x T iw w 0 ) 2 s.t. i=1 Find the penalized MLE d wj 2 t j=1 J(w) = (y Xw) T (y Xw)+λw T w w = (X T X+λI) 1 X T y See book 23

Recall w = (X T X+λI) 1 X T y Expanded data: X = ( ) λid X, ỹ= QR ( y 0 d 1 J(w) = (ỹ Xw) T (ỹ Xw)=(y Xw) T (y Xw)+λw T w ŵ ridge = X\ỹ. ) 24

SVD Recall w = (X T X+λI) 1 X T y Homework: let X=U D V T. w = V(D 2 +λi) 1 DU T y Cheap to compute for many lambdas (regularization path), useful for CV 25

We have Ridge and PCA ŷ = Xŵ ridge =UDV T V(D 2 +λi) 1 DU T y d = U DU T y= u j Djj u T j y j=1 D jj def = [D(D 2 +λi) 1 D] jj = d2 j d 2 j +λ ŷ = Xŵ ridge = d j=1 u j d 2 j d 2 j +λut j y ŷ = Xŵ ls =(UDV T )(VD 1 U T y)=uu T y= d u j u T jy j=1 d 2 j/(d 2 j+λ) 1 Filter factors 26

Ridge and PCA D j2 are the eigenvalues of empirical cov mat X T X. Small d_j are directions j with small variance: these get shrunk the most, since most ill-determined ŷ = Xŵ ridge = d j=1 u j d 2 j d 2 j +λut jy 27

Principal components regression Can set Z=PCA(X,K) then w=regress(x,y) using a pcatransformer object PCR sets (transformed) dimensions K+1,,d to zero, whereas ridge uses all weighted dimensions. Ridge predictions usually more accurate. Feature selection (see later) sets (original) dimensions K+1,,d to zero. Ridge is usually more accurate, but may be less interpretable. 28

Degrees of freedom λ=0 λ=10-5 λ=10-3 All have D=14 but clearly differ in their effective complexity ŷ = S(X)y def df(s) = trace(s) d d 2 j df(λ) = d 2 j +λ j=1 29

Tikhonov regularization 1 min f 2 1 0 (f(x) y(x)) 2 dx+ λ 2 1 0 [f (x)] 2 dx 30

Discretization 1 min f 2 1 min f 2 1 0 1 min f 2 n (f i y i ) 2 + λ 4 i=1 (f(x) y(x)) 2 dx+ λ 2 n 1 (f i y i ) 2 + λ 2 i=1 n 1 i=1 1 0 [f (x)] 2 dx (f i+1 f i ) 2 n [(f i f i 1 ) 2 +(f i f i+1 ) 2] i=1 Boundary conditions: f 0 =f 1, f n+1 =f n 31

Matrix form 1 min f 2 n (f i y i ) 2 + λ 4 i=1 D T D= n [(f i f i 1 ) 2 +(f i f i+1 ) 2] i=1 J(w)= y w 2 +λ Dw 2 1 1 1 1 D=...... Dw 2 =w T (D T D)w= 1 1 n 1 (w i+1 w i ) 2 i=1 1 1 0 0 0 0 0 1 2 1 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 1 1 32

QR ( )w ( ) min w λd In y 0 2 Listing 1: : D = spdiags(ones(n-1,1)*[-1 1], [0 1], N-1, N); A = [speye(n); sqrt(lambda)*d]; b = [y; zeros(n-1,1)]; w = A \ b; 33