CS540 Machine learning Lecture 5

Size: px

Start display at page:

Download "CS540 Machine learning Lecture 5"

Julius Gordon
5 years ago
Views:

1 CS540 Machine learning Lecture 5 1

2 Last time Basis functions for linear regression Normal equations QR SVD - briefly 2

3 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 3

4 Geometry of least squares Columns of X define a d-dimensional linear subspace in n-dimensions. Yhat is projection of y into that subspace. Here n=3, d=2. Unit norm 4

5 Projection of y onto X Orthogonal projection Let r = y - \hat{y}. Residual must be orthogonal to X. Hence Prediction on training set Hat matrix Residual is orthogonal 5

6 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 6

7 Eigenvector decomposition (EVD) For any square matrix A, we say λ is an eval and u is its evec if Stacking up all evecs/vals gives If evecs linearly independent diagonalization 7

8 EVD of symmetric matrices If A is symmetric, all its evals are real, and all its evecs are orthonormal, u it u j =δ ij Hence and 8

9 SVD For any real matrix 9

10 Truncated SVD Rank k approximation to a matrix Equivalent to PCA 10

11 Truncated SVD 11

12 SVD and EVD If A is symmetric positive definite, then svals(a)=evals(a), leftsvecs(a)=rightsvecs(a)=evecs(a) modulo sign changes 12

13 SVD and EVD For arbitrary real matrix A leftsvecs(a) = evecs(a A ) rightsvecs(a) = evecs(a A) Svals(A)^2 = evals(a A) = evals(a A ) 13

14 SVD for least squares We have What if D j = 0 (so rank of X is less than d)? 14

15 Pseudo inverse If D_j=0, use Of all solutions w that minimize Xw y, the pinv solution also minimizes w 15

16 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 16

17 QR and SVD take O(d 3 ) time Gradient descent We can find the MLE by following the gradient O(d) per step, but may need many steps η=0.1 η=0.6 Exact line search 17

18 Stochastic gradient descent Approximate the gradient by looking at a single data case Least Mean Squared Widrow-Hoff Delta-rule Can be used to learn online 18

19 This time Geometry of least squares (again) SVD more slowly LMS Ridge regression 19

20 Ridge regression Minimize penalized negative log likelihood l(w)+λ w 2 2 Weight decay, shrinkage, L2 regularization, ridge regression 20

21 Regularization D=14 λ=0 λ=10-5 λ=

22 Coefficients if λ=0 (MLE) Why it works -0.18, 10.57, , , , , , , , , , 54 Coefficients if λ= , 5.52, 3.66, 17.04, -2.63, , -0.37, , 5.40, 8.29, 7.75, 1.78, 2.03, -8.42, Small weights mean the curve is almost linear (same is true for sigmoid function) 22

23 The objective function is w = argmin w Ridge regression n (y i x T iw w 0 ) 2 +λ i=1 We don t shrink w_0. We should standardize first. Constrained formulation d j=1 w 2 j w = argmin w n (y i x T iw w 0 ) 2 s.t. i=1 Find the penalized MLE d wj 2 t j=1 J(w) = (y Xw) T (y Xw)+λw T w w = (X T X+λI) 1 X T y See book 23

24 Recall w = (X T X+λI) 1 X T y Expanded data: X = ( ) λid X, ỹ= QR ( y 0 d 1 J(w) = (ỹ Xw) T (ỹ Xw)=(y Xw) T (y Xw)+λw T w ŵ ridge = X\ỹ. ) 24

25 SVD Recall w = (X T X+λI) 1 X T y Homework: let X=U D V T. w = V(D 2 +λi) 1 DU T y Cheap to compute for many lambdas (regularization path), useful for CV 25

26 We have Ridge and PCA ŷ = Xŵ ridge =UDV T V(D 2 +λi) 1 DU T y d = U DU T y= u j Djj u T j y j=1 D jj def = [D(D 2 +λi) 1 D] jj = d2 j d 2 j +λ ŷ = Xŵ ridge = d j=1 u j d 2 j d 2 j +λut j y ŷ = Xŵ ls =(UDV T )(VD 1 U T y)=uu T y= d u j u T jy j=1 d 2 j/(d 2 j+λ) 1 Filter factors 26

27 Ridge and PCA D j2 are the eigenvalues of empirical cov mat X T X. Small d_j are directions j with small variance: these get shrunk the most, since most ill-determined ŷ = Xŵ ridge = d j=1 u j d 2 j d 2 j +λut jy 27

28 Principal components regression Can set Z=PCA(X,K) then w=regress(x,y) using a pcatransformer object PCR sets (transformed) dimensions K+1,,d to zero, whereas ridge uses all weighted dimensions. Ridge predictions usually more accurate. Feature selection (see later) sets (original) dimensions K+1,,d to zero. Ridge is usually more accurate, but may be less interpretable. 28

29 Degrees of freedom λ=0 λ=10-5 λ=10-3 All have D=14 but clearly differ in their effective complexity ŷ = S(X)y def df(s) = trace(s) d d 2 j df(λ) = d 2 j +λ j=1 29

30 Tikhonov regularization 1 min f (f(x) y(x)) 2 dx+ λ [f (x)] 2 dx 30

31 Discretization 1 min f 2 1 min f min f 2 n (f i y i ) 2 + λ 4 i=1 (f(x) y(x)) 2 dx+ λ 2 n 1 (f i y i ) 2 + λ 2 i=1 n 1 i=1 1 0 [f (x)] 2 dx (f i+1 f i ) 2 n [(f i f i 1 ) 2 +(f i f i+1 ) 2] i=1 Boundary conditions: f 0 =f 1, f n+1 =f n 31

32 Matrix form 1 min f 2 n (f i y i ) 2 + λ 4 i=1 D T D= n [(f i f i 1 ) 2 +(f i f i+1 ) 2] i=1 J(w)= y w 2 +λ Dw D= Dw 2 =w T (D T D)w= 1 1 n 1 (w i+1 w i ) 2 i=

33 QR ( )w ( ) min w λd In y 0 2 Listing 1: : D = spdiags(ones(n-1,1)*[-1 1], [0 1], N-1, N); A = [speye(n); sqrt(lambda)*d]; b = [y; zeros(n-1,1)]; w = A \ b; 33

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle