Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran

Similar documents
Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction

LINEAR ALGEBRA KNOWLEDGE SURVEY

CS 246 Review of Linear Algebra 01/17/19

Jordan Normal Form and Singular Decomposition

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

14 Singular Value Decomposition

COMP 558 lecture 18 Nov. 15, 2010

Properties of Matrices and Operations on Matrices

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Lecture 6: Methods for high-dimensional problems

15 Singular Value Decomposition

1 Linearity and Linear Systems

Mathematical foundations - linear algebra

Linear Algebra and Matrices

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

PCA, Kernel PCA, ICA

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

Linear Algebra- Final Exam Review

Review of similarity transformation and Singular Value Decomposition

Problem # Max points possible Actual score Total 120

The Singular Value Decomposition

Mathematical foundations - linear algebra

Appendix A: Matrices

Lecture 1 Review: Linear models have the form (in matrix notation) Y = Xβ + ε,

Linear Algebra Primer

Background Mathematics (2/2) 1. David Barber

COMS 4771 Regression. Nakul Verma

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Algebra Review. Fei-Fei Li

5.) For each of the given sets of vectors, determine whether or not the set spans R 3. Give reasons for your answers.

Introduction to Matrix Algebra

Recap from previous lecture

. = V c = V [x]v (5.1) c 1. c k

Math Bootcamp An p-dimensional vector is p numbers put together. Written as. x 1 x =. x p

B553 Lecture 5: Matrix Algebra Review

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Linear Algebra Review. Vectors

CS 143 Linear Algebra Review

Matrix operations Linear Algebra with Computer Science Application

= main diagonal, in the order in which their corresponding eigenvectors appear as columns of E.

C&O367: Nonlinear Optimization (Winter 2013) Assignment 4 H. Wolkowicz

2. Matrix Algebra and Random Vectors

Foundations of Computer Vision

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015

Clustering VS Classification

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Main matrix factorizations

Data Preprocessing Tasks

Vectors and Matrices Statistics with Vectors and Matrices

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Least squares problems Linear Algebra with Computer Science Application

Latent Semantic Analysis (Tutorial)

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Linear Algebra Review. Fei-Fei Li

+E x, y ρ. (fw (x) y) 2] (4)

Stat 206: Linear algebra

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Singular Value Decomposition

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Problem Set 1. Homeworks will graded based on content and clarity. Please show your work clearly for full credit.

Linear Algebra, part 3 QR and SVD

Math Linear Algebra Final Exam Review Sheet

Matrix Factorizations

a 11 a 12 a 11 a 12 a 13 a 21 a 22 a 23 . a 31 a 32 a 33 a 12 a 21 a 23 a 31 a = = = = 12

Latent Semantic Analysis. Hongning Wang

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Lecture 1: Systems of linear equations and their solutions

Lecture 6: Geometry of OLS Estimation of Linear Regession

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

ACM 104. Homework Set 4 Solutions February 14, 2001

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

From Matrix to Tensor. Charles F. Van Loan

Econ Slides from Lecture 7

Math 308 Practice Final Exam Page and vector y =

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Linear Subspace Models

1. What is the determinant of the following matrix? a 1 a 2 4a 3 2a 2 b 1 b 2 4b 3 2b c 1. = 4, then det

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare


4. Determinants.

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

3.4 Elementary Matrices and Matrix Inverse

Collaborative Filtering. Radek Pelánek

Cheat Sheet for MATH461

Problems. Looks for literal term matches. Problems:

Eigenvalues, Eigenvectors, and an Intro to PCA

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Vector Spaces, Orthogonality, and Linear Least Squares

Chapter 4 - MATRIX ALGEBRA. ... a 2j... a 2n. a i1 a i2... a ij... a in

Linear Algebra Practice Problems

Eigenvalues, Eigenvectors, and an Intro to PCA

Jeffrey D. Ullman Stanford University

Example Linear Algebra Competency Test

Transcription:

Assignment 2 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Let A m n be a matrix of real numbers. The matrix AA T has an eigenvector x with eigenvalue b. Then the eigenvector y of A T A which has eigenvalue b is equal to (a) x T A (b) A T x (c) x (d) Cannot be described in terms of x Sol. (b) (AA T )x = bx Multiplying by A T on both sides and rearranging, (A T )(AA T x) = A T (bx) (A T A)(A T x) = b(a T x) Hence, A T x is an eigenvector of A T A, with eigenvalue b. 2. Let A n n be a row stochaistic matrix - in other words, all elements are non-negative and the sum of elements in every row is 1. Let b be an eigenvalue of A. Which of the following is true? (a) b > 1 (b) b 1 (c) b 1 (d) b < 1 Sol. (b) Note that Ax = bx where x is an eigenvector and b is an eigenvalue. Let x m ax be the largest element of x. Let A ij denote the ith row, jth column element of A, and x j denote the jth element of x. Now, j=n A ij x j = bx i j=1 Let us consider the case where x i corresponds to the maximum element of x. The RHS is equal to bx max. Now, since j=n j=1 A ij = 1 and A ij > 0, the LHS is less than or equal to x m ax. Hence, bx max x max 1

Hence, b 1 3. Let u be a n 1 vector, such that u T u = 1. Let I be the n n identity matrix. The n n matrix A is given by (I kuu T ), where k is a real constant. u itself is an eigenvector of A, with eigenvalue 1. What is the value of k? (a) -2 (b) -1 (c) 2 (d) 0 Sol. (c) Hence, k = 2 (I kuu T )u = u u ku(u T u) = u 2u ku = 0 4. Which of the following are true for any m n matrix A of real numbers. (a) The rowspace of A is the same as the columnspace of A T (b) The rowspace of A is the same as the rowspace of A T (c) The eigenvectors of AA T are the same as the eigenvectors of A T A (d) The eigenvalues of AA T are the same as the eigenvalues of A T A Sol. (a), (d) Since the rows of A are the same as the columns of A T, the rowspace of A is the same as the columnspace of A T. The eigenvalues of AA T are the same as the eigenvalues of A T A, because if AA T x = λx we get A T A(A T x) = λ(a T x). (b) is clearly not necessary. (c) need not hold, since although the eigenvalues are same, the eigenvectors have a factor of A T multiplying them. 5. Consider the following 4 training examples x = 1, y = 0.0319 x = 0, y = 0.8692 x = 1, y = 1.9566 x = 2, y = 3.0343 We want to learn a function f(x) = ax+b which is parametrized by (a, b). Using squared error as the loss function, which of the following parameters would you use to model this function. (a) (1, 1) 2

(b) (1, 2) (c) (2, 1) (d) (2, 2) Sol. (a) The line y = x + 1 is the one with minimum squared error out of all the four proposed. 6. You are given the following five training instances x 1 = 2, x 2 = 1, y = 4 x 1 = 6, x 2 = 3, y = 2 x 1 = 2, x 2 = 5, y = 2 x 1 = 6, x 2 = 7, y = 3 x 1 = 10, x 2 = 7, y = 3 We want to model this function using the K-nearest neighbor regressor model. When we want to predict the value of y corresponding to (x 1, x 2 ) = (3, 6) (a) For K = 2, y = 3 (b) For K = 2, y = 2.5 (c) For K = 3, y = 2.33 (d) For K = 3, y = 2.666 Sol. (b), (c) Points 3 and 4 are the two closest points to (x 1, x 2 ) = (3, 6). Hence, the 2-nearest neighbor prediction would be the average of their y values, which is 2.5. The third closest point is point 2, on adding which we get an average y value of 2.33. 7. Bias and Variance can be visualized using a classic example of a dart game. We can think of the true value of the parameters as the bull s-eye on a target, and the arrow s value as the estimated value from each sample. Consider the following situations, and select the correct option(s) (a) Player 1 has low variance compared to player 4 (b) Player 1 has higher variance compared to player 4 (c) Bias exhibited by player 2 is more than that done by player 3. 3

Figure 1: Figure for Q7 Sol.(b) We can think of the true value of the population parameter as the bulls-eye on a target, and of the sample statistic as an arrow fired at the bulls-eye. Bias and variability describe what happens when a player fires many arrows at the target. Bias means that the aim is off, and the arrows land consistently off the bulls-eye in the same direction. The sample values do not center about the population value. Large variability means that repeated shots are widely scattered on the target. Repeated samples do not give similar results but differ widely among themselves. (a) Player 1 - Shows high bias and high variance (b) Player 2 - Shows low bias and high variance (c) Player 3 - Shows low bias and low variance (d) Player 4 - Shows high bias and low variance 8. Choose the correct option(s) from the following. (a) When working with a small dataset, one should prefer low bias/high variance classifiers over high bias/low variance classifiers. (b) When working with a small dataset, one should prefer high bias/low variance classifiers over low bias/high variance classifiers. (c) When working with a large dataset, one should prefer high bias/low variance classifiers over low bias/high variance classifiers. 4

(d) When working with a large dataset, one should prefer low bias/high variance classifiers over high bias/low variance classifiers. Sol. (b), (d) On smaller datasets, variance is a concern since even small changes in the training set may change the optimal parameters significantly. Hence, a high bias/ low variance classifier would be preferred. On the other hand, with a large dataset, since we have sufficient points to represent the data distribution accurately, variance is not of much concern. Hence, one would go for the classifier with low bias even though it has higher variance. 9. Consider a modified k-nn method in which once the k nearest neighbours to the query point are identified, you do a linear regression fit on them and output the fitted value for the query point. Which of the following is/are true regarding this method. (a) This method makes an assumption that the data is locally linear. (b) In order to perform well, this method would need dense distributed training data. (c) This method has higher bias compared to K-NN (d) This method has higher variance compared to K-NN Sol. (a), (b), (d) Since we do a linear fit in the k-neighborhood, we are making an assumption of local linearity. Hence, (a) holds. The method would need dense distributed training data to perform well, since in the case of the training data being sparse, the k-neighborhood would end up being quite spread out (not really local anymore). Hence, the assumption of local linearity would not give good results. Hence, (b) holds. The method has higher variance, since we now have two parameters (slope and intercept) instead of one in the case of conventional k-nn. (In the conventional case, we just try to fit a constant, and the average happens to be the constant which minimizes the squared error). 10. The Singular Value Decomposition (SVD) of a matrix R is given by USV T. Consider an orthogonal matrix Q and A = QR. The SVD of A is given by U 1 S 1 V T 1. Which of the following is/are true? Note-There can be more than one correct option. (a) U = U 1 (b) S = S 1 (c) V = V 1 Sol. (b), (c) The matrix V 1 represents the eigenvectors of A T A. Now since A T A = (R T Q T )(QR) Since Q is orthogonal, Q T Q = I. Therefore, A T A = (R T IR) A T A = (R T R) 5

Since these matrices are equal, their eigenvectors will be the same. V represents the eigenvectors of R T R and V 1 the eigenvectors of A T A. Hence, V = V 1. Also, since S represents the eigenvalues of A T A (as well as AA T, since they the set of eigenvalues is the same as both), S = S 1. However, U need not be equal to U 1, since AA T = (QR)(R T Q T ) = Q(RR T )Q T. 11. Assume that the feature vectors defining the training data are not all linearly independent. What happens if we apply the standard linear regression formulation considering all feature vectors? (a) The coefficients ˆβ are not uniquely defined. (b) ŷ = X ˆβ is no longer the projection of y into the column space of X. (c) X is full rank. (d) X T X is singular. Sol. (a), (d) Since X is such that its columns are not linearly independent, it will not have full rank. As a result, X T X also wont have full rank. Hence, it will be singular. The coefficients will not be uniquely defined since there would be multiple feasible solutions as a result of X T X being singular and hence non-invertible. 6