1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code.

Size: px

Start display at page:

Download "1 Effects of Regularization For this problem, you are required to implement everything by yourself and submit code."

Clinton Anthony
5 years ago
Views:

1 This set is due pm, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. Please include any code ith your submission. Effects of Regularization For this problem, you are required to implement everything by yourself and submit code. Question A : In order to prevent over-fitting in the least-squares linear regression problem, e add a regularization penalty term. Can adding the penalty term decrease the training (in-sample) error? Will adding a penalty term alays decrease the out-of-sample errors? Please justify your ansers. Question B: l regularization is sometimes favored over l regularization due to its ability to generate a sparse (more zero eights). In fact, l 0 regularization (using l 0 norm instead of l and l norm) can generate a sparser, hich seems favorable in high-dimensional problems. Hoever, it is rarely used. Could you please explain hy? Implementation of L- regularization: We are going to experiment ith linear regression for the Red Wine Quality Rating data set. The data set is uploaded onto Moodle, and you can read more about it here: Note that the original data set has three classes, but one as removed to make this a binary classification problem. Donload the data for training and testing. There are to training data sets ine training.txt and ine training.txt (00 and 40 data points here the second data set is a complete subset of the first) and one testing data set ine testing.txt (30 data points). You ill use the ine.testing.txt dataset to validate your models. The data in each data set represents ho different factors (last columns) affect the ine type (the first column). Each column of data represents a different factor, and they are all continuous. You can read about their significance on the ebsite provided above. No, e ill use logistic regression instead of linear regression, so our error function ill be different. When evaluating training error (E in ) and testing error (E out ) use the logistic error: here p(y i = x i ) is defined as: and p(y i = x i ) is defined as: E = N log(p(y i x i )) + e T x i + e T x i Implement the l regularized logistic regression that minimizes: N E = log(p(y i x i )) + λ N ( N T = log + e yit x i ) + λ N T

2 Train the model ith 0 different choices of λ starting ith λ 0 = and increasing by a factor of 5, i.e. λ 0 = 0.000, λ = , λ = 0.005,..., λ 5 = You may run into numerical instability issues (overflo or underflo). One ay to deal ith these issues is by normalizing the input data X. Given a column X i for a single feature in the training set, you can normalize it by setting X ij = Xij Xi σ(x i) here σ(x i ) is the standard deviation of the column s entries, and X i is the mean of the column s entries. Normalization may change the optimal choice of λ. If you normalize the input data, simply plot the enough choices of λ to see any trends. Question C : Do the folloing for both training data sets and attach your plots in the homeork submission (Hint: use semi-log plot): i. Plot the in sample error (E in ) versus different λs. ii. Plot the out of sample error (E out ) versus different λs. iii. Plot the L- norm of versus different λs. Question D: Considering that the data in ine training.txt is a subset of the data in ine training.txt, compare errors (training and validation) resulting from training ith ine training.txt (00 data points) versus ine training.txt (40 data points). Please briefly explain the difference. Question E: Briefly explain the qualitative behavior (i.e., over-fitting and under-fitting) of the training and validation errors ith different λs hile training ith data in ine training.txt. Question F: Briefly explain the qualitative behavior of norm of ith different λs hile training ith the data in ine train.txt. Question G: If the model ere trained ith ine training.txt, hich λ ould you choose to train your final model? Why? Lasso (l ) vs. Ridge (l ) Regularization For this problem, you are alloed to use packages in Python, MATLAB, or any other language instead of implementing lasso and ridge regularization. Many datasets e no encounter in regression problems are very high dimensional. One ay to handle this is to encourage the eights to be sparse, and only allo a small number of features to have non-zero eight. The direct ay of encouraging a sparse eight vector is use l 0 regularization and penalize the l 0 norm, hich is the number of non-zero elements in the vector. Hoever, the l 0 norm is far from smooth, and as a result, it is hard to optimize. To related methods are Lasso (l ) regression and Ridge (l ) regression. Although both result in shrinkage estimators, only Lasso regression results in sparse eight vectors. This problem compares the behavior of the to methods.

3 Question A: Let D be a set of data points, and let be a D-dimensional eight vector. Both Lasso and Ridge regression can be formulated as a maximum a posteriori (MAP) estimate of p( D) ith different priors p( λ), here λ is a parameter controlling the shape of the prior. i. In the case of Lasso regression, the prior is that the eights are independent and identically distributed (i.i.d.) zero-mean Laplacian random variables Sho that under this prior, D D λ p( λ) = Lap( j 0, /λ) = e λ j. j= j= ŵ = arg max p( D) ŵ = arg min ( log p(d ) + λ ). ii. In the case of Ridge regression, the prior is that the eights are i.i.d. zero-mean Normal random variables D D λ p( λ) = N ( j 0, /λ) = j. Sho that under this prior, j= j= π e λ ŵ = arg max p( D) ( ) ŵ = arg min log p(d ) + λ. iii. Suppose that y N (X, σ I) and D contains X and y. Sho that ŵ = arg max p(d ) ŵ = arg min y X This means that least-squares linear regression is a maximum likelihood estimator hen there is hite Gaussian noise. 3

4 Question B : This section compares the behavior of Lasso and Ridge regression on a synthetic dataset. The dataset consists 000 independent samples of a 9-dimensional feature vector x = [x,..., x 9 ] dran from a uniform distribution on the interval [, ], along ith the response y = 4x 3x x 3 x 4 + 0x 5 + x 6 + x 7 + 3x 8 + 4x 9 + n = T x + n, here n is a standard Normal random variable. The file questiondata.txt consists of 000 lines of 0 tabdelimited values. The first 9 columns represent x,..., x 9, and the last column represents y. Each ro consists of one sample. Using Python and numpy, you may load the data as such >>> data = numpy.loadtxt(./data/questiondata.txt, delimiter=, ) >>> X = data[:, 0:9] >>> Y = data[:, 9] Using MATLAB, you may load the data as such >>> data = dlmread(./data/questiondata.txt,, ); >>> X = data(:, :9); >>> Y = data(:, 0); You may also generate the dataset using the process specified above. This problem explores the behavior of the estimated eights as the strength of the regularization (λ) varies. i. Estimate the eights using linear regression ith Lasso regularization for various choices of λ. For each of the eights, plot the eight as a function of λ (start ith λ = 0 and increase λ until all eights are small). Using a linear scale for λ ill allo the plot to be easily interpreted. Note, in MATLAB you should not standardize the variables: >>> lasso(x, Y, Lambda, 0:0.0:3, standardize,false); In Python, you may use sklearn: from sklearn.linear_model import Lasso clf = Lasso(alpha=Lambda) Then, Python s sklearn and MATLAB s lasso plots should be identical. ii. Estimate the eights using linear regression ith Ridge regularization for various choices of λ. For each of the eights, plot the eight as a function of λ (start ith λ = 0 and increase λ until all eights are small). Note, in MATLAB you should use: >>> Xn = [ones(length(y), ), X]; >>> coef = (Xn *Xn + diag([0; Lambda(i) * ones(9, )]))\ (Xn *Y); You may also use >>> coef = ridge(y, X, Lambda(i), 0); 4

5 but this ill produce a scaled version of the plot generated by python s sklearn, due to differences in normalization beteen MATLAB s ridge and sklearn s Lasso. In Python, you may use: >>> from sklearn.linear_model import Ridge >>> clf = Ridge(alpha=Lambda) iii. As regularization parameter increases, hat happens to the number of estimated eights that are exactly zero ith Lasso regression? What happens to the number of estimated eights that are exactly zero ith Ridge regression? Question C : For general choices of p(d ), an analytic solution for regularized linear regression may not exist. Hoever, hen p(d ) has a standard normal distribution (corresponding to linear regression), an analytic solution exists for -dimensional Lasso regression and for Ridge regression in all dimensions. i. Solve for arg min y T x + λ in the case of a -dimensional feature space. This is linear regression ith Lasso regression. ii. Suppose that hen λ = 0, 0. Does there exist a value for λ such that = 0? If so, hat is the smallest such value? iii. Solve for arg min y T x +λ for an arbitrary number of dimensions. This is linear regression ith Ridge regression. iv. Suppose that hen λ = 0, i 0. Does there exist a value for λ > 0 such that i = 0? If so, hat is the smallest such value? 5

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018 Policies Due 9 PM, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. You should submit all code used in the homework.