MATH 680 Fall November 27, Homework 3

MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and Proximal Operators. Recall that subgradient can be viewed as a generalization of gradient for general functions. Let f be a function from R n to R. The subdifferential of f at x is defined as f(x) = {g R n : g is a subgradient of f at x}. (i) Show that f(x) is a convex and closed set. (ii) Show that f(x) N {y:f(y) f(x)} (x), where recall N C (x) denotes the normal cone to a set C at a point x. (iii) Let f(x) = x 2. Show that f(x) = { {x/ x 2 }, x 0 {z : z 2 }, x = 0 (iv) More generally, let p, q > 0 such that p + q =. Consider function f(x) = x p, where x p is defined as: x p = max z q zt x Based on the definition of x p, show that x, y: x T y x p y q The above inequality is known as Hölder s inequality. Hint: you may use the dual representation of the l p norm, namely, x p = max z q z T x. (iv) Use Hölder s inequality to show that for f(x) = x p, its subdifferential is f(x) = arg max z q z T x. (You are not allowed to use the rule for the subdifferential of a max of functions for this problem.)

2. The proximal operator for function h : R n R and t > 0 is defined as: prox h,t (x) = arg min z 2 z x 2 2 + th(z) Compute the proximal operators prox h,t (x) for the following functions. (i) h(z) = 2 zt Az + b T z + c, where A S n +. (ii) h(z) = n i= z i log z i, where z R n ++. Hint: you may refer to the Lambert W -function when solving for the proximal. (iii) h(z) = z 2. (iv) h(z) = z 0, where z 0 is defined as z 0 = {z i : z i 0, i =,..., n}. (Bonus) h(z) = n i= λ i z (i), where z R n, λ λ 2... λ n 0, and z () z (2)... z (n) are the ordered absolute values of the coordinates of z. This is called the sorted-l norm of z. Hint: you may consider the relation of the sign of x i and z i ; and sort the entries in x and consider their correspondence with the sorted entries in z. 2 Properties of Proximal Mappings and Subgradients (b) Show that if f : R n R is a convex function the following property holds (x y) (u v) 0 x, y R n, u f(x), v f(y). () (d) Recall the definition of the proximal mapping: For a function h, the proximal mapping prox t is defined as prox t (x) = arg min u 2t x u 2 2 + h(u). (2) Show that prox t (x) = u h(y) h(u) + t (x u) (y u) y. (e) Prove that the prox t mapping is non-expansive, that is, prox t (x) prox t (y) 2 x y 2 x, y. (3) 3 Convergence Rate for Proximal Gradient Descent In this problem, you will show the sublinear convergence for gradient descent and proximal gradient descent, which was presented in class. To be clear, we assume that the objective f(x) can be written as f(x) = g(x) + h(x), where (A) g is convex, differentiable, and dom(g) = R n. (A2) g is Lipschitz, with constant L > 0. (A3) h is convex, not necessarily differentiable, and we take dom(h) = R n for simplicity. 2

(a) We begin with the simple case f(x) = g(x); that is, h(x) = 0 and can be ignored. We will prove that the gradient descent converges sublinearly in this case. As a reminder, the iterates of gradient descent is computed by x + = x t g(x), (4) where x + is the iterate succeeding x. Henceforth, we will set t = /L for simplicity. (i) Show that g(x + ) g(x) 2L g(x) 2. That is, the objective value is monotonically decreasing in each update. This is why gradient descent is called a descent method. (ii) Using convexity of g, show the following helpful inequality: g(x + ) g(z) g(x) T (x z) 2L g(x) 2, z R n. (iii) Show that g(x + ) g(x ) L ( x x 2 x + x 2), 2 where x is the minimizer of g, assuming g(x ) is finite. (iv) Now, aggregating the last inequality over all steps i = 0,..., k, show that the accuracy of gradient descent at iteration k is O(/k), i.e., g(x (k) ) g(x ) L 2k x(0) x 2. Put differently, for an ɛ-level accuracy, you need to run at most O(/ɛ) iterations. (b) Now consider the general h in assumption (A3). We will prove that the proximal gradient descent converges sublinearly under such assumptions. Specifically, the iterates of proximal gradient descent is computed by x + = prox th (x t g(x)), (5) where again we will set t = /L for simplicity. Further, we define the useful notation G(x) = t ( x x + ). We will see (in the following proofs) that G(x) behaves like g(x) in gradient descent. (i) Show that (ii) Show that g(x + ) g(x) L g(x)t G(x) + 2L G(x) 2. f(x + ) f(z) G(x) T (x z) 2L G(x) 2, z R n. Note that setting z := x verifies the proximal gradient descent is a descent method. (Hint: Look back at what you did in Q2 part (b) and add the missing h to (i).) 3

(iii) Show that f(x + ) f(x ) L 2 ( x x 2 x + x 2), where x is the minimizer of f. Then show that f(x (k) ) f(x ) L 2k x(0) x 2. That is, the proximal descent method achieves O(/k) accuracy at the k-th iteration. 4 Stochastic and Proximal Gradient Descent for Group Lasso Suppose predictors (columns of the design matrix X R n (p+) ) in a regression problem split up into J groups: X = [ ] X () X (2)... X (J) (6) where = ( ) R n. To achieve sparsity over non-overlapping groups rather than individual predictors, we may write β = (β 0, β (),..., β (J) ), where β 0 is an intercept term and each β (j) is an appropriate coefficient block of β corresponding to X (j), and solve the regularized regression problem: min g(β) + h(β). (7) β Rp+ In the following problems, we will use linear regression to predict the Parkinson s disease (PD) symptom score on the Parkinsons dataset. The PD symptom score is measured on the unified Parkinson s disease rating scale (UPDRS). This data contains 5, 785 observations, 8 predictors (in X_train.csv), and an outcome the toal UPDRS (in y_train.csv). The data were collected at the University of Oxford, in collaboration with 0 medical centers in the US and Intel Corporation. The 8 columns in the predictor matrix have the following groupings (in column ordering): age: Subject age in years sex: Subject gender, 0 male, female Jitter(%), Jitter(Abs), Jitter:RAP, Jitter:PPQ5, Jitter:DDP: Several measures of variation in fundamental frequency of voice Shimmer, Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, Shimmer:APQ, Shimmer:DDA: Several measures of variation in amplitude of voice NHR, HNR: Two measures of ratio of noise to tonal components in the voice RPDE: A nonlinear dynamical complexity measure DFA: Signal fractal scaling exponent PPE: A nonlinear measure of fundamental frequency variation 4

. We first consider the ridge regression problem, where h(β) = λ 2 β 2 2: min β R p+ 2N Xβ y 2 + λ 2 β 2 2 (8) where N is the number of samples. Note: in your implementation for this problem, if you added a ones vector to X (X = [ X () X (2)... X (J) ] ), you should not include the bias term β 0 associated with the ones vector in the penalty. (a) Derive the stochastic gradient update w.r.t. a batch-size B and a step-size t. Hint: you will need to a separate update for β 0 since it should not be penalized. (b) Implement the stochastic gradient descent algorithm to solve the ridge regression problem (8). Initialize β with random normal values. Fit the model parameters on the training data (X_train.csv, Y_train.csv) and evaluate the objective function after each epoch (you will need to plot these values later). Set λ =. Try different batch-sizes from {0, 20, 50, 00} and different step-sizes from {0 2, 0 3, 0 4, 0 5 }. Train for 500 epochs (an epoch is one iteration though the dataset). (c) Plot f k f versus k (k =,..., 500) on a semi-log scale (i.e. where the y-axis is in log scale) for all setting combinations, where f k denotes the objective value averaged over all samples at epoch k, and the optimal objective value is f = 57.040. What do you find? How do the different step sizes and batch sizes affect the learning curves (i.e. convergence rate, final convergence value, etc.)? 2. Next, we consider the least squares group LASSO problem, where h(β) = λ j w j β (j) 2 : min β R p+ 2N Xβ y 2 + λ j w j β (j) 2 (9) A common choice for weights on groups w j is p j, where p j is number of predictors that belong to the jth group, to adjust for the group sizes. We will solve the problem using proximal gradient descent algorithm (over the whole dataset). (a) Derive the proximal operator prox h,t (x) for the non-smooth component h(β) = λ J j= w j β (j) 2. (b) Derive the proximal gradient update for the objective. (c) Implement proximal gradient descent to solve the least squares group lasso problem on the Parkinsons dataset. Set λ = 0.02. Use a fixed step-size t = 0.005 and run for 0000 steps. (d) Plot f k f versus k for the first 0000 iterations (k =,..., 0000) on a semi-log scale (i.e. where the y-axis is in log scale) for both train and test data, where f k denotes the objective value averaged over all samples at step k, and the optimal objective value is f = 49.9649. Print the components of the solutions numerically. What are the selected groups? (e) Now implement the LASSO (hint: you shouldn t have to do any additional coding), with fixed step-size t = 0.005 and λ = 0.02. Run accelerated proximal gradient descent for 0000 steps. Compare the LASSO solution with your group lasso solutions. 5

(f) (Bonus) Implement accelerated proximal gradient descent with fixed step-size under the same setting in part (c). Hint: be sure to exclude the bias term β 0 from the proximal update, just use a regular accelerated gradient update. Plot f k f versus k for both methods (unaccelerated and accelerated proximal gradient) for k =,..., 0000 on a semi-log scale and compare the selected groups. What do you find? 5 Computation of Gaussian mixture model using EM algorithm In this problem you will fit a Gaussian mixture model and use it to cluster n observations x,..., x n, where x i R p. The model assumes that x,..., x n are a realization of the sequence of independent random vectors X,..., X n defined in the following way. For each X i there is an unobservable random variable Y i and X i Y i N (µ Yi, Σ Yi ); Y i Multinomial(r,..., r c ) p(y i = j) = r j, j =,..., c, (X i, Y i ) i =,..., n independent samples. r j = ; j= To fit this model, we estimate θ = (r,..., r c, µ,..., µ c, Σ,..., Σ c ). We know that f(x θ) = f X,Y (x, y θ)dy = f X Y (x y, θ)f Y (y)dy = r j N (x µ j, Σ j ) j= So to estimate θ with a realization x,..., x n of n independent copies, we maximize the loglikelihood function, i.e. we compute a local maximizer for log L(θ X) = n log i= r j N (x i µ j, Σ j ). j= θ mle = arg max log L(θ X), (0) θ. Study the R functions gmmfit for computing (0) as defined in gmmfit.r posted on the website. In the end of that file, it provides some code that calls gmmfit function using the simulated data. (a) Derive an EM algorithm to compute a local maximizer for log L(θ X) = n log i= r j N (x i µ j, Σ). j= 6

where θ mle = arg max log L(θ X), () θ θ = (r,..., r c, µ,..., µ c, Σ). Therefore in this problem, the covariance matrix is the same for all j {,..., c}. Write the corresponding R function to compute this local minimizer. (b) Find a classification dataset from the UCI machine learning data repository for which the Gaussian mixture model assumptions are not terribly unreasonable (pretending that the responses/class labels are unobserved). This repository is at https://archive.ics.uci.edu/ml/datasets.html Test the R functions created in parts (a) and (b) on these data by pretending that the responses/class labels y,..., y n are unobserved. Set c equal to the number possible categories for a response/class label. For a fitted Gaussian mixture model, assign a class label ŷ i to the ith observation x i with ŷ i = arg max P (y i = j X = x i ) j {,...,c} where P (y i = j X = x i ) is an estimate of P (y i = j X = x i ) from the Gaussian mixture model fit. Using the fit from the R function created in part (a), report n n i= I(ŷ i = y i ). Also do this for the fit from the R function created in part (b). 7