MATH 680 Fall November 27, Homework 3

Similar documents
Homework 2. Convex Optimization /36-725

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Lasso: Algorithms and Extensions

Optimization methods

Fast proximal gradient methods

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear Models for Regression CS534

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Models in Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Descent methods. min x. f(x)

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Lecture 8: February 9

Conditional Gradient (Frank-Wolfe) Method

Linear Models for Regression CS534

CS Homework 3. October 15, 2009

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Sparse Linear Models (10/7/13)

6. Proximal gradient method

Homework 4. Convex Optimization /36-725

Lecture 9: September 28

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Convex Optimization / Homework 1, due September 19

Homework 5. Convex Optimization /36-725

HOMEWORK #4: LOGISTIC REGRESSION

Subgradient Method. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Advanced Introduction to Machine Learning

HOMEWORK #4: LOGISTIC REGRESSION

6. Proximal gradient method

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

10-701/ Machine Learning - Midterm Exam, Fall 2010

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

COMS 4771 Regression. Nakul Verma

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

1 Sparsity and l 1 relaxation

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CPSC 540: Machine Learning

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Homework 6. Due: 10am Thursday 11/30/17

Lecture 2 Machine Learning Review

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Bits of Machine Learning Part 1: Supervised Learning

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Math 273a: Optimization Convex Conjugacy

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Overfitting, Bias / Variance Analysis

Convex Optimization Lecture 16

ECE521 week 3: 23/26 January 2017

Lecture: Smoothing.

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

A Unified Approach to Proximal Algorithms using Bregman Distance

Regression with Numerical Optimization. Logistic

Latent Variable Models and EM algorithm

Support Vector Machines and Bayes Regression

10701/15781 Machine Learning, Spring 2007: Homework 2

CS60021: Scalable Data Mining. Large Scale Machine Learning

Introduction to Machine Learning

Linear Regression (continued)

Linear classifiers: Logistic regression

Machine Learning And Applications: Supervised Learning-SVM

Homework 3. Convex Optimization /36-725

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Gaussian Mixture Models

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

10-701/15-781, Machine Learning: Homework 4

Warm up: risk prediction with logistic regression

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Introduction to Machine Learning HW6

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Optimization methods

6.867 Machine Learning

Lecture 23: November 21

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Classification Logistic Regression

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

CPSC 540: Machine Learning

CSE546 Machine Learning, Autumn 2016: Homework 1

Machine Learning Practice Page 2 of 2 10/28/13

Lecture 6. Regression

Gaussian Graphical Models and Graphical Lasso

Coordinate Descent and Ascent Methods

Convergence of Fixed-Point Iterations

EE364b Convex Optimization II May 30 June 2, Final exam

Transcription:

MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and Proximal Operators. Recall that subgradient can be viewed as a generalization of gradient for general functions. Let f be a function from R n to R. The subdifferential of f at x is defined as f(x) = {g R n : g is a subgradient of f at x}. (i) Show that f(x) is a convex and closed set. (ii) Show that f(x) N {y:f(y) f(x)} (x), where recall N C (x) denotes the normal cone to a set C at a point x. (iii) Let f(x) = x 2. Show that f(x) = { {x/ x 2 }, x 0 {z : z 2 }, x = 0 (iv) More generally, let p, q > 0 such that p + q =. Consider function f(x) = x p, where x p is defined as: x p = max z q zt x Based on the definition of x p, show that x, y: x T y x p y q The above inequality is known as Hölder s inequality. Hint: you may use the dual representation of the l p norm, namely, x p = max z q z T x. (iv) Use Hölder s inequality to show that for f(x) = x p, its subdifferential is f(x) = arg max z q z T x. (You are not allowed to use the rule for the subdifferential of a max of functions for this problem.)

2. The proximal operator for function h : R n R and t > 0 is defined as: prox h,t (x) = arg min z 2 z x 2 2 + th(z) Compute the proximal operators prox h,t (x) for the following functions. (i) h(z) = 2 zt Az + b T z + c, where A S n +. (ii) h(z) = n i= z i log z i, where z R n ++. Hint: you may refer to the Lambert W -function when solving for the proximal. (iii) h(z) = z 2. (iv) h(z) = z 0, where z 0 is defined as z 0 = {z i : z i 0, i =,..., n}. (Bonus) h(z) = n i= λ i z (i), where z R n, λ λ 2... λ n 0, and z () z (2)... z (n) are the ordered absolute values of the coordinates of z. This is called the sorted-l norm of z. Hint: you may consider the relation of the sign of x i and z i ; and sort the entries in x and consider their correspondence with the sorted entries in z. 2 Properties of Proximal Mappings and Subgradients (b) Show that if f : R n R is a convex function the following property holds (x y) (u v) 0 x, y R n, u f(x), v f(y). () (d) Recall the definition of the proximal mapping: For a function h, the proximal mapping prox t is defined as prox t (x) = arg min u 2t x u 2 2 + h(u). (2) Show that prox t (x) = u h(y) h(u) + t (x u) (y u) y. (e) Prove that the prox t mapping is non-expansive, that is, prox t (x) prox t (y) 2 x y 2 x, y. (3) 3 Convergence Rate for Proximal Gradient Descent In this problem, you will show the sublinear convergence for gradient descent and proximal gradient descent, which was presented in class. To be clear, we assume that the objective f(x) can be written as f(x) = g(x) + h(x), where (A) g is convex, differentiable, and dom(g) = R n. (A2) g is Lipschitz, with constant L > 0. (A3) h is convex, not necessarily differentiable, and we take dom(h) = R n for simplicity. 2

(a) We begin with the simple case f(x) = g(x); that is, h(x) = 0 and can be ignored. We will prove that the gradient descent converges sublinearly in this case. As a reminder, the iterates of gradient descent is computed by x + = x t g(x), (4) where x + is the iterate succeeding x. Henceforth, we will set t = /L for simplicity. (i) Show that g(x + ) g(x) 2L g(x) 2. That is, the objective value is monotonically decreasing in each update. This is why gradient descent is called a descent method. (ii) Using convexity of g, show the following helpful inequality: g(x + ) g(z) g(x) T (x z) 2L g(x) 2, z R n. (iii) Show that g(x + ) g(x ) L ( x x 2 x + x 2), 2 where x is the minimizer of g, assuming g(x ) is finite. (iv) Now, aggregating the last inequality over all steps i = 0,..., k, show that the accuracy of gradient descent at iteration k is O(/k), i.e., g(x (k) ) g(x ) L 2k x(0) x 2. Put differently, for an ɛ-level accuracy, you need to run at most O(/ɛ) iterations. (b) Now consider the general h in assumption (A3). We will prove that the proximal gradient descent converges sublinearly under such assumptions. Specifically, the iterates of proximal gradient descent is computed by x + = prox th (x t g(x)), (5) where again we will set t = /L for simplicity. Further, we define the useful notation G(x) = t ( x x + ). We will see (in the following proofs) that G(x) behaves like g(x) in gradient descent. (i) Show that (ii) Show that g(x + ) g(x) L g(x)t G(x) + 2L G(x) 2. f(x + ) f(z) G(x) T (x z) 2L G(x) 2, z R n. Note that setting z := x verifies the proximal gradient descent is a descent method. (Hint: Look back at what you did in Q2 part (b) and add the missing h to (i).) 3

(iii) Show that f(x + ) f(x ) L 2 ( x x 2 x + x 2), where x is the minimizer of f. Then show that f(x (k) ) f(x ) L 2k x(0) x 2. That is, the proximal descent method achieves O(/k) accuracy at the k-th iteration. 4 Stochastic and Proximal Gradient Descent for Group Lasso Suppose predictors (columns of the design matrix X R n (p+) ) in a regression problem split up into J groups: X = [ ] X () X (2)... X (J) (6) where = ( ) R n. To achieve sparsity over non-overlapping groups rather than individual predictors, we may write β = (β 0, β (),..., β (J) ), where β 0 is an intercept term and each β (j) is an appropriate coefficient block of β corresponding to X (j), and solve the regularized regression problem: min g(β) + h(β). (7) β Rp+ In the following problems, we will use linear regression to predict the Parkinson s disease (PD) symptom score on the Parkinsons dataset. The PD symptom score is measured on the unified Parkinson s disease rating scale (UPDRS). This data contains 5, 785 observations, 8 predictors (in X_train.csv), and an outcome the toal UPDRS (in y_train.csv). The data were collected at the University of Oxford, in collaboration with 0 medical centers in the US and Intel Corporation. The 8 columns in the predictor matrix have the following groupings (in column ordering): age: Subject age in years sex: Subject gender, 0 male, female Jitter(%), Jitter(Abs), Jitter:RAP, Jitter:PPQ5, Jitter:DDP: Several measures of variation in fundamental frequency of voice Shimmer, Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, Shimmer:APQ, Shimmer:DDA: Several measures of variation in amplitude of voice NHR, HNR: Two measures of ratio of noise to tonal components in the voice RPDE: A nonlinear dynamical complexity measure DFA: Signal fractal scaling exponent PPE: A nonlinear measure of fundamental frequency variation 4

. We first consider the ridge regression problem, where h(β) = λ 2 β 2 2: min β R p+ 2N Xβ y 2 + λ 2 β 2 2 (8) where N is the number of samples. Note: in your implementation for this problem, if you added a ones vector to X (X = [ X () X (2)... X (J) ] ), you should not include the bias term β 0 associated with the ones vector in the penalty. (a) Derive the stochastic gradient update w.r.t. a batch-size B and a step-size t. Hint: you will need to a separate update for β 0 since it should not be penalized. (b) Implement the stochastic gradient descent algorithm to solve the ridge regression problem (8). Initialize β with random normal values. Fit the model parameters on the training data (X_train.csv, Y_train.csv) and evaluate the objective function after each epoch (you will need to plot these values later). Set λ =. Try different batch-sizes from {0, 20, 50, 00} and different step-sizes from {0 2, 0 3, 0 4, 0 5 }. Train for 500 epochs (an epoch is one iteration though the dataset). (c) Plot f k f versus k (k =,..., 500) on a semi-log scale (i.e. where the y-axis is in log scale) for all setting combinations, where f k denotes the objective value averaged over all samples at epoch k, and the optimal objective value is f = 57.040. What do you find? How do the different step sizes and batch sizes affect the learning curves (i.e. convergence rate, final convergence value, etc.)? 2. Next, we consider the least squares group LASSO problem, where h(β) = λ j w j β (j) 2 : min β R p+ 2N Xβ y 2 + λ j w j β (j) 2 (9) A common choice for weights on groups w j is p j, where p j is number of predictors that belong to the jth group, to adjust for the group sizes. We will solve the problem using proximal gradient descent algorithm (over the whole dataset). (a) Derive the proximal operator prox h,t (x) for the non-smooth component h(β) = λ J j= w j β (j) 2. (b) Derive the proximal gradient update for the objective. (c) Implement proximal gradient descent to solve the least squares group lasso problem on the Parkinsons dataset. Set λ = 0.02. Use a fixed step-size t = 0.005 and run for 0000 steps. (d) Plot f k f versus k for the first 0000 iterations (k =,..., 0000) on a semi-log scale (i.e. where the y-axis is in log scale) for both train and test data, where f k denotes the objective value averaged over all samples at step k, and the optimal objective value is f = 49.9649. Print the components of the solutions numerically. What are the selected groups? (e) Now implement the LASSO (hint: you shouldn t have to do any additional coding), with fixed step-size t = 0.005 and λ = 0.02. Run accelerated proximal gradient descent for 0000 steps. Compare the LASSO solution with your group lasso solutions. 5

(f) (Bonus) Implement accelerated proximal gradient descent with fixed step-size under the same setting in part (c). Hint: be sure to exclude the bias term β 0 from the proximal update, just use a regular accelerated gradient update. Plot f k f versus k for both methods (unaccelerated and accelerated proximal gradient) for k =,..., 0000 on a semi-log scale and compare the selected groups. What do you find? 5 Computation of Gaussian mixture model using EM algorithm In this problem you will fit a Gaussian mixture model and use it to cluster n observations x,..., x n, where x i R p. The model assumes that x,..., x n are a realization of the sequence of independent random vectors X,..., X n defined in the following way. For each X i there is an unobservable random variable Y i and X i Y i N (µ Yi, Σ Yi ); Y i Multinomial(r,..., r c ) p(y i = j) = r j, j =,..., c, (X i, Y i ) i =,..., n independent samples. r j = ; j= To fit this model, we estimate θ = (r,..., r c, µ,..., µ c, Σ,..., Σ c ). We know that f(x θ) = f X,Y (x, y θ)dy = f X Y (x y, θ)f Y (y)dy = r j N (x µ j, Σ j ) j= So to estimate θ with a realization x,..., x n of n independent copies, we maximize the loglikelihood function, i.e. we compute a local maximizer for log L(θ X) = n log i= r j N (x i µ j, Σ j ). j= θ mle = arg max log L(θ X), (0) θ. Study the R functions gmmfit for computing (0) as defined in gmmfit.r posted on the website. In the end of that file, it provides some code that calls gmmfit function using the simulated data. (a) Derive an EM algorithm to compute a local maximizer for log L(θ X) = n log i= r j N (x i µ j, Σ). j= 6

where θ mle = arg max log L(θ X), () θ θ = (r,..., r c, µ,..., µ c, Σ). Therefore in this problem, the covariance matrix is the same for all j {,..., c}. Write the corresponding R function to compute this local minimizer. (b) Find a classification dataset from the UCI machine learning data repository for which the Gaussian mixture model assumptions are not terribly unreasonable (pretending that the responses/class labels are unobserved). This repository is at https://archive.ics.uci.edu/ml/datasets.html Test the R functions created in parts (a) and (b) on these data by pretending that the responses/class labels y,..., y n are unobserved. Set c equal to the number possible categories for a response/class label. For a fitted Gaussian mixture model, assign a class label ŷ i to the ith observation x i with ŷ i = arg max P (y i = j X = x i ) j {,...,c} where P (y i = j X = x i ) is an estimate of P (y i = j X = x i ) from the Gaussian mixture model fit. Using the fit from the R function created in part (a), report n n i= I(ŷ i = y i ). Also do this for the fit from the R function created in part (b). 7