Ridge Regression Mohammad Emtiyaz Khan EPFL Oct, 205 Mohammad Emtiyaz Khan 205
Motivation Linear models can be too limited and usually underfit. One way is to use nonlinear basis functions instead. Nonlinear basis functions In general, we can use basis functions that are (known and fixed) nonlinear transformations applied to input variables. Given a D dimensional input vector x, let us say that we have M basis functions φ j (x), indexed by j. Example of basis functions: Polynomial, Splines (see Wikipedia), Gaussian, sigmoid, Fourier, Wavelets. 0.75 0.5 0.25 0 0 0.75 0.5 0.25 0 0 Taken from Bishop, Chapter 3
Linear basis function model The model is given as follows: y n = 0 + M j φ j (x n ) = φ(x n ) T j= where φ(x n ) T = [, φ (x n ), φ 2 (x n ),..., φ M (x n )] This model is linear in but nonlinear in x. Note that the dimensionality is now M, not D. Defining matrix Φ with φ(x n ) T as rows, φ (x ) φ 2 (x )... φ M (x ) φ (x 2 ) φ 2 (x 2 )... φ M (x 2 ) Φ := φ (x 3 ) φ 2 (x 3 )... φ M (x 3 )....... φ (x N ) φ 2 (x N )... φ M (x N ) The least square solution is lse = ( Φ T Φ) ΦT y Two Issues The model can potentially overfit. The Gram matrix can be illconditioned. 2
Regularization and ridge regression Through regularization, we can penalize complex models and favor simpler ones: min L() + λ 2N M j= 2 j The second term is a regularizer. The main point is that an input variable weighted by a small j will have less influence on the output. Note that 0 is not penalized. Setting λ = 0, we get back to leastsquares when L is MSE. Ridge regression When L is MSE, this is called the ridge regression: min 2N N [y n φ(x n ) T ] 2 + n= λ 2N M j= 2 j Differentiating and setting to zero: ridge = ( Φ T Φ + λim ) ΦT y 3
Ridge regression to fight ill-conditioning The eigenvalues of ( Φ T Φ + λim ) is at least λ. This is also referred to as lifting the eigenvalues. Proof: Write the eigenvalue decomposition of Φ T Φ as QSQ T and use the fact that QQ T = I M. Therefore ridge regression improves the condition number of the Gram matrix. Ridge regression as MAP estimator Assume 0 = 0 for this discussion. Recall that least-squares can be interpreted as the maximum likelihood estimator. [ N ] lse = arg max log N (y n x T n, σ 2 ) n= Ridge regression has a very similar interpretation: [ N ] ridge = arg max log N (y n x T n, λ) N ( 0, I) n= This is called a Maximum-aposteriori (MAP) estimate. 4
Additional Notes Other type of regularization Popular methods such as shrinkage, weight decay (in the context of neural networks), early stopping are all different forms of regularization. Another view of regularization: the regularized optimization problem is similar to the following constrained optimization (for some τ > 0). min 2N N (y n φ(x n ) T ) 2, such that T τ () n= The following picture illustrates this. w 2 w w Figure : Geometric interpretation of Ridge Regression. 5
Another popular regularization is the L regularizer (also known as lasso) N min (y n 2N φ(x M n ) T ) 2, such that i τ (2) n= This forces some of the elements of to be strictly 0 and therefore forces sparsity in the model (some features are not used since their coefficients are zero). Why does L regularizer forces sparsity? Hint: Draw the picture similar to above for lasso and look at the optimal solution. Why is it good to have sparsity in the model? Is it going to be better than least-squares? When and why? i= MAP estimate and the Bayes rule In general, a MAP estimate maximizes the product of the likelihood and the prior, as opposed to a ML estimate that maximizes only the likelihood. lik = arg max map = arg max p(y, X, λ) (3) p(y, X, λ)p( θ) (4) Here p(y, X, λ) defines the likelihood of observing the data y given X, and some likelihood parameters λ. Similarly, p( λ) is the prior distribution. This incorporates our prior knowledge about. Using the Bayes rule, p(a, B) = p(a B)p(B) = p(b A)p(A) (5) we can rewrite the MAP estimate, as follows: map = arg max p( y, X, λ, θ) (6) Therefore, the MAP estimate is the maximum of the posterior distribution, which explains the name. 6
To do. Read Section 3. and 3.3 of Bishop (3.. might be confusing and can be skipped at the first reading). 2. Implement ridge regression and understand how it solves the two problems. 3. (Difficult) Derive the update rule for Ridge regression using the Gaussian formula (Eq. 2.3-2.7 in Bishop s book). This is detailed in Section 3.3. of Bishop s book. 7