L methods H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands December 5, 2 Bert Kappen
Outline George McCullochs model The Variational Garrote Bert Kappen L methods
Linear regression Given data x µ i, yµ, µ =,...,p find weights w i that best describe the relation y µ = n w i x µ i + ξµ i= The ordinary least square (OLS) minimizes OLS = ( y µ µ n w i x µ i i= ) 2 Solution is given by w = χ b χ ij = p x µ i xµ j b i = p x µ i yµ µ µ Problems: low accuracy due to overfitting (p < n) and interpretation: the OLS solution is not sparse Bert Kappen L methods 2
Ridge regression Add a regularization term Ridge = OLS + λ i w 2 i λ > w = (χ + λ) b Ridge regression - improves the prediction accuracy - maximal rank - solution is not sparse Bert Kappen L methods 3
Lasso Solve the OLS problem problem under the linear constraint i w i t. Equivalently, add a regularization term Lasso = OLS + λ i w i λ > There exist efficient methods to solve this quadratic programming problem. The solution tends to be sparse and improves both the prediction accuracy and the interpretability of the solution. Both ridge regression and Lasso are shrinkage methods that find a solution that is biased to smaller w. Bert Kappen L methods 4
Spike and slab Introduce a prior distribution over w i p(w i s i,β ± ) = ( s i )N(w i,σ spike ) + s i N(w i,σ slab ) s i =, p(s i γ) exp(γs i ) γ < p(w i ) = p(w i s i )p(s i γ) s i =, with /β = σ spike and /β + = σ slab. Likelihood p(y x, w, β ) = β 2π exp β 2 ( y n i= ) 2 w i x i p(d w, β ) = µ p(y µ x µ, w,β ) George and McCulloch in addition assume prior over β ±, β (and γ). Bert Kappen L methods 5
Posterior The posterior becomes p( w, s D,β, β ±,γ) = p(d w, β )p( w s,β ± )p( s γ) p(d) where D is the data and p(d w,β ) = p( w s,β ± ) ( P p(y µ x µ, w) exp β y µ 2 µ= µ i exp β p w i w j χ ij 2 w i b i 2 ij i ( exp n ( si β + wi 2 + ( s i )β w 2 ) ) i 2 w i x µ i ) 2 i= Bert Kappen L methods 6
p( s γ) = i exp(γs i ) 2 cosh(γ) For given s, the posterior distribution is Gaussian in w: p( w s) exp (w i wi )A ij ( s)(w j w 2 j) A ij ( s) = β pχ ij + (s i β + + ( s i )β )δ ij A ij wj = β pb i j ij For given w, the posterior factorizes in s: ( p( s w) = exp γ s i 2 i n ( si β + wi 2 + ( s i )β wi 2 ) ) i= Bert Kappen L methods 7
Gibbs sampling Sample w conditioned on s: w N( w ( s),a( s) ) Sample s independently: p(s i = ) = exp ( γ 2 β ) +wi 2 exp ( γ 2 β ( +wi) 2 + exp 2 β ) wi 2 Bert Kappen L methods 8
Spike and slab Advantage of spike and slab model is that it does not shrink w. However, MCMC is complex and time consuming. Bert Kappen L methods 9
The Garrote vil Bert Kappen L methods
The variational Garrote Introduce s i =, that select features. The regression model becomes y µ = n w i s i x µ i + ξµ s i =, i= To optimize the s i is equivalent to find the optimal subset of relevant features. Since the number of subsets is exponential in n one has to resort to heuristic methods to find a good subset of features. Here we propose a variational approximation. Bert Kappen L methods
The variational Garrote The likelihood term is given by p(y x, s, w, β) = β 2π exp β 2 ( y n i= ) 2 w i s i x i p(d s, w, β) = µ p(y µ x µ, s, w, β) = ( ) p/2 β exp βp 2π 2 n i,j= s i s j w i w j χ ij 2 n w i s i b i + σy 2 i= with b i, χ ij as before and σ 2 y = p µ (yµ ) 2. Bert Kappen L methods 2
The variational Garrote For concreteness, we assume that the prior over s p( s γ) = n i= p(s i γ) p(s i γ) = exp(γs i) + exp(γ) with γ given which specifies the sparsity of the solution. We further assume priors p(β, w). Bert Kappen L methods 3
The variational Garrote The posterior becomes p( s, w,β D,γ) = p( w, β)p( s γ)p(d s, w,β) p(d γ) Posterior is intractable: - MCMC - Variational Bayes - Variational MAP - BP, CVM,... Here we compute a variational MAP estimate. We approximate the marginal posterior p( w, β D,γ) = s p( s, w, β D,γ) and computing the MAP solution with respect to w,β. Bert Kappen L methods 4
Breiman s Garrote method The proposed model is similar to Breiman s Garrote method: y µ = n w i s i x µ i + ξµ s i =, i= which assumes s i instead of binary. It computes w i using OLS and then finds s i by minimizing ( y µ µ ) 2 n x µ i w is i subject to s i i= s i t i We refer to our method as the Binary Garrote (BG). Bert Kappen L methods 5
The variational approximation We compute the variational approximation using Jensens inequality: log s p( s γ)p(d s, w, β) s q( s) log q( s) p( s γ)p(d s, w,β) = F(q, w, β) The optimal q( s) is found by minimizing F(q, w,β) with respect to q( s). We consider the simplest case q( s) = n q i (s i ) q i (s i ) = m i s i + ( m i )( s i ) i= So q( s) is parametrized by m. Bert Kappen L methods 6
The variational approximation The expectation values with respect to q can now be easily evaluated and the result is F = p 2 log β 2π + βp 2 + γ n v i v j χ ij + i i,j n m i + n log( + exp(γ)) i= n (m i log m i + ( m i ) log( m i )) i= where we have defined v i = m i w i. m i m i v 2 i χ ii 2 n v i b i + σy 2 i= Bert Kappen L methods 7
The variational approximation The approximate posterior marginal posterior is then p( w, β D,γ) p( w, β) s p( s γ)p(d s, w, β) p( w, β) exp( F( m, w, β, γ)) = exp( G( m, w,β, γ)) G( m, w,β, γ) = F( m, w, β, γ) log p( w, β) We can compute the variational approximation m for given w, β, γ by minimizing F with respect to m. In addition, p( w, β D,γ) needs to be maximized with respect to w,β. Bert Kappen L methods 8
The variational approximation Taking the derivative of G with respect m, v, β and setting the derivatives equal to zero gives the following set of fixed point equations: m i ( = σ with σ(x) = ( + exp( x)) γ + βp 2 vi 2χ ) ii m 2 i v = (χ ) b χ ij = χ ij + m i χ ii δ ij m i n β = v i b i + σy 2 i= Bert Kappen L methods 9
Comments The variational approximation is not simply w i s i w i m i If this were the case, the substitution v i = w i m i would remove m i from the equations and the OLS problem would be recovered. The reason is s i s j = m i m j for i j, but s 2 i = si = m i. χ differs from χ by adding a positive diagonal to it, making χ automatically of maximal rank when m i <. Roughly speaking if χ has rank p < n, χ can be still of rank n when no more than p of the m i =, the remaining n p of the m i < making up for the rank deficiency. Bert Kappen L methods 2
When inputs are uncorrelated: χ ij = δ ij, Independent inputs w i = b i = x i y ( = σ γ + βp ) 2 b2 i m i /β = σ 2 y i b 2 im i i b2 i m i is the explained variance. The Garrote solution with m i has reduced explained variance with (hopefully) a better prediction accuracy and interpretability. Bert Kappen L methods 2
Univariate case In the -dimensional case these equations become m = σ β ( γ + p ) ρ 2 ρm = σ2 y( mρ) with ρ = b 2 /σ 2 y the squared correlation coefficient. = f(m) Bert Kappen L methods 22
Univariate case f(m) is an increasing function of m and crosses the line m either or three times, depending on the values of p,γ,ρ..8.8 f(m).6.4 f(m).6.4.2.2.2.4.6.8 m.2.4.6.8 m f(m) vs m. Left: p =, γ =, different lines correspond to different values of < ρ <. Right: p =, γ = 3. solutions for m. The solutions close to m, correspond to local minima of F. The intermediate solution corresponds to a local maximum of F. Bert Kappen L methods 23
Univariate case One can compute the critical p for which multiple solutions occur. p = 4 ρ 2 ρ ( ρ + 2 ) p is a decreasing function of ρ. For p > p, we find two solutions for m. For p < p, we find one solution for m. γ 2 4 m.5.5 ρ 6 m.5 8.2.4.6.8 ρ.5 ρ Left: Phase plot ρ, γ for p = Dotted line is solution for γ when m = /2. Right: m versus ρ for γ =, p = (top) and for γ = 4, p = (bottom). Bert Kappen L methods 24
Transfer function Suppose that data are generated from the model y = wx + ξ ξ 2 = x 2 = w estimated.5.5 vg ridge garrote lasso.5.5 w Binary Garrote (VG) with γ = and p =. Ridge regression with λ =.5. Garrote with γ = /4. Lasso with γ =.5. Bert Kappen L methods 25
Numerical examples Inputs are generated from a mean zero multi-variate Gaussian distribution with specified covariance structure. We generate outputs y µ = i ŵix µ i + ξµ with ξ µ N(, ˆσ). For each example, we generate a training set, a validation set and a test set (p/p v /p t ). For each value of the hyper parameters (γ in the case of BG, λ in the case of ridge regression and Lasso), we optimize the model parameters on the training set. We optimize the hyper parameters on the validation set. Bert Kappen L methods 26
x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example F 5 45 4 35 forward backward 3 3 2 γ error 3 2 3 2 γ train val v.5.5 v v 2:n.5 3 2 γ 5 Binary Garrote Bert Kappen L methods 27
x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example error 3 2 train val w.5 w w 2:n.5.5 λ.5 λ 5 error 2.5 train val w.8.6.4 w w 2:n.5.5 5 λ.2 5 λ 5 Lasso (top row) and ridge regression (bottom row) Bert Kappen L methods 28
x µ i N(, ) independently. ŵ = (,,...,), n = and ˆσ =. p/p v /p t = 5/5/4. Example Train Val Test # non-zero δw δw 2 Ridge.44 ±.29.75 ±.3.79 ±.8-4.4 ±.8.8 ±.5 Lasso.8 ±.22.6 ±.24.5 ±.23 3.8 ± 2.4.64 ±.39.6 ±.6 BG.83 ±.8.89 ±.9.2 ±.9.76 ±.36.28 ±.26.4 ±.4 True.93 ±.4.87 ±.2.98 ±.4 Results on random instances. Bert Kappen L methods 29
Example 2 x µ N(, Σ) with Σ ij = δ i j, δ =.5. ŵ i =,i =,2, 5,, 5 and all other ŵ i =, n =, ˆσ =. p/p v /p t = 5/5/4. Train Val Test # non-zero δw δw 2 Lasso.78 ±.47.4 ±.3.49 ±.23.2 ± 3.2 2.9 ±.77.55 ±.22 BG.8 ±.2.8 ±.2.2 ±.6 5.5 ±.7. ±.46.32 ±.37 True. ±.8.97 ±.9.99 ±.7 5 Results on random instances..2.8.6.4.2 lasso bg lasso 2.8.6.4.2.2 2 4 6 8.2.4.6.8 2 bg Bert Kappen L methods 3
Dependence on noise Data as in example. test error.6.4.2 Lasso VG gap.5.5 Lasso VG δ w 3 2.5 2.5.5 Lasso VG.8 2 3 σ.5 2 3 σ 2 3 σ All results are averages over runs. Bert Kappen L methods 3
Implementation issues for high dimensional problems For large n, the most expensive part of the computation is inversion of χ Note, that the free energy can also be written as F = p 2 log β 2π + βp 2 + γ p p (z µ ) 2 + µ i n m i + n log( + exp(γ)) i= n (m i log m i + ( m i ) log( m i )) i= m i m i v 2 i χ ii 2 n v i b i + σy 2 with z µ = i xµ i v i. We can thus mimimize F with respect to v, z under linear constraints without the need to compute the covariance matrix χ. This is a quadratic optimization problem. i= Bert Kappen L methods 32
Implementation issues for high dimensional problems The quadratic program can be computed linear in time with n. n Regression QP. ±..4 ±.4 5.2 ±.5. ±.3.6 ±..6 ±.4 2 4.29 ±.3.33 ±.4 5 -.79 ±.6 -.86 ±.5 5 -.37-2.47 CPU times in seconds for solving v by matrix inversion and for solving the QP problem using MOSEK. Problem is as described in Example with p = 5, ˆσ =. Lasso BG n Error δw 2 CPU (sec) Error δw 2 CPU (sec).424.583.2688.3239.59.798/38 5.627.3849.336.278.294 8.665/3.742.4924.588.2944.38.7249/25 2.643.342.2658.325.35 526 Bert Kappen L methods 33
Local minima: - appear for few and noisy data. - seem modest for (very) sparse problems. Discussion - increasing γ increases β and works as an annealing schedule. Extensions: - MAP: TAP, BP, CVM - Full Bayes : MCMC, VB,... - Use of priors (on γ) instead of cross validation Applications: - Finding structure of networks, both static and dynamic - Finding genes in GWAS -... arxiv.org/abs/9.486 Bert Kappen L methods 34