MCMC convergence diagnosis using geometry of Bayesian LASSO A. Dermoune, D.Ounaissi, N.Rahmania Abstract arxiv:151.01366v1 [math.st] 4 Dec 015 Using posterior distribution of Bayesian LASSO we construct a semi-norm on the parameter space. We show that the partition function depends on the ratio of the l 1 and l norms and present three regimes. We derive the concentration of Bayesian LASSO, and present MCMC convergence diagnosis. keyword: LASSO, Bayes, MCMC, log-concave, geometry, incomplete Gamma function 1 Introduction Let p n be two positive integers, y R n and A be an n p matrix with real numbers entries. Bayesian LASSO c(x = 1 ( Z exp Ax y x 1 is a typically posterior distribution used in the linear regression Here y = Ax + w. ( Z = exp Ax y x 1 dx ( R p is the partition function, and 1 are respectively the Euclidean and the l 1 norms. The vector y R n are the observations, x R p is the unknown signal to recover, w R n is the standard Gaussian noise, and A is a known matrix which maps the signal domain R p into the observation domain R n. If we suppose that x is drawn from Laplace distribution i.e. the distribution proportional to (1 exp( x 1, (3 then the posterior of x known y is drawn from the distribution c (1. The mode { Ax y arg min + x 1 : x R p} (4 1
of c was first introduced in [14] and called LASSO. It is also called Basis Pursuit De-Noising method [4]. In our work we select the term LASSO and keep it for the rest of the article. In general LASSO is not a singleton, i.e. the mode of the distribution c is not unique. In this case LASSO is a set and we will denote by lasso any element of this set. A large number of theoretical results has been provided for LASSO. See [5], [6], [8], [1] and the references herein. The most popular algorithms to find LASSO are LARS algorithm [7], ISTA and FISTA algorithms see e.g. [] and the review article [10]. The aim of this work is to study geometry of bayesian LASSO and to derive MCMC convergence diagnosis. Polar integration Using polar coordinates, the partition function ( Z = J p (θdθ, (5 where dθ denotes the surface measure on the unit sphere S, and Here where J p (θ = 0 S exp( g(r, θr p 1 dr. (6 g(r, θ = 1 (r Aθ + r Aθ β + y, (7 β := θ 1 Aθ y s, (8 and s denotes the cosine of the angle (Aθ, y i.e. cos((aθ, y. Using known estimate θ θ 1, we observe that β is bounded below by 1 A y, (9 and β + as Aθ 0. Here A is the square root of the largest eigenvalue of A A. Observe that c(xdx = 1 Z exp( g(r, θrp 1 drdθ. Hence, we can sample from Bayesian LASSO c (1 as following. We draw uniformly θ := x x from the unit sphere, and then draw the norm x following the distribution µ θ (r := 1 exp( ϕ(r, θ, (10 J p (θ
where ϕ(r, θ := g(r, θ (p 1 ln(r, r > 0. (11 Moreover, observe that the modes {x lasso = r lasso θ lasso : g(r lasso, θ lasso = min r 0,θ S g(r, θ} and {(r, θ : ϕ(r, θ = min r 0,θ S ϕ(r, θ} respectively of the distributions c(xdx and 1 Z exp( g(r, θrp 1 drdθ are different. We will show that (r, θ contains more information than x lasso. 3 Geometric interpretation of the partition function The volume (Lebesgue measure of the set K(A, y := {x R p : J p (x 1} is vol(k(a, y = 1 p Z. Observe that J 1 p p is a norm on the null-space N(A of A. A general result [1] tells us that if f is even, log-concave and integrable on an Euclidean space E, then x E ( 0 f(rxr p 1 dr 1 p is a norm on E. It follows that in the case E = N(A or E = {x R p : Ax, y = 0}, the map is a norm on E. The map x E J 1 p p (x x R p J 1 p p (x := x LASSO has nearly all the properties of a norm. Only the evenness is missing. The set K(A, y = {x R p : x LASSO 1} is convex, compact and contains the origin. See [9] for more details. 3.1 Necessary and sufficient condition to have LASS0 = {0} If β 0, then r [0, + g(r, θ is increasing, its minimizer is equal to r = 0, and its smallest value is y. If β < 0, then its minimizer is equal β to r = Aθ, and its smallest value is less than y. If the set {β < 0} is empty, then LASSO = {0}, if not β l LASSO = {l = θ l : Aθ l β l 0, s.t. βl = sup β }. (1 β 0 As an illustration we consider the case n = 4, p = 7 and the entries of the matrix A B(± 1 n are a realization of i.i.d. Bernoulli random variables 3
x 1 x x 3 x 4 x 5 x 6 x 7 LASSO F IST A 1.7744 0.6019-0.383 0 0-1.0050 0 LASSO P OLAR 0.999 0.3890-1.3980 0.0769-0.0070-0.8699-0.0379 Table 1: N = 10 5, p = 7, n = 4. with the values ± 1 n. We draw uniformly N = 10 5 vectors from the sphere S and estimate LASSO using Formula (1. Table 1 gives the value of LASSO using respectively FISTA algorithm and Formula (1. Observe that necessary and sufficient condition for β 0 for all θ is θ 1 y inf{ Aθ s : θ S}. Using known estimate θ θ 1, we obtain y 1 A as a sufficient condition for β 0 for all θ. 4 Closed form of the partition function We introduce for a R and for a couple p, r 1 of integers, the notations (a r = (a 1... (a r, (13 c(p, r = p 1 ( p 1 ( 1 p 1 k ( k + 1 k r. (14 k=0 Now, we can announce the following result. Proposition 4.1. 1 If β = +, then where If β 0, then θ p 1 J p(θ = (p 1! exp( y. θ p 1 J p(θ := Φ(β, Φ(β = exp( y (β + s y p p 1 k=0 k 1 exp( β k+1 Γ(, β ( p 1 k ( β p 1 k, (15 4
Here Γ(a, x = x exp( tt a 1 dt, a > 0, x 0, is the upper incomplete Gamma function. 3 If β < 0, then θ p 1 J p(θ = exp( y (β + s y p p 1 k=0 k 1 exp( β ( p 1 k ( β p 1 k k 1 (Γ( k+1 + ( 1k γ( k+1, β. (16 Here γ(a, x = x 0 exp( tta 1 dt is the lower incomplete Gamma function. where 4 If β = 0, then θ p 1 J p(θ = exp( y p y p sp Γ( p, 0. 5 If β > 0 then for M p + 1, θ p 1 J p(θ := Φ(β, M + R(β, M, Φ(β, M = (p 1! exp( y M 1 + p 1 exp( y ( 1 + y s pc(p, ( β r β and the remainder term r=p ( R(β, M 1 + y s p y exp( c(p, M β p 1 ( β M (p 1 (18 Proof 4.. Only the first part of assertions and 5 needs the proof. Let us prove the first part of. From the equality J p (θ = exp( y + β and the change of the variable 0 τ = Aθ r + β, exp( ( Aθ r + β r p 1 dr, p 1 r(17 we obtain 0 ( exp ( Aθ r + β r p 1 dr = = 1 Aθ p exp( τ β (τ βp 1 dτ 1 p 1 ( p 1 + ( β Aθ p p 1 k exp( τ k β τ k dτ. k=0 5
As β > 0, then the change of variable implies β ω = τ, exp( τ τ k dτ = k 1 k + 1 Γ(, β. The equality β = θ 1 Aθ y s achieves the proof of. Now we prove Assertion 5. We extend the incomplete Gamma function as following Γ(a, x = and we use known estimate see [3] page 14 x exp( tt a 1 dt, x > 0, a R, M 1 Γ(a, x = exp( xx a 1 + (a r exp( xx a 1 r + R(a, x, M, (19 r=1 where M > a 1, x > 0, a R, and the remainder term R(a, x, M (a M 1 exp( xx a 1 M. If β > 0, then from β = θ 1 Aθ y s, we have J p (θ = ( p 1 θ p exp( y 1 + y s β 1 ( exp( β β p ( β p 1 p 1 k=0 k 1 Γ( k+1, β. Using the expansion (19, and the fact that J p (θ (p 1! θ p 1 β +, we obtain for r < p 1 c(p, r = 0, 1 r < p 1, c(p, p 1 = It follows the following expansion: ( 1 + y s β ( p 1 k ( 1 p 1 k (p 1! p 1. p θ M 1 p 1 exp( y J p(θ = (p 1! + p 1 where the remainder term which achieves the proof. R(β, M p 1 c(p, M M (p 1, ( β r=p ( β exp( y as c(p, r r (p 1 + R(β, M, 6
4.1 Numerical calculations As an illustration we consider n = 4, p = 7, A B(± 1 n, and y = 0. The choice M = 17 corresponds to the relative error R(β,M Φ(β,M 10 4 for β 7.5. Figure 1: Curves of Φ(β and Φ(β, 17, β = [6, 45], p = 7, n = 4. Numerical calculations show that the function Φ(β (15 explodes for β > 13.8. To compass these explosions we use the expansion (19, and then we use the function Φ(, M (17. In Fig.(1 and Fig.( dashed and black curves represent respectively the function Φ(, M and Φ. In Fig. (1 we plot Φ(, M and Φ for β (6, 45. By zooming on β (1.09, 7.5, β (7.5, 13.8, and β (13.8, 15 we show that the behavior of Φ becomes abnormal from β β Φ = 13.8 and obtain Fig.(. 4. The case LASS0={ 0 }: Partition function estimate and concentration inequality The following is a consequence of [9] Lemma.1. Proposition 4.3. 1 The function r (0, + ϕ(r, θ is convex and its unique critical point is the mode of (10. r(θ = β + β + 4(p 1 Aθ (0 7
( By denoting M(θ = exp ϕ(r(θ, θ, we obtain M(θr(θ p J p (θ M(θr(θ(p 1! exp(p 1 (p 1 p. qr(θ 3 We have for q > 0, exp( ϕ(r, θdr pγ(p, (p 1q exp(p 1 (p 1 p 0 exp( ϕ(r, θdr, and Z min := S inf θ S M(θr(θ p Z S (p 1! exp(p 1 (p 1 p sup θ S M(θr(θ := Z max, where S denotes the surface of the unit sphere S. 4 If x is drawn from Bayesian LASSO distribution c (1, then for q > 0, x qr(θ with the probability at least equal to P (q, p := 1 pγ(p,(p 1q exp(p 1 (p 1 p exp( (p 1.. In particular for q = 5 we have pγ(p,(p 1q exp(p 1 (p 1 p Figure : Curves of Φ(β and Φ(β, 17, p = 7 and n = 4. Remark 4.4. If y = 0, then LASS0=0 and the mode (0 becomes r(θ θ 1 = β( β + β + 4(p 1. 8
Figure 3: Curve of r(θ θ 1, β min := 1 A = 0.4987, β max = 44.5, min(r(θ θ 1 = 1.1035, p = 7 and n = 4. In Fig.(3 we plot β [0.4987, 44.5] β( β+ β +4(p 1. The mode of the distribution of 1 Z exp( ϕ(r, θdrdθ is equal to arg min ϕ(r(θ, θ = r>0,θ S (r(θ, θ. As an illustration we consider p = 7, n = 4, A B(± 1 n. We draw uniformly N = 10 5 sample θ i S from the unit sphere S. For each i, we calculate ϕ(r(θ i, θ i, and we derive θ. Notice that β = θ 1 Aθ = 14.01 β Φ is nearly equal to the beginning of abnormality of Φ. Using Formula (5 and Monte Carlo method, we obtain Z.14, Z min 0.0058 and Z max 10.3654. If we draw N = 10 5 vectors using Laplace distribution (3 and calculate the value of Z using Formula ( and Monte Carlo method, then we obtain Z 0.0036 < Z min. Hence Monte Carlo method using Formula (5 wins against Monte Carlo method using Formula (. 5 The case 0 / LASSO If 0 / LASSO, then the assertions of Proposition (4.3 are no longer valid. But we are going to show that these assertions becomes valid if we work 9
around LASSO. We consider for l LASSO, h(x = A(x + l y x + l 1, h(x := h(x h(0, f(x = exp ( h(x. (1 Contrary to the map x c(x, the map x f(x attains its supremum at the origin. Observe that f(x l c(x = R f(xdx, p Z = exp(h(0z f := exp(h(0 f(xdx. R p If x is drawn from c, then x l is drawn from f Z f. Moreover where Z f = f(xdx = R p J p (θ, l := 0 S J p (θ, ldθ, ( f(rθr p 1 dr. (3 The map x R p J 1 p p (x, l := x A,y,l is nearly a norm (only the eveness is missing. The set K(A, y, l = {x R p : x A,y,l 1} (4 is convex, compact and contains the origin. The volume V ol(k(a, y, l = Z f p. If x is drawn from c, then x l is drawn from f Z f, or equivalently if x is drawn from f Z f, then x + l is drawn from c. To draw x from f Z f, we draw θ = uniformely on S, and then we draw x from x x µ θ,l (r := f(rθ J p (θ, l. (5 We have from [9] Lemma.1 and Remarks page 14 the following result. 10
Proposition 5.1. 1 The function r (0, + ϕ(r, θ, l := A(rθ + l y + rθ + l 1 (p 1 ln(r + h(0 is convex and its unique critical point ( r(θ, l is the mode of (5. By denoting M(θ, l = exp ϕ(r(θ, l, θ, l, we obtain M(θ, lr(θ, l p J p (θ, l M(θ, lr(θ, l(p 1! exp(p 1 (p 1 p, and for q > 0, exp( ϕ(r, θ, ldr qr(θ,l pγ(p, (p 1q exp(p 1 (p 1 p 0 exp( ϕ(r, θ, ldr. 3 We have S inf θ S M(θ, lr(θ, l p Z f S (p 1! exp(p 1 (p 1 p sup M(θ, lr(θ, l. θ S 4 if x is drawn from the distribution c (1, then x l qr(θ, l with the probability at least equal to 1 pγ(p,(p 1q exp(p 1 (p 1 p. 5.1 Calculation of the mode of (5 and the partition function (3 Now, we are going to calculate the mode r(θ, l, and the partition function J p (θ, l. The calculations are similar to the case LASSO={0}, but we need new notations. The vector y l = y Al, s l = cos(θ l where θ l denotes the angle (Aθ, y l, b l = y l s l. The components of the vector l LASSO are denoted by l 1,..., l p. For θ R p, we set S 0 = {i {1,..., p} : θ i = 0}, S + = {i {1,..., p} : θ i 0, θ i l i 0}, S = {i {1,..., p} : θ i 0, θ i l i < 0}. The cardinality of S is denoted by S, and the order statistic of the sequence l i θ i, for i S is denoted by lθ(0 := 0 lθ(1 := l θ (1... lθ( S := l θ ( S lθ( S + 1 := +. Using these new notations, we obtain rθ + l 1 = l i + θ i (r + l i + θ i r l i θ i θ i. i S 0 i S + i S 11
If lθ(k r < lθ(k + 1, then where c k := i S 0 l i + θ 1,k := rθ + l 1 = θ 1,k r + c k, i S + θ i + i S + l i k k i=0,i S l(i + i=0,i S θ(i Observe that θ 1, S = θ 1, and if Aθ = 0 then J p (θ, l = exp( y l Now, we have the following. S lθ(k+1 k=0 lθ(k S i=k+1 S l(i, i=k+1,i S θ(i. exp( θ 1,k rr p 1 dr. Proposition 5.. If Aθ = 0, then exp( y l J p (θ, l = exp( c k ((lθ(k + 1 p (lθ(k p + p k I 1 (θ ( exp( c k Γ(p, lθ(k Γ(p, lθ(k + 1, (6 where k I (θ θ p 1,k I 1 (θ = {k {0,..., S }, such that θ 1,k = 0}, I (θ = {k {0,..., S }, such that θ 1,k > 0}. Now, we are going to give the closed form of J p (θ, l when Aθ 0. We observe for lθ(k r < lθ(k + 1 that where raθ + Al y + rθ + l 1 = α k + ( Aθ r + β k, β k = θ 1,k Aθ b l, α k = Al y β k + c k. 1
Moreover if k I 1 (θ, then β k = b l. Observe also that β S is bounded below by y l sup(s l : θ S. It follows that J p (θ, l = exp( α k J p,k (θ, l + exp( α k J p,k (θ, l, where k I 1 (θ J p,k (θ, l = The calculation of lθ(k+1 lθ(k lθ(k+1 lθ(k k I (θ exp( ( Aθ r + β k r p 1 dr. exp( ( Aθ r + β k r p 1 dr is similar to Proposition (4.1, and depends on the sign of x k = Aθ lθ(k + β k, y k = Aθ lθ(k + 1 + β k. Let k 0 = max(k : x k < 0 and k 1 = min(k : y k > 0. Observe that k 0 + 1 k 1, and then y k 0 for k < k 1 k 0 + 1. It follows for k < k 1 that J p,k (θ, l = 1 p 1 Aθ p j=0 j 1 ( p 1 If k 1 < k 0 + 1, then x k1 < 0 < y k1 J p,k1 (θ, l = and for k > k 1, J p,k (θ, l = 1 Aθ p p 1 j=0 1 Aθ p j 1 p 1 j=0 p 1 j=0 j j 1 ( p 1 j j 1 ( β k p 1( γ( j + 1, x k γ(j + 1, y k. and ( p 1 j ( β k p 1 j γ( j + 1, y k + 1 Aθ p ( β k p 1 γ( j + 1, x k, ( p 1 If k 1 = k 0 + 1, then 0 x k1 < y k1, and for k k 1, J p,k (θ, l = 1 p 1 Aθ p j=0 j 1 ( p 1 j j ( β k p 1( γ( j + 1, y k γ(j + 1, x k. ( β k p 1( γ( j + 1, y k γ(j + 1, x k. Now we can show that J p (θ, l converges to (6 as Aθ 0, and we obtain an approximation similar to (19 for J p,k (θ, l as Aθ 0 for each k I 1 (θ. 13
6 MCMC diagnosis Here we take p = 7, n = 4, A B(± 1 n and for simplicity we consider y = 0. We sample from the distribution c (1 using Hastings-Metropolis algorithm (x (t and propose the test x (t qr(θ (t as a criterion for the convergence. Here θ (t := x(t. We recall that if x is drawn from the x (t target distribution c, then x qr(θ with the probability at least equal to P (q, p. Table gives the values of the probability P (q, p. Note that for q.5 the criterion x (t qr(θ (t is satisfied with a large probability. q.5 3 3.5 4 4.5 5 P (q, p 0.667 0.9446 0.994 0.9991 0.9999 1.0000 1.0000 Table : Values of the probability P (q, p for p = 7. 6.1 Independent sampler (IS The proposal distribution Q(x, x 1 = p(x = 1 p exp( x 1, x 1, x. The ratio c(x p(x p Z, x. It s known that MCMC (x (t with the target distribution c and the proposal distribution p is uniformly ergodic [11]: sup P(x (t A x (0 c(xdx (1 Z A B(R p A p t. Here Z.14 and then (1 Z p = 0.987. Figure 4(a shows respectively the plot of t 5r(θ (t and t x (t. 6. Random-walk (RW Metropolis algorithm We do not know if the target distribution c satisfies the curvature condition in [13] Section 6. Here we propose to analyse the convergence of the Random walk Metropolis algorithm (x (t using the criterion x (t qr(θ (t. Figure 4(b shows respectively the plot of t 5r(θ (t and t x (t. Figures 4 show that contrary to independent sampler algorithm, the random walk (RW algorithm satisfies early the criterion x (t 5r(θ. More precisely 14
1 N 1 N 1 the independent sampler (IS algorithm begins to satisfy the criterion x (t 5r(θ (t at t = 8 10 5 iteration. The RW algorithm begins to satisfy the criterion x (t 3.5r(θ (t at t = 939065 iteration, but the IS algorithm never satisfies the criterion x (t 3.5r(θ (t. We finally compare IS and RW algorithms using the fact that R xc(xdx = p 0. The best algorithm will furnish the best approximation of the integral R xc(xdx. Table 3 gives the estimators 1 N p N t=1 x(t IS R xc(xdx and p N t=1 x(t RW R xc(xdx. It follows that 1 N p N t=1 x(t IS = 0.0187 and N t=1 x(t RW = 0.0041. We conclude that the random walk algorithm wins for both criteria against independent sampler algorithm. x 1 x x 3 x 4 x 5 x 6 x 7 ˆx IS -0.0005-0.0037 0.0016 0.0164 0.0050 0.001-0.0058 ˆx RW 0.0005-0.0019-0.000 0.001-0.0005 0.0031-0.0011 Table 3: N = 10 6, p = 7, n = 4 and q = 5. Figure 4: (a: Test of convergence of MCMC algorithm with proposal distribution p(x. (b: Test of convergence of MCMC algorithm with N (0, 0.5I p proposal distribution. N = 10 6 iterations, p = 7, n = 4, q = 5, A B(± 1 n, y = 0 and l = 0. 15
7 Conclusion We studied the geometry of bayesian LASSO using polar coordinates and calculated the partition function. We obtained a concentration inequality and derived MCMC convergence diagnosis for the convergence of hasting metropolis algorithm. We showed that the random walk MCMC with the variance 0.5 wins again the independent sampler with the Laplace proposal distribution. 8 References References [1] K. Ball, Logarithmically concave functions and sections of convex sets in R n, Studia Mathematica, T. LXXXVIII (1988 70 84. [] A. Beck, M. Teboulle, A Fast Iterative Shrinkage-thresholding Algorithm for linear Inverse Problem, SIAM J. Imaging Sci. (009 183 0. [3] E.T. Copson, Asymptotic Expansions, Cambridge at the university press (1965. [4] S. Chen, D. L. Donoho, M. Saunders. Atomic decomposition by basis pursuit, SIAM J. Sci. Computing, Vol. 0 No. 1 (1998 33 61. [5] I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Communications on Pure and Applied Mathematics Vol. LVII (004, 1413 1457. [6] A. Dermoune, D. Ounaissi, N. Rahmania, Oscillation of Metropolis- Hasting and simulated annealing algorithms around penalized least squares estimator, Math. Comput.Simulation (015. [7] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Statist. 37 (004 407 499. [8] G. Fort, S. Le Corff, E. Moulines, A. Schreck, A shrinkage-thresholding Metropolis adjusted Langevin algorithm for Bayesian variable selection, arxiv:131.5658 (015. [9] B. Klartag, V.D. Milman, Geometry of log-concave functions and measures, Geometriae Dedicata, Volume 11 Issue 1 (005 169 18. [10] N. Parikh, S. Boyed, Proximal algorithms, Foundation and Trends in Optimization, 1 (3 (003 13 31. 16
[11] K. Mengersen, R.L Tweedie, Rates of convergence of the Hastings and Metropolis algorithms, The Annals of Statistics (1994 4:101 11. [1] M. Pereyra, Proximal Markov chain Monte Carlo algorithms, Stat. Comput. (015 1 16. [13] G.0. Roberts, A.L. Tweedie, Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms, Biometrika 83 1 (1996 95 110. [14] R. Tibshirani, Regression shrinkage and selection via Lasso, Journal of the Royal Statistical Society. Series B. Methodological 58 1 (1996 67 88. [15] R. Tibshirani, The Lasso problem and uniqueness, Electron. J. Stat. 7 (013 1456 1490. 17