arxiv: v1 [math.st] 22 Feb 2018

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 22 Feb 2018"

Meagan Thompson
5 years ago
Views:

1 Bayesian Lasso : Concentration and MCMC Diagnosis arxiv:18.857v1 [math.st] Feb 18 D.Ounaissi 1, N.Rahmania 1 Laboratoire LAMA, UFR de Mathématiques, Université de Paris 1, France. Laboratoire Paul Painlevé, UFR de Mathématiques, Université de Lille, France. Abstract Using posterior distribution of Bayesian LASSO we construct a semi-norm on the parameter space. We show that the partition function depends on the ratio of the l 1 and l norms and present three regimes. We derive the concentration of Bayesian LASSO, and present MCMC convergence diagnosis. Keywords: LASSO, Bayes, MCMC, log-concave, geometry, incomplete Gamma function 1. Introduction Let p n be two positive integers, y R n and A be an n p matrix with real numbers entries. Bayesian LASSO c(x) = 1 ( Z exp Ax ) y x 1 (1) is a typically posterior distribution used in the linear regression Here ( Z = exp R p y = Ax + w. Ax y x 1 )dx () is the partition function, and 1 are respectively the Euclidean and the l 1 norms. The vector y R n are the observations, x R p is the addresses: daoud.ounaissi@u-pec.fr (D.Ounaissi), nadji.rahmania@univ-lille1.fr (N.Rahmania) 1 Corresponding Author Preprint submitted to Elsevier November 8, 18

2 unknown signal to recover, w R n is the standard Gaussian noise, and A is a known matrix which maps the signal domain R p into the observation domain R n. If we suppose that x is drawn from Laplace distribution i.e. the distribution proportional to exp( x 1 ), (3) then the posterior of x known y is drawn from the distribution c (1). The mode { Ax y } arg min + x 1 : x R p (4) of c was first introduced in [18] and called LASSO. It is also called Basis Pursuit De-Noising method [4]. In our work we select the term LASSO and keep it for the rest of the article. In general LASSO is not a singleton, i.e. the mode of the distribution c is not unique. In this case LASSO is a set and we will denote by lasso any element of this set. A large number of theoretical results has been provided for LASSO. See [5], [6], [9], [1], [15] and the references herein. The most popular algorithms to find LASSO are LARS algorithm [8], ISTA and FISTA algorithms see e.g. [] and the review article [14]. The aim of this work is to study geometry of bayesian LASSO and to derive MCMC convergence diagnosis.. Polar integration Using polar coordinates s = x S, r = x, the partition function () x Z p = Z p (s)ds, (5) S where denotes one of l or l 1 norms in R p, ds denotes the surface measure on the unit sphere S of the norm, and Z p (s) = exp{ f(rs)}r p 1 dr, (6) where f(x) := Ax y + x 1, x R p. We express the partition function (6) using the parabolic cylinder function. We also give an inequality of concentration and a geometric interpretation of the partition function Z p.

3 3. Parabolic cylinder function and partition function We extend the function s S Z p (s) x R p Z p (x) = exp{ f(rx)}r p 1 dr. (7) This extension is homogeneous of order p. If Ax =, then f(rx) = y + r x 1, and more if x, then Z p (x) = (p 1)! x p 1 exp( y ). If Ax, then we will express Z p (x) using the parabolic cylinder function. We recall that for b R, z the parabolic cylinder function is given by D b (z) = z b exp( z 4 )[1 + O(z )], (8) when z + [16]. We also recall the integral representation of Erdlyi [7] for the parabolic cylinder function exp( z 4 )Γ(ν)D ν(z) = exp( 1 r zr)r ν 1 dr, ν >, where Γ(ν) = exp( t)t ν 1 dt is the Γ function. Proposition 3.1. The variable ω lasso := x 1 Ax, y Ax (9) will play an important role. It depends only on s = x x 1 S p 1,1 and the function s S p 1,1 ω lasso (s) is bounded below by λ 1, := min{ 1 As,y As : s S p 1,1 }. Now we can announce the following result. Proposition 3.. We have for Ax Z p (x) = (p 1)! exp( y ) Ax p exp( ω lasso 4 )D p(ω lasso ). If Ax, then ω lasso + and Z p (x) = (p 1)! exp( y ) x p 1 [1 + O(ω lasso )]. 3 As

4 Corollary 3.3. If y = then ω lasso = 1 As is bounded below by 1 λ 1,, where λ 1, = max( As : s S p 1,1 ) is the norm of the operator A : (R p, 1 ) (R n, ). The partition fucntion Z p (s) = (p 1)!ω p lasso exp(ω lasso 4 )D p(ω lasso ) is As convex and decreasing. Proof 3.4. It suffices to remark that Z p (s) = exp{ As r r}r p 1 dr. 4. Geometric interpretation of the partition function First we represent f(rx) for Ax in the form f(rx) = y ω lasso + (r Ax + ω lasso ). (1) The function exp{ f(x)}, x R p is log-concav and integrable in R p. Observe that Z 1 p p [1] tells us that is a norm on the null space Ker(A) of A. A general result x R p Z 1 p p (x) := x c is a quasi-norm on R p. The unit ball of c is defined by B(A, y) := {x R p : x c 1} = {x R p : Z p (x) 1} = {x = rs R p : Z p (s) r p } = {x = rs R p : (p 1)! exp( y ) As p exp( ω lasso 4 )D p(ω lasso ) r p }. Its contour is equal to C(A, y) := {x R p : x c = 1} = {x R p : Z p (x) = 1} = {x = rs R p : Z p (s) = r p } = {x = rs R p : (p 1)! exp( y ) As p exp( ω lasso 4 )D p(ω lasso ) = r p }. We summarize our results in the following proposition. 4

5 Proposition ) For each s S p 1,1, the longest segment [, r]s contained in B(A, y) holds for r = r max (s) is solution of the equation r p = (p 1)! exp( y ) As p exp( ω lasso 4 )D p(ω lasso ). ) The ball B(A, y) = s S p 1,1 [, r max (s)]s, and its contour is equal to 3) The volume B(A, y) is C(A, y) = {r max (s)s : s S p 1,1 }. rmax(s) p S p 1,1 p ds = Z p p. 5. Necessary and sufficient condition to have lasso equal zero Now we can give the necessary and sufficient condition to have lasso = {} Proposition 5.1. The following assertions are equivalent. 1) = lasso. ) ω lasso (s) pour tout s S p 1,1. 3) A y Concentration around the lasso 6.1. The case lasso null The polar coordinate formula tells us that, we can draw a vector x from c(x)dx by drawing its angle s uniformly, and then simulate its distance r to the origin from c(r, s)dr = 1 Z p (s) exp{ f(rs)}rp 1 dr (11) 5

6 Now let s estimate for r > the probability P( x > r) = c(r, s)dr ds S, S r where S denotes the Lebesgue measure of S. We introduce for each pair a, b R p the function In the following g a,b,p (r) := g a,b (r) (p 1) ln(r), r >. (1) a = As, b = ω lasso. The function r g a,b (r) is increasing (because b := ω lasso ). The function r g a,b,p (r) f est convexe et atteint son minimum au point r(s) solution de l quation As (r As + ω lasso ) = p 1. r The positive root is given by On one hand r(s) = ω lasso + ωlasso + 4(p 1) (13) As exp{ g a,b,p (r)}dr exp{ g a,b (r(s))} r(s) r p 1 dr = exp{ g a,b,p (r(s))} r(s) p. On the other hand by using the convexity of r g a,b (r), we have for all r >, g a,b (r) g a,b (r(s)) + (p 1)(r r(s)), r(s) because r g a,b (r(s)) = p 1. We deduce for q >, r(s) qr(s) exp{ g a,b,p (r)}dr exp{ g a,b (r) + p 1} exp{ g a,b (r(s)) + p 1} qr(s) q(p 1) exp{ p 1 r(s) r}rp 1 dr exp( r)r p 1 r(s) dr (p 1) p exp{ g a,b (r(s)) + p 1} r(s) Γ(p, q(p 1)), (p 1) p 6

7 where Γ(ν, x) = exp( t)t ν 1 dt is the upper incomplete gamma function. x Finally we get the following result. Proposition 6.1. We have for all q >, P( x qr(s)) p exp(p 1) (p 1) p Γ(p, q(p 1)) := P (q, p). (14) Using the following estimate [? ] x a 1 exp( x) < Γ(a, x) < Bx a 1 exp( x), a > 1, B > 1, x > B (a 1), B 1 we get for q > 1, Therefore the quantity Γ(p, q(p 1)) q p 1 (p 1) p 1 exp( q(p 1)). P (q, p) pqp 1 exp{ (q 1)(p 1)}. (p 1) x balance sheet. If x is drawn from the density c, alors B r(θ) (, q) with a probability at least equal to 1 P (q, p). In the figure(1) we plot for p =, n = 1, A = ( 1 1 ) and y = the density c(r, s)dr for a fixed value of s. We notice that the mode c(r, s) =.6 is very close to the value of r( s) =.69 (13) for the same fixed s. 6.. The general case We take the vector l lasso. We will study the concentration of c around l. The variable of interest is u = x l. The law of u has for density c(u + l)du = 1 Z p exp{ f(u + l, A, y)}du. The change of variables formula gives for each norm c(u + l)du = 1 Z p exp{ f(rθ + l)}r p 1 drdθ, r >, θ S. 7

8 Figure 1: pour r [.1; 1] c(r, s). By definition for any vector x, the convex function r f(rs+l) reaches its minimum at the point r =. Therefore r f(rs + l) is increasing. The function f(rs + l, p) := f(rs + l) (p 1) ln(r), r >, (15) is strictly convex. Its critical point r l (s) is solution of the equation r f(rs + l) = p 1. r By a similar proof to that of propostion (6.1) we have the following result; Proposition 6.. If x is drawn from the density c, and s = x l, then for x l all q >, P( x l qr l (s)) 7. Applications p exp(p 1) (p 1) p Γ(p, q(p 1)) := P (q, p). (16) 7.1. The contour in the case p =, n = 1 Let A := (a 1, a ) a matrix of order 1. Its null-space Ker(A) = {(x 1, x ) : a 1 x 1 + a x = }. We have that B(a 1, a, y) contains Ker(A) B,1. 8

9 This intersection is a symmetric segment noted [(x 1 (a 1, a ), x (a 1, a )), (x 1 (a 1, a ), x (a 1, a ))]. To determine the other points of the set B(a 1, a, y), we will directly calculate Z (s). A simple calculation gives et As Z (s) = exp( y + ω lasso ) exp{ ( As r + ω lasso) Finally we have the following proposition. Proposition ) If As, then exp{ ( As r + ω lasso) }rdr, }rdr + ω lasso exp{ ( As r + ω lasso) }dr = 1. Z (s) = exp( y + ω lasso ) As 1 {1 ω lasso π(1 F (ωlasso ))}, Aω where F is the distibution function of the normal law. ) If As and y =, then Z (s) = ω lasso exp( ω lasso ){1 ω lasso π(1 F (ωlasso ))}. 3) Ifi s S 1,1, As and y =, then the function z z (b ) = 1 b exp( 1 b ){1 1 b π(1 F ( 1 b ))} defined on (, λ 1,] is convex and decreasing, where λ 1, = max s S1,1 As. 4) We have for s S 1,1 the ball Z (s) = z ( 1 ω lasso ), ω S 1,1. B(A, ) = {rs : Z (s) r }. is contained in the unit disk x 1 1 for the norm l 1. The contour is defined by the equation Z (s) = r. 9

10 The norm of the linear operator A : (R, 1 ) (R, ) is defined by λ 1, = sup As. s: s 1 =1 the function s Z (s) = z (λ 1,) is constant on est constante sur If A = (1, 1) then Ω 1, = {s : s 1 = 1, As = λ 1, }. Ω 1, = {s : s 1 = 1, As = λ 1, } If A = (a 1, a ) with a 1 < a, then In both case = [(1, ), (, 1)] [( 1, ), (, 1)]. Ω 1, = {s : s 1 = 1, As = λ 1, } = {(, sgn(a )), (, sgn(a ))}. {z (λ 1,)} 1 Ω1, is part of the contour. The other points of the contour are deduced from the equation z (b ) = a, b (, λ 1, ). Each pair (a, b) generate four points of B((a 1, a ), ) of the form as where s 1 + s = 1, a 1 s 1 + a s = b. We plot in the figure the contour of B(a 1, a, ) for different choices of the matrix (a 1, a ). We notice that the surface of B((a 1, a ), ) is decreasing function relatively the norm λ 1, of the matrix A. Remark 7.. The numerics show thatz(ω lasso ) exploses for the large values of ω lasso, it means that ω is closes to the null-space of A. to eleminate that explosion we need to estimate the tail of the gaussian density. Using the Gordon estimation [1] exp( x ) x + 1 π(1 F (x)) x we have the following approximation exp( x x ), x >, 1 b 1 b z (b ) 1 b 1, near to. (17) 1 + b3 1

11 Figure : Contours of B(a 1, a, ) for different matrices A, p = and n = MCMC diagnosis Here we take p = 7, n = 4, A B(± 1 n ) and for simplicity we consider y =. We sample from the distribution c (1) using Hastings-Metropolis algorithm (x (t) ) and propose the test x (t) qr(θ (t) ) as a criterion for the convergence. Here θ (t) := x(t) x (t). We recall that if x is drawn from the target distribution c, then x qr(θ) with the probability at least equal to P (q, p). Table gives the values of the probability P (q, p). Note that for q.5 the criterion x (t) qr(θ (t) ) is satisfied with a large probability. q P (q, p) Table 1: Values of the probability P (q, p) for p = Independent sampler (IS) The proposal distribution Q(x, x 1 ) = p(x ) = 1 p exp( x 1 ), x 1, x. 11

12 The ratio c(x) p(x) p Z, x. It s known that MCMC (x (t) ) with the target distribution c and the proposal distribution p is uniformly ergodic [13]: sup P(x (t) A x () ) c(x)dx (1 Z A B(R p ) p )t. Here Z.14 and then (1 Z p ) =.987. Figure 4(a) shows respectively the plot of t 5r(θ (t) ) and t x (t). 8.. Random-walk (RW) Metropolis algorithm We do not know if the target distribution c satisfies the curvature condition in [17] Section 6. Here we propose to analyse the convergence of the Random walk Metropolis algorithm (x (t) ) using the criterion x (t) qr(θ (t) ). Figure 4(b) shows respectively the plot of t 5r(θ (t) ) and t x (t). Figures 4 show that contrary to independent sampler algorithm, the random walk (RW) algorithm satisfies early the criterion x (t) 5r(θ). More precisely A 1) the independent sampler (IS) algorithm begins to satisfy the criterion x (t) 5r(θ (t) ) at t = iteration. ) The RW algorithm begins to satisfy the criterion x (t) 3.5r(θ (t) ) at t = iteration, but the IS algorithm never satisfies the criterion x (t) 3.5r(θ (t) ). We finally compare IS and RW algorithms using the fact that xc(x)dx = R p. The best algorithm will furnish the best approximation of the integral xc(x)dx. Table 3 gives the estimators 1 N R p N t=1 x(t) IS xc(x)dx and R p 1 N N t=1 x(t) RW xc(x)dx. It follows that 1 N R p N t=1 x(t) IS =.187 and 1 N N t=1 x(t) RW =.41. We conclude that the random walk algorithm wins for both criteria against independent sampler algorithm. 1

13 x 1 x x 3 x 4 x 5 x 6 x 7 ˆx IS ˆx RW Table : IS and RW estimators using N = 1 6 iterations. Figure 3: (a): Test of convergence of MCMC algorithm with proposal distribution p(x ). (b): Test of convergence of MCMC algorithm with N (,.5I p ) proposal distribution. N = 1 6 iterations. 9. Conclusion We studied the geometry of bayesian LASSO using polar coordinates and calculated the partition function. We obtained a concentration inequality and derived MCMC convergence diagnosis for the convergence of Hasting Metropolis algorithm. We showed that the random walk MCMC with the variance.5 wins again the independent sampler with Laplace proposal distribution. 13

14 References [1] K. Ball, Logarithmically concave functions and sections of convex sets in R n, Stud. Math. T. LXXXVIII (1988) [] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problem, SIAM J. Imag. Sci. (9) [3] E.T. Copson, Asymptotic Expansions, Cambridge University Press, [4] S. Chen, D. L. Donoho, M. Saunders, Atomic decomposition by basis pursuit, SIAM J. Sci. Comput. (1) (1998) [5] I. Daubechies, M. Defrise, C. De Mol, An iterative thresholding algorithm for linear inverse problems with a sparsity constraint, Comm. Pure Appl. Math. 57 (11) (4) [6] A. Dermoune, D. Ounaissi, N. Rahmania, Oscillation of Metropolis- Hastings and simulated annealing algorithms around LASSO estimator, Math. Comput. Simulation (15). [7] A. Erdlyi, Higher transcendental functions, California institute of technology, bateman manuscript project, vol. p. 119 (1953). [8] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, Least angle regression, Ann. Statist. 37 (4) [9] G. Fort, S. Le Corff, E. Moulines, A. Schreck, A shrinkage-thresholding Metropolis adjusted Langevin algorithm for bayesian variable selection, arxiv: v3 [math.st] (15). [1] R.D. Gordon, Values of Mills ratio of area to bounding ordinate of the normal probability integral for large values of the argument, Annals of Mathematical Statistics, vol (1941). [11] B. Klartag, V.D. Milman, Geometry of log-concave functions and measures, Geom. Dedicata 11 (1) (5) [1] M. Mendoza, A. Allegra, T.P. Coleman, Bayesian Lasso Posterior Sampling Via Parallelized Measure Transport, arxiv:181.16v1 [stat.co] (18). 14

15 [13] K. Mengersen, R.L Tweedie, Rates of convergence of the Hastings and Metropolis algorithms, Ann. Statist. 4 (1994) [14] N. Parikh, S. Boyed, Proximal algorithms, Found. Trends Optim. 1 (3) (3) [15] M. Pereyra, Proximal Markov chain Monte Carlo algorithms, Stat. Comput. (15) [16] N.M. Temme, Numerical and asymptotic aspects of parabolic cylinder functions, ournal of Computational and Applied Mathematics (). [17] G.O. Roberts, A.L. Tweedie, Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms, Biometrika 83 (1) (1996) [18] R. Tibshirani, Regression shrinkage and selection via Lasso, J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1) (1996) [19] R. Tibshirani, The Lasso problem and uniqueness, Electron. J. Stat. 7 (13)

arxiv: v1 [math.st] 4 Dec 2015

arxiv: v1 [math.st] 4 Dec 2015 MCMC convergence diagnosis using geometry of Bayesian LASSO A. Dermoune, D.Ounaissi, N.Rahmania Abstract arxiv:151.01366v1 [math.st] 4 Dec 015 Using posterior distribution of Bayesian LASSO we construct