Regularization with variance-mean mixtures

Size: px

Start display at page:

Download "Regularization with variance-mean mixtures"

Eleanor Daniel
5 years ago
Views:

1 Regularization with variance-mean mixtures Nick Polson University of Chicago James Scott University of Texas at Austin Workshop on Sensing and Analysis of High-Dimensional Data Duke University July 2011

2 Robust regression Logit models Multinomial logit models Extreme-value models Support vector machines Topic models Restricted Boltzmann machines Neural networks Autologistic models Penalized additive models Student t Lasso/ridge/bridge MC+ Group lasso Normal/exponential-gamma Normal/gamma Generalized double Pareto Normal/inverted-beta Normal/inverse-Gaussian Meridian filter

3 Robust regression Logit models Multinomial logit models Extreme-value models Support vector machines Topic models Restricted Boltzmann machines Neural networks Autologistic models Penalized additive models Student t Lasso/ridge/bridge MC+ Group lasso Normal/exponential-gamma Normal/gamma Generalized double Pareto Normal/inverted-beta Normal/inverse-Gaussian Meridian filter Our approach works for arbitrary combinations of likelihood with prior. No matrix inversion; no numerical derivatives. Fully parallelizable block updates.

4 Robust regression Logit models Multinomial logit models Extreme-value models Support vector machines Topic models Restricted Boltzmann machines Neural networks Autologistic models Penalized additive models Student t Lasso/ridge/bridge MC+ Group lasso Normal/exponential-gamma Normal/gamma Generalized double Pareto Normal/inverted-beta Normal/inverse-Gaussian Meridian filter Our approach works for arbitrary combinations of likelihood with prior. No matrix inversion; no numerical derivatives. Fully parallelizable block updates.

5 But I can solve all of these with an 1 constraint. True enough. But consider an example: p = 25; n = 500; 1000 simulated data sets. y i = 1 zi >0 for i = 1,..., n z N(X β, I ) β = 5 (5x) 0 (20x) MLE Lasso-UT Lasso-CV HS Median SSE Mean SSE

6 But I can solve all of these with an 1 constraint. True enough. But consider an example: p = 25; n = 500; 1000 simulated data sets. y i = 1 zi >0 for i = 1,..., n z N(X β, I ) β = 5 (5x) 0 (20x) (an r-spike signal) MLE Lasso-UT Lasso-CV HS Median SSE Mean SSE

9 There seems to be precise separation of the computationally nice penalties/priors asso... from the statistically nice priors. Cauchy Horseshoe Our contribution: an algorithm for many priors in this second class, when used in conjunction with many common non-gaussian likelihoods.

10 There seems to be precise separation of the computationally nice penalties/priors asso... from the statistically nice priors. Cauchy Horseshoe PS, 2011 Fan and Li, 2001 Pericchi and Smith, 1992 Masreliez, 1975 Brown, 1971 etc. Our contribution: an algorithm for many priors in this second class, when used in conjunction with many common non-gaussian likelihoods.

11 A teaser example: logit with a bridge penalty ˆβ = arg min β p n i=1 log(1 + exp{ y i x T i β})+ p j =1 β j /τs j α

12 A teaser example: logit with a bridge penalty ˆβ = arg min β p n i=1 log(1 + exp{ y i x T i β})+ p j =1 β j /τs j α gamma o ouch.

13 To find the MAP, just iterate three steps β (g+1) = τ 2 S 1 ˆΛ 1(g) + X T ˆΩ 1(g) X XT 1 ˆω 1(g+1) i = 1 z (g) i e z(g) i 1 + e z(g) i 1 2 ˆλ 1(g+1) j = α(τs j ) 2 α β (g) α 2 j

14 To find the MAP, just iterate three steps Don t actually do this. β (g+1) = τ 2 S 1 ˆΛ 1(g) + X T ˆΩ 1(g) 1 1 X 2 XT 1 ˆω 1(g+1) i = 1 z (g) i e z(g) i 1 + e z(g) i 1 2 ˆλ 1(g+1) j = α(τs j ) 2 α β (g) α 2 j

15 Example: covariate-dependent disease networks ~112m patient records comprising ICD-9 codes + prescriptions + covariates ISCHEMIC HEART DISEASE ( ) 410 Acute myocardial infarction Of anterolateral wall Of other anterior wall Of inferolateral wall Of inferoposterior wall Of other inferior wall Of other lateral wall True posterior wall infarction Subendocardial infarction Of other specified sites Unspecified site Our approach: a tree of graphs Tree splits on covariates, mainly demographics and geography. Each terminal node is a disease network for a subgroup of the population.

16 Example: covariate-dependent disease networks ~112m patient records comprising ICD-9 codes + prescriptions + covariates ISCHEMIC HEART DISEASE ( ) 410 Acute myocardial infarction Of anterolateral wall Of other anterior wall Of inferolateral wall Of inferoposterior wall Of other inferior wall Of other lateral wall True posterior wall infarction Subendocardial infarction Of other specified sites Unspecified site Our approach: a tree of graphs Goh et. al., PNAS (2007) Tree splits on covariates, mainly demographics and geography. Each terminal node is a disease network for a subgroup of the population.

17 The big picture We use normal variance mean mixtures to represent a wide class of objective functions commonly encountered in high-dimensional problems. By modern Bayesian standards, these are pretty simple. But they are ubiquitous, useful, and often the only tractable approach in their respective domains for data sets beyond a certain size.

18 This class is surprisingly broad. For example: p(α 1:K,β 1:Nd W,µ,Σ) D d=1 Nd p(β d µ,σ) n=1 K k=1 eβ dk K l =1 eβ dl α k,wn p(α 1:K ) We get an exact MAP estimate. No variational approximation.

19 This class is surprisingly broad. For example: p(α 1:K,β 1:Nd W,µ,Σ) D d=1 Nd p(β d µ,σ) n=1 K k=1 eβ dk K l =1 eβ dl α k,wn p(α 1:K ) (e.g. Blei and Lafferty, 2009) We get an exact MAP estimate. No variational approximation.

20 Connection with previous work Penalties/priors corresponding to scale mixtures: lasso (Tibshirani, 1996; Park and Casella, 2008; Hans, 2009) bridge estimators (West, 1987; Huang et al., 2008) relevance vector machines (Tipping, 2001) normal/jeffreys (Figueiredo, 2003; Bae and Mallick, 2004) normal/exponential-gamma (Griffin and Brown, 2005) normal/inverse-gaussian (Caron and Doucet, 2008) normal/gamma (Griffin and Brown, 2010) horseshoe/inverted-beta prior (CPS 2010; Polson and Scott, 2011) double-pareto (Armagan et al., 2010) Algorithms for regularized regression LARS (Efron et al., 2004) LQA (Fan and Li, 2001) and LLA (Zou and Li, 2008) EM/ECME (Dempster et al., 1977; Meng and Rubin, 1993; many others) MM (Hunter and Lange, 2000; Taddy, 2010) MCMC for support-vector machines (Polson and Steve Scott, 2011) MCMC for logistic regression (Gramacy and Polson; Holmes and Held; Steve Scott; SFS; others) Distributional theory based on variance-mean mixtures Z distributions (Fisher, 1923) Generalized inverse-gaussian distributions (Barndorff-Nielsen 1977) Penalties, priors, and Lévy processes (Polson and Scott, 2011)

21 The standard scale-mixture trick p(z)= 0 φ(z v) p(v) dv

22 ˆβ = argmin y X β 2 + ν p g(β j ) j =1 (y β) N(X β,σ 2 β) (β i τ 2,λ 2 i ) N(0,τ2 λ 2 i ) λ 2 i π(λ 2 i ) (τ 2,σ 2 ) π(τ 2,σ 2 ) (β y,τ 2,Λ,σ 2 ) N( ˆβ, ˆΣ β ) ˆβ = (τ 2 Λ 1 + X T X ) 1 X T y

23 ˆβ = argmin y X β 2 + ν p g(β j ) j =1 (y β) N(X β,σ 2 β) (β i τ 2,λ 2 i ) N(0,τ2 λ 2 i ) λ 2 i π(λ 2 i ) (τ 2,σ 2 ) π(τ 2,σ 2 ) Exponential > Lasso Inverted-beta > Horseshoe Gamma > Normal/gamma (Andrews and Mallows; West) (β y,τ 2,Λ,σ 2 ) N( ˆβ, ˆΣ β ) ˆβ = (τ 2 Λ 1 + X T X ) 1 X T y

24 The standard scale-mixture trick p(z)= 0 φ(z v) p(v) dv

25 The standard scale-mixture trick p(z)= 0 φ(z v) p(v) dv Two ways of generalizing this: the Lévy representation variance-mean mixtures

26 1) The Lévy representation Let ψ(t), t > 0, be a nonnegative-real-valued, totally monotone function such that lim t 0 ψ(t)=0. Part A: Suppose that these conditions are met for t f (β j ). Then the prior distribution p(β j s) exp{ sψ[ f (β j )]}, where s > 0, is the moment-generating function of a subordinator T (s), evaluated at f (β j ), whose Lévy measure satisfies ψ(t)= 0 {1 exp( tx)} µ(dx). (1) Part B: Suppose that these conditions are met for t β 2/2. Then p(β j j exp{ sψ(β 2 /2)}, where s > 0, is a mixture of normals given by j s) p(β j s) 0 N β j 0,T 1 T 1/2 p(t ) dt, where p(t ) is the density of the subordinator T, observed at time s, whose Lévy measure µ(dx) satisfies (1).

27 1) The Lévy representation Let ψ(t), t > 0, be a nonnegative-real-valued, totally monotone function such that lim t 0 ψ(t)=0. Part A: Suppose that these conditions are met for t f (β j ). Then the prior distribution p(β j s) exp{ sψ[ f (β j )]}, where s > 0, is the moment-generating function of a subordinator T (s), evaluated at f (β j ), whose Lévy measure satisfies ψ(t)= 0 {1 exp( tx)} µ(dx). (1) Part B: Suppose that these conditions are met for t β 2/2. Then p(β j j exp{ sψ(β 2 /2)}, where s > 0, is a mixture of normals given by j s) p(β j s) 0 N β j 0,T 1 T 1/2 p(t ) dt, where p(t ) is the density of the subordinator T, observed at time s, whose Lévy measure µ(dx) satisfies (1). (super useful for block-wise penalties, e.g. the group lasso)

28 2) Mean-variance mixtures p(z)= p(z)= 0 0 φ(z v) p(v) dv φ(z µ + kv, v) p(v) dv (like Brownian motion with a drift; useful for non-gaussian likelihoods)

29 The generic regularization problem n p β j Q(β)= f (y i, x T i β)+ g i=1 j =1 τs j We consider properties of the posterior distribution e Q(β) p(β τ, y) exp n n i=1 i=1 p(z i x T i f (y i, x i β) p β) = p(z β) p(β τ), k j =1 j =1 g(β j /τs j ) p(β j τ) where z i = y i x T i β for regression, or z i = y i x T i β for classification.

30 The generic regularization problem Suppose that both the likelihood and prior/penalty can be represented as normal variance-mean mixtures. p(z i β)= p(β j τ)= 0 0 φ(z i µ z + κ z ω i,σ 2 ω i ) dp(ω i ) φ(β j µ β + κ β λ j,τ 2 s 2 j λ j ) dp(λ j ).

31 Then we have an easy EM: E Step Compute the expected value of the log posterior, given the current parameter estimate: Q(β β (g) )= log p(β ω,λ,τ, y) p(ω,λ β (g),τ, y) dω dλ. M Step Maximize the complete-data posterior to update the parameter estimate: β (g+1) = argmax β Q(β β (g) ).

32 The E Step The observed-data log posterior is p(β τ, y)= π(β ω,λ, y) p(ω,λ y,τ) dω dλ. Exploiting the mean variance mixture representation, the complete-data log posterior is log p(β ω,λ,τ, y)=c 0 (ω,λ, y,τ) 1 2 n i=1 1 2s 2 j τ2 ω 1 i p j =1 zi µ z κ y ω i 2 λ 1 (β j j µ β κ β λ j ) 2 This can be shown to depend linearly upon the conditional moments { ˆω 1 } and i }. Therefore the E-step is to simply plug these conditional expected values into {ˆλ 1 j the complete-data log posterior.

33 The E Step Theorem: these conditional moments are: (β j µ β )ˆλ 1(g) j (z i µ z ) ˆω 1(g) i = κ β + τ 2 s 2 j g (β j τ), = κ z + σ 2 f (z i β), Key fact: we don t need the conditional posterior for the latent variances. Only need the functional form of the likelihood (f) and prior (g). These are pre-specified.

34 Tilted, iteratively re-weighted least squares E step Given a current estimate β = β (g), compute the conditional moments of the latent variances as M Step (β (g) j (z (g) i µ β )ˆλ 1(g) j µ z ) ˆω 1(g) i = κ β + τ 2 s 2 j g (β (g) j = κ z + σ 2 f (z (g) i β). τ), For regression, compute β (g+1) as β (g+1) = τ 2 S 1 ˆΛ 1(g) + X T ˆΩ 1(g) X 1 X T ˆΩ 1(g) y µ z ω 1(g) κ z 1. For classification, compute β (g+1) as β (g+1) = τ 2 S 1 ˆΛ 1(g) + X T ˆΩ 1(g) X 1 X T ˆΩ 1(g) µ z 1 + κ z ˆω (g).

35 This works for a broad class of models. Likelihood f (z i β) k z µ z p(ω i ) Squared-error z ω i i 1 Absolute-error z i 0 0 Exponential Check loss z i +(2q 1)z i 1 2q 0 GIG SVM max(1 z i,0) 1 1 GIG Logistic log(1 + e z i ) 1/2 0 Polya Penalty function g(β j τ) k β µ β p(λ j ) Ridge (β j /τ) ω i 1 Lasso β j /τ 0 0 Exponential Bridge β j /τ α 0 0 Stable Gen. Double-Pareto {(1 + α)/τ}log(1 + β j /ατ) 0 0 Exp Gam + multinomial logit, autologistic, topic models, RBMs, extreme-value....

36 Two key identities: α 2 κ 2 2α e α θ µ +κ(θ µ) = 1 B(α, κ) e α(θ µ) (1 + e θ µ ) 2(α κ) = 0 0 φ(θ µ + κv, v)) p gig v 1,0, α 2 κ 2 dv φ(θ µ + κv, v) p pol (v α,α 2κ) dv. Improper versions (treat purely as an integral identity) a 1 exp 2c 1 max(au,0) = c 1 exp 2c 1 ρ q (u) = (1 + exp{u µ}) 1 = φ(u av, cv) dv φ(u (2τ 1)v, cv)e 2τ(1 τ)v dv φ(u µ (1/2)v, v) p pol (v 0,1) dv where ρ q (u)= 1 2 u + q 1 2 u is the check-loss function.

37 Back to the teaser example: logit models A Polya mixing distribution... The terms in this sum are p pol (v α,α 2k)= k=0 w k e a k v (α + k)(k + k) a k = 2 ak w k = a k j =k a j a k = 2δ k where b = 1 2 (α k), δ = 1 2 (α + k), and 2δ k (δ + k) B(δ + b,δ b), = ( 1)k (2δ)...(2δ+k 1) k!.

38 Back to the teaser example: logit models A Polya mixing distribution leads to a Z distribution marginal. p Z (θ µ,α, k)= 1 B(α, k) (e θ µ ) α (1 + e θ µ ) 2(α k) This family includes the logistic regression model as a limiting, improper case: (a, k, µ) = (1, 1/2, 0). Our representation theorem gives the relevant conditional moment as ˆω 1 i = 1 z i e z i 1 + e z i 1 2, z i = y i x T i β

39 Multinomial logit The probability that observation i falls into class k is θ ki = P(y i = k)= exp(xt β i k ) K l =1 exp(xt β i l ) To represent the multinomial logit model in our framework, let η ki = exp x T i c ki (β ( k) ) = log l =k β k c ki /{1 + exp x T i exp{x T i β l } β k c ki } (Holmes and Held, 2006)

40 Multinomial logit Write the conditional likelihood for category k as n K L(β k β ( k), y) = i=1 n i=1 n i=1 n i=1 n i=1 l =1 θỹli li ηỹki ki {w i (1 η ki )}1 ỹ ki {(1 η ηỹki ki ki )}1 ỹ ki exp(γki x T β i k γ ki c ki ) exp(γ ki x T i β k γ ki c ki ) φ(z ki µ ki + κξ ki,ξ ki ) p PY (ξ ki 1,0) dξ ki where κ = 1/2, z ki = γ ki x T β i k, µ ki = γ ki c ki, and p PY (ξ ki 1,0) is a function of ξ ki that is the limit of a Polya density as b 0.

41 An exact ECM For iteration t = 1, 2,... For k = 2,..., K, cycle through the following steps: Update Ω k : z (t) ki µ (t) ki ω (t) ki Ω (t) k := γ ki x T i β (t) k := γ ki log l =k := z (t) ki 1 µ (t) ki exp(x T i := diag(ω (t) k1,...,ω(t) kn ), β (t) ) l exp 1 + exp z (t) ki z (t) ki µ (t) ki µ (t) ki 1 2 where β (t) is the current estimate for the kth block of coefficients, and k where γ ki = ±1 is an indicator of whether y i = k.

42 An exact ECM For iteration t = 1, 2,... For k = 2,..., K, cycle through the following steps: Update Λ k : λ kj := Λ (t) k τ 2 ψ (β (t) kj /τ) β (t) kj := diag(λ (t) k1,...,λ(t) kp ), where ψ is the derivative of the penalty function ψ(β kj ).

43 An exact ECM For iteration t = 1, 2,... For k = 2,..., K, cycle through the following steps: Update β k : Solve the linear system A (t) k β(t+1) k = b (t) k for β (t+1), with k A (t) k b (t) k := τ 2 Λ (t) + X T k k Ω(t) X k k := X T Ω (t) k k µ(t) + 1 k 2 1, where X k is the n p matrix having rows x i = γ ki x i, 1 is a column vector of ones, and µ (t) k =(µ(t) k1,...,µ(t) kn )T.

44 An exact ECM For iteration t = 1, 2,... For k = 2,..., K, cycle through the following steps: Update β k : Solve the linear system A (t) k β(t+1) k = b (t) k for β (t+1), with k A (t) k b (t) k := τ 2 Λ (t) + X T k k Ω(t) X k k := X T Ω (t) k k µ(t) + 1 k 2 1, where X k is the n p matrix having rows x i = γ ki x i, 1 is a column vector of ones, and µ (t) k =(µ(t) k1,...,µ(t) kn )T. Don t solve this exactly.

45 Tilted, iteratively re-weighted conjugate gradient In the update step for ß, don t solve the system exactly. Instead: While (t,l ) >δ min, increment l and set β (c,l ) (c,l 1) := β + k (t,l 1) r (t,l ) := r (t,l 1) α (t,l 1) c (t,l 1) γ (t,l ) := r T (t,l ) r (t,l ) r T (t,l 1) r (t,l 1) d (t,l ) := r (t,l ) + γ (t,l ) d (t,l 1) c (t,l ) := A (t) k d (t,l ) α (t,l ) := r T (t,0) r (t,l ) d T (t,l ) c (t,l ) (t,l ) := α (t,l ) d (t,l ).

46 Tilted, iteratively re-weighted conjugate gradient In the update step for ß, don t solve the system exactly. Instead: While (t,l ) >δ min, increment l and set β (c,l ) (c,l 1) := β + k (t,l 1) r (t,l ) := r (t,l 1) α (t,l 1) c (t,l 1) γ (t,l ) := r T (t,l ) r (t,l ) r T (t,l 1) r (t,l 1) Parallelize this. d (t,l ) := r (t,l ) + γ (t,l ) d (t,l 1) c (t,l ) := A (t) k d (t,l ) α (t,l ) := r T (t,0) r (t,l ) d T (t,l ) c (t,l ) (t,l ) := α (t,l ) d (t,l ).

47 Other methods You can solve many of these problems using tailored methods but not all of them, and rarely this simply or efficiently. LARS: efficient only in Gaussian models Coordinate descent: can get stuck; may require numerical derivatives; irredeemably serial LLA/LQA: exact only in Gaussian models (where it s a special case of our method) Variational methods: never exact; sometimes inconsistent; poor in cases of interesting a posteriori dependence

48 How would you do MCMC in this class? Some people have worked out special cases. Here s an interesting fact: p pol ( 1 2, 1 2) (λ) = ( 1) n n + 1 e 1 2(n+ 1 2) 2 λ 2 C D = n=0 1 4π 2 λ p(c )=4π 2 C D = 1 2π 2 ( 1) n n + 1 e 2 (n+ 1 2) 2 π 2 C 2 n=0 n=1 (1) n 1 2 2

49 How would you do MCMC in this class? There is a duality between the scale-mixture and MGF representations for C: e x2 2 C = (2πC ) 12 e x2 8π 2 C = 1 cosh x 2 A logit-type likelihood would thus look like 2 a+b e ax = e 1 (1 + e x ) a+b 2 (a b)x 1 cosh x 2 = e 1 2 (a b)x e x2 2 C a+b

50 How would you do MCMC in this class? Therefore we can simulate from C using Polya-Gamma distributions: (C x) D = 2 (a + b,1) a n = a n=1 n 1 + x 2 4(n 1 2) 2 π 2 2 π 2 4 n 1 2

51 Robust regression Logit models Multinomial logit models Extreme-value models Support vector machines Topic models Restricted Boltzmann machines Neural networks Autologistic models Penalized additive models Student t Ridge Lasso Bridge Normal/exponential-gamma Normal/gamma Generalized-t (double Pareto) Normal/inverted-beta Normal/inverse-Gaussian Meridian filter The theory of variance-mean mixtures allows MAP estimation for arbitrary combinations of model (left) with prior or penalty (right). With the conjugate-gradient version, there are no matrix inverses and no numerical derivatives.

52 Three papers Sparse Bayes estimation in non-gaussian models via data augmentation. Sparse multinomial logistic regression via nonconcave penalized likelihood. Exact MAP estimation in logistic-normal topic models.

arxiv: v3 [stat.me] 26 Feb 2012

arxiv: v3 [stat.me] 26 Feb 2012 Data augmentation for non-gaussian regression models using variance-mean mixtures arxiv:113.547v3 [stat.me] 26 Feb 212 Nicholas G. Polson Booth School of Business University of Chicago James G. Scott University