Bayesian learning of sparse factor loadings

Size: px

Start display at page:

Download "Bayesian learning of sparse factor loadings"

Felix Michael Welch
5 years ago
Views:

1 Magnus Rattray School of Computer Science, University of Manchester Bayesian Research Kitchen, Ambleside, September 6th 2008

2 Talk Outline Brief overview of popular sparsity priors Example application: sparse Bayesian factor analysis for network inference Theory for average-case performance with mixture prior Theory and MCMC results for dierent data distributions Comparison with L1 prior

3 Sparsity priors tend to be convenient rather than realistic, e.g.

4 Sparsity priors tend to be convenient rather than realistic, e.g. L1 p(w i ) = λ 2 e λ w i

5 Sparsity priors tend to be convenient rather than realistic, e.g. L1 p(w i ) = λ 2 e λ w i ARD p(w i ) = N (w i 0, λ 1 i )

6 Sparsity priors tend to be convenient rather than realistic, e.g. L1 p(w i ) = λ 2 e λ w i ARD p(w i ) = N (w i 0, λ 1 i ) Mixture p(w i ) = (1 C)δ(w i ) + CN (w i 0, λ 1 )

7 Sparsity priors tend to be convenient rather than realistic, e.g. L1 p(w i ) = λ 2 e λ w i ARD p(w i ) = N (w i 0, λ 1 i ) Mixture p(w i ) = (1 C)δ(w i ) + CN (w i 0, λ 1 ) Is it OK to not worry about whether these actually t the data?

8 Regulatory network inference Gibbs sampler Example application: factor analysis for network inference Y N (WZ + µ, Ψ) Y = [y in ] log-expression of gene i in sample n Z = [z jn ] log-concentration (or "activity") of TF j in sample n W = [w ij ] factor loading is eect of TF j on gene i Model Z as latent variable, since mrna data may not capture TF protein level/activity, or TFs too weakly expressed For e.g. yeast W roughly Various methods for inferring sparse W from Y (reviewed by Pournara and Wernisch, BMC Bioinformatics 2007).

9 Regulatory network inference Gibbs sampler Factor analysis for network inference Y N (WZ + µ, Ψ) Mixture prior leads to tractable Gibbs sampler p(w ij ) = (1 C ij )δ(w ij ) + C ij N (w ij 0, λ 1 ) Hyper-parameters C ij [0, 1] can be obtained from e.g. ChIP-chip data DNA motifs (Sabatti and James, Bioinformatics 2006) Or we can estimate (grouped) hyper-parameters by MCMC

10 Regulatory network inference Gibbs sampler Gibbs sampler Write w ij = x ij b ij where x ij {0, 1} and b ij N (0, λ 1 ) x j p(x j X \ x j, Z, Y) (1) B p(b X, Z, Y) (2) Z p(z X, B, Y) (3)

11 Regulatory network inference Gibbs sampler Gibbs sampler Write w ij = x ij b ij where x ij {0, 1} and b ij N (0, λ 1 ) x j p(x j X \ x j, Z, Y) (1) B p(b X, Z, Y) (2) Z p(z X, B, Y) (3) Integrate out B before sampling X (2,3) more ecient when X is typically sparse Can also sample hyper-parameters C ij and λ if required

12 for sparse Bayesian PCA y n N (0, σ 2 I + w w T ) p(w C, λ) = N [ (1 C)δ(wi ) + CN (w i 0, λ 1 ) ] i=1

13 for sparse Bayesian PCA y n N (0, σ 2 I + w w T ) p(w C, λ) = N [ (1 C)δ(wi ) + CN (w i 0, λ 1 ) ] i=1 We study average behaviour over datasets D = {y 1, y 2,..., y M } produced by a teacher distribution

14 for sparse Bayesian PCA y n N (0, σ 2 I + w w T ) p(w C, λ) = N [ (1 C)δ(wi ) + CN (w i 0, λ 1 ) ] i=1 We study average behaviour over datasets D = {y 1, y 2,..., y M } produced by a teacher distribution The teacher is identical except for a dierent factorized parameter distribution. We consider two cases:

15 for sparse Bayesian PCA y n N (0, σ 2 I + w w T ) p(w C, λ) = N [ (1 C)δ(wi ) + CN (w i 0, λ 1 ) ] i=1 We study average behaviour over datasets D = {y 1, y 2,..., y M } produced by a teacher distribution The teacher is identical except for a dierent factorized parameter distribution. We consider two cases: (1) Same form: p(w t i ) = (1 C t )δ(w t i ) + C t N (w t i 0, λ 1 t )

16 for sparse Bayesian PCA y n N (0, σ 2 I + w w T ) p(w C, λ) = N [ (1 C)δ(wi ) + CN (w i 0, λ 1 ) ] i=1 We study average behaviour over datasets D = {y 1, y 2,..., y M } produced by a teacher distribution The teacher is identical except for a dierent factorized parameter distribution. We consider two cases: (1) Same form: p(w t i ) = (1 C t )δ(w t i ) + C t N (w t i 0, λ 1 t ) (2) Dierent form: p(w t i ) = (1 C t )δ(w t i ) + C t δ(w t i λ 1/2 t )

17 for sparse Bayesian PCA Compute marginal likelihood log p(d C, λ) D in limit N with α = M/N held constant using replica method

18 for sparse Bayesian PCA Compute marginal likelihood log p(d C, λ) D in limit N with α = M/N held constant using replica method Compute functions of the mean posterior parameter w (D) ρ(w ) = w w t w w t L(w ) = log p(y w, C, λ) y w t

19 for sparse Bayesian PCA Compute marginal likelihood log p(d C, λ) D in limit N with α = M/N held constant using replica method Compute functions of the mean posterior parameter w (D) ρ(w ) = w w t w w t L(w ) = log p(y w, C, λ) y w t Good agreement with simulations for most relevant case of small α (so-called large N small p regime)

20 for sparse Bayesian PCA Compute marginal likelihood log p(d C, λ) D in limit N with α = M/N held constant using replica method Compute functions of the mean posterior parameter w (D) ρ(w ) = w w t w w t L(w ) = log p(y w, C, λ) y w t Good agreement with simulations for most relevant case of small α (so-called large N small p regime)

21 for sparse Bayesian PCA Similar replica calculation to Uda and Kabashima (J. Phys. Soc. Japan 74, 2005) Z(D) = p(d C, λ) = N dw p(w C, λ) p(y n w) n=1 1 N log Z(D) D,w t = 1 N lim n 0 n Z n (D) D,w t = α log p(y w (D), C, λ) y,d,w t + entropic terms

22 for sparse Bayesian PCA Similar replica calculation to Uda and Kabashima (J. Phys. Soc. Japan 74, 2005) Z(D) = p(d C, λ) = N dw p(w C, λ) p(y n w) n=1 1 N log Z(D) D,w t = 1 N lim n 0 n Z n (D) D,w t = α log p(y w (D), C, λ) y,d,w t + entropic terms Average case becomes typical for large N due to self-averaging

23 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Standard PCA result (C=1) Learning exhibits phase transitions, e.g. (for α > 1) ( ρ(w ) = θ(α T 2 ) θ α where θ(x) is the step function and λ NT T = w t 2 N = NC tλ 1 t. ) α T 2 α + T 1

24 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Standard PCA result (C=1) Learning exhibits phase transitions, e.g. (for α > 1) ( ρ(w ) = θ(α T 2 ) θ α where θ(x) is the step function and λ NT T = w t 2 N = NC tλ 1 t. ) α T 2 α + T 1 Consistent with result for Bayesian PCA with spherical prior p(w) δ( w 1) (Riemann et al. J. Phys. A 1996)

25 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Standard PCA result (C=1) Learning exhibits phase transitions, e.g. (for α > 1) ( ρ(w ) = θ(α T 2 ) θ α where θ(x) is the step function and λ NT T = w t 2 N = NC tλ 1 t. ) α T 2 α + T 1 Consistent with result for Bayesian PCA with spherical prior p(w) δ( w 1) (Riemann et al. J. Phys. A 1996) Only new feature is 1st-order transition with increasing λ

26 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Standard PCA result (C=1) λ t = N, N = 5000, M = (α = 5) 1 Theory vs. MCMC averaged over 10 data sets 0.8 ρ(w * ) λ/λ t

27 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Standard PCA result (C=1) λ t = N, N = 5000, M = (α = 5) 1 Theory vs. MCMC averaged over 10 data sets 0.8 ρ(w * ) λ/λ t Here we will only consider learning away from phase transitions

28 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for sparse PCA (C<1) We consider two types of data set distribution:

29 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for sparse PCA (C<1) We consider two types of data set distribution: (1) Same form as prior: p(w t i ) = (1 C t )δ(w t i ) + C t N (w t i 0, λ 1 t )

30 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for sparse PCA (C<1) We consider two types of data set distribution: (1) Same form as prior: p(w t i ) = (1 C t )δ(w t i ) + C t N (w t i 0, λ 1 t ) (2) Dierent form: p(w t i ) = (1 C t )δ(w t i ) + C t δ(w t i λ 1/2 t )

31 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for sparse PCA (C<1) We consider two types of data set distribution: (1) Same form as prior: p(w t i ) = (1 C t )δ(w t i ) + C t N (w t i 0, λ 1 t ) (2) Dierent form: p(w t i ) = (1 C t )δ(w t i ) + C t δ(w t i λ 1/2 t ) Both give identical performance for standard PCA

32 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for sparse PCA (C<1) We consider two types of data set distribution: (1) Same form as prior: p(w t i ) = (1 C t )δ(w t i ) + C t N (w t i 0, λ 1 t ) (2) Dierent form: p(w t i ) = (1 C t )δ(w t i ) + C t δ(w t i λ 1/2 t ) Both give identical performance for standard PCA Both give identical performance if sparsity is known

33 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (1): ρ(w ) and L(w ) p(wi t) = (1 C t)δ(wi t) + C tn (wi t 0, λ 1 t ) C t = 0.2, λ = λ t = N/100, M = 200, N = M/α Theory versus MCMC results for single dataset α=0.15 α=0.1 α=0.05 α= Theory versus MCMC results for single dataset α=0.15 α=0.1 α=0.05 α= ρ(w * ) 0.7 L(w * ) C C

34 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (1): ρ(w ) and L(w ) p(wi t) = (1 C t)δ(wi t) + C tn (wi t 0, λ 1 t ) C t = 0.2, λ = λ t = N/100, M = 200, N = M/α Theory versus MCMC averaged over 10 datasets α=0.15 α=0.1 α=0.05 α= Theory versus MCMC averaged over 10 datasets α=0.15 α=0.1 α=0.05 α= ρ(w * ) 0.7 L(w * ) C C

35 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (1): ρ(w ) and L(w ) C t = 0.2, λ t = 20, M = 200, N = 2000 (α = 0.1) 8 ρ(w * ) 8 L(w * ) λ/λ t 4 λ/λ t C C

36 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (1): p(d C, λ) C t = 0.2, λ t = 20, M = 200, N = 2000 (α = 0.1) 2 Theory for log p(d C,λ) 2 MCMC for p(c,λ D) averaged over 20 data sets λ/λ t λ/λ t C C

37 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (2): ρ(w ) p(w t i ) = (1 C t)δ(w t i ) + C tδ(w t i λ 1/2 t ) C t = 0.2, λ = λ t = N/100, M = 200, N = M/α Delta mixture versus Gaussian mixture theory α=0.2 α=0.15 α=0.1 α= Theory versus MCMC averaged over 10 datasets α=0.2 α=0.15 α=0.1 α= ρ(w * ) 0.7 ρ(w * ) C C

38 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (2): ρ(w ) and L(w ) C t = 0.2, λ t = 10, M = 200, N = 1000 (α = 0.2) 8 ρ(w * ) 8 L(w * ) λ/λ t 4 λ/λ t C C

39 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior for data distribution (2): p(d C, λ) C t = 0.2, λ t = 10, M = 200, N = 1000 (α = 0.2) 4 Theory for log p(d C,λ) 4 MCMC for p(c,λ D) averaged over 20 data sets λ/λ t 2 λ/λ t C C

40 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Other priors? Is this problem specic to the mixture prior?

41 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Other priors? Is this problem specic to the mixture prior? Consider the L1 prior (with an additional L2 term), p(w i ) e λ 2w 2 i 2 λ 1 wi

42 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior Other priors? Is this problem specic to the mixture prior? Consider the L1 prior (with an additional L2 term), Data distribution (2) p(w i ) e λ 2w 2 i 2 λ 1 wi p(w t i ) = (1 C t )δ(w t i ) + C t δ(w t i λ 1/2 t )

43 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior L1 prior results: ρ(w ) C t = 0.2, λ t = N/100, α = M/N L1 prior theory α=0.2 α=0.15 α=0.1 α= Mixture prior theory α=0.2 α=0.15 α=0.1 α= ρ(w * ) 0.8 ρ(w * ) λ C

44 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior L1 prior results: p(d λ 1, λ 2 ) versus L(w ) and ρ(w ) C t = 0.2, λ t = N/100, α = 0.2 L(w * ) p(d λ 1,λ 2 ) λ 2 /λ t λ λ 2 /λ t λ

45 Standard PCA result for sparse PCA (C<1) for well-matched data for unmatched data L1 prior L1 prior results: p(d λ 1, λ 2 ) versus L(w ) and ρ(w ) C t = 0.2, λ t = N/100, α = 0.2 ρ(w * ) p(d λ 1,λ 2 ) λ 2 /λ t λ λ 2 /λ t λ

46 Mixture prior works as expected when well-matched to data

47 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading

48 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading Marginal likelihood seems more eective for L1 prior

49 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading Marginal likelihood seems more eective for L1 prior... although L1 didn't really perform well (preliminary)

50 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading Marginal likelihood seems more eective for L1 prior... although L1 didn't really perform well (preliminary) Future work should look at multiple factors

51 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading Marginal likelihood seems more eective for L1 prior... although L1 didn't really perform well (preliminary) Future work should look at multiple factors Assessment of metrics using the full posterior

52 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading Marginal likelihood seems more eective for L1 prior... although L1 didn't really perform well (preliminary) Future work should look at multiple factors Assessment of metrics using the full posterior Comparison with MAP and ML approaches

53 Mixture prior works as expected when well-matched to data Marginal likelihood for mixture prior can be misleading Marginal likelihood seems more eective for L1 prior... although L1 didn't really perform well (preliminary) Future work should look at multiple factors Assessment of metrics using the full posterior Comparison with MAP and ML approaches And better priors!

Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian