Bayesian simultaneous regression and dimension reduction

Size: px

Start display at page:

Download "Bayesian simultaneous regression and dimension reduction"

Roger Stewart
5 years ago
Views:

1 Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008

2 Table of contents 1 Statistical principles 2 Simulated data Digits 3 Pathways and gene sets Progression in prostate cancer 4

3 Motivation and related work Data generated by measuring thousands of variables lies on or near a low-dimensional manifold or strong dependencies between variables.

4 Motivation and related work Data generated by measuring thousands of variables lies on or near a low-dimensional manifold or strong dependencies between variables. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps.

5 Motivation and related work Data generated by measuring thousands of variables lies on or near a low-dimensional manifold or strong dependencies between variables. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps. Simultaneous dimensionality reduction and regression: SIR, MAVE, SAVE.

6 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.

7 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)

8 Regression Statistical principles Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min[var (f )] = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.

9 Inverse regression Statistical principles Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y.

10 Inverse regression Statistical principles Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label

11 Statistical principles Model simultaneously f r (x) and f r = ( f r x 1,..., fr x p ) T.

12 Statistical principles Model simultaneously f r (x) and f r = ( f r x 1,..., fr x p ) T. 1 regression: f r (x) 2 inverse regression: gradient outer product (GOP) Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.

13 Linear case Statistical principles We start with the linear case Σ X = cov (X ), σ 2 Y Γ = σ 2 Y y = w x + ε, ε iid No(0, σ 2 ). = var (Y ). ( 1 σ2 σ 2 Y ) 2 Σ 1 X ΩΣ 1 X σ 2 Σ 1ΩΣ 1. Y X X Γ and Ω are equivalent modulo rotation and scale.

14 Nonlinear case Statistical principles For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).

15 Nonlinear case Statistical principles Partition into sections and compute local quantities X = I i=1 χ i

16 Nonlinear case Statistical principles Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi )

17 Nonlinear case Statistical principles Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi )

18 Nonlinear case Statistical principles Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi )

19 Nonlinear case Statistical principles Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ).

20 Nonlinear case Statistical principles Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ). Γ I i=1 m i σi 2 Σ 1 i Ω i Σ 1 i.

21 Gradient estimate for regression Taylor expanding f (x) around data should result in (f (x j ) f (x i ) + f (x i ) (x j x i )) 2 0 for x i x j.

22 Gradient estimate for regression Taylor expanding f (x) around data should result in (f (x j ) f (x i ) + f (x i ) (x j x i )) 2 0 for x i x j. L(f, f, data) = ij w ij (y j f (x i ) + f (x i ) (x j x i )) 2.

23 Gradient estimate for regression Taylor expanding f (x) around data should result in (f (x j ) f (x i ) + f (x i ) (x j x i )) 2 0 for x i x j. L(f, f, data) = ij w ij (y j f (x i ) + f (x i ) (x j x i )) 2. Similar idea for classification, link function.

24 Gradient estimate Statistical principles Optimization Problem { } (f D, f D ) = arg min (f, f ) H p+1 K L(f, f, data) + λ 1 f 2 K + λ 2 f 2 K f is vector of gradients λ 1, λ 2 are regularization terms L( ) is empirical error using convex loss function

25 Gradient estimate Statistical principles Optimization Problem { } (f D, f D ) = arg min (f, f ) H p+1 K L(f, f, data) + λ 1 f 2 K + λ 2 f 2 K f is vector of gradients λ 1, λ 2 are regularization terms L( ) is empirical error using convex loss function Representation form n f D (x) = a i,d K(x i, x), fd (x) = i=1 n c i,d K(x i, x) with a D = (a 1,D,..., a n,d ) R n and c D = (c 1,D,..., c n,d ) T R np. i=1

26 Gradient Outer Product (GOP) A central quantity in this talk will be the GOP. Definition (GOP) ˆΓ = f D f D = c T D Kc D E( f f )

27 Dimension reduction Statistical principles Proposition The eigenvectors corresponding to the d non-zero eigenvalues of Γ span the subspace relevant to prediction. Gradients provide information on the predictive directions b i, i = 1,..., d.

28 Gaussian Markov distributions over graphs Give a multivariate normal distribution with covariance matrix Σ the matrix P = Σ 1 is the conditional independence matrix P ij = dependence of i j all other variables.

29 Gaussian Markov distributions over graphs Give a multivariate normal distribution with covariance matrix Σ the matrix P = Σ 1 is the conditional independence matrix P ij = dependence of i j all other variables. Note by construction ˆΓ is a covariance matrix of a Gaussian process.

30 Gaussian Markov distributions over graphs Give a multivariate normal distribution with covariance matrix Σ the matrix P = Σ 1 is the conditional independence matrix P ij = dependence of i j all other variables. Note by construction ˆΓ is a covariance matrix of a Gaussian process. J = inv(ˆγ) is the inferred conditional independence matrix.

31 Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p.

32 Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p. Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) (dϕ) 1 fd M f ρx C log n 1/d f D f ρx C log where (dϕ) is the dual of the map dϕ. δ ( 1 δ ) n 1/d,

33 Bayesian kernel model for regression y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X

34 Bayesian kernel model for regression y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X this implies a posterior on f. π(z data) L(data Z) π(z),

35 Priors and integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ).

36 Priors and integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ). A prior on Γ implies a prior on G.

37 Equivalence with RKHS For what Γ is H K = span(g)? What is L 1(H K K ) =??. This is hard to characterize.

38 Equivalence with RKHS For what Γ is H K = span(g)? What is L 1(H K K ) =??. This is hard to characterize. An appropriate choice for Γ is the union of integrable functions and discrete measures.

39 Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm.

40 Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm. Proposition B(X ) L 1 K (H K (X )).

41 The implication Statistical principles Take home message need priors on signed measures. A function theoretic foundation for random signed measures such as Gaussian, Dirichlet and Lévy process priors.

42 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X

43 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X this implies a posterior on f. π(z data) L(data Z) π(z),

44 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function.

45 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function. Model F using a Dirichlet process prior: DP(α, F 0 )

46 Bayesian representer form Given X n = (x 1,..., x n ) iid F n F X n DP(α + n, F n ), F n = (αf 0 + δ xi )/(α + n). i=1 E[f X n ] = a n n K(x, u) w(u) df 0 (u)+n 1 (1 a n ) w(x i ) K(x, x i ), i=1 a n = α/(α + n).

47 Bayesian representer form Taking lim α 0 to represent a non-informative prior: Proposition (Bayesian representer form) n ˆf n (x) = w i K(x, x i ), w i = w(x i )/n. i=1

48 Bayesian kernel model for gradient estimates By Taylor expansion y i = f (x j ) + f (x i ) (x i x j ) + ε xi,x j.

49 Bayesian kernel model for gradient estimates By Taylor expansion y i = f (x j ) + f (x i ) (x i x j ) + ε xi,x j. By representer form y i = α 0 + Kα + (ιx i X )CK i + ε i where ι = (1,..., 1), α = (α 1,...α n ) IR n, C = (c 1,...c n ) IR p n, X is the n p data matrix, K i is the ith column of K.

50 Likelihood: error term and spatial statistics Intuition: Consider a spatial model (similarity matrix) w ij := θ exp{ φ x i x j }, where θ and φ are parameters of a spatial model.

51 Likelihood: error term and spatial statistics Intuition: Consider a spatial model (similarity matrix) w ij := θ exp{ φ x i x j }, where θ and φ are parameters of a spatial model. A natural modeling assumption is ε xi,x j w 1 ij.

52 Likelihood: error term and spatial statistics Intuition: Consider a spatial model (similarity matrix) w ij := θ exp{ φ x i x j }, where θ and φ are parameters of a spatial model. A natural modeling assumption is ε xi,x j w 1 ij. Given this spatial structure ε i No n (0, W 1 i ) where W i = diag(w xi,x 1,..., w xi,x n ).

53 Likelihood Statistical principles Given the error model the likelihood is L(data f, f ) w ij exp { 1 (e i 2 W ie i ) } ij i

54 Likelihood Statistical principles Given the error model the likelihood is L(data f, f ) w ij exp { 1 (e i 2 W ie i ) } with K = F F := diag(λ 2 1,..., λ2 n ) α = F 1 β e i = y i α 0 F β (ιx i X )CK i ij i

55 Prior specification Statistical principles π(α 0, θ) 1/θ,

56 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T )

57 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T ) T := diag(τ 1,..., τ n ), τ 1 i Ga(a τ /2, b τ /2),

58 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T ) T := diag(τ 1,..., τ n ), τ 1 i Ga(a τ /2, b τ /2), C kj (1 π k )δ 0 + π k No(0, φ 1 k )

59 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T ) T := diag(τ 1,..., τ n ), τ 1 i Ga(a τ /2, b τ /2), C kj (1 π k )δ 0 + π k No(0, φ 1 k ) φ k Ga(α c /2, β c /2)

60 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T ) T := diag(τ 1,..., τ n ), τ 1 i Ga(a τ /2, b τ /2), C kj (1 π k )δ 0 + π k No(0, φ 1 k ) φ k Ga(α c /2, β c /2) π k Beta(α π, β π ),

61 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T ) T := diag(τ 1,..., τ n ), τ 1 i Ga(a τ /2, b τ /2), C kj (1 π k )δ 0 + π k No(0, φ 1 k ) φ k Ga(α c /2, β c /2) π k Beta(α π, β π ), φ Ga(a φ /2, b φ /2)

62 Prior specification Statistical principles π(α 0, θ) 1/θ, β No(0, T ) T := diag(τ 1,..., τ n ), τ 1 i Ga(a τ /2, b τ /2), C kj (1 π k )δ 0 + π k No(0, φ 1 k ) φ k Ga(α c /2, β c /2) π k Beta(α π, β π ), φ Ga(a φ /2, b φ /2) Standard Gibbs sampler simulates p(α, α 0, C, φ, θ data).

63 Linear example Statistical principles Simulated data Digits Dimensions RKHS norm samples Dimensions

64 Linear example Statistical principles Simulated data Digits x Dimensions Dimensions

65 Nonlinear example Statistical principles Simulated data Digits Dimension Feature Dimension Feature 1

66 Digit classification Statistical principles Simulated data Digits Input MNIST handwritten digits database: X i R 784 : 28 by 28 gray-scale pixel image

67 Digit classification Statistical principles Simulated data Digits Input MNIST handwritten digits database: X i R 784 : 28 by 28 gray-scale pixel image Formulation Problem 1: 3 vs 8 with 50 3 s, 50 8 s Problem 2: 5 vs 8 with 50 5 s, 50 8 s

68 Digit classification Statistical principles Simulated data Digits Input MNIST handwritten digits database: X i R 784 : 28 by 28 gray-scale pixel image Formulation Problem 1: 3 vs 8 with 50 3 s, 50 8 s Problem 2: 5 vs 8 with 50 5 s, 50 8 s Goal Learn features for predictive model: 3 vs 8 5 vs 8

69 3, 5, 8 Classification problem Simulated data Digits

70 Top features: 3 vs 8 Statistical principles Simulated data Digits

71 Top features: 5 vs 8 Statistical principles Simulated data Digits

72 Genes don t do things Pathways and gene sets Progression in prostate cancer

73 Diabetes Oxphos Statistical principles Pathways and gene sets Progression in prostate cancer

74 Gender Statistical principles Pathways and gene sets Progression in prostate cancer

75 Gene set database Statistical principles Pathways and gene sets Progression in prostate cancer The gene sets in the database are defined by 1 Positional gene sets: cytogenetic bands, 3 megabase windows;

76 Gene set database Statistical principles Pathways and gene sets Progression in prostate cancer The gene sets in the database are defined by 1 Positional gene sets: cytogenetic bands, 3 megabase windows; 2 Motif gene sets: TRANSFAC motifs, Representative motifs;

77 Gene set database Statistical principles Pathways and gene sets Progression in prostate cancer The gene sets in the database are defined by 1 Positional gene sets: cytogenetic bands, 3 megabase windows; 2 Motif gene sets: TRANSFAC motifs, Representative motifs; 3 Curated gene sets: Pathways, Literature reviews, Animal models, Clinical phenotypes, Expert curations, Chemical or genetic perturbations.

78 Progression of prostate cancer Pathways and gene sets Progression in prostate cancer Gene expression from 22, 283 genes. 71 people 22 benign (b) prostate epithelium, 32 primary (p) prostate cancer, 17 metastatic (m) prostate cancer.

79 Progression of prostate cancer Pathways and gene sets Progression in prostate cancer Gene expression from 22, 283 genes. 71 people 22 benign (b) prostate epithelium, 32 primary (p) prostate cancer, 17 metastatic (m) prostate cancer. Progression: {b p m}.

80 Progression of prostate cancer Pathways and gene sets Progression in prostate cancer Gene expression from 22, 283 genes. 71 people 22 benign (b) prostate epithelium, 32 primary (p) prostate cancer, 17 metastatic (m) prostate cancer. Progression: {b p m}. 523 pathway defined gene sets.

81 Progression of prostate cancer Pathways and gene sets Progression in prostate cancer Gene expression from 22, 283 genes. 71 people 22 benign (b) prostate epithelium, 32 primary (p) prostate cancer, 17 metastatic (m) prostate cancer. Progression: {b p m}. 523 pathway defined gene sets. 1 Which pathways are involved in all or some stages of progression?

82 Progression of prostate cancer Pathways and gene sets Progression in prostate cancer Gene expression from 22, 283 genes. 71 people 22 benign (b) prostate epithelium, 32 primary (p) prostate cancer, 17 metastatic (m) prostate cancer. Progression: {b p m}. 523 pathway defined gene sets. 1 Which pathways are involved in all or some stages of progression? 2 What are the pathway dependencies (inferring pathway networks)?

83 Progression of prostate cancer Pathways and gene sets Progression in prostate cancer Gene expression from 22, 283 genes. 71 people 22 benign (b) prostate epithelium, 32 primary (p) prostate cancer, 17 metastatic (m) prostate cancer. Progression: {b p m}. 523 pathway defined gene sets. 1 Which pathways are involved in all or some stages of progression? 2 What are the pathway dependencies (inferring pathway networks)? 3 For each relevant pathway infer gene network for pathway.

84 Pathways relevant in progression Pathways and gene sets Progression in prostate cancer A TRANS b p p m 0.8 CCC B GHD KREB 0.2 C HORM GLY

85 Pathways and gene sets Progression in prostate cancer Pathway dependencies: benign to primary A B

86 Refinement of gene sets Pathways and gene sets Progression in prostate cancer 1 Not all genes in a gene set are relevant in the specific context studied.

87 Refinement of gene sets Pathways and gene sets Progression in prostate cancer 1 Not all genes in a gene set are relevant in the specific context studied. 2 Genes not included in the gene set maybe relevant to the specific context studied.

88 Gene network for ERK pathway Pathways and gene sets Progression in prostate cancer NGF PTPR EL K 1 SOS1 NGFB DPM2 GR B2 PPP2CA GNB1 MK NK 2 MK NK 1 EGFR R PS R AF1 SHC1 STAT TGFB MY C RPS6K AS MAPK 1 MAP2K 2 PDG MAP2K 1 GNAS

89 Relevant papers Learning Coordinate Covariances via Gradients. S. Mukherjee, D-X. Zhou; Journal of Machine Learning Research, 7(Mar): , Estimation of Gradients and Coordinate Covariation in Classification. S. Mukherjee, Q. Wu; Journal of Machine Learning Research, 7(Nov): , Characterizing the Function Space for Bayesian Kernel Models. N. Pillai, Q. Wu, F. Liang, S. Mukherjee, R.L. Wolpert; Journal of Machine Learning Research, 8(Aug): , Non-parametric Bayesian kernel models. F. Liang, K. Mao, M. Liao, S. Mukherjee, M. West; Biometrika, in submission. Learning Gradients: predictive models that infer geometry and dependence. Qiang Wu, Justin Guinney, Mauro Maggioni, ; Journal of Machine Learning Research, submitted. Modeling Cancer Progression via Pathway Dependencies. E. Edelman, J. Guinney, J-T. Chi, P.G. Febbo, S. Mukherjee; PLoS Computational Biology, in press. Bayesian simultaneous dimension reduction and regression. K. Mao, F. Liang, S. Mukherjee, Q. Wu; in preparation.

90 Acknowledgements People that did the work: Gradients Q Wu, D-X Zhou, K Mao, J Guinney

91 Acknowledgements People that did the work: Gradients Q Wu, D-X Zhou, K Mao, J Guinney Computational biology E Edelman, J Guinney, P Febbo, J-T Chi

92 Acknowledgements People that did the work: Gradients Q Wu, D-X Zhou, K Mao, J Guinney Computational biology E Edelman, J Guinney, P Febbo, J-T Chi Bayesian modeling N Pillai, K Mao, F Liang, M West, R Wolpert

93 Acknowledgements People that did the work: Gradients Q Wu, D-X Zhou, K Mao, J Guinney Computational biology E Edelman, J Guinney, P Febbo, J-T Chi Bayesian modeling N Pillai, K Mao, F Liang, M West, R Wolpert Funding: IGSP Center for Systems Biology at Duke NSF DMS

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan