Nonparametric Bayesian Models for Supervised Dimension R

Size: px

Start display at page:

Download "Nonparametric Bayesian Models for Supervised Dimension R"

Junior Haynes
5 years ago
Views:

1 Nonparametric Bayesian Models for Supervised Dimension Reduction Department of Statistical Science Duke University December 2, 2009

2 Nonparametric Bayesian Models for Supervised Dimension Reduction Department of Statistical Science Duke University December 2, 2009

3 Dimension Reduction Supervised Dimension Reduction 1 Background Dimension Reduction Supervised Dimension Reduction 2 Inverse Regression Methods Bayesian Mixture Inverse Modeling 3 Motivation Error Covariance and Learning the gradient 4

4 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.

5 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.

6 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.

7 Dimension Reduction Supervised Dimension Reduction Linear methods: find a linear subspace Principal component analysis (PCA), Factor models... Nonlinear methods: Kernel PCA, Multidimensional scaling... Manifold learning methods: explore local structure ISOMAP, Laplacian eigenmaps, Local linear embedding... These are all unsupervised.

8 Dimension Reduction Supervised Dimension Reduction Linear methods: find a linear subspace Principal component analysis (PCA), Factor models... Nonlinear methods: Kernel PCA, Multidimensional scaling... Manifold learning methods: explore local structure ISOMAP, Laplacian eigenmaps, Local linear embedding... These are all unsupervised.

9 Dimension Reduction Supervised Dimension Reduction Supervised Dimension Reduction (SDR) Supervised dimension reduction (SDR): Predictor variable X R p, response variable Y R; Goal: find for X a low dimensional subspace (or manifold) that contains all the information to predict Y. Ignoring Y can be problematic (e.g., PCA only maximizes variation in X):

10 SDR Formulation Dimension Reduction Supervised Dimension Reduction 1 Y = g(b 1 X,, b d X, ε), where d < p, g is some unknown function and ε is an error. B = (b 1,, b d ) R p d B = span(b): dimension reduction (d.r.) subspace. B G (d,p), the Grassmann manifold which is the set of all the d dimensional linear subspaces of R p. 2 Y X P B X, where P B X is the orthogonal projection of X onto B. 3 Y X = Y P B X.

11 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.

12 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.

13 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.

14 Inverse Regression Methods Bayesian Mixture Inverse Modeling Reduced-Rank Linear Discriminant Analysis (LDA) Ideas: find a subspace that maximizes between-class variation controlling for within-class variation.

15 LDA Limitation Inverse Regression Methods Bayesian Mixture Inverse Modeling Limitation: when there are multiple clusters in a class, the class mean/center cannot represent the whole class.

16 Sliced Inverse Regression (SIR) Inverse Regression Methods Bayesian Mixture Inverse Modeling SIR: Slicing the data in terms of Y into H bins as classes, then do the same as in LDA.

17 SIR Limitations Inverse Regression Methods Bayesian Mixture Inverse Modeling 1 Same as LDA, degeneracy may occur. 2 Slicing procedure rigid. Information in Y not well utilized. 3 Not probabilistic. Cannot evaluate estimate uncertainty.

18 LSIR and MDA Inverse Regression Methods Bayesian Mixture Inverse Modeling Localized SIR (LSIR) (Wu et al., 2008) utilizes local means. Compromise between LDA/SIR and PCA. Mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1996) fits finite Gaussian mixtures in each class. Finite mixtures not flexible. Natural to go nonparametric.

19 LSIR and MDA Inverse Regression Methods Bayesian Mixture Inverse Modeling Localized SIR (LSIR) (Wu et al., 2008) utilizes local means. Compromise between LDA/SIR and PCA. Mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1996) fits finite Gaussian mixtures in each class. Finite mixtures not flexible. Natural to go nonparametric.

20 Probabilistic Model for SIR Inverse Regression Methods Bayesian Mixture Inverse Modeling Principal fitted component (PFC) (Cook, 2007) model: X (Y = y) = µ + Aν y + ε µ R p is the intercept, A R p d (d < p), ν y R d, ε N(0, ). Columns of 1 A span the d.r. space.

21 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

22 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

23 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

24 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

25 d.r. subspace Inverse Regression Methods Bayesian Mixture Inverse Modeling Proposition For this model the d.r. subspace is the span of B = 1 A, i.e., Y X = Y ( 1 A) X.

26 Dirichlet Processes (DP) Inverse Regression Methods Bayesian Mixture Inverse Modeling When Y discrete, i.e., Y = c, c = 1,, C, Dirichlet process prior (DP) on G c leads to a mixture model with the number of mixtures determined in an automatic fashion. G c (i.i.d.) DP(α 0, G 0 ) G c = h=1 π hδ ν h (stick-breaking representation) π h = V h l<h (1 V l), V h Beta(1, α 0 ), ν h G 0 For each sample i, ν i (y i = c, ν i ) j i,y j =c δ ν j + α 0 G 0 ( )

27 Inverse Regression Methods Bayesian Mixture Inverse Modeling Dependent Dirichlet Processes (DDP) When Y continuous, want G y to change smoothly with y, i.e., G y1 and G y2 having stronger dependence with closer y 1 and y 2 Dependent Dirichlet process (DDP, MacEachern 1998) provides a natural framework for this. G y = π yh δ ν yh h=1 π yh = V yh (1 V yl ) l<h Different constructions for DDP (Gelfand et al., 2005; Iorio et al., 2004; Griffin and Steel, 2006)

28 Inverse Regression Methods Bayesian Mixture Inverse Modeling Dependent Dirichlet Processes (DDP) When Y continuous, want G y to change smoothly with y, i.e., G y1 and G y2 having stronger dependence with closer y 1 and y 2 Dependent Dirichlet process (DDP, MacEachern 1998) provides a natural framework for this. G y = π yh δ ν yh h=1 π yh = V yh (1 V yl ) l<h Different constructions for DDP (Gelfand et al., 2005; Iorio et al., 2004; Griffin and Steel, 2006)

29 Kernel Stick-breaking Process Inverse Regression Methods Bayesian Mixture Inverse Modeling Kernel stick-breaking process (Dunson and Park, 2008): G y = U(y; V h, L h ) (1 U(y; V l, L l ))δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h is a probability weight, L h is a random location with the same domain of Y, K (y, L h ) is a pre-specified kernel function measuring the similarity between y and L h, e.g., K (y, L h ) = 1 y Lh <φ or K (y, L h ) = exp( φ y L h 2 ) Dependence on the weights U(y; V h, L h ) implies dependence among G y s.

30 Kernel Stick-breaking Process Inverse Regression Methods Bayesian Mixture Inverse Modeling Kernel stick-breaking process (Dunson and Park, 2008): G y = U(y; V h, L h ) (1 U(y; V l, L l ))δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h is a probability weight, L h is a random location with the same domain of Y, K (y, L h ) is a pre-specified kernel function measuring the similarity between y and L h, e.g., K (y, L h ) = 1 y Lh <φ or K (y, L h ) = exp( φ y L h 2 ) Dependence on the weights U(y; V h, L h ) implies dependence among G y s.

31 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.

32 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.

33 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.

34 Likelihood Inverse Regression Methods Bayesian Mixture Inverse Modeling The likelihood p(data A,, ν, µ) det( 1 ) n 2 exp { 1 2 n (x i µ Aν i ) 1 (x i µ Aν i ) } i=1 Normal prior on µ, A, base measure of ν i, and Wishart for 1 imply conjugacy.

35 for A and 1 Inverse Regression Methods Bayesian Mixture Inverse Modeling Prior for A: Normal full conditional. a lj N(0, φ 1 a ), l j, l = 1,, p Prior for 1 : Wishart(df, p, V D ).

36 Sampling for ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Marginal approach and conditional approach (Ishwaran and James, 2001). Pólya-urn representation of the prior for ν i : ν i (y i = c, ν i ) δ νj + α 0 G 0 (ν i ) j i,y j =c A convenient choice for G 0 is N(0, I d ).

37 Sampling for ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Marginal approach and conditional approach (Ishwaran and James, 2001). Pólya-urn representation of the prior for ν i : ν i (y i = c, ν i ) δ νj + α 0 G 0 (ν i ) j i,y j =c A convenient choice for G 0 is N(0, I d ).

38 Sampling ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Full conditional (Escobar and West, 1995): ν i (data, y i = c, ν i, A, ) q i,j δ νj + q i,0 G i (ν i ) j i,y j =c ) G i (ν i ) N (V ν A 1 (x i µ), V ν { q i,j exp 1 } 2 (x i µ Aν j ) 1 (x i µ Aν j ) { q i,0 α 0 V 1 2 ν exp 1 } 2 (x i µ) ( 1 1 AV ν A 1 )(x i µ) V ν = (A 1 A + I d ) 1

39 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))

40 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))

41 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))

42 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling K i = h exp { 1 2 (x i µ Aν h ) 1 (x i µ Aν h )} U(y; V h, L h ) l<h (1 U(y; V l, L l )) (multinomial). If the sampled index turns out to be h, set ν i = ν h. νh N ((n h A 1 A + I d ) 1 A 1 ) (x i Ch i µ), (n h A 1 A + I d ) 1 where C h denotes the index for the h-th cluster and n h = #(C h ). V h Be(a h + i:k i h A ih, b h + i:k i h (1 A ih) for h < H and V H 1.

43 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling A ih = B ih = 1 for h = K i and for h < K i. (A ih = 1, B ih = 0) V h(1 K (y i, L h )) 1 V h K (y i, L h ) (A ih = 0, B ih = 1) (1 V h)k (y i, L h ) 1 V h K (y i, L h ) (A ih = 0, B ih = 0) (1 V h)(1 K (y i, L h )) 1 V h K (y i, L h )

44 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Metropolis-Hastings step for sampling L h. The proposal distribution: L h Unif( min (y i), i {B ih =1} max i {B ih =1} (y i )), when{i : B ih = 1} Unif(min(y i ), max(y i )), otherwise K (y, L h ) = exp( φ y L h 2 ), φ controls the intensity of borrowing information across Y.

45 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Recall B = span(b) = span( 1 A) G (d,p). G (d,p) is the parameter space for the d.r. subspace. Posterior samples of the d.r. subspace, denoted as {B (1),, B (T ) }, are on G (d,p). dim dim1

46 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Bayes estimate of the posterior mean should be with respect to the distance metric B Bayes = arg min B G (d,p) t=1 T dist 2 (B (t), B) called the Karcher mean (Karcher, 1977). A standard deviation measure std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1

47 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Bayes estimate of the posterior mean should be with respect to the distance metric B Bayes = arg min B G (d,p) t=1 T dist 2 (B (t), B) called the Karcher mean (Karcher, 1977). A standard deviation measure std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1

48 Inverse Regression Methods Bayesian Mixture Inverse Modeling Distance on the Grassmann Manifold Given two subspaces W 1 and W 2 spanned by orthonormal bases W 1 and W 2 respectively, the geodesic distance between W 1 and W 2 is (Karcher, 1977; Kendall, 1990): (I W 1 (W 1 W 1) 1 W 1 )W 2(W 1 W 2) 1 = UΣV (SVD) Θ = atan(σ) dist(w 1, W 2 ) = Trace(Θ 2 ),

49 d.r. dimension: d Inverse Regression Methods Bayesian Mixture Inverse Modeling Model comparison: BF(d 1, d 2 ) = p(data d 1 )/p(data d 2 ) Marginal likelihood p(data d) = p(data d, θ)p prior (θ)dθ θ Out-of-sample validation.

50 d.r. dimension: d Inverse Regression Methods Bayesian Mixture Inverse Modeling Model comparison: BF(d 1, d 2 ) = p(data d 1 )/p(data d 2 ) Marginal likelihood p(data d) = p(data d, θ)p prior (θ)dθ θ Out-of-sample validation.

51 Large p Inverse Regression Methods Bayesian Mixture Inverse Modeling When p n, preprocess with PCA. Constrain µ yx µ = Aν yx and 1 into span(x 1,, x n ) SVD: X = U X D X V X, V X R p p, (p min(p, n)) Then A = V X Ã (Ã Rp d ), = V X V X ( R p p ).

52 Swiss Roll Inverse Regression Methods Bayesian Mixture Inverse Modeling Wu et al. (2008) Data generated in R 10 with structure: X 1 = t cos(t), X 2 = h, X 3 = t sin(t) where t = 3π 2 (1 + 2θ), θ Unif(0, 1), h Unif(0, 1) (Swiss roll). Remaining 7 dimensions independent Gaussian noise. Response: Y = sin(5πθ) + h 2 + ε, ε N(0, 0.01) (a) Y v.s. X 1, X 2, X 3 (b) X 3 v.s. X 1

53 Swiss Roll Accuracy Measure Inverse Regression Methods Bayesian Mixture Inverse Modeling True d.r. B = the first three dimensions. Metric to measure the accuracy in estimation: ˆB = ( ˆβ 1,, ˆβ d ) denotes estimate of B. Accuracy: 1 d d P B ˆβ i 2 = 1 d i=1 d (BB ) ˆβ i 2 i=1 where P B denotes the orthogonal projection onto the column space of B.

54 Swiss Roll Accuracy Inverse Regression Methods Bayesian Mixture Inverse Modeling

55 Swiss Roll Uncertainty Inverse Regression Methods Bayesian Mixture Inverse Modeling Standard deviation: Distance between the posterior mean and the true d.r. subspace is

56 Inverse Regression Methods Bayesian Mixture Inverse Modeling Swiss Roll Mixture Component Labels

57 Inverse Regression Methods Bayesian Mixture Inverse Modeling Swiss Roll Out-of-sample Mean-Square Error v.s. d

58 Iris Background Inverse Regression Methods Bayesian Mixture Inverse Modeling Data consists of 3 classes with 50 instances of each class. Each class refers to a type of Iris plant ( Setosa, Virginica and Versicolour ), and has 4 predictors describing the length and width of the sepal and petal. We merge Setosa, Virginica into a single class as in (Sugiyama, 2007).

59 Iris Embedding (Sugiyama, 2007) Inverse Regression Methods Bayesian Mixture Inverse Modeling

60 Iris Embedding (BMI) Inverse Regression Methods Bayesian Mixture Inverse Modeling

61 Handwritten Digit Data Inverse Regression Methods Bayesian Mixture Inverse Modeling

62 Digit d.r. Inverse Regression Methods Bayesian Mixture Inverse Modeling (c) Posterior mean of 3 vs 8 (d) Posterior mean of 5 vs 8

63 Gradient Motivation Error Covariance and Learning the gradient Additive error regression model: Y = f (X) + ɛ = g(β 1 X,, β d X) + ɛ The gradient of the regression function lies in B = span(β 1,, β d ). f = ( f x 1,, f x p )

64 Gradient Outer Product Motivation Error Covariance and Learning the gradient Gradient Outer Product (GOP) matrix (Mukherjee et al., 2006): Γ = E X ( ( f )( f ) ) R p p Γ rank at most d; suppose {v 1,..., v d } are the eigenvectors associated to the nonzero eigenvalues then B = span(v 1,..., v d ) In addition Γ can be viewed as a covariance matrix (Wu et al., 2007) since Γ ij = E(( f f )) indicates covariation of x i x j predictors relevant to prediction.

65 Gradient Outer Product Motivation Error Covariance and Learning the gradient Gradient Outer Product (GOP) matrix (Mukherjee et al., 2006): Γ = E X ( ( f )( f ) ) R p p Γ rank at most d; suppose {v 1,..., v d } are the eigenvectors associated to the nonzero eigenvalues then B = span(v 1,..., v d ) In addition Γ can be viewed as a covariance matrix (Wu et al., 2007) since Γ ij = E(( f f )) indicates covariation of x i x j predictors relevant to prediction.

66 Graphical Models Motivation Error Covariance and Learning the gradient Could use a graphical model to infer the conditional dependence of the predictors that are predictive: For multivariate Gaussian p(x) exp( 1 2 x Jx + h x), the precision J is the conditional independence matrix. Partial correlation matrix R where each element r ij = cov(x i, x j S /ij ) = var(xi S ) J ij /ij var(x j S /ij ) Jii J jj is a measure of dependence between variables i and j conditioned on all other variables S /ij (i j). Set Γ 1 = J.

67 Graphical Models Motivation Error Covariance and Learning the gradient Could use a graphical model to infer the conditional dependence of the predictors that are predictive: For multivariate Gaussian p(x) exp( 1 2 x Jx + h x), the precision J is the conditional independence matrix. Partial correlation matrix R where each element r ij = cov(x i, x j S /ij ) = var(xi S ) J ij /ij var(x j S /ij ) Jii J jj is a measure of dependence between variables i and j conditioned on all other variables S /ij (i j). Set Γ 1 = J.

68 Modeling Errors Motivation Error Covariance and Learning the gradient For regression case y i = f (x i ) + ε i, i = 1,, n = f (x j ) + f (x i ) (x i x j ) + O((x i x j ) 2 ) + ε i The non-random term O((x i x j ) 2 ) has an absolute magnitude positively associated with the distance d xi,x j We model it as a stochastic term with mean 0 and variance positively associated with d xi,x j y i = f (x j ) + f (x i ) (x i x j ) + ε ij ε ij is mean 0 and variance positively associated with d xi,x j.

69 Modeling Errors Motivation Error Covariance and Learning the gradient For regression case y i = f (x i ) + ε i, i = 1,, n = f (x j ) + f (x i ) (x i x j ) + O((x i x j ) 2 ) + ε i The non-random term O((x i x j ) 2 ) has an absolute magnitude positively associated with the distance d xi,x j We model it as a stochastic term with mean 0 and variance positively associated with d xi,x j y i = f (x j ) + f (x i ) (x i x j ) + ε ij ε ij is mean 0 and variance positively associated with d xi,x j.

70 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)

71 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)

72 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)

73 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

74 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

75 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

76 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

77 The Representer Theorem Motivation Error Covariance and Learning the gradient The Representer Theorem by Kimeldorf and Wahba (1971) ˆf (x) = n ŵ i k(x, x i ) = i=1 n ŵ i k xi (x) i=1 x 1,, x n R p are samples. From infinite dimensional to n-dimensional span(k x1,, k xn ) This form is achieved purely by regularization. We want to justify this form through proper Bayesian prior specification.

78 The Representer Theorem Motivation Error Covariance and Learning the gradient The Representer Theorem by Kimeldorf and Wahba (1971) ˆf (x) = n ŵ i k(x, x i ) = i=1 n ŵ i k xi (x) i=1 x 1,, x n R p are samples. From infinite dimensional to n-dimensional span(k x1,, k xn ) This form is achieved purely by regularization. We want to justify this form through proper Bayesian prior specification.

79 Eigen-decomposition of the Kernel Motivation Error Covariance and Learning the gradient In terms of eigen-decomposition of the kernel: let {λ j } and {φ j (x)}, j = 1,, be the eigen-values and eigen-functions of k, i.e., λ j φ j (x) = X k(x, u) φ j(u) dµ(u), H k = {f f (x) = c j φ j (x) j=1 s.t. cj 2 /λ j < } j=1 A prior over {(c j ) j=1 c2 j /λ j < } implies a prior on H k Problems: Sampling from infinite-dimensional space; the constraints; the eigen-functions not computable; etc.

80 Eigen-decomposition of the Kernel Motivation Error Covariance and Learning the gradient In terms of eigen-decomposition of the kernel: let {λ j } and {φ j (x)}, j = 1,, be the eigen-values and eigen-functions of k, i.e., λ j φ j (x) = X k(x, u) φ j(u) dµ(u), H k = {f f (x) = c j φ j (x) j=1 s.t. cj 2 /λ j < } j=1 A prior over {(c j ) j=1 c2 j /λ j < } implies a prior on H k Problems: Sampling from infinite-dimensional space; the constraints; the eigen-functions not computable; etc.

81 Prior on the Function Space Motivation Error Covariance and Learning the gradient H k is equivalent with G = { g g(x) = k(x, u)dγ(u), γ Γ 0 }, Γ0 is a subset of the space of signed Borel measures (Pillai et al., 2006); γ is modeled by a random probability distribution G(u) and random coefficient w(u) so that { } G = g g(x) = k(x, u)w(u)dg(u) Take G to be the marginal distribution of X and place a Dirichlet Process (DP) prior on G, i.e., G DP(α, G 0 ).

82 Prior on the Function Space Motivation Error Covariance and Learning the gradient H k is equivalent with G = { g g(x) = k(x, u)dγ(u), γ Γ 0 }, Γ0 is a subset of the space of signed Borel measures (Pillai et al., 2006); γ is modeled by a random probability distribution G(u) and random coefficient w(u) so that { } G = g g(x) = k(x, u)w(u)dg(u) Take G to be the marginal distribution of X and place a Dirichlet Process (DP) prior on G, i.e., G DP(α, G 0 ).

83 Posterior: Finite Representation Motivation Error Covariance and Learning the gradient Posterior DP (Schervish (1995)): Given sample (x 1,, x n ) from G DP(α, G 0 ), G (x 1,, x n ) DP(α + n, G n ), G n = 1 α+n (αg 0 + δ xi ) E(g (x 1,, x n )) = k(x, u)w(u)d(e(g(u) X n )) = k(x, u)w(u)dg n (u) = 1 α + n n i=1 [ α k(x, u)w(u)dg 0 (u) + w(x i ) α + n k(x, x i) when α 0 ] n w(x i )k(x, x i ) i=1

84 Posterior: Finite Representation Motivation Error Covariance and Learning the gradient Posterior DP (Schervish (1995)): Given sample (x 1,, x n ) from G DP(α, G 0 ), G (x 1,, x n ) DP(α + n, G n ), G n = 1 α+n (αg 0 + δ xi ) E(g (x 1,, x n )) = k(x, u)w(u)d(e(g(u) X n )) = k(x, u)w(u)dg n (u) = 1 α + n n i=1 [ α k(x, u)w(u)dg 0 (u) + w(x i ) α + n k(x, x i) when α 0 ] n w(x i )k(x, x i ) i=1

85 Represent the Gradient Motivation Error Covariance and Learning the gradient We have justified f = n α i k(x, x i ), f (j) = n c ji k(x, x i ), j = 1,, p i=1 i=1 Denote α = (α 1,, α n ) R n, C = (c ji ) R p n An estimate of GOP would be ˆΓ n f (x i ) f (x i ) = CK 2 C i=1 where K R n n with K ij = k(x i, x j ).

86 Likelihood Motivation Error Covariance and Learning the gradient For each i = 1,, n, y i 1 = K α + D i CK i + ε i where 1 = (1,, 1) R n, D i = 1x i (x 1,, x n ) R n p, K i is the i-th column of K, ε i = (ε i1,, ε in ) Too many parameters (especially when p n) West (2003) developed a strategy using empirical factor analysis for similar cases in large p small n regression. The idea was to apply singular value decomposition (svd) on the data matrix.

87 Likelihood Motivation Error Covariance and Learning the gradient For each i = 1,, n, y i 1 = K α + D i CK i + ε i where 1 = (1,, 1) R n, D i = 1x i (x 1,, x n ) R n p, K i is the i-th column of K, ε i = (ε i1,, ε in ) Too many parameters (especially when p n) West (2003) developed a strategy using empirical factor analysis for similar cases in large p small n regression. The idea was to apply singular value decomposition (svd) on the data matrix.

88 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.

89 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.

90 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.

91 Likelihood Motivation Error Covariance and Learning the gradient Since ε ij s are independently N(0, (φ/w ij ) 1 ) where φ 1 = σ 2, the likelihood is thus φ n2 2 exp { φ 2 n {[y i 1 Fβ Di C K i ] W i [y i 1 Fβ Di C K i ]} } i=1 where W i is diag(w i1,, w in ).

92 Prior and Sampling Motivation Error Covariance and Learning the gradient β ( R m 1 ) Prior β N(0, 1 ψ ), ψ = diag(ψ 1,...ψ m ), ψ i Gamma(a ψ /2, b ψ /2); Full conditional: β y... N( ˆβ, ˆV β ) where ˆV β = (F ( i φw i )F + ψ ) 1 ˆβ = φ ˆV β F i W i a i, where a i = y i 1 D i C K i

93 Prior and Sampling Motivation Error Covariance and Learning the gradient C = (c 1,...c n) ( R s n ). Prior c j N(0, 1 ϕ ), ϕ = diag(ϕ 1,...ϕ s ), ϕ i Gamma(a ϕ /2, b ϕ /2) The likelihood part for cj is N p (µ j, V j ), where V j = (φ i K ij 2D i W i Di ) 1 and µ j = φv j where b j i = y i 1 F β Di k j c kk ik i K ijdi Full conditional: c j... N(µ, V ) where V = (V 1 j + diag(ϕ 1,...ϕ s )) 1 and µ = V (V 1 j µ j ) W i b j i

94 Prior and Sampling Motivation Error Covariance and Learning the gradient φ and other parameters P φ y,... Gamma( n2 2, for improper prior 1 φ ; ψ i... Gamma( a ψ+1 2, b ψ+βi 2 2 ) ϕ i... Gamma( aϕ+1 i [y i 1 F β D i C K i ] W i [y i 1 F β D i C K i ] 2, bϕ+p n k=1 c 2 ik 2 ) 2 ),

95 Binary Classification Motivation Error Covariance and Learning the gradient The responses y i = 1/0, i = 1,, n Probit model: Let p i = P(y i = 1). Link Φ 1 (p i ) = µ i, Φ is the standard normal cdf and µ i is some predictor. Introduce z i N(µ i, 1), z i > 0 y i = 1. Sampling schemes above for β and C are the same with all previous y i replaced by z i. Each z i has a truncated normal full conditional: { N + (ẑ z i... i, 1), y i = 1 N (ẑ i, 1), y i = 0. where N + and N denote a Normal truncated to the positive and a Normal truncated to the negative, respectively, and (ẑ 1,, ẑ n ) T = Fβ.

96 Binary Classification Motivation Error Covariance and Learning the gradient The responses y i = 1/0, i = 1,, n Probit model: Let p i = P(y i = 1). Link Φ 1 (p i ) = µ i, Φ is the standard normal cdf and µ i is some predictor. Introduce z i N(µ i, 1), z i > 0 y i = 1. Sampling schemes above for β and C are the same with all previous y i replaced by z i. Each z i has a truncated normal full conditional: { N + (ẑ z i... i, 1), y i = 1 N (ẑ i, 1), y i = 0. where N + and N denote a Normal truncated to the positive and a Normal truncated to the negative, respectively, and (ẑ 1,, ẑ n ) T = Fβ.

97 Posterior for GOP Motivation Error Covariance and Learning the gradient Given draws {C (t) } T t=1 we compute {C(t) } T t=1 from the relation C = VC, then the posterior draws of the GOP ˆΓ (t) = C (t) K 2 (C (t) ) We then compute the posterior mean GOP matrix as well as a variance estimate ˆµˆΓ = 1 T T ˆΓ (t), ˆσ 2ˆΓ = 1 T t=1 T (ˆΓ (t) ˆµˆΓ t=1 where ( )2 denotes the element-wise square. e ) 2 e

98 Posterior for d.r. subspace Motivation Error Covariance and Learning the gradient Recall the d.r. space B = span(v 1,..., v d ) where {v 1,..., v d } are the eigenvectors associated to the largest d eigenvalues of the GOP. The d.r. subspace is on the manifold G (d,p). A spectral decomposition of ˆ Γ(t) then provides a posterior draw of the d.r. subspace B (t), and T B Bayes = arg min dist 2 (B (t), B) B G (d,p) t=1 std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1

99 Motivation Error Covariance and Learning the gradient Posterior for the Conditional Independence The conditional independence and partial correlations J (t) = (ˆΓ (t) ) 1, ij = J(t) ij R (t) J (t) ii J (t) jj using a pseudo-inverse. The mean and variance of the posterior estimates of the partial correlations are ˆµ R = 1 T T R (t), t=1 ˆσ 2 R = 1 T T ( R (t) ) 2 ˆµ R e t=1 These quantities could be used to infer a graphical model with the capability to evaluate the uncertainty of the correlation structure.,

100 Linear Simulation 1 Motivation Error Covariance and Learning the gradient 20 Samples from class 0 were from x j N(1.5, 1), for j = 1,, 10, x j N( 1.5, 1), for j = 11,, 20, x j N(0, 0.1), for j = 21,, Samples from class 1 were from x j N(1.5, 1), for j = 41,, 50, x j N( 1.5, 1), for j = 51,, 60, x j N(0, 0.1), for j = 1,, 40, 61,, 80

101 Linear Simulation 1 Motivation Error Covariance and Learning the gradient Dimension (e) Posterior mean of GOP (f) Top d.r. direction

102 Swiss Roll Accuracy Motivation Error Covariance and Learning the gradient

103 Iris Embedding Motivation Error Covariance and Learning the gradient

104 Digits Motivation Error Covariance and Learning the gradient (g) Posterior mean of 3 vs 8 (h) Posterior mean of 5 vs 8

105 Linear Simulation 2 Motivation Error Covariance and Learning the gradient The predictor variables are correspond to a five dimension random vector drawn from the following model X 1 = θ 1, X 2 = θ 1 + θ 2, X 3 = θ 3 + θ 4, X 4 = θ 4, X 5 = θ 5 θ 4, where θ N(0, 1). The regression model is Y = X 1 + X 3 + X ε, where ε N(0, 0.25). X 1, X 3, X 5 are negatively correlated with respect to variation in the response and X 2 and X 4 are not correlated with respect to variation of the response.

106 Linear Simulation 2 Motivation Error Covariance and Learning the gradient The predictor variables are correspond to a five dimension random vector drawn from the following model X 1 = θ 1, X 2 = θ 1 + θ 2, X 3 = θ 3 + θ 4, X 4 = θ 4, X 5 = θ 5 θ 4, where θ N(0, 1). The regression model is Y = X 1 + X 3 + X ε, where ε N(0, 0.25). X 1, X 3, X 5 are negatively correlated with respect to variation in the response and X 2 and X 4 are not correlated with respect to variation of the response.

107 Linear Simulation 2 Motivation Error Covariance and Learning the gradient

108 Linear Simulation 2 Motivation Error Covariance and Learning the gradient (i) (j)

109 Pathway Association Motivation Error Covariance and Learning the gradient Genetic perturbations reflected by the altered expression of gene sets or pathways have been implicated for driving a normal cell to a malignancy state Thus it is necessary to study the relationship between pathways and the cell state (benign or malignant). In Edelman et al. (2008), 54 prostate samples (22 benign and 32 malignant) and 522 pathways. For visualization we pick 16 most significant pathways to build an interaction network.

110 Pathway Association Motivation Error Covariance and Learning the gradient

111 Development of distribution theory and proposal distribution on the Grassmann manifold. Local dimension reduction. In Chen et al. (2009) a factor model with mixtures on the loading matrix is proposed X N(µ + A X w, φ 1 I) Probabilistic nonlinear dimension reduction via kernels. x :,j K N(0, K ) where K R n n (Lawrence 2005)

112 Development of distribution theory and proposal distribution on the Grassmann manifold. Local dimension reduction. In Chen et al. (2009) a factor model with mixtures on the loading matrix is proposed X N(µ + A X w, φ 1 I) Probabilistic nonlinear dimension reduction via kernels. x :,j K N(0, K ) where K R n n (Lawrence 2005)

113 Cook, R. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science 22(1), Cook, R. and S. Weisberg (1991). Discussion of sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Dunson, D. B. and J. Park (2008). Kernel stick-breaking processes. Biometrika 89, Edelman, E., J. Guinney, J. Chi, P. Febbo, and S. Mukherjee (2008). Modeling cancer progression via pathway dependencies. PLoS Comp. Bio 4(2). Escobar, M. and M. West (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression. J. Amer. Statist. Assoc.,

114 Gelfand, A., A. Kottas, and S. N. MacEachern (2005). Bayesian nonparametric spatial modeling with dirichlet process mixing. J. Amer. Statist. Assoc. (471), Griffin, J. and M. Steel (2006). Order-based dependent dirichlet processes. J. Amer. Statist. Assoc., Hastie, T. and R. Tibshirani (1996). Discriminant analysis by Gaussian mixtures. J. Roy.Statist. Soc. Ser. B 58(1), Iorio, M. D., P. Müller, G. L. Rosner, and S. N. MacEachern (2004). An anova model for dependent random measures. J. Amer. Statist. Assoc., Ishwaran, H. and L. James (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. (453). Karcher, H. (1977). Riemannian center of mass and mollifier smoothing. Comm. Pure Appl. Math. (5),

115 Kendall, W. S. (1990). Probability, convexity and harmonic maps with small image. i. uniqueness and fine existence. Proc. London Math. Soc. (2), Kimeldorf, G. and G. Wahba (1971). A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist. 41(2), Li, K. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Lopes, H. F. and M. West (2004). Bayesian model assessment in factor analysis. statsinica 14, Mukherjee, S., Q. Wu, and D. Zhou (2006). Gradient Learning and Feature Selection on Manifolds. Technical report, ISDS Discussion Paper, Duke University. Pillai, N., Q. Wu, F. Liang, S. Mukherjee, and R. Wolpert (2006).

116 Characterizing the function space for bayesian kernel models. J. Mach. Learn. Res.. under review. Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. J. Mach. Learn. Res. 8, Tokdar, S., Y. Zhu, and J. Ghosh (2008). A bayesian implementation of sufficient dimension reduction in regression. Technical report, Purdue Univ. West, M. (2003). Bayesian factor regression models in the large p, small n paradigm. In J. B. et al. (Ed.), Bayesian Statistics 7, pp Oxford. Wu, Q., J. Guinney, M. Maggioni, and S. Mukherjee (2007). Learning gradients: Predictive models that infer geometry and dependence. J. Mach. Learn. Res..

117 Wu, Q., F. Liang, and S. Mukherjee (2008). Localized sliced inverse regression. Technical report. Xia, Y., H. Tong, W. Li, and L.-X. Zhu (2002). An adaptive estimation of dimension reduction space. J. Roy.Statist. Soc. Ser. B 64(3),

Supervised Dimension Reduction Using Bayesian Mixture Modeling

Supervised Dimension Reduction Using Bayesian Mixture Modeling Kai Mao Feng Liang Sayan Mukherjee Department of Statistics University of Illinois at Urbana-Champaign, IL 682 Department of Statistical Science Duke University, NC 2778 Departments of Staistical Science