Nonparametric Bayesian Models for Supervised Dimension R

Size: px
Start display at page:

Download "Nonparametric Bayesian Models for Supervised Dimension R"

Transcription

1 Nonparametric Bayesian Models for Supervised Dimension Reduction Department of Statistical Science Duke University December 2, 2009

2 Nonparametric Bayesian Models for Supervised Dimension Reduction Department of Statistical Science Duke University December 2, 2009

3 Dimension Reduction Supervised Dimension Reduction 1 Background Dimension Reduction Supervised Dimension Reduction 2 Inverse Regression Methods Bayesian Mixture Inverse Modeling 3 Motivation Error Covariance and Learning the gradient 4

4 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.

5 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.

6 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.

7 Dimension Reduction Supervised Dimension Reduction Linear methods: find a linear subspace Principal component analysis (PCA), Factor models... Nonlinear methods: Kernel PCA, Multidimensional scaling... Manifold learning methods: explore local structure ISOMAP, Laplacian eigenmaps, Local linear embedding... These are all unsupervised.

8 Dimension Reduction Supervised Dimension Reduction Linear methods: find a linear subspace Principal component analysis (PCA), Factor models... Nonlinear methods: Kernel PCA, Multidimensional scaling... Manifold learning methods: explore local structure ISOMAP, Laplacian eigenmaps, Local linear embedding... These are all unsupervised.

9 Dimension Reduction Supervised Dimension Reduction Supervised Dimension Reduction (SDR) Supervised dimension reduction (SDR): Predictor variable X R p, response variable Y R; Goal: find for X a low dimensional subspace (or manifold) that contains all the information to predict Y. Ignoring Y can be problematic (e.g., PCA only maximizes variation in X):

10 SDR Formulation Dimension Reduction Supervised Dimension Reduction 1 Y = g(b 1 X,, b d X, ε), where d < p, g is some unknown function and ε is an error. B = (b 1,, b d ) R p d B = span(b): dimension reduction (d.r.) subspace. B G (d,p), the Grassmann manifold which is the set of all the d dimensional linear subspaces of R p. 2 Y X P B X, where P B X is the orthogonal projection of X onto B. 3 Y X = Y P B X.

11 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.

12 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.

13 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.

14 Inverse Regression Methods Bayesian Mixture Inverse Modeling Reduced-Rank Linear Discriminant Analysis (LDA) Ideas: find a subspace that maximizes between-class variation controlling for within-class variation.

15 LDA Limitation Inverse Regression Methods Bayesian Mixture Inverse Modeling Limitation: when there are multiple clusters in a class, the class mean/center cannot represent the whole class.

16 Sliced Inverse Regression (SIR) Inverse Regression Methods Bayesian Mixture Inverse Modeling SIR: Slicing the data in terms of Y into H bins as classes, then do the same as in LDA.

17 SIR Limitations Inverse Regression Methods Bayesian Mixture Inverse Modeling 1 Same as LDA, degeneracy may occur. 2 Slicing procedure rigid. Information in Y not well utilized. 3 Not probabilistic. Cannot evaluate estimate uncertainty.

18 LSIR and MDA Inverse Regression Methods Bayesian Mixture Inverse Modeling Localized SIR (LSIR) (Wu et al., 2008) utilizes local means. Compromise between LDA/SIR and PCA. Mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1996) fits finite Gaussian mixtures in each class. Finite mixtures not flexible. Natural to go nonparametric.

19 LSIR and MDA Inverse Regression Methods Bayesian Mixture Inverse Modeling Localized SIR (LSIR) (Wu et al., 2008) utilizes local means. Compromise between LDA/SIR and PCA. Mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1996) fits finite Gaussian mixtures in each class. Finite mixtures not flexible. Natural to go nonparametric.

20 Probabilistic Model for SIR Inverse Regression Methods Bayesian Mixture Inverse Modeling Principal fitted component (PFC) (Cook, 2007) model: X (Y = y) = µ + Aν y + ε µ R p is the intercept, A R p d (d < p), ν y R d, ε N(0, ). Columns of 1 A span the d.r. space.

21 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

22 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

23 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

24 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.

25 d.r. subspace Inverse Regression Methods Bayesian Mixture Inverse Modeling Proposition For this model the d.r. subspace is the span of B = 1 A, i.e., Y X = Y ( 1 A) X.

26 Dirichlet Processes (DP) Inverse Regression Methods Bayesian Mixture Inverse Modeling When Y discrete, i.e., Y = c, c = 1,, C, Dirichlet process prior (DP) on G c leads to a mixture model with the number of mixtures determined in an automatic fashion. G c (i.i.d.) DP(α 0, G 0 ) G c = h=1 π hδ ν h (stick-breaking representation) π h = V h l<h (1 V l), V h Beta(1, α 0 ), ν h G 0 For each sample i, ν i (y i = c, ν i ) j i,y j =c δ ν j + α 0 G 0 ( )

27 Inverse Regression Methods Bayesian Mixture Inverse Modeling Dependent Dirichlet Processes (DDP) When Y continuous, want G y to change smoothly with y, i.e., G y1 and G y2 having stronger dependence with closer y 1 and y 2 Dependent Dirichlet process (DDP, MacEachern 1998) provides a natural framework for this. G y = π yh δ ν yh h=1 π yh = V yh (1 V yl ) l<h Different constructions for DDP (Gelfand et al., 2005; Iorio et al., 2004; Griffin and Steel, 2006)

28 Inverse Regression Methods Bayesian Mixture Inverse Modeling Dependent Dirichlet Processes (DDP) When Y continuous, want G y to change smoothly with y, i.e., G y1 and G y2 having stronger dependence with closer y 1 and y 2 Dependent Dirichlet process (DDP, MacEachern 1998) provides a natural framework for this. G y = π yh δ ν yh h=1 π yh = V yh (1 V yl ) l<h Different constructions for DDP (Gelfand et al., 2005; Iorio et al., 2004; Griffin and Steel, 2006)

29 Kernel Stick-breaking Process Inverse Regression Methods Bayesian Mixture Inverse Modeling Kernel stick-breaking process (Dunson and Park, 2008): G y = U(y; V h, L h ) (1 U(y; V l, L l ))δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h is a probability weight, L h is a random location with the same domain of Y, K (y, L h ) is a pre-specified kernel function measuring the similarity between y and L h, e.g., K (y, L h ) = 1 y Lh <φ or K (y, L h ) = exp( φ y L h 2 ) Dependence on the weights U(y; V h, L h ) implies dependence among G y s.

30 Kernel Stick-breaking Process Inverse Regression Methods Bayesian Mixture Inverse Modeling Kernel stick-breaking process (Dunson and Park, 2008): G y = U(y; V h, L h ) (1 U(y; V l, L l ))δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h is a probability weight, L h is a random location with the same domain of Y, K (y, L h ) is a pre-specified kernel function measuring the similarity between y and L h, e.g., K (y, L h ) = 1 y Lh <φ or K (y, L h ) = exp( φ y L h 2 ) Dependence on the weights U(y; V h, L h ) implies dependence among G y s.

31 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.

32 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.

33 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.

34 Likelihood Inverse Regression Methods Bayesian Mixture Inverse Modeling The likelihood p(data A,, ν, µ) det( 1 ) n 2 exp { 1 2 n (x i µ Aν i ) 1 (x i µ Aν i ) } i=1 Normal prior on µ, A, base measure of ν i, and Wishart for 1 imply conjugacy.

35 for A and 1 Inverse Regression Methods Bayesian Mixture Inverse Modeling Prior for A: Normal full conditional. a lj N(0, φ 1 a ), l j, l = 1,, p Prior for 1 : Wishart(df, p, V D ).

36 Sampling for ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Marginal approach and conditional approach (Ishwaran and James, 2001). Pólya-urn representation of the prior for ν i : ν i (y i = c, ν i ) δ νj + α 0 G 0 (ν i ) j i,y j =c A convenient choice for G 0 is N(0, I d ).

37 Sampling for ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Marginal approach and conditional approach (Ishwaran and James, 2001). Pólya-urn representation of the prior for ν i : ν i (y i = c, ν i ) δ νj + α 0 G 0 (ν i ) j i,y j =c A convenient choice for G 0 is N(0, I d ).

38 Sampling ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Full conditional (Escobar and West, 1995): ν i (data, y i = c, ν i, A, ) q i,j δ νj + q i,0 G i (ν i ) j i,y j =c ) G i (ν i ) N (V ν A 1 (x i µ), V ν { q i,j exp 1 } 2 (x i µ Aν j ) 1 (x i µ Aν j ) { q i,0 α 0 V 1 2 ν exp 1 } 2 (x i µ) ( 1 1 AV ν A 1 )(x i µ) V ν = (A 1 A + I d ) 1

39 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))

40 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))

41 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))

42 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling K i = h exp { 1 2 (x i µ Aν h ) 1 (x i µ Aν h )} U(y; V h, L h ) l<h (1 U(y; V l, L l )) (multinomial). If the sampled index turns out to be h, set ν i = ν h. νh N ((n h A 1 A + I d ) 1 A 1 ) (x i Ch i µ), (n h A 1 A + I d ) 1 where C h denotes the index for the h-th cluster and n h = #(C h ). V h Be(a h + i:k i h A ih, b h + i:k i h (1 A ih) for h < H and V H 1.

43 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling A ih = B ih = 1 for h = K i and for h < K i. (A ih = 1, B ih = 0) V h(1 K (y i, L h )) 1 V h K (y i, L h ) (A ih = 0, B ih = 1) (1 V h)k (y i, L h ) 1 V h K (y i, L h ) (A ih = 0, B ih = 0) (1 V h)(1 K (y i, L h )) 1 V h K (y i, L h )

44 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Metropolis-Hastings step for sampling L h. The proposal distribution: L h Unif( min (y i), i {B ih =1} max i {B ih =1} (y i )), when{i : B ih = 1} Unif(min(y i ), max(y i )), otherwise K (y, L h ) = exp( φ y L h 2 ), φ controls the intensity of borrowing information across Y.

45 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Recall B = span(b) = span( 1 A) G (d,p). G (d,p) is the parameter space for the d.r. subspace. Posterior samples of the d.r. subspace, denoted as {B (1),, B (T ) }, are on G (d,p). dim dim1

46 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Bayes estimate of the posterior mean should be with respect to the distance metric B Bayes = arg min B G (d,p) t=1 T dist 2 (B (t), B) called the Karcher mean (Karcher, 1977). A standard deviation measure std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1

47 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Bayes estimate of the posterior mean should be with respect to the distance metric B Bayes = arg min B G (d,p) t=1 T dist 2 (B (t), B) called the Karcher mean (Karcher, 1977). A standard deviation measure std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1

48 Inverse Regression Methods Bayesian Mixture Inverse Modeling Distance on the Grassmann Manifold Given two subspaces W 1 and W 2 spanned by orthonormal bases W 1 and W 2 respectively, the geodesic distance between W 1 and W 2 is (Karcher, 1977; Kendall, 1990): (I W 1 (W 1 W 1) 1 W 1 )W 2(W 1 W 2) 1 = UΣV (SVD) Θ = atan(σ) dist(w 1, W 2 ) = Trace(Θ 2 ),

49 d.r. dimension: d Inverse Regression Methods Bayesian Mixture Inverse Modeling Model comparison: BF(d 1, d 2 ) = p(data d 1 )/p(data d 2 ) Marginal likelihood p(data d) = p(data d, θ)p prior (θ)dθ θ Out-of-sample validation.

50 d.r. dimension: d Inverse Regression Methods Bayesian Mixture Inverse Modeling Model comparison: BF(d 1, d 2 ) = p(data d 1 )/p(data d 2 ) Marginal likelihood p(data d) = p(data d, θ)p prior (θ)dθ θ Out-of-sample validation.

51 Large p Inverse Regression Methods Bayesian Mixture Inverse Modeling When p n, preprocess with PCA. Constrain µ yx µ = Aν yx and 1 into span(x 1,, x n ) SVD: X = U X D X V X, V X R p p, (p min(p, n)) Then A = V X Ã (Ã Rp d ), = V X V X ( R p p ).

52 Swiss Roll Inverse Regression Methods Bayesian Mixture Inverse Modeling Wu et al. (2008) Data generated in R 10 with structure: X 1 = t cos(t), X 2 = h, X 3 = t sin(t) where t = 3π 2 (1 + 2θ), θ Unif(0, 1), h Unif(0, 1) (Swiss roll). Remaining 7 dimensions independent Gaussian noise. Response: Y = sin(5πθ) + h 2 + ε, ε N(0, 0.01) (a) Y v.s. X 1, X 2, X 3 (b) X 3 v.s. X 1

53 Swiss Roll Accuracy Measure Inverse Regression Methods Bayesian Mixture Inverse Modeling True d.r. B = the first three dimensions. Metric to measure the accuracy in estimation: ˆB = ( ˆβ 1,, ˆβ d ) denotes estimate of B. Accuracy: 1 d d P B ˆβ i 2 = 1 d i=1 d (BB ) ˆβ i 2 i=1 where P B denotes the orthogonal projection onto the column space of B.

54 Swiss Roll Accuracy Inverse Regression Methods Bayesian Mixture Inverse Modeling

55 Swiss Roll Uncertainty Inverse Regression Methods Bayesian Mixture Inverse Modeling Standard deviation: Distance between the posterior mean and the true d.r. subspace is

56 Inverse Regression Methods Bayesian Mixture Inverse Modeling Swiss Roll Mixture Component Labels

57 Inverse Regression Methods Bayesian Mixture Inverse Modeling Swiss Roll Out-of-sample Mean-Square Error v.s. d

58 Iris Background Inverse Regression Methods Bayesian Mixture Inverse Modeling Data consists of 3 classes with 50 instances of each class. Each class refers to a type of Iris plant ( Setosa, Virginica and Versicolour ), and has 4 predictors describing the length and width of the sepal and petal. We merge Setosa, Virginica into a single class as in (Sugiyama, 2007).

59 Iris Embedding (Sugiyama, 2007) Inverse Regression Methods Bayesian Mixture Inverse Modeling

60 Iris Embedding (BMI) Inverse Regression Methods Bayesian Mixture Inverse Modeling

61 Handwritten Digit Data Inverse Regression Methods Bayesian Mixture Inverse Modeling

62 Digit d.r. Inverse Regression Methods Bayesian Mixture Inverse Modeling (c) Posterior mean of 3 vs 8 (d) Posterior mean of 5 vs 8

63 Gradient Motivation Error Covariance and Learning the gradient Additive error regression model: Y = f (X) + ɛ = g(β 1 X,, β d X) + ɛ The gradient of the regression function lies in B = span(β 1,, β d ). f = ( f x 1,, f x p )

64 Gradient Outer Product Motivation Error Covariance and Learning the gradient Gradient Outer Product (GOP) matrix (Mukherjee et al., 2006): Γ = E X ( ( f )( f ) ) R p p Γ rank at most d; suppose {v 1,..., v d } are the eigenvectors associated to the nonzero eigenvalues then B = span(v 1,..., v d ) In addition Γ can be viewed as a covariance matrix (Wu et al., 2007) since Γ ij = E(( f f )) indicates covariation of x i x j predictors relevant to prediction.

65 Gradient Outer Product Motivation Error Covariance and Learning the gradient Gradient Outer Product (GOP) matrix (Mukherjee et al., 2006): Γ = E X ( ( f )( f ) ) R p p Γ rank at most d; suppose {v 1,..., v d } are the eigenvectors associated to the nonzero eigenvalues then B = span(v 1,..., v d ) In addition Γ can be viewed as a covariance matrix (Wu et al., 2007) since Γ ij = E(( f f )) indicates covariation of x i x j predictors relevant to prediction.

66 Graphical Models Motivation Error Covariance and Learning the gradient Could use a graphical model to infer the conditional dependence of the predictors that are predictive: For multivariate Gaussian p(x) exp( 1 2 x Jx + h x), the precision J is the conditional independence matrix. Partial correlation matrix R where each element r ij = cov(x i, x j S /ij ) = var(xi S ) J ij /ij var(x j S /ij ) Jii J jj is a measure of dependence between variables i and j conditioned on all other variables S /ij (i j). Set Γ 1 = J.

67 Graphical Models Motivation Error Covariance and Learning the gradient Could use a graphical model to infer the conditional dependence of the predictors that are predictive: For multivariate Gaussian p(x) exp( 1 2 x Jx + h x), the precision J is the conditional independence matrix. Partial correlation matrix R where each element r ij = cov(x i, x j S /ij ) = var(xi S ) J ij /ij var(x j S /ij ) Jii J jj is a measure of dependence between variables i and j conditioned on all other variables S /ij (i j). Set Γ 1 = J.

68 Modeling Errors Motivation Error Covariance and Learning the gradient For regression case y i = f (x i ) + ε i, i = 1,, n = f (x j ) + f (x i ) (x i x j ) + O((x i x j ) 2 ) + ε i The non-random term O((x i x j ) 2 ) has an absolute magnitude positively associated with the distance d xi,x j We model it as a stochastic term with mean 0 and variance positively associated with d xi,x j y i = f (x j ) + f (x i ) (x i x j ) + ε ij ε ij is mean 0 and variance positively associated with d xi,x j.

69 Modeling Errors Motivation Error Covariance and Learning the gradient For regression case y i = f (x i ) + ε i, i = 1,, n = f (x j ) + f (x i ) (x i x j ) + O((x i x j ) 2 ) + ε i The non-random term O((x i x j ) 2 ) has an absolute magnitude positively associated with the distance d xi,x j We model it as a stochastic term with mean 0 and variance positively associated with d xi,x j y i = f (x j ) + f (x i ) (x i x j ) + ε ij ε ij is mean 0 and variance positively associated with d xi,x j.

70 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)

71 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)

72 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)

73 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

74 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

75 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

76 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]

77 The Representer Theorem Motivation Error Covariance and Learning the gradient The Representer Theorem by Kimeldorf and Wahba (1971) ˆf (x) = n ŵ i k(x, x i ) = i=1 n ŵ i k xi (x) i=1 x 1,, x n R p are samples. From infinite dimensional to n-dimensional span(k x1,, k xn ) This form is achieved purely by regularization. We want to justify this form through proper Bayesian prior specification.

78 The Representer Theorem Motivation Error Covariance and Learning the gradient The Representer Theorem by Kimeldorf and Wahba (1971) ˆf (x) = n ŵ i k(x, x i ) = i=1 n ŵ i k xi (x) i=1 x 1,, x n R p are samples. From infinite dimensional to n-dimensional span(k x1,, k xn ) This form is achieved purely by regularization. We want to justify this form through proper Bayesian prior specification.

79 Eigen-decomposition of the Kernel Motivation Error Covariance and Learning the gradient In terms of eigen-decomposition of the kernel: let {λ j } and {φ j (x)}, j = 1,, be the eigen-values and eigen-functions of k, i.e., λ j φ j (x) = X k(x, u) φ j(u) dµ(u), H k = {f f (x) = c j φ j (x) j=1 s.t. cj 2 /λ j < } j=1 A prior over {(c j ) j=1 c2 j /λ j < } implies a prior on H k Problems: Sampling from infinite-dimensional space; the constraints; the eigen-functions not computable; etc.

80 Eigen-decomposition of the Kernel Motivation Error Covariance and Learning the gradient In terms of eigen-decomposition of the kernel: let {λ j } and {φ j (x)}, j = 1,, be the eigen-values and eigen-functions of k, i.e., λ j φ j (x) = X k(x, u) φ j(u) dµ(u), H k = {f f (x) = c j φ j (x) j=1 s.t. cj 2 /λ j < } j=1 A prior over {(c j ) j=1 c2 j /λ j < } implies a prior on H k Problems: Sampling from infinite-dimensional space; the constraints; the eigen-functions not computable; etc.

81 Prior on the Function Space Motivation Error Covariance and Learning the gradient H k is equivalent with G = { g g(x) = k(x, u)dγ(u), γ Γ 0 }, Γ0 is a subset of the space of signed Borel measures (Pillai et al., 2006); γ is modeled by a random probability distribution G(u) and random coefficient w(u) so that { } G = g g(x) = k(x, u)w(u)dg(u) Take G to be the marginal distribution of X and place a Dirichlet Process (DP) prior on G, i.e., G DP(α, G 0 ).

82 Prior on the Function Space Motivation Error Covariance and Learning the gradient H k is equivalent with G = { g g(x) = k(x, u)dγ(u), γ Γ 0 }, Γ0 is a subset of the space of signed Borel measures (Pillai et al., 2006); γ is modeled by a random probability distribution G(u) and random coefficient w(u) so that { } G = g g(x) = k(x, u)w(u)dg(u) Take G to be the marginal distribution of X and place a Dirichlet Process (DP) prior on G, i.e., G DP(α, G 0 ).

83 Posterior: Finite Representation Motivation Error Covariance and Learning the gradient Posterior DP (Schervish (1995)): Given sample (x 1,, x n ) from G DP(α, G 0 ), G (x 1,, x n ) DP(α + n, G n ), G n = 1 α+n (αg 0 + δ xi ) E(g (x 1,, x n )) = k(x, u)w(u)d(e(g(u) X n )) = k(x, u)w(u)dg n (u) = 1 α + n n i=1 [ α k(x, u)w(u)dg 0 (u) + w(x i ) α + n k(x, x i) when α 0 ] n w(x i )k(x, x i ) i=1

84 Posterior: Finite Representation Motivation Error Covariance and Learning the gradient Posterior DP (Schervish (1995)): Given sample (x 1,, x n ) from G DP(α, G 0 ), G (x 1,, x n ) DP(α + n, G n ), G n = 1 α+n (αg 0 + δ xi ) E(g (x 1,, x n )) = k(x, u)w(u)d(e(g(u) X n )) = k(x, u)w(u)dg n (u) = 1 α + n n i=1 [ α k(x, u)w(u)dg 0 (u) + w(x i ) α + n k(x, x i) when α 0 ] n w(x i )k(x, x i ) i=1

85 Represent the Gradient Motivation Error Covariance and Learning the gradient We have justified f = n α i k(x, x i ), f (j) = n c ji k(x, x i ), j = 1,, p i=1 i=1 Denote α = (α 1,, α n ) R n, C = (c ji ) R p n An estimate of GOP would be ˆΓ n f (x i ) f (x i ) = CK 2 C i=1 where K R n n with K ij = k(x i, x j ).

86 Likelihood Motivation Error Covariance and Learning the gradient For each i = 1,, n, y i 1 = K α + D i CK i + ε i where 1 = (1,, 1) R n, D i = 1x i (x 1,, x n ) R n p, K i is the i-th column of K, ε i = (ε i1,, ε in ) Too many parameters (especially when p n) West (2003) developed a strategy using empirical factor analysis for similar cases in large p small n regression. The idea was to apply singular value decomposition (svd) on the data matrix.

87 Likelihood Motivation Error Covariance and Learning the gradient For each i = 1,, n, y i 1 = K α + D i CK i + ε i where 1 = (1,, 1) R n, D i = 1x i (x 1,, x n ) R n p, K i is the i-th column of K, ε i = (ε i1,, ε in ) Too many parameters (especially when p n) West (2003) developed a strategy using empirical factor analysis for similar cases in large p small n regression. The idea was to apply singular value decomposition (svd) on the data matrix.

88 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.

89 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.

90 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.

91 Likelihood Motivation Error Covariance and Learning the gradient Since ε ij s are independently N(0, (φ/w ij ) 1 ) where φ 1 = σ 2, the likelihood is thus φ n2 2 exp { φ 2 n {[y i 1 Fβ Di C K i ] W i [y i 1 Fβ Di C K i ]} } i=1 where W i is diag(w i1,, w in ).

92 Prior and Sampling Motivation Error Covariance and Learning the gradient β ( R m 1 ) Prior β N(0, 1 ψ ), ψ = diag(ψ 1,...ψ m ), ψ i Gamma(a ψ /2, b ψ /2); Full conditional: β y... N( ˆβ, ˆV β ) where ˆV β = (F ( i φw i )F + ψ ) 1 ˆβ = φ ˆV β F i W i a i, where a i = y i 1 D i C K i

93 Prior and Sampling Motivation Error Covariance and Learning the gradient C = (c 1,...c n) ( R s n ). Prior c j N(0, 1 ϕ ), ϕ = diag(ϕ 1,...ϕ s ), ϕ i Gamma(a ϕ /2, b ϕ /2) The likelihood part for cj is N p (µ j, V j ), where V j = (φ i K ij 2D i W i Di ) 1 and µ j = φv j where b j i = y i 1 F β Di k j c kk ik i K ijdi Full conditional: c j... N(µ, V ) where V = (V 1 j + diag(ϕ 1,...ϕ s )) 1 and µ = V (V 1 j µ j ) W i b j i

94 Prior and Sampling Motivation Error Covariance and Learning the gradient φ and other parameters P φ y,... Gamma( n2 2, for improper prior 1 φ ; ψ i... Gamma( a ψ+1 2, b ψ+βi 2 2 ) ϕ i... Gamma( aϕ+1 i [y i 1 F β D i C K i ] W i [y i 1 F β D i C K i ] 2, bϕ+p n k=1 c 2 ik 2 ) 2 ),

95 Binary Classification Motivation Error Covariance and Learning the gradient The responses y i = 1/0, i = 1,, n Probit model: Let p i = P(y i = 1). Link Φ 1 (p i ) = µ i, Φ is the standard normal cdf and µ i is some predictor. Introduce z i N(µ i, 1), z i > 0 y i = 1. Sampling schemes above for β and C are the same with all previous y i replaced by z i. Each z i has a truncated normal full conditional: { N + (ẑ z i... i, 1), y i = 1 N (ẑ i, 1), y i = 0. where N + and N denote a Normal truncated to the positive and a Normal truncated to the negative, respectively, and (ẑ 1,, ẑ n ) T = Fβ.

96 Binary Classification Motivation Error Covariance and Learning the gradient The responses y i = 1/0, i = 1,, n Probit model: Let p i = P(y i = 1). Link Φ 1 (p i ) = µ i, Φ is the standard normal cdf and µ i is some predictor. Introduce z i N(µ i, 1), z i > 0 y i = 1. Sampling schemes above for β and C are the same with all previous y i replaced by z i. Each z i has a truncated normal full conditional: { N + (ẑ z i... i, 1), y i = 1 N (ẑ i, 1), y i = 0. where N + and N denote a Normal truncated to the positive and a Normal truncated to the negative, respectively, and (ẑ 1,, ẑ n ) T = Fβ.

97 Posterior for GOP Motivation Error Covariance and Learning the gradient Given draws {C (t) } T t=1 we compute {C(t) } T t=1 from the relation C = VC, then the posterior draws of the GOP ˆΓ (t) = C (t) K 2 (C (t) ) We then compute the posterior mean GOP matrix as well as a variance estimate ˆµˆΓ = 1 T T ˆΓ (t), ˆσ 2ˆΓ = 1 T t=1 T (ˆΓ (t) ˆµˆΓ t=1 where ( )2 denotes the element-wise square. e ) 2 e

98 Posterior for d.r. subspace Motivation Error Covariance and Learning the gradient Recall the d.r. space B = span(v 1,..., v d ) where {v 1,..., v d } are the eigenvectors associated to the largest d eigenvalues of the GOP. The d.r. subspace is on the manifold G (d,p). A spectral decomposition of ˆ Γ(t) then provides a posterior draw of the d.r. subspace B (t), and T B Bayes = arg min dist 2 (B (t), B) B G (d,p) t=1 std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1

99 Motivation Error Covariance and Learning the gradient Posterior for the Conditional Independence The conditional independence and partial correlations J (t) = (ˆΓ (t) ) 1, ij = J(t) ij R (t) J (t) ii J (t) jj using a pseudo-inverse. The mean and variance of the posterior estimates of the partial correlations are ˆµ R = 1 T T R (t), t=1 ˆσ 2 R = 1 T T ( R (t) ) 2 ˆµ R e t=1 These quantities could be used to infer a graphical model with the capability to evaluate the uncertainty of the correlation structure.,

100 Linear Simulation 1 Motivation Error Covariance and Learning the gradient 20 Samples from class 0 were from x j N(1.5, 1), for j = 1,, 10, x j N( 1.5, 1), for j = 11,, 20, x j N(0, 0.1), for j = 21,, Samples from class 1 were from x j N(1.5, 1), for j = 41,, 50, x j N( 1.5, 1), for j = 51,, 60, x j N(0, 0.1), for j = 1,, 40, 61,, 80

101 Linear Simulation 1 Motivation Error Covariance and Learning the gradient Dimension (e) Posterior mean of GOP (f) Top d.r. direction

102 Swiss Roll Accuracy Motivation Error Covariance and Learning the gradient

103 Iris Embedding Motivation Error Covariance and Learning the gradient

104 Digits Motivation Error Covariance and Learning the gradient (g) Posterior mean of 3 vs 8 (h) Posterior mean of 5 vs 8

105 Linear Simulation 2 Motivation Error Covariance and Learning the gradient The predictor variables are correspond to a five dimension random vector drawn from the following model X 1 = θ 1, X 2 = θ 1 + θ 2, X 3 = θ 3 + θ 4, X 4 = θ 4, X 5 = θ 5 θ 4, where θ N(0, 1). The regression model is Y = X 1 + X 3 + X ε, where ε N(0, 0.25). X 1, X 3, X 5 are negatively correlated with respect to variation in the response and X 2 and X 4 are not correlated with respect to variation of the response.

106 Linear Simulation 2 Motivation Error Covariance and Learning the gradient The predictor variables are correspond to a five dimension random vector drawn from the following model X 1 = θ 1, X 2 = θ 1 + θ 2, X 3 = θ 3 + θ 4, X 4 = θ 4, X 5 = θ 5 θ 4, where θ N(0, 1). The regression model is Y = X 1 + X 3 + X ε, where ε N(0, 0.25). X 1, X 3, X 5 are negatively correlated with respect to variation in the response and X 2 and X 4 are not correlated with respect to variation of the response.

107 Linear Simulation 2 Motivation Error Covariance and Learning the gradient

108 Linear Simulation 2 Motivation Error Covariance and Learning the gradient (i) (j)

109 Pathway Association Motivation Error Covariance and Learning the gradient Genetic perturbations reflected by the altered expression of gene sets or pathways have been implicated for driving a normal cell to a malignancy state Thus it is necessary to study the relationship between pathways and the cell state (benign or malignant). In Edelman et al. (2008), 54 prostate samples (22 benign and 32 malignant) and 522 pathways. For visualization we pick 16 most significant pathways to build an interaction network.

110 Pathway Association Motivation Error Covariance and Learning the gradient

111 Development of distribution theory and proposal distribution on the Grassmann manifold. Local dimension reduction. In Chen et al. (2009) a factor model with mixtures on the loading matrix is proposed X N(µ + A X w, φ 1 I) Probabilistic nonlinear dimension reduction via kernels. x :,j K N(0, K ) where K R n n (Lawrence 2005)

112 Development of distribution theory and proposal distribution on the Grassmann manifold. Local dimension reduction. In Chen et al. (2009) a factor model with mixtures on the loading matrix is proposed X N(µ + A X w, φ 1 I) Probabilistic nonlinear dimension reduction via kernels. x :,j K N(0, K ) where K R n n (Lawrence 2005)

113 Cook, R. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science 22(1), Cook, R. and S. Weisberg (1991). Discussion of sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Dunson, D. B. and J. Park (2008). Kernel stick-breaking processes. Biometrika 89, Edelman, E., J. Guinney, J. Chi, P. Febbo, and S. Mukherjee (2008). Modeling cancer progression via pathway dependencies. PLoS Comp. Bio 4(2). Escobar, M. and M. West (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression. J. Amer. Statist. Assoc.,

114 Gelfand, A., A. Kottas, and S. N. MacEachern (2005). Bayesian nonparametric spatial modeling with dirichlet process mixing. J. Amer. Statist. Assoc. (471), Griffin, J. and M. Steel (2006). Order-based dependent dirichlet processes. J. Amer. Statist. Assoc., Hastie, T. and R. Tibshirani (1996). Discriminant analysis by Gaussian mixtures. J. Roy.Statist. Soc. Ser. B 58(1), Iorio, M. D., P. Müller, G. L. Rosner, and S. N. MacEachern (2004). An anova model for dependent random measures. J. Amer. Statist. Assoc., Ishwaran, H. and L. James (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. (453). Karcher, H. (1977). Riemannian center of mass and mollifier smoothing. Comm. Pure Appl. Math. (5),

115 Kendall, W. S. (1990). Probability, convexity and harmonic maps with small image. i. uniqueness and fine existence. Proc. London Math. Soc. (2), Kimeldorf, G. and G. Wahba (1971). A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist. 41(2), Li, K. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Lopes, H. F. and M. West (2004). Bayesian model assessment in factor analysis. statsinica 14, Mukherjee, S., Q. Wu, and D. Zhou (2006). Gradient Learning and Feature Selection on Manifolds. Technical report, ISDS Discussion Paper, Duke University. Pillai, N., Q. Wu, F. Liang, S. Mukherjee, and R. Wolpert (2006).

116 Characterizing the function space for bayesian kernel models. J. Mach. Learn. Res.. under review. Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. J. Mach. Learn. Res. 8, Tokdar, S., Y. Zhu, and J. Ghosh (2008). A bayesian implementation of sufficient dimension reduction in regression. Technical report, Purdue Univ. West, M. (2003). Bayesian factor regression models in the large p, small n paradigm. In J. B. et al. (Ed.), Bayesian Statistics 7, pp Oxford. Wu, Q., J. Guinney, M. Maggioni, and S. Mukherjee (2007). Learning gradients: Predictive models that infer geometry and dependence. J. Mach. Learn. Res..

117 Wu, Q., F. Liang, and S. Mukherjee (2008). Localized sliced inverse regression. Technical report. Xia, Y., H. Tong, W. Li, and L.-X. Zhu (2002). An adaptive estimation of dimension reduction space. J. Roy.Statist. Soc. Ser. B 64(3),

Supervised Dimension Reduction Using Bayesian Mixture Modeling

Supervised Dimension Reduction Using Bayesian Mixture Modeling Kai Mao Feng Liang Sayan Mukherjee Department of Statistics University of Illinois at Urbana-Champaign, IL 682 Department of Statistical Science Duke University, NC 2778 Departments of Staistical Science

More information

Supervised Dimension Reduction:

Supervised Dimension Reduction: Supervised Dimension Reduction: A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, M. Maggioni, D-X. Zhou Department of Statistical Science Institute for Genome Sciences & Policy Department

More information

Two models for Bayesian supervised dimension reduction

Two models for Bayesian supervised dimension reduction Two models for Bayesian supervised dimension reduction BY KAI MAO Department of Statistical Science Duke University, Durham NC 778-5, U.S.A. km68@stat.duke.edu QIANG WU Department of Mathematics Michigan

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Bayesian simultaneous regression and dimension reduction

Bayesian simultaneous regression and dimension reduction Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008

More information

Localized Sliced Inverse Regression

Localized Sliced Inverse Regression Localized Sliced Inverse Regression Qiang Wu, Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University, Durham NC 2778-251,

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

A Selective Review of Sufficient Dimension Reduction

A Selective Review of Sufficient Dimension Reduction A Selective Review of Sufficient Dimension Reduction Lexin Li Department of Statistics North Carolina State University Lexin Li (NCSU) Sufficient Dimension Reduction 1 / 19 Outline 1 General Framework

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data Nonparametric Bayes Density Estimation and Regression with High Dimensional Data Abhishek Bhattacharya, Garritt Page Department of Statistics, Duke University Joint work with Prof. D.Dunson September 2010

More information

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Bayesian Nonparametric Regression through Mixture Models

Bayesian Nonparametric Regression through Mixture Models Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013 Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #11 PROF. ALAN YUILLE LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform

More information

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold

More information

Nonparametric Bayes Inference on Manifolds with Applications

Nonparametric Bayes Inference on Manifolds with Applications Nonparametric Bayes Inference on Manifolds with Applications Abhishek Bhattacharya Indian Statistical Institute Based on the book Nonparametric Statistics On Manifolds With Applications To Shape Spaces

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

A Fully Nonparametric Modeling Approach to. BNP Binary Regression A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Sufficient Dimension Reduction using Support Vector Machine and it s variants Sufficient Dimension Reduction using Support Vector Machine and it s variants Andreas Artemiou School of Mathematics, Cardiff University @AG DANK/BCS Meeting 2013 SDR PSVM Real Data Current Research and

More information

Sayan Mukherjee. June 15, 2007

Sayan Mukherjee. June 15, 2007 Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University June 15, 2007 To Tommy Poggio This talk is dedicated to my advisor Tommy Poggio as

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Introduction to Graphical Models

Introduction to Graphical Models Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

STA 414/2104: Lecture 8

STA 414/2104: Lecture 8 STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1). Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

High Dimensional Discriminant Analysis

High Dimensional Discriminant Analysis High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen

More information