Nonparametric Bayesian Models for Supervised Dimension R
|
|
- Junior Haynes
- 5 years ago
- Views:
Transcription
1 Nonparametric Bayesian Models for Supervised Dimension Reduction Department of Statistical Science Duke University December 2, 2009
2 Nonparametric Bayesian Models for Supervised Dimension Reduction Department of Statistical Science Duke University December 2, 2009
3 Dimension Reduction Supervised Dimension Reduction 1 Background Dimension Reduction Supervised Dimension Reduction 2 Inverse Regression Methods Bayesian Mixture Inverse Modeling 3 Motivation Error Covariance and Learning the gradient 4
4 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.
5 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.
6 Dimension Reduction Dimension Reduction Supervised Dimension Reduction Dimension reduction: find a low dimensional representation for high dimensional data while preserve information in certain sense. Intrinsic dimension is low. Formally, given samples x 1,, x n X R p, produce lower dimensional features or factors x 1,, x n X R d with d < p according to certain criteria to ensure preserving information in certain sense. There exists a map D s.t. D(x i ) = x i. Disclose underlying data structure, graphical visualization, overcome the curse of dimensionality, facilitate the use of other statistical methods, save data storage cost, etc.
7 Dimension Reduction Supervised Dimension Reduction Linear methods: find a linear subspace Principal component analysis (PCA), Factor models... Nonlinear methods: Kernel PCA, Multidimensional scaling... Manifold learning methods: explore local structure ISOMAP, Laplacian eigenmaps, Local linear embedding... These are all unsupervised.
8 Dimension Reduction Supervised Dimension Reduction Linear methods: find a linear subspace Principal component analysis (PCA), Factor models... Nonlinear methods: Kernel PCA, Multidimensional scaling... Manifold learning methods: explore local structure ISOMAP, Laplacian eigenmaps, Local linear embedding... These are all unsupervised.
9 Dimension Reduction Supervised Dimension Reduction Supervised Dimension Reduction (SDR) Supervised dimension reduction (SDR): Predictor variable X R p, response variable Y R; Goal: find for X a low dimensional subspace (or manifold) that contains all the information to predict Y. Ignoring Y can be problematic (e.g., PCA only maximizes variation in X):
10 SDR Formulation Dimension Reduction Supervised Dimension Reduction 1 Y = g(b 1 X,, b d X, ε), where d < p, g is some unknown function and ε is an error. B = (b 1,, b d ) R p d B = span(b): dimension reduction (d.r.) subspace. B G (d,p), the Grassmann manifold which is the set of all the d dimensional linear subspaces of R p. 2 Y X P B X, where P B X is the orthogonal projection of X onto B. 3 Y X = Y P B X.
11 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.
12 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.
13 SDR Methods Dimension Reduction Supervised Dimension Reduction Forward regression category: directly model Y X or g. : projection pursuit regression (Friedman and Stuetzle, 1981); Bayesian sufficient dimension reduction (Tokdar et al., 2008), etc. Inverse regression category: model X Y. : sliced inverse regression (Li, 1991); reduced-rank linear discriminant analysis; principal component fitted model (Cook, 2007), etc. Gradient learning category: learn f, the gradient of the regression function f = E(Y X). : Mukherjee et al. (2006); Wu et al. (2007); Xia et al. (2002), etc.
14 Inverse Regression Methods Bayesian Mixture Inverse Modeling Reduced-Rank Linear Discriminant Analysis (LDA) Ideas: find a subspace that maximizes between-class variation controlling for within-class variation.
15 LDA Limitation Inverse Regression Methods Bayesian Mixture Inverse Modeling Limitation: when there are multiple clusters in a class, the class mean/center cannot represent the whole class.
16 Sliced Inverse Regression (SIR) Inverse Regression Methods Bayesian Mixture Inverse Modeling SIR: Slicing the data in terms of Y into H bins as classes, then do the same as in LDA.
17 SIR Limitations Inverse Regression Methods Bayesian Mixture Inverse Modeling 1 Same as LDA, degeneracy may occur. 2 Slicing procedure rigid. Information in Y not well utilized. 3 Not probabilistic. Cannot evaluate estimate uncertainty.
18 LSIR and MDA Inverse Regression Methods Bayesian Mixture Inverse Modeling Localized SIR (LSIR) (Wu et al., 2008) utilizes local means. Compromise between LDA/SIR and PCA. Mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1996) fits finite Gaussian mixtures in each class. Finite mixtures not flexible. Natural to go nonparametric.
19 LSIR and MDA Inverse Regression Methods Bayesian Mixture Inverse Modeling Localized SIR (LSIR) (Wu et al., 2008) utilizes local means. Compromise between LDA/SIR and PCA. Mixture discriminant analysis (MDA) (Hastie and Tibshirani, 1996) fits finite Gaussian mixtures in each class. Finite mixtures not flexible. Natural to go nonparametric.
20 Probabilistic Model for SIR Inverse Regression Methods Bayesian Mixture Inverse Modeling Principal fitted component (PFC) (Cook, 2007) model: X (Y = y) = µ + Aν y + ε µ R p is the intercept, A R p d (d < p), ν y R d, ε N(0, ). Columns of 1 A span the d.r. space.
21 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.
22 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.
23 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.
24 Inverse Regression Methods Bayesian Mixture Inverse Modeling Semi-parametric Bayesian Mixture Model Extending to mixture modeling setting, X (Y = y) = µ + Aν yx + ε ν yx G y (unknown random distribution) SIR: G y = δ νy is a point mass for samples in the same bin; G y can be a mixture; When y continuous G y can change smoothly in y.
25 d.r. subspace Inverse Regression Methods Bayesian Mixture Inverse Modeling Proposition For this model the d.r. subspace is the span of B = 1 A, i.e., Y X = Y ( 1 A) X.
26 Dirichlet Processes (DP) Inverse Regression Methods Bayesian Mixture Inverse Modeling When Y discrete, i.e., Y = c, c = 1,, C, Dirichlet process prior (DP) on G c leads to a mixture model with the number of mixtures determined in an automatic fashion. G c (i.i.d.) DP(α 0, G 0 ) G c = h=1 π hδ ν h (stick-breaking representation) π h = V h l<h (1 V l), V h Beta(1, α 0 ), ν h G 0 For each sample i, ν i (y i = c, ν i ) j i,y j =c δ ν j + α 0 G 0 ( )
27 Inverse Regression Methods Bayesian Mixture Inverse Modeling Dependent Dirichlet Processes (DDP) When Y continuous, want G y to change smoothly with y, i.e., G y1 and G y2 having stronger dependence with closer y 1 and y 2 Dependent Dirichlet process (DDP, MacEachern 1998) provides a natural framework for this. G y = π yh δ ν yh h=1 π yh = V yh (1 V yl ) l<h Different constructions for DDP (Gelfand et al., 2005; Iorio et al., 2004; Griffin and Steel, 2006)
28 Inverse Regression Methods Bayesian Mixture Inverse Modeling Dependent Dirichlet Processes (DDP) When Y continuous, want G y to change smoothly with y, i.e., G y1 and G y2 having stronger dependence with closer y 1 and y 2 Dependent Dirichlet process (DDP, MacEachern 1998) provides a natural framework for this. G y = π yh δ ν yh h=1 π yh = V yh (1 V yl ) l<h Different constructions for DDP (Gelfand et al., 2005; Iorio et al., 2004; Griffin and Steel, 2006)
29 Kernel Stick-breaking Process Inverse Regression Methods Bayesian Mixture Inverse Modeling Kernel stick-breaking process (Dunson and Park, 2008): G y = U(y; V h, L h ) (1 U(y; V l, L l ))δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h is a probability weight, L h is a random location with the same domain of Y, K (y, L h ) is a pre-specified kernel function measuring the similarity between y and L h, e.g., K (y, L h ) = 1 y Lh <φ or K (y, L h ) = exp( φ y L h 2 ) Dependence on the weights U(y; V h, L h ) implies dependence among G y s.
30 Kernel Stick-breaking Process Inverse Regression Methods Bayesian Mixture Inverse Modeling Kernel stick-breaking process (Dunson and Park, 2008): G y = U(y; V h, L h ) (1 U(y; V l, L l ))δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h is a probability weight, L h is a random location with the same domain of Y, K (y, L h ) is a pre-specified kernel function measuring the similarity between y and L h, e.g., K (y, L h ) = 1 y Lh <φ or K (y, L h ) = exp( φ y L h 2 ) Dependence on the weights U(y; V h, L h ) implies dependence among G y s.
31 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.
32 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.
33 The Loading Matrix A Inverse Regression Methods Bayesian Mixture Inverse Modeling Recall X (Y = y, A, ν yx, µ, ) N(µ + Aν yx, ), A R p d and B = span(b) = span( 1 A) G (d,p). Constrain A to have a standard form: a A = a d1... a dd..... a p1... a pd a structure for identifiability in Lopes and West (2004): Then ν yx must have unit variance.
34 Likelihood Inverse Regression Methods Bayesian Mixture Inverse Modeling The likelihood p(data A,, ν, µ) det( 1 ) n 2 exp { 1 2 n (x i µ Aν i ) 1 (x i µ Aν i ) } i=1 Normal prior on µ, A, base measure of ν i, and Wishart for 1 imply conjugacy.
35 for A and 1 Inverse Regression Methods Bayesian Mixture Inverse Modeling Prior for A: Normal full conditional. a lj N(0, φ 1 a ), l j, l = 1,, p Prior for 1 : Wishart(df, p, V D ).
36 Sampling for ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Marginal approach and conditional approach (Ishwaran and James, 2001). Pólya-urn representation of the prior for ν i : ν i (y i = c, ν i ) δ νj + α 0 G 0 (ν i ) j i,y j =c A convenient choice for G 0 is N(0, I d ).
37 Sampling for ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Marginal approach and conditional approach (Ishwaran and James, 2001). Pólya-urn representation of the prior for ν i : ν i (y i = c, ν i ) δ νj + α 0 G 0 (ν i ) j i,y j =c A convenient choice for G 0 is N(0, I d ).
38 Sampling ν i in Classification Inverse Regression Methods Bayesian Mixture Inverse Modeling Full conditional (Escobar and West, 1995): ν i (data, y i = c, ν i, A, ) q i,j δ νj + q i,0 G i (ν i ) j i,y j =c ) G i (ν i ) N (V ν A 1 (x i µ), V ν { q i,j exp 1 } 2 (x i µ Aν j ) 1 (x i µ Aν j ) { q i,0 α 0 V 1 2 ν exp 1 } 2 (x i µ) ( 1 1 AV ν A 1 )(x i µ) V ν = (A 1 A + I d ) 1
39 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))
40 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))
41 Prior for ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Truncation approximation: G y = H ( U(y; Vh, L h ) (1 U(y; V l, L l )) ) δ ν h h=1 l<h U(y; V h, L h ) = V h K (y, L h ) V h Beta(a h, b h ), L h Unif(min(y i ), max(y i )), ν h N(0, I d) Introduce latent variables (Dunson and Park, 2008): K i is the mixture label, i.e., K i = h means the sample i is assigned to the h-th mixture component. A ih Ber(V h ), B ih Ber(K (y i, L h ))
42 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling K i = h exp { 1 2 (x i µ Aν h ) 1 (x i µ Aν h )} U(y; V h, L h ) l<h (1 U(y; V l, L l )) (multinomial). If the sampled index turns out to be h, set ν i = ν h. νh N ((n h A 1 A + I d ) 1 A 1 ) (x i Ch i µ), (n h A 1 A + I d ) 1 where C h denotes the index for the h-th cluster and n h = #(C h ). V h Be(a h + i:k i h A ih, b h + i:k i h (1 A ih) for h < H and V H 1.
43 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling A ih = B ih = 1 for h = K i and for h < K i. (A ih = 1, B ih = 0) V h(1 K (y i, L h )) 1 V h K (y i, L h ) (A ih = 0, B ih = 1) (1 V h)k (y i, L h ) 1 V h K (y i, L h ) (A ih = 0, B ih = 0) (1 V h)(1 K (y i, L h )) 1 V h K (y i, L h )
44 Sampling ν i in Regression Inverse Regression Methods Bayesian Mixture Inverse Modeling Metropolis-Hastings step for sampling L h. The proposal distribution: L h Unif( min (y i), i {B ih =1} max i {B ih =1} (y i )), when{i : B ih = 1} Unif(min(y i ), max(y i )), otherwise K (y, L h ) = exp( φ y L h 2 ), φ controls the intensity of borrowing information across Y.
45 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Recall B = span(b) = span( 1 A) G (d,p). G (d,p) is the parameter space for the d.r. subspace. Posterior samples of the d.r. subspace, denoted as {B (1),, B (T ) }, are on G (d,p). dim dim1
46 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Bayes estimate of the posterior mean should be with respect to the distance metric B Bayes = arg min B G (d,p) t=1 T dist 2 (B (t), B) called the Karcher mean (Karcher, 1977). A standard deviation measure std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1
47 Inverse Regression Methods Bayesian Mixture Inverse Modeling Posterior for the d.r. subspace Bayes estimate of the posterior mean should be with respect to the distance metric B Bayes = arg min B G (d,p) t=1 T dist 2 (B (t), B) called the Karcher mean (Karcher, 1977). A standard deviation measure std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1
48 Inverse Regression Methods Bayesian Mixture Inverse Modeling Distance on the Grassmann Manifold Given two subspaces W 1 and W 2 spanned by orthonormal bases W 1 and W 2 respectively, the geodesic distance between W 1 and W 2 is (Karcher, 1977; Kendall, 1990): (I W 1 (W 1 W 1) 1 W 1 )W 2(W 1 W 2) 1 = UΣV (SVD) Θ = atan(σ) dist(w 1, W 2 ) = Trace(Θ 2 ),
49 d.r. dimension: d Inverse Regression Methods Bayesian Mixture Inverse Modeling Model comparison: BF(d 1, d 2 ) = p(data d 1 )/p(data d 2 ) Marginal likelihood p(data d) = p(data d, θ)p prior (θ)dθ θ Out-of-sample validation.
50 d.r. dimension: d Inverse Regression Methods Bayesian Mixture Inverse Modeling Model comparison: BF(d 1, d 2 ) = p(data d 1 )/p(data d 2 ) Marginal likelihood p(data d) = p(data d, θ)p prior (θ)dθ θ Out-of-sample validation.
51 Large p Inverse Regression Methods Bayesian Mixture Inverse Modeling When p n, preprocess with PCA. Constrain µ yx µ = Aν yx and 1 into span(x 1,, x n ) SVD: X = U X D X V X, V X R p p, (p min(p, n)) Then A = V X Ã (Ã Rp d ), = V X V X ( R p p ).
52 Swiss Roll Inverse Regression Methods Bayesian Mixture Inverse Modeling Wu et al. (2008) Data generated in R 10 with structure: X 1 = t cos(t), X 2 = h, X 3 = t sin(t) where t = 3π 2 (1 + 2θ), θ Unif(0, 1), h Unif(0, 1) (Swiss roll). Remaining 7 dimensions independent Gaussian noise. Response: Y = sin(5πθ) + h 2 + ε, ε N(0, 0.01) (a) Y v.s. X 1, X 2, X 3 (b) X 3 v.s. X 1
53 Swiss Roll Accuracy Measure Inverse Regression Methods Bayesian Mixture Inverse Modeling True d.r. B = the first three dimensions. Metric to measure the accuracy in estimation: ˆB = ( ˆβ 1,, ˆβ d ) denotes estimate of B. Accuracy: 1 d d P B ˆβ i 2 = 1 d i=1 d (BB ) ˆβ i 2 i=1 where P B denotes the orthogonal projection onto the column space of B.
54 Swiss Roll Accuracy Inverse Regression Methods Bayesian Mixture Inverse Modeling
55 Swiss Roll Uncertainty Inverse Regression Methods Bayesian Mixture Inverse Modeling Standard deviation: Distance between the posterior mean and the true d.r. subspace is
56 Inverse Regression Methods Bayesian Mixture Inverse Modeling Swiss Roll Mixture Component Labels
57 Inverse Regression Methods Bayesian Mixture Inverse Modeling Swiss Roll Out-of-sample Mean-Square Error v.s. d
58 Iris Background Inverse Regression Methods Bayesian Mixture Inverse Modeling Data consists of 3 classes with 50 instances of each class. Each class refers to a type of Iris plant ( Setosa, Virginica and Versicolour ), and has 4 predictors describing the length and width of the sepal and petal. We merge Setosa, Virginica into a single class as in (Sugiyama, 2007).
59 Iris Embedding (Sugiyama, 2007) Inverse Regression Methods Bayesian Mixture Inverse Modeling
60 Iris Embedding (BMI) Inverse Regression Methods Bayesian Mixture Inverse Modeling
61 Handwritten Digit Data Inverse Regression Methods Bayesian Mixture Inverse Modeling
62 Digit d.r. Inverse Regression Methods Bayesian Mixture Inverse Modeling (c) Posterior mean of 3 vs 8 (d) Posterior mean of 5 vs 8
63 Gradient Motivation Error Covariance and Learning the gradient Additive error regression model: Y = f (X) + ɛ = g(β 1 X,, β d X) + ɛ The gradient of the regression function lies in B = span(β 1,, β d ). f = ( f x 1,, f x p )
64 Gradient Outer Product Motivation Error Covariance and Learning the gradient Gradient Outer Product (GOP) matrix (Mukherjee et al., 2006): Γ = E X ( ( f )( f ) ) R p p Γ rank at most d; suppose {v 1,..., v d } are the eigenvectors associated to the nonzero eigenvalues then B = span(v 1,..., v d ) In addition Γ can be viewed as a covariance matrix (Wu et al., 2007) since Γ ij = E(( f f )) indicates covariation of x i x j predictors relevant to prediction.
65 Gradient Outer Product Motivation Error Covariance and Learning the gradient Gradient Outer Product (GOP) matrix (Mukherjee et al., 2006): Γ = E X ( ( f )( f ) ) R p p Γ rank at most d; suppose {v 1,..., v d } are the eigenvectors associated to the nonzero eigenvalues then B = span(v 1,..., v d ) In addition Γ can be viewed as a covariance matrix (Wu et al., 2007) since Γ ij = E(( f f )) indicates covariation of x i x j predictors relevant to prediction.
66 Graphical Models Motivation Error Covariance and Learning the gradient Could use a graphical model to infer the conditional dependence of the predictors that are predictive: For multivariate Gaussian p(x) exp( 1 2 x Jx + h x), the precision J is the conditional independence matrix. Partial correlation matrix R where each element r ij = cov(x i, x j S /ij ) = var(xi S ) J ij /ij var(x j S /ij ) Jii J jj is a measure of dependence between variables i and j conditioned on all other variables S /ij (i j). Set Γ 1 = J.
67 Graphical Models Motivation Error Covariance and Learning the gradient Could use a graphical model to infer the conditional dependence of the predictors that are predictive: For multivariate Gaussian p(x) exp( 1 2 x Jx + h x), the precision J is the conditional independence matrix. Partial correlation matrix R where each element r ij = cov(x i, x j S /ij ) = var(xi S ) J ij /ij var(x j S /ij ) Jii J jj is a measure of dependence between variables i and j conditioned on all other variables S /ij (i j). Set Γ 1 = J.
68 Modeling Errors Motivation Error Covariance and Learning the gradient For regression case y i = f (x i ) + ε i, i = 1,, n = f (x j ) + f (x i ) (x i x j ) + O((x i x j ) 2 ) + ε i The non-random term O((x i x j ) 2 ) has an absolute magnitude positively associated with the distance d xi,x j We model it as a stochastic term with mean 0 and variance positively associated with d xi,x j y i = f (x j ) + f (x i ) (x i x j ) + ε ij ε ij is mean 0 and variance positively associated with d xi,x j.
69 Modeling Errors Motivation Error Covariance and Learning the gradient For regression case y i = f (x i ) + ε i, i = 1,, n = f (x j ) + f (x i ) (x i x j ) + O((x i x j ) 2 ) + ε i The non-random term O((x i x j ) 2 ) has an absolute magnitude positively associated with the distance d xi,x j We model it as a stochastic term with mean 0 and variance positively associated with d xi,x j y i = f (x j ) + f (x i ) (x i x j ) + ε ij ε ij is mean 0 and variance positively associated with d xi,x j.
70 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)
71 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)
72 Error Structure Motivation Error Covariance and Learning the gradient Ideally can specify a whole covariance structure for ε ij s and respect the fact that ε ij and ε ij should covary for j close to j. This involves specifying a covariance structure of size n 2 n 2 where n is the sample size, which is extremely computationally expensive! To simplify we assume ε ij s are independently N(0, σ 2 ij ), with σ 2 ij = σ 2 /w ij, where σ 2 is a (scale) variance parameter and w ij s are weights, for instance, Gaussian weights w ij = exp( d 2 x i,x j /2σ 2 w)
73 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]
74 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]
75 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]
76 Regularization Methods Motivation Error Covariance and Learning the gradient Need to model f and f. The regularization framework to learn a function f : Minimize Loss + Penalty ˆf = arg minf Hk [ L(f, data) + λ f 2 H k ] H k : Reproducing Kernel Hilbert Space (RKHS) generated by a semi-positive definite function (kernel) k(x, u), x, u X R p, e.g., k(x, u) = exp{ 1 x u 2 }. 2σk 2 H k = span(k u1, k u2, ), where k ui ( ) = k(, u i ), u i R p e.g., support vector machines ˆf (x) = arg minf Hk [ n i=1 (1 y i f (x i )) + + λ f 2 H k ]
77 The Representer Theorem Motivation Error Covariance and Learning the gradient The Representer Theorem by Kimeldorf and Wahba (1971) ˆf (x) = n ŵ i k(x, x i ) = i=1 n ŵ i k xi (x) i=1 x 1,, x n R p are samples. From infinite dimensional to n-dimensional span(k x1,, k xn ) This form is achieved purely by regularization. We want to justify this form through proper Bayesian prior specification.
78 The Representer Theorem Motivation Error Covariance and Learning the gradient The Representer Theorem by Kimeldorf and Wahba (1971) ˆf (x) = n ŵ i k(x, x i ) = i=1 n ŵ i k xi (x) i=1 x 1,, x n R p are samples. From infinite dimensional to n-dimensional span(k x1,, k xn ) This form is achieved purely by regularization. We want to justify this form through proper Bayesian prior specification.
79 Eigen-decomposition of the Kernel Motivation Error Covariance and Learning the gradient In terms of eigen-decomposition of the kernel: let {λ j } and {φ j (x)}, j = 1,, be the eigen-values and eigen-functions of k, i.e., λ j φ j (x) = X k(x, u) φ j(u) dµ(u), H k = {f f (x) = c j φ j (x) j=1 s.t. cj 2 /λ j < } j=1 A prior over {(c j ) j=1 c2 j /λ j < } implies a prior on H k Problems: Sampling from infinite-dimensional space; the constraints; the eigen-functions not computable; etc.
80 Eigen-decomposition of the Kernel Motivation Error Covariance and Learning the gradient In terms of eigen-decomposition of the kernel: let {λ j } and {φ j (x)}, j = 1,, be the eigen-values and eigen-functions of k, i.e., λ j φ j (x) = X k(x, u) φ j(u) dµ(u), H k = {f f (x) = c j φ j (x) j=1 s.t. cj 2 /λ j < } j=1 A prior over {(c j ) j=1 c2 j /λ j < } implies a prior on H k Problems: Sampling from infinite-dimensional space; the constraints; the eigen-functions not computable; etc.
81 Prior on the Function Space Motivation Error Covariance and Learning the gradient H k is equivalent with G = { g g(x) = k(x, u)dγ(u), γ Γ 0 }, Γ0 is a subset of the space of signed Borel measures (Pillai et al., 2006); γ is modeled by a random probability distribution G(u) and random coefficient w(u) so that { } G = g g(x) = k(x, u)w(u)dg(u) Take G to be the marginal distribution of X and place a Dirichlet Process (DP) prior on G, i.e., G DP(α, G 0 ).
82 Prior on the Function Space Motivation Error Covariance and Learning the gradient H k is equivalent with G = { g g(x) = k(x, u)dγ(u), γ Γ 0 }, Γ0 is a subset of the space of signed Borel measures (Pillai et al., 2006); γ is modeled by a random probability distribution G(u) and random coefficient w(u) so that { } G = g g(x) = k(x, u)w(u)dg(u) Take G to be the marginal distribution of X and place a Dirichlet Process (DP) prior on G, i.e., G DP(α, G 0 ).
83 Posterior: Finite Representation Motivation Error Covariance and Learning the gradient Posterior DP (Schervish (1995)): Given sample (x 1,, x n ) from G DP(α, G 0 ), G (x 1,, x n ) DP(α + n, G n ), G n = 1 α+n (αg 0 + δ xi ) E(g (x 1,, x n )) = k(x, u)w(u)d(e(g(u) X n )) = k(x, u)w(u)dg n (u) = 1 α + n n i=1 [ α k(x, u)w(u)dg 0 (u) + w(x i ) α + n k(x, x i) when α 0 ] n w(x i )k(x, x i ) i=1
84 Posterior: Finite Representation Motivation Error Covariance and Learning the gradient Posterior DP (Schervish (1995)): Given sample (x 1,, x n ) from G DP(α, G 0 ), G (x 1,, x n ) DP(α + n, G n ), G n = 1 α+n (αg 0 + δ xi ) E(g (x 1,, x n )) = k(x, u)w(u)d(e(g(u) X n )) = k(x, u)w(u)dg n (u) = 1 α + n n i=1 [ α k(x, u)w(u)dg 0 (u) + w(x i ) α + n k(x, x i) when α 0 ] n w(x i )k(x, x i ) i=1
85 Represent the Gradient Motivation Error Covariance and Learning the gradient We have justified f = n α i k(x, x i ), f (j) = n c ji k(x, x i ), j = 1,, p i=1 i=1 Denote α = (α 1,, α n ) R n, C = (c ji ) R p n An estimate of GOP would be ˆΓ n f (x i ) f (x i ) = CK 2 C i=1 where K R n n with K ij = k(x i, x j ).
86 Likelihood Motivation Error Covariance and Learning the gradient For each i = 1,, n, y i 1 = K α + D i CK i + ε i where 1 = (1,, 1) R n, D i = 1x i (x 1,, x n ) R n p, K i is the i-th column of K, ε i = (ε i1,, ε in ) Too many parameters (especially when p n) West (2003) developed a strategy using empirical factor analysis for similar cases in large p small n regression. The idea was to apply singular value decomposition (svd) on the data matrix.
87 Likelihood Motivation Error Covariance and Learning the gradient For each i = 1,, n, y i 1 = K α + D i CK i + ε i where 1 = (1,, 1) R n, D i = 1x i (x 1,, x n ) R n p, K i is the i-th column of K, ε i = (ε i1,, ε in ) Too many parameters (especially when p n) West (2003) developed a strategy using empirical factor analysis for similar cases in large p small n regression. The idea was to apply singular value decomposition (svd) on the data matrix.
88 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.
89 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.
90 svd Decomposition Motivation Error Covariance and Learning the gradient K = F K ΛF K hence K α = Fβ with F = F K Λ, β = F K α. We pick the first m(m < n) columns of F corresponding to large singular values. Let M X = (x 1 x n,, x n 1 x n ), svd decomposition M X = V Λ M U ; V R p s with s min(n 1, p). Then for each i, Di R n s s.t. D i = Di V. Put C ( R s n ) = V C, then D i C = Di C. Reduce dimension from p n to s n. Can conveniently place independent priors on β and C.
91 Likelihood Motivation Error Covariance and Learning the gradient Since ε ij s are independently N(0, (φ/w ij ) 1 ) where φ 1 = σ 2, the likelihood is thus φ n2 2 exp { φ 2 n {[y i 1 Fβ Di C K i ] W i [y i 1 Fβ Di C K i ]} } i=1 where W i is diag(w i1,, w in ).
92 Prior and Sampling Motivation Error Covariance and Learning the gradient β ( R m 1 ) Prior β N(0, 1 ψ ), ψ = diag(ψ 1,...ψ m ), ψ i Gamma(a ψ /2, b ψ /2); Full conditional: β y... N( ˆβ, ˆV β ) where ˆV β = (F ( i φw i )F + ψ ) 1 ˆβ = φ ˆV β F i W i a i, where a i = y i 1 D i C K i
93 Prior and Sampling Motivation Error Covariance and Learning the gradient C = (c 1,...c n) ( R s n ). Prior c j N(0, 1 ϕ ), ϕ = diag(ϕ 1,...ϕ s ), ϕ i Gamma(a ϕ /2, b ϕ /2) The likelihood part for cj is N p (µ j, V j ), where V j = (φ i K ij 2D i W i Di ) 1 and µ j = φv j where b j i = y i 1 F β Di k j c kk ik i K ijdi Full conditional: c j... N(µ, V ) where V = (V 1 j + diag(ϕ 1,...ϕ s )) 1 and µ = V (V 1 j µ j ) W i b j i
94 Prior and Sampling Motivation Error Covariance and Learning the gradient φ and other parameters P φ y,... Gamma( n2 2, for improper prior 1 φ ; ψ i... Gamma( a ψ+1 2, b ψ+βi 2 2 ) ϕ i... Gamma( aϕ+1 i [y i 1 F β D i C K i ] W i [y i 1 F β D i C K i ] 2, bϕ+p n k=1 c 2 ik 2 ) 2 ),
95 Binary Classification Motivation Error Covariance and Learning the gradient The responses y i = 1/0, i = 1,, n Probit model: Let p i = P(y i = 1). Link Φ 1 (p i ) = µ i, Φ is the standard normal cdf and µ i is some predictor. Introduce z i N(µ i, 1), z i > 0 y i = 1. Sampling schemes above for β and C are the same with all previous y i replaced by z i. Each z i has a truncated normal full conditional: { N + (ẑ z i... i, 1), y i = 1 N (ẑ i, 1), y i = 0. where N + and N denote a Normal truncated to the positive and a Normal truncated to the negative, respectively, and (ẑ 1,, ẑ n ) T = Fβ.
96 Binary Classification Motivation Error Covariance and Learning the gradient The responses y i = 1/0, i = 1,, n Probit model: Let p i = P(y i = 1). Link Φ 1 (p i ) = µ i, Φ is the standard normal cdf and µ i is some predictor. Introduce z i N(µ i, 1), z i > 0 y i = 1. Sampling schemes above for β and C are the same with all previous y i replaced by z i. Each z i has a truncated normal full conditional: { N + (ẑ z i... i, 1), y i = 1 N (ẑ i, 1), y i = 0. where N + and N denote a Normal truncated to the positive and a Normal truncated to the negative, respectively, and (ẑ 1,, ẑ n ) T = Fβ.
97 Posterior for GOP Motivation Error Covariance and Learning the gradient Given draws {C (t) } T t=1 we compute {C(t) } T t=1 from the relation C = VC, then the posterior draws of the GOP ˆΓ (t) = C (t) K 2 (C (t) ) We then compute the posterior mean GOP matrix as well as a variance estimate ˆµˆΓ = 1 T T ˆΓ (t), ˆσ 2ˆΓ = 1 T t=1 T (ˆΓ (t) ˆµˆΓ t=1 where ( )2 denotes the element-wise square. e ) 2 e
98 Posterior for d.r. subspace Motivation Error Covariance and Learning the gradient Recall the d.r. space B = span(v 1,..., v d ) where {v 1,..., v d } are the eigenvectors associated to the largest d eigenvalues of the GOP. The d.r. subspace is on the manifold G (d,p). A spectral decomposition of ˆ Γ(t) then provides a posterior draw of the d.r. subspace B (t), and T B Bayes = arg min dist 2 (B (t), B) B G (d,p) t=1 std({b (1),, B (T ) }) = 1 T T dist 2 (B (t), B Bayes ) t=1
99 Motivation Error Covariance and Learning the gradient Posterior for the Conditional Independence The conditional independence and partial correlations J (t) = (ˆΓ (t) ) 1, ij = J(t) ij R (t) J (t) ii J (t) jj using a pseudo-inverse. The mean and variance of the posterior estimates of the partial correlations are ˆµ R = 1 T T R (t), t=1 ˆσ 2 R = 1 T T ( R (t) ) 2 ˆµ R e t=1 These quantities could be used to infer a graphical model with the capability to evaluate the uncertainty of the correlation structure.,
100 Linear Simulation 1 Motivation Error Covariance and Learning the gradient 20 Samples from class 0 were from x j N(1.5, 1), for j = 1,, 10, x j N( 1.5, 1), for j = 11,, 20, x j N(0, 0.1), for j = 21,, Samples from class 1 were from x j N(1.5, 1), for j = 41,, 50, x j N( 1.5, 1), for j = 51,, 60, x j N(0, 0.1), for j = 1,, 40, 61,, 80
101 Linear Simulation 1 Motivation Error Covariance and Learning the gradient Dimension (e) Posterior mean of GOP (f) Top d.r. direction
102 Swiss Roll Accuracy Motivation Error Covariance and Learning the gradient
103 Iris Embedding Motivation Error Covariance and Learning the gradient
104 Digits Motivation Error Covariance and Learning the gradient (g) Posterior mean of 3 vs 8 (h) Posterior mean of 5 vs 8
105 Linear Simulation 2 Motivation Error Covariance and Learning the gradient The predictor variables are correspond to a five dimension random vector drawn from the following model X 1 = θ 1, X 2 = θ 1 + θ 2, X 3 = θ 3 + θ 4, X 4 = θ 4, X 5 = θ 5 θ 4, where θ N(0, 1). The regression model is Y = X 1 + X 3 + X ε, where ε N(0, 0.25). X 1, X 3, X 5 are negatively correlated with respect to variation in the response and X 2 and X 4 are not correlated with respect to variation of the response.
106 Linear Simulation 2 Motivation Error Covariance and Learning the gradient The predictor variables are correspond to a five dimension random vector drawn from the following model X 1 = θ 1, X 2 = θ 1 + θ 2, X 3 = θ 3 + θ 4, X 4 = θ 4, X 5 = θ 5 θ 4, where θ N(0, 1). The regression model is Y = X 1 + X 3 + X ε, where ε N(0, 0.25). X 1, X 3, X 5 are negatively correlated with respect to variation in the response and X 2 and X 4 are not correlated with respect to variation of the response.
107 Linear Simulation 2 Motivation Error Covariance and Learning the gradient
108 Linear Simulation 2 Motivation Error Covariance and Learning the gradient (i) (j)
109 Pathway Association Motivation Error Covariance and Learning the gradient Genetic perturbations reflected by the altered expression of gene sets or pathways have been implicated for driving a normal cell to a malignancy state Thus it is necessary to study the relationship between pathways and the cell state (benign or malignant). In Edelman et al. (2008), 54 prostate samples (22 benign and 32 malignant) and 522 pathways. For visualization we pick 16 most significant pathways to build an interaction network.
110 Pathway Association Motivation Error Covariance and Learning the gradient
111 Development of distribution theory and proposal distribution on the Grassmann manifold. Local dimension reduction. In Chen et al. (2009) a factor model with mixtures on the loading matrix is proposed X N(µ + A X w, φ 1 I) Probabilistic nonlinear dimension reduction via kernels. x :,j K N(0, K ) where K R n n (Lawrence 2005)
112 Development of distribution theory and proposal distribution on the Grassmann manifold. Local dimension reduction. In Chen et al. (2009) a factor model with mixtures on the loading matrix is proposed X N(µ + A X w, φ 1 I) Probabilistic nonlinear dimension reduction via kernels. x :,j K N(0, K ) where K R n n (Lawrence 2005)
113 Cook, R. (2007). Fisher lecture: Dimension reduction in regression. Statistical Science 22(1), Cook, R. and S. Weisberg (1991). Discussion of sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Dunson, D. B. and J. Park (2008). Kernel stick-breaking processes. Biometrika 89, Edelman, E., J. Guinney, J. Chi, P. Febbo, and S. Mukherjee (2008). Modeling cancer progression via pathway dependencies. PLoS Comp. Bio 4(2). Escobar, M. and M. West (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc., Friedman, J. H. and W. Stuetzle (1981). Projection pursuit regression. J. Amer. Statist. Assoc.,
114 Gelfand, A., A. Kottas, and S. N. MacEachern (2005). Bayesian nonparametric spatial modeling with dirichlet process mixing. J. Amer. Statist. Assoc. (471), Griffin, J. and M. Steel (2006). Order-based dependent dirichlet processes. J. Amer. Statist. Assoc., Hastie, T. and R. Tibshirani (1996). Discriminant analysis by Gaussian mixtures. J. Roy.Statist. Soc. Ser. B 58(1), Iorio, M. D., P. Müller, G. L. Rosner, and S. N. MacEachern (2004). An anova model for dependent random measures. J. Amer. Statist. Assoc., Ishwaran, H. and L. James (2001). Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. (453). Karcher, H. (1977). Riemannian center of mass and mollifier smoothing. Comm. Pure Appl. Math. (5),
115 Kendall, W. S. (1990). Probability, convexity and harmonic maps with small image. i. uniqueness and fine existence. Proc. London Math. Soc. (2), Kimeldorf, G. and G. Wahba (1971). A correspondence between bayesian estimation on stochastic processes and smoothing by splines. Ann. Math. Statist. 41(2), Li, K. (1991). Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 86, Lopes, H. F. and M. West (2004). Bayesian model assessment in factor analysis. statsinica 14, Mukherjee, S., Q. Wu, and D. Zhou (2006). Gradient Learning and Feature Selection on Manifolds. Technical report, ISDS Discussion Paper, Duke University. Pillai, N., Q. Wu, F. Liang, S. Mukherjee, and R. Wolpert (2006).
116 Characterizing the function space for bayesian kernel models. J. Mach. Learn. Res.. under review. Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local Fisher discriminant analysis. J. Mach. Learn. Res. 8, Tokdar, S., Y. Zhu, and J. Ghosh (2008). A bayesian implementation of sufficient dimension reduction in regression. Technical report, Purdue Univ. West, M. (2003). Bayesian factor regression models in the large p, small n paradigm. In J. B. et al. (Ed.), Bayesian Statistics 7, pp Oxford. Wu, Q., J. Guinney, M. Maggioni, and S. Mukherjee (2007). Learning gradients: Predictive models that infer geometry and dependence. J. Mach. Learn. Res..
117 Wu, Q., F. Liang, and S. Mukherjee (2008). Localized sliced inverse regression. Technical report. Xia, Y., H. Tong, W. Li, and L.-X. Zhu (2002). An adaptive estimation of dimension reduction space. J. Roy.Statist. Soc. Ser. B 64(3),
Supervised Dimension Reduction Using Bayesian Mixture Modeling
Kai Mao Feng Liang Sayan Mukherjee Department of Statistics University of Illinois at Urbana-Champaign, IL 682 Department of Statistical Science Duke University, NC 2778 Departments of Staistical Science
More informationSupervised Dimension Reduction:
Supervised Dimension Reduction: A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, M. Maggioni, D-X. Zhou Department of Statistical Science Institute for Genome Sciences & Policy Department
More informationTwo models for Bayesian supervised dimension reduction
Two models for Bayesian supervised dimension reduction BY KAI MAO Department of Statistical Science Duke University, Durham NC 778-5, U.S.A. km68@stat.duke.edu QIANG WU Department of Mathematics Michigan
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationBayesian simultaneous regression and dimension reduction
Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008
More informationLocalized Sliced Inverse Regression
Localized Sliced Inverse Regression Qiang Wu, Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University, Durham NC 2778-251,
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationA Selective Review of Sufficient Dimension Reduction
A Selective Review of Sufficient Dimension Reduction Lexin Li Department of Statistics North Carolina State University Lexin Li (NCSU) Sufficient Dimension Reduction 1 / 19 Outline 1 General Framework
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationStatistical Machine Learning
Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x
More informationNonparametric Bayes Density Estimation and Regression with High Dimensional Data
Nonparametric Bayes Density Estimation and Regression with High Dimensional Data Abhishek Bhattacharya, Garritt Page Department of Statistics, Duke University Joint work with Prof. D.Dunson September 2010
More informationClassification Methods II: Linear and Quadratic Discrimminant Analysis
Classification Methods II: Linear and Quadratic Discrimminant Analysis Rebecca C. Steorts, Duke University STA 325, Chapter 4 ISL Agenda Linear Discrimminant Analysis (LDA) Classification Recall that linear
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationSTAT Advanced Bayesian Inference
1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationBANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1
BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationBayes methods for categorical data. April 25, 2017
Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationBayesian Nonparametric Regression through Mixture Models
Bayesian Nonparametric Regression through Mixture Models Sara Wade Bocconi University Advisor: Sonia Petrone October 7, 2013 Outline 1 Introduction 2 Enriched Dirichlet Process 3 EDP Mixtures for Regression
More informationSTATISTICAL LEARNING SYSTEMS
STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationUnsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent
Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 8 Continuous Latent Variable
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationLECTURE NOTE #11 PROF. ALAN YUILLE
LECTURE NOTE #11 PROF. ALAN YUILLE 1. NonLinear Dimension Reduction Spectral Methods. The basic idea is to assume that the data lies on a manifold/surface in D-dimensional space, see figure (1) Perform
More informationISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification
ISyE 6416: Computational Statistics Spring 2017 Lecture 5: Discriminant analysis and classification Prof. Yao Xie H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationNonlinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap
More informationFace Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi
Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi Overview Introduction Linear Methods for Dimensionality Reduction Nonlinear Methods and Manifold
More informationNonparametric Bayes Inference on Manifolds with Applications
Nonparametric Bayes Inference on Manifolds with Applications Abhishek Bhattacharya Indian Statistical Institute Based on the book Nonparametric Statistics On Manifolds With Applications To Shape Spaces
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationPCA and admixture models
PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationA Fully Nonparametric Modeling Approach to. BNP Binary Regression
A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation
More informationFoundations of Nonparametric Bayesian Methods
1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More information1 Data Arrays and Decompositions
1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More informationLinear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.
Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationMachine Learning 2nd Edition
INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationMachine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationManifold Learning: Theory and Applications to HRI
Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher
More informationSufficient Dimension Reduction using Support Vector Machine and it s variants
Sufficient Dimension Reduction using Support Vector Machine and it s variants Andreas Artemiou School of Mathematics, Cardiff University @AG DANK/BCS Meeting 2013 SDR PSVM Real Data Current Research and
More informationSayan Mukherjee. June 15, 2007
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University June 15, 2007 To Tommy Poggio This talk is dedicated to my advisor Tommy Poggio as
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationNonlinear Dimensionality Reduction
Nonlinear Dimensionality Reduction Piyush Rai CS5350/6350: Machine Learning October 25, 2011 Recap: Linear Dimensionality Reduction Linear Dimensionality Reduction: Based on a linear projection of the
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationIntroduction to Graphical Models
Introduction to Graphical Models The 15 th Winter School of Statistical Physics POSCO International Center & POSTECH, Pohang 2018. 1. 9 (Tue.) Yung-Kyun Noh GENERALIZATION FOR PREDICTION 2 Probabilistic
More informationCS281 Section 4: Factor Analysis and PCA
CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationSTA 414/2104: Lecture 8
STA 414/2104: Lecture 8 6-7 March 2017: Continuous Latent Variable Models, Neural networks With thanks to Russ Salakhutdinov, Jimmy Ba and others Outline Continuous latent variable models Background PCA
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationBayesian non-parametric model to longitudinally predict churn
Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More informationUniversity of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians
Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationLearning Eigenfunctions: Links with Spectral Clustering and Kernel PCA
Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures
More informationApproximate Kernel PCA with Random Features
Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationHigh Dimensional Discriminant Analysis
High Dimensional Discriminant Analysis Charles Bouveyron 1,2, Stéphane Girard 1, and Cordelia Schmid 2 1 LMC IMAG, BP 53, Université Grenoble 1, 38041 Grenoble cedex 9 France (e-mail: charles.bouveyron@imag.fr,
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationSupervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing
Supervised Learning Unsupervised learning: To extract structure and postulate hypotheses about data generating process from observations x 1,...,x n. Visualize, summarize and compress data. We have seen
More information