Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Size: px

Start display at page:

Download "Kernel-Based Contrast Functions for Sufficient Dimension Reduction"

Adela Owens
5 years ago
Views:

1 Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach 1

2 Outline Introduction dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Manifold KDR Summary 2

3 Sufficient Dimension Reduction Regression setting: observe (X,Y) pairs, where the covariate X is high-dimensional Find a (hopefully small) subspace S of the covariate space that retains the information pertinent to the response Y Semiparametric formulation: treat the conditional distribution p(y X) nonparametrically, and estimate the parameter S 3

4 Perspectives Classically the covariate vector X has been treated as ancillary in regression The sufficient dimension reduction (SDR) literature has aimed at making use of the randomness in X (in settings where this is reasonable) This has generally been achieved via inverse regression at the cost of introducing strong assumptions on the distribution of the covariate X We ll make use of the randomness in X without employing inverse regression 4

5 Dimension Reduction for Regression Regression: p( Y X ) Y : response variable, X = (X 1,...,X m ): m-dimensional covariate Goal: Find the central subspace, which is defined via: ( ~ ( T = p Y B ) ) ~ ( T T p( Y X ) = p Y b1 X,..., b X ) X d 5

Y Y X 1 X 2 X 1 1.2 1 1 2 Y = N(0; 0.1 1+ exp( X ) + 1 ) Y 0.8 0.6 0.

6 Y Y X 1 X 2 X Y = N(0; exp( X ) + 1 ) Y central subspace = X 1 axis X 2 6

7 Some Existing Methods Sliced Inverse Regression (SIR, Li 1991) PCA of E[X Y] use slice of Y Elliptic assumption on the distribution of X Principal Hessian Directions (phd, Li 1992) T Average Hessian Σ yxx E[( Y Y )( X X )( X X ) ] is used If X is Gaussian, eigenvectors gives the central subspace Gaussian assumption on X. Y must be one-dimensional Projection pursuit approach (e.g., Friedman et al. 1981) Additive model E[Y X] = g 1 (b 1T X) g d (b dt X) is used Canonical Correlation Analysis (CCA) / Partial Least Squares (PLS) Linear assumption on the regression Contour Regression (Li, Zha & Chiaromonte, 2004) Elliptic assumption on the distribution of X 7

8 Dimension Reduction and Conditional Independence (U, V)=(B T X, C T X) where C: m x (m-d) with columns orthogonal to B B gives the projector onto the central subspace p X ( y x) py U ( y B x) Y = py U, V ( y u, v) = py U ( y u) for all y, u, v Conditional independence Y V U T U Y V X Our approach: Characterize conditional independence 8

9 Outline Introduction dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Manifold KDR Summary 9

10 Reproducing Kernel Hilbert Spaces Kernel methods RKHS s have generally been used to provide basis expansions for regression and classification (e.g., support vector machine) Kernelization: map data into the RKHS and apply linear or secondorder methods in the RKHS But RKHS s can also be used to characterize independence and conditional independence Ω X feature map Φ X (X) Φ Y (Y) feature map Ω Y X Φ X Φ Y Y H X H Y RKHS RKHS 10

11 Positive Definite Kernels and RKHS Positive definite kernel (p.d. kernel) k :Ω Ω R k is positive definite if k(x,y) = k(y,x) and for any ( k(x i, x j )) i, j the matrix (Gram matrix) is positive semidefinite. Example: Gaussian RBF kernel Reproducing kernel Hilbert space (RKHS) k: p.d. kernel on Ω H : reproducing kernel Hilbert space (RKHS) k(, x) H for all x Ω. 1) 2) Span{ k(, x) x Ω} is dense in H. 3) k(, x), f = f (x) H (reproducing property) n N, x 1, x n Ω k(x, y) = exp( x y 2 σ 2 ) 11

12 Functional data Φ : Ω H, x k(, x) i. e. Φ( x) = k(, x) Data: X 1,, X N Φ X (X 1 ),, Φ X (X N ) : functional data Why RKHS? By the reproducing property, computing the inner product on RKHS is easy: Φ(x), Φ(y) = k(x, y) N 1 = i j N f = i= aiφ( xi ) = i aik(, xi ), g = j= 1 bjφ ( xj) = jbjk(, xj) f, g, a b k( x, x ) i j The computational cost essentially depends on the sample size. Advantageous for high-dimensional data of small sample size. i j 12

13 Covariance Operators on RKHS X, Y : random variables on Ω X and Ω Y, resp. Prepare RKHS (H X, k X ) and (H Y, k Y ) defined on Ω X and Ω Y, resp. Define random variables on the RKHS H X and H Y by Φ X ( X ) = k (, X ) X Φ Y ( Y ) = k (, Y ) Y Define the covariance operator Σ YX Σ YX = E[Φ Y (Y ) Φ X (X), ] E[Φ Y (Y )]E[ Φ X (X), ] Ω X Φ X (X) Φ Y (Y) Ω Y Σ YX X Φ X Φ Y Y H X H Y 13

14 Covariance Operators on RKHS Definition Σ YX = E[Φ Y (Y ) Φ X (X), ] E[Φ Y (Y )]E[ Φ X (X), ] Σ YX is an operator from H X to H Y such that g, Σ YX f = E[g(Y ) f (X)] E[g(Y )]E[ f (X)] (= Cov[ f (X), g(y )]) for all f H, g H X Y cf. Euclidean case V YX = E[YX T ] E[Y]E[X] T ( b, V YX a) = Cov[( b, Y ),( a, X )] : covariance matrix 14

15 Characterization of Independence Independence and cross-covariance operators If the RKHS s are rich enough : X Y Σ XY = O is always true requires an assumption on the kernel (universality) e.g., Gaussian RBF kernels are universal ( ) k(x, y) = exp x y 2 σ 2 cf. for Gaussian variables, X and Y are independent Cov[ f (X), g(y )] = 0 or E[g(Y ) f (X)] = E[g(Y )]E[ f (X)] for all f H, g H V XY = O i.e. uncorrelated X Y 15

16 Independence and characteristic functions Random variables X and Y are independent E XY e iω T X e iηt Y = E X e iω T X E Y e iηt Y for all ω and η I.e., e iω T x and e iηt y work as test functions RKHS characterization X Ω X Random variables and are independent [ f ( X ) g( Y )] E [ f ( X )] E [ g( Y )] E for all f H X, g H Y XY = X Y Ω Y Y RKHS approach is a generalization of the characteristic-function approach 16

17 RKHS and Conditional Independence Conditional covariance operator X and Y are random vectors. H X, H Y : RKHS with kernel k X, k Y, resp. Def. Σ YY X Σ YY Σ YX Σ XX 1 Σ XY : conditional covariance operator Under a universality assumption on the kernel g, Σ YY X g = E[ Var[g(Y ) X]] cf. For Gaussian Var Y X [a T Y X = x] = a T ( V YY V YX V 1 XX V XY )a Monotonicity of conditional covariance operators X = (U,V) : random vectors Σ YY U ΣYY X : in the sense of self-adjoint operators 17

18 Conditional independence Theorem X = (U,V) and Y are random vectors. H X, H U, H Y : RKHS with Gaussian kernel k X, k U, k Y, resp. Y V U ΣYY U = ΣYY X This theorem provides a new methodology for solving the sufficient dimension reduction problem 18

19 Outline Introduction dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Manifold KDR Summary 19

20 Kernel Dimension Reduction Use a universal kernel for B T X and Y Σ YY B T X Σ YY X ( : the partial order of self-adjoint operators) Σ YY B T X = Σ YY X X Y B T X KDR objective function: min Tr Σ B: B T B= I YY B T X d which is an optimization over the Stiefel manifold 20

21 Empirical cross-covariance operator (N ˆΣ ) YX = 1 N N i=1 ˆm X = 1 N Estimator { k Y (,Y i ) ˆm Y } k X (, X i ) ˆm X N i=1 k X (, X i ) { } ˆm Y = 1 N N i=1 k Y (,Y i ) (N ) ˆΣ YX gives the empirical covariance: ( N g, öσ ) YX f = 1 N f ( X N i=1 i )g(y i ) 1 N 1 N f ( X N i=1 i ) g(y N i=1 i ) Empirical conditional covariance operator (N ) ˆΣ YY X = ˆΣ (N ) YY ˆΣ YX ( (N ) + ε N I) 1 ˆΣ XY (N ) ˆΣ XX (N ) ε N : regularization coefficient 21

22 Estimating function for KDR: (N ) Tr ˆΣYY U = Tr (N ˆΣYY ) ˆΣ YU ( (N ) + ε N I) 1 ˆΣUY (N ) ˆΣUU (N ) U = B T X ( ) 1 = Tr G Y G Y G U G U + Nε N I N where G U = I N 1 T ( N 1 N 1 N )K U I N 1 T N 1 N 1 N T K U = k( B X i, B X j ) T ( ) : centered Gram matrix Optimization problem: min Tr G B:B T Y B=I d ( G + Nε B T X N I ) 1 N 22

23 Experiments with KDR Wine data k(z 1, z 2 ) Data 13 dim. 178 data. 3 classes 2 dim. projection ( ) = exp z 1 z 2 2 σ 2 σ = KDR CCA Partial Least Square Sliced Inverse Regression

24 Consistency of KDR Theorem Suppose k d is bounded and continuous, and ε N 0, N 1/2 ε N (N ). Let S 0 be the set of optimal parameters: S 0 = B B T B = I d,tr Σ B YY X = min Tr Σ B' YY X { } Then, under some conditions, for any open set Pr( öb ( N ) U) 1 (N ). B' U S 0 24

25 Lemma Suppose k d is bounded and continuous, and ε N 0, N 1/2 ε N (N ). Then, under some conditions, sup Tr öσ YY X B:B T B= I d in probability. B( N ) Tr Σ B YY X 0 (N ) 25

26 Conclusions Introduction dimension reduction and conditional independence Conditional covariance operators on RKHS Kernel Dimensionality Reduction for regression Technical report available at: 26

27 Background Reduction to Manifolds? Experimental Results Summary Regression on Manifolds using Kernel Dimension Reduction Jens Nilsson 1 Fei Sha 2 Michael I. Jordan 3 1 Centre for Mathematical Sciences Lund University 2 Computer Science Division University of California, Berkeley 3 Computer Science Division and Department of Statistics University of California, Berkeley June 28, 2008 Regression on Manifolds using Kernel Dimension Reduction

28 Background Reduction to Manifolds? Experimental Results Summary Dimensionality Reduction Sufficient Dimension Reduction Dimensionality Reduction in Regression Find a lower-dimensional subspace Z X and a mapping X x i z i Z, i = 1,..., m such that {z i } retains maximal predictive power w.r.t. {y i }. Supervised dimensionality reduction Regression on Manifolds using Kernel Dimension Reduction

29 Background Reduction to Manifolds? Experimental Results Summary Dimensionality Reduction Sufficient Dimension Reduction Sufficient Dimension Reduction Parameterize Z by B R D d where B T B = I. Find B such that Y X B T X (1) Under weak conditions, the intersection of all such Z B defines the central subspace, S. Regression on Manifolds using Kernel Dimension Reduction

30 Background Reduction to Manifolds? Experimental Results Summary Dimensionality Reduction Sufficient Dimension Reduction Kernel Dimension Reduction Measure conditional independence in RKHS Map X and Y to reproducing kernel Hilbert spaces H X, H Y. X f H X, Y g H Y Cross-covariance C f g between f and g can be represented by an operator Σ YX : H X H Y such that Conditional covariance operator g, Σ YX f HY = C f g, f, g (2) Σ YY X = Σ YY Σ YX Σ 1 XX Σ XY. (3) Regression on Manifolds using Kernel Dimension Reduction

31 Background Reduction to Manifolds? Experimental Results Summary Dimensionality Reduction Sufficient Dimension Reduction KDR Theorem 1. Σ YY X Σ YY B T X 2. Σ YY X = Σ YY B T X Y X B T X The central space can be found by minimizing Σ YY B T X. Regression on Manifolds using Kernel Dimension Reduction

32 Background Reduction to Manifolds? Experimental Results Summary Dimensionality Reduction Sufficient Dimension Reduction KDR Algorithm The minimization of Σ YY B T X can be formulated as min such that Tr K c Y (K c + B T X NɛI) 1 B T B = I (4) where K c Y and K c B T X are centered Gram matrices. Regression on Manifolds using Kernel Dimension Reduction

33 Background Reduction to Manifolds? Experimental Results Summary Reduction to Manifolds? Large literature on manifold learning The goal is to uncover the intrinsic geometry underlying a data set often the goal is visualization This is usually done without taking into account a response variable As before, we re motivated to find a way to estimate manifolds while taking into account a response e.g., can help provide guidance for visualization We ll combine (normalized) graph Laplacian technology with KDR Regression on Manifolds using Kernel Dimension Reduction

34 Background Reduction to Manifolds? Experimental Results Summary Eigenvectors of the Graph Laplacian Figure: First four (non-constant) eigenvectors of the graph Laplacian on a torus. Harmonics on the manifold Reflect intrinsic coordinates Regression on Manifolds using Kernel Dimension Reduction

35 Background Reduction to Manifolds? Experimental Results Summary mkdr Formulation Nilsson, Sha, and Jordan (2007) Compute the eigenvectors v i, i = 1,..., M of the normalized graph Laplacian Define an RKHS explicitly as the span of these eigenvectors Approximate the image of central subspace with a linear transformation ΦV T Regression on Manifolds using Kernel Dimension Reduction

36 Background Reduction to Manifolds? Experimental Results Summary mkdr Algorithm mkdr minimization problem: min Tr K c Y (VΩV T + NɛI) 1 such that Ω 0 Tr(Ω) = 1 (5) Φ = Ω Regression on Manifolds using Kernel Dimension Reduction

37 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Regression on a Torus {x i } have intrinsic coordinates [θ i, φ i ] S 1 S 1 y is a logistic function of (θ, φ) Regression on Manifolds using Kernel Dimension Reduction

38 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Regression on a Torus {x i } have intrinsic coordinates [θ i, φ i ] S 1 S 1 y is a logistic function of (θ, φ) Regression on Manifolds using Kernel Dimension Reduction

39 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data mkdr finds the Central Subspace Ω nearly rank 1 Ω aa T ; Project onto a y y Va x 10 3 Figure: Uniform grid sampling Va x 10 3 Figure: Uniformly random sampling with additive noise Regression on Manifolds using Kernel Dimension Reduction

40 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data mkdr can be used to guide visualization Map {x i } onto the eigenvectors {v i } with largest weight in Φ. Predictive eigenvectors in contrast to principal eigenvectors used in e.g. Laplacian eigenmaps. v 3 v 5 v 2 v 1 v 3 v 1 Figure: Principal eigenvectors Figure: Predictive eigenvectors Regression on Manifolds using Kernel Dimension Reduction

41 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Global Temperature Data {y i } are satellite measurements of atmospherical temperatures around the globe observation points {x i } lie on a spheroid in R 3 Regress the temperature y on x. Regression on Manifolds using Kernel Dimension Reduction

Model of Temperature Distribution Compute the central space ΦV T and use linear regression to

42 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Regression Model of Temperature Distribution Compute the central space ΦV T and use linear regression to model E[Y ΦV T ]. Figure: Predicted temperature Figure: Central space coordinate Figure: Prediction error Regression on Manifolds using Kernel Dimension Reduction

43 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Visualization of an Image Data Manifold {x i } are a set of 1000 grayscale images of size pixels 4 degrees of freedom: rotation angle, tilt angle and translations in the image plane Data lie on a 4-dimensional manifold in R Create a lower-dimensional embedding that captures the variation in rotation angle Regression on Manifolds using Kernel Dimension Reduction

44 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Unsupervised Embedding Project onto the principal eigenvectors, i.e. Laplacian Eigenmaps Figure: Principal eigenvectors. Color by tilt angle. Figure: Principal eigenvectors. Color by rotation angle. Regression on Manifolds using Kernel Dimension Reduction

45 Background Reduction to Manifolds? Experimental Results Summary Regression on a Torus Global Temperature Data Image Data Predictive embedding guided by mkdr Apply mkdr with rotation angle as response Map data onto predictive eigenvectors of the Graph Laplacian Figure: Predictive eigenvectors Regression on Manifolds using Kernel Dimension Reduction

46 Background Reduction to Manifolds? Experimental Results Summary Summary mkdr discovers manifolds that optimally preserve predictive power w.r.t response variables. mkdr enables: flexible regression modeling supervised exploration of nonlinear data manifolds mkdr extends: sufficient dimension reduction to nonlinear manifolds. manifold learning to the supervised setting. Regression on Manifolds using Kernel Dimension Reduction

Regression on Manifolds Using Kernel Dimension Reduction

Regression on Manifolds Using Kernel Dimension Reduction Jens Nilsson JENSN@MATHS.LTH.SE Centre for Mathematical Sciences, Lund University, Box 118, SE-221 00 Lund, Sweden Fei Sha FEISHA@CS.BERKELEY.EDU Computer Science Division, University of California, Berkeley,