Learning gradients: prescriptive models

Size: px

Start display at page:

Download "Learning gradients: prescriptive models"

Janis Cross
5 years ago
Views:

1 Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007

2 Relevant papers Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of Machine Learning Research, 7(Mar): , Estimation of Gradients and Coordinate Covariation in Classification., Qiang Wu; Journal of Machine Learning Research, 7(Nov): , Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou; Annals of Statistics, submitted. Learning Gradients: simultaneous regression and inverse regression. Mauro Maggioni,, Qiang Wu; Journal of Machine Learning Research, in preparation.

3 Table of contents 1 Regression and inverse regression Learning gradients: a justification for inverse regression 2 Nonparametric kernel model Convergence of estimate

4 Generative vs. predictive modelling Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.

5 Generative vs. predictive modelling Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)

6 Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold.

7 Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps.

8 Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps. Simultaneous dimensionality reduction and regression: SIR, MAVE, SAVE.

9 Regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min[var (f )] = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.

10 Inverse regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y.

11 Inverse regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label

12 Learning gradients Regression and inverse regression Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r x 1,..., fr x p ) T.

13 Learning gradients Regression and inverse regression Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p 1 regression: f r (x) 2 inverse regression: outer product of gradients Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.

14 Linear case Regression and inverse regression Learning gradients: a justification for inverse regression We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Ω = cov (X Y ) = 1 σ 2 Y Σ X Γ Σ X.

15 Linear case Regression and inverse regression Learning gradients: a justification for inverse regression We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Ω = cov (X Y ) = 1 σ 2 Y Σ X Γ Σ X. Γ and Ω are equivalent modulo rotation and scale.

16 Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).

17 Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ).

18 Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ). Γ = I i=1 σi 2 Σ 1 i Ω i Σ 1 i.

19 Taylor expansion and gradients Nonparametric kernel model Convergence of estimate Given D = (Z 1,..., Z n ) the variance of f may be approximated as n 2, Var n (f ) = w ij [y i f (x j ) f (x j ) (x i x j )] i,j=1 w ij ensures the locality of x i x j.

20 Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min [error on data + smoothness of function] f bs

21 Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min [error on data + smoothness of function] f bs error on data = L(f, data) = Var n (f ) 2 smoothness of function = f 2 K = f (x) 2 dx big function space = reproducing kernel Hilbert space = H K

22 Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] The kernel: K : X X IR e.g. K(u, v) = e ( u v 2). The RKHS { H K = f f (x) = } l α i K(x, x i ), x i X, α i R, l N. i=1

23 Gradient estimate Regression and inverse regression Nonparametric kernel model Convergence of estimate Nonparametric model (f D, [ n f D ) := arg min w ij (y j f (x i ) ) 2 f (x i ) (x j x i ) f, f H p+1 K i,j=1 +λ 1 f 2 K + λ 2 ] f 2 K, H p K is the space of p functions f = (f 1,..., f p ) where f i H K, f 2 K = p i=1 f i 2 K, and λ 1, λ 2 > 0.

24 Computational efficiency Nonparametric kernel model Convergence of estimate The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory f D (x) = n a i,d K(x i, x), fd (x) = i=1 n c i,d K(x i, x) i=1 with a D = (a 1,D,..., a n,d ) R n and c D = (c 1,D,..., c n,d ) T R np.

25 Consistency Regression and inverse regression Nonparametric kernel model Convergence of estimate Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) 1 f D f ρx C log n 1/p f f ρx C log δ ( 1 δ ) n 1/p.

26 Linear example Samples from class 1 were drawn from x j No(1.5, 1), for j = 1,..., 10, x j No( 3, 1), for j = 11,..., 20, x j No(0, σ noise ), for j = 21,..., 80, Samples from class +1 were drawn from x j No(1.5, 1), for j = 41,..., 50, x j No( 3, 1), for j = 51,..., 60, x j No(0, σ noise )), for j = 1,..., 40, 61,..., 80.

27 Linear example Dimensions RKHS norm Samples Dimensions

28 Linear example Dimensions Probability y= Dimensions Samples

29 Nonlinear example Samples from class +1 were drawn from (x 1, x 2 ) = (r sin(θ), r cos(θ)), where r U[0, 1] and θ U[0, 2π], x j No(0.0,.2), for j = 3,..., 200, Samples from class 1 were drawn from (x 1, x 2 ) = (r sin(θ), r cos(θ)), where r U[2, 3] and θ U[0, 2π], x j N(0.0,.2), for j = 3,..., 200.

30 Nonlinear example 3 Class Class Dimensions Dimensions

31 Nonlinear example RKHS norm RKHS norm Dimensions Dimensions

32 Nonlinear example Probability y= Probability y= Samples Samples

33 Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p.

34 Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p. Given a smooth orthonormal vector field {e 1,..., e d } we can define the gradient on the manifold M f = (e 1 f,..., e d f ).

35 Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p. Given a smooth orthonormal vector field {e 1,..., e d } we can define the gradient on the manifold M f = (e 1 f,..., e d f ). For q U M a chart u : U R d satisfying exists. The Taylor expansion on the manifold around q u i (q) = e i (q) f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

36 A problem with all manifold methods The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

37 A problem with all manifold methods The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. {(q i, y i )} n i=1 M Y are drawn from the manifold but neither M nor a local expression of M are given. Also we have only the image x i = ϕ(q i ) R p.

38 The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

39 The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. The Taylor expansion on the manifold around x in terms of f ϕ 1 R p (f ϕ 1 )(u) (f ϕ 1 )(x) (f ϕ 1 )(x) (u x) for u x.

40 The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. The Taylor expansion on the manifold around x in terms of f ϕ 1 R p (f ϕ 1 )(u) (f ϕ 1 )(x) (f ϕ 1 )(x) (u x) for u x. Due to this equivalence f D dϕ M f

41 Improved rate of convergence Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) (dϕ) 1 fd M f ρx C log n 1/d f f ρx C log where (dϕ) is the dual of the map dϕ. δ ( 1 δ ) n 1/d,

42 Sensitive features Proposition Let f be a smooth function on R p with gradient f. The sensitivity of the function f along a (unit normalized) direction u is u K. The d most sensitive features are those {u 1,..., u d } that are orthogonal and maximize u i f K. A spectral decomposition of Γ is used to compute these features, the eigenvectors corresponding to the d top eigenvalues.

43 Projection of data The matrix is an empirical estimate of Γ. ˆΓ = f D f D = c T D Kc D,

44 Projection of data The matrix is an empirical estimate of Γ. ˆΓ = f D f D = c T D Kc D, Geometry is preserved by projection onto top k-eigenvectors. No need to compute the p p matrix, method is O(n 2 p + n 3 ) time and O(p n) memory.

45 Linear example 8 x 10 4 Dimensions samples index of eigenvalues

46 Linear example Dimensions Feature 1

47 Linear example 3.5 x Dimensions samples index of eigenvalues

48 Nonlinear example Dimension Feature Dimension Feature 1

49 Nonlinear example index of eigenvalues Dimensions

50 Nonlinear example Dimensions Feature 1

51 Nonlinear example Dimensions Dimensions

52 Digits: 6 vs

53 Digits: 6 vs. 9 8 x Norm index of eigenvalues DImensions

54 Digits: 6 vs feature 2 0 feature feature feature 1

55 Leukemia 48 samples of AML, 25 samples of ALL, p = 7, 129. Dataset split into a training set of 38 samples and a test set of 35 samples.

56 Leukemia x 10 4 x x x 10 4 x x 10 4

57 Leukemia 5 x distance to hyperplane index of eigenvalues samples

58 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables.

59 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables. Set C = ˆΓ. However, many values near zero and how to truncate.

60 Multiscale graphical models Assume for now ˆΓ has all positive entries and is normalized to be a Markov matrix. T = Λ 1/2 ˆΓ Λ 1/2, where Λ is the eigenvalue matrix for ˆΓ.

61 Multiscale graphical models Note the harmonic expansion (I T ) 1 = T k = k=1 (I + T 2k ), k=1 where k is path-length in the first inequality T k ij = probability i j in path-length k. (I + T 2k ) factorizes (I T ) 1 into low rank matrices with fewer entries.

62 Multiscale graphical models Note the harmonic expansion (I T ) 1 = T k = k=1 (I + T 2k ), k=1 where k is path-length in the first inequality T k ij = probability i j in path-length k. (I + T 2k ) factorizes (I T ) 1 into low rank matrices with fewer entries. This is the idea of diffusion wavelets.

63 Summary and toy example Key quantities C T C 1 (I T ) 1. so (I + T 2k ) factorizes C 1.

64 Summary and toy example X abcd ˆΓ =

65 Summary and toy example X abcd ˆΓ = d c b ab a c d

66 General covariance matrices Math goes through as above. Interpretation is slightly different. Path coefficients (Wright 1921) and path regression (Tukey 1954) X 1 = c 12 X 2 + c 13 X c 1m X m where c 12 is the contribution of X 2 to X 1.

67 General covariance matrices Math goes through as above. Interpretation is slightly different. Path coefficients (Wright 1921) and path regression (Tukey 1954) X 1 = c 12 X 2 + c 13 X c 1m X m where c 12 is the contribution of X 2 to X 1. Matrix Γ can be though of as matrix of path coefficients. Powers of k compute path coefficients of length k as in Markov case.

68 An early graphical model weight at 33 days Gain 0-33 days External conditions weight at birth Heredity Rate of growth condition of dam Gestation period Size of litter Heredity of dam

69 Discussion Lots of work left: Semi-supervised setting.

70 Discussion Lots of work left: Semi-supervised setting. Multi-task setting.

71 Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation.

72 Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps.

73 Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps. Noise on-off manifolds.

Bayesian simultaneous regression and dimension reduction

Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008