Learning gradients: prescriptive models

Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007

Relevant papers Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of Machine Learning Research, 7(Mar):519 549, 2006. Estimation of Gradients and Coordinate Covariation in Classification., Qiang Wu; Journal of Machine Learning Research, 7(Nov):2481 2514, 2006. Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou; Annals of Statistics, submitted. Learning Gradients: simultaneous regression and inverse regression. Mauro Maggioni,, Qiang Wu; Journal of Machine Learning Research, in preparation.

Table of contents 1 Regression and inverse regression Learning gradients: a justification for inverse regression 2 Nonparametric kernel model Convergence of estimate 3 4 5 6 7 8

Generative vs. predictive modelling Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.

Generative vs. predictive modelling Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)

Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold.

Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps.

Regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min[var (f )] = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.

Inverse regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label

Learning gradients Regression and inverse regression Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r x 1,..., fr x p ) T.

Learning gradients Regression and inverse regression Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p 1 regression: f r (x) 2 inverse regression: outer product of gradients Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.

Linear case Regression and inverse regression Learning gradients: a justification for inverse regression We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Ω = cov (X Y ) = 1 σ 2 Y Σ X Γ Σ X.

Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).

Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ).

Taylor expansion and gradients Nonparametric kernel model Convergence of estimate Given D = (Z 1,..., Z n ) the variance of f may be approximated as n 2, Var n (f ) = w ij [y i f (x j ) f (x j ) (x i x j )] i,j=1 w ij ensures the locality of x i x j.

Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min [error on data + smoothness of function] f bs

Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min [error on data + smoothness of function] f bs error on data = L(f, data) = Var n (f ) 2 smoothness of function = f 2 K = f (x) 2 dx big function space = reproducing kernel Hilbert space = H K

Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] The kernel: K : X X IR e.g. K(u, v) = e ( u v 2). The RKHS { H K = f f (x) = } l α i K(x, x i ), x i X, α i R, l N. i=1

Gradient estimate Regression and inverse regression Nonparametric kernel model Convergence of estimate Nonparametric model (f D, [ n f D ) := arg min w ij (y j f (x i ) ) 2 f (x i ) (x j x i ) f, f H p+1 K i,j=1 +λ 1 f 2 K + λ 2 ] f 2 K, H p K is the space of p functions f = (f 1,..., f p ) where f i H K, f 2 K = p i=1 f i 2 K, and λ 1, λ 2 > 0.

Computational efficiency Nonparametric kernel model Convergence of estimate The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory f D (x) = n a i,d K(x i, x), fd (x) = i=1 n c i,d K(x i, x) i=1 with a D = (a 1,D,..., a n,d ) R n and c D = (c 1,D,..., c n,d ) T R np.

Consistency Regression and inverse regression Nonparametric kernel model Convergence of estimate Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) 1 f D f ρx C log n 1/p f f ρx C log δ ( 1 δ ) n 1/p.

Linear example Samples from class 1 were drawn from x j No(1.5, 1), for j = 1,..., 10, x j No( 3, 1), for j = 11,..., 20, x j No(0, σ noise ), for j = 21,..., 80, Samples from class +1 were drawn from x j No(1.5, 1), for j = 41,..., 50, x j No( 3, 1), for j = 51,..., 60, x j No(0, σ noise )), for j = 1,..., 40, 61,..., 80.

Linear example 1 10 20 1.5 1 0.5 0.9 0.8 0.7 Dimensions 30 40 50 0 0.5 1 1.5 RKHS norm 0.6 0.5 0.4 0.3 60 2 70 2.5 3 0.2 0.1 80 5 10 15 20 25 30 35 40 Samples 0 0 10 20 30 40 50 60 70 80 Dimensions

Linear example 0.5 1 10 0.4 0.9 Dimensions 20 30 40 50 60 0.3 0.2 0.1 0 0.1 0.2 0.3 Probability y=+1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 70 0.4 0.1 80 10 20 30 40 50 60 70 80 Dimensions 0 0 5 10 15 20 25 30 35 40 Samples

Nonlinear example Samples from class +1 were drawn from (x 1, x 2 ) = (r sin(θ), r cos(θ)), where r U[0, 1] and θ U[0, 2π], x j No(0.0,.2), for j = 3,..., 200, Samples from class 1 were drawn from (x 1, x 2 ) = (r sin(θ), r cos(θ)), where r U[2, 3] and θ U[0, 2π], x j N(0.0,.2), for j = 3,..., 200.

Nonlinear example 3 Class +1 1 2 Class 1 2 0.8 1 0 1 Dimensions 3 4 5 6 7 0.6 0.4 0.2 8 0 2 9 0.2 10 3 3 2 1 0 1 2 3 1 2 3 4 5 6 7 8 9 10 Dimensions

Nonlinear example 3.5 3.5 3 3 2.5 2.5 RKHS norm 2 1.5 RKHS norm 2 1.5 1 1 0.5 0.5 0 0 10 20 30 40 50 60 70 80 90 100 Dimensions 0 1 2 3 4 5 6 7 8 9 10 Dimensions

Nonlinear example 1 1 0.9 0.9 0.8 0.8 Probability y=+1 0.7 0.6 0.5 0.4 0.3 Probability y=+1 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 10 20 30 40 50 60 Samples 0 0 10 20 30 40 50 60 Samples

Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p.

Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p. Given a smooth orthonormal vector field {e 1,..., e d } we can define the gradient on the manifold M f = (e 1 f,..., e d f ). For q U M a chart u : U R d satisfying exists. The Taylor expansion on the manifold around q u i (q) = e i (q) f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

A problem with all manifold methods The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

A problem with all manifold methods The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. {(q i, y i )} n i=1 M Y are drawn from the manifold but neither M nor a local expression of M are given. Also we have only the image x i = ϕ(q i ) R p.

The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. The Taylor expansion on the manifold around x in terms of f ϕ 1 R p (f ϕ 1 )(u) (f ϕ 1 )(x) (f ϕ 1 )(x) (u x) for u x.

Improved rate of convergence Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) (dϕ) 1 fd M f ρx C log n 1/d f f ρx C log where (dϕ) is the dual of the map dϕ. δ ( 1 δ ) n 1/d,

Sensitive features Proposition Let f be a smooth function on R p with gradient f. The sensitivity of the function f along a (unit normalized) direction u is u K. The d most sensitive features are those {u 1,..., u d } that are orthogonal and maximize u i f K. A spectral decomposition of Γ is used to compute these features, the eigenvectors corresponding to the d top eigenvalues.

Projection of data The matrix is an empirical estimate of Γ. ˆΓ = f D f D = c T D Kc D,

Projection of data The matrix is an empirical estimate of Γ. ˆΓ = f D f D = c T D Kc D, Geometry is preserved by projection onto top k-eigenvectors. No need to compute the p p matrix, method is O(n 2 p + n 3 ) time and O(p n) memory.

Linear example 8 x 10 4 Dimensions 10 20 30 40 50 60 70 80 90 4 3 2 1 0 1 2 3 4 5 7 6 5 4 3 2 1 100 5 10 15 20 25 30 35 40 samples 0 0 5 10 15 20 25 30 35 40 index of eigenvalues

Linear example 0.3 8 0.2 6 4 0.1 2 0 0 0.1 2 0.2 4 6 0.3 8 0.4 0 10 20 30 40 50 60 70 80 90 100 Dimensions 10 0 5 10 15 20 25 30 35 40 Feature 1

Linear example 3.5 x 10 4 10 6 3 20 4 30 2.5 Dimensions 40 50 60 70 2 0 2 2 1.5 1 80 4 90 6 0.5 100 5 10 15 20 25 30 35 40 samples 0 0 5 10 15 20 25 30 35 40 index of eigenvalues

Nonlinear example 8 8 6 6 4 4 Dimension 2 2 0 2 Feature 2 2 0 2 4 4 6 6 8 8 6 4 2 0 2 4 6 8 Dimension 1 8 8 6 4 2 0 2 4 6 8 Feature 1

Nonlinear example 3 0.6 2.5 0.4 2 0.2 0 1.5 0.2 1 0.4 0.5 0.6 0 0 5 10 15 20 25 30 index of eigenvalues 0.8 0 20 40 60 80 100 120 140 160 180 200 Dimensions

Nonlinear example 0.3 20 0.25 0.2 15 0.15 10 0.1 0.05 5 0 0.05 0 0.1 5 0.15 0.2 0 10 20 30 40 50 60 70 80 90 100 Dimensions 10 0 5 10 15 20 25 30 35 40 Feature 1

Nonlinear example 0.1 0.5 0 0.1 0.2 0.4 0.3 0.3 0.2 0.4 0.5 0.1 0.6 0 0.7 0.8 0.1 0.9 0 20 40 60 80 100 120 140 160 180 200 Dimensions 0.2 0 20 40 60 80 100 120 140 160 180 200 Dimensions

Digits: 6 vs. 9 5 5 10 10 15 15 20 20 25 25 5 10 15 20 25 5 10 15 20 25

Digits: 6 vs. 9 8 x 10 5 0.2 7 6 5 4 3 2 1 Norm 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 index of eigenvalues 0 0 100 200 300 400 500 600 700 800 DImensions

Digits: 6 vs. 9 1500 1500 1000 1000 500 500 feature 2 0 feature 2 0 500 500 1000 1000 1500 1500 1000 500 0 500 1000 feature 1 1500 1500 1000 500 0 500 1000 1500 feature 1

Leukemia 48 samples of AML, 25 samples of ALL, p = 7, 129. Dataset split into a training set of 38 samples and a test set of 35 samples.

Leukemia x 10 4 x 10 4 6 4 2 0 2 4 4 2 0 3 2 1 0 1 2 3 x 10 4 2 4 6 4 2 0 2 x 10 4 x 10 4 2 0 2 2 0 2 x 10 4

Leukemia 5 x 10 7 0.6 4.5 4 3.5 3 2.5 2 1.5 1 0.5 distance to hyperplane 0.4 0.2 0 0.2 0.4 0.6 0 0 5 10 15 20 25 30 35 40 index of eigenvalues 0.8 0 5 10 15 20 25 30 35 samples

Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables.

Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables. Set C = ˆΓ. However, many values near zero and how to truncate.

Multiscale graphical models Assume for now ˆΓ has all positive entries and is normalized to be a Markov matrix. T = Λ 1/2 ˆΓ Λ 1/2, where Λ is the eigenvalue matrix for ˆΓ.

Multiscale graphical models Note the harmonic expansion (I T ) 1 = T k = k=1 (I + T 2k ), k=1 where k is path-length in the first inequality T k ij = probability i j in path-length k. (I + T 2k ) factorizes (I T ) 1 into low rank matrices with fewer entries.

Summary and toy example Key quantities C T C 1 (I T ) 1. so (I + T 2k ) factorizes C 1.

Summary and toy example X abcd ˆΓ = 5 3 1.5 0 3 5.5 2 1.5.5 5 2.5 0 2 2.5 5.

Summary and toy example X abcd ˆΓ = 5 3 1.5 0 3 5.5 2 1.5.5 5 2.5 0 2 2.5 5. d c b ab a c d

General covariance matrices Math goes through as above. Interpretation is slightly different. Path coefficients (Wright 1921) and path regression (Tukey 1954) X 1 = c 12 X 2 + c 13 X 3 +... + c 1m X m where c 12 is the contribution of X 2 to X 1. Matrix Γ can be though of as matrix of path coefficients. Powers of k compute path coefficients of length k as in Markov case.

An early graphical model weight at 33 days Gain 0-33 days External conditions weight at birth Heredity Rate of growth condition of dam Gestation period Size of litter Heredity of dam

Discussion Lots of work left: Semi-supervised setting.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps. Noise on-off manifolds.