Sayan Mukherjee. June 15, 2007

Size: px

Start display at page:

Download "Sayan Mukherjee. June 15, 2007"

Kelly Cobb
5 years ago
Views:

1 Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University June 15, 2007

2 To Tommy Poggio This talk is dedicated to my advisor Tommy Poggio as part of a journey through computation. I hope I don t tarnish his name too much.

3 Relevant papers Considerations of Models of Movement Detection. T. Poggio and W. Reichardt. Kybernetik 13, , The Volterra representation and the Wiener expansion: validity and pitfalls. G. Palm and T. Poggio. SIAM J. Appl. Math., 33 (2) , Stochastic identification methods for nonlinear systems: an extension of Wiener theory. G. Palm and T. Poggio. SIAM J. Appl. Math., 34 (3) , Regularization algorithms for learning that are equivalent to multilayer networks. T. Poggio and F. Girosi. Science, 247: , Characterizing the function space for Bayesian kernel models. N. Pillai, Q. Wu, F. Liang, S. Mukherjee, and R. Wolpert. Journal of Machine Learning Research, in press, 2007.

4 Table of contents 1 Intergral operators and priors Wiener theory Radial basis functions networks Bayesian kernel model 2

5 Wiener theory Radial basis functions networks Bayesian kernel model Linear systems theory Define: the input signal, x(t), and the output signal, y(t). A system maps an input signal to an output signal via a system operator S defined by an impulse response K y(t) = S{X (t τ)} = K(t τ)x(τ)dτ. X

6 Wiener theory Radial basis functions networks Bayesian kernel model Linear systems theory Define: the input signal, x(t), and the output signal, y(t). A system maps an input signal to an output signal via a system operator S defined by an impulse response K y(t) = S{X (t τ)} = K(t τ)x(τ)dτ. X Charcterize S by test functions sinusoidal inputs.

7 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Weiner theory: Characterize S by test functions Brownian motion. The method of Wiener profides a canonical series representation Sx = nki G nki x where G nki = g nki (τ 1,..., τ i )dx(τ 1 ) dx(τ i ).

8 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Lee-Schetzen: Characterize S by test functions white noise. A response of S to ẋ Sẋ = S x = nki G nki x where G nki = g nki (τ 1,..., τ i )ẋ(τ 1 ) ẋ(τ i )dτ 1 dτ i.

9 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Lee-Schetzen: Characterize S by test functions white noise. A response of S to ẋ Sẋ = S x = nki G nki x where G nki = g nki (τ 1,..., τ i )ẋ(τ 1 ) ẋ(τ i )dτ 1 dτ i. A question: For what stochastic inputs does this hold?

10 Wiener theory Radial basis functions networks Bayesian kernel model Regression learning as hypersurface reconstruction data = {L i = (x i, y i )} n i=1 with L i iid ρ(x, Y ). X X IR p and Y IR and p n. A natural idea f r (x) = E Y [Y x].

11 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RBF style Minimization of the following functional was proposed in Poggio and Girosi N H[f ] = (y i f (x i )) 2 + λ P f 2, i=1 where P is a differential operator.

12 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RBF style Using Green s functions (operator theory) in matrix-vector form f (x) = N c i G(x; x i ). i=1 (G + λi )c = y, where G ij = G(x i ; x j ).

13 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min [error on data + smoothness of function] f bs

14 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min [error on data + smoothness of function] f bs error on data = L(f, data) = (f (x) y) 2 smoothness of function = f 2 K = f (x) 2 dx big function space = reproducing kerne l Hilbert space = H K

15 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] The kernel: K : X X IR e.g. K(u, v) = e ( u v 2). The RKHS { H K = f f (x) = } l α i K(x, x i ), x i X, α i R, l N. i=1

16 Wiener theory Radial basis functions networks Bayesian kernel model Representer theorem ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] ˆf (x) = ˆf (x) = n c i K(x, x i ) i=1 n c i G(x; x i ). i=1 Great when p n.

17 Wiener theory Radial basis functions networks Bayesian kernel model Very popular and useful Theo, Massi, and Tommy: 1 Support vector machines [ n ] ˆf (x) = arg min 1 y i f (x i ) + + λ f 2 K, f H K i=1 2 Regularized Kernel regression [ n ] ˆf (x) = arg min y i f (x i ) 2 + λ f 2 K, f H K i=1 i=1 3 Regularized logistic regression [ n ( ˆf ) ] (x) = arg min ln 1 + e y i f (x i ) + λ f 2 K. f H K

18 Wiener theory Radial basis functions networks Bayesian kernel model Why go Bayes uncertainty

19 Wiener theory Radial basis functions networks Bayesian kernel model Bayesian interpretation Poggio and Girosi N H[f ] = (y i f (x i )) 2 + λ P f 2, i=1 ( ) N exp( H[f ]) = exp (y i f (x i )) 2 exp ( λ P f 2) i=1 Pr(f data) Lik(data f )π(f ).

20 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). X

21 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). Specify a prior on H K via a prior on A { (ak ) } A = k=1 ak 2 /λ k <. X k

22 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). Specify a prior on H K via a prior on A { (ak ) } A = k=1 ak 2 /λ k <. X Hard to sample and relies on computation of eigenvalues and eigenvectors. k

23 Wiener theory Radial basis functions networks Bayesian kernel model Priors via duality The duality between Gaussian processes and RKHS implies the following construction where K is given by the kernel. f ( ) GP(µ f, K),

24 Wiener theory Radial basis functions networks Bayesian kernel model Priors via duality The duality between Gaussian processes and RKHS implies the following construction where K is given by the kernel. f ( ) / H K almost surely. f ( ) GP(µ f, K),

25 Wiener theory Radial basis functions networks Bayesian kernel model Integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ).

26 Wiener theory Radial basis functions networks Bayesian kernel model Integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ). A prior on Γ implies a prior on G.

27 Wiener theory Radial basis functions networks Bayesian kernel model Equivalence with RKHS For what Γ is H K = span(g)? For what class of test functions does span(l K [Γ]) = H K. This is hard to characterize.

28 Wiener theory Radial basis functions networks Bayesian kernel model Equivalence with RKHS For what Γ is H K = span(g)? For what class of test functions does span(l K [Γ]) = H K. This is hard to characterize. The candidates test functions for Γ will be 1 square integrable functions 2 integrable functions 3 discrete measures 4 the union or integrable functions and discrete measures.

29 Wiener theory Radial basis functions networks Bayesian kernel model Square integrable functions are too small Proposition For every γ L 2 (X ), L K [γ] H K. Consequently, L 2 (X ) L 1 K (H K ).

30 Wiener theory Radial basis functions networks Bayesian kernel model Square integrable functions are too small Proposition For every γ L 2 (X ), L K [γ] H K. Consequently, L 2 (X ) L 1 K (H K ). Corollary If Λ = {k : λ k > 0} is a finite set, then L K (L 2 (X )) = H K otherwise L K (L 2 (X )) H K. The latter occurs when the kernel K is strictly positive definite, the RKHS is infinite-dimensional.

31 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Measures: The class of functions L 1 (X ) are signed measures. Proposition For every γ L 1 (X ), L K [γ] H K. Consequently, L 1 (X ) L 1 K (H K ).

32 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Measures: The class of functions L 1 (X ) are signed measures. Proposition For every γ L 1 (X ), L K [γ] H K. Consequently, L 1 (X ) L 1 K (H K ). Discrete measures: { M D = µ = n c i δ xi : i=1 } n c i <, x i X, n N. i=1 Proposition Given the set of finite discrete measures, M D L 1 K (H K ).

33 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm.

34 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm. Proposition B(X ) L 1 K (H K (X )).

35 Wiener theory Radial basis functions networks Bayesian kernel model The implication Take home message need priors on signed measures. A function theoretic foundation for random signed measures such as Gaussian, Dirichlet and Lévy process priors.

36 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces.

37 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators.

38 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains.

39 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains. Bayesian variable selection: Efficient sampling and search in high-dimensional space.

40 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains. Bayesian variable selection: Efficient sampling and search in high-dimensional space. Numeric stability and statistical robustness work with Tommy.

41 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.

42 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)

43 Regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.

44 Inverse regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y.

45 Inverse regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label

46 Learning gradients Intergral operators and priors Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p

47 Learning gradients Intergral operators and priors Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p 1 regression: f r (x) 2 inverse regression: gradient outer product (GOP) Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.

48 Linear case Intergral operators and priors We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Γ = σ 2 Σ 1ΩΣ 1. Y X X Γ and Ω are equivalent modulo rotation and scale.

49 Nonlinear case Intergral operators and priors For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).

50 Nonlinear case Intergral operators and priors Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ).

51 Nonlinear case Intergral operators and priors Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ). Γ = I i=1 m i σi 2 Σ 1 i Ω i Σ 1 i.

52 Taylor expansion and gradients Given D = (Z 1,..., Z n ) the error of f may be approximated as n 2, L(f, D) = w ij [y i f (x j ) f (x j ) (x i x j )] i,j=1 w ij ensures the locality of x i x j.

53 Gradient estimate - regularization Radial basis function model (f D, f D ) := arg min f, f H p+1 K [ L(f, D) + λ 1 f 2 K + λ 2 ] f 2 K, H p K is the space of p functions f = (f 1,..., f p ) where f i H K, f 2 K = p i=1 f i 2 K, and λ 1, λ 2 > 0.

54 Computational efficiency The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory f D (x) = n a i K(x i, x), fd (x) = i=1 n c i K(x i, x) i=1 with a D = (a 1,..., a n ) R n and c D = (c 1,..., c n ) T R np.

55 Linear example Intergral operators and priors Dimensions RKHS norm Samples Dimensions

56 Linear example Intergral operators and priors Dimensions Probability y= Dimensions Samples

57 Nonlinear example Intergral operators and priors 3 Class Class Dimensions Dimensions

58 Sensitive features Intergral operators and priors Proposition Let f be a smooth function on R p with gradient f. The sensitivity of the function f along a (unit normalized) direction u is u f K. The d most sensitive features are those {u 1,..., u d } that are orthogonal and maximize u i f K. A spectral decomposition of Γ is used to compute these features, the eigenvectors corresponding to the d top eigenvalues.

59 Projection of data Intergral operators and priors The matrix ˆΓ = f D f D = c T D Kc D, is an empirical estimate of Γ.

60 Projection of data Intergral operators and priors The matrix ˆΓ = f D f D = c T D Kc D, is an empirical estimate of Γ. Geometry is preserved by projection onto top k-eigenvectors. No need to compute the p p matrix, method is O(n 2 p + n 3 ) time and O(pn) memory.

61 Linear example Intergral operators and priors 8 x 10 4 Dimensions samples index of eigenvalues

62 Linear example Intergral operators and priors Dimensions Feature 1

63 Nonlinear example Intergral operators and priors Dimension Feature Dimension Feature 1

64 Nonlinear example Intergral operators and priors index of eigenvalues Dimensions

65 Nonlinear example Intergral operators and priors Dimensions Dimensions

66 Digits: 6 vs

67 Digits: 6 vs. 9 8 x Norm index of eigenvalues DImensions

68 Nonlinear example Intergral operators and priors Dimensions Dimensions

69 Digits: 6 vs feature 2 0 feature feature feature 1

70 Intergral operators and priors An early graphical model weight at 33 days Gain 0-33 days External conditions weight at birth Heredity Rate of growth condition of dam Gestation period Size of litter Heredity of dam

71 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables.

72 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables. Note by construction ˆΓ is a covariance matrix of a Gaussian process.

73 Define J = ˆΓ 1. and the partial correlation coefficient J ij r ij =. Jii J jj Define R ij = r ij if i j and 0 otherwise. Define D ii = J ii. Define R = D 1/2 RD 1/2. ˆΓ = D 1/2 (I R)D 1/2.

74 The following expansions hold ( ) ˆΓ = D 1/2 R k D 1/2 k=0 ( ) ˆΓ = D 1/2 (I + R 2k ) D 1/2. k=0 Here k is path-length in the first equality. The second equality factorizes the covariance matrix into low rank matrices with fewer entries.

75 Toy example Intergral operators and priors X abcd ˆΓ 1 =

76 Toy example Intergral operators and priors X abcd ˆΓ 1 = c a ab d b c d

77 Real data progression of prostate cancer

78 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting.

79 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting.

80 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation.

81 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps.

82 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps. Noise on-off manifolds.

83 Relevant papers Intergral operators and priors Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of Machine Learning Research, 7(Mar): , Estimation of Gradients and Coordinate Covariation in Classification., Qiang Wu; Journal of Machine Learning Research, 7(Nov): , Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou; Annals of Statistics, submitted. Learning Gradients: simultaneous regression and inverse regression. Mauro Maggioni,, Qiang Wu; Journal of Machine Learning Research, in preparation.

84 Thanks Tommy Intergral operators and priors and Alessandro, Gadi, and Federico

85 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X

86 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X this implies a posterior on f. π(z data) L(data Z) π(z),

87 Lévy processes Intergral operators and priors A stochastic process Z := {Z u R : u X } is called a Lévy process if it satisfies the following conditions: 1 Z 0 = 0 almost surely. 2 For any choice of m 1 and 0 u 0 < u 1 <... < u m, the random variables Z u0, Z u1 Z u0,..., Z um Z um 1 are independent. (Independent increments property) 3 The distribution of Z s+u Z s is independent of Z s (Temporal homogeneity or stationary increments property). 4 Z has càdlàg paths almost surely.

88 Lévy processes Intergral operators and priors Theorem (Lévy-Khintchine) If Z is a Lévy process, then the characteristic function of Z u : u 0 has the following form: ( " E[e iλzu ] = exp u iλa 1 Z #) 2 σ2 λ 2 + [e iλw 1 iλw1 {w: w <1} (w)]ν(dw), R\{0} where a IR, σ 2 0 and ν is a nonnegative measure on IR with R R (1 w 2 )ν(dw) <.

89 Lévy processes Intergral operators and priors drift term a variance of Brownian motion σ 2 ν(dw) the jump process or Lévy measure. exp { u [ iλa 1 2 σ2 λ 2]} { exp u [ R\{0} e iλw 1 iλw1 {w: w <1} (w) ] ν(dw)}

90 Two approaches to Gaussian processes Two modelling approaches 1 prior directly on the space of functions by sampling from paths of the Gaussian process defined by K; 2 Gaussian process prior on Z(du) implies on prior on function space via integral operator.

91 Prior on random measure A Gaussian process prior on Z(du) is a signed measure so span(g) H K.

92 Direct prior elicitation Theorem (Kallianpur) If {Z u, u X } is a Gaussian process with covariance K and mean m H K and H K is infinite dimensional, then P(Z H K ) = 0. The sample paths are almost surely outside H K.

93 A bigger RKHS Intergral operators and priors Theorem (Lukić and Beder) Given two kernel functions R and K, R dominates K (R K) if H K H R. Let R K. Then g R g K, g H K. There exists a unique linear operator L : H R H R whose range is contained in H K such that f, g R = Lf, g K, f H R, g H K. In particular LR u = K u, u X. As an operator into H R, L is bounded, symmetric, and positive. Conversely, let L : H R H R be a positive, continuous, self-adjoint operator then K(s, t) = LR s, R t R, s, t X defines a reproducing kernel on X such that K R.

94 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K.

95 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K. Theorem (Lukić and Beder) Let K and R be two reproducing kernels. Assume that the RKHS H R is separable. A necessary and sufficient condition for the existence of a Gaussian process with covariance K and mean m H R and with trajectories in H R with probability 1 is that R K.

96 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K. Theorem (Lukić and Beder) Let K and R be two reproducing kernels. Assume that the RKHS H R is separable. A necessary and sufficient condition for the existence of a Gaussian process with covariance K and mean m H R and with trajectories in H R with probability 1 is that R K. Characterize H R by L 1 K (H K ).

97 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function.

98 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function. Model F using a Dirichlet process prior: DP(α, F 0 )

99 Bayesian representer form Given X n = (x 1,..., x n ) iid F n F X n DP(α + n, F n ), F n = (αf 0 + δ xi )/(α + n). i=1 E[f X n ] = a n n K(x, u) w(u) df 0 (u)+n 1 (1 a n ) w(x i ) K(x, x i ), i=1 a n = α/(α + n).

100 Bayesian representer form Taking lim α 0 to represent a non-informative prior: Proposition (Bayesian representor theorem) n ˆf n (x) = w i K(x, x i ), w i = w(x i )/n. i=1

101 Likelihood Intergral operators and priors y i = f (x i ) + ε i = w 0 + where ε i No(0, σ 2 ). n w j K(x i, x j ) + ε i, i = 1,..., n j=1 Y No(w 0 ι + Kw, σ 2 I ). where ι = (1,..., 1).

102 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β.

103 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β. π(w 0, σ 2 ) 1/σ 2 τ 1 i Ga(a τ /2, b τ /2) T := diag(τ 1,..., τ n ) β No(0, T ) w K, T No(0, F 1 T 1 F ).

104 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β. π(w 0, σ 2 ) 1/σ 2 τ 1 i Ga(a τ /2, b τ /2) T := diag(τ 1,..., τ n ) β No(0, T ) w K, T No(0, F 1 T 1 F ). Standard Gibbs sampler simulates p(w, w 0, σ 2 data).

105 Kernel model extension K ν (x, u) = K( ν x, ν u) with ν = {ν 1,..., ν p } with ν k [0, ) as a scale parameter. k ν (x, u) = k ν (x, u) = p ν k x k u k, k=1 ( 1 + ( ) d p ν k x k u k, k=1 k ν (x, u) = exp ) p ν k (x k u k ) 2. k=1

106 Prior specification Intergral operators and priors ν k (1 γ)δ 0 + γ Ga(a ν, a ν s), (k = 1,..., p), s Exp(a s ), γ Be(a γ, b γ )

107 Prior specification Intergral operators and priors ν k (1 γ)δ 0 + γ Ga(a ν, a ν s), (k = 1,..., p), s Exp(a s ), γ Be(a γ, b γ ) Standard Gibbs sampler does not work: Metropolis-Hastings.

108 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α.

109 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α. Minimize H[f m ] = N (y i f m (x i )) 2 + λ P f m 2, i=1 where P is a differential operator.

110 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α. Minimize H[f m ] = N (y i f m (x i )) 2 + λ P f m 2, i=1 where P is a differential operator. Solve for c α, t α, and σ α via stochastic differential equations c α = ω H[f m ] + η α (t), α = 1,..., m c α Sayan H[f Mukherjee m ]

111 Problem setup Intergral operators and priors Labelled data : (Y p, X p ) = {(y p i, x p i ); i = 1 : n p } iid ρ(y, X φ, θ). Unlabelled data: X m = {x m i, i = (1) : (n m )} iid ρ(x θ).

112 Problem setup Intergral operators and priors Labelled data : (Y p, X p ) = {(y p i, x p i ); i = 1 : n p } iid ρ(y, X φ, θ). Unlabelled data: X m = {x m i, i = (1) : (n m )} iid ρ(x θ). How can the unlabelled data help our a predictive model? data = {Y, X, X m } p(φ, θ data) π(φ, θ)p(y X, φ)p(x θ)p(x m θ). Need very strong dependence between θ and φ.

113 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ).

114 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). Same as in Belkin and Niyogi but without [ min L(f, data) + λ1 f 2 K + λ 2 f 2 ] I. f H K

115 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). 1 θ = F ( ) so that p(x θ)dx = df (x) - the parameter is the full distribution function itself; 2 p(y x, φ) depends intimately on θ = F ; in fact, θ φ in this case and dependence of θ and φ is central to the model.

116 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). 1 θ = F ( ) so that p(x θ)dx = df (x) - the parameter is the full distribution function itself; 2 p(y x, φ) depends intimately on θ = F ; in fact, θ φ in this case and dependence of θ and φ is central to the model.

117 Simulated data semi-supervised

118 Simulated data semi-supervised

119 Simulated data semi-supervised

120 Cancer classification semi-supervised Without Unlabellled Data With Unlabelled Data Average (Hard) Error Rate Percentage of Labelled Examples

121 MNIST digits feature selection

122 MNIST digits feature selection With variable selection Without variable selection

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan