Sayan Mukherjee. June 15, 2007

Size: px
Start display at page:

Download "Sayan Mukherjee. June 15, 2007"

Transcription

1 Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University June 15, 2007

2 To Tommy Poggio This talk is dedicated to my advisor Tommy Poggio as part of a journey through computation. I hope I don t tarnish his name too much.

3 Relevant papers Considerations of Models of Movement Detection. T. Poggio and W. Reichardt. Kybernetik 13, , The Volterra representation and the Wiener expansion: validity and pitfalls. G. Palm and T. Poggio. SIAM J. Appl. Math., 33 (2) , Stochastic identification methods for nonlinear systems: an extension of Wiener theory. G. Palm and T. Poggio. SIAM J. Appl. Math., 34 (3) , Regularization algorithms for learning that are equivalent to multilayer networks. T. Poggio and F. Girosi. Science, 247: , Characterizing the function space for Bayesian kernel models. N. Pillai, Q. Wu, F. Liang, S. Mukherjee, and R. Wolpert. Journal of Machine Learning Research, in press, 2007.

4 Table of contents 1 Intergral operators and priors Wiener theory Radial basis functions networks Bayesian kernel model 2

5 Wiener theory Radial basis functions networks Bayesian kernel model Linear systems theory Define: the input signal, x(t), and the output signal, y(t). A system maps an input signal to an output signal via a system operator S defined by an impulse response K y(t) = S{X (t τ)} = K(t τ)x(τ)dτ. X

6 Wiener theory Radial basis functions networks Bayesian kernel model Linear systems theory Define: the input signal, x(t), and the output signal, y(t). A system maps an input signal to an output signal via a system operator S defined by an impulse response K y(t) = S{X (t τ)} = K(t τ)x(τ)dτ. X Charcterize S by test functions sinusoidal inputs.

7 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Weiner theory: Characterize S by test functions Brownian motion. The method of Wiener profides a canonical series representation Sx = nki G nki x where G nki = g nki (τ 1,..., τ i )dx(τ 1 ) dx(τ i ).

8 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Lee-Schetzen: Characterize S by test functions white noise. A response of S to ẋ Sẋ = S x = nki G nki x where G nki = g nki (τ 1,..., τ i )ẋ(τ 1 ) ẋ(τ i )dτ 1 dτ i.

9 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Lee-Schetzen: Characterize S by test functions white noise. A response of S to ẋ Sẋ = S x = nki G nki x where G nki = g nki (τ 1,..., τ i )ẋ(τ 1 ) ẋ(τ i )dτ 1 dτ i. A question: For what stochastic inputs does this hold?

10 Wiener theory Radial basis functions networks Bayesian kernel model Regression learning as hypersurface reconstruction data = {L i = (x i, y i )} n i=1 with L i iid ρ(x, Y ). X X IR p and Y IR and p n. A natural idea f r (x) = E Y [Y x].

11 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RBF style Minimization of the following functional was proposed in Poggio and Girosi N H[f ] = (y i f (x i )) 2 + λ P f 2, i=1 where P is a differential operator.

12 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RBF style Using Green s functions (operator theory) in matrix-vector form f (x) = N c i G(x; x i ). i=1 (G + λi )c = y, where G ij = G(x i ; x j ).

13 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min [error on data + smoothness of function] f bs

14 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min [error on data + smoothness of function] f bs error on data = L(f, data) = (f (x) y) 2 smoothness of function = f 2 K = f (x) 2 dx big function space = reproducing kerne l Hilbert space = H K

15 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] The kernel: K : X X IR e.g. K(u, v) = e ( u v 2). The RKHS { H K = f f (x) = } l α i K(x, x i ), x i X, α i R, l N. i=1

16 Wiener theory Radial basis functions networks Bayesian kernel model Representer theorem ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] ˆf (x) = ˆf (x) = n c i K(x, x i ) i=1 n c i G(x; x i ). i=1 Great when p n.

17 Wiener theory Radial basis functions networks Bayesian kernel model Very popular and useful Theo, Massi, and Tommy: 1 Support vector machines [ n ] ˆf (x) = arg min 1 y i f (x i ) + + λ f 2 K, f H K i=1 2 Regularized Kernel regression [ n ] ˆf (x) = arg min y i f (x i ) 2 + λ f 2 K, f H K i=1 i=1 3 Regularized logistic regression [ n ( ˆf ) ] (x) = arg min ln 1 + e y i f (x i ) + λ f 2 K. f H K

18 Wiener theory Radial basis functions networks Bayesian kernel model Why go Bayes uncertainty

19 Wiener theory Radial basis functions networks Bayesian kernel model Bayesian interpretation Poggio and Girosi N H[f ] = (y i f (x i )) 2 + λ P f 2, i=1 ( ) N exp( H[f ]) = exp (y i f (x i )) 2 exp ( λ P f 2) i=1 Pr(f data) Lik(data f )π(f ).

20 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). X

21 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). Specify a prior on H K via a prior on A { (ak ) } A = k=1 ak 2 /λ k <. X k

22 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). Specify a prior on H K via a prior on A { (ak ) } A = k=1 ak 2 /λ k <. X Hard to sample and relies on computation of eigenvalues and eigenvectors. k

23 Wiener theory Radial basis functions networks Bayesian kernel model Priors via duality The duality between Gaussian processes and RKHS implies the following construction where K is given by the kernel. f ( ) GP(µ f, K),

24 Wiener theory Radial basis functions networks Bayesian kernel model Priors via duality The duality between Gaussian processes and RKHS implies the following construction where K is given by the kernel. f ( ) / H K almost surely. f ( ) GP(µ f, K),

25 Wiener theory Radial basis functions networks Bayesian kernel model Integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ).

26 Wiener theory Radial basis functions networks Bayesian kernel model Integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ). A prior on Γ implies a prior on G.

27 Wiener theory Radial basis functions networks Bayesian kernel model Equivalence with RKHS For what Γ is H K = span(g)? For what class of test functions does span(l K [Γ]) = H K. This is hard to characterize.

28 Wiener theory Radial basis functions networks Bayesian kernel model Equivalence with RKHS For what Γ is H K = span(g)? For what class of test functions does span(l K [Γ]) = H K. This is hard to characterize. The candidates test functions for Γ will be 1 square integrable functions 2 integrable functions 3 discrete measures 4 the union or integrable functions and discrete measures.

29 Wiener theory Radial basis functions networks Bayesian kernel model Square integrable functions are too small Proposition For every γ L 2 (X ), L K [γ] H K. Consequently, L 2 (X ) L 1 K (H K ).

30 Wiener theory Radial basis functions networks Bayesian kernel model Square integrable functions are too small Proposition For every γ L 2 (X ), L K [γ] H K. Consequently, L 2 (X ) L 1 K (H K ). Corollary If Λ = {k : λ k > 0} is a finite set, then L K (L 2 (X )) = H K otherwise L K (L 2 (X )) H K. The latter occurs when the kernel K is strictly positive definite, the RKHS is infinite-dimensional.

31 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Measures: The class of functions L 1 (X ) are signed measures. Proposition For every γ L 1 (X ), L K [γ] H K. Consequently, L 1 (X ) L 1 K (H K ).

32 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Measures: The class of functions L 1 (X ) are signed measures. Proposition For every γ L 1 (X ), L K [γ] H K. Consequently, L 1 (X ) L 1 K (H K ). Discrete measures: { M D = µ = n c i δ xi : i=1 } n c i <, x i X, n N. i=1 Proposition Given the set of finite discrete measures, M D L 1 K (H K ).

33 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm.

34 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm. Proposition B(X ) L 1 K (H K (X )).

35 Wiener theory Radial basis functions networks Bayesian kernel model The implication Take home message need priors on signed measures. A function theoretic foundation for random signed measures such as Gaussian, Dirichlet and Lévy process priors.

36 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces.

37 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators.

38 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains.

39 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains. Bayesian variable selection: Efficient sampling and search in high-dimensional space.

40 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains. Bayesian variable selection: Efficient sampling and search in high-dimensional space. Numeric stability and statistical robustness work with Tommy.

41 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.

42 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)

43 Regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.

44 Inverse regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y.

45 Inverse regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label

46 Learning gradients Intergral operators and priors Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p

47 Learning gradients Intergral operators and priors Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p 1 regression: f r (x) 2 inverse regression: gradient outer product (GOP) Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.

48 Linear case Intergral operators and priors We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Γ = σ 2 Σ 1ΩΣ 1. Y X X Γ and Ω are equivalent modulo rotation and scale.

49 Nonlinear case Intergral operators and priors For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).

50 Nonlinear case Intergral operators and priors Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ).

51 Nonlinear case Intergral operators and priors Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ). Γ = I i=1 m i σi 2 Σ 1 i Ω i Σ 1 i.

52 Taylor expansion and gradients Given D = (Z 1,..., Z n ) the error of f may be approximated as n 2, L(f, D) = w ij [y i f (x j ) f (x j ) (x i x j )] i,j=1 w ij ensures the locality of x i x j.

53 Gradient estimate - regularization Radial basis function model (f D, f D ) := arg min f, f H p+1 K [ L(f, D) + λ 1 f 2 K + λ 2 ] f 2 K, H p K is the space of p functions f = (f 1,..., f p ) where f i H K, f 2 K = p i=1 f i 2 K, and λ 1, λ 2 > 0.

54 Computational efficiency The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory f D (x) = n a i K(x i, x), fd (x) = i=1 n c i K(x i, x) i=1 with a D = (a 1,..., a n ) R n and c D = (c 1,..., c n ) T R np.

55 Linear example Intergral operators and priors Dimensions RKHS norm Samples Dimensions

56 Linear example Intergral operators and priors Dimensions Probability y= Dimensions Samples

57 Nonlinear example Intergral operators and priors 3 Class Class Dimensions Dimensions

58 Sensitive features Intergral operators and priors Proposition Let f be a smooth function on R p with gradient f. The sensitivity of the function f along a (unit normalized) direction u is u f K. The d most sensitive features are those {u 1,..., u d } that are orthogonal and maximize u i f K. A spectral decomposition of Γ is used to compute these features, the eigenvectors corresponding to the d top eigenvalues.

59 Projection of data Intergral operators and priors The matrix ˆΓ = f D f D = c T D Kc D, is an empirical estimate of Γ.

60 Projection of data Intergral operators and priors The matrix ˆΓ = f D f D = c T D Kc D, is an empirical estimate of Γ. Geometry is preserved by projection onto top k-eigenvectors. No need to compute the p p matrix, method is O(n 2 p + n 3 ) time and O(pn) memory.

61 Linear example Intergral operators and priors 8 x 10 4 Dimensions samples index of eigenvalues

62 Linear example Intergral operators and priors Dimensions Feature 1

63 Nonlinear example Intergral operators and priors Dimension Feature Dimension Feature 1

64 Nonlinear example Intergral operators and priors index of eigenvalues Dimensions

65 Nonlinear example Intergral operators and priors Dimensions Dimensions

66 Digits: 6 vs

67 Digits: 6 vs. 9 8 x Norm index of eigenvalues DImensions

68 Nonlinear example Intergral operators and priors Dimensions Dimensions

69 Digits: 6 vs feature 2 0 feature feature feature 1

70 Intergral operators and priors An early graphical model weight at 33 days Gain 0-33 days External conditions weight at birth Heredity Rate of growth condition of dam Gestation period Size of litter Heredity of dam

71 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables.

72 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables. Note by construction ˆΓ is a covariance matrix of a Gaussian process.

73 Define J = ˆΓ 1. and the partial correlation coefficient J ij r ij =. Jii J jj Define R ij = r ij if i j and 0 otherwise. Define D ii = J ii. Define R = D 1/2 RD 1/2. ˆΓ = D 1/2 (I R)D 1/2.

74 The following expansions hold ( ) ˆΓ = D 1/2 R k D 1/2 k=0 ( ) ˆΓ = D 1/2 (I + R 2k ) D 1/2. k=0 Here k is path-length in the first equality. The second equality factorizes the covariance matrix into low rank matrices with fewer entries.

75 Toy example Intergral operators and priors X abcd ˆΓ 1 =

76 Toy example Intergral operators and priors X abcd ˆΓ 1 = c a ab d b c d

77 Real data progression of prostate cancer

78 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting.

79 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting.

80 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation.

81 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps.

82 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps. Noise on-off manifolds.

83 Relevant papers Intergral operators and priors Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of Machine Learning Research, 7(Mar): , Estimation of Gradients and Coordinate Covariation in Classification., Qiang Wu; Journal of Machine Learning Research, 7(Nov): , Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou; Annals of Statistics, submitted. Learning Gradients: simultaneous regression and inverse regression. Mauro Maggioni,, Qiang Wu; Journal of Machine Learning Research, in preparation.

84 Thanks Tommy Intergral operators and priors and Alessandro, Gadi, and Federico

85 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X

86 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X this implies a posterior on f. π(z data) L(data Z) π(z),

87 Lévy processes Intergral operators and priors A stochastic process Z := {Z u R : u X } is called a Lévy process if it satisfies the following conditions: 1 Z 0 = 0 almost surely. 2 For any choice of m 1 and 0 u 0 < u 1 <... < u m, the random variables Z u0, Z u1 Z u0,..., Z um Z um 1 are independent. (Independent increments property) 3 The distribution of Z s+u Z s is independent of Z s (Temporal homogeneity or stationary increments property). 4 Z has càdlàg paths almost surely.

88 Lévy processes Intergral operators and priors Theorem (Lévy-Khintchine) If Z is a Lévy process, then the characteristic function of Z u : u 0 has the following form: ( " E[e iλzu ] = exp u iλa 1 Z #) 2 σ2 λ 2 + [e iλw 1 iλw1 {w: w <1} (w)]ν(dw), R\{0} where a IR, σ 2 0 and ν is a nonnegative measure on IR with R R (1 w 2 )ν(dw) <.

89 Lévy processes Intergral operators and priors drift term a variance of Brownian motion σ 2 ν(dw) the jump process or Lévy measure. exp { u [ iλa 1 2 σ2 λ 2]} { exp u [ R\{0} e iλw 1 iλw1 {w: w <1} (w) ] ν(dw)}

90 Two approaches to Gaussian processes Two modelling approaches 1 prior directly on the space of functions by sampling from paths of the Gaussian process defined by K; 2 Gaussian process prior on Z(du) implies on prior on function space via integral operator.

91 Prior on random measure A Gaussian process prior on Z(du) is a signed measure so span(g) H K.

92 Direct prior elicitation Theorem (Kallianpur) If {Z u, u X } is a Gaussian process with covariance K and mean m H K and H K is infinite dimensional, then P(Z H K ) = 0. The sample paths are almost surely outside H K.

93 A bigger RKHS Intergral operators and priors Theorem (Lukić and Beder) Given two kernel functions R and K, R dominates K (R K) if H K H R. Let R K. Then g R g K, g H K. There exists a unique linear operator L : H R H R whose range is contained in H K such that f, g R = Lf, g K, f H R, g H K. In particular LR u = K u, u X. As an operator into H R, L is bounded, symmetric, and positive. Conversely, let L : H R H R be a positive, continuous, self-adjoint operator then K(s, t) = LR s, R t R, s, t X defines a reproducing kernel on X such that K R.

94 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K.

95 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K. Theorem (Lukić and Beder) Let K and R be two reproducing kernels. Assume that the RKHS H R is separable. A necessary and sufficient condition for the existence of a Gaussian process with covariance K and mean m H R and with trajectories in H R with probability 1 is that R K.

96 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K. Theorem (Lukić and Beder) Let K and R be two reproducing kernels. Assume that the RKHS H R is separable. A necessary and sufficient condition for the existence of a Gaussian process with covariance K and mean m H R and with trajectories in H R with probability 1 is that R K. Characterize H R by L 1 K (H K ).

97 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function.

98 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function. Model F using a Dirichlet process prior: DP(α, F 0 )

99 Bayesian representer form Given X n = (x 1,..., x n ) iid F n F X n DP(α + n, F n ), F n = (αf 0 + δ xi )/(α + n). i=1 E[f X n ] = a n n K(x, u) w(u) df 0 (u)+n 1 (1 a n ) w(x i ) K(x, x i ), i=1 a n = α/(α + n).

100 Bayesian representer form Taking lim α 0 to represent a non-informative prior: Proposition (Bayesian representor theorem) n ˆf n (x) = w i K(x, x i ), w i = w(x i )/n. i=1

101 Likelihood Intergral operators and priors y i = f (x i ) + ε i = w 0 + where ε i No(0, σ 2 ). n w j K(x i, x j ) + ε i, i = 1,..., n j=1 Y No(w 0 ι + Kw, σ 2 I ). where ι = (1,..., 1).

102 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β.

103 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β. π(w 0, σ 2 ) 1/σ 2 τ 1 i Ga(a τ /2, b τ /2) T := diag(τ 1,..., τ n ) β No(0, T ) w K, T No(0, F 1 T 1 F ).

104 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β. π(w 0, σ 2 ) 1/σ 2 τ 1 i Ga(a τ /2, b τ /2) T := diag(τ 1,..., τ n ) β No(0, T ) w K, T No(0, F 1 T 1 F ). Standard Gibbs sampler simulates p(w, w 0, σ 2 data).

105 Kernel model extension K ν (x, u) = K( ν x, ν u) with ν = {ν 1,..., ν p } with ν k [0, ) as a scale parameter. k ν (x, u) = k ν (x, u) = p ν k x k u k, k=1 ( 1 + ( ) d p ν k x k u k, k=1 k ν (x, u) = exp ) p ν k (x k u k ) 2. k=1

106 Prior specification Intergral operators and priors ν k (1 γ)δ 0 + γ Ga(a ν, a ν s), (k = 1,..., p), s Exp(a s ), γ Be(a γ, b γ )

107 Prior specification Intergral operators and priors ν k (1 γ)δ 0 + γ Ga(a ν, a ν s), (k = 1,..., p), s Exp(a s ), γ Be(a γ, b γ ) Standard Gibbs sampler does not work: Metropolis-Hastings.

108 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α.

109 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α. Minimize H[f m ] = N (y i f m (x i )) 2 + λ P f m 2, i=1 where P is a differential operator.

110 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α. Minimize H[f m ] = N (y i f m (x i )) 2 + λ P f m 2, i=1 where P is a differential operator. Solve for c α, t α, and σ α via stochastic differential equations c α = ω H[f m ] + η α (t), α = 1,..., m c α Sayan H[f Mukherjee m ]

111 Problem setup Intergral operators and priors Labelled data : (Y p, X p ) = {(y p i, x p i ); i = 1 : n p } iid ρ(y, X φ, θ). Unlabelled data: X m = {x m i, i = (1) : (n m )} iid ρ(x θ).

112 Problem setup Intergral operators and priors Labelled data : (Y p, X p ) = {(y p i, x p i ); i = 1 : n p } iid ρ(y, X φ, θ). Unlabelled data: X m = {x m i, i = (1) : (n m )} iid ρ(x θ). How can the unlabelled data help our a predictive model? data = {Y, X, X m } p(φ, θ data) π(φ, θ)p(y X, φ)p(x θ)p(x m θ). Need very strong dependence between θ and φ.

113 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ).

114 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). Same as in Belkin and Niyogi but without [ min L(f, data) + λ1 f 2 K + λ 2 f 2 ] I. f H K

115 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). 1 θ = F ( ) so that p(x θ)dx = df (x) - the parameter is the full distribution function itself; 2 p(y x, φ) depends intimately on θ = F ; in fact, θ φ in this case and dependence of θ and φ is central to the model.

116 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). 1 θ = F ( ) so that p(x θ)dx = df (x) - the parameter is the full distribution function itself; 2 p(y x, φ) depends intimately on θ = F ; in fact, θ φ in this case and dependence of θ and φ is central to the model.

117 Simulated data semi-supervised

118 Simulated data semi-supervised

119 Simulated data semi-supervised

120 Cancer classification semi-supervised Without Unlabellled Data With Unlabelled Data Average (Hard) Error Rate Percentage of Labelled Examples

121 MNIST digits feature selection

122 MNIST digits feature selection With variable selection Without variable selection

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Bayesian simultaneous regression and dimension reduction

Bayesian simultaneous regression and dimension reduction Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008

More information

Supervised Dimension Reduction:

Supervised Dimension Reduction: Supervised Dimension Reduction: A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, M. Maggioni, D-X. Zhou Department of Statistical Science Institute for Genome Sciences & Policy Department

More information

Gaussian processes for inference in stochastic differential equations

Gaussian processes for inference in stochastic differential equations Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017

More information

Probabilistic Machine Learning. Industrial AI Lab.

Probabilistic Machine Learning. Industrial AI Lab. Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

Graphs, Geometry and Semi-supervised Learning

Graphs, Geometry and Semi-supervised Learning Graphs, Geometry and Semi-supervised Learning Mikhail Belkin The Ohio State University, Dept of Computer Science and Engineering and Dept of Statistics Collaborators: Partha Niyogi, Vikas Sindhwani In

More information

Back to the future: Radial Basis Function networks revisited

Back to the future: Radial Basis Function networks revisited Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu

More information

Lecture 4: Introduction to stochastic processes and stochastic calculus

Lecture 4: Introduction to stochastic processes and stochastic calculus Lecture 4: Introduction to stochastic processes and stochastic calculus Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Statistics & Data Sciences: First Year Prelim Exam May 2018

Statistics & Data Sciences: First Year Prelim Exam May 2018 Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008 Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections

More information

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN Contribution from: Mathematical Physics Studies Vol. 7 Perspectives in Analysis Essays in Honor of Lennart Carleson s 75th Birthday Michael Benedicks, Peter W. Jones, Stanislav Smirnov (Eds.) Springer

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Nonparametric Drift Estimation for Stochastic Differential Equations

Nonparametric Drift Estimation for Stochastic Differential Equations Nonparametric Drift Estimation for Stochastic Differential Equations Gareth Roberts 1 Department of Statistics University of Warwick Brazilian Bayesian meeting, March 2010 Joint work with O. Papaspiliopoulos,

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions

More information

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31 Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior

3. Some tools for the analysis of sequential strategies based on a Gaussian process prior 3. Some tools for the analysis of sequential strategies based on a Gaussian process prior E. Vazquez Computer experiments June 21-22, 2010, Paris 21 / 34 Function approximation with a Gaussian prior Aim:

More information

A Least Squares Formulation for Canonical Correlation Analysis

A Least Squares Formulation for Canonical Correlation Analysis A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation

More information

Representations of Gaussian measures that are equivalent to Wiener measure

Representations of Gaussian measures that are equivalent to Wiener measure Representations of Gaussian measures that are equivalent to Wiener measure Patrick Cheridito Departement für Mathematik, ETHZ, 89 Zürich, Switzerland. E-mail: dito@math.ethz.ch Summary. We summarize results

More information

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Kernel-Based Contrast Functions for Sufficient Dimension Reduction Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

ELEMENTS OF PROBABILITY THEORY

ELEMENTS OF PROBABILITY THEORY ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Gaussian Process Regression

Gaussian Process Regression Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit Evan Kwiatkowski, Jan Mandel University of Colorado Denver December 11, 2014 OUTLINE 2 Data Assimilation Bayesian Estimation

More information

Data dependent operators for the spatial-spectral fusion problem

Data dependent operators for the spatial-spectral fusion problem Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Learning on Graphs and Manifolds CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Outline Manifold learning is a relatively new area of machine learning (2000-now). Main idea Model the underlying geometry of

More information

Nonparametric Bayesian Methods

Nonparametric Bayesian Methods Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and

More information

TUM 2016 Class 1 Statistical learning theory

TUM 2016 Class 1 Statistical learning theory TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All

More information

Manifold Regularization

Manifold Regularization 9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,

More information

Geometry on Probability Spaces

Geometry on Probability Spaces Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,

More information

Bayesian inverse problems with Laplacian noise

Bayesian inverse problems with Laplacian noise Bayesian inverse problems with Laplacian noise Remo Kretschmann Faculty of Mathematics, University of Duisburg-Essen Applied Inverse Problems 2017, M27 Hangzhou, 1 June 2017 1 / 33 Outline 1 Inverse heat

More information

MCMC Sampling for Bayesian Inference using L1-type Priors

MCMC Sampling for Bayesian Inference using L1-type Priors MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lévy Processes and Infinitely Divisible Measures in the Dual of afebruary Nuclear2017 Space 1 / 32

Lévy Processes and Infinitely Divisible Measures in the Dual of afebruary Nuclear2017 Space 1 / 32 Lévy Processes and Infinitely Divisible Measures in the Dual of a Nuclear Space David Applebaum School of Mathematics and Statistics, University of Sheffield, UK Talk at "Workshop on Infinite Dimensional

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Slide05 Haykin Chapter 5: Radial-Basis Function Networks

Slide05 Haykin Chapter 5: Radial-Basis Function Networks Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-6 Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation,

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

How to learn from very few examples?

How to learn from very few examples? How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A

More information

From graph to manifold Laplacian: The convergence rate

From graph to manifold Laplacian: The convergence rate Appl. Comput. Harmon. Anal. 2 (2006) 28 34 www.elsevier.com/locate/acha Letter to the Editor From graph to manifold Laplacian: The convergence rate A. Singer Department of athematics, Yale University,

More information

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix Labor-Supply Shifts and Economic Fluctuations Technical Appendix Yongsung Chang Department of Economics University of Pennsylvania Frank Schorfheide Department of Economics University of Pennsylvania January

More information

Point spread function reconstruction from the image of a sharp edge

Point spread function reconstruction from the image of a sharp edge DOE/NV/5946--49 Point spread function reconstruction from the image of a sharp edge John Bardsley, Kevin Joyce, Aaron Luttman The University of Montana National Security Technologies LLC Montana Uncertainty

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016 Spatial Statistics Spatial Examples More Spatial Statistics with Image Analysis Johan Lindström 1 1 Mathematical Statistics Centre for Mathematical Sciences Lund University Lund October 6, 2016 Johan Lindström

More information

The Lévy-Itô decomposition and the Lévy-Khintchine formula in31 themarch dual of 2014 a nuclear 1 space. / 20

The Lévy-Itô decomposition and the Lévy-Khintchine formula in31 themarch dual of 2014 a nuclear 1 space. / 20 The Lévy-Itô decomposition and the Lévy-Khintchine formula in the dual of a nuclear space. Christian Fonseca-Mora School of Mathematics and Statistics, University of Sheffield, UK Talk at "Stochastic Processes

More information

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Arno Solin and Simo Särkkä Aalto University, Finland Workshop on Gaussian Process Approximation Copenhagen, Denmark, May 2015 Solin &

More information

Spectral Regularization

Spectral Regularization Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

ECE 636: Systems identification

ECE 636: Systems identification ECE 636: Systems identification Lectures 3 4 Random variables/signals (continued) Random/stochastic vectors Random signals and linear systems Random signals in the frequency domain υ ε x S z + y Experimental

More information

I. Bayesian econometrics

I. Bayesian econometrics I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory D. Large sample results E. Diffuse priors F. Numerical Bayesian methods

More information

Manifold Regularization

Manifold Regularization Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is

More information

Online Gradient Descent Learning Algorithms

Online Gradient Descent Learning Algorithms DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline

More information

Modern Discrete Probability Spectral Techniques

Modern Discrete Probability Spectral Techniques Modern Discrete Probability VI - Spectral Techniques Background Sébastien Roch UW Madison Mathematics December 22, 2014 1 Review 2 3 4 Mixing time I Theorem (Convergence to stationarity) Consider a finite

More information

Robust MCMC Sampling with Non-Gaussian and Hierarchical Priors

Robust MCMC Sampling with Non-Gaussian and Hierarchical Priors Division of Engineering & Applied Science Robust MCMC Sampling with Non-Gaussian and Hierarchical Priors IPAM, UCLA, November 14, 2017 Matt Dunlop Victor Chen (Caltech) Omiros Papaspiliopoulos (ICREA,

More information

Joint distribution optimal transportation for domain adaptation

Joint distribution optimal transportation for domain adaptation Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Regularization via Spectral Filtering

Regularization via Spectral Filtering Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,

More information

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University Kernels March 13, 2008 Paper: R.R. Coifman, S. Lafon, maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University Kernels Figure: Example Application from [LafonWWW] meaningful geometric

More information

Unsupervised dimensionality reduction

Unsupervised dimensionality reduction Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional

More information