Sayan Mukherjee. June 15, 2007
|
|
- Kelly Cobb
- 5 years ago
- Views:
Transcription
1 Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University June 15, 2007
2 To Tommy Poggio This talk is dedicated to my advisor Tommy Poggio as part of a journey through computation. I hope I don t tarnish his name too much.
3 Relevant papers Considerations of Models of Movement Detection. T. Poggio and W. Reichardt. Kybernetik 13, , The Volterra representation and the Wiener expansion: validity and pitfalls. G. Palm and T. Poggio. SIAM J. Appl. Math., 33 (2) , Stochastic identification methods for nonlinear systems: an extension of Wiener theory. G. Palm and T. Poggio. SIAM J. Appl. Math., 34 (3) , Regularization algorithms for learning that are equivalent to multilayer networks. T. Poggio and F. Girosi. Science, 247: , Characterizing the function space for Bayesian kernel models. N. Pillai, Q. Wu, F. Liang, S. Mukherjee, and R. Wolpert. Journal of Machine Learning Research, in press, 2007.
4 Table of contents 1 Intergral operators and priors Wiener theory Radial basis functions networks Bayesian kernel model 2
5 Wiener theory Radial basis functions networks Bayesian kernel model Linear systems theory Define: the input signal, x(t), and the output signal, y(t). A system maps an input signal to an output signal via a system operator S defined by an impulse response K y(t) = S{X (t τ)} = K(t τ)x(τ)dτ. X
6 Wiener theory Radial basis functions networks Bayesian kernel model Linear systems theory Define: the input signal, x(t), and the output signal, y(t). A system maps an input signal to an output signal via a system operator S defined by an impulse response K y(t) = S{X (t τ)} = K(t τ)x(τ)dτ. X Charcterize S by test functions sinusoidal inputs.
7 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Weiner theory: Characterize S by test functions Brownian motion. The method of Wiener profides a canonical series representation Sx = nki G nki x where G nki = g nki (τ 1,..., τ i )dx(τ 1 ) dx(τ i ).
8 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Lee-Schetzen: Characterize S by test functions white noise. A response of S to ẋ Sẋ = S x = nki G nki x where G nki = g nki (τ 1,..., τ i )ẋ(τ 1 ) ẋ(τ i )dτ 1 dτ i.
9 Wiener theory Radial basis functions networks Bayesian kernel model Nonlinear systems theory y(t) = S{X (t τ)}. Lee-Schetzen: Characterize S by test functions white noise. A response of S to ẋ Sẋ = S x = nki G nki x where G nki = g nki (τ 1,..., τ i )ẋ(τ 1 ) ẋ(τ i )dτ 1 dτ i. A question: For what stochastic inputs does this hold?
10 Wiener theory Radial basis functions networks Bayesian kernel model Regression learning as hypersurface reconstruction data = {L i = (x i, y i )} n i=1 with L i iid ρ(x, Y ). X X IR p and Y IR and p n. A natural idea f r (x) = E Y [Y x].
11 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RBF style Minimization of the following functional was proposed in Poggio and Girosi N H[f ] = (y i f (x i )) 2 + λ P f 2, i=1 where P is a differential operator.
12 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RBF style Using Green s functions (operator theory) in matrix-vector form f (x) = N c i G(x; x i ). i=1 (G + λi )c = y, where G ij = G(x i ; x j ).
13 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min [error on data + smoothness of function] f bs
14 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min [error on data + smoothness of function] f bs error on data = L(f, data) = (f (x) y) 2 smoothness of function = f 2 K = f (x) 2 dx big function space = reproducing kerne l Hilbert space = H K
15 Wiener theory Radial basis functions networks Bayesian kernel model Regularization theory RKHS style ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] The kernel: K : X X IR e.g. K(u, v) = e ( u v 2). The RKHS { H K = f f (x) = } l α i K(x, x i ), x i X, α i R, l N. i=1
16 Wiener theory Radial basis functions networks Bayesian kernel model Representer theorem ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] ˆf (x) = ˆf (x) = n c i K(x, x i ) i=1 n c i G(x; x i ). i=1 Great when p n.
17 Wiener theory Radial basis functions networks Bayesian kernel model Very popular and useful Theo, Massi, and Tommy: 1 Support vector machines [ n ] ˆf (x) = arg min 1 y i f (x i ) + + λ f 2 K, f H K i=1 2 Regularized Kernel regression [ n ] ˆf (x) = arg min y i f (x i ) 2 + λ f 2 K, f H K i=1 i=1 3 Regularized logistic regression [ n ( ˆf ) ] (x) = arg min ln 1 + e y i f (x i ) + λ f 2 K. f H K
18 Wiener theory Radial basis functions networks Bayesian kernel model Why go Bayes uncertainty
19 Wiener theory Radial basis functions networks Bayesian kernel model Bayesian interpretation Poggio and Girosi N H[f ] = (y i f (x i )) 2 + λ P f 2, i=1 ( ) N exp( H[f ]) = exp (y i f (x i )) 2 exp ( λ P f 2) i=1 Pr(f data) Lik(data f )π(f ).
20 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). X
21 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). Specify a prior on H K via a prior on A { (ak ) } A = k=1 ak 2 /λ k <. X k
22 Wiener theory Radial basis functions networks Bayesian kernel model Priors via spectral expansion H K = { f f (x) = a i φ i (x) with i=1 } ai 2 /λ i <, i=1 φ i (x) and λ i 0 are eigenfunctions and eigenvalues of K: λ i φ i (x) = K(x, u)φ i (u)dγ(u). Specify a prior on H K via a prior on A { (ak ) } A = k=1 ak 2 /λ k <. X Hard to sample and relies on computation of eigenvalues and eigenvectors. k
23 Wiener theory Radial basis functions networks Bayesian kernel model Priors via duality The duality between Gaussian processes and RKHS implies the following construction where K is given by the kernel. f ( ) GP(µ f, K),
24 Wiener theory Radial basis functions networks Bayesian kernel model Priors via duality The duality between Gaussian processes and RKHS implies the following construction where K is given by the kernel. f ( ) / H K almost surely. f ( ) GP(µ f, K),
25 Wiener theory Radial basis functions networks Bayesian kernel model Integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ).
26 Wiener theory Radial basis functions networks Bayesian kernel model Integral operators Integral operator L K : Γ G { G = f f (x) := L K [γ](x) = K(x, u) dγ(u), X } γ Γ, with Γ B(X ). A prior on Γ implies a prior on G.
27 Wiener theory Radial basis functions networks Bayesian kernel model Equivalence with RKHS For what Γ is H K = span(g)? For what class of test functions does span(l K [Γ]) = H K. This is hard to characterize.
28 Wiener theory Radial basis functions networks Bayesian kernel model Equivalence with RKHS For what Γ is H K = span(g)? For what class of test functions does span(l K [Γ]) = H K. This is hard to characterize. The candidates test functions for Γ will be 1 square integrable functions 2 integrable functions 3 discrete measures 4 the union or integrable functions and discrete measures.
29 Wiener theory Radial basis functions networks Bayesian kernel model Square integrable functions are too small Proposition For every γ L 2 (X ), L K [γ] H K. Consequently, L 2 (X ) L 1 K (H K ).
30 Wiener theory Radial basis functions networks Bayesian kernel model Square integrable functions are too small Proposition For every γ L 2 (X ), L K [γ] H K. Consequently, L 2 (X ) L 1 K (H K ). Corollary If Λ = {k : λ k > 0} is a finite set, then L K (L 2 (X )) = H K otherwise L K (L 2 (X )) H K. The latter occurs when the kernel K is strictly positive definite, the RKHS is infinite-dimensional.
31 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Measures: The class of functions L 1 (X ) are signed measures. Proposition For every γ L 1 (X ), L K [γ] H K. Consequently, L 1 (X ) L 1 K (H K ).
32 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Measures: The class of functions L 1 (X ) are signed measures. Proposition For every γ L 1 (X ), L K [γ] H K. Consequently, L 1 (X ) L 1 K (H K ). Discrete measures: { M D = µ = n c i δ xi : i=1 } n c i <, x i X, n N. i=1 Proposition Given the set of finite discrete measures, M D L 1 K (H K ).
33 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm.
34 Wiener theory Radial basis functions networks Bayesian kernel model Signed measures are (almost) just right Nonsingular measures: M = L 1 (X ) M D Proposition L K (M) is dense in H K with respect to the RKHS norm. Proposition B(X ) L 1 K (H K (X )).
35 Wiener theory Radial basis functions networks Bayesian kernel model The implication Take home message need priors on signed measures. A function theoretic foundation for random signed measures such as Gaussian, Dirichlet and Lévy process priors.
36 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces.
37 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators.
38 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains.
39 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains. Bayesian variable selection: Efficient sampling and search in high-dimensional space.
40 Wiener theory Radial basis functions networks Bayesian kernel model Discussion Lots of work left: Further refinement of integral operators and priors in terms of Sobolev spaces. Semi-supervised setting: relation of kernel model and priors with Laplace-Beltrami and graph Laplacian operators. Semi-supervised setting: Duality between diffusion processes on manifolds and Markov chains. Bayesian variable selection: Efficient sampling and search in high-dimensional space. Numeric stability and statistical robustness work with Tommy.
41 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.
42 Generative vs. predictive modelling Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)
43 Regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.
44 Inverse regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y.
45 Inverse regression Intergral operators and priors Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label
46 Learning gradients Intergral operators and priors Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p
47 Learning gradients Intergral operators and priors Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p 1 regression: f r (x) 2 inverse regression: gradient outer product (GOP) Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.
48 Linear case Intergral operators and priors We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Γ = σ 2 Σ 1ΩΣ 1. Y X X Γ and Ω are equivalent modulo rotation and scale.
49 Nonlinear case Intergral operators and priors For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).
50 Nonlinear case Intergral operators and priors Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ).
51 Nonlinear case Intergral operators and priors Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ). Γ = I i=1 m i σi 2 Σ 1 i Ω i Σ 1 i.
52 Taylor expansion and gradients Given D = (Z 1,..., Z n ) the error of f may be approximated as n 2, L(f, D) = w ij [y i f (x j ) f (x j ) (x i x j )] i,j=1 w ij ensures the locality of x i x j.
53 Gradient estimate - regularization Radial basis function model (f D, f D ) := arg min f, f H p+1 K [ L(f, D) + λ 1 f 2 K + λ 2 ] f 2 K, H p K is the space of p functions f = (f 1,..., f p ) where f i H K, f 2 K = p i=1 f i 2 K, and λ 1, λ 2 > 0.
54 Computational efficiency The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory f D (x) = n a i K(x i, x), fd (x) = i=1 n c i K(x i, x) i=1 with a D = (a 1,..., a n ) R n and c D = (c 1,..., c n ) T R np.
55 Linear example Intergral operators and priors Dimensions RKHS norm Samples Dimensions
56 Linear example Intergral operators and priors Dimensions Probability y= Dimensions Samples
57 Nonlinear example Intergral operators and priors 3 Class Class Dimensions Dimensions
58 Sensitive features Intergral operators and priors Proposition Let f be a smooth function on R p with gradient f. The sensitivity of the function f along a (unit normalized) direction u is u f K. The d most sensitive features are those {u 1,..., u d } that are orthogonal and maximize u i f K. A spectral decomposition of Γ is used to compute these features, the eigenvectors corresponding to the d top eigenvalues.
59 Projection of data Intergral operators and priors The matrix ˆΓ = f D f D = c T D Kc D, is an empirical estimate of Γ.
60 Projection of data Intergral operators and priors The matrix ˆΓ = f D f D = c T D Kc D, is an empirical estimate of Γ. Geometry is preserved by projection onto top k-eigenvectors. No need to compute the p p matrix, method is O(n 2 p + n 3 ) time and O(pn) memory.
61 Linear example Intergral operators and priors 8 x 10 4 Dimensions samples index of eigenvalues
62 Linear example Intergral operators and priors Dimensions Feature 1
63 Nonlinear example Intergral operators and priors Dimension Feature Dimension Feature 1
64 Nonlinear example Intergral operators and priors index of eigenvalues Dimensions
65 Nonlinear example Intergral operators and priors Dimensions Dimensions
66 Digits: 6 vs
67 Digits: 6 vs. 9 8 x Norm index of eigenvalues DImensions
68 Nonlinear example Intergral operators and priors Dimensions Dimensions
69 Digits: 6 vs feature 2 0 feature feature feature 1
70 Intergral operators and priors An early graphical model weight at 33 days Gain 0-33 days External conditions weight at birth Heredity Rate of growth condition of dam Gestation period Size of litter Heredity of dam
71 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables.
72 Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables. Note by construction ˆΓ is a covariance matrix of a Gaussian process.
73 Define J = ˆΓ 1. and the partial correlation coefficient J ij r ij =. Jii J jj Define R ij = r ij if i j and 0 otherwise. Define D ii = J ii. Define R = D 1/2 RD 1/2. ˆΓ = D 1/2 (I R)D 1/2.
74 The following expansions hold ( ) ˆΓ = D 1/2 R k D 1/2 k=0 ( ) ˆΓ = D 1/2 (I + R 2k ) D 1/2. k=0 Here k is path-length in the first equality. The second equality factorizes the covariance matrix into low rank matrices with fewer entries.
75 Toy example Intergral operators and priors X abcd ˆΓ 1 =
76 Toy example Intergral operators and priors X abcd ˆΓ 1 = c a ab d b c d
77 Real data progression of prostate cancer
78 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting.
79 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting.
80 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation.
81 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps.
82 Discussion Intergral operators and priors Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps. Noise on-off manifolds.
83 Relevant papers Intergral operators and priors Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of Machine Learning Research, 7(Mar): , Estimation of Gradients and Coordinate Covariation in Classification., Qiang Wu; Journal of Machine Learning Research, 7(Nov): , Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou; Annals of Statistics, submitted. Learning Gradients: simultaneous regression and inverse regression. Mauro Maggioni,, Qiang Wu; Journal of Machine Learning Research, in preparation.
84 Thanks Tommy Intergral operators and priors and Alessandro, Gadi, and Federico
85 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X
86 Bayesian kernel model y i = f (x i ) + ε, ε iid No(0, σ 2 ). f (x) = K(x, u)z(du) where Z(du) M(X ) is a signed measure on X. X this implies a posterior on f. π(z data) L(data Z) π(z),
87 Lévy processes Intergral operators and priors A stochastic process Z := {Z u R : u X } is called a Lévy process if it satisfies the following conditions: 1 Z 0 = 0 almost surely. 2 For any choice of m 1 and 0 u 0 < u 1 <... < u m, the random variables Z u0, Z u1 Z u0,..., Z um Z um 1 are independent. (Independent increments property) 3 The distribution of Z s+u Z s is independent of Z s (Temporal homogeneity or stationary increments property). 4 Z has càdlàg paths almost surely.
88 Lévy processes Intergral operators and priors Theorem (Lévy-Khintchine) If Z is a Lévy process, then the characteristic function of Z u : u 0 has the following form: ( " E[e iλzu ] = exp u iλa 1 Z #) 2 σ2 λ 2 + [e iλw 1 iλw1 {w: w <1} (w)]ν(dw), R\{0} where a IR, σ 2 0 and ν is a nonnegative measure on IR with R R (1 w 2 )ν(dw) <.
89 Lévy processes Intergral operators and priors drift term a variance of Brownian motion σ 2 ν(dw) the jump process or Lévy measure. exp { u [ iλa 1 2 σ2 λ 2]} { exp u [ R\{0} e iλw 1 iλw1 {w: w <1} (w) ] ν(dw)}
90 Two approaches to Gaussian processes Two modelling approaches 1 prior directly on the space of functions by sampling from paths of the Gaussian process defined by K; 2 Gaussian process prior on Z(du) implies on prior on function space via integral operator.
91 Prior on random measure A Gaussian process prior on Z(du) is a signed measure so span(g) H K.
92 Direct prior elicitation Theorem (Kallianpur) If {Z u, u X } is a Gaussian process with covariance K and mean m H K and H K is infinite dimensional, then P(Z H K ) = 0. The sample paths are almost surely outside H K.
93 A bigger RKHS Intergral operators and priors Theorem (Lukić and Beder) Given two kernel functions R and K, R dominates K (R K) if H K H R. Let R K. Then g R g K, g H K. There exists a unique linear operator L : H R H R whose range is contained in H K such that f, g R = Lf, g K, f H R, g H K. In particular LR u = K u, u X. As an operator into H R, L is bounded, symmetric, and positive. Conversely, let L : H R H R be a positive, continuous, self-adjoint operator then K(s, t) = LR s, R t R, s, t X defines a reproducing kernel on X such that K R.
94 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K.
95 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K. Theorem (Lukić and Beder) Let K and R be two reproducing kernels. Assume that the RKHS H R is separable. A necessary and sufficient condition for the existence of a Gaussian process with covariance K and mean m H R and with trajectories in H R with probability 1 is that R K.
96 A bigger RKHS Intergral operators and priors If L is nucular (an operator that is compact with finite trace independent of basis choice) then we have nucular dominance R K. Theorem (Lukić and Beder) Let K and R be two reproducing kernels. Assume that the RKHS H R is separable. A necessary and sufficient condition for the existence of a Gaussian process with covariance K and mean m H R and with trajectories in H R with probability 1 is that R K. Characterize H R by L 1 K (H K ).
97 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function.
98 Dirichlet process prior f (x) = X K(x, u)z(du) = X K(x, u)w(u)f (du) F (du) is a distribution and w(u) a coefficient function. Model F using a Dirichlet process prior: DP(α, F 0 )
99 Bayesian representer form Given X n = (x 1,..., x n ) iid F n F X n DP(α + n, F n ), F n = (αf 0 + δ xi )/(α + n). i=1 E[f X n ] = a n n K(x, u) w(u) df 0 (u)+n 1 (1 a n ) w(x i ) K(x, x i ), i=1 a n = α/(α + n).
100 Bayesian representer form Taking lim α 0 to represent a non-informative prior: Proposition (Bayesian representor theorem) n ˆf n (x) = w i K(x, x i ), w i = w(x i )/n. i=1
101 Likelihood Intergral operators and priors y i = f (x i ) + ε i = w 0 + where ε i No(0, σ 2 ). n w j K(x i, x j ) + ε i, i = 1,..., n j=1 Y No(w 0 ι + Kw, σ 2 I ). where ι = (1,..., 1).
102 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β.
103 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β. π(w 0, σ 2 ) 1/σ 2 τ 1 i Ga(a τ /2, b τ /2) T := diag(τ 1,..., τ n ) β No(0, T ) w K, T No(0, F 1 T 1 F ).
104 Prior specification Intergral operators and priors Factor: K = F F with := diag(λ 2 1,..., λ2 n ) and w = F 1 β. π(w 0, σ 2 ) 1/σ 2 τ 1 i Ga(a τ /2, b τ /2) T := diag(τ 1,..., τ n ) β No(0, T ) w K, T No(0, F 1 T 1 F ). Standard Gibbs sampler simulates p(w, w 0, σ 2 data).
105 Kernel model extension K ν (x, u) = K( ν x, ν u) with ν = {ν 1,..., ν p } with ν k [0, ) as a scale parameter. k ν (x, u) = k ν (x, u) = p ν k x k u k, k=1 ( 1 + ( ) d p ν k x k u k, k=1 k ν (x, u) = exp ) p ν k (x k u k ) 2. k=1
106 Prior specification Intergral operators and priors ν k (1 γ)δ 0 + γ Ga(a ν, a ν s), (k = 1,..., p), s Exp(a s ), γ Be(a γ, b γ )
107 Prior specification Intergral operators and priors ν k (1 γ)δ 0 + γ Ga(a ν, a ν s), (k = 1,..., p), s Exp(a s ), γ Be(a γ, b γ ) Standard Gibbs sampler does not work: Metropolis-Hastings.
108 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α.
109 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α. Minimize H[f m ] = N (y i f m (x i )) 2 + λ P f m 2, i=1 where P is a differential operator.
110 GRBF and HyperBF networks a reprise Suppose f m (x) = m c α G(x; t α, σ α ), α=1 coefficients c α, knots t α, and scale σ α. Minimize H[f m ] = N (y i f m (x i )) 2 + λ P f m 2, i=1 where P is a differential operator. Solve for c α, t α, and σ α via stochastic differential equations c α = ω H[f m ] + η α (t), α = 1,..., m c α Sayan H[f Mukherjee m ]
111 Problem setup Intergral operators and priors Labelled data : (Y p, X p ) = {(y p i, x p i ); i = 1 : n p } iid ρ(y, X φ, θ). Unlabelled data: X m = {x m i, i = (1) : (n m )} iid ρ(x θ).
112 Problem setup Intergral operators and priors Labelled data : (Y p, X p ) = {(y p i, x p i ); i = 1 : n p } iid ρ(y, X φ, θ). Unlabelled data: X m = {x m i, i = (1) : (n m )} iid ρ(x θ). How can the unlabelled data help our a predictive model? data = {Y, X, X m } p(φ, θ data) π(φ, θ)p(y X, φ)p(x θ)p(x m θ). Need very strong dependence between θ and φ.
113 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ).
114 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). Same as in Belkin and Niyogi but without [ min L(f, data) + λ1 f 2 K + λ 2 f 2 ] I. f H K
115 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). 1 θ = F ( ) so that p(x θ)dx = df (x) - the parameter is the full distribution function itself; 2 p(y x, φ) depends intimately on θ = F ; in fact, θ φ in this case and dependence of θ and φ is central to the model.
116 Bayesian kernel model Result of DP prior ˆf n (x) = n p+n m i=1 w i K(x, x i ). 1 θ = F ( ) so that p(x θ)dx = df (x) - the parameter is the full distribution function itself; 2 p(y x, φ) depends intimately on θ = F ; in fact, θ φ in this case and dependence of θ and φ is central to the model.
117 Simulated data semi-supervised
118 Simulated data semi-supervised
119 Simulated data semi-supervised
120 Cancer classification semi-supervised Without Unlabellled Data With Unlabelled Data Average (Hard) Error Rate Percentage of Labelled Examples
121 MNIST digits feature selection
122 MNIST digits feature selection With variable selection Without variable selection
Learning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationBayesian simultaneous regression and dimension reduction
Bayesian simultaneous regression and dimension reduction MCMski II Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University January 10, 2008
More informationSupervised Dimension Reduction:
Supervised Dimension Reduction: A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, M. Maggioni, D-X. Zhou Department of Statistical Science Institute for Genome Sciences & Policy Department
More informationGaussian processes for inference in stochastic differential equations
Gaussian processes for inference in stochastic differential equations Manfred Opper, AI group, TU Berlin November 6, 2017 Manfred Opper, AI group, TU Berlin (TU Berlin) inference in SDE November 6, 2017
More informationProbabilistic Machine Learning. Industrial AI Lab.
Probabilistic Machine Learning Industrial AI Lab. Probabilistic Linear Regression Outline Probabilistic Classification Probabilistic Clustering Probabilistic Dimension Reduction 2 Probabilistic Linear
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationGraphs, Geometry and Semi-supervised Learning
Graphs, Geometry and Semi-supervised Learning Mikhail Belkin The Ohio State University, Dept of Computer Science and Engineering and Dept of Statistics Collaborators: Partha Niyogi, Vikas Sindhwani In
More informationBack to the future: Radial Basis Function networks revisited
Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu
More informationLecture 4: Introduction to stochastic processes and stochastic calculus
Lecture 4: Introduction to stochastic processes and stochastic calculus Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationStatistics & Data Sciences: First Year Prelim Exam May 2018
Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book
More informationNonparametric Bayesian Methods - Lecture I
Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics
More informationThe Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee
The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationReproducing Kernel Hilbert Spaces
Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationComputer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression
Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationRKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee
RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationAdvances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008
Advances in Manifold Learning Presented by: Nakul Verma June 10, 008 Outline Motivation Manifolds Manifold Learning Random projection of manifolds for dimension reduction Introduction to random projections
More informationContribution from: Springer Verlag Berlin Heidelberg 2005 ISBN
Contribution from: Mathematical Physics Studies Vol. 7 Perspectives in Analysis Essays in Honor of Lennart Carleson s 75th Birthday Michael Benedicks, Peter W. Jones, Stanislav Smirnov (Eds.) Springer
More informationComputer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression
Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:
More informationNonparametric Drift Estimation for Stochastic Differential Equations
Nonparametric Drift Estimation for Stochastic Differential Equations Gareth Roberts 1 Department of Statistics University of Warwick Brazilian Bayesian meeting, March 2010 Joint work with O. Papaspiliopoulos,
More information9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee
9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 305 Part VII
More informationBeyond the Point Cloud: From Transductive to Semi-Supervised Learning
Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of
More informationDiffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)
Diffeomorphic Warping Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel) What Manifold Learning Isn t Common features of Manifold Learning Algorithms: 1-1 charting Dense sampling Geometric Assumptions
More informationLearning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31
Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking Dengyong Zhou zhou@tuebingen.mpg.de Dept. Schölkopf, Max Planck Institute for Biological Cybernetics, Germany Learning from
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More information3. Some tools for the analysis of sequential strategies based on a Gaussian process prior
3. Some tools for the analysis of sequential strategies based on a Gaussian process prior E. Vazquez Computer experiments June 21-22, 2010, Paris 21 / 34 Function approximation with a Gaussian prior Aim:
More informationA Least Squares Formulation for Canonical Correlation Analysis
A Least Squares Formulation for Canonical Correlation Analysis Liang Sun, Shuiwang Ji, and Jieping Ye Department of Computer Science and Engineering Arizona State University Motivation Canonical Correlation
More informationRepresentations of Gaussian measures that are equivalent to Wiener measure
Representations of Gaussian measures that are equivalent to Wiener measure Patrick Cheridito Departement für Mathematik, ETHZ, 89 Zürich, Switzerland. E-mail: dito@math.ethz.ch Summary. We summarize results
More informationKernel-Based Contrast Functions for Sufficient Dimension Reduction
Kernel-Based Contrast Functions for Sufficient Dimension Reduction Michael I. Jordan Departments of Statistics and EECS University of California, Berkeley Joint work with Kenji Fukumizu and Francis Bach
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationReproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto
Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert
More informationELEMENTS OF PROBABILITY THEORY
ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationGaussian Process Regression
Gaussian Process Regression 4F1 Pattern Recognition, 21 Carl Edward Rasmussen Department of Engineering, University of Cambridge November 11th - 16th, 21 Rasmussen (Engineering, Cambridge) Gaussian Process
More informationMachine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling
Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationConvergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit
Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit Evan Kwiatkowski, Jan Mandel University of Colorado Denver December 11, 2014 OUTLINE 2 Data Assimilation Bayesian Estimation
More informationData dependent operators for the spatial-spectral fusion problem
Data dependent operators for the spatial-spectral fusion problem Wien, December 3, 2012 Joint work with: University of Maryland: J. J. Benedetto, J. A. Dobrosotskaya, T. Doster, K. W. Duke, M. Ehler, A.
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationLearning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst
Learning on Graphs and Manifolds CMPSCI 689 Sridhar Mahadevan U.Mass Amherst Outline Manifold learning is a relatively new area of machine learning (2000-now). Main idea Model the underlying geometry of
More informationNonparametric Bayesian Methods
Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationManifold Regularization
9.520: Statistical Learning Theory and Applications arch 3rd, 200 anifold Regularization Lecturer: Lorenzo Rosasco Scribe: Hooyoung Chung Introduction In this lecture we introduce a class of learning algorithms,
More informationGeometry on Probability Spaces
Geometry on Probability Spaces Steve Smale Toyota Technological Institute at Chicago 427 East 60th Street, Chicago, IL 60637, USA E-mail: smale@math.berkeley.edu Ding-Xuan Zhou Department of Mathematics,
More informationBayesian inverse problems with Laplacian noise
Bayesian inverse problems with Laplacian noise Remo Kretschmann Faculty of Mathematics, University of Duisburg-Essen Applied Inverse Problems 2017, M27 Hangzhou, 1 June 2017 1 / 33 Outline 1 Inverse heat
More informationMCMC Sampling for Bayesian Inference using L1-type Priors
MÜNSTER MCMC Sampling for Bayesian Inference using L1-type Priors (what I do whenever the ill-posedness of EEG/MEG is just not frustrating enough!) AG Imaging Seminar Felix Lucka 26.06.2012 , MÜNSTER Sampling
More informationGaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008
Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:
More informationLévy Processes and Infinitely Divisible Measures in the Dual of afebruary Nuclear2017 Space 1 / 32
Lévy Processes and Infinitely Divisible Measures in the Dual of a Nuclear Space David Applebaum School of Mathematics and Statistics, University of Sheffield, UK Talk at "Workshop on Infinite Dimensional
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationApril 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning
for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions
More informationSlide05 Haykin Chapter 5: Radial-Basis Function Networks
Slide5 Haykin Chapter 5: Radial-Basis Function Networks CPSC 636-6 Instructor: Yoonsuck Choe Spring Learning in MLP Supervised learning in multilayer perceptrons: Recursive technique of stochastic approximation,
More information10-701/ Recitation : Kernels
10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer
More informationUnsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto
Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationBrownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539
Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory
More informationHow to learn from very few examples?
How to learn from very few examples? Dengyong Zhou Department of Empirical Inference Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany Outline Introduction Part A
More informationFrom graph to manifold Laplacian: The convergence rate
Appl. Comput. Harmon. Anal. 2 (2006) 28 34 www.elsevier.com/locate/acha Letter to the Editor From graph to manifold Laplacian: The convergence rate A. Singer Department of athematics, Yale University,
More informationLabor-Supply Shifts and Economic Fluctuations. Technical Appendix
Labor-Supply Shifts and Economic Fluctuations Technical Appendix Yongsung Chang Department of Economics University of Pennsylvania Frank Schorfheide Department of Economics University of Pennsylvania January
More informationPoint spread function reconstruction from the image of a sharp edge
DOE/NV/5946--49 Point spread function reconstruction from the image of a sharp edge John Bardsley, Kevin Joyce, Aaron Luttman The University of Montana National Security Technologies LLC Montana Uncertainty
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationStein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm Qiang Liu and Dilin Wang NIPS 2016 Discussion by Yunchen Pu March 17, 2017 March 17, 2017 1 / 8 Introduction Let x R d
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationSpatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016
Spatial Statistics Spatial Examples More Spatial Statistics with Image Analysis Johan Lindström 1 1 Mathematical Statistics Centre for Mathematical Sciences Lund University Lund October 6, 2016 Johan Lindström
More informationThe Lévy-Itô decomposition and the Lévy-Khintchine formula in31 themarch dual of 2014 a nuclear 1 space. / 20
The Lévy-Itô decomposition and the Lévy-Khintchine formula in the dual of a nuclear space. Christian Fonseca-Mora School of Mathematics and Statistics, University of Sheffield, UK Talk at "Stochastic Processes
More informationLearning Eigenfunctions: Links with Spectral Clustering and Kernel PCA
Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA Yoshua Bengio Pascal Vincent Jean-François Paiement University of Montreal April 2, Snowbird Learning 2003 Learning Modal Structures
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationHilbert Space Methods for Reduced-Rank Gaussian Process Regression
Hilbert Space Methods for Reduced-Rank Gaussian Process Regression Arno Solin and Simo Särkkä Aalto University, Finland Workshop on Gaussian Process Approximation Copenhagen, Denmark, May 2015 Solin &
More informationSpectral Regularization
Spectral Regularization Lorenzo Rosasco 9.520 Class 07 February 27, 2008 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationOutline Lecture 2 2(32)
Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationComputer Vision Group Prof. Daniel Cremers. 3. Regression
Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationCSci 8980: Advanced Topics in Graphical Models Gaussian Processes
CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian
More informationECE 636: Systems identification
ECE 636: Systems identification Lectures 3 4 Random variables/signals (continued) Random/stochastic vectors Random signals and linear systems Random signals in the frequency domain υ ε x S z + y Experimental
More informationI. Bayesian econometrics
I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory D. Large sample results E. Diffuse priors F. Numerical Bayesian methods
More informationManifold Regularization
Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1 The Problem of Learning is
More informationOnline Gradient Descent Learning Algorithms
DISI, Genova, December 2006 Online Gradient Descent Learning Algorithms Yiming Ying (joint work with Massimiliano Pontil) Department of Computer Science, University College London Introduction Outline
More informationModern Discrete Probability Spectral Techniques
Modern Discrete Probability VI - Spectral Techniques Background Sébastien Roch UW Madison Mathematics December 22, 2014 1 Review 2 3 4 Mixing time I Theorem (Convergence to stationarity) Consider a finite
More informationRobust MCMC Sampling with Non-Gaussian and Hierarchical Priors
Division of Engineering & Applied Science Robust MCMC Sampling with Non-Gaussian and Hierarchical Priors IPAM, UCLA, November 14, 2017 Matt Dunlop Victor Chen (Caltech) Omiros Papaspiliopoulos (ICREA,
More informationJoint distribution optimal transportation for domain adaptation
Joint distribution optimal transportation for domain adaptation Changhuang Wan Mechanical and Aerospace Engineering Department The Ohio State University March 8 th, 2018 Joint distribution optimal transportation
More informationBayesian Nonparametrics
Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationRegularization via Spectral Filtering
Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 About this class Goal To discuss how a class of regularization methods originally designed for solving ill-posed inverse problems,
More informationMarch 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University
Kernels March 13, 2008 Paper: R.R. Coifman, S. Lafon, maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University Kernels Figure: Example Application from [LafonWWW] meaningful geometric
More informationUnsupervised dimensionality reduction
Unsupervised dimensionality reduction Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 Guillaume Obozinski Unsupervised dimensionality reduction 1/30 Outline 1 PCA 2 Kernel PCA 3 Multidimensional
More information