Sparse grids and optimisation
|
|
- Beverly Potter
- 5 years ago
- Views:
Transcription
1 Sparse grids and optimisation Lecture 1: Challenges and structure of multidimensional problems Markus Hegland 1, ANU July, MATRIX workshop on approximation and optimisation 1 support by ARC DP and LP and by Technical University of Munich and DFG 1 / 64
2 Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 2 / 64
3 HPC Tackling the multi challenge The new computational science Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 3 / 64
4 HPC Tackling the multi challenge The new computational science Challenges of numerical analysis numerical techniques are major driver of innovation in industrial societies and indispensable for design of aeroplanes, weather forecast, environmental monitoring, medical diagnostics, robotics and image processing fundamental techniques and their foundations well established techniques include finite elements, finite differences and finite volumes, they are widely available but do not work for ill-posed problems high-dimensional problems in both cases one requires specially adapted techniques and theory and practice of these techniques are active area of research recent developments in High Performance Computing (HPC) and new algorithms allow solution of new multidimensional problems but introduce new challenges 4 / 64
5 HPC Tackling the multi challenge The new computational science Computational challenges in HPC from simulation to optimisation from assumed parameters to estimation and identification user interaction energy costs it s getting tougher combine data and pde models computational faults multiphysics 5 / 64
6 HPC Tackling the multi challenge The new computational science Examples: PDEs, data, parameters PDEs u = argmin v V J(v) elliptic PDEs J(v) = 1 2a(v, v) f (v) least squares solution J(v) = (Lv(x) f (x)) 2 dx eigenvalues J(v) = a(v,v) b(v,v) (Rayleigh quotient) fitting data u = argmin v V L(v) penalised least squares L(v) = 1 N N i=1 (v(x i) y i ) 2 + a(v, v) MAP for density p(x) = exp u(x) L(v) = 1 N v(x i ) + log exp(v(x)) dx + a(v, v) N i=1 parametric problems combine PDEs and data fitting u = argmin{l(u(µ); µ) v = u(µ), µ M} with PDE constraint u(µ) = argmin v J(v; µ) quantities of interest q = s(u) target of approximation, e.g. energy, moments, likelihood, cost, risk 6 / 64
7 HPC Tackling the multi challenge The new computational science Integrating multiplicities HPC tackles a multi-challenge multi-disciplinary domains and education multi-physics models multi-scale models multi-dimensional numerics multi-level numerics multi-core systems The prevailing paradigm in modern computational science and HPC combines multiple resources and approaches with a wide range of different properties to gain new insights into immensely complex systems in the natural, engineering and social sciences. This reflects the multi-skilled and multi-cultural societies in which modern science is developed. 7 / 64
8 Examples of multidimensional problems Interpolation Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 8 / 64
9 Examples of multidimensional problems Interpolation Interpolation evaluation of function expensive compute some, interpolate logarithm tables by H. Briggs 1617 logarithm table piecewise linear interpolant 1.5 log(x) x fast evaluation reasonable accuracy stable, positive from: Wikipedia 9 / 64
10 Examples of multidimensional problems Interpolation A more accurate and flexible approach f I (x) = c 1 φ( x x 1 ) + + c m φ( x x m ) interpolation function piecewise linear 1 Gaussian φ(r) = e r 2 /γ 1 φ(r) 0.5 φ(r) r r interpolation equations φ(0) φ( x 1 x 2 ) φ( x 1 x m ) c 1 f (x 1 ) φ( x 2 x 1 ) φ(0) φ( x 2 x m ) c = f (x 2 ). φ( x m x 1 ) φ( x m x 2 ) φ(0) c m f (x m ) 10 / 64
11 Examples of multidimensional problems Interpolation Multidimensional interpolation use φ( x x i ) where x x i is Euclidean distance of x and x i x i on regular grid error O(m 2/d ) random points x i error O(m 1/2 ) up to d = 4 dimensions and smooth functions regular grid competitive for higher dimensions random interpolation points better theory for random points uses law of large numbers 11 / 64
12 Examples of multidimensional problems Interpolation The concentration of measure in high dimensions any pair of random points have same distance consequently interpolant is close to constant with high probability interpolation matrix for d = [φ( x i x j )] i,j=1,...,n = other instances of concentration of measure most of volume of sphere (Earth) close to surface law of large numbers, statistical convergence theory [Lévy, 20s, Milman 70s, Gromov, Talagrand 90s+] 12 / 64
13 Examples of multidimensional problems Interpolation When concentration is not a problem for interpolation when the points of interest are on low-dimensional sub-manifold when function which is to be interpolated has known simple structure, e.g., is linear or additive: f (x 1,..., x d ) = d f i (x i ) i=1 or is close to such a function when function only depends on few dimensions dimension is not only a curse f (x 1,..., x d ) = g(x 1, x 2, x 3 ) in high dimensions any non-empty neighbourhood contains large numbers of points which can be used for error reduction by averaging [Anderssen, H. 1999] 13 / 64
14 Examples of multidimensional problems Density estimation Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 14 / 64
15 Examples of multidimensional problems Density estimation Density estimation large data sets, queries expensive data set = probability measure over feature space histogram = piecewise constant approximation of measure extract relevant information fast from histogram application mean, variance, moments number and location of modes skewness and tail behaviour All modern theories of statistical inference take as their starting point the idea of the probability density function of the observations. E. Parzen (1961) in An Approach to Time Series Analysis histogram Count Age 15 / 64
16 Examples of multidimensional problems Density estimation A more accurate and flexible approach p K (x) = φ( x x 1 /σ) mσ + + φ( x x m /σ) mσ kernel density estimator piecewise constant 0.4 Gaussian φ(r) = e r2 /2 2π 0.4 φ(r) 0.2 φ(r) r r more accurate representation of smooth densities control smoothness with width parameter σ, can depend on x no need to solve linear system of equations more flexible: also for multidimensional distributions 16 / 64
17 Examples of multidimensional problems Density estimation Challenges of multidimensional density estimation low dimensional case for any x only the φ( x x i /σ) for neighbouring x i are nonzero, gives efficient estimator as every point has only few neighbours p K (x) = φ( x x i /σ)/mσ x i N (x) high dimensional case all x i are neighbours, need to consider all data points to evaluate density p K (x) = m φ( x x i /σ)/mσ i=1 very high dimensional case if x, x 1,..., x m are i.i.d. then all components φ( x x i /σ)/mσ of the same size, density asymptotically uniform p K (x) E(φ( X i X j /σ))/σ17 / 64
18 Examples of multidimensional problems Density estimation When concentration is not a problem for density estimation when the points of interest are on low-dimensional submanifold when unknown p has known simple structure, e.g. p(x 1,..., x d ) = d p i (x i ) more generally, the density is described by graphical model which leads to a factorisation as in i=1 p(x 1, x 2, x 3 ) = p(x 1, x 2 ) p(x 2, x 3 ) p(x 2 ) mixture model p(x) = K p i (x)π i i=1 where p i (x) = p(x x Ω i ) has some known form and π i = p(ω i ) 18 / 64
19 Examples of multidimensional problems Partial differential equations Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 19 / 64
20 Examples of multidimensional problems Partial differential equations Partial differential equations Partial differential equations are a very widely used tool in computational science examples of equations ih ψ t = h2 2m ψ + V ψ dp dt = z Schrödinger equation, quantum chemistry (S z I )Λ z p Chemical master equation, molecular biology f t + v T x f + q(e + v B) p f = 0 dimensionality Vlasov equation, plasma physics ψ and p can depend on hundreds of variables, f depends on five variables 20 / 64
21 Examples of multidimensional problems Partial differential equations Controlling the function values Sobolev norms important tool for PDE theory u 2 k = ( 1)k u(x) k u(x) dx for u C 0 case d > 3 Ω (Ω) and completion embedding for k d u(x) C u k k = 2 from regularity theory bounded solutions for d 3 PDE regularity theory u 2 < Sobolev embedding mixed norms u(x) C u 2 u 2 d u(x) mix = x 1 x d and so u(x) C d u 2 mix 2 dx 21 / 64
22 Sparse Grids Grids Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 22 / 64
23 Sparse Grids Grids The grid the challenge: curse of dimension approximate unknown function u(x, y) compute only values u(x i, y j ) on discrete grid points interpolate values u(x, y) for other points (x, y) regular isotropic grid: x i = ih and y j = jh In two dimensions 1/h 2 grid points, in d dimensions 1/h d grid points but accuracy proportional to h 2 23 / 64
24 Sparse Grids Grids Anisotropic grids more general regular grids choose fine grid when u(x, y) has large gradients choose coarse grid when u(x, y) is smooth gradients may be different in different directions choose anisotropic grid when u(x, y) varies differently in different directions with anisotropic grids one can approximate multi-dimensional u(x 1,..., x d ) if u very smooth in most x k 24 / 64
25 Sparse Grids Grids Grids and sampling full grid captures all scales subgrid captures less scales evaluation of u(x, y) on the grid corresponds to sampling u on the grid points sampling on a fine grid captures high frequencies small scale fluctuations (Nyqvist/Shannon) with anisotropic grids one can capture small scales in one dimension and different scales in another 25 / 64
26 Sparse Grids Grids Sparse grid = union of regular anisotropic grids a simple sparse grid = sparse grid in frequency / scale space = captures fine scales in both dimensions but not joint fine scales 26 / 64
27 Sparse Grids Grids Another sparse grid sparse grid points sparse grid frequency diagram the frequency diagram displays 1/4 of a hyperbolic cross 27 / 64
28 Sparse Grids Grids Sparse grids and the curse of dimension four dimensional case error isotropic grid sparse grid number of grid points only asymptotic error rates given here constants and preasymptotics also depend on dimension practical experience: with sparse grids up to 10 dimensions Zenger 1991 number of points error regular isotropic grids h d h 2 sparse grids h 1 log 2 h d 1 h 2 log 2 h d 1 28 / 64
29 Sparse Grids Grids The big plan dimension independence problem of sparse grids: exponential d-dependence of time and error through factors log 2 (h) d 1 factors of the form C d aim: remove all exponential d-dependence so that error h 2 time 1/h as in the case d = 1 ideas: parallel solution on subgrids (see next section) gives 1/h time stronger (energy) sparse grids give h 2 error weighted mixed norms and special basis functions to deal with C d dependence 29 / 64
30 Sparse Grids Grids Spatially adaptive (sparse) grids choosing a sparse sub-grid of the sparse grid adaptively choose necessary sparse grid points and corresponding (hierarchical) basis functions requires an error indicator function grid points inserted only where necessary acts as extra regularisation (like smoothing) for machine learning applications modified basis functions for boundary to remove the C d implemented in SG++ software package by Dirk Pflüger (Universität Stuttgart), / 64
31 Sparse Grids The combination technique Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 31 / 64
32 Sparse Grids The combination technique Combining two regular grids = + solution on combined grid is approximated as a linear combination of the solution on the regular component grids the components include the maximal generators and all intersections union and intersection grids 32 / 64
33 Sparse Grids The combination technique Weak solutions of boundary value problems boundary value problem u(x) = f (x), x Ω u(x) = 0, x Ω weak solution a(u, v) = f, v, v H0 1 (Ω) where a(u, v) = u(x) T v(x) dx Ω f, v = f (x)v(x) dx Ω approximate solution u h V h a(u h, v h ) = f, v h, v h V h Approximate solution can be viewed as a projection u h = P h u which is orthogonal with respect to the energy norm 33 / 64
34 Sparse Grids The combination technique Combination approximations regular grid approximation regular grid G h function space V h Galerkin equations for u h a(u h, v h ) = f, v h sparse grid approximation sparse grid G SG = h G h function space V SG = h V h Galerkin equations for u SG a(u SG, v SG ) = f, v SG for all v h V h for all v SG V SG combination technique where HPC comes in compute all u h in parallel and combine solutions using parallel reduction: u C = h c h u h Big question: when is u C u SG? 34 / 64
35 Sparse Grids The combination technique Sparse grid combination technique sparse grid points sparse grid frequency diagram combination formula u C = u 1,16 + u 2,8 + u 4,4 + u 8,2 + u 16,1 u 1,8 u 2,4 u 4,2 u 8,1 35 / 64
36 Sparse Grids The combination technique Inclusion / exclusion principle in combinatorics for the cardinality of sets A B A B = A + B A B more general for additive α: α(a B) = α(a) + α(b) α(a B) Theorem (de Moivre) If A 1,..., A m form intersection structure then ( m ) α A i = i=1 m c i α(a i ), i=1 for some c i Z 36 / 64
37 Sparse Grids The combination technique When the combination approximation is the sparse grid solution Lemma if the grids G h and the spaces V h form an intersection structure if the Galerkin projections P h commute, i.e., then P h P h = P h P h, for all h, h u C = u SG i.e., the combination technique provides the sparse grid solution Proof. This is a consequence of the inclusion-exclusion principle as it follows from the commutativity that P h is additive 37 / 64
38 Sparse Grids The combination technique Tensor products the classical sparse grid V 1 V 2 V m V hierarchy of functions of one variable classical sparse grid space V SG = V i V j i+j=n tensor product function space V i V j space of functions generated by products u 1 u 2 (x 1, x 2 ) = u i (x 1 )u j (x 2 ) where u i V i and u j V j V i V j form an intersection structure as combination coefficients 1 i + j = n c ij = 1 i + j = n 1 0 else (V i1 V j1 ) (V i2 V j2 ) = V min(i1,i 2 ) V min(j1,j 2 ) and combination formula exact if a(u 1 u 2, v 1 v 2 ) = a(u 1, v 1 )a(u 2, v 2 ) [Griebel, Schneider, Zenger 1992] 38 / 64
39 Sparse Grids The combination technique Extrapolation assumption: error model error of approximation in V ij = V i V j is of form e ij = e (1) i + e (2) j + r ij is type of ANOVA decomposition for the error consequence: error of combination technique e h = c ij e ij = e n (1) + e n (2) + c ij r ij i+j n i+j n if last term very small then e h e n,n i.e., the combination technique approximation using only components in V i V j with i + j n get a similar approximation order as the one in V nn [Bungartz et al 1994, Pflaum and Zhou 1999, Liem, Lu Shih 1995 (splitting extrapolation)] 39 / 64
40 Sparse Grids The combination technique Breakdown of the combination technique regression problem minimise 1 M M (u(x i ) y i ) 2 + λ u 2 i=1 with λ = 10 4 (left) and λ = 10 6 (right) combination approximation is not necessarily better for finer grids [Garcke 2004, H. 2003, H., Garcke, Challis 2007] 40 / 64
41 Sparse Grids The combination technique Opticom Optimal combination technique : choose the coefficients c i such that J is optimised with m J(c 1,..., c m ) = u c i u i 2 E = i,j=1 i=1 m m c i c j a(u i, u j ) 2 c i u i 2 E + u 2 E normal equations u 1 2 E a(u 1, u 2 ) a(u 1, u m ) c 1 u 1 2 a(u 2, u 1 ) u 2 2 E E a(u 2, u m ) c = u 2 2 E. a(u m, u 1 ) a(u m, u 2 ) u m 2 E c m u m 2 E Solution of the system of order O(m 3 ) at most but one needs to determine 2 41 / 64 i=1
42 Sparse Grids The combination technique Opticom is better than the sub-grid solutions Let V SG = n i=1 V i V, a(, ) be V-elliptic and bounded symmetric bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all v i V i and c i be the Opticom coefficients. Then the error in the energy norm satisfies u u SG E u n c i u i E min u u i E i=1,...,n The standard combination technique does not have this property i=1 42 / 64
43 Sparse Grids The combination technique The optimality of Opticom Proposition Let V SG = n i=1 V i V, a(, ) be V-elliptic and bounded bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all u i V i and c i be the Opticom coefficients. Then for some κ > 0 one has n n u c i u i V κ u c i u i V i=1 i=1 for any c i R Proof. This is a direct application of Céa s Lemma 43 / 64
44 Sparse Grids The combination technique Norm reduction with Opticom Proposition Let V SG = n i=1 V i V a(, ) be V-elliptic and bounded symmetric bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all v i V i and c i be the Opticom coefficients. Then one has for the energy norm defined by a(, ) the bound u n c i u i E u E i=1 and either f n i=1 c iu i E < u E or f V h thus u i = 0, i,..., n. 44 / 64
45 Sparse Grids The combination technique Proof. n n u 2 E = u c i u i 2 E + c i u i 2 E i=1 If the best approximation is zero then u has to be orthogonal to all u i. As u u i is orthogonal to V h it follows that all the v which are orthogonal to u i are also orthogonal to u and it follows that u is orthogonal to V i i=1 45 / 64
46 Sparse Grids The combination technique An iterative method Opticom iterative refinement u (0) = 0 a(u (k+1) i, v i ) = a(u u (k), v i ), v i V i n such that c (k+1) i u (k+1) i (u u (k) ) c (k+1) i u (k+1) = u (k) + n i=1 i=1 c i u (k+1) i E minimal algorithm converges to the sparse grid solution variant of parallel subspace correction [Xu 1992] also combine with Newton [Griebel, H. 2010] 46 / 64
47 Data distributions the machine learning problem given data x 1,... x N in R d find density f (x) such that f 1 N N δ xk k=1 [Tapia & Thompson 1978, Silverman 1986, Scott 1992] 47 / 64
48 Data distributions from data smoothing to sums of simpler approximations function approximation: RBF = sums of products f (x) = N c i κ(x x i ) = f (x) = K K c k i=1 k=1 j=1 f j,k (ξ j ) where x = (ξ 1,..., ξ d ) and x i are data points density estimation: kernel density estimators = mixture models f (x) = 1 N N K κ(x x i ) = f (x) = π k N(x µ k, C k ) i=1 k=1 [McLachlan and Peel, 2000] 48 / 64
49 Data distributions Bayes and MAP Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 49 / 64
50 Data distributions Bayes and MAP Bayesian inference given data and models data, e.g. x = (x 1,..., x N ) typically x R Nd likelihood of data: p(x z) prior of parameters z: p(z) models reasonable assumptions about z Bayes rule how to adapt p(z) in light of the evidence p(z x) = p(x z)p(z) p(x) where p(x) = p(x z)p(z) dz p(x) is p X (X = x), p(z x) is p X Z (X = x Z = z) etc what to do with the posterior expectations E(Y ) = yp(y z)p(z x) dx dy probability distributions p(y) = p(y z)p(z x) dy maximum z max = argmax z p(z x) = argmax z p(x z)p(z) tractability the computation of p(x), p(y) and E(y) require in general highdimensional integrals approximation 50 / 64
51 Data distributions Bayes and MAP density estimation data: x = (x 1,..., x N ) drawn randomly from some unknown probability distribution probability density model: f (x k ) = p(x k u) = exp(u(x k ) γ(u)) where γ(u) is such that p(x k u) dx k = 1 estimation problem: for given data x 1,..., x N find û(ˆx) such that p(x k û) approximates underlying density likelihood: ( n ) p(x u) = exp u(x i ) n γ(u) choose û such that likelihood large i=1 parametric case: maximum likelihood method problem underdetermined in nonparametric case 51 / 64
52 Data distributions Bayes and MAP MAP with Gaussian process priors prior for u: Gaussian probability measure ν over space of functions = Gaussian process prior we consider covariance C = C 1 C d posterior based on likelihood ρ(u) = p(x u): dµ = ρ dν is a well defined measure if ρ L 1 (ν) maximum a-posteriori (MAP) method: estimate u as mode of posterior Laplace approximation of posterior: Gaussian process with expectation u 52 / 64
53 Data distributions Bayes and MAP a variational problem characterisation of mode u of µ: ρ(u) dλ v (u) ρ(u + v), dλ for all v H where λ v (A) = λ(v + A) this leads to minimisation of functional j(u) = 1 n u 2 CM 1 n u(x i ) + log exp(u(x)) dx n X i=1 where CM is Cameron-Martin norm defined by prior [H. 2007, Griebel, H. 2010] 53 / 64
54 Data distributions Bayes and MAP Newton-Galerkin Opticom method Newton Galerkin u n+1 = u n + u n, u n minimises J( u) = 1 2 H u n ( u, u) + (F (u n ), u) H Sparse grid space V = j V (j) Sparse grid combination technique u n = k j=1 c j u n (i) where components u n (j) minimise J( u) over V (j) Opticom method: choose combination coefficients c j to minimise J( k j=1 c j u n (j) ) descent method, converges to sparse grid solution, not some combination approximation inexact Newton method [Deuflhard, Weiser 1996, Deuflhard 2004] alternative: nonlinear additive Schwarz [Dryja, Hackbusch 1997] [H., Griebel 2007] 54 / 64
55 Data distributions Bayes and MAP errors of sparse grid approximation for 2D case approximation of the normal distribution: maximum likelihood projection and our estimator l e (1) 1,l e (1) 1,l e (1) 1,l+1 e (1) 2,l e (1) 2,l e (1) 2,l+1 e (3) 1,l e (3) 1,l e (3) 1,l+1 e (3) 2,l e (3) 2,l e (3) 2,l e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e / 64
56 Data distributions Bayes and MAP 3D density [Griebel, H. 2010] 56 / 64
57 Data distributions Minimising KL divergence Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 57 / 64
58 Data distributions Minimising KL divergence two variational problems method Ritz-Galerkin for u h variational Bayes for q error energy norm KL-divergence u u h E = KL(q p( x)) = ( ) a(u uh, u u h ) q(z) log p(z x) q(z) dz optimisation minimise maximise problem J(u h ) = L(q) = 1 2 a(u ( ) h, u h ) f, u h q(z) log p(x,z) q(z) dz property V -ellipticity convexity right column: use KL(q p( x)) L(q) = log p(x) data: f on left and p(x, z) on right [Beal 2003, MacKay 2003, Bishop 2006] 58 / 64
59 Data distributions Minimising KL divergence fix point characterisation of best product Proposition (characterisation) If q = m i=1 q j is best approximant then u j (z j ) = log(p(x, z)) q i (z i ) dz i i j q j (z j ) = exp(u j (z j )) exp(uj (w j )) dw j Proof. L(q) = = m q i (z i ) ( log(p(x, z) i=1 i=1 ) m log q i (z i ) dz i (u j (z j ) log q j (z j ) ) q j (z j ) dz j + F ({q i } i j ) 59 / 64
60 Data distributions Minimising KL divergence iterative solver start with q (0) 1,..., q(0) m n = 1, 2,... j = 1,..., m u (n) j (z j ) = q (n) j (z j ) = j 1 log p(x, z) q (n) i (z i ) dz i i=1 exp u (n) j (z j ) (n) exp u j (w j ) dw i convergence as KL-divergence convex in u j m i=j+1 q (n 1) i (z i ) dz i 60 / 64
61 Data distributions Minimising KL divergence mixture models probability distribution for n-th observation p(x n u) = K π k p k (x n u k ) k=1 components p k are separable sums of separable functions problem: estimation of p given data x = (x 1,..., x N ) difficulty: while each p k has product structure which is adapted to KL minimisation the sum is a problem idea: introduce latent (or hidden) variables z 1,..., z N which are binary vectors indicating the class k of observation n thus p(x n u) = K p(x n u k, z n = e k ) p(z n = e k ) = p(x n, z n u) z n k=1 is interpreted as a marginal distribution and u = (u 1,..., u K ) 61 / 64
62 Data distributions Minimising KL divergence likelihood, priors and posterior likelihood of x = (x 1,..., x n ) N K p(x z, u) = p(x n u k ) z nk n=1 k=1 sum disappeared prior for latent variables z n p(z π) = priors for π and u: p(π) and p(u) N K π z nk k n=1 k=1 posterior distribution p(z, π, u x) = p(x, z, π, u)/p(x) where p(x, z, π, u) = p(x z, u) p(z π) p(π) p(u) 62 / 64
63 Data distributions Minimising KL divergence variational Bayes for mixture models aim: q(z, π, u) approximation of posterior p(z, π, u x) product Ansatz: q(z, π, u) = q(z)q(π, u) fix point formulation from minimal KL divergence q(z) = N K r z nk nk n=1 k=1 K N q(π, u) = C p(π) p(u k ) k=1 n=1 k=1 K (π k p(x n u k )) r nk where r nk = ρ nk /( k ρ nk) and ρ nk = E π [log π k ] + E u [log p(x n u)] and approximate p(x n u k ) as product to get tractability 63 / 64
64 Conclusions Conclusions with the wide availability of new computational resources high performance computing ideas now enter mainstream computational science HPC is getting increasingly complex with a shift towards new problems and approaches characterised by the multi-challenge an increasingly important challenge originates from the multi and high dimensionality of many models in physics, chemistry, biology, statistics and engineering new theory and algorithms are needed sparse grid combination technique deals with dimensionality and is ideally suited for HPC the Opticom method is able to overcome a stability issue of the original combination technique next: high-dimensional inverse problems? 64 / 64
Variational problems in machine learning and their solution with finite elements
Variational problems in machine learning and their solution with finite elements M. Hegland M. Griebel April 2007 Abstract Many machine learning problems deal with the estimation of conditional probabilities
More informationFitting multidimensional data using gradient penalties and combination techniques
1 Fitting multidimensional data using gradient penalties and combination techniques Jochen Garcke and Markus Hegland Mathematical Sciences Institute Australian National University Canberra ACT 0200 jochen.garcke,markus.hegland@anu.edu.au
More informationFitting multidimensional data using gradient penalties and the sparse grid combination technique
Fitting multidimensional data using gradient penalties and the sparse grid combination technique Jochen Garcke 1 and Markus Hegland 2 1 Institut für Mathematik, Technische Universität Berlin Straße des
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationAdaptive sparse grids
ANZIAM J. 44 (E) ppc335 C353, 2003 C335 Adaptive sparse grids M. Hegland (Received 1 June 2001) Abstract Sparse grids, as studied by Zenger and Griebel in the last 10 years have been very successful in
More informationCurve Fitting Re-visited, Bishop1.2.5
Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationPattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods
Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter
More informationA very short introduction to the Finite Element Method
A very short introduction to the Finite Element Method Till Mathis Wagner Technical University of Munich JASS 2004, St Petersburg May 4, 2004 1 Introduction This is a short introduction to the finite element
More informationStochastic Spectral Approaches to Bayesian Inference
Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationApproximation of Geometric Data
Supervised by: Philipp Grohs, ETH Zürich August 19, 2013 Outline 1 Motivation Outline 1 Motivation 2 Outline 1 Motivation 2 3 Goal: Solving PDE s an optimization problems where we seek a function with
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More informationNumerical Solution I
Numerical Solution I Stationary Flow R. Kornhuber (FU Berlin) Summerschool Modelling of mass and energy transport in porous media with practical applications October 8-12, 2018 Schedule Classical Solutions
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationProbabilistic & Unsupervised Learning
Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationCS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning
CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy
More informationVariational Mixture of Gaussians. Sargur Srihari
Variational Mixture of Gaussians Sargur srihari@cedar.buffalo.edu 1 Objective Apply variational inference machinery to Gaussian Mixture Models Demonstrates how Bayesian treatment elegantly resolves difficulties
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationFully Bayesian Deep Gaussian Processes for Uncertainty Quantification
Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory
More informationAlgorithmisches Lernen/Machine Learning
Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines
More informationThe Variational Gaussian Approximation Revisited
The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much
More informationStat 451 Lecture Notes Numerical Integration
Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationBayesian Paradigm. Maximum A Posteriori Estimation
Bayesian Paradigm Maximum A Posteriori Estimation Simple acquisition model noise + degradation Constraint minimization or Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization)
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationMachine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io
Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem
More informationApproximate Inference Part 1 of 2
Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory
More informationThe Bayesian approach to inverse problems
The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu
More informationECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering
ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:
More informationGaussian with mean ( µ ) and standard deviation ( σ)
Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (
More informationMachine Learning 2017
Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section
More informationGWAS V: Gaussian processes
GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationOn Multigrid for Phase Field
On Multigrid for Phase Field Carsten Gräser (FU Berlin), Ralf Kornhuber (FU Berlin), Rolf Krause (Uni Bonn), and Vanessa Styles (University of Sussex) Interphase 04 Rome, September, 13-16, 2004 Synopsis
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized
More informationLECTURE # 0 BASIC NOTATIONS AND CONCEPTS IN THE THEORY OF PARTIAL DIFFERENTIAL EQUATIONS (PDES)
LECTURE # 0 BASIC NOTATIONS AND CONCEPTS IN THE THEORY OF PARTIAL DIFFERENTIAL EQUATIONS (PDES) RAYTCHO LAZAROV 1 Notations and Basic Functional Spaces Scalar function in R d, d 1 will be denoted by u,
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationA variational radial basis function approximation for diffusion processes
A variational radial basis function approximation for diffusion processes Michail D. Vrettas, Dan Cornford and Yuan Shen Aston University - Neural Computing Research Group Aston Triangle, Birmingham B4
More informationDeep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationLecture 3. Linear Regression
Lecture 3. Linear Regression COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Weeks 2 to 8 inclusive Lecturer: Andrey Kan MS: Moscow, PhD:
More informationMachine Learning - MT & 5. Basis Expansion, Regularization, Validation
Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationLecture 1b: Linear Models for Regression
Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationScientific Computing I
Scientific Computing I Module 8: An Introduction to Finite Element Methods Tobias Neckel Winter 2013/2014 Module 8: An Introduction to Finite Element Methods, Winter 2013/2014 1 Part I: Introduction to
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline
More informationProbabilistic Reasoning in Deep Learning
Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/
More informationChapter 1: The Finite Element Method
Chapter 1: The Finite Element Method Michael Hanke Read: Strang, p 428 436 A Model Problem Mathematical Models, Analysis and Simulation, Part Applications: u = fx), < x < 1 u) = u1) = D) axial deformation
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationGaussian Process Approximations of Stochastic Differential Equations
Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run
More informationCSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection
CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationMultivariate Distributions
IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate
More informationSchwarz Preconditioner for the Stochastic Finite Element Method
Schwarz Preconditioner for the Stochastic Finite Element Method Waad Subber 1 and Sébastien Loisel 2 Preprint submitted to DD22 conference 1 Introduction The intrusive polynomial chaos approach for uncertainty
More informationSparse Kernel Density Estimation Technique Based on Zero-Norm Constraint
Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk
More informationK-Means and Gaussian Mixture Models
K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationIntegrated Non-Factorized Variational Inference
Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationOn the convergence of the combination technique
Wegelerstraße 6 53115 Bonn Germany phone +49 228 73-3427 fax +49 228 73-7527 www.ins.uni-bonn.de M. Griebel, H. Harbrecht On the convergence of the combination technique INS Preprint No. 1304 2013 On
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationINTRODUCTION TO FINITE ELEMENT METHODS ON ELLIPTIC EQUATIONS LONG CHEN
INTROUCTION TO FINITE ELEMENT METHOS ON ELLIPTIC EQUATIONS LONG CHEN CONTENTS 1. Poisson Equation 1 2. Outline of Topics 3 2.1. Finite ifference Method 3 2.2. Finite Element Method 3 2.3. Finite Volume
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLinear Models for Regression
Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationLecture 3: Pattern Classification
EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationVariable selection for model-based clustering
Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition
More informationClustering. CSL465/603 - Fall 2016 Narayanan C Krishnan
Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification
More informationExpectation Propagation in Dynamical Systems
Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex
More information