Sparse grids and optimisation

Size: px
Start display at page:

Download "Sparse grids and optimisation"

Transcription

1 Sparse grids and optimisation Lecture 1: Challenges and structure of multidimensional problems Markus Hegland 1, ANU July, MATRIX workshop on approximation and optimisation 1 support by ARC DP and LP and by Technical University of Munich and DFG 1 / 64

2 Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 2 / 64

3 HPC Tackling the multi challenge The new computational science Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 3 / 64

4 HPC Tackling the multi challenge The new computational science Challenges of numerical analysis numerical techniques are major driver of innovation in industrial societies and indispensable for design of aeroplanes, weather forecast, environmental monitoring, medical diagnostics, robotics and image processing fundamental techniques and their foundations well established techniques include finite elements, finite differences and finite volumes, they are widely available but do not work for ill-posed problems high-dimensional problems in both cases one requires specially adapted techniques and theory and practice of these techniques are active area of research recent developments in High Performance Computing (HPC) and new algorithms allow solution of new multidimensional problems but introduce new challenges 4 / 64

5 HPC Tackling the multi challenge The new computational science Computational challenges in HPC from simulation to optimisation from assumed parameters to estimation and identification user interaction energy costs it s getting tougher combine data and pde models computational faults multiphysics 5 / 64

6 HPC Tackling the multi challenge The new computational science Examples: PDEs, data, parameters PDEs u = argmin v V J(v) elliptic PDEs J(v) = 1 2a(v, v) f (v) least squares solution J(v) = (Lv(x) f (x)) 2 dx eigenvalues J(v) = a(v,v) b(v,v) (Rayleigh quotient) fitting data u = argmin v V L(v) penalised least squares L(v) = 1 N N i=1 (v(x i) y i ) 2 + a(v, v) MAP for density p(x) = exp u(x) L(v) = 1 N v(x i ) + log exp(v(x)) dx + a(v, v) N i=1 parametric problems combine PDEs and data fitting u = argmin{l(u(µ); µ) v = u(µ), µ M} with PDE constraint u(µ) = argmin v J(v; µ) quantities of interest q = s(u) target of approximation, e.g. energy, moments, likelihood, cost, risk 6 / 64

7 HPC Tackling the multi challenge The new computational science Integrating multiplicities HPC tackles a multi-challenge multi-disciplinary domains and education multi-physics models multi-scale models multi-dimensional numerics multi-level numerics multi-core systems The prevailing paradigm in modern computational science and HPC combines multiple resources and approaches with a wide range of different properties to gain new insights into immensely complex systems in the natural, engineering and social sciences. This reflects the multi-skilled and multi-cultural societies in which modern science is developed. 7 / 64

8 Examples of multidimensional problems Interpolation Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 8 / 64

9 Examples of multidimensional problems Interpolation Interpolation evaluation of function expensive compute some, interpolate logarithm tables by H. Briggs 1617 logarithm table piecewise linear interpolant 1.5 log(x) x fast evaluation reasonable accuracy stable, positive from: Wikipedia 9 / 64

10 Examples of multidimensional problems Interpolation A more accurate and flexible approach f I (x) = c 1 φ( x x 1 ) + + c m φ( x x m ) interpolation function piecewise linear 1 Gaussian φ(r) = e r 2 /γ 1 φ(r) 0.5 φ(r) r r interpolation equations φ(0) φ( x 1 x 2 ) φ( x 1 x m ) c 1 f (x 1 ) φ( x 2 x 1 ) φ(0) φ( x 2 x m ) c = f (x 2 ). φ( x m x 1 ) φ( x m x 2 ) φ(0) c m f (x m ) 10 / 64

11 Examples of multidimensional problems Interpolation Multidimensional interpolation use φ( x x i ) where x x i is Euclidean distance of x and x i x i on regular grid error O(m 2/d ) random points x i error O(m 1/2 ) up to d = 4 dimensions and smooth functions regular grid competitive for higher dimensions random interpolation points better theory for random points uses law of large numbers 11 / 64

12 Examples of multidimensional problems Interpolation The concentration of measure in high dimensions any pair of random points have same distance consequently interpolant is close to constant with high probability interpolation matrix for d = [φ( x i x j )] i,j=1,...,n = other instances of concentration of measure most of volume of sphere (Earth) close to surface law of large numbers, statistical convergence theory [Lévy, 20s, Milman 70s, Gromov, Talagrand 90s+] 12 / 64

13 Examples of multidimensional problems Interpolation When concentration is not a problem for interpolation when the points of interest are on low-dimensional sub-manifold when function which is to be interpolated has known simple structure, e.g., is linear or additive: f (x 1,..., x d ) = d f i (x i ) i=1 or is close to such a function when function only depends on few dimensions dimension is not only a curse f (x 1,..., x d ) = g(x 1, x 2, x 3 ) in high dimensions any non-empty neighbourhood contains large numbers of points which can be used for error reduction by averaging [Anderssen, H. 1999] 13 / 64

14 Examples of multidimensional problems Density estimation Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 14 / 64

15 Examples of multidimensional problems Density estimation Density estimation large data sets, queries expensive data set = probability measure over feature space histogram = piecewise constant approximation of measure extract relevant information fast from histogram application mean, variance, moments number and location of modes skewness and tail behaviour All modern theories of statistical inference take as their starting point the idea of the probability density function of the observations. E. Parzen (1961) in An Approach to Time Series Analysis histogram Count Age 15 / 64

16 Examples of multidimensional problems Density estimation A more accurate and flexible approach p K (x) = φ( x x 1 /σ) mσ + + φ( x x m /σ) mσ kernel density estimator piecewise constant 0.4 Gaussian φ(r) = e r2 /2 2π 0.4 φ(r) 0.2 φ(r) r r more accurate representation of smooth densities control smoothness with width parameter σ, can depend on x no need to solve linear system of equations more flexible: also for multidimensional distributions 16 / 64

17 Examples of multidimensional problems Density estimation Challenges of multidimensional density estimation low dimensional case for any x only the φ( x x i /σ) for neighbouring x i are nonzero, gives efficient estimator as every point has only few neighbours p K (x) = φ( x x i /σ)/mσ x i N (x) high dimensional case all x i are neighbours, need to consider all data points to evaluate density p K (x) = m φ( x x i /σ)/mσ i=1 very high dimensional case if x, x 1,..., x m are i.i.d. then all components φ( x x i /σ)/mσ of the same size, density asymptotically uniform p K (x) E(φ( X i X j /σ))/σ17 / 64

18 Examples of multidimensional problems Density estimation When concentration is not a problem for density estimation when the points of interest are on low-dimensional submanifold when unknown p has known simple structure, e.g. p(x 1,..., x d ) = d p i (x i ) more generally, the density is described by graphical model which leads to a factorisation as in i=1 p(x 1, x 2, x 3 ) = p(x 1, x 2 ) p(x 2, x 3 ) p(x 2 ) mixture model p(x) = K p i (x)π i i=1 where p i (x) = p(x x Ω i ) has some known form and π i = p(ω i ) 18 / 64

19 Examples of multidimensional problems Partial differential equations Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 19 / 64

20 Examples of multidimensional problems Partial differential equations Partial differential equations Partial differential equations are a very widely used tool in computational science examples of equations ih ψ t = h2 2m ψ + V ψ dp dt = z Schrödinger equation, quantum chemistry (S z I )Λ z p Chemical master equation, molecular biology f t + v T x f + q(e + v B) p f = 0 dimensionality Vlasov equation, plasma physics ψ and p can depend on hundreds of variables, f depends on five variables 20 / 64

21 Examples of multidimensional problems Partial differential equations Controlling the function values Sobolev norms important tool for PDE theory u 2 k = ( 1)k u(x) k u(x) dx for u C 0 case d > 3 Ω (Ω) and completion embedding for k d u(x) C u k k = 2 from regularity theory bounded solutions for d 3 PDE regularity theory u 2 < Sobolev embedding mixed norms u(x) C u 2 u 2 d u(x) mix = x 1 x d and so u(x) C d u 2 mix 2 dx 21 / 64

22 Sparse Grids Grids Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 22 / 64

23 Sparse Grids Grids The grid the challenge: curse of dimension approximate unknown function u(x, y) compute only values u(x i, y j ) on discrete grid points interpolate values u(x, y) for other points (x, y) regular isotropic grid: x i = ih and y j = jh In two dimensions 1/h 2 grid points, in d dimensions 1/h d grid points but accuracy proportional to h 2 23 / 64

24 Sparse Grids Grids Anisotropic grids more general regular grids choose fine grid when u(x, y) has large gradients choose coarse grid when u(x, y) is smooth gradients may be different in different directions choose anisotropic grid when u(x, y) varies differently in different directions with anisotropic grids one can approximate multi-dimensional u(x 1,..., x d ) if u very smooth in most x k 24 / 64

25 Sparse Grids Grids Grids and sampling full grid captures all scales subgrid captures less scales evaluation of u(x, y) on the grid corresponds to sampling u on the grid points sampling on a fine grid captures high frequencies small scale fluctuations (Nyqvist/Shannon) with anisotropic grids one can capture small scales in one dimension and different scales in another 25 / 64

26 Sparse Grids Grids Sparse grid = union of regular anisotropic grids a simple sparse grid = sparse grid in frequency / scale space = captures fine scales in both dimensions but not joint fine scales 26 / 64

27 Sparse Grids Grids Another sparse grid sparse grid points sparse grid frequency diagram the frequency diagram displays 1/4 of a hyperbolic cross 27 / 64

28 Sparse Grids Grids Sparse grids and the curse of dimension four dimensional case error isotropic grid sparse grid number of grid points only asymptotic error rates given here constants and preasymptotics also depend on dimension practical experience: with sparse grids up to 10 dimensions Zenger 1991 number of points error regular isotropic grids h d h 2 sparse grids h 1 log 2 h d 1 h 2 log 2 h d 1 28 / 64

29 Sparse Grids Grids The big plan dimension independence problem of sparse grids: exponential d-dependence of time and error through factors log 2 (h) d 1 factors of the form C d aim: remove all exponential d-dependence so that error h 2 time 1/h as in the case d = 1 ideas: parallel solution on subgrids (see next section) gives 1/h time stronger (energy) sparse grids give h 2 error weighted mixed norms and special basis functions to deal with C d dependence 29 / 64

30 Sparse Grids Grids Spatially adaptive (sparse) grids choosing a sparse sub-grid of the sparse grid adaptively choose necessary sparse grid points and corresponding (hierarchical) basis functions requires an error indicator function grid points inserted only where necessary acts as extra regularisation (like smoothing) for machine learning applications modified basis functions for boundary to remove the C d implemented in SG++ software package by Dirk Pflüger (Universität Stuttgart), / 64

31 Sparse Grids The combination technique Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 31 / 64

32 Sparse Grids The combination technique Combining two regular grids = + solution on combined grid is approximated as a linear combination of the solution on the regular component grids the components include the maximal generators and all intersections union and intersection grids 32 / 64

33 Sparse Grids The combination technique Weak solutions of boundary value problems boundary value problem u(x) = f (x), x Ω u(x) = 0, x Ω weak solution a(u, v) = f, v, v H0 1 (Ω) where a(u, v) = u(x) T v(x) dx Ω f, v = f (x)v(x) dx Ω approximate solution u h V h a(u h, v h ) = f, v h, v h V h Approximate solution can be viewed as a projection u h = P h u which is orthogonal with respect to the energy norm 33 / 64

34 Sparse Grids The combination technique Combination approximations regular grid approximation regular grid G h function space V h Galerkin equations for u h a(u h, v h ) = f, v h sparse grid approximation sparse grid G SG = h G h function space V SG = h V h Galerkin equations for u SG a(u SG, v SG ) = f, v SG for all v h V h for all v SG V SG combination technique where HPC comes in compute all u h in parallel and combine solutions using parallel reduction: u C = h c h u h Big question: when is u C u SG? 34 / 64

35 Sparse Grids The combination technique Sparse grid combination technique sparse grid points sparse grid frequency diagram combination formula u C = u 1,16 + u 2,8 + u 4,4 + u 8,2 + u 16,1 u 1,8 u 2,4 u 4,2 u 8,1 35 / 64

36 Sparse Grids The combination technique Inclusion / exclusion principle in combinatorics for the cardinality of sets A B A B = A + B A B more general for additive α: α(a B) = α(a) + α(b) α(a B) Theorem (de Moivre) If A 1,..., A m form intersection structure then ( m ) α A i = i=1 m c i α(a i ), i=1 for some c i Z 36 / 64

37 Sparse Grids The combination technique When the combination approximation is the sparse grid solution Lemma if the grids G h and the spaces V h form an intersection structure if the Galerkin projections P h commute, i.e., then P h P h = P h P h, for all h, h u C = u SG i.e., the combination technique provides the sparse grid solution Proof. This is a consequence of the inclusion-exclusion principle as it follows from the commutativity that P h is additive 37 / 64

38 Sparse Grids The combination technique Tensor products the classical sparse grid V 1 V 2 V m V hierarchy of functions of one variable classical sparse grid space V SG = V i V j i+j=n tensor product function space V i V j space of functions generated by products u 1 u 2 (x 1, x 2 ) = u i (x 1 )u j (x 2 ) where u i V i and u j V j V i V j form an intersection structure as combination coefficients 1 i + j = n c ij = 1 i + j = n 1 0 else (V i1 V j1 ) (V i2 V j2 ) = V min(i1,i 2 ) V min(j1,j 2 ) and combination formula exact if a(u 1 u 2, v 1 v 2 ) = a(u 1, v 1 )a(u 2, v 2 ) [Griebel, Schneider, Zenger 1992] 38 / 64

39 Sparse Grids The combination technique Extrapolation assumption: error model error of approximation in V ij = V i V j is of form e ij = e (1) i + e (2) j + r ij is type of ANOVA decomposition for the error consequence: error of combination technique e h = c ij e ij = e n (1) + e n (2) + c ij r ij i+j n i+j n if last term very small then e h e n,n i.e., the combination technique approximation using only components in V i V j with i + j n get a similar approximation order as the one in V nn [Bungartz et al 1994, Pflaum and Zhou 1999, Liem, Lu Shih 1995 (splitting extrapolation)] 39 / 64

40 Sparse Grids The combination technique Breakdown of the combination technique regression problem minimise 1 M M (u(x i ) y i ) 2 + λ u 2 i=1 with λ = 10 4 (left) and λ = 10 6 (right) combination approximation is not necessarily better for finer grids [Garcke 2004, H. 2003, H., Garcke, Challis 2007] 40 / 64

41 Sparse Grids The combination technique Opticom Optimal combination technique : choose the coefficients c i such that J is optimised with m J(c 1,..., c m ) = u c i u i 2 E = i,j=1 i=1 m m c i c j a(u i, u j ) 2 c i u i 2 E + u 2 E normal equations u 1 2 E a(u 1, u 2 ) a(u 1, u m ) c 1 u 1 2 a(u 2, u 1 ) u 2 2 E E a(u 2, u m ) c = u 2 2 E. a(u m, u 1 ) a(u m, u 2 ) u m 2 E c m u m 2 E Solution of the system of order O(m 3 ) at most but one needs to determine 2 41 / 64 i=1

42 Sparse Grids The combination technique Opticom is better than the sub-grid solutions Let V SG = n i=1 V i V, a(, ) be V-elliptic and bounded symmetric bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all v i V i and c i be the Opticom coefficients. Then the error in the energy norm satisfies u u SG E u n c i u i E min u u i E i=1,...,n The standard combination technique does not have this property i=1 42 / 64

43 Sparse Grids The combination technique The optimality of Opticom Proposition Let V SG = n i=1 V i V, a(, ) be V-elliptic and bounded bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all u i V i and c i be the Opticom coefficients. Then for some κ > 0 one has n n u c i u i V κ u c i u i V i=1 i=1 for any c i R Proof. This is a direct application of Céa s Lemma 43 / 64

44 Sparse Grids The combination technique Norm reduction with Opticom Proposition Let V SG = n i=1 V i V a(, ) be V-elliptic and bounded symmetric bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all v i V i and c i be the Opticom coefficients. Then one has for the energy norm defined by a(, ) the bound u n c i u i E u E i=1 and either f n i=1 c iu i E < u E or f V h thus u i = 0, i,..., n. 44 / 64

45 Sparse Grids The combination technique Proof. n n u 2 E = u c i u i 2 E + c i u i 2 E i=1 If the best approximation is zero then u has to be orthogonal to all u i. As u u i is orthogonal to V h it follows that all the v which are orthogonal to u i are also orthogonal to u and it follows that u is orthogonal to V i i=1 45 / 64

46 Sparse Grids The combination technique An iterative method Opticom iterative refinement u (0) = 0 a(u (k+1) i, v i ) = a(u u (k), v i ), v i V i n such that c (k+1) i u (k+1) i (u u (k) ) c (k+1) i u (k+1) = u (k) + n i=1 i=1 c i u (k+1) i E minimal algorithm converges to the sparse grid solution variant of parallel subspace correction [Xu 1992] also combine with Newton [Griebel, H. 2010] 46 / 64

47 Data distributions the machine learning problem given data x 1,... x N in R d find density f (x) such that f 1 N N δ xk k=1 [Tapia & Thompson 1978, Silverman 1986, Scott 1992] 47 / 64

48 Data distributions from data smoothing to sums of simpler approximations function approximation: RBF = sums of products f (x) = N c i κ(x x i ) = f (x) = K K c k i=1 k=1 j=1 f j,k (ξ j ) where x = (ξ 1,..., ξ d ) and x i are data points density estimation: kernel density estimators = mixture models f (x) = 1 N N K κ(x x i ) = f (x) = π k N(x µ k, C k ) i=1 k=1 [McLachlan and Peel, 2000] 48 / 64

49 Data distributions Bayes and MAP Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 49 / 64

50 Data distributions Bayes and MAP Bayesian inference given data and models data, e.g. x = (x 1,..., x N ) typically x R Nd likelihood of data: p(x z) prior of parameters z: p(z) models reasonable assumptions about z Bayes rule how to adapt p(z) in light of the evidence p(z x) = p(x z)p(z) p(x) where p(x) = p(x z)p(z) dz p(x) is p X (X = x), p(z x) is p X Z (X = x Z = z) etc what to do with the posterior expectations E(Y ) = yp(y z)p(z x) dx dy probability distributions p(y) = p(y z)p(z x) dy maximum z max = argmax z p(z x) = argmax z p(x z)p(z) tractability the computation of p(x), p(y) and E(y) require in general highdimensional integrals approximation 50 / 64

51 Data distributions Bayes and MAP density estimation data: x = (x 1,..., x N ) drawn randomly from some unknown probability distribution probability density model: f (x k ) = p(x k u) = exp(u(x k ) γ(u)) where γ(u) is such that p(x k u) dx k = 1 estimation problem: for given data x 1,..., x N find û(ˆx) such that p(x k û) approximates underlying density likelihood: ( n ) p(x u) = exp u(x i ) n γ(u) choose û such that likelihood large i=1 parametric case: maximum likelihood method problem underdetermined in nonparametric case 51 / 64

52 Data distributions Bayes and MAP MAP with Gaussian process priors prior for u: Gaussian probability measure ν over space of functions = Gaussian process prior we consider covariance C = C 1 C d posterior based on likelihood ρ(u) = p(x u): dµ = ρ dν is a well defined measure if ρ L 1 (ν) maximum a-posteriori (MAP) method: estimate u as mode of posterior Laplace approximation of posterior: Gaussian process with expectation u 52 / 64

53 Data distributions Bayes and MAP a variational problem characterisation of mode u of µ: ρ(u) dλ v (u) ρ(u + v), dλ for all v H where λ v (A) = λ(v + A) this leads to minimisation of functional j(u) = 1 n u 2 CM 1 n u(x i ) + log exp(u(x)) dx n X i=1 where CM is Cameron-Martin norm defined by prior [H. 2007, Griebel, H. 2010] 53 / 64

54 Data distributions Bayes and MAP Newton-Galerkin Opticom method Newton Galerkin u n+1 = u n + u n, u n minimises J( u) = 1 2 H u n ( u, u) + (F (u n ), u) H Sparse grid space V = j V (j) Sparse grid combination technique u n = k j=1 c j u n (i) where components u n (j) minimise J( u) over V (j) Opticom method: choose combination coefficients c j to minimise J( k j=1 c j u n (j) ) descent method, converges to sparse grid solution, not some combination approximation inexact Newton method [Deuflhard, Weiser 1996, Deuflhard 2004] alternative: nonlinear additive Schwarz [Dryja, Hackbusch 1997] [H., Griebel 2007] 54 / 64

55 Data distributions Bayes and MAP errors of sparse grid approximation for 2D case approximation of the normal distribution: maximum likelihood projection and our estimator l e (1) 1,l e (1) 1,l e (1) 1,l+1 e (1) 2,l e (1) 2,l e (1) 2,l+1 e (3) 1,l e (3) 1,l e (3) 1,l+1 e (3) 2,l e (3) 2,l e (3) 2,l e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e / 64

56 Data distributions Bayes and MAP 3D density [Griebel, H. 2010] 56 / 64

57 Data distributions Minimising KL divergence Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 57 / 64

58 Data distributions Minimising KL divergence two variational problems method Ritz-Galerkin for u h variational Bayes for q error energy norm KL-divergence u u h E = KL(q p( x)) = ( ) a(u uh, u u h ) q(z) log p(z x) q(z) dz optimisation minimise maximise problem J(u h ) = L(q) = 1 2 a(u ( ) h, u h ) f, u h q(z) log p(x,z) q(z) dz property V -ellipticity convexity right column: use KL(q p( x)) L(q) = log p(x) data: f on left and p(x, z) on right [Beal 2003, MacKay 2003, Bishop 2006] 58 / 64

59 Data distributions Minimising KL divergence fix point characterisation of best product Proposition (characterisation) If q = m i=1 q j is best approximant then u j (z j ) = log(p(x, z)) q i (z i ) dz i i j q j (z j ) = exp(u j (z j )) exp(uj (w j )) dw j Proof. L(q) = = m q i (z i ) ( log(p(x, z) i=1 i=1 ) m log q i (z i ) dz i (u j (z j ) log q j (z j ) ) q j (z j ) dz j + F ({q i } i j ) 59 / 64

60 Data distributions Minimising KL divergence iterative solver start with q (0) 1,..., q(0) m n = 1, 2,... j = 1,..., m u (n) j (z j ) = q (n) j (z j ) = j 1 log p(x, z) q (n) i (z i ) dz i i=1 exp u (n) j (z j ) (n) exp u j (w j ) dw i convergence as KL-divergence convex in u j m i=j+1 q (n 1) i (z i ) dz i 60 / 64

61 Data distributions Minimising KL divergence mixture models probability distribution for n-th observation p(x n u) = K π k p k (x n u k ) k=1 components p k are separable sums of separable functions problem: estimation of p given data x = (x 1,..., x N ) difficulty: while each p k has product structure which is adapted to KL minimisation the sum is a problem idea: introduce latent (or hidden) variables z 1,..., z N which are binary vectors indicating the class k of observation n thus p(x n u) = K p(x n u k, z n = e k ) p(z n = e k ) = p(x n, z n u) z n k=1 is interpreted as a marginal distribution and u = (u 1,..., u K ) 61 / 64

62 Data distributions Minimising KL divergence likelihood, priors and posterior likelihood of x = (x 1,..., x n ) N K p(x z, u) = p(x n u k ) z nk n=1 k=1 sum disappeared prior for latent variables z n p(z π) = priors for π and u: p(π) and p(u) N K π z nk k n=1 k=1 posterior distribution p(z, π, u x) = p(x, z, π, u)/p(x) where p(x, z, π, u) = p(x z, u) p(z π) p(π) p(u) 62 / 64

63 Data distributions Minimising KL divergence variational Bayes for mixture models aim: q(z, π, u) approximation of posterior p(z, π, u x) product Ansatz: q(z, π, u) = q(z)q(π, u) fix point formulation from minimal KL divergence q(z) = N K r z nk nk n=1 k=1 K N q(π, u) = C p(π) p(u k ) k=1 n=1 k=1 K (π k p(x n u k )) r nk where r nk = ρ nk /( k ρ nk) and ρ nk = E π [log π k ] + E u [log p(x n u)] and approximate p(x n u k ) as product to get tractability 63 / 64

64 Conclusions Conclusions with the wide availability of new computational resources high performance computing ideas now enter mainstream computational science HPC is getting increasingly complex with a shift towards new problems and approaches characterised by the multi-challenge an increasingly important challenge originates from the multi and high dimensionality of many models in physics, chemistry, biology, statistics and engineering new theory and algorithms are needed sparse grid combination technique deals with dimensionality and is ideally suited for HPC the Opticom method is able to overcome a stability issue of the original combination technique next: high-dimensional inverse problems? 64 / 64

Variational problems in machine learning and their solution with finite elements

Variational problems in machine learning and their solution with finite elements Variational problems in machine learning and their solution with finite elements M. Hegland M. Griebel April 2007 Abstract Many machine learning problems deal with the estimation of conditional probabilities

More information

Fitting multidimensional data using gradient penalties and combination techniques

Fitting multidimensional data using gradient penalties and combination techniques 1 Fitting multidimensional data using gradient penalties and combination techniques Jochen Garcke and Markus Hegland Mathematical Sciences Institute Australian National University Canberra ACT 0200 jochen.garcke,markus.hegland@anu.edu.au

More information

Fitting multidimensional data using gradient penalties and the sparse grid combination technique

Fitting multidimensional data using gradient penalties and the sparse grid combination technique Fitting multidimensional data using gradient penalties and the sparse grid combination technique Jochen Garcke 1 and Markus Hegland 2 1 Institut für Mathematik, Technische Universität Berlin Straße des

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Adaptive sparse grids

Adaptive sparse grids ANZIAM J. 44 (E) ppc335 C353, 2003 C335 Adaptive sparse grids M. Hegland (Received 1 June 2001) Abstract Sparse grids, as studied by Zenger and Griebel in the last 10 years have been very successful in

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 13: Learning in Gaussian Graphical Models, Non-Gaussian Inference, Monte Carlo Methods Some figures

More information

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

A very short introduction to the Finite Element Method

A very short introduction to the Finite Element Method A very short introduction to the Finite Element Method Till Mathis Wagner Technical University of Munich JASS 2004, St Petersburg May 4, 2004 1 Introduction This is a short introduction to the finite element

More information

Stochastic Spectral Approaches to Bayesian Inference

Stochastic Spectral Approaches to Bayesian Inference Stochastic Spectral Approaches to Bayesian Inference Prof. Nathan L. Gibson Department of Mathematics Applied Mathematics and Computation Seminar March 4, 2011 Prof. Gibson (OSU) Spectral Approaches to

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Approximation of Geometric Data

Approximation of Geometric Data Supervised by: Philipp Grohs, ETH Zürich August 19, 2013 Outline 1 Motivation Outline 1 Motivation 2 Outline 1 Motivation 2 3 Goal: Solving PDE s an optimization problems where we seek a function with

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Numerical Solution I

Numerical Solution I Numerical Solution I Stationary Flow R. Kornhuber (FU Berlin) Summerschool Modelling of mass and energy transport in porous media with practical applications October 8-12, 2018 Schedule Classical Solutions

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Variational Mixture of Gaussians. Sargur Srihari

Variational Mixture of Gaussians. Sargur Srihari Variational Mixture of Gaussians Sargur srihari@cedar.buffalo.edu 1 Objective Apply variational inference machinery to Gaussian Mixture Models Demonstrates how Bayesian treatment elegantly resolves difficulties

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification N. Zabaras 1 S. Atkinson 1 Center for Informatics and Computational Science Department of Aerospace and Mechanical Engineering University

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

Stat 451 Lecture Notes Numerical Integration

Stat 451 Lecture Notes Numerical Integration Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Bayesian Paradigm. Maximum A Posteriori Estimation

Bayesian Paradigm. Maximum A Posteriori Estimation Bayesian Paradigm Maximum A Posteriori Estimation Simple acquisition model noise + degradation Constraint minimization or Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization)

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

The Bayesian approach to inverse problems

The Bayesian approach to inverse problems The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and Astronautics Center for Computational Engineering Massachusetts Institute of Technology ymarz@mit.edu, http://uqgroup.mit.edu

More information

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Siwei Guo: s9guo@eng.ucsd.edu Anwesan Pal:

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

On Multigrid for Phase Field

On Multigrid for Phase Field On Multigrid for Phase Field Carsten Gräser (FU Berlin), Ralf Kornhuber (FU Berlin), Rolf Krause (Uni Bonn), and Vanessa Styles (University of Sussex) Interphase 04 Rome, September, 13-16, 2004 Synopsis

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of Discussion Functionals Calculus of Variations Maximizing a Functional Finding Approximation to a Posterior Minimizing K-L divergence Factorized

More information

LECTURE # 0 BASIC NOTATIONS AND CONCEPTS IN THE THEORY OF PARTIAL DIFFERENTIAL EQUATIONS (PDES)

LECTURE # 0 BASIC NOTATIONS AND CONCEPTS IN THE THEORY OF PARTIAL DIFFERENTIAL EQUATIONS (PDES) LECTURE # 0 BASIC NOTATIONS AND CONCEPTS IN THE THEORY OF PARTIAL DIFFERENTIAL EQUATIONS (PDES) RAYTCHO LAZAROV 1 Notations and Basic Functional Spaces Scalar function in R d, d 1 will be denoted by u,

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

A variational radial basis function approximation for diffusion processes

A variational radial basis function approximation for diffusion processes A variational radial basis function approximation for diffusion processes Michail D. Vrettas, Dan Cornford and Yuan Shen Aston University - Neural Computing Research Group Aston Triangle, Birmingham B4

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Lecture 3. Linear Regression

Lecture 3. Linear Regression Lecture 3. Linear Regression COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Weeks 2 to 8 inclusive Lecturer: Andrey Kan MS: Moscow, PhD:

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Scientific Computing I

Scientific Computing I Scientific Computing I Module 8: An Introduction to Finite Element Methods Tobias Neckel Winter 2013/2014 Module 8: An Introduction to Finite Element Methods, Winter 2013/2014 1 Part I: Introduction to

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline

More information

Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

Chapter 1: The Finite Element Method

Chapter 1: The Finite Element Method Chapter 1: The Finite Element Method Michael Hanke Read: Strang, p 428 436 A Model Problem Mathematical Models, Analysis and Simulation, Part Applications: u = fx), < x < 1 u) = u1) = D) axial deformation

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Gaussian Process Approximations of Stochastic Differential Equations

Gaussian Process Approximations of Stochastic Differential Equations Gaussian Process Approximations of Stochastic Differential Equations Cédric Archambeau Dan Cawford Manfred Opper John Shawe-Taylor May, 2006 1 Introduction Some of the most complex models routinely run

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

Schwarz Preconditioner for the Stochastic Finite Element Method

Schwarz Preconditioner for the Stochastic Finite Element Method Schwarz Preconditioner for the Stochastic Finite Element Method Waad Subber 1 and Sébastien Loisel 2 Preprint submitted to DD22 conference 1 Introduction The intrusive polynomial chaos approach for uncertainty

More information

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 42 K-Means Clustering K-Means Clustering David

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

On the convergence of the combination technique

On the convergence of the combination technique Wegelerstraße 6 53115 Bonn Germany phone +49 228 73-3427 fax +49 228 73-7527 www.ins.uni-bonn.de M. Griebel, H. Harbrecht On the convergence of the combination technique INS Preprint No. 1304 2013 On

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

INTRODUCTION TO FINITE ELEMENT METHODS ON ELLIPTIC EQUATIONS LONG CHEN

INTRODUCTION TO FINITE ELEMENT METHODS ON ELLIPTIC EQUATIONS LONG CHEN INTROUCTION TO FINITE ELEMENT METHOS ON ELLIPTIC EQUATIONS LONG CHEN CONTENTS 1. Poisson Equation 1 2. Outline of Topics 3 2.1. Finite ifference Method 3 2.2. Finite Element Method 3 2.3. Finite Volume

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Variable selection for model-based clustering

Variable selection for model-based clustering Variable selection for model-based clustering Matthieu Marbac (Ensai - Crest) Joint works with: M. Sedki (Univ. Paris-sud) and V. Vandewalle (Univ. Lille 2) The problem Objective: Estimation of a partition

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Expectation Propagation in Dynamical Systems

Expectation Propagation in Dynamical Systems Expectation Propagation in Dynamical Systems Marc Peter Deisenroth Joint Work with Shakir Mohamed (UBC) August 10, 2012 Marc Deisenroth (TU Darmstadt) EP in Dynamical Systems 1 Motivation Figure : Complex

More information