Sparse grids and optimisation

Size: px

Start display at page:

Download "Sparse grids and optimisation"

Beverly Potter
5 years ago
Views:

1 Sparse grids and optimisation Lecture 1: Challenges and structure of multidimensional problems Markus Hegland 1, ANU July, MATRIX workshop on approximation and optimisation 1 support by ARC DP and LP and by Technical University of Munich and DFG 1 / 64

2 Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 2 / 64

3 HPC Tackling the multi challenge The new computational science Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 3 / 64

4 HPC Tackling the multi challenge The new computational science Challenges of numerical analysis numerical techniques are major driver of innovation in industrial societies and indispensable for design of aeroplanes, weather forecast, environmental monitoring, medical diagnostics, robotics and image processing fundamental techniques and their foundations well established techniques include finite elements, finite differences and finite volumes, they are widely available but do not work for ill-posed problems high-dimensional problems in both cases one requires specially adapted techniques and theory and practice of these techniques are active area of research recent developments in High Performance Computing (HPC) and new algorithms allow solution of new multidimensional problems but introduce new challenges 4 / 64

5 HPC Tackling the multi challenge The new computational science Computational challenges in HPC from simulation to optimisation from assumed parameters to estimation and identification user interaction energy costs it s getting tougher combine data and pde models computational faults multiphysics 5 / 64

6 HPC Tackling the multi challenge The new computational science Examples: PDEs, data, parameters PDEs u = argmin v V J(v) elliptic PDEs J(v) = 1 2a(v, v) f (v) least squares solution J(v) = (Lv(x) f (x)) 2 dx eigenvalues J(v) = a(v,v) b(v,v) (Rayleigh quotient) fitting data u = argmin v V L(v) penalised least squares L(v) = 1 N N i=1 (v(x i) y i ) 2 + a(v, v) MAP for density p(x) = exp u(x) L(v) = 1 N v(x i ) + log exp(v(x)) dx + a(v, v) N i=1 parametric problems combine PDEs and data fitting u = argmin{l(u(µ); µ) v = u(µ), µ M} with PDE constraint u(µ) = argmin v J(v; µ) quantities of interest q = s(u) target of approximation, e.g. energy, moments, likelihood, cost, risk 6 / 64

7 HPC Tackling the multi challenge The new computational science Integrating multiplicities HPC tackles a multi-challenge multi-disciplinary domains and education multi-physics models multi-scale models multi-dimensional numerics multi-level numerics multi-core systems The prevailing paradigm in modern computational science and HPC combines multiple resources and approaches with a wide range of different properties to gain new insights into immensely complex systems in the natural, engineering and social sciences. This reflects the multi-skilled and multi-cultural societies in which modern science is developed. 7 / 64

8 Examples of multidimensional problems Interpolation Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 8 / 64

Examples of multidimensional problems Interpolation Interpolation evaluation of function expensive compute some, interpolate logarithm tables by H.

9 Examples of multidimensional problems Interpolation Interpolation evaluation of function expensive compute some, interpolate logarithm tables by H. Briggs 1617 logarithm table piecewise linear interpolant 1.5 log(x) x fast evaluation reasonable accuracy stable, positive from: Wikipedia 9 / 64

10 Examples of multidimensional problems Interpolation A more accurate and flexible approach f I (x) = c 1 φ( x x 1 ) + + c m φ( x x m ) interpolation function piecewise linear 1 Gaussian φ(r) = e r 2 /γ 1 φ(r) 0.5 φ(r) r r interpolation equations φ(0) φ( x 1 x 2 ) φ( x 1 x m ) c 1 f (x 1 ) φ( x 2 x 1 ) φ(0) φ( x 2 x m ) c = f (x 2 ). φ( x m x 1 ) φ( x m x 2 ) φ(0) c m f (x m ) 10 / 64

11 Examples of multidimensional problems Interpolation Multidimensional interpolation use φ( x x i ) where x x i is Euclidean distance of x and x i x i on regular grid error O(m 2/d ) random points x i error O(m 1/2 ) up to d = 4 dimensions and smooth functions regular grid competitive for higher dimensions random interpolation points better theory for random points uses law of large numbers 11 / 64

12 Examples of multidimensional problems Interpolation The concentration of measure in high dimensions any pair of random points have same distance consequently interpolant is close to constant with high probability interpolation matrix for d = [φ( x i x j )] i,j=1,...,n = other instances of concentration of measure most of volume of sphere (Earth) close to surface law of large numbers, statistical convergence theory [Lévy, 20s, Milman 70s, Gromov, Talagrand 90s+] 12 / 64

13 Examples of multidimensional problems Interpolation When concentration is not a problem for interpolation when the points of interest are on low-dimensional sub-manifold when function which is to be interpolated has known simple structure, e.g., is linear or additive: f (x 1,..., x d ) = d f i (x i ) i=1 or is close to such a function when function only depends on few dimensions dimension is not only a curse f (x 1,..., x d ) = g(x 1, x 2, x 3 ) in high dimensions any non-empty neighbourhood contains large numbers of points which can be used for error reduction by averaging [Anderssen, H. 1999] 13 / 64

14 Examples of multidimensional problems Density estimation Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 14 / 64

15 Examples of multidimensional problems Density estimation Density estimation large data sets, queries expensive data set = probability measure over feature space histogram = piecewise constant approximation of measure extract relevant information fast from histogram application mean, variance, moments number and location of modes skewness and tail behaviour All modern theories of statistical inference take as their starting point the idea of the probability density function of the observations. E. Parzen (1961) in An Approach to Time Series Analysis histogram Count Age 15 / 64

16 Examples of multidimensional problems Density estimation A more accurate and flexible approach p K (x) = φ( x x 1 /σ) mσ + + φ( x x m /σ) mσ kernel density estimator piecewise constant 0.4 Gaussian φ(r) = e r2 /2 2π 0.4 φ(r) 0.2 φ(r) r r more accurate representation of smooth densities control smoothness with width parameter σ, can depend on x no need to solve linear system of equations more flexible: also for multidimensional distributions 16 / 64

17 Examples of multidimensional problems Density estimation Challenges of multidimensional density estimation low dimensional case for any x only the φ( x x i /σ) for neighbouring x i are nonzero, gives efficient estimator as every point has only few neighbours p K (x) = φ( x x i /σ)/mσ x i N (x) high dimensional case all x i are neighbours, need to consider all data points to evaluate density p K (x) = m φ( x x i /σ)/mσ i=1 very high dimensional case if x, x 1,..., x m are i.i.d. then all components φ( x x i /σ)/mσ of the same size, density asymptotically uniform p K (x) E(φ( X i X j /σ))/σ17 / 64

18 Examples of multidimensional problems Density estimation When concentration is not a problem for density estimation when the points of interest are on low-dimensional submanifold when unknown p has known simple structure, e.g. p(x 1,..., x d ) = d p i (x i ) more generally, the density is described by graphical model which leads to a factorisation as in i=1 p(x 1, x 2, x 3 ) = p(x 1, x 2 ) p(x 2, x 3 ) p(x 2 ) mixture model p(x) = K p i (x)π i i=1 where p i (x) = p(x x Ω i ) has some known form and π i = p(ω i ) 18 / 64

19 Examples of multidimensional problems Partial differential equations Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 19 / 64

20 Examples of multidimensional problems Partial differential equations Partial differential equations Partial differential equations are a very widely used tool in computational science examples of equations ih ψ t = h2 2m ψ + V ψ dp dt = z Schrödinger equation, quantum chemistry (S z I )Λ z p Chemical master equation, molecular biology f t + v T x f + q(e + v B) p f = 0 dimensionality Vlasov equation, plasma physics ψ and p can depend on hundreds of variables, f depends on five variables 20 / 64

21 Examples of multidimensional problems Partial differential equations Controlling the function values Sobolev norms important tool for PDE theory u 2 k = ( 1)k u(x) k u(x) dx for u C 0 case d > 3 Ω (Ω) and completion embedding for k d u(x) C u k k = 2 from regularity theory bounded solutions for d 3 PDE regularity theory u 2 < Sobolev embedding mixed norms u(x) C u 2 u 2 d u(x) mix = x 1 x d and so u(x) C d u 2 mix 2 dx 21 / 64

22 Sparse Grids Grids Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 22 / 64

23 Sparse Grids Grids The grid the challenge: curse of dimension approximate unknown function u(x, y) compute only values u(x i, y j ) on discrete grid points interpolate values u(x, y) for other points (x, y) regular isotropic grid: x i = ih and y j = jh In two dimensions 1/h 2 grid points, in d dimensions 1/h d grid points but accuracy proportional to h 2 23 / 64

24 Sparse Grids Grids Anisotropic grids more general regular grids choose fine grid when u(x, y) has large gradients choose coarse grid when u(x, y) is smooth gradients may be different in different directions choose anisotropic grid when u(x, y) varies differently in different directions with anisotropic grids one can approximate multi-dimensional u(x 1,..., x d ) if u very smooth in most x k 24 / 64

25 Sparse Grids Grids Grids and sampling full grid captures all scales subgrid captures less scales evaluation of u(x, y) on the grid corresponds to sampling u on the grid points sampling on a fine grid captures high frequencies small scale fluctuations (Nyqvist/Shannon) with anisotropic grids one can capture small scales in one dimension and different scales in another 25 / 64

26 Sparse Grids Grids Sparse grid = union of regular anisotropic grids a simple sparse grid = sparse grid in frequency / scale space = captures fine scales in both dimensions but not joint fine scales 26 / 64

27 Sparse Grids Grids Another sparse grid sparse grid points sparse grid frequency diagram the frequency diagram displays 1/4 of a hyperbolic cross 27 / 64

28 Sparse Grids Grids Sparse grids and the curse of dimension four dimensional case error isotropic grid sparse grid number of grid points only asymptotic error rates given here constants and preasymptotics also depend on dimension practical experience: with sparse grids up to 10 dimensions Zenger 1991 number of points error regular isotropic grids h d h 2 sparse grids h 1 log 2 h d 1 h 2 log 2 h d 1 28 / 64

29 Sparse Grids Grids The big plan dimension independence problem of sparse grids: exponential d-dependence of time and error through factors log 2 (h) d 1 factors of the form C d aim: remove all exponential d-dependence so that error h 2 time 1/h as in the case d = 1 ideas: parallel solution on subgrids (see next section) gives 1/h time stronger (energy) sparse grids give h 2 error weighted mixed norms and special basis functions to deal with C d dependence 29 / 64

30 Sparse Grids Grids Spatially adaptive (sparse) grids choosing a sparse sub-grid of the sparse grid adaptively choose necessary sparse grid points and corresponding (hierarchical) basis functions requires an error indicator function grid points inserted only where necessary acts as extra regularisation (like smoothing) for machine learning applications modified basis functions for boundary to remove the C d implemented in SG++ software package by Dirk Pflüger (Universität Stuttgart), / 64

31 Sparse Grids The combination technique Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 31 / 64

32 Sparse Grids The combination technique Combining two regular grids = + solution on combined grid is approximated as a linear combination of the solution on the regular component grids the components include the maximal generators and all intersections union and intersection grids 32 / 64

33 Sparse Grids The combination technique Weak solutions of boundary value problems boundary value problem u(x) = f (x), x Ω u(x) = 0, x Ω weak solution a(u, v) = f, v, v H0 1 (Ω) where a(u, v) = u(x) T v(x) dx Ω f, v = f (x)v(x) dx Ω approximate solution u h V h a(u h, v h ) = f, v h, v h V h Approximate solution can be viewed as a projection u h = P h u which is orthogonal with respect to the energy norm 33 / 64

34 Sparse Grids The combination technique Combination approximations regular grid approximation regular grid G h function space V h Galerkin equations for u h a(u h, v h ) = f, v h sparse grid approximation sparse grid G SG = h G h function space V SG = h V h Galerkin equations for u SG a(u SG, v SG ) = f, v SG for all v h V h for all v SG V SG combination technique where HPC comes in compute all u h in parallel and combine solutions using parallel reduction: u C = h c h u h Big question: when is u C u SG? 34 / 64

35 Sparse Grids The combination technique Sparse grid combination technique sparse grid points sparse grid frequency diagram combination formula u C = u 1,16 + u 2,8 + u 4,4 + u 8,2 + u 16,1 u 1,8 u 2,4 u 4,2 u 8,1 35 / 64

36 Sparse Grids The combination technique Inclusion / exclusion principle in combinatorics for the cardinality of sets A B A B = A + B A B more general for additive α: α(a B) = α(a) + α(b) α(a B) Theorem (de Moivre) If A 1,..., A m form intersection structure then ( m ) α A i = i=1 m c i α(a i ), i=1 for some c i Z 36 / 64

37 Sparse Grids The combination technique When the combination approximation is the sparse grid solution Lemma if the grids G h and the spaces V h form an intersection structure if the Galerkin projections P h commute, i.e., then P h P h = P h P h, for all h, h u C = u SG i.e., the combination technique provides the sparse grid solution Proof. This is a consequence of the inclusion-exclusion principle as it follows from the commutativity that P h is additive 37 / 64

38 Sparse Grids The combination technique Tensor products the classical sparse grid V 1 V 2 V m V hierarchy of functions of one variable classical sparse grid space V SG = V i V j i+j=n tensor product function space V i V j space of functions generated by products u 1 u 2 (x 1, x 2 ) = u i (x 1 )u j (x 2 ) where u i V i and u j V j V i V j form an intersection structure as combination coefficients 1 i + j = n c ij = 1 i + j = n 1 0 else (V i1 V j1 ) (V i2 V j2 ) = V min(i1,i 2 ) V min(j1,j 2 ) and combination formula exact if a(u 1 u 2, v 1 v 2 ) = a(u 1, v 1 )a(u 2, v 2 ) [Griebel, Schneider, Zenger 1992] 38 / 64

39 Sparse Grids The combination technique Extrapolation assumption: error model error of approximation in V ij = V i V j is of form e ij = e (1) i + e (2) j + r ij is type of ANOVA decomposition for the error consequence: error of combination technique e h = c ij e ij = e n (1) + e n (2) + c ij r ij i+j n i+j n if last term very small then e h e n,n i.e., the combination technique approximation using only components in V i V j with i + j n get a similar approximation order as the one in V nn [Bungartz et al 1994, Pflaum and Zhou 1999, Liem, Lu Shih 1995 (splitting extrapolation)] 39 / 64

40 Sparse Grids The combination technique Breakdown of the combination technique regression problem minimise 1 M M (u(x i ) y i ) 2 + λ u 2 i=1 with λ = 10 4 (left) and λ = 10 6 (right) combination approximation is not necessarily better for finer grids [Garcke 2004, H. 2003, H., Garcke, Challis 2007] 40 / 64

41 Sparse Grids The combination technique Opticom Optimal combination technique : choose the coefficients c i such that J is optimised with m J(c 1,..., c m ) = u c i u i 2 E = i,j=1 i=1 m m c i c j a(u i, u j ) 2 c i u i 2 E + u 2 E normal equations u 1 2 E a(u 1, u 2 ) a(u 1, u m ) c 1 u 1 2 a(u 2, u 1 ) u 2 2 E E a(u 2, u m ) c = u 2 2 E. a(u m, u 1 ) a(u m, u 2 ) u m 2 E c m u m 2 E Solution of the system of order O(m 3 ) at most but one needs to determine 2 41 / 64 i=1

42 Sparse Grids The combination technique Opticom is better than the sub-grid solutions Let V SG = n i=1 V i V, a(, ) be V-elliptic and bounded symmetric bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all v i V i and c i be the Opticom coefficients. Then the error in the energy norm satisfies u u SG E u n c i u i E min u u i E i=1,...,n The standard combination technique does not have this property i=1 42 / 64

43 Sparse Grids The combination technique The optimality of Opticom Proposition Let V SG = n i=1 V i V, a(, ) be V-elliptic and bounded bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all u i V i and c i be the Opticom coefficients. Then for some κ > 0 one has n n u c i u i V κ u c i u i V i=1 i=1 for any c i R Proof. This is a direct application of Céa s Lemma 43 / 64

44 Sparse Grids The combination technique Norm reduction with Opticom Proposition Let V SG = n i=1 V i V a(, ) be V-elliptic and bounded symmetric bilinear form, u i V i defined by a(u i, v i ) = a(u, v i ), for all v i V i and c i be the Opticom coefficients. Then one has for the energy norm defined by a(, ) the bound u n c i u i E u E i=1 and either f n i=1 c iu i E < u E or f V h thus u i = 0, i,..., n. 44 / 64

45 Sparse Grids The combination technique Proof. n n u 2 E = u c i u i 2 E + c i u i 2 E i=1 If the best approximation is zero then u has to be orthogonal to all u i. As u u i is orthogonal to V h it follows that all the v which are orthogonal to u i are also orthogonal to u and it follows that u is orthogonal to V i i=1 45 / 64

46 Sparse Grids The combination technique An iterative method Opticom iterative refinement u (0) = 0 a(u (k+1) i, v i ) = a(u u (k), v i ), v i V i n such that c (k+1) i u (k+1) i (u u (k) ) c (k+1) i u (k+1) = u (k) + n i=1 i=1 c i u (k+1) i E minimal algorithm converges to the sparse grid solution variant of parallel subspace correction [Xu 1992] also combine with Newton [Griebel, H. 2010] 46 / 64

47 Data distributions the machine learning problem given data x 1,... x N in R d find density f (x) such that f 1 N N δ xk k=1 [Tapia & Thompson 1978, Silverman 1986, Scott 1992] 47 / 64

48 Data distributions from data smoothing to sums of simpler approximations function approximation: RBF = sums of products f (x) = N c i κ(x x i ) = f (x) = K K c k i=1 k=1 j=1 f j,k (ξ j ) where x = (ξ 1,..., ξ d ) and x i are data points density estimation: kernel density estimators = mixture models f (x) = 1 N N K κ(x x i ) = f (x) = π k N(x µ k, C k ) i=1 k=1 [McLachlan and Peel, 2000] 48 / 64

49 Data distributions Bayes and MAP Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 49 / 64

50 Data distributions Bayes and MAP Bayesian inference given data and models data, e.g. x = (x 1,..., x N ) typically x R Nd likelihood of data: p(x z) prior of parameters z: p(z) models reasonable assumptions about z Bayes rule how to adapt p(z) in light of the evidence p(z x) = p(x z)p(z) p(x) where p(x) = p(x z)p(z) dz p(x) is p X (X = x), p(z x) is p X Z (X = x Z = z) etc what to do with the posterior expectations E(Y ) = yp(y z)p(z x) dx dy probability distributions p(y) = p(y z)p(z x) dy maximum z max = argmax z p(z x) = argmax z p(x z)p(z) tractability the computation of p(x), p(y) and E(y) require in general highdimensional integrals approximation 50 / 64

51 Data distributions Bayes and MAP density estimation data: x = (x 1,..., x N ) drawn randomly from some unknown probability distribution probability density model: f (x k ) = p(x k u) = exp(u(x k ) γ(u)) where γ(u) is such that p(x k u) dx k = 1 estimation problem: for given data x 1,..., x N find û(ˆx) such that p(x k û) approximates underlying density likelihood: ( n ) p(x u) = exp u(x i ) n γ(u) choose û such that likelihood large i=1 parametric case: maximum likelihood method problem underdetermined in nonparametric case 51 / 64

52 Data distributions Bayes and MAP MAP with Gaussian process priors prior for u: Gaussian probability measure ν over space of functions = Gaussian process prior we consider covariance C = C 1 C d posterior based on likelihood ρ(u) = p(x u): dµ = ρ dν is a well defined measure if ρ L 1 (ν) maximum a-posteriori (MAP) method: estimate u as mode of posterior Laplace approximation of posterior: Gaussian process with expectation u 52 / 64

53 Data distributions Bayes and MAP a variational problem characterisation of mode u of µ: ρ(u) dλ v (u) ρ(u + v), dλ for all v H where λ v (A) = λ(v + A) this leads to minimisation of functional j(u) = 1 n u 2 CM 1 n u(x i ) + log exp(u(x)) dx n X i=1 where CM is Cameron-Martin norm defined by prior [H. 2007, Griebel, H. 2010] 53 / 64

54 Data distributions Bayes and MAP Newton-Galerkin Opticom method Newton Galerkin u n+1 = u n + u n, u n minimises J( u) = 1 2 H u n ( u, u) + (F (u n ), u) H Sparse grid space V = j V (j) Sparse grid combination technique u n = k j=1 c j u n (i) where components u n (j) minimise J( u) over V (j) Opticom method: choose combination coefficients c j to minimise J( k j=1 c j u n (j) ) descent method, converges to sparse grid solution, not some combination approximation inexact Newton method [Deuflhard, Weiser 1996, Deuflhard 2004] alternative: nonlinear additive Schwarz [Dryja, Hackbusch 1997] [H., Griebel 2007] 54 / 64

55 Data distributions Bayes and MAP errors of sparse grid approximation for 2D case approximation of the normal distribution: maximum likelihood projection and our estimator l e (1) 1,l e (1) 1,l e (1) 1,l+1 e (1) 2,l e (1) 2,l e (1) 2,l+1 e (3) 1,l e (3) 1,l e (3) 1,l+1 e (3) 2,l e (3) 2,l e (3) 2,l e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e / 64

56 Data distributions Bayes and MAP 3D density [Griebel, H. 2010] 56 / 64

57 Data distributions Minimising KL divergence Outline 1 HPC Tackling the multi challenge The new computational science 2 Examples of multidimensional problems Interpolation Density estimation Partial differential equations 3 Sparse Grids Grids The combination technique 4 Data distributions Bayes and MAP Minimising KL divergence 5 Conclusions 57 / 64

58 Data distributions Minimising KL divergence two variational problems method Ritz-Galerkin for u h variational Bayes for q error energy norm KL-divergence u u h E = KL(q p( x)) = ( ) a(u uh, u u h ) q(z) log p(z x) q(z) dz optimisation minimise maximise problem J(u h ) = L(q) = 1 2 a(u ( ) h, u h ) f, u h q(z) log p(x,z) q(z) dz property V -ellipticity convexity right column: use KL(q p( x)) L(q) = log p(x) data: f on left and p(x, z) on right [Beal 2003, MacKay 2003, Bishop 2006] 58 / 64

59 Data distributions Minimising KL divergence fix point characterisation of best product Proposition (characterisation) If q = m i=1 q j is best approximant then u j (z j ) = log(p(x, z)) q i (z i ) dz i i j q j (z j ) = exp(u j (z j )) exp(uj (w j )) dw j Proof. L(q) = = m q i (z i ) ( log(p(x, z) i=1 i=1 ) m log q i (z i ) dz i (u j (z j ) log q j (z j ) ) q j (z j ) dz j + F ({q i } i j ) 59 / 64

60 Data distributions Minimising KL divergence iterative solver start with q (0) 1,..., q(0) m n = 1, 2,... j = 1,..., m u (n) j (z j ) = q (n) j (z j ) = j 1 log p(x, z) q (n) i (z i ) dz i i=1 exp u (n) j (z j ) (n) exp u j (w j ) dw i convergence as KL-divergence convex in u j m i=j+1 q (n 1) i (z i ) dz i 60 / 64

61 Data distributions Minimising KL divergence mixture models probability distribution for n-th observation p(x n u) = K π k p k (x n u k ) k=1 components p k are separable sums of separable functions problem: estimation of p given data x = (x 1,..., x N ) difficulty: while each p k has product structure which is adapted to KL minimisation the sum is a problem idea: introduce latent (or hidden) variables z 1,..., z N which are binary vectors indicating the class k of observation n thus p(x n u) = K p(x n u k, z n = e k ) p(z n = e k ) = p(x n, z n u) z n k=1 is interpreted as a marginal distribution and u = (u 1,..., u K ) 61 / 64

62 Data distributions Minimising KL divergence likelihood, priors and posterior likelihood of x = (x 1,..., x n ) N K p(x z, u) = p(x n u k ) z nk n=1 k=1 sum disappeared prior for latent variables z n p(z π) = priors for π and u: p(π) and p(u) N K π z nk k n=1 k=1 posterior distribution p(z, π, u x) = p(x, z, π, u)/p(x) where p(x, z, π, u) = p(x z, u) p(z π) p(π) p(u) 62 / 64

63 Data distributions Minimising KL divergence variational Bayes for mixture models aim: q(z, π, u) approximation of posterior p(z, π, u x) product Ansatz: q(z, π, u) = q(z)q(π, u) fix point formulation from minimal KL divergence q(z) = N K r z nk nk n=1 k=1 K N q(π, u) = C p(π) p(u k ) k=1 n=1 k=1 K (π k p(x n u k )) r nk where r nk = ρ nk /( k ρ nk) and ρ nk = E π [log π k ] + E u [log p(x n u)] and approximate p(x n u k ) as product to get tractability 63 / 64

64 Conclusions Conclusions with the wide availability of new computational resources high performance computing ideas now enter mainstream computational science HPC is getting increasingly complex with a shift towards new problems and approaches characterised by the multi-challenge an increasingly important challenge originates from the multi and high dimensionality of many models in physics, chemistry, biology, statistics and engineering new theory and algorithms are needed sparse grid combination technique deals with dimensionality and is ideally suited for HPC the Opticom method is able to overcome a stability issue of the original combination technique next: high-dimensional inverse problems? 64 / 64

Variational problems in machine learning and their solution with finite elements

Variational problems in machine learning and their solution with finite elements M. Hegland M. Griebel April 2007 Abstract Many machine learning problems deal with the estimation of conditional probabilities