Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points

Size: px

Start display at page:

Download "Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points"

Osborne Patterson
5 years ago
Views:

1 Breaking Computational Barriers: Multi-GPU High-Order RBF Kernel Problems with Millions of Points Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität Bonn GPU Technnology Conference 2014 March 24-27, 2014, San José, CA, USA

2 Outline 1 Motivation 2 Radial basis function interpolation 3 Preconditioning for large-scale kernel interpolation problems 4 Summary

3 Motivation Meshfree Interpolation interpolation reconstruction of continuous function from point evaluations meshfree evaluation points at arbitrary locations, no mesh Fields of application classical applications computer graphics, signal processing, computer vision,... large scale data analysis Big Data, data mining, machine learning,... solving PDEs by collocation methods compuational fluid dynamics, stochastic collocation,...

known input parameters Problem nature phenomena: input data not known

4 My personal motivation: Uncertainty quantification in CFD (1) Current standard in computational fluid dynamics simulations for fixed and known input parameters Problem nature phenomena: input data not known exactly engineering: constructions / measurements always subject to perturbations

5 My personal motivation: Uncertainty quantification in CFD (2) Algorithmic idea 1 sampling of stochastic input parameters according to some distribution 2 computation of hundreds or thousands of stochastic realizations (high-resolution simulations) 3 extraction of averaged data (expectation value), variance information,... as post-processing step CFD solver E[u] sampling of stochatic space config. file generator CFD solver stochastics tool Var[u]. CFD solver Cov[u] K.-L.

6 Outline 1 Motivation 2 Radial basis function interpolation 3 Preconditioning for large-scale kernel interpolation problems 4 Summary

7 Basic facts Interpolation problem given: function f : Ω R, sampling points, X := {y 1,... y N } Ω, Ω R d target: function s f,x : Ω R such that Radial basis functions s f,x (y j ) = f (y j ) for all j = 1,..., N Gaussian: φ j (y) := e ɛ2 y y j 2 Matérn function: φ j (y) := K β d ( y y j ) y y j β d 2 2, β > d 2 β 1 Γ(β) 2, Kernel functions functions k of type k : Ω Ω R radial basis function case: k(y, y j ) := ψ( y y j ), e.g. ψ(r) = e ɛ2 r 2

8 Interpolation with kernel functions (1) Kernel-based Interpolation problem F Hilbert function space, f : Ω R, points w. func. eval.: X := {y 1,..., y N } Ω, f j := f (y j ) j = 1... N. for kernel function k : R R R looking for s f,x F with N s f,x (y) := α j k(y, y j ) y Ω j=1 Solution of interpolation problem s f,x (y j ) = f j, 1 j N. A k,x α = f k(y 1, y 1 ) k(y 1, y N ) A k,x = k(y N, y 1 ) k(y N, y N ), f = f (y 1 ). f (y N )

9 Interpolation with kernel functions (2) Interpolation by Lagrange basis s f,x (y) = N f(y i )L i (y), L i (y) = i=1 {L i } N i=1, L i : Γ R, with L i (y j ) = N αjk(y, i y j ) j=1 { 1 i = j 0 i j Construction of Lagrange basis A L matrix of coefficients A L := ( α i j ) N j,i=1 A L = A 1 k,x Γ

10 Error estimates (in native spaces) Requirement f N kɛ (Ω), Ω cube in R s Definition (Fill distance) h X,Ω := sup min y y i 2 y j X y Ω Theorem (Gaussian kernel k ɛ (y i, y j ) = e ɛ2 y i y j 2 ) Theorem (Matérn kernels) s X,f c log h X,Ω f s X,f L (Ω) e h X,Ω f Nkɛ (Ω) D α f (y) D α s f,x (y) Ch β d/2 α X,Ω f Nkd,k (Ω) (1) function f interpolated by Lagrange interpolation with collocation points X

11 Outline 1 Motivation 2 Radial basis function interpolation 3 Preconditioning for large-scale kernel interpolation problems 4 Summary

12 Preconditioning motivation (1) Objective solution of special dense linear systems with unknowns Standard approach A k,x α = f k(y 1, y 1 ) k(y 1, y N ) A k,x = k(y N, y 1 ) k(y N, y N ), f = solution of dense linear system by direct factorization complexity O(N 3 ) f (y 1 ). f (y N )

13 Preconditioning motivation (2) Iterative approach Krylov iterative linear solver such as CG or BiCG for dense matrices still complexity O(N 3 ), but... Preconditioned iterative approach use of Krylov iterative solver for dense linear system preconditioner based on localized Lagrange basis functions often few or even constant number of iterations possible optional: fast multipole (FFM) for dense MatVec product final complexity in optimal case: w. FFM O(N log(n)), w/o FFM O(N 2 ) [Beatson, Cherrie, Mouat, 1999], [Faul, Powell, 2000], [Gumerov, Duraiswami, 2007], [Fuselier, Hangelbroek, Narcowich, Ward, Wright, 2012]

14 Preconditioning idea Algorithm for each point: find lokal point neighborhood / subset solve interpolation problem on local neighborhood construct local Lagrange basis use local solutions as preconditioner for full system Properties many small problems very local source: [Fuselier et. al., 2012]

15 Choice of local subsets subsets by radius X i points within given radius of y i non-fixed size of subset clear geometric view subset by next neighbors X i points next n neighbors of y i fixed size of subset geometric view unclear n = κ log(n) 2 source: [Fuselier et. al., 2012]

16 Approximated / localized Lagrange basis N N s f,x (y) = f(y i )L i (y), L i (y) = αjk(y, i y j ) i=1 {L i } N i=1, L i : Γ R, with L i (y j ) = j=1 { 1 i = j 0 i j Local subsets X i X i X, Xi := {y i1,..., y ini }, s.th. y i X i, i {1,..., N} } N Approximate / localized Lagrange basis { Li i=1 Li (y) = { 1 if y = yi 0 if y X i \ y i, Li (y) := N i j=1 α i jk(y, y ij ), Li N k (Ω) A k, Xi α i = e i i = 1... N

17 Preconditioning by localized Lagrange basis Li (y) = { 1 if y = yi 0 if y X i \ y i, Li (y) := N i α jk(y, i y ij ) j=1 Coefficient matrix describing localized Lagrange basis A L := (a ik ) N i,k=1, with a i,k := maximum of X i non-zero entries in row i { α i j if k = i j 0 otherwise or A L := (α 1 ). (α N ) Preconditioned linear system A LA k,x α = A Lf (A L =A k,x 1 A L A k,x 1 )

18 Properties for Exascale systems Locality of preconditioner local point set used for each localized Lagrange basis application of preconditioner local per construction no energy wasting global transfer operations Parallelism of preconditioner construction many similar / equally sized small problems solved in parallel without communication optimal for fine-graind parallelism, deep memory hierarchies Error resilience of overall method iterative methods error resilient by construction

19 Multi-GPU implementation (1) Iterative solver for dense linear systems parla in-house multi-gpu parallel library CUDA-aware MPI and overlapping of compuation and communication for optimal scaling iterative solvers impelemented with variable preconditioner and black-box or memory-based matrix-vector product currently available solvers: CG, CGS, Lanczos (EV problems) on-the-fly application of matrix-vector product in CUDA-kernel with no additional memory consumption (besides of points) libraries: CUB, CUDA 5.0, CUBLAS 5.0, OpenMPI Domain decomposition of points currently preprocessing step with clustering alorithm in external tool and ghost point layer

20 Multi-GPU implementation (2) Preconditioner setup (under development / optimization) setup within stochastic collocation tool brute-force GPU based knn-search (arbitrary dim. no lib.) for now: small systems solved by LU decomp. from CULA currently development for Fermi-type GPUs (sm 20) no concurrent kernel execution cublas<t>gemmbatched()? (performance for > ?) more libraries providing batched / device kernel functions? (CUB is great, but not yet feature complete / too low-level) libraries: Thrust, CUBLAS, CULA, OpenMPI Multi-GPU preconditioner setup & application purely local no par. comm.

21 First performance results runtime in min precond. solver small LUs perfect scaling number of GPUs strong scaling for interpolation points Parameters κ = 2 k: Matérn kernel solver stops at r < intp. points subset size 311 iteration count 45 overall time on 32 proc.: 8.73 min 1M intp. points subset size 381 iteration count 67 overall time on 32 proc.: min

22 Summary Meshfree interpolation / collocation needs scaling solvers RBF kernel methods with higher-order convergence Strong preconditioning techniques with high potential for Exascale systems Thank you!

Multi-GPU Parallel Numerical Methods for Uncertainty Quantification in Computational Fluid Dynamics

Multi-GPU Parallel Numerical Methods for Uncertainty Quantification in Computational Fluid Dynamics Michael Griebel Christian Rieger Peter Zaspel Institute for Numerical Simulation Rheinische Friedrich-Wilhelms-Universität