Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Similar documents
Parallel Coordinate Optimization

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

Lecture 23: November 21

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

On Nesterov s Random Coordinate Descent Algorithms

Lecture 3: Huge-scale optimization problems

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Coordinate descent methods

Large-scale Stochastic Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Lasso: Algorithms and Extensions

Subgradient methods for huge-scale optimization problems

A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints

Mini-Batch Primal and Dual Methods for SVMs

CPSC 540: Machine Learning

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Coordinate Descent and Ascent Methods

Efficient Learning on Large Data Sets

Is the test error unbiased for these programs? 2017 Kevin Jamieson

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Optimization methods

ECS289: Scalable Machine Learning

Accelerated Proximal Gradient Methods for Convex Optimization

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Lecture 25: November 27

Convex Optimization Lecture 16

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

ECS289: Scalable Machine Learning

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

Big Data Analytics: Optimization and Randomization

Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations

Classification Logistic Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Asynchronous Parallel Computing in Signal Processing and Machine Learning

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

SVRG++ with Non-uniform Sampling

Importance Sampling for Minibatches

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Optimization methods

Sparse and Regularized Optimization

Coordinate gradient descent methods. Ion Necoara

Adaptive Restarting for First Order Optimization Methods

On Nesterov s Random Coordinate Descent Algorithms - Continued

minimize x subject to (x 2)(x 4) u,

Block Coordinate Descent for Regularized Multi-convex Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)

Coordinate Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization / Slides adapted from Tibshirani

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Linearly-Convergent Stochastic-Gradient Methods

arxiv: v2 [cs.lg] 10 Oct 2018

STA141C: Big Data & High Performance Statistical Computing

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Trade-Offs in Distributed Learning and Optimization

Accelerating Stochastic Optimization

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Statistical Data Mining and Machine Learning Hilary Term 2016

arxiv: v1 [hep-lat] 7 Oct 2010

Stochastic Gradient Descent with Variance Reduction

Stochastic Optimization: First order method

Coordinate Update Algorithm Short Course The Package TMAC

A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures

10-725/ Optimization Midterm Exam

Is the test error unbiased for these programs?

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Linear Convergence under the Polyak-Łojasiewicz Inequality

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

J. Sadeghi E. Patelli M. de Angelis

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Generalized Power Method for Sparse Principal Component Analysis

l 1 and l 2 Regularization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Machine Learning in the Data Revolution Era

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Linear Convergence under the Polyak-Łojasiewicz Inequality

More Science per Joule: Bottleneck Computing

CSC 576: Gradient Descent Algorithms

CPSC 540: Machine Learning

GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Julian Merten. GPU Computing and Alternative Architecture

Transcription:

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, ERGO Tech. report 11-011, 2011. P. Richtárik and M. Takáč. Efficient serial and parallel coordinate descent methods for huge-scale truss topology design, ERGO Tech. report 11-012 2011. October 06, 2011 1 / 43

Outline 1 Randomized Coordinate Descent Method Theory GPU implementation Numerical examples 2 Facing the Multicore-Challenge II (a few insights from a conference) 3 Discussion StarSs Intel Cilk Plus SIMD and auto-vectorization 2 / 43

Randomized Coordinate Descent Method 3 / 43

Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints 4 / 43

Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints General Iterative Algorithm 1 choose initial point x := x 0 R n 2 iterate choose direction d := d(x) choose step length α(d, x) set x = x + αd 4 / 43

Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints General Iterative Algorithm 1 choose initial point x := x 0 R n 2 iterate choose direction d := d(x) choose step length α(d, x) set x = x + αd More sophisticated direction less iterations More sophisticated direction more work per iteration 4 / 43

Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T 5 / 43

Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T Methods for choosing direction cyclic (or essentially cyclic) coordinate with the maximal directional derivative (Greedy) RANDOM (WHY?) 5 / 43

Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T Methods for choosing direction cyclic (or essentially cyclic) coordinate with the maximal directional derivative (Greedy) RANDOM (WHY?) Applications Sparse regression, LASSO Truss topology design (Sparse) group LASSO Elastic Net... L 1 regularized linear support vector machines (SVM) with various (smooth) loss functions 5 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x, x 0 = ( 0 0 ) T 6 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x, x 0 = ( 0 0 ) T What we need? Generate random coordinate (we need n dimensional coin) 6 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( 1 0.5 0.5 1 ) x ( 1.5 1.5 ) x, x 0 = ( 0 0 ) T What we need? Generate random coordinate (we need n dimensional coin) Luckily in our case n = 2 6 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x 1 0.5 0.5 1 x 1.5 1.5 x, x0 = 0 0 T 7 / 43

Block Structure of R N Any vector x R N can be written uniquely as n x = U i x (i), where x (i) = Ui T x R i R N i. i=1 8 / 43

Block Norms and Induced Norms Block norms (norms in R i ): We equip R i with a pair of conjugate Euclidean norms: t (i) = B i t, t 1/2, where B i R N i N i Induced norms (norms in R N ): t (i) = B 1 i t, t 1/2, t R i, is a positive definite matrix. For fixed positive scalars w 1,..., w n and x R N we define Conjugate norm is defined by y W = [ n 1/2 x W = w i x (i) (i)] 2. i=1 max y, x = x W 1 where W = Diag(w 1,..., w n ). [ n i=1 w 1 i ( y (i) (i) )2] 1/2, 9 / 43

Assumptions on f and Ψ The gradient of f is block coordinate-wise Lipschitz with positive constants L 1,..., L n : i f (x + U i t) i f (x) (i) L i t (i), x R N, t R i, i = 1,..., n, (1) where i f (x) def = ( f (x)) (i) = U T i f (x) R i. An important consequence of (1) is the following inequality f (x + U i t) f (x) + i f (x), t + L i 2 t 2 (i). (2) We denote L = Diag(L 1,..., L n ). Ψ is block-separable: n Ψ(x) = Ψ i (x (i) ). i=1 10 / 43

UCDC: Uniform Coordinate Descent for Composite Functions - Algorithm An upper bound on F (x + U i t) is readily available: F (x + U i t) = f (x + U i t) + Ψ(x + U i t) (2) Algorithm 1 UCDC(x 0 ) f (x) + i f (x), t + L i 2 t 2 (i) + Ψ i(x (i) + t) + Ψ j (x (j) ). }{{} j i def =V i (x,t) for k = 0, 1, 2,... Choose block coordinate i {1, 2,..., n} with probability 1 n T (i) (x k ) = arg min{v i (x k, t) : t R i R N i } x k+1 = x k + U i T (i) (x k ) end for In each step we minimize a function of N i variables We assume Ψ is simple: T (i) (x k ) can be computed in closed form If Ψ 0, then T (i) (x k ) = 1 L i B 1 i i f (x) 11 / 43

The Main Result Theorem 1 Choose initial point x 0 and confidence level ρ (0, 1). Further, let the target accuracy ɛ > 0 and iteration counter k be chosen in any of the following two ways: (i) ɛ < F (x 0 ) F and k 2n max{r2 L (x0),f (x0) F } ɛ (ii) ɛ < min{r 2 L (x 0), F (x 0 ) F } and ( ) 1 + log 1 ρ + 2 2n max{r2 L (x0),f (x0) F } F (x 0) F, k 2nR2 L (x0) ɛ log F (x0) F ɛρ. If x k is the random point generated by UCDC(x 0 ) as applied to F, then P(F (x k ) F ɛ) 1 ρ. R 2 W (x 0) = max x {max x X x x 2 W : F (x) F (x 0)} 12 / 43

Performance on Strongly Convex Functions Theorem 2 Let F be strongly convex with respect to L with convexity parameter µ > 0 and choose accuracy level ɛ > 0, confidence level 0 < ρ < 1, and ( ) k F (x0) F, n 1 γ µ log ρɛ where γ µ is given by { γ µ = 1 µ 4, if µ 2, 1 µ, otherwise. (3) If x k is the random point generated by UCDC(x 0 ), then P(F (x k ) F ɛ) 1 ρ. 13 / 43

Comparison with other coordinate descent methods for composite minimization Lip Ψ block coordinate choice E(F (x k ) < ɛ) 1 iter [YT09] L( f ) separable Yes greedy O( nl( f ) x x 0 2 2 ) ɛ O(n 2 ) 1 [SST09] L i = β 1 No O( nβ x x 0 2 2 ) n ɛ cheap [ST10] L( f ) 1 No cyclic O( nl( f ) x x 0 2 2 ) ɛ cheap 1 Alg 1 L i separable Yes O( n x x 0 2 L ) n ɛ cheap [YT09 ]S. Yun and P. Tseng. A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications, 140:513 535, 2009. [SST09 ]S. Shalev-Shwartz and A. Tewari. Stochastic methods for l 1 regularized loss minimization. In: ICML 2009. [ST09 ]A. Saha and A. Tewari. On the finite time convergence of cyclic coordinate descent methods. arxiv:1005.2146, 2010. 14 / 43

Comparison with Nesterov s randomized coordinate descent... for achieving P(F (x k ) F ɛ) 1 ρ in the general case (F not strongly convex): Alg Ψ p i norms complexity objective [N10] 0 [N10] 0 1 n Euclid. Alg1 0 > 0 general Alg2 0 8nR 2 L (x 0) log 4(f (x 0) f ) ɛ ɛρ L i Euclid. (2n + 8SR2 I (x 0) ) log 4(f (x 0) f ) S ɛ ɛρ 1 n Euclid. 2R 2 LP 1 (x 0) f (x) + ɛ x x 0 2 L 8R 2 L (x 0) f (x) + ɛ x x 0 2 I 8R 2 I (x 0) (1 + log 1 ɛ ρ ) 2 f 2nM ɛ (1 + log 1 ρ ) F 2nR 2 L (x 0) ɛ log F (x 0) F ɛρ Symbols used in the table: S = L i, M = max{r 2 L (x 0), F (x 0 ) F }. [N10] Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Paper # 2010/2, 2010. 15 / 43

Application: Sparse Regression Problem formulation & Lipschitz constants 1 min x R N 2 Ax b 2 2 + γ x 1 L i = a T i a i (easy to compute) Note: Lipschitz constant of f = λ max (A T A) (hard to compute) Algorithm 1 UCDC(x 0 ) for k = 0, 1, 2,... Choose block coor. i Compute T (i) (x k ) x k+1 = x k + U i T (i) (x k ) end for UCDC for Sparse Regression Choose x 0 = 0 and set g 0 = Ax 0 b = b for k = 0, 1, 2,... do Choose i k = i {1, 2,..., N} uniformly i f (x k ) = ai T g k, i f (x k )+γ L i, if x (i) k T (i) (x k ) = i f (x k ) γ L i, if x (i) k x (i) k, otherwise, x k+1 = x k + e i T (i) (x k ), g k+1 = g k + a i T (i) (x k ). end for i f (x k )+γ L i > 0, i f (x k ) γ L i < 0, 16 / 43

Instance 1 run on CPU: A R 107 10 6, A 0 = 10 8 k/n F (x k ) F x k 0 time [s] 0.0010 < 10 16 857 0.01 12.72 < 10 11 999,246 54.48 15.23 < 10 10 997,944 65.19 18.61 < 10 9 990,923 79.69 20.61 < 10 8 978,761 88.25 23.38 < 10 7 926,167 100.08 25.91 < 10 6 763,314 110.94 28.22 < 10 5 366,835 120.84 30.66 < 10 4 57,991 131.25 32.83 < 10 3 9,701 140.55 35.05 < 10 2 2,538 150.02 37.01 < 10 1 1,722 158.39 38.26 < 10 0 1,633 163.75 40.98 < 10 1 1,604 175.38 42.71 < 10 3 1,600 182.75 42.71 < 10 4 1,600 182.77 43.28 < 10 5 1,600 185.21 44.86 < 10 6 1,600 191.94 17 / 43

Instance 2 run on CPU: A R 109 10 8, A 0 = 2 10 9 k/n F (x k ) F x k 0 time [s] 0.01 < 10 18 18,486 1.32 9.35 < 10 14 99,837,255 1294.72 11.97 < 10 13 99,567,891 1657.32 14.78 < 10 12 98,630,735 2045.53 17.12 < 10 11 96,305,090 2370.07 20.09 < 10 10 86,242,708 2781.11 22.60 < 10 9 58,157,883 3128.49 24.97 < 10 8 19,926,459 3455.80 28.62 < 10 7 747,104 3960.96 31.47 < 10 6 266,180 4325.60 34.47 < 10 5 175,981 4693.44 36.84 < 10 4 163,297 5004.24 39.39 < 10 3 160,516 5347.71 41.08 < 10 2 160,138 5577.22 43.88 < 10 1 160,011 5941.72 45.94 < 10 0 160,002 6218.82 46.19 < 10 1 160,001 6252.20 46.25 < 10 2 160,000 6260.20 46.89 < 10 3 160,000 6344.31 46.91 < 10 4 160,000 6346.99 46.93 < 10 5 160,000 6349.69 18 / 43

Acceleration on a Graphical Processing Unit (GPU) CPU GPU 1 core 448 cores 3.00 GHz 1.15 GHz 3 GFlops 1 TFlop GPU (Graphical Processing Unit): performs a single operation in parallel on multiple data (vector addition). 19 / 43

GPU Architecture Tesla C2050 14 Multiprocessors Each Multiprocessor has 32 cores (each 1.15GHz) Each core in one multiprocessor has to do the same operation on different data = SIMD (Single Instruction Multiple Data) 20 / 43

Huge and Sparse Data = Small Chance of Collision 21 / 43

Race Conditions and Atomic Operation 22 / 43

23 / 43

CPU (C) vs GPU (CUDA): step-by-step comparison n = 10 8 24 / 43

Acceleration by GPU 25 / 43

Naive Implementation Implementation each thread choose its coordinate independently, matrix A is naturally stored in CSC sparse format. 26 / 43

Naive Implementation Data access is NOT coalesced (is NOT sequential at all) is misaligned (decrease of performance ) 27 / 43

Misaligned copy 28 / 43

Better implementation Memory access improvement all threads in ONE WRAP generate the same random number which points to begin of memory bank (from where data will be read) each thread then compute its coordinate i (one wrap is working on ALIGNED and SEQUENTIAL coordinates) 29 / 43

Pipelining C++ code i n t o f f s e t =.... ; f o r ( i n t i = o f f s e t ; i < N; i ++) { A[ i ] = A[ i + o f f s e t ] s ; } Left: Offset is a constant. Right: Offset is a variable, therefore compiler cannot optimize as well as previously. 30 / 43

Facing the Multicore-Challenge II (a few insights from a conference) 31 / 43

SIMD and Auto-Vectorization CPU 2.2 GHz, 4 flops/cycle, 12 cores = 105.6 GFlops (AMD Magny-Cours) GPU 1030.4 GFlops (Tesla C2050/C2070 GPU Computing Processor)...the memory subsystem (or at least the cache) must be able to sustain sufficient bandwidth to keep all units busy... 32 / 43

Intel Compiler 12+ -guide=n analyzes code and gives some recommendations -vec-report n report of loop vectorization, shows which loops where vectorized and which not #pragma ivdep vectorization, do not care on data dependences -ax auto-vectorization C++ code i n t C [N ] ;.... f o r ( i n t j = 0 ; j < C [ 7 ] ; j ++) { //C[7]=2034 A[ j ]=2 B[ j ] ; } 33 / 43

Intel Compiler 12+ -guide=n analyzes code and gives some recommendations -vec-report n report of loop vectorization, shows which loops where vectorized and which not #pragma ivdep vectorization, do not care on data dependences -ax auto-vectorization C++ code i n t C [N ] ;.... #pragma i v d e p f o r ( i n t j = 0 ; j < C [ 7 ] ; j ++) { //C[7]=2034 A[ j ]=2 B[ j ] ; } 34 / 43

StarSs - Introduction 35 / 43

StarSs - Renaming 36 / 43

StarSs - GPU 37 / 43

StarSs - GPU 38 / 43

Intel Cilk Plus Write parallel programs using a simple model: With only three keywords to learn, C and C++ developers move quickly into the parallel programming domain. Utilize data parallelism by simple array notations that include elemental function capabilities. Leverage existing serial tools: The serial semantics of Intel Cilk Plus allows you to debug in a familiar serial debugger. Scale for the future: The runtime system operates smoothly on systems with hundreds of cores. 39 / 43

Cilk - Arrays { } d e c l s p e c ( v e c t o r ) f l o a t s a x p y e l e m e n t a l ( f l o a t a, f l o a t x, f l o a t y ) return ( a x + y ) ; // Here i s how to i n v o k e such v e c t o r i z e d e l e m e n t a l // f u n c t i o n s u s i n g a r r a y n o t a t i o n : z [ : ] = s a x p y e l e m e n t a l ( a, b [ : ], c [ : ] ) ; a [ : ] = s i n ( b [ : ] ) ; // v e c t o r math l i b r a r y f u n c t i o n a [ : ] = pow ( b [ : ], c ) ; // b [ : ] c v i a a v e c t o r // math l i b r a r y f u n c t i o n 40 / 43

Cilk - Fibonacci - Create new thread 01 c i l k i n t f i b ( i n t n ) 02 { 03 i f ( n < 2) return n ; 04 e l s e 05 { 06 i n t x, y ; 07 08 x = spawn f i b ( n 1); 09 y = spawn f i b ( n 2); 10 11 sync ; 12 13 return ( x+y ) ; 14 } 15 } 41 / 43

Any Questions? Thank you for your attention 42 / 43

References Georg Hager and Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers Jesus Labarta, Rosa Badia, Eduard Ayguade, Marta Garcia, Vladimir Marjanovic: Hybrid Parallel Programming with MPI+SMPSs 43 / 43