THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley

Size: px

Start display at page:

Download "THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS. Benjamin Recht UC Berkeley"

Lynette Bryan
5 years ago
Views:

1 THE STATE OF CONTEMPORARY COMPUTING SUBSTRATES FOR OPTIMIZATION METHODS Benjamin Recht UC Berkeley

2 MY QUIXOTIC QUEST FOR SUPERLINEAR ALGORITHMS Benjamin Recht UC Berkeley

3 Collaborators Slides extracted by torturing Eric Jonas, Vaishaal Shankar, Stephen Tu, Shivaram Venkataraman, and Ashia Wilson

4 The holy grail of optimization Fix a positive definite, n x n matrix A Find x such that Ax = b minimize 1 2 xt Ax b T x Linear regression/classification Interior Point Methods Sum of Squares Quadratic Programming Sequential Quadratic Programming Newton s Method Hard case for Nesterov Dense case ~ bigger sparse cases

5 Iterative Solvers Fix a positive definite, n x n matrix A Find x such that Ax = b complexity = (number of iterations) x (flops per iteration) L O µ log(1/ ) O(n 2 ) x k+1 = x k s(ax k b) Steepest Descent

6 Iterative Solvers Fix a positive definite, n x n matrix A Find x such that Ax = b complexity = (number of iterations) x (flops per iteration) O n L max µ log(1/ ) O(n) x (i) k+1 = x(i) k A 1 ii (at i x k b i ) Coordinate Descent (Gauss-Seidel)

7 Iterative Solvers Fix a positive definite, n x n matrix A Find x such that Ax = b complexity = (number of iterations) x (flops per iteration) O n 2 K(A) log(1/ ) Not sublinear in dimension. direct solve complexity = O(n 3 )

8 Iterative or Direct? Fix a positive definite, n x n matrix A n=60,000 Find x such that Ax = b b 60,000 x 10 iterative complexity O n 2 K(A) log(1/ ) direct solve complexity = O(n 3 ) direct solve wall clock = 111s iterative wall clock = 127s kx x? k F =0.58 kx? k F Are we selling ourselves short with these crummy iterative methods?

9 Why is direct faster? iterative complexity O n 2 K(A) log(1/ ) direct solve complexity = O(n 3 ) n=60,000 b 60,000 x 10 direct solve wall clock = 111s iterative wall clock = 127s kx x? k F =0.58 kx? k F Are we selling ourselves short with these crummy iterative methods? To achieve maximum speed, data shuffling is required Memory bandwidth is critical Maybe we just need more RAM?

10 I NEED MOAR RAM Superlinear scaling is the worst. n RAM Time Memory demand growing quadratically means more than 60,000 30GB 111s one computer is needed for anything larger than 400K. 350K 1TB 6 hours Cubic time scaling means a lot 1.2M 12TB 10 days of cores are needed to minimize time. Time extrapolation based on 64 cores, 2TB RAM Where can I get all of this compute?

11 THE CLOUD

12 Cloud Computing CHOICEs t2.nano, t2.micro, t2.small m4.large, m4.xlarge, m4.2xlarge, m4.4xlarge, m3.medium, c4.large, c4.xlarge, c4.2xlarge, c3.large, c3.xlarge, c3.4xlarge, r3.large, r3.xlarge, r3.4xlarge, i2.2xlarge, i2.4xlarge, d2.xlarge d2.2xlarge, d2.4xlarge, Basic tier: A0, A1, A2, A3, A4 Optimized Compute : D1, D2, D3, D4, D11, D12, D13 D1v2, D2v2, D3v2, D11v2, Latest CPUs: G1, G2, G3, Network Optimized: A8, A9 Compute Intensive: A10, A11, n1-standard-1, ns1-standard-2, ns1- standard-4, ns1-standard-8, ns1- standard-16, ns1highmem-2, ns1- highmem-4, ns1-highmem-8, n1- highcpu-2, n1-highcpu-4, n1-highcpu-8, n1-highcpu-16, n1-highcpu-32, f1-micro, g1-small Amazon EC2 Microsoft Azure Google Cloud Engine

14 TYRANNY of CHOICE

15 #THECLOUDISTOODAMNHARD What type? what instance? What base image? What is the cheapest configuration to run my job in 2 hours? How many to spin up? What price? spot? wait, Wait, WAIT oh god

16 Do choices matter? Equal Price/Hour, Aggregate Cores QR Factorization:1M by 1K 1 r3.8xlarge r3.4xlarge 4 r3.2xlarge 8 r3.xlarge 16 r3.large Time (s) Mem BW (GB/s) 150 ~2x performance difference for same price

17 Do choices MATTER? r3.4xlarge instances, QR Factorization:1M by 1K 30 Actual Ideal 22.5 Time (s) computation + communication yield non-linear scaling Cores 17

18 Computation patterns Time INPUT Time 1 machines

19 Communication Patterns ONE-To-ONE Tree DAG All-to-one CONSTANT LOG LINEAR 19

20 BASIC Model Computation (linear) All-to-One DAG input time= x 1 + x 2 machines + x 3 log(machines)+ x 4 (machines) Serial Tree DAG Execution Collect Training Data Fit Linear Regression

21 Optimal Design of Experiments Given a Linear Model y i = a T i x + w i, i =1,...,m Collection of (input, machines) Associate cost with each experiment λ i - Fraction of times each experiment is run minimize tr n Pm i=1 ia i a T i 1 o Lower variance! Better model i 2 [0, 1] P m I=1 c i i apple B Bound total cost

22 Number of instances Large Least-squares solve (TIMIT) on r3.xlarge instances Predicted Actual Time per iteration (s) Machines Machines

23 CHOOSING INSTANCE TYPES Running Time Training 8 r3.4xlarge r3.2xlarge r3.xlarge Time (seconds)

24 Wait. That still seems hard! and what happened to the RAM?

25 AWS LAMBDA 300 seconds single-core (AVX2) 512 MB in /tmp 1.5GB RAM Python, Java, Node S3 = Simple Storage Service. Essentially infinite RAM Communication at 600MB/s per machine (same speed as SATA) (PCI Bus is 1GB/s, Memory Bus is 80GB/s)

26 AWS LAMBDA 300 seconds single-core (AVX2) 512 MB in /tmp 1.5GB RAM Python, Java, Node

27 LAMBDA SCALABILITY Compute Data

28 Most wrens are small and rather inconspicuous, except for their loud and often complex songs. pywren

29 THE API

30 How it works future = runner.map(fn, data) Serialize func and data func data data data Put on S3 Invoke Lambda pull job from s3 download anaconda runtime python to run code pickle result stick in S3 future.result() poll S3 unpickle and return your laptop result the cloud

31 How expensive is S3? (Taking dimensionality analysis seriously, or beyond PSPACE ) Simple Storage Service Essentially infinite RAM Communication at 600MB/s per machine Same speed as SATA (PCI Bus is 1GB/s, Memory Bus is 80GB/s)

32 How do algorithms change when you have infinite memory (through a straw) Never discard intermediate information

33 numpywren That s a lot of SIMD cores! Parallel matrix multiplication is easy when output matrix is small D x N = D x D N x D Fits cleanly into map-reduce framework + + = D x D

34 numpywren However when output matrix is very large it becomes very difficult or expensive to store in memory D x N = N x D N x N For example for N = 1e6 and D=1e4 D x D matrix of doubles is 800 Mb N x N matrix of doubles is 8 TB Storing 8 TB in memory traditional cluster is expensive!

35 numpywren N D Lambdas Runtime Output Size N x N s 20 GB s 20 GB Solution: Use S3 to store matrices, stream blocks to Lambdas to compute output matrix in parallel 1.2 Millon 1.2 Million s 11 TB s 11 TB

36 Cholesky in numpywren apple A11 A T 12 A 12 A 22 = apple L11 0 L 12 L 22 apple L T 11 L T 12 0 L T 22 N x N Solution: Use S3 to store matrices, stream blocks to Lambdas to compute output matrix in parallel

37 Cholesky in numpywren apple A11 A T 12 A 12 A 22 = apple L11 0 L 12 L 22 apple L T 11 L T 12 0 L T 22 N x N Solution: Use S3 to store matrices, stream blocks to Lambdas to compute output matrix in parallel

38 Cholesky in numpywren apple A11 A T 12 A 12 A 22 = apple L11 0 L 12 L 22 apple L T 11 L T 12 0 L T 22 N x N Solution: Use S3 to store matrices, stream blocks to Lambdas to compute output matrix in parallel

39 Cholesky in numpywren apple A11 A T 12 A 12 A 22 = apple L11 0 L 12 L 22 apple L T 11 L T 12 0 L T 22 N x N L 21 = A 21 L T 11 Solution: Use S3 to store matrices, stream blocks to Lambdas to compute output matrix in parallel

40 Cholesky in numpywren apple A11 A T 12 A 12 A 22 = apple L11 0 L 12 L 22 apple L T 11 L T 12 0 L T 22 N x N L 21 = A 21 L T 11 Solution: Use S3 to store matrices, stream blocks to Lambdas to compute output matrix in parallel A (new) 22 = A 22 L 12 L T 12

41 The future is direct solves Fix a positive definite, n x n matrix A Find x such that Ax = b minimize 1 2 xt Ax b T x Iterative methods are not fast New substrates make large-scale direct solve reachable Need simpler tools and cost management There are algorithmic challenges in noniterative methods

References If you d like to try Pywren visit pywren.io Occupy the Cloud: Distributed Computing for the 99%. Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht.

Jordan, and Benjamin Recht. In Proceedings of the International Conference on Machine Learning (ICML). 2017. Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics.

42 References If you d like to try Pywren visit pywren.io Occupy the Cloud: Distributed Computing for the 99%. Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. In Proceedings of the ACM Symposium on Cloud Computing (SOCC) Breaking Locality Accelerates Block Gauss-Seidel. Stephen Tu, Shivaram Venkataraman, Ashia C. Wilson, Alex Gittens, Michael I. Jordan, and Benjamin Recht. In Proceedings of the International Conference on Machine Learning (ICML) Ernest: Efficient Performance Prediction for Large Scale Advanced Analytics. Shivaram Venkataraman, Zongheng Yang, Michael J Franklin, Benjamin Recht, and Ion Stoica. In Proceedings of the Symposium on Networked Systems Design (NSDI)

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,