Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
2 How many bits do you need to represent a single number in machine learning systems? 8 Takeaways Training Neural Networks 4 bits is enough for communication 4 0 3bit 32bit floating point 1 10 100 Training Linear Models 4 bits is enough end-to-end Beyond Empirical Rigorous theoretical guarantees
3 First Example: GPUs GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK What happens in practice? Regular model Compute gradient Exchange gradient Update params Trend towards large models and datasets Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive. Minibatch 1 Minibatch 2
4 First Example: GPUs GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK General trend towards large models What happens in practice? Bigger model Compute gradient Exchange gradient Update params Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive. Minibatch 1 Minibatch
5 First Example: GPUs GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK General trend towards large models What happens in practice? Biggerer model Compute gradient Exchange gradient Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive.
6 First Example: GPUs Compression [Seide et al., Microsoft CNTK] GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK General trend towards large models Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive. Minibatch 1
The Key Question Can lossy compression provide speedup, while preserving convergence? Yes. Quantized SGD (QSGD) can converge as fast as SGD, with considerably less bandwidth. > 2x faster Top-1 accuracy for AlexNet (ImageNet). Top-1 accuracy vs Time for AlexNet (ImageNet).
Why does QSGD work? 8
Notation in One Slide Model x argmin x f x 7 f x =(1/M) 0 loss(x, ei) 89: Task Solved via optimization procedure. Notion of quality E.g., image classification Data (M examples)
Background on Stochastic Gradient Descent Stochastic Gradient Descent: Goal: find argmin x f x. Let ga(x) = x s gradient at a randomly chosen data point. Iteration: x td1 = x t η t ga(xt), where Ε[gK(xt)] = f x t. Let Ε ga x f x 2 σ 2 (variance bound) Theorem [Informal]: Given f nice (e.g., convex and smooth), and R 2 = x 0 x 2. To converge within ε of optimal it is sufficient to run for T = O( R 2 2 σ2 ε 2 ) iterations. Higher variance = more iterations to convergence.
11 Data Flow: Data-Parallel Training (e.g. GPUs) Standard SGD step: x td1 = x t η t ga x t at step t. 12000 10000 8000 6000 Q( g] 1 x t ) Q( g] 2 x t ) -5 4000 2000 0-4 -3-2 -1 0 1 2 3 4 5 Data Data GPU 1 GPU 2 Model x td1 = x t η t ( g] 1 x t + g] 2 x t ) Quantized SGD step: x td1 = x t η t Q( ga x t ) at step t.
How Do We Quantize? Gradient = vector v of dimension n, normalized Quantization function where ξ 8 v i Q[v i ] = ξ 8 v i sgn v 8 = 1 with probability v i, and 0, otherwise. Quantization is an unbiased estimator: E Q v = v. Pr[ 1 ] = v i = 0.7 Pr[ 0 ] = 1 - v i = 0.3 1 v i = 0.7 0 Why do this? float v1 float v2 float v3 float v4 float vn float scaling n bits and signs +1 0 0-1 -1 0 1 1 1 0 0 0-1 Compression rate > 15x
13 Gradient Compression We apply stochastic rounding to gradients The SGD iteration becomes: x td1 = x t η t Q(gA x t ) at step t. 0.857 0.714 0.571 1 Theorem [QSGD: Alistarh, Grubic, Li, Tomioka, Vojnovic, 2016] Given dimension n, QSGD guarantees the following: 1. Convergence: If SGD converges, then QSGD converges. 2. Convergence speed: If SGD converges in T iterations, QSGD converges in 3. Bandwidth cost: Each gradient can be coded using 2 n log n bits. n T iterations. The Gamble: The benefit of reduced communication will outweigh the performance hit because of extra iterations/variance and coding/decoding. 0.429 0.286 0.143 Generalizes to arbitrarily many quantization levels. 0
Does it actually work? 14
Experimental Setup Where? Amazon p16xlarge (16 x NVIDIA K80 GPUs) Microsoft CNTK v2.0, with MPI-based communication (no NVIDIA NCCL) What? Tasks: image classification (ImageNet) and speech recognition (CMU AN4) Nets: ResNet, VGG, Inception, AlexNet, respectively LSTM With default parameters Why? Accuracy vs. Speed/Scalability Open-source implementation, as well as docker containers.
Experiments: Communication Cost AlexNet x ImageNet-1K x 2 GPUs Compute 60% Communicate 40% Compute 95% Communicate 5% SGD vs QSGD on AlexNet.
Experiments: Strong Scaling 2.3x 3.5x 1.6x 1.3x
Experiments: A Closer Look at Accuracy 4bit: - 0.2% 8bit: + 0.3% 2.5x ResNet50 on ImageNet 3-Layer LSTM on CMU AN4 (Speech) Across all networks we tried, 4 bits are sufficient for full accuracy. (QSGD arxiv tech report contains full numbers and comparisons.)
19 How many bits do you need to represent a single number in machine learning systems? 8 Takeaways Training Neural Networks 4 bits is enough for communication 4 0 3bit 32bit floating point 1 10 100 Training Linear Models 4 bits is enough end-to-end
20 Data Flow in Machine Learning Systems Gradient: dot(ar, x)ar 3 Data Source Sensor Database Data Ar Computation Device GPU, CPU FPGA Model x 1 2 Storage Device DRAM CPU Cache
21 ZipML [. 0.7.] Gradient: dot(ar, x)ar 3 1 0 Naive solution: nearest rounding (=1) => Converge to a different solution Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 Expectation matches => OK! (Over-simplified, need to be careful about variance!) Data Source Sensor Database Data Ar Computation Device FPGA GPU, CPU Model x 1 2 Storage Device DRAM CPU Cache NIPS 15
22 ZipML Gradient: dot(ar, x)ar 3 Data Source Sensor Database Data Ar Model x 1 2 Storage Device DRAM CPU Cache Computation Device FPGA GPU, CPU [. 0.7.] 0 (p=0.3) 1 (p=0.7) Loss: (ax - b) 2 Gradient: 2a(ax-b) Expectation matches => OK!
23 ZipML Gradient: dot(ar, x)ar 3 Data Source Sensor Database Data Ar Model x 1 2 Storage Device DRAM CPU Cache Expectation matches => OK? NO!! [. 0.7.] 0 (p=0.3) 1 (p=0.7) Computation Device FPGA GPU, CPU Why? Gradient 2a(ax-b) is not linear in a.
24 ZipML: Double Sampling How to generate samples for a to get an unbiased estimator for 2a(ax-b)? How many more bits do we need to store the second sample? Not 2x Overhead! TWO Independent Samples! 2a1(a2x-b) First Sample Second Sample 3bits to store the first sample 2nd sample only have 3 choices: - up, down, same => 2bits to store We can do even better Samples are symmetric! 15 different possibilities => 4bits to store 2 samples 1bit Overhead arxiv 16
It works! 25
26 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)
It works, but is what we are doing optimal? 28
29 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b) Intuitively, shouldn t we put more markers here?
30 Not Really Data-optimal Quantization Strategy a b b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) Intuitively, shouldn t we put more markers here? Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b) Expectation of quantization error for A, B (variance) = 2a(b+b ) / (a+b+b ) Find a set of markers c1 < < cs, P-TIME Dynamic Programming All data points falling into an interval Dense Markers
Experiments 31
Beyond Linear Regression 32
33 General Loss Function Chebyshev Polynomials Linear Regression or LS-SVM Logistic Regression Challenge: Non-Linear terms Approximation via polynomials
Experiments 34 Even for non-linear models, we can do end-to-end quantization at 8 bits, withguaranteed convergence.
Summary & Implementation We can do end-to-end quantized linear regression with guarantees We can do arbitrary classification loss via approximation Implemented on an Xeon + FPGA platform (Intel-Altera HARP) Quantized data -> 8-16x more data per transfer 10 2 10 2 Training loss 2 1 float CPU 1-thread float CPU 10-threads float FPGA Q2 FPGA Training loss 1 0.5 float CPU 1-thread float CPU 10-threads float FPGA Q4 FPGA 6.5x 0 7.2x 10 2 10 1 10 0 10 1 Time (s) Classification (gisette dataset, n = 5K). 10 3 10 2 10 1 Time (s) Regression (synthetic dataset, n = 100).
8 Takeaways Training Neural Networks 4 bits is enough for communication 4 0 3bit 32bit floating point 1 10 100 Training Linear Models 4 bits is enough end-to-end Beyond Empirical Rigorous theoretical guarantees Questions?