Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Similar documents
Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

arxiv: v3 [cs.lg] 19 Jun 2017

FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off

arxiv: v1 [cs.lg] 10 Jan 2019

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Optimization geometry and implicit regularization

The ZipML Framework for Training Models with End-to-End Low Precision

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Stochastic Gradient Descent with Variance Reduction

ECS171: Machine Learning

ECE 5424: Introduction to Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Outline. Overview CNTK introduction. Educational resources Conclusions. Symbolic loop Batch scheduling Data parallel training

CS260: Machine Learning Algorithms

Importance Sampling for Minibatches

Least Mean Squares Regression. Machine Learning Fall 2018

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Comparison of Modern Stochastic Optimization Algorithms

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Ad Placement Strategies

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

STA141C: Big Data & High Performance Statistical Computing

Discriminative Models

Neural Networks: Backpropagation

A Parallel SGD method with Strong Convergence

VBM683 Machine Learning

Communication-efficient and Differentially-private Distributed SGD

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Least Mean Squares Regression

PRUNING CONVOLUTIONAL NEURAL NETWORKS. Pavlo Molchanov Stephen Tyree Tero Karras Timo Aila Jan Kautz

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Linear Regression (continued)

ECS289: Scalable Machine Learning

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Notes on AdaGrad. Joseph Perla 2014

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Importance Reweighting Using Adversarial-Collaborative Training

Discriminative Models

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Day 3 Lecture 3. Optimizing deep networks

Neural networks and optimization

Accelerating Stochastic Optimization

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter Generalization and Regularization

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

SVRG++ with Non-uniform Sampling

Algorithmic Stability and Generalization Christoph Lampert

ECS289: Scalable Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Implicit Optimization Bias

Linear Models in Machine Learning

CSC321 Lecture 7: Optimization

Stochastic Gradient Descent

TYPES OF MODEL COMPRESSION. Soham Saha, MS by Research in CSE, CVIT, IIIT Hyderabad

Large-scale Stochastic Optimization

Value-aware Quantization for Training and Inference of Neural Networks

Randomized Smoothing Techniques in Optimization

Logistic Regression. Stochastic Gradient Descent

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Neural networks COMS 4771

Binary Convolutional Neural Network on RRAM

Optimization and Gradient Descent

Overfitting, Bias / Variance Analysis

RegML 2018 Class 8 Deep learning

CS-E3210 Machine Learning: Basic Principles

Summary and discussion of: Dropout Training as Adaptive Regularization

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Computational and Statistical Learning Theory

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

J. Sadeghi E. Patelli M. de Angelis

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

arxiv: v1 [cs.ne] 20 Apr 2018

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

CSC321 Lecture 4: Learning a Classifier

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning Basics

Ad Placement Strategies

Fast, Cheap and Deep Scaling machine learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Probabilistic Hashing for Efficient Search and Learning

PAC-learning, VC Dimension and Margin-based Bounds

Linear classifiers: Overfitting and regularization

CSC321 Lecture 8: Optimization

Randomized Algorithms

Bayesian Deep Learning

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Logistic Regression Trained with Different Loss Functions. Discussion

Transcription:

Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

2 How many bits do you need to represent a single number in machine learning systems? 8 Takeaways Training Neural Networks 4 bits is enough for communication 4 0 3bit 32bit floating point 1 10 100 Training Linear Models 4 bits is enough end-to-end Beyond Empirical Rigorous theoretical guarantees

3 First Example: GPUs GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK What happens in practice? Regular model Compute gradient Exchange gradient Update params Trend towards large models and datasets Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive. Minibatch 1 Minibatch 2

4 First Example: GPUs GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK General trend towards large models What happens in practice? Bigger model Compute gradient Exchange gradient Update params Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive. Minibatch 1 Minibatch

5 First Example: GPUs GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK General trend towards large models What happens in practice? Biggerer model Compute gradient Exchange gradient Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive.

6 First Example: GPUs Compression [Seide et al., Microsoft CNTK] GPUs have plenty of compute Yet, bandwidth relatively limited PCIe or (newer) NVLINK General trend towards large models Vision: ImageNet (1.8M images) ResNet-152 model [He+15]: 60M parameters (~240 MB) Speech: NIST2000 2000 hours LACEA [Yu+16]: 65M parameters (~300 MB) Gradient transmission is expensive. Minibatch 1

The Key Question Can lossy compression provide speedup, while preserving convergence? Yes. Quantized SGD (QSGD) can converge as fast as SGD, with considerably less bandwidth. > 2x faster Top-1 accuracy for AlexNet (ImageNet). Top-1 accuracy vs Time for AlexNet (ImageNet).

Why does QSGD work? 8

Notation in One Slide Model x argmin x f x 7 f x =(1/M) 0 loss(x, ei) 89: Task Solved via optimization procedure. Notion of quality E.g., image classification Data (M examples)

Background on Stochastic Gradient Descent Stochastic Gradient Descent: Goal: find argmin x f x. Let ga(x) = x s gradient at a randomly chosen data point. Iteration: x td1 = x t η t ga(xt), where Ε[gK(xt)] = f x t. Let Ε ga x f x 2 σ 2 (variance bound) Theorem [Informal]: Given f nice (e.g., convex and smooth), and R 2 = x 0 x 2. To converge within ε of optimal it is sufficient to run for T = O( R 2 2 σ2 ε 2 ) iterations. Higher variance = more iterations to convergence.

11 Data Flow: Data-Parallel Training (e.g. GPUs) Standard SGD step: x td1 = x t η t ga x t at step t. 12000 10000 8000 6000 Q( g] 1 x t ) Q( g] 2 x t ) -5 4000 2000 0-4 -3-2 -1 0 1 2 3 4 5 Data Data GPU 1 GPU 2 Model x td1 = x t η t ( g] 1 x t + g] 2 x t ) Quantized SGD step: x td1 = x t η t Q( ga x t ) at step t.

How Do We Quantize? Gradient = vector v of dimension n, normalized Quantization function where ξ 8 v i Q[v i ] = ξ 8 v i sgn v 8 = 1 with probability v i, and 0, otherwise. Quantization is an unbiased estimator: E Q v = v. Pr[ 1 ] = v i = 0.7 Pr[ 0 ] = 1 - v i = 0.3 1 v i = 0.7 0 Why do this? float v1 float v2 float v3 float v4 float vn float scaling n bits and signs +1 0 0-1 -1 0 1 1 1 0 0 0-1 Compression rate > 15x

13 Gradient Compression We apply stochastic rounding to gradients The SGD iteration becomes: x td1 = x t η t Q(gA x t ) at step t. 0.857 0.714 0.571 1 Theorem [QSGD: Alistarh, Grubic, Li, Tomioka, Vojnovic, 2016] Given dimension n, QSGD guarantees the following: 1. Convergence: If SGD converges, then QSGD converges. 2. Convergence speed: If SGD converges in T iterations, QSGD converges in 3. Bandwidth cost: Each gradient can be coded using 2 n log n bits. n T iterations. The Gamble: The benefit of reduced communication will outweigh the performance hit because of extra iterations/variance and coding/decoding. 0.429 0.286 0.143 Generalizes to arbitrarily many quantization levels. 0

Does it actually work? 14

Experimental Setup Where? Amazon p16xlarge (16 x NVIDIA K80 GPUs) Microsoft CNTK v2.0, with MPI-based communication (no NVIDIA NCCL) What? Tasks: image classification (ImageNet) and speech recognition (CMU AN4) Nets: ResNet, VGG, Inception, AlexNet, respectively LSTM With default parameters Why? Accuracy vs. Speed/Scalability Open-source implementation, as well as docker containers.

Experiments: Communication Cost AlexNet x ImageNet-1K x 2 GPUs Compute 60% Communicate 40% Compute 95% Communicate 5% SGD vs QSGD on AlexNet.

Experiments: Strong Scaling 2.3x 3.5x 1.6x 1.3x

Experiments: A Closer Look at Accuracy 4bit: - 0.2% 8bit: + 0.3% 2.5x ResNet50 on ImageNet 3-Layer LSTM on CMU AN4 (Speech) Across all networks we tried, 4 bits are sufficient for full accuracy. (QSGD arxiv tech report contains full numbers and comparisons.)

19 How many bits do you need to represent a single number in machine learning systems? 8 Takeaways Training Neural Networks 4 bits is enough for communication 4 0 3bit 32bit floating point 1 10 100 Training Linear Models 4 bits is enough end-to-end

20 Data Flow in Machine Learning Systems Gradient: dot(ar, x)ar 3 Data Source Sensor Database Data Ar Computation Device GPU, CPU FPGA Model x 1 2 Storage Device DRAM CPU Cache

21 ZipML [. 0.7.] Gradient: dot(ar, x)ar 3 1 0 Naive solution: nearest rounding (=1) => Converge to a different solution Stochastic rounding: 0 with prob 0.3, 1 with prob 0.7 Expectation matches => OK! (Over-simplified, need to be careful about variance!) Data Source Sensor Database Data Ar Computation Device FPGA GPU, CPU Model x 1 2 Storage Device DRAM CPU Cache NIPS 15

22 ZipML Gradient: dot(ar, x)ar 3 Data Source Sensor Database Data Ar Model x 1 2 Storage Device DRAM CPU Cache Computation Device FPGA GPU, CPU [. 0.7.] 0 (p=0.3) 1 (p=0.7) Loss: (ax - b) 2 Gradient: 2a(ax-b) Expectation matches => OK!

23 ZipML Gradient: dot(ar, x)ar 3 Data Source Sensor Database Data Ar Model x 1 2 Storage Device DRAM CPU Cache Expectation matches => OK? NO!! [. 0.7.] 0 (p=0.3) 1 (p=0.7) Computation Device FPGA GPU, CPU Why? Gradient 2a(ax-b) is not linear in a.

24 ZipML: Double Sampling How to generate samples for a to get an unbiased estimator for 2a(ax-b)? How many more bits do we need to store the second sample? Not 2x Overhead! TWO Independent Samples! 2a1(a2x-b) First Sample Second Sample 3bits to store the first sample 2nd sample only have 3 choices: - up, down, same => 2bits to store We can do even better Samples are symmetric! 15 different possibilities => 4bits to store 2 samples 1bit Overhead arxiv 16

It works! 25

26 Experiments 32bit Floating Points Tomographic Reconstruction 12bit Fixed Points Linear regression with fancy regularization (but a 240GB model)

It works, but is what we are doing optimal? 28

29 Not Really Data-optimal Quantization Strategy a b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b) Intuitively, shouldn t we put more markers here?

30 Not Really Data-optimal Quantization Strategy a b b Probability of quantizing to A: PA = b / (a+b) Probability of quantizing to B: PB = a / (a+b) Intuitively, shouldn t we put more markers here? Expectation of quantization error for A, B (variance) = a PA + b PB = 2ab / (a+b) Expectation of quantization error for A, B (variance) = 2a(b+b ) / (a+b+b ) Find a set of markers c1 < < cs, P-TIME Dynamic Programming All data points falling into an interval Dense Markers

Experiments 31

Beyond Linear Regression 32

33 General Loss Function Chebyshev Polynomials Linear Regression or LS-SVM Logistic Regression Challenge: Non-Linear terms Approximation via polynomials

Experiments 34 Even for non-linear models, we can do end-to-end quantization at 8 bits, withguaranteed convergence.

Summary & Implementation We can do end-to-end quantized linear regression with guarantees We can do arbitrary classification loss via approximation Implemented on an Xeon + FPGA platform (Intel-Altera HARP) Quantized data -> 8-16x more data per transfer 10 2 10 2 Training loss 2 1 float CPU 1-thread float CPU 10-threads float FPGA Q2 FPGA Training loss 1 0.5 float CPU 1-thread float CPU 10-threads float FPGA Q4 FPGA 6.5x 0 7.2x 10 2 10 1 10 0 10 1 Time (s) Classification (gisette dataset, n = 5K). 10 3 10 2 10 1 Time (s) Regression (synthetic dataset, n = 100).

8 Takeaways Training Neural Networks 4 bits is enough for communication 4 0 3bit 32bit floating point 1 10 100 Training Linear Models 4 bits is enough end-to-end Beyond Empirical Rigorous theoretical guarantees Questions?