Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Similar documents
Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Progressive & Algorithms & Systems

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Parallel Matrix Factorization for Recommender Systems

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

ECS289: Scalable Machine Learning

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Linear classifiers: Overfitting and regularization

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Using R for Iterative and Incremental Processing

Ad Placement Strategies

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Matrix Factorization and Factorization Machines for Recommender Systems

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

CS60021: Scalable Data Mining. Large Scale Machine Learning

Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems

Clustering algorithms distributed over a Cloud Computing Platform.

In-Database Learning with Sparse Tensors

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

Online Advertising is Big Business

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Streaming Variational Bayes

ARock: an algorithmic framework for asynchronous parallel coordinate updates

CS425: Algorithms for Web Scale Data

Efficient Data Reduction and Summarization

Knowledge Discovery and Data Mining 1 (VO) ( )

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Parallel Numerical Algorithms

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

In-Database Factorised Learning fdbresearch.github.io

Sparse Stochastic Inference for Latent Dirichlet Allocation

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Fast, Cheap and Deep Scaling machine learning

Ad Placement Strategies

ECS171: Machine Learning

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Large-Scale Behavioral Targeting

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Supplement: On Model Parallelization and Scheduling Strategies for Distributed Machine Learning

Logistic Regression. Stochastic Gradient Descent

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Comparison of Modern Stochastic Optimization Algorithms

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Optimization for Machine Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Logistic Regression. William Cohen

Logistic Regression Logistic

Matrix Factorization Techniques for Recommender Systems

Block stochastic gradient update method

Introduction to Logistic Regression

ECS289: Scalable Machine Learning

Online Advertising is Big Business

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Coordinate Update Algorithm Short Course The Package TMAC

Deep Learning & Neural Networks Lecture 4

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

ArcGIS GeoAnalytics Server: An Introduction. Sarah Ambrose and Ravi Narayanan

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Announcements Kevin Jamieson

Linear Regression (continued)

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Logistic Regression. COMP 527 Danushka Bollegala

Stochastic Gradient Descent. CS 584: Big Data Analytics

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Gradient Coding: Mitigating Stragglers in Distributed Gradient Descent.

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

CS260: Machine Learning Algorithms

Logistic Regression & Neural Networks

RAID+: Deterministic and Balanced Data Distribution for Large Disk Enclosures

Variations of Logistic Regression with Stochastic Gradient Descent

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning for NLP

A Parallel SGD method with Strong Convergence

Lecture 3: Minimizing Large Sums. Peter Richtárik

Big Data Analytics. Lucas Rego Drumond

STA141C: Big Data & High Performance Statistical Computing

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Randomized Smoothing for Stochastic Optimization

Parallel Coordinate Optimization

Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Harvard Center for Geographic Analysis Geospatial on the MOC

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

ECS289: Scalable Machine Learning

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Sam Williamson

FPGA-accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-off

Advanced Training Techniques. Prajit Ramachandran

Transcription:

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017

Machine Learning (ML Is Booming https://ucbrise.github.io/cs294-rise-fa16/prediction serving.html General frameworks with ML libraries: Hadoop s Mahout, Spark s MLLib, GraphLab Specialized ML systems: Vowpal Wabbit, SystemML, SimSQL, TensorFlow In-Database ML: MADlib, Bismarck, GLADE 2

ML for Generalized Linear Models Model is d-dimensional vector w, d 1 Training data X of N d-dimensional feature vectors x i and their corresponding label y i, 1 i N Objective function (or loss: Λ( w = min w R d N i=1 f ( w, x i;y i Find model w that minimizes objective function based on training data Logistic Regression (LR Λ LR ( w = N i=1 log( 1 + e y i w x i Low-Rank Matrix Factorization (LMF ( 2 Λ LMF (L,R = 1 2 (i, j M Li T R j M i j 3

Agenda Big Model Analytics Gradient Descent Optimization Scalable HOGWILD! for Big Models Experimental Results Conclusions 4

Recommender Systems Big Model Example 1 http://www.slideshare.net/mrchrisjohnson/algorithmic-music-recommendations-at-spotify Spotify applies low-rank matrix factorization (LMF to 24 million users and 20 million songs which is 4.4 billion features at a relatively small rank of 100 5

Big Model Example 2 Text Analytics N-gram features of a sentence For the English Wikipedia corpus, a feature vector with 25 billion unigrams and 218 billion bigrams can be constructed [S. Lee, J. K. Kim et al., On Model Parallelization and Scheduling Strategies for Distributed Machine Learning, NIPS 2015] 6

Big Model Motivation In the Cloud... There is always enough memory You can always add more servers ML in IoT, Edge, and Fog Environments Push processing to the devices acquiring the data which have rather scarce resources Data transfer is not a viable alternative for bandwidth and privacy reasons Secondary storage (disk, SSD, or flash is plentiful 7

Support for Big Models Existing ML Systems Do Not Support Big Models Model is an in-memory container data structure, e.g., vector or map Model is array attribute in a single-column table (at most 1 GB in PostgreSQL and in-memory state of a UDA (User-Defined Aggregate Parameter Server [M. Li et al., OSDI 2014] Partition the model across the distributed shared memory of multiple servers, with each server storing a sufficiently small model partition that fits in its local memory Complex partitioning, replication, and synchronization Dot-Product Join [C. Qin and F. Rusu, SSDBM 2017] Serial dot-product computation between sparse matrix and massive dense vector Range-based model partitioning independent of training data Training data reordering at chunk-level 8

Problem In-Database Big Model ML Provide scalable in-database support for Big Model ML in a single multi-core server with attached storage, i.e., disk or SSD Challenge Access to model is unpredictable and there are many accesses for each training example Highly-concurrent READ/WRITE model accesses across partitions (worker threads Data Index: ( x 14 ( x 13 X ( x12 ( x 11 1 2 3 4 5 6 1 3 9 1 4 2 2 8 5 4 3 6 Partition 1 Partition 2 ( x 24 ( x 23 ( x 22 ( x 21 1 2 3 4 5 6 5 3 6 1 1 9 7 3 4 8 6 1 3 ( x 34 ( x 33 ( x 32 ( x 31 Partition 3 1 2 3 4 5 6 2 3 5 5 1 1 9 2 1 4 7 4 W Model Value: Model Index: 1 2 3 4 5 6 1.9 1.2 1.0 0.65 0.7 0.82 Zero Value Indices Non-Zero Value Indices 1 2 3 4 5 6 1.9 1.2 1.0 0.65 0.7 0.82 Correlated Indices Disk w 9

Agenda Big Model Analytics Gradient Descent Optimization Scalable HOGWILD! for Big Models Experimental Results Conclusions 10

Gradient Descent Optimization min w R d { Λ( w = w (k+1 = w (k α (k Λ } N f ( w, x i ;y i i=1 ( w (k α (k is step size or learning rate w (0 is the starting point (random http://www.yaldex.com/game-development/1592730043 ch18lev1sec4.html [ Λ( w Λ( w = w ( 1 Λ LR ( w = w i Λ LMF (L,R L i N i=1,..., Λ( w ] w d e y i w x i is the gradient y i 1 + e y x i i w x i ( Li T R j M i j R T j = (i, j M Convergence to minimum guaranteed for convex objective function 11

Stochastic Gradient Descent (SGD Input: {( x i,y i } 1 i N, f, f, w (0, α (0 Output: w (k 1 1: Let k = 1 2: while (true do 3: if convergence({λ( w (l } 0 l<k then break 4: for each example ( x η (k,y η (k do ( 5: Approximate gradient: f ( w (k, x η (k;y η (k 6: w (k = w (k α (k f w (k, x η (k;y η (k 7: end for 8: Update step size α (k 9: Let k = k + 1 10: end while 11: return w (k 1 http://www.holehouse.org/mlclass/17 Large Scale Machine Learning.html 12

HOGWILD!: Parallel Asynchronous SGD HOGWILD! Main Loop 1: for i = 1 to N do in parallel 2: w w α (k f ( w, x i ;y i Parallelize inner loop while ignoring the sequential nature of SGD [F. Niu et al., NIPS 2011] Single model shared across threads Lock-free access to model: data races 2 READ: gradient and model update 1 WRITE: model update Naive Extension to Big Models 1: for i = 1 to N do in parallel 2: for each non-zero feature j {1,...,d} in x i do 3: get w[ j] 4: compute f ( w[ j], x i ;y i 5: end for 6: w w α (k f ( w, x i ;y i 7: for each non-zero feature j {1,...,d} in x i do 8: put w[ j] 9: end for Convergence is preserved for sparse data Hardware cache coherence limits speedup [S. Sallinen et al., IPDPS 2016] Model is stored in disk key-value store with get/put interface, e.g., (HyperLevelDB 13

Agenda Big Model Analytics Gradient Descent Optimization Scalable HOGWILD! for Big Models Experimental Results Conclusions 14

Scalable HOGWILD! for Big Models Offline model vertical partitioning Correlate indices to reduce number of get calls inside an example Online asynchronous model access sharing Vertical partition traversal to reduce number of get calls inside a partition Push-based model sharing to reduce number of get calls across partitions Partition-level model update (mini-batch to reduce number of put calls Data Index: ( x 14 ( x 13 X ( x12 ( x 11 1 2 3 4 5 6 1 3 9 1 4 2 2 8 5 4 3 6 Partition 1 Partition 2 ( x 24 ( x 23 ( x 22 ( x 21 1 2 3 4 5 6 5 3 6 1 1 9 7 3 4 8 6 1 3 ( x 34 ( x 33 ( x 32 ( x 31 Partition 3 1 2 3 4 5 6 2 3 5 5 1 1 9 2 1 4 7 4 W Model Value: Model Index: 1 2 3 4 5 6 1.9 1.2 1.0 0.65 0.7 0.82 Zero Value Indices Non-Zero Value Indices Correlated Indices w 1 2 3 4 5 6 1.9 1.2 1.0 0.65 0.7 0.82 Disk 15

Offline Model Partitioning Model 1 2 3 4 5 6 Index: 1 2 3 4 5 6 X 1 1 3 9 X 2 1 4 2 X 3 2 8 5 X 4 4 3 6 X 5 5 3 6 1 X 6 1 9 7 X 7 3 4 8 X 8 6 1 3 5 X 9 2 3 5 X 10 5 1 1 X 11 9 2 4 7 4 X 12 Model Vertical Partitioning Algorithm Index Map original index 1 2 3 4 5 6 mapped index a : (1, (5 b : (2, (3 c : (4 d : (6 Vertical Partitioned Model a b 1 5 2 3 c 4 d 6 Composite key-value storage scheme Two-level values in (key, value intermediate representation Model vertical partitioning algorithm Vertical partitioning in physical design: examples are workload; model is relation Bottom-up greedy strategy with precomputed affinity matrix Cost model combines seek time and payload access time Sampling & frequency-based pruning 16

Online Asynchronous Model Access Sharing Partition 1 Index: 1 2 3 4 5 6 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 1 3 9 1 4 2 2 8 5 4 3 6 Partition 2 5 3 6 1 1 9 7 3 4 8 6 1 3 5 Partition 3 2 3 5 5 1 1 9 2 4 7 4 Index Map from offline stage Partition 1 Index: a b c d X 1 (1:1 (3:3 (4:9 X 2 (2:1, (3:4 (6:2 X 3 (5:5 (2:2, (3:8 (1:4 (2:3, (3:6 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Partition 2 (1:5, (5:1 (2:3, (3:6 (1:1, (5:7 (3:9 (1:3, (5:8 (4:4 (1:6, (5:3 (2:1 (6:5 Partition 3 (1:2 (2:3, (3:5 (5:1 (2:5 (6:1 (2:9, (3:2 (2:4, (3:7 (6:4 X W 4.9 5.2 13.9 17.2 19.8 15.8 0 0 12.4 6.7 12.8 11.8 Inverted indexes for partitions a X 1, X 3, X 4 b X 1, X 2, X 3, X 4 c X 1 d a b a X 5, X 6 X 5, X 6 b X 9, X 10, X 11, X 12 d X 2 X 9, X 10 X 10, X 12 1:1.9 5:0.7 W a b c d 2:1.2 3:1.0 4:0.65 6:0.82 Key-value stored model Push-based sharing On-the-fly example mapping into composite key-value scheme Vertical partition traversal Incremental dot-products in gradient computation Push-based sharing of model indices across partitions Key-to-vector inverted index (cuckoo hash table: single request across partition Partition-level (mini-batch model updates preceded by get call 17

Experimental Results (1 7E+14 KV-1 KV-8 KV-16 KV-VP-1 KV-VP-8 KV-VP-16 Loss value Naive (KV Scalable (KV-VP 0E+0 0 2500 5000 7500 10000 Time per iteration [sec] 4000 3000 2000 1000 KV KV-VP Time [sec] Speedup 14 12 10 8 6 4 2 KV KV-VP # requests 4E+6 2E+6 KV KV-VP 0 1 2 4 8 16 # threads 0 0 4 8 12 16 # threads 0E+0 1 2 4 8 16 # threads 18

Experimental Results (2 # requests reduced 6E+7 skewed splice matrix 5E+7 MovieLens 4E+7 3E+7 2E+7 1E+7 Tim e per iteration [sec] 1000 800 600 400 200 KV KV-VP in-memory 0E+0 100 500 1000 K 0 splice MovieLens 19

Conclusions Design a scalable model and data-parallel framework for parallelizing stochastic optimization algorithms over big models: offline model partitioning and asynchronous online training Formalize model partitioning as vertical partitioning and design a scalable frequencybased model vertical partitioning algorithm Devise an asynchronous method to traverse vertically the training examples in all the data partitions Design a push-based model sharing mechanism for incremental gradient computation based on partial dot-products Implement the entire framework using User-Defined Aggregates (UDA which provides generality across databases Evaluate the framework for three analytics tasks over synthetic and real datasets Results prove the scalability and reduced overhead incurred by model partitioning and key-value store 20

Thank you. Questions??? 21