Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Similar documents
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Progressive & Algorithms & Systems

Large-Scale Behavioral Targeting

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Using R for Iterative and Incremental Processing

In-Database Learning with Sparse Tensors

ECS289: Scalable Machine Learning

Probabilistic Hashing for Efficient Search and Learning

Efficient Data Reduction and Summarization

CS425: Algorithms for Web Scale Data

CS 347 Parallel and Distributed Data Processing

Ad Placement Strategies

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Linear classifiers: Overfitting and regularization

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Knowledge Discovery and Data Mining 1 (VO) ( )

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

ECS171: Machine Learning

Stochastic Gradient Descent. CS 584: Big Data Analytics

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Matrix Factorization Techniques for Recommender Systems

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

CME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

Communication-efficient and Differentially-private Distributed SGD

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CS60021: Scalable Data Mining. Large Scale Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Randomized Algorithms

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam, Winter 2011

Collaborative Filtering Matrix Completion Alternating Least Squares

Data Intensive Computing Handout 11 Spark: Transitive Closure

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Ad Placement Strategies

Parallel Matrix Factorization for Recommender Systems

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

Machine Learning Basics

In-Database Factorised Learning fdbresearch.github.io

CS260: Machine Learning Algorithms

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

A Parallel SGD method with Strong Convergence

Visualizing Big Data on Maps: Emerging Tools and Techniques. Ilir Bejleri, Sanjay Ranka

High-Dimensional Indexing by Distributed Aggregation

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Logistic Regression Logistic

NEC PerforCache. Influence on M-Series Disk Array Behavior and Performance. Version 1.0

Machine Learning: The Perceptron. Lecture 06

Midterm exam CS 189/289, Fall 2015

Gradient Coding: Mitigating Stragglers in Distributed Gradient Descent.

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Online Advertising is Big Business

ECS289: Scalable Machine Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Stochastic Gradient Descent

Collaborative Filtering

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

NLP Programming Tutorial 6 - Advanced Discriminative Learning

Least Mean Squares Regression. Machine Learning Fall 2018

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

LOCALITY SENSITIVE HASHING FOR BIG DATA

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Least Mean Squares Regression

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Introduction to Machine Learning

Factorized Relational Databases Olteanu and Závodný, University of Oxford

Logistic Regression. COMP 527 Danushka Bollegala

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

Optimization for Machine Learning

Harvard Center for Geographic Analysis Geospatial on the MOC

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

CS 347 Parallel and Distributed Data Processing

ECS289: Scalable Machine Learning

Computational Frameworks. MapReduce

Kernelized Perceptron Support Vector Machines

Decoupled Collaborative Ranking

Large Scale Semi-supervised Linear SVMs. University of Chicago

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Logistic Regression. Stochastic Gradient Descent

Introduction to Logistic Regression

Large-scale machine learning Stochastic gradient descent

ECS289: Scalable Machine Learning

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Algorithms for Collaborative Filtering

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Transcription:

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics Chengjie Qin 1 and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced June 29, 2017

Machine Learning (ML) Is Booming https://ucbrise.github.io/cs294-rise-fa16/prediction serving.html General frameworks with ML libraries: Hadoop s Mahout, Spark s MLLib, GraphLab Specialized ML systems: Vowpal Wabbit, SystemML, SimSQL, TensorFlow In-Database ML libraries: MADlib, Bismarck, GLADE 2

Agenda Big Model Analytics Gradient Descent Optimization Big Model Dot-Product Dot-Product Join Operator Vector Reordering Batch Execution Gradient Descent Integration Conclusions 3

Recommender Systems Big Model Example 1 http://www.slideshare.net/mrchrisjohnson/algorithmic-music-recommendations-at-spotify Spotify applies low-rank matrix factorization (LMF) to 24 million users and 20 million songs which is 4.4 billion features at a relatively small rank of 100 4

Big Model Example 2 Text Analytics N-gram features of a sentence For the English Wikipedia corpus, a feature vector with 25 billion unigrams and 218 billion bigrams can be constructed [S. Lee, J. K. Kim et al., On Model Parallelization and Scheduling Strategies for Distributed Machine Learning, NIPS 2015] 5

Big Model Motivation In the Cloud... There is always enough memory You can always add more servers ML in IoT, Edge, and Fog Environments Push processing to the devices acquiring the data which have rather scarce resources Data transfer is not a viable alternative for bandwidth and privacy reasons Secondary storage (disk, SSD, or flash) is plentiful 6

Support for Big Models Existing ML Systems Do Not Support Big Models Model is an in-memory container data structure, e.g., vector or map Model is array attribute in a single-column table (at most 1 GB in PostgreSQL) and in-memory state of a UDA (User-Defined Aggregate) Parameter Server [M. Li et al., Scaling Distributed Machine Learning with the Parameter Server, OSDI 2014] Partition the model across the distributed shared memory of multiple servers, with each server storing a sufficiently small model partition that fits in its local memory Extensive hardware investment and considerable network traffic Complex partitioning, replication, and synchronization Targeted to the Cloud 7

In-Database Big Model ML Problem Provide efficient and cost-effective in-database support for Big Model ML Challenge ML is heavily based on in-memory linear algebra (ordered arrays) Databases are based on disk relational algebra (unordered relations) Solution Offload Big Model to secondary storage and leverage database processing techniques Design specialized in-database linear algebra operators for Big Model ML 8

Agenda Big Model Analytics Gradient Descent Optimization Big Model Dot-Product Dot-Product Join Operator Vector Reordering Batch Execution Gradient Descent Integration Conclusions 9

ML for Generalized Linear Models Sparse Matrix Vector Multiplication (SpMV) Model is d-dimensional vector w, d 1 Training data X of N d-dimensional feature vectors x i and their corresponding label y i, 1 i N Objective function (or loss): Λ( w) = min w R d N i=1 f ( w, x i;y i ) Find model w that minimizes objective function based on training data Logistic Regression (LR) Λ LR ( w) = N i=1 log( 1 + e y i w x i ) Low-Rank Matrix Factorization (LMF) ( ) 2 Λ LMF (L,R) = 1 2 (i, j) M Li T R j M i j 10

Gradient Descent Optimization min w R d { Λ( w) = w (k+1) = w (k) α (k) Λ } N f ( w, x i ;y i ) i=1 ( w (k)) α (k) is step size or learning rate w (0) is the starting point (random) http://www.yaldex.com/game-development/1592730043 ch18lev1sec4.html [ Λ( w) Λ( w) = w ( 1 Λ LR ( w) = w i Λ LMF (L,R) L i N i=1,..., Λ( w) ] w d e y i w x i is the gradient ) y i 1 + e y x i i w x i ( ) Li T R j M i j R T j = (i, j) M Convergence to minimum guaranteed for convex objective function 11

Batch Gradient Descent (BGD) Input: {( x j,y j )} 1 j N, f, Λ, Λ, w (0), α (0) Output: w (k 1) 1: Let k = 1 2: while (true) do 3: if convergence({λ( w (l) )} 0 l<k ) then break 4: Compute gradient: Λ( w (k 1) ){( x j,y j )} 1 j N 5: Determine step size α (k) 6: Update model: w (k) = w (k 1) α (k) Λ( w (k 1) ) 7: Let k = k + 1 8: end while 9: return w (k 1) http://www.holehouse.org/mlclass/17 Large Scale Machine Learning.html Gradient Λ is standard SpMV between training data X and model w, i.e., X w 12

Stochastic Gradient Descent (SGD) Input: {( x j,y j )} 1 j N, f, f, w (0), α (0) Output: w (k 1) 1: Let k = 1 2: while (true) do 3: if convergence({λ( w (l) )} 0 l<k ) then break 4: for each example ( x η (i),y η (i)) do ( ) 5: Approximate gradient: f w (k) (i 1), x η (i) ;y η (i) ( ) 6: w (k) (i) = w(k) (i 1) α(k) f w (k) (i 1), x η (i) ;y η (i) 7: end for 8: Update step size α (k) 9: Let k = k + 1 10: end while 11: return w (k 1) http://www.holehouse.org/mlclass/17 Large Scale Machine Learning.html Approximate gradient f is standard vector dot-product between each training vector x i and model w, i.e., x i w 13

BGD vs. SGD BGD SGD http://www.holehouse.org/mlclass/17 Large Scale Machine Learning.html http://www.holehouse.org/mlclass/17 Large Scale Machine Learning.html w (k+1) = w (k) α (k) Λ ( w (k)) Exact gradient computation Single step for one iteration Faster convergence close to minimum w (k+1) = w (k) β (k) f ( ) w (k), x η (k);y η (k) Approximate gradient at data point One step for each random data point Faster convergence far from minimum 14

Agenda Big Model Analytics Gradient Descent Optimization Big Model Dot-Product Dot-Product Join Operator Vector Reordering Batch Execution Gradient Descent Integration Conclusions 15

The Big Model Dot-Product Problem Set of (very) sparse vectors, i.e., matrix, U = { u 1,..., u N } Dense vector V = v that cannot fit in the available memory Compute DP = U V in non-blocking mode, i.e., generate DP i = u i v one-by-one Required for SGD where V is updated after each DP i is computed Novelty: Constrained SpMV with V stored on disk ū 1 ū 2 ū 3 ū 4 ū 5 ū 6 ū 7 ū 8 1 2 3 4 5 6 1 3 9 4 2 6 8 8 2 1 2 4 5 6 7 8 2 1 4 U 1 2 3 4 5 6 0.1 0.5 0.3 0.2 0.1 0.1 V 2.8 1.4 6.2 0.9 1.4 3.7 1.0 0.7 DP ū 1 ū 2 ū 3 ū 4 ū 5 ū 6 ū 7 ū 8 non-zero zero 16

Relational Dot-Product Store sparse matrix U and dense vector V as two relational tables U(index INTEGER, value NUMERIC, tid INTEGER) {(1,1,1); (3,3,1); (4,9,1);...} V(index INTEGER, value NUMERIC) ū 1 ū 2 ū 3 ū 4 ū 5 ū 6 ū 7 ū 8 {(1,0.1); (2,0.5); (3,0.3); (4,0.2); (5,0.1); (6,0.1)} 1 2 3 4 5 6 1 3 9 4 2 6 8 8 2 1 2 4 5 6 7 8 2 1 4 U 1 2 3 4 5 6 0.1 0.5 0.3 0.2 0.1 0.1 V 2.8 1.4 6.2 0.9 1.4 3.7 1.0 0.7 DP ū 1 ū 2 ū 3 ū 4 ū 5 ū 6 ū 7 ū 8 SQL query to compute DP is blocking SELECT U.tid, SUM(U.value*V.value) FROM U, V WHERE U.index=V.index GROUP BY U.tid Vectors are represented as tuples which incurs redundancy U V Relational dot-product U.index = V.indexƔ U.tid; SUM(U.value*V.value) DP non-zero zero 17

Array-Relation Join Store sparse matrix U in the coordinate representation with two ARRAY attributes for the non-zero index and the corresponding value U(index INTEGER[], value NUMERIC[], tid INTEGER) {([1,3,4], [1,3,9], 1);...} V(index INTEGER, value NUMERIC) ū 1 ū 2 ū 3 ū 4 ū 5 ū 6 ū 7 ū 8 {(1,0.1); (2,0.5); (3,0.3); (4,0.2); (5,0.1); (6,0.1)} 1 2 3 4 5 6 1 3 9 4 2 6 8 8 2 1 2 4 5 6 7 8 2 1 4 U 1 2 3 4 5 6 0.1 0.5 0.3 0.2 0.1 0.1 V 2.8 1.4 6.2 0.9 1.4 3.7 1.0 0.7 DP ū 1 ū 2 ū 3 ū 4 ū 5 ū 6 ū 7 ū 8 SQL query to compute DP is blocking SELECT U.tid, SUM(U.value[idx(U.index,V.index)]* V.value) FROM U, V WHERE V.index = ANY(U.index) GROUP BY U.tid Intermediate join size can be very large U V V.index = ANY(U.index) Array-Relation join ƔU.tid; SUM(U.value[idx(U.index,V.index)]*V.value) DP non-zero zero 18

Agenda Big Model Analytics Gradient Descent Optimization Big Model Dot-Product Dot-Product Join Operator Vector Reordering Batch Execution Gradient Descent Integration Conclusions 19

Dot-Product Join Operator Idea U V V.index = ANY(U.index) Array-Relation Join ƔU.tid; SUM(U.value[idx(U.index,V.index)]*V.value) DP U V Dot-Product Join DP Push group-by aggregate into join and generate DP i cells one-at-a-time by streaming U page-at-a-time and probing V (non-blocking) Each DP i cell can be computed in memory V is range-based partitioned 20

Dot-Product Join Operator Approach Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 1 2 3 4 5 6 7 8 1 2 3 4 5 6 1 3 9 4 2 6 8 8 2 1 2 4 5 6 7 8 2 1 4 Reorder U 1 3 5 2 4 8 6 7 1 2 3 4 5 6 1 1 3 9 6 8 8 4 5 4 2 2 1 2 1 4 6 7 8 2 U' Batch Batch 1 Batch 2 Batch 3 2 3 4 5 6 1 3 9 6 8 8 4 5 4 2 2 1 2 1 4 6 7 8 2 # Page Accesses: 8, # Requests: 19 # Page Accesses: 4, # Requests: 16 # Page Accesses: 4, # Requests: 6 1 3 5 2 4 8 6 7 U'' Minimize number of secondary storage page accesses to vector V Reorder vectors u i U at page-level Minimize number of requests for cells in V Batch reordered vectors u i U and emit a single request for all the accessed pages Enhance buffer manager with set requests 21

Dot-Product Join Operator Algorithm Require: U (index INTEGER[],value NUMERIC[],tid INTEGER) V (index INTEGER,value NUMERIC) memory budget M Ensure: DP (tid INTEGER,product NUMERIC) 1: for each page u p U do OPTIMIZATION 2: Reorder vectors u i to cluster similar vectors together 3: Group vectors u i into batches B j that access at most M pages from V BATCH EXECUTION 4: for each batch B j do 5: Collect pages v p V accessed by vectors u i B j into a set v B j = {v p V u i B j that accesses v p } 6: Request access to pages in v B j DOT-PRODUCT COMPUTATION 7: dp i Dot-Product( u i,v) 8: return dp i 9: end for 10: end for 22

Vector Reordering Minimize number of secondary storage page accesses to vector V Find optimal order is NP-hard (minimum Hamiltonian path) Reverse Cuthill-McKee algorithm (RCM) in SpMV permutes both rows and columns Cluster similar vectors together 3 heuristics to compute a good ordering for in-memory working set (inspector-executor paradigm in SpMV) K-Center Clustering Locality-Sensitive Hashing (LSH)-based Nearest Neighbor Radix Sort Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 1 2 3 4 5 6 7 8 1 2 3 4 5 6 1 3 9 4 2 6 8 8 2 1 2 4 5 6 7 8 2 1 4 Reorder U 1 3 5 2 4 8 6 1 2 3 4 5 6 1 3 9 6 8 8 4 5 4 2 2 1 2 1 4 6 7 8 2 7 U' # Page Accesses: 8, # Requests: 19 # Page Accesses: 4, # Requests: 16 23

K-Center Clustering Require: Vectors { u 1,..., u N } with page requests Ensure: Reordered input vectors { u i1,..., u in } 1: Initialize first set of k centers with random vectors u i 2: Assign each vector to the center having the minimum set difference cardinality 3: Let X l be the set of vectors assigned to center l, 1 l k 4: Call K-Center Reordering recursively for the sets X l with requests that do not fit in memory 5: Reorder centers and their corresponding vectors Complexity: O(kdN); k centers, N d-dimensional vectors Only partial ordering between two vectors Randomized algorithm sensitive to initialization 24

LSH-based Nearest Neighbor Require: Vectors { u 1,..., u N } with page requests m minwise hash functions grouped into b bands Ensure: Reordered input vectors { u i1,..., u in } Compute LSH tables LSH-based nearest neighbor search 1: Initialize u i1 with a random vector u i 2: for j = 1 to N 1 do 3: Let X k be the set of vectors co-located in the same bucket with u i j in hash table Hash k and not selected 4: X X 1 X b 5: Let u i j+1 be the vector in X with the minimum set difference cardinality C i j+1,i j to the current vector u i j 6: end for Complexity: O(mdN); m hash functions, N d-dimensional vectors Only partial ordering between two vectors Randomized algorithm sensitive to initialization 25

Radix Sort Require: Vectors { u 1,..., u N } with page requests Ensure: Reordered input vectors { u i1,..., u in } Page request frequency computation 1: Compute page request frequency across vectors { u 1,..., u N } 2: for each vector u i do 3: Represent u i by a bitset of 0 s and 1 s where a 1 at index k corresponds to the vector requesting page k 4: Reorder the bitset in decreasing order of the page request frequency, i.e., index 1 corresponds to the most frequent page 5: end for Radix sort 6: Apply radix sort to the set of bitsets 7: Let u i j be the vector corresponding to the bitset at position j in the sorted order Complexity: O(dN); N d-dimensional vectors Strict ordering between two vectors based on a single dimension at a time 26

Heuristics Experimental Comparison Setup: 16 cores, 28 GB of memory, and 1 TB 7200 RPM SAS hard-drive (120 MB/s) Reordering time (log) [sec] 1E+2 1E+1 1E+0 1E-1 1E-2 1E-3 1E-4 LSH Radix K-center 1024 4096 16384 65536 Page size Relative improvement [%] 40 30 20 10 0 LSH Radix K-center 1024 4096 16384 65536 Page size Average reordering time per page Relative improvement over LRU Reordering time Radix sort is the only scalable solution (less than 1 second for 2 16 vectors) LRU comparison LSH provides largest reduction over LRU at almost 30% for large pages 27

Batch Execution Minimize number of requests for cells in V Construct batches by iterating over reordered vectors until memory budget for pages in V is full (optimal) Page accesses from same batch are grouped into a single request Enhanced buffer manager with page set requests Page 1 Page 2 Page 3 Page 1 Page 2 Page 3 1 3 5 2 4 8 6 7 1 2 3 4 5 6 1 1 3 9 6 8 8 4 5 4 2 2 1 2 1 4 6 7 8 2 U' Batch Batch 1 Batch 2 Batch 3 2 3 4 5 6 1 3 9 6 8 8 4 5 4 2 2 1 2 1 4 6 7 8 2 # Page Accesses: 4, # Requests: 16 # Page Accesses: 4, # Requests: 6 1 3 5 2 4 8 6 7 U'' 28

Batch Execution Evaluation Dataset # Dims # Examples Size Model uniform 1B 80K 4.2 GB 8 GB skewed 1B 1M 4.5 GB 8 GB splice 13M 500K 30 GB 100 MB Radix sort reordering, page size is 4096 Memory budget is 20% of model size (vector V dimensionality) Execution time [sec] 160 120 80 40 0 tuple-at-a-time batching reordering + batching 153.25 134.96 114.93 110.43 108.24 93.01 52.49 19.85 5.24 uniform skewed splice Execution time (log) [sec] 1E+4 1E+3 1E+2 1E+1 1E+0 tuple-at-a-time reordering + batching 20% 40% 60% 80% 100% Available memory Average dot-product join time per page Dot-product join time over skewed 29

Gradient Descent Integration V DOT_PRODUCT_JOIN index UDA_gradient σ (SGD) tid U [Dot-product join operator] SGD includes gradient computation and model update beyond dot-product Execution time [sec] 12000 9000 6000 3000 splice Dot-product SGD Execution time [sec] 20000 15000 10000 5000 matrix Dot-product SGD 0 20% 40% 60% 80% 100% Available memory 0 10% 20% Available memory 30

Dot-Product Join vs. Alternatives Dataset # Dims # Examples Size Model uniform 1B 80K 4.2 GB 8 GB skewed 1B 1M 4.5 GB 8 GB matrix 10M x 10K 300M 4.5 GB 80 GB splice 13M 500K 30 GB 100 MB MovieLens 6K x 4K 1M 24 MB 80 MB Memory budget is 40% of model size (vector V dimensionality) for GLADE and Dot- Product Join; 10 GB in PostgreSQL (PG); unlimited in MonetDB and SciDB Execution time (in seconds) for relational (R) and array (A) solutions; N/A stands for not finishing in 24 hours GLADE R PG R PG A MonetDB R SciDB A DOT-PRODUCT JOIN uniform 37851 N/A N/A 4280 N/A 3914 skewed 13000 N/A N/A 4952 N/A 786 matrix N/A N/A N/A N/A N/A 9045 splice 9554 81054 N/A 2947 N/A 6769 MovieLens 905 5648 72480 N/A 1116 477 31

Conclusions Investigate the Big Model analytics problem and identify dot-product as the critical operation in training generalized linear models Establish a direct correspondence with the sparse matrix vector (SpMV) multiplication problem Present several alternatives for implementing dot-product in a relational database Design the first array-relation dot-product join database operator targeted at secondary storage and introduce two optimizations dynamic batch processing and reordering to make the operator practical Devise three batch reordering heuristics K-center, LSH, and Radix inspired from optimizations to the SpMV kernel and evaluate them thoroughly. Execute an extensive set of experiments that evaluate each sub-component of the operator and compare our overall solution with alternative dot-product implementations over synthetic and real data Dot-product join provides significant reduction in execution time over alternative indatabase solutions 32

Thank you. Questions??? 33