In-Database Factorised Learning fdbresearch.github.io

Similar documents
In-Database Learning with Sparse Tensors

Joins Aggregates Optimization

In-Database Learning with Sparse Tensors

arxiv: v3 [cs.db] 30 May 2018 ABSTRACT

Factorized Relational Databases Olteanu and Závodný, University of Oxford

Lecture 2 Machine Learning Review

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Intelligent Systems Discriminative Learning, Neural Networks

QR Decomposition of Normalised Relational Data

Statistical Data Mining and Machine Learning Hilary Term 2016

Learning From Data: Modelling as an Optimisation Problem

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Databases 2011 The Relational Algebra

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Midterm: CS 6375 Spring 2015 Solutions

Support Vector Machines

Lecture 6. Notes on Linear Algebra. Perceptron

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Stochastic gradient descent; Classification

ECS289: Scalable Machine Learning

Numerical Learning Algorithms

CS60021: Scalable Data Mining. Large Scale Machine Learning

Ad Placement Strategies

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Online Advertising is Big Business

Machine Learning for NLP

Kaggle.

Lecture 11 Linear regression

ECS289: Scalable Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Click Prediction and Preference Ranking of RSS Feeds

EECS 349:Machine Learning Bryan Pardo

Linear Models for Regression CS534

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Machine Learning Practice Page 2 of 2 10/28/13

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS 188: Artificial Intelligence. Outline

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Linear and Logistic Regression. Dr. Xiaowei Huang

Logistic Regression. William Cohen

6.036 midterm review. Wednesday, March 18, 15

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Stochastic Gradient Descent

Machine Learning 4771

CS229 Supplemental Lecture notes

Machine Learning: Assignment 1

Discriminative Models

CPSC 340: Machine Learning and Data Mining

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Midterm, Fall 2003

Statistical Machine Learning Hilary Term 2018

Introduction to Logistic Regression and Support Vector Machine

Learning from Data 1 Naive Bayes

Machine Learning Basics

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Support Vector Machine I

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

An Introduction to Statistical and Probabilistic Linear Models

Mining Classification Knowledge

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Logistic Regression. COMP 527 Danushka Bollegala

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Gradient-Based Learning. Sargur N. Srihari

Linear Models for Regression CS534

Linear Regression (continued)

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks and Ensemble Methods for Classification

9 Classification. 9.1 Linear Classifiers

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

ECS171: Machine Learning

CSC 411 Lecture 6: Linear Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Variations of Logistic Regression with Stochastic Gradient Descent

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Overfitting, Bias / Variance Analysis

Lecture 18: Multiclass Support Vector Machines

CSC242: Intro to AI. Lecture 21

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

A Magiv CV Theory for Large-Margin Classifiers

Reinforcement Learning

Introduction to Machine Learning

Discriminative Models

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Final Exam, Machine Learning, Spring 2009

Online Advertising is Big Business

Midterm: CS 6375 Spring 2018

Chapter 6 Classification and Prediction (2)

VBM683 Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Large-scale Collaborative Ranking in Near-Linear Time

Transcription:

In-Database Factorised Learning fdbresearch.github.io Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich December 2017 Logic for Data Science Seminar Alan Turing Institute

Talk Outline Current Landscape for ML over DB Factorised Learning over Normalised Data Learning under Functional Dependencies General Problem Formulation 1/37

Brief Outlook at Current Landscape for ML over DB (1/2) No integration ML & DB distinct tools on the technology stack DB exports data as one table, ML imports it in own format Spark/PostgreSQL + R supports virtually any ML task Most ML over DB solutions operate in this space 2/37

Brief Outlook at Current Landscape for ML over DB (1/2) No integration ML & DB distinct tools on the technology stack DB exports data as one table, ML imports it in own format Spark/PostgreSQL + R supports virtually any ML task Most ML over DB solutions operate in this space Loose integration Each ML task implemented by a distinct UDF inside DB Same running process for DB and ML DB computes one table, ML works directly on it MadLib supports comprehensive library of ML UDFs 2/37

Brief Outlook at Current Landscape for ML over DB (2/2) Unified programming architecture One framework for many ML tasks instead of one UDF per task, with possible code reuse across UDFs DB computes one table, ML works directly on it Bismark supports incremental gradient descent for convex programming; up to 100% overhead over specialized UDFs 3/37

Brief Outlook at Current Landscape for ML over DB (2/2) Unified programming architecture One framework for many ML tasks instead of one UDF per task, with possible code reuse across UDFs DB computes one table, ML works directly on it Bismark supports incremental gradient descent for convex programming; up to 100% overhead over specialized UDFs Tight integration In-Database Analytics One evaluation plan for both DB and ML workload; opportunity to push parts of ML tasks past DB joins Morpheus + Hamlet supports GLM and naïve Bayes Our approach supports PR/FM, decision trees,... 3/37

In-Database Analytics Move the analytics, not the data Avoid expensive data export/import Exploit database technologies Exploit the relational structure (schema, query, dependencies) Build better models using larger datasets and faster Cast analytics code as join-aggregate queries Many similar queries that massively share computation Fixpoint computation needed for model convergence 4/37

In-database vs. Out-of-database Analytics materialized output feature extraction query DB ML tool θ Query Engine model reformulation model Optimised join-aggregate queries Gradient-descent Trainer 5/37

Does It Pay Off in Practice? Retailer dataset (records) excerpt (17M) full (86M) Linear regression Features (cont+categ) 33 + 55 33+3,653 Aggregates (cont+categ) 595+2,418 595+145k MadLib Learn 1,898.35 sec > 24h R Join (PSQL) 50.63 sec Export/Import 308.83 sec Learn 490.13 sec Our approach Join-Aggregate 25.51 sec 380.31 sec (1core, commodity machine) Converge (runs) 0.02 (343) sec 8.82 (366) sec Polynomial regression degree 2 Features (cont+categ) 562+2,363 562+141k Aggregates (cont+categ) 158k+742k 158k+37M MadLib Learn > 24h Our approach Join-Aggregate 132.43 sec 1,819.80 sec (1core, commodity machine) Converge (runs) 3.27 (321) sec 219.51 (180) sec 6/37

Talk Outline Current Landscape for ML over DB Factorised Learning over Normalised Data Learning under Functional Dependencies General Problem Formulation 7/37

Unified In-Database Analytics for Optimisation Problems Our target: retail-planning and forecasting applications Typical databases: weekly sales, promotions, and products Training dataset: Result of a feature extraction query Task: Train model to predict additional demand generated for a product due to promotion Training algorithm: batch gradient descent ML tasks: ridge linear regression, polynomial regression, factorisation machines; logistic regression, SVM; PCA. 8/37

Typical Retail Example Database I = (R 1, R 2, R 3, R 4, R 5 ) Feature selection query Q: Q(sku, store, color, city, country, unitssold) R 1 (sku, store, day, unitssold), R 2 (sku, color), R 3 (day, quarter), R 4 (store, city), R 5 (city, country). Free variables Categorical (qualitative): F = {sku, store, color, city, country}. Continuous (quantitative): unitssold. Bound variables Categorical (qualitative): B = {day, quarter} 9/37

Typical Retail Example We learn the ridge linear regression model θ, x = f F θ f, x f Training dataset: D = Q(I ) Feature vector x and response y = unitssold The parameters θ obtained by minimising the objective function: least square loss l {}}{ 2 regulariser 1 {}}{ J(θ) = ( θ, x y) 2 + θ 2 2 2 D (x,y) D 10/37

Side Note: One-hot Encoding of Categorical Variables Continuous variables are mapped to scalars y unitssold R. Categorical variables are mapped to indicator vectors country has categories vietnam and england country is then mapped to an indicator vector x country = [x vietnam, x england ] ({0, 1} 2 ). x country = [0, 1] for a tuple with country = england This encoding leads to wide training datasets and many 0s 11/37

From Optimisation to Sum-Product Queries We can solve θ := arg min θ J(θ) by repeatedly updating θ in the direction of the gradient until convergence: θ := θ α J(θ). 12/37

From Optimisation to Sum-Product Queries We can solve θ := arg min θ J(θ) by repeatedly updating θ in the direction of the gradient until convergence: θ := θ α J(θ). Define the matrix Σ = (σ ij ) i,j [ F ], the vector c = (c i ) i [ F ], and the scalar s Y : σ ij = 1 x i x j c i = 1 y x i s Y = 1 D D D (x,y) D (x,y) D (x,y) D y 2. 12/37

From Optimisation to Sum-Product Queries We can solve θ := arg min θ J(θ) by repeatedly updating θ in the direction of the gradient until convergence: θ := θ α J(θ). Define the matrix Σ = (σ ij ) i,j [ F ], the vector c = (c i ) i [ F ], and the scalar s Y : σ ij = 1 x i x j c i = 1 y x i s Y = 1 D D D (x,y) D (x,y) D (x,y) D y 2. Then, J(θ) = 1 2 D (x,y) D ( θ, x y) 2 + λ 2 θ 2 2 = 1 2 θ Σθ θ, c + s Y 2 + λ 2 θ 2 2 12/37

Σ, c, s Y can be Expressed as SQL Queries SQL queries for σ ij = 1 D (x,y) D x ix 1 j (w/o factor D ): 13/37

Σ, c, s Y can be Expressed as SQL Queries SQL queries for σ ij = 1 D (x,y) D x ix 1 j (w/o factor D ): x i, x j continuous no group-by variable SELECT SUM (x i * x j ) FROM D; where D is the natural join of tables R 1 to R 5 in our example. 13/37

Σ, c, s Y can be Expressed as SQL Queries SQL queries for σ ij = 1 D (x,y) D x ix 1 j (w/o factor D ): x i, x j continuous no group-by variable SELECT SUM (x i * x j ) FROM D; x i categorical, x j continuous one group-by variable SELECT x i, SUM(x j ) FROM D GROUP BY x i ; where D is the natural join of tables R 1 to R 5 in our example. 13/37

Σ, c, s Y can be Expressed as SQL Queries SQL queries for σ ij = 1 D (x,y) D x ix 1 j (w/o factor D ): x i, x j continuous no group-by variable SELECT SUM (x i * x j ) FROM D; x i categorical, x j continuous one group-by variable SELECT x i, SUM(x j ) FROM D GROUP BY x i ; x i, x j categorical two group-by variables SELECT x i, x j, SUM(1) FROM D GROUP BY x i, x j ; where D is the natural join of tables R 1 to R 5 in our example. 13/37

Σ, c, s Y can be Expressed as SQL Queries SQL queries for σ ij = 1 D (x,y) D x ix 1 j (w/o factor D ): x i, x j continuous no group-by variable SELECT SUM (x i * x j ) FROM D; x i categorical, x j continuous one group-by variable SELECT x i, SUM(x j ) FROM D GROUP BY x i ; x i, x j categorical two group-by variables SELECT x i, x j, SUM(1) FROM D GROUP BY x i, x j ; where D is the natural join of tables R 1 to R 5 in our example. This query encoding avoids drawbacks of one-hot encoding 13/37

How To Compute Efficiently These Join-Aggregate Queries? 13/37

Factorised Query Computation by Example Orders (O for short) customer day dish Dish (D for short) dish item Items (I for short) item price Elise Monday burger Elise Friday burger Steve Friday hotdog Joe Friday hotdog burger burger burger hotdog hotdog hotdog patty onion bun bun onion sausage patty 6 onion 2 bun 2 sausage 4 14/37

Factorised Query Computation by Example Orders (O for short) customer day dish Dish (D for short) dish item Items (I for short) item price Elise Monday burger Elise Friday burger Steve Friday hotdog Joe Friday hotdog burger burger burger hotdog hotdog hotdog patty onion bun bun onion sausage patty 6 onion 2 bun 2 sausage 4 Consider the natural join of the above relations: O(customer, day, dish), D(dish, item), I(item, price) customer day dish item price Elise Monday burger patty 6 Elise Monday burger onion 2 Elise Monday burger bun 2 Elise Friday burger patty 6 Elise Friday burger onion 2 Elise Friday burger bun 2............... 14/37

Factorised Query Computation by Example O(customer, day, dish), D(dish, item), I(item, price) customer day dish item price Elise Monday burger patty 6 Elise Monday burger onion 2 Elise Monday burger bun 2 Elise Friday burger patty 6 Elise Friday burger onion 2 Elise Friday burger bun 2............... An algebraic encoding uses product ( ), union (), and values: Elise Monday burger patty 6 Elise Monday burger onion 2 Elise Monday burger bun 2 Elise Friday burger patty 6 Elise Friday burger onion 2 Elise Friday burger bun 2... 15/37

Factorised Join dish day item customer price Variable order burger hotdog Monday Friday patty bun onion Friday bun onion sausage Elise Elise 6 2 2 Joe Steve 2 2 4 Grounding of the variable order over the input database There are several algebraically equivalent factorised joins defined by distributivity of product over union and their commutativity. 16/37

.. Now with Further Compression dish day item customer price burger hotdog Monday Friday patty bun onion bun onion sausage Friday Elise Elise 6 2 2 4 Joe Steve Observation: price is under item, which is under dish, but only depends on item,.. so the same price appears under an item regardless of the dish. Idea: Cache price for a specific item and avoid repetition! 17/37

Same Data, Different Factorisation day customer dish item price Monday Friday Elise burger Elise burger Joe hotdog Steve hotdog patty bun onion patty bun onion bun onion sausage bun onion sausage 6 2 2 6 2 2 2 2 4 2 2 4 18/37

.. and Further Compressed day costumer dish item price Monday Friday Elise burger Elise burger Joe hotdog patty bun onion bun onion sausage 6 2 2 4 Steve hotdog 19/37

Grounding Variable Orders to Factorised Joins Our join: O(customer, day, dish), D(dish, item), I(item, price) can be grounded to a factorised join as follows: O(,,dish),D(dish, ) dish O(,day,dish) day D(dish,item) item O(customer,day,dish) customer I (item,price) price This grounding follows the previous variable order. 20/37

Grounding Variable Orders to Factorised Joins O(,,dish),D(dish, ) dish O(,day,dish) day D(dish,item) item O(customer,day,dish) customer I (item,price) price Relations sorted following topological order of the variable order Intersection of O and D on dish in time Õ(min( π disho, π dish D )) The remaining operations are lookups in the relations, where we first fix the dish value and then the day and item values 21/37

Factorising the Computation of Aggregates (1/2) dish day item customer price burger hotdog Monday Friday patty bun onion bun onion sausage Friday Elise Elise 6 2 2 4 Joe Steve COUNT(*) computed in one pass over the factorisation: values 1, +,. 22/37

Factorising the Computation of Aggregates (1/2) dish day item customer price + 1 12 1 6 6 2 + + 3 3 + + 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 + + + + + + + 1 1 1 1 1 1 1 1 1 1 COUNT(*) computed in one pass over the factorisation: values 1, +,. 23/37

Factorising the Computation of Aggregates (2/2) dish day item customer price burger hotdog Monday Friday patty bun onion bun onion sausage Friday Elise Elise 6 2 2 4 Joe Steve SUM(dish * price) computed in one pass over the factorisation: Assume there is a function f that turns dish into reals. All values except for dish & price 1, +,. 24/37

Factorising the Computation of Aggregates (2/2) dish day item customer price 20 f (burger)+16 f (hotdog) + f (burger) f (hotdog) 20 16 2 + + 10 1 1 1 1 1 1 1 6 + + + + + 2 2 1 1 6 2 2 8 + 1 1 1 + 4 2 + 1 4 2 + 1 1 SUM(dish * price) computed in one pass over the factorisation: Assume there is a function f that turns dish into reals. All values except for dish & price 1, +,. 25/37

Talk Outline Current Landscape for ML over DB Factorised Learning over Normalised Data Learning under Functional Dependencies General Problem Formulation 26/37

Model Reparameterisation using Functional Dependencies Consider the functional dependency city country and country categories: vietnam, england city categories: saigon, hanoi, oxford, leeds,bristol The one-hot encoding enforces the following identities: x vietnam = x saigon + x hanoi country is vietnam city is either saigon or hanoi x vietnam = 1 either x saigon = 1 or x hanoi = 1 x england = x oxford + x leeds + x bristol country is england city is either oxford, leeds, or bristol x england = 1 either x oxford = 1 or x leeds = 1 or x bristol = 1 27/37

Model Reparameterisation using Functional Dependencies Identities due to one-hot encoding x vietnam = x saigon + x hanoi x england = x oxford + x leeds + x bristol Encode x country as x country = Rx city, where R = saigon hanoi oxford leeds bristol 1 1 0 0 0 vietnam 0 0 1 1 1 england For instance, if city is saigon, i.e., x city = [1, 0, 0, 0, 0], then country is vietnam, i.e., x country = Rx city = [1, 0]. [ 1 1 0 0 0 0 0 1 1 1 ] 1 0 0 0 0 [ = 1 0 ] 28/37

Model Reparameterisation using Functional Dependencies Functional dependency: city country x country = Rx city Replace all occurrences of x country by Rx city : = = f F {city,country} f F {city,country} f F {city,country} θ f, x f + θ country, x country + θ city, x city θ f, x f + θ country, Rx city + θ city, x city θ f, x f + R θ country + θ city, x city }{{} γ city 29/37

Model Reparameterisation using Functional Dependencies Functional dependency: city country x country = Rx city Replace all occurrences of x country by Rx city : = = f F {city,country} f F {city,country} f F {city,country} θ f, x f + θ country, x country + θ city, x city θ f, x f + θ country, Rx city + θ city, x city θ f, x f + R θ country + θ city, x city }{{} γ city We avoid the computation of the aggregates over x country. We reparameterise and ignore parameters θ country. What about the penalty term in the loss function? 29/37

Model Reparameterisation using Functional Dependencies Functional dependency: city country x country = Rx city γ city = R θ country + θ city Rewrite the penalty term θ 2 2 = θ j 2 2 + γcity R 2 θ country j city 2 + θ country 2 2 Optimise out θ country by expressing it in terms of γ city : θ country = (I country + RR ) 1 Rγ city = R(I city + R R) 1 γ city The penalty term becomes θ 2 2 = j city θ j 2 2 + (I city + R R) 1 γ city, γ city 30/37

Talk Outline Current Landscape for ML over DB Factorised Learning over Normalised Data Learning under Functional Dependencies General Problem Formulation 31/37

General Problem Formulation We want to solve θ := arg min θ J(θ), where J(θ) := (x,y) D L ( g(θ), h(x), y) + Ω(θ). θ = (θ 1,..., θ p ) R p are parameters functions g : R p R m and h : R n R m g = (g j ) j [m] is a vector of multivariate polynomials h = (h j ) j [m] is a vector of multivariate monomials L is a loss function, Ω is the regulariser D is the training dataset with features x and response y. Problems: ridge linear regression, polynomial regression, Factorisation machines; logistic regression, SVM; PCA. 32/37

Special Case: Ridge Linear Regression Under square loss L, l 2 -regularisation, data points x = (x 0, x 1,..., x n, y), p = n + 1 parameters θ = (θ 0,..., θ n ), x 0 = 1 corresponds to the bias parameter θ 0, identity functions g and h, we obtain the following formulation for ridge linear regression: J(θ) := 1 2 D (x,y) D ( θ, x y) 2 + λ 2 θ 2 2. 33/37

Special Case: Degree-d Polynomial Regression Under square loss L, l 2 -regularisation, data points x = (x 0, x 1,..., x n, y), p = m = 1 + n + n 2 + + n d parameters θ = (θ a ), where a = (a 1,..., a n ) with non-negative integers s.t. a 1 d. the components of h are given by h a (x) = n i=1 x a i i, g(θ) = θ, we obtain the following formulation for polynomial regression: J(θ) := 1 2 D (x,y) D ( g(θ), h(x) y) 2 + λ 2 θ 2 2. 34/37

Special Case: Factorisation Machines Under square loss L, l 2 -regularisation, data points x = (x 0, x 1,..., x n, y), p = 1 + n + r n parameters, m = 1 + n + ( n 2) features, we obtain the following formulation for degree-2 rank-r Factorisation machines: J(θ) := 1 2 D (x,y) D n θ i x i + i=0 {i,j} ( [n] 2 ) l [r] θ (l) i θ (l) j 2 x i x j y + λ 2 θ 2 2. 35/37

Special Case: Classifiers Typically, the regulariser is λ 2 θ 2 2 The response is binary: y {±1} The loss function L(γ, y), where γ := g(θ), h(x) is L(γ, y) = max{1 yγ, 0} for support vector machines, L(γ, y) = log(1 + e yγ ) for logistic regression, L(γ, y) = e yγ for Adaboost. 36/37

Zoom-in: In-database vs. Out-of-database Learning x y feature extraction query R1... Rk DB D ML tool θ Queries: σ 11 h model reformulation model Query optimiser. σ ij. c 1. Σ, c θ g g(θ) No Gradient-descent Yes converged? J(θ) J(θ) Factorised query evaluation Cost N faqw D 37/37

Thank you! 37/37