Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

Similar documents
ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Machine Learning for NLP

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2016

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

CONTEMPORARY ANALYTICAL ECOSYSTEM PATRICK HALL, SAS INSTITUTE

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

Coordinate Update Algorithm Short Course The Package TMAC

Tales from fmri Learning from limited labeled data. Gae l Varoquaux

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

INF 5860 Machine learning for image classification. Lecture 5 : Introduction to TensorFlow Tollef Jahren February 14, 2018

Generative Models for Discrete Data

Logistic Regression. COMP 527 Danushka Bollegala

Neural networks and optimization

Quick Introduction to Nonnegative Matrix Factorization

18.6 Regression and Classification with Linear Models

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Stochastic Gradient Descent. CS 584: Big Data Analytics

Recommendation Systems

STA141C: Big Data & High Performance Statistical Computing

Postgraduate Course Signal Processing for Big Data (MSc)

STA141C: Big Data & High Performance Statistical Computing

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

ECS289: Scalable Machine Learning

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Neural networks and optimization

Mini-project in scientific computing

Dot-Product Join: Scalable In-Database Linear Algebra for Big Model Analytics

Lecture Notes 10: Matrix Factorization

Optimization for Machine Learning

The exam is closed book, closed notes except your one-page cheat sheet.

HOMEWORK #4: LOGISTIC REGRESSION

MATH36061 Convex Optimization

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

2.3. Clustering or vector quantization 57

TensorFlow. Dan Evans

Data Exploration and Unsupervised Learning with Clustering

CS425: Algorithms for Web Scale Data

Randomized Algorithms

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Collaborative Filtering. Radek Pelánek

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Compressing Tabular Data via Pairwise Dependencies

Singular Value Decompsition

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Neural Networks Teaser

What s Cooking? Predicting Cuisines from Recipe Ingredients

7 Principal Component Analysis

A tutorial on sparse modeling. Outline:

Matrix Factorization and Factorization Machines for Recommender Systems

Quantum Recommendation Systems

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Scripting Languages Fast development, extensible programs

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Ad Placement Strategies

Binary Principal Component Analysis in the Netflix Collaborative Filtering Task

HOMEWORK #4: LOGISTIC REGRESSION

CS 542G: Conditioning, BLAS, LU Factorization

Gradient Coding: Mitigating Stragglers in Distributed Gradient Descent.

Fitting. PHY 688: Numerical Methods for (Astro)Physics

Deep Learning for Computer Vision

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Accelerating SVRG via second-order information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Tall-and-skinny! QRs and SVDs in MapReduce

CS60021: Scalable Data Mining. Large Scale Machine Learning

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

CSC 411 Lecture 6: Linear Regression

Introduction to Logistic Regression

Continuous Machine Learning

Dreem Challenge report (team Bussanati)

This ensures that we walk downhill. For fixed λ not even this may be the case.

Chenhan D. Yu The 3rd BLIS Retreat Sep 28, 2015

CSC321 Lecture 2: Linear Regression

Linear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Three right directions and three wrong directions for tensor research

Lecture 2: Linear Algebra Review

Cyclops Tensor Framework

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Numerical Optimization Techniques

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Course in Data Science

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

ArcGIS Enterprise: What s New. Philip Heede Shannon Kalisky Melanie Summers Shreyas Shinde

Classification with Perceptrons. Reading:

Sparse Approximation and Variable Selection

Applied Machine Learning Lecture 2, part 1: Basic linear algebra; linear algebra in NumPy and SciPy. Richard Johansson

STCE. Adjoint Code Design Patterns. Uwe Naumann. RWTH Aachen University, Germany. QuanTech Conference, London, April 2016

B629 project - StreamIt MPI Backend. Nilesh Mahajan

Processing Big Data Matrix Sketching

Course Announcements. Bacon is due next Monday. Next lab is about drawing UIs. Today s lecture will help thinking about your DB interface.

CS246 Final Exam. March 16, :30AM - 11:30AM

Transcription:

Scikit-learn Machine learning for the small and the many Gaël Varoquaux scikit machine learning in Python In this meeting, I represent low performance computing

Scikit-learn Machine learning for the small and the many Gae l Varoquaux What I do: bridging psychology to neuroscience via machine learning on brain images scikit machine learning in Python In this meeting, I represent low performance computing

G Varoquaux 2 1 Scikit-learn 2 Statistical algorithms 3 Scaling up / scaling out?

G Varoquaux 3 1 Scikit-learn Goals and tradeoff

G Varoquaux 4 Scikit-learn s vision: Machine learning for everyone Outreach across scientific fields, applications, communities Enabling foster innovation

Minimal prerequisites & assumptions G Varoquaux 4 Scikit-learn s vision: Machine learning for everyone Outreach across scientific fields, applications, communities Enabling foster innovation

1 scikit-learn user base G Varoquaux 5 350 000 returning users 5 000 citations OS Employer 50% 30% 20% 63% 34% 3% Windows Mac Linux industry academia other

1 A Python library G Varoquaux 6 Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use

1 A Python library Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use Slow compiled code as a backend Python s primitive virtual machine makes it easy G Varoquaux 6

1 A Python library G Varoquaux 6 Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use Scipy Vibrant scientific stack numpy arrays = wrappers on C pointers pandas for columnar data scikit-image for images

1 A Python library Users like Python Web searches: Google trends G Varoquaux 7

1 A Python library And developpers like Python 50 Number of contributors active in a week 25 2010 2012 2014 2016 0 G Varoquaux 8

1 A Python library And developpers like Python 50 Number of contributors active in a week 25 2010 2012 2014 2016 0 Huge set of features ( 160 different statistical models) G Varoquaux 8

1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t )

1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t ) classifier often has hyperparameters Finding good defaults is crucial, and hard

1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t ) classifier often has hyperparameters Finding good defaults is crucial, and hard A lot of effort on the documentation Example-driven development

1 Tradeoffs G Varoquaux 10 Algorithms and models with good failure mode Avoid parameters hard to set or fragile convergence Statistical computing = ill-posed & data-dependent Little or no dependencies Easy build everywhere All compiled code generated from Cython High-level languages give features (Spark) Low-level gives speed (eg cache-friendly code)

2 Statistical algorithms Fast algorithms accept statistical error G Varoquaux 11

2 Statistical algorithms Fast algorithms accept statistical error Models most used in scikit-learn: 1. Logistic regression, SVM 4. Kmeans 2. Random forests 5. Naive Bayes 3. PCA 6. Nearest neighbor G Varoquaux 11

G Varoquaux 12 Big data Many samples samples features 0 3 0 7 8 0 9 0 7 0 7 9 0 7 0 0 7 9 0 7 5 2 7 0 0 5 7 8 9 4 0 7 1 0 0 6 0 0 0 7 9 7 0 0 9 7 0 0 0 8 0 0 7 0 0 0 1 0 0 0 0 4 0 0 4 0 0 0 9 0 0 0 0 5 0 2 0 5 0 0 8 0 0 0 or Many features samples features 0307809070790703078090707907 0079075270057800790752700578 9407100600079794071006000797 0097000800700000970008007000 1000040040009010000400400090 0005020500800000050205008000 Web behavior data Cheap sensors (cameras) Medical patients Scientific experiments

G Varoquaux 13 2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately It works because: Features are redundant Sparse models can guess which w j are zero Progress = better selection of features

2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately Many samples Gradient descent: Stochastic gradient descent min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Progress = second order schemes G Varoquaux 13

2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately Many samples Gradient descent: Stochastic gradient descent min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Data-access locality G Varoquaux 13

2 Linear models min l(y i, x i w) w i Deep learning Many features Composition Coordinate of linear descent models optimized Iteratively jointly optimize (non-convex) w.r.t. w j separately with stochastic gradient descent Many samples Stochastic gradient descent Gradient descent: min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Data-access locality G Varoquaux 13

2 Trees & (random) forests (on subsets of the data) Compute simple bi-variate statistics Split data accordingly... Speed ups Share computing between trees or precompute Cache friendly access optimize traversal order Approximate histograms / statistics LightGBM, XGBoost G Varoquaux 14

G Varoquaux 15 2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T

2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T Randomized linear algebra for i in [1,... k]: X =random projection(x) Ũ i, s i, ṼT i = SVD( X) 20x speed ups # e.g. subsampling V red, R = QR([Ṽ1,..., Ṽk]) X red = V T redx U s V T = SVD(X red ) V T = V T V T red G Varoquaux 15

2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T Randomized linear algebra for i in [1,... k]: X =random projection(x) Ũ i, s i, ṼT i = SVD( X) 20x speed ups # e.g. subsampling V red, R = QR([Ṽ1,..., Ṽk]) X red = V T redx U s V T = SVD(X red ) V T = V T V T red X summarize well the data Each SVD is on local data [Halko... 2011] G Varoquaux 15

2 Stochastic factorization of huge matrices G Varoquaux 16 Factorization of dense matrices 200 000 2 000 000 Data matrix X U V min U,V X UVT 2 + V 1

2 Stochastic factorization of huge matrices G Varoquaux 16 Factorization of dense matrices 200 000 2 000 000 Data matrix - Data access - Code computation Stream columns min X U,V UVT 2 + V 1 - Dictionary update

2 Stochastic factorization of huge matrices Factorization of dense matrices 200 000 2 000 000 Data matrix Alternating minimization Online matrix factorization - Data access - Code computation Stream columns - Dictionary update Seen at t Seen at t+1 Unseen at t [Mairal... 2010] out of core, huge speed ups G Varoquaux 16

2 Stochastic factorization of huge matrices Factorization of dense matrices 200 000 2 000 000 Data matrix Alternating minimization Online matrix factorization New subsampling algorithm - Data access - Code computation - Dictionary update Stream columns Subsample rows Seen at t Seen at t+1 Unseen at t [Mensch... 2017] 10X speed ups, or more G Varoquaux 16

3 Scaling up / scaling out? G Varoquaux 17

3 Dataflow is key to scale Data parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Array computing Streaming 0 3 8 7 8 7 9 4 7 9 7 9 2 7 003 38 87 78 87 79 94 47 79 97 79 92 27 7 001 17 79 90 07 75 52 27 70 01 15 57 78 8 994 40 07 71 17 74 46 61 12 24 47 79 97 7 554 49 97 70 07 71 18 87 71 17 78 8 87 7 113 36 65 53 34 49 90 04 49 95 51 19 90 0 774 47 75 54 42 26 65 53 35 58 80 09 98 8 448 87 72 21 15 54 46 63 34 49 90 08 84 4 990 03 34 45 56 67 73 32 24 45 56 61 14 4 778 89 95 57 71 18 87 7 74 45 56 62 20 0 CPU 0 1 7 9 0 7 5 2 7 0 1 5 7 8 003 38 87 78 87 47 97 92 27 7 0 01790752701578 03 38 87 78 87 79 94 47 79 97 79 92 27 7 01790752701578 0 94071746124797 01 17 79 90 07 75 52 27 70 01 15 57 78 8 94071746124797 9 549707187178 7 94 40 07 71 17 46 61 12 24 47 79 97 7 54970718717887 5 113 36 65 53 34 04 95 51 19 90 54 49 97 70 07 71 18 71 17 78 8 87 7 0 1 774 47 75 42 26 65 53 35 58 80 09 98 13 36 53 49 04 49 95 51 19 90 0 8 74754265358098 448 87 72 21 15 46 63 34 49 90 08 84 74754265358098 4 48721546349084 990 03 34 45 56 67 73 32 24 45 56 61 14 48721546349084 4 90345673245614 778 89 95 57 71 18 87 7 74 45 56 62 20 90345673245614 0 778 89 95 57 71 18 7 74 62 20 0 Parallel computing Data + code transfer Out-of-memory persistence These patterns can yield horrible code G Varoquaux 18

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard Queues are the central abstraction G Varoquaux 19

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard G Varoquaux 19

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend( dask.distributed, scheduler host= HOST:PORT ): # normal Joblib code G Varoquaux 19

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend( dask.distributed, scheduler host= HOST:PORT ): # normal Joblib code Middleware to plug in distributed infrastructures G Varoquaux 19

3 Distributed data flow and storage G Varoquaux 20 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Moving data around is costly

G Varoquaux 20 3 Distributed data flow and storage 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Read-only data store Moving data around is costly Parameter database Node's memory

G Varoquaux 20 3 Distributed data flow and storage 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Read-only Why databases and not files? Maintain integrity themselves Moving data around Know how to do data replication & distribution is costly Fast lookup via indexes Not bound by POSIX FS specs Very big data calls for coupling a database to a computing engine Parameter database data store Node's memory

3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21

3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21 Out-of-memory computing >>> result = mem.cache(g).call and shelve(a) >>> result MemorizedResult(cachedir=..., func= g, argument hash=... ) >>> c = result.get()

3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21 Out-of-memory computing >>> result = mem.cache(g).call and shelve(a) >>> result MemorizedResult(cachedir=..., func= g, argument hash=... ) >>> c = result.get() S3/HDFS/cloud backend: joblib.memory( uri, backend= s3 ) https://github.com/joblib/joblib/pull/397

Challenges and dreams 0387879479792 7 Read-only data store High-level constructs for distributed computation & data exchange MPI feels too low level and without data concepts Parameter database Node's memory Goal: reusable algorithms from laptops to datacenters Capturing data access patterns is the missing piece G Varoquaux 22

Challenges and dreams 0387879479792 7 Read-only data store High-level constructs for distributed computation & data exchange MPI feels too low level and without data concepts Parameter database Node's memory Goal: reusable algorithms from laptops to datacenters Capturing data access patterns is the missing piece Dask project: Limit to purely-functional code Lazy computation / compilation Build a data flow + execution graph Also: deep-learning engines, for GPUs G Varoquaux 22

@GaelVaroquaux Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Enables focusing on algorithmic optimization Great to grow a community Can easily drop to compiled code

@GaelVaroquaux Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Algorithms operate on expectancies Stochastic Gradient Descent Random projections Can bring data locality

Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Distributed data computing Data acces is central Must be optimized for algorithm File system and memory no longer suffice 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 3 8 7 8 7 9 4 7 9 7 9 2 7 @GaelVaroquaux

Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Distributed data computing If you know what your doing, you can scale scikit-learn The challenge is to make this easy and generic @GaelVaroquaux

4 References I N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53, 2011. ISSN 0036-1445. doi: 10.1137/090771806. URL http://dx.doi.org/10.1137/090771806. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19, 2010. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. Arxiv preprint, 2017.