Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

Scikit-learn Machine learning for the small and the many Gaël Varoquaux scikit machine learning in Python In this meeting, I represent low performance computing

Scikit-learn Machine learning for the small and the many Gae l Varoquaux What I do: bridging psychology to neuroscience via machine learning on brain images scikit machine learning in Python In this meeting, I represent low performance computing

G Varoquaux 2 1 Scikit-learn 2 Statistical algorithms 3 Scaling up / scaling out?

G Varoquaux 3 1 Scikit-learn Goals and tradeoff

G Varoquaux 4 Scikit-learn s vision: Machine learning for everyone Outreach across scientific fields, applications, communities Enabling foster innovation

Minimal prerequisites & assumptions G Varoquaux 4 Scikit-learn s vision: Machine learning for everyone Outreach across scientific fields, applications, communities Enabling foster innovation

1 scikit-learn user base G Varoquaux 5 350 000 returning users 5 000 citations OS Employer 50% 30% 20% 63% 34% 3% Windows Mac Linux industry academia other

1 A Python library G Varoquaux 6 Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use

1 A Python library Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use Slow compiled code as a backend Python s primitive virtual machine makes it easy G Varoquaux 6

1 A Python library G Varoquaux 6 Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use Scipy Vibrant scientific stack numpy arrays = wrappers on C pointers pandas for columnar data scikit-image for images

1 A Python library Users like Python Web searches: Google trends G Varoquaux 7

1 A Python library And developpers like Python 50 Number of contributors active in a week 25 2010 2012 2014 2016 0 G Varoquaux 8

1 A Python library And developpers like Python 50 Number of contributors active in a week 25 2010 2012 2014 2016 0 Huge set of features ( 160 different statistical models) G Varoquaux 8

1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t ) classifier often has hyperparameters Finding good defaults is crucial, and hard

1 Tradeoffs G Varoquaux 10 Algorithms and models with good failure mode Avoid parameters hard to set or fragile convergence Statistical computing = ill-posed & data-dependent Little or no dependencies Easy build everywhere All compiled code generated from Cython High-level languages give features (Spark) Low-level gives speed (eg cache-friendly code)

2 Statistical algorithms Fast algorithms accept statistical error G Varoquaux 11

2 Statistical algorithms Fast algorithms accept statistical error Models most used in scikit-learn: 1. Logistic regression, SVM 4. Kmeans 2. Random forests 5. Naive Bayes 3. PCA 6. Nearest neighbor G Varoquaux 11

G Varoquaux 12 Big data Many samples samples features 0 3 0 7 8 0 9 0 7 0 7 9 0 7 0 0 7 9 0 7 5 2 7 0 0 5 7 8 9 4 0 7 1 0 0 6 0 0 0 7 9 7 0 0 9 7 0 0 0 8 0 0 7 0 0 0 1 0 0 0 0 4 0 0 4 0 0 0 9 0 0 0 0 5 0 2 0 5 0 0 8 0 0 0 or Many features samples features 0307809070790703078090707907 0079075270057800790752700578 9407100600079794071006000797 0097000800700000970008007000 1000040040009010000400400090 0005020500800000050205008000 Web behavior data Cheap sensors (cameras) Medical patients Scientific experiments

G Varoquaux 13 2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately It works because: Features are redundant Sparse models can guess which w j are zero Progress = better selection of features

2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately Many samples Gradient descent: Stochastic gradient descent min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Progress = second order schemes G Varoquaux 13

2 Linear models min l(y i, x i w) w i Deep learning Many features Composition Coordinate of linear descent models optimized Iteratively jointly optimize (non-convex) w.r.t. w j separately with stochastic gradient descent Many samples Stochastic gradient descent Gradient descent: min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Data-access locality G Varoquaux 13

2 Trees & (random) forests (on subsets of the data) Compute simple bi-variate statistics Split data accordingly... Speed ups Share computing between trees or precompute Cache friendly access optimize traversal order Approximate histograms / statistics LightGBM, XGBoost G Varoquaux 14

G Varoquaux 15 2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T

2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T Randomized linear algebra for i in [1,... k]: X =random projection(x) Ũ i, s i, ṼT i = SVD( X) 20x speed ups # e.g. subsampling V red, R = QR([Ṽ1,..., Ṽk]) X red = V T redx U s V T = SVD(X red ) V T = V T V T red G Varoquaux 15

2 Stochastic factorization of huge matrices G Varoquaux 16 Factorization of dense matrices 200 000 2 000 000 Data matrix X U V min U,V X UVT 2 + V 1

2 Stochastic factorization of huge matrices G Varoquaux 16 Factorization of dense matrices 200 000 2 000 000 Data matrix - Data access - Code computation Stream columns min X U,V UVT 2 + V 1 - Dictionary update

2 Stochastic factorization of huge matrices Factorization of dense matrices 200 000 2 000 000 Data matrix Alternating minimization Online matrix factorization - Data access - Code computation Stream columns - Dictionary update Seen at t Seen at t+1 Unseen at t [Mairal... 2010] out of core, huge speed ups G Varoquaux 16

2 Stochastic factorization of huge matrices Factorization of dense matrices 200 000 2 000 000 Data matrix Alternating minimization Online matrix factorization New subsampling algorithm - Data access - Code computation - Dictionary update Stream columns Subsample rows Seen at t Seen at t+1 Unseen at t [Mensch... 2017] 10X speed ups, or more G Varoquaux 16

3 Scaling up / scaling out? G Varoquaux 17

3 Dataflow is key to scale Data parallel 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Array computing Streaming 0 3 8 7 8 7 9 4 7 9 7 9 2 7 003 38 87 78 87 79 94 47 79 97 79 92 27 7 001 17 79 90 07 75 52 27 70 01 15 57 78 8 994 40 07 71 17 74 46 61 12 24 47 79 97 7 554 49 97 70 07 71 18 87 71 17 78 8 87 7 113 36 65 53 34 49 90 04 49 95 51 19 90 0 774 47 75 54 42 26 65 53 35 58 80 09 98 8 448 87 72 21 15 54 46 63 34 49 90 08 84 4 990 03 34 45 56 67 73 32 24 45 56 61 14 4 778 89 95 57 71 18 87 7 74 45 56 62 20 0 CPU 0 1 7 9 0 7 5 2 7 0 1 5 7 8 003 38 87 78 87 47 97 92 27 7 0 01790752701578 03 38 87 78 87 79 94 47 79 97 79 92 27 7 01790752701578 0 94071746124797 01 17 79 90 07 75 52 27 70 01 15 57 78 8 94071746124797 9 549707187178 7 94 40 07 71 17 46 61 12 24 47 79 97 7 54970718717887 5 113 36 65 53 34 04 95 51 19 90 54 49 97 70 07 71 18 71 17 78 8 87 7 0 1 774 47 75 42 26 65 53 35 58 80 09 98 13 36 53 49 04 49 95 51 19 90 0 8 74754265358098 448 87 72 21 15 46 63 34 49 90 08 84 74754265358098 4 48721546349084 990 03 34 45 56 67 73 32 24 45 56 61 14 48721546349084 4 90345673245614 778 89 95 57 71 18 87 7 74 45 56 62 20 90345673245614 0 778 89 95 57 71 18 7 74 62 20 0 Parallel computing Data + code transfer Out-of-memory persistence These patterns can yield horrible code G Varoquaux 18

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard Queues are the central abstraction G Varoquaux 19

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard G Varoquaux 19

3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend( dask.distributed, scheduler host= HOST:PORT ): # normal Joblib code G Varoquaux 19

3 Distributed data flow and storage G Varoquaux 20 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Moving data around is costly

G Varoquaux 20 3 Distributed data flow and storage 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Read-only data store Moving data around is costly Parameter database Node's memory

G Varoquaux 20 3 Distributed data flow and storage 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Read-only Why databases and not files? Maintain integrity themselves Moving data around Know how to do data replication & distribution is costly Fast lookup via indexes Not bound by POSIX FS specs Very big data calls for coupling a database to a computing engine Parameter database data store Node's memory

3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21

3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21 Out-of-memory computing >>> result = mem.cache(g).call and shelve(a) >>> result MemorizedResult(cachedir=..., func= g, argument hash=... ) >>> c = result.get()

Challenges and dreams 0387879479792 7 Read-only data store High-level constructs for distributed computation & data exchange MPI feels too low level and without data concepts Parameter database Node's memory Goal: reusable algorithms from laptops to datacenters Capturing data access patterns is the missing piece Dask project: Limit to purely-functional code Lazy computation / compilation Build a data flow + execution graph Also: deep-learning engines, for GPUs G Varoquaux 22

@GaelVaroquaux Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Enables focusing on algorithmic optimization Great to grow a community Can easily drop to compiled code

@GaelVaroquaux Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Algorithms operate on expectancies Stochastic Gradient Descent Random projections Can bring data locality

Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Distributed data computing Data acces is central Must be optimized for algorithm File system and memory no longer suffice 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 3 8 7 8 7 9 4 7 9 7 9 2 7 @GaelVaroquaux

Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Distributed data computing If you know what your doing, you can scale scikit-learn The challenge is to make this easy and generic @GaelVaroquaux

4 References I N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53, 2011. ISSN 0036-1445. doi: 10.1137/090771806. URL http://dx.doi.org/10.1137/090771806. J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19, 2010. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. Arxiv preprint, 2017.