Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

Size: px

Start display at page:

Download "Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python"

Janel Eaton
5 years ago
Views:

1 Scikit-learn Machine learning for the small and the many Gaël Varoquaux scikit machine learning in Python In this meeting, I represent low performance computing

2 Scikit-learn Machine learning for the small and the many Gae l Varoquaux What I do: bridging psychology to neuroscience via machine learning on brain images scikit machine learning in Python In this meeting, I represent low performance computing

3 G Varoquaux 2 1 Scikit-learn 2 Statistical algorithms 3 Scaling up / scaling out?

4 G Varoquaux 3 1 Scikit-learn Goals and tradeoff

5 G Varoquaux 4 Scikit-learn s vision: Machine learning for everyone Outreach across scientific fields, applications, communities Enabling foster innovation

6 Minimal prerequisites & assumptions G Varoquaux 4 Scikit-learn s vision: Machine learning for everyone Outreach across scientific fields, applications, communities Enabling foster innovation

7 1 scikit-learn user base G Varoquaux returning users citations OS Employer 50% 30% 20% 63% 34% 3% Windows Mac Linux industry academia other

8 1 A Python library G Varoquaux 6 Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use

9 1 A Python library Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use Slow compiled code as a backend Python s primitive virtual machine makes it easy G Varoquaux 6

10 1 A Python library G Varoquaux 6 Python High-level language, for users and developers General-purpose: suitable for any application Excellent interactive use Scipy Vibrant scientific stack numpy arrays = wrappers on C pointers pandas for columnar data scikit-image for images

11 1 A Python library Users like Python Web searches: Google trends G Varoquaux 7

12 1 A Python library And developpers like Python 50 Number of contributors active in a week G Varoquaux 8

13 1 A Python library And developpers like Python 50 Number of contributors active in a week Huge set of features ( 160 different statistical models) G Varoquaux 8

14 1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t )

15 1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t ) classifier often has hyperparameters Finding good defaults is crucial, and hard

16 1 API: simplify, but do not dumb down G Varoquaux 9 Universal estimator interface from s k l e a r n import svm c l a s s i f i e r = svm.svc() c l a s s i f i e r. f i t ( X t r a i n, Y t r a i n ) Y t e s t = c l a s s i f i e r. p r e d i c t ( X t e s t ) # or X r e d = c l a s s i f i e r. t r a n s f o r m ( X t e s t ) classifier often has hyperparameters Finding good defaults is crucial, and hard A lot of effort on the documentation Example-driven development

17 1 Tradeoffs G Varoquaux 10 Algorithms and models with good failure mode Avoid parameters hard to set or fragile convergence Statistical computing = ill-posed & data-dependent Little or no dependencies Easy build everywhere All compiled code generated from Cython High-level languages give features (Spark) Low-level gives speed (eg cache-friendly code)

18 2 Statistical algorithms Fast algorithms accept statistical error G Varoquaux 11

19 2 Statistical algorithms Fast algorithms accept statistical error Models most used in scikit-learn: 1. Logistic regression, SVM 4. Kmeans 2. Random forests 5. Naive Bayes 3. PCA 6. Nearest neighbor G Varoquaux 11

G Varoquaux 12 Big data Many samples samples features 0 3 0 7 8 0 9 0 7 0 7 9 0 7 0 0 7 9 0 7 5 2 7 0 0 5 7 8 9 4 0 7 1 0 0 6 0 0 0 7 9 7 0 0 9 7 0 0 0 8 0 0 7 0 0 0 1 0 0 0 0 4 0 0 4 0 0 0 9 0 0 0 0

20 G Varoquaux 12 Big data Many samples samples features or Many features samples features Web behavior data Cheap sensors (cameras) Medical patients Scientific experiments

21 G Varoquaux 13 2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately It works because: Features are redundant Sparse models can guess which w j are zero Progress = better selection of features

22 2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately Many samples Gradient descent: Stochastic gradient descent min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Progress = second order schemes G Varoquaux 13

23 2 Linear models min w i l(y i, x i w) Many features Coordinate descent Iteratively optimize w.r.t. w j separately Many samples Gradient descent: Stochastic gradient descent min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Data-access locality G Varoquaux 13

24 2 Linear models min l(y i, x i w) w i Deep learning Many features Composition Coordinate of linear descent models optimized Iteratively jointly optimize (non-convex) w.r.t. w j separately with stochastic gradient descent Many samples Stochastic gradient descent Gradient descent: min E[l(y, x w)] w w w + α w l Stochastic gradient descent w w + αe[ w l] Use a cheap estimate of E[ w l] (e.g. subsampling) Data-access locality G Varoquaux 13

25 2 Trees & (random) forests (on subsets of the data) Compute simple bi-variate statistics Split data accordingly... Speed ups Share computing between trees or precompute Cache friendly access optimize traversal order Approximate histograms / statistics LightGBM, XGBoost G Varoquaux 14

26 G Varoquaux 15 2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T

27 2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T Randomized linear algebra for i in [1,... k]: X =random projection(x) Ũ i, s i, ṼT i = SVD( X) 20x speed ups # e.g. subsampling V red, R = QR([Ṽ1,..., Ṽk]) X red = V T redx U s V T = SVD(X red ) V T = V T V T red G Varoquaux 15

28 2 PCA: principal component analysis Truncated SVD (singular value decomposition) X = U s V T Randomized linear algebra for i in [1,... k]: X =random projection(x) Ũ i, s i, ṼT i = SVD( X) 20x speed ups # e.g. subsampling V red, R = QR([Ṽ1,..., Ṽk]) X red = V T redx U s V T = SVD(X red ) V T = V T V T red X summarize well the data Each SVD is on local data [Halko ] G Varoquaux 15

29 2 Stochastic factorization of huge matrices G Varoquaux 16 Factorization of dense matrices Data matrix X U V min U,V X UVT 2 + V 1

30 2 Stochastic factorization of huge matrices G Varoquaux 16 Factorization of dense matrices Data matrix - Data access - Code computation Stream columns min X U,V UVT 2 + V 1 - Dictionary update

31 2 Stochastic factorization of huge matrices Factorization of dense matrices Data matrix Alternating minimization Online matrix factorization - Data access - Code computation Stream columns - Dictionary update Seen at t Seen at t+1 Unseen at t [Mairal ] out of core, huge speed ups G Varoquaux 16

32 2 Stochastic factorization of huge matrices Factorization of dense matrices Data matrix Alternating minimization Online matrix factorization New subsampling algorithm - Data access - Code computation - Dictionary update Stream columns Subsample rows Seen at t Seen at t+1 Unseen at t [Mensch ] 10X speed ups, or more G Varoquaux 16

33 3 Scaling up / scaling out? G Varoquaux 17

34 3 Dataflow is key to scale Data parallel Array computing Streaming CPU Parallel computing Data + code transfer Out-of-memory persistence These patterns can yield horrible code G Varoquaux 18

35 3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard Queues are the central abstraction G Varoquaux 19

36 3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard G Varoquaux 19

37 3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend( dask.distributed, scheduler host= HOST:PORT ): # normal Joblib code G Varoquaux 19

38 3 Parallel-computing engine: joblib sklearn.estimator(n jobs=2) Under the hood: joblib Parallel for loops concurrency is hard New: distributed computing backends: Yarn, dask.distributed, IPython.parallel import distributed.joblib from joblib import Parallel, parallel backend with parallel backend( dask.distributed, scheduler host= HOST:PORT ): # normal Joblib code Middleware to plug in distributed infrastructures G Varoquaux 19

39 3 Distributed data flow and storage G Varoquaux Moving data around is costly

40 G Varoquaux 20 3 Distributed data flow and storage Read-only data store Moving data around is costly Parameter database Node's memory

G Varoquaux 20 3 Distributed data flow and storage 0 3 8 7 8 7 9 4 7 9 7 9 2 7 Read-only Why databases and not files?

41 G Varoquaux 20 3 Distributed data flow and storage Read-only Why databases and not files? Maintain integrity themselves Moving data around Know how to do data replication & distribution is costly Fast lookup via indexes Not bound by POSIX FS specs Very big data calls for coupling a database to a computing engine Parameter database data store Node's memory

42 3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21

43 3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21 Out-of-memory computing >>> result = mem.cache(g).call and shelve(a) >>> result MemorizedResult(cachedir=..., func= g, argument hash=... ) >>> c = result.get()

3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21 Out-of-memory computing >>> result = mem.cache(g).

44 3 joblib.memory as a storage pool A caching / function memoizing system Stores results of function executions G Varoquaux 21 Out-of-memory computing >>> result = mem.cache(g).call and shelve(a) >>> result MemorizedResult(cachedir=..., func= g, argument hash=... ) >>> c = result.get() S3/HDFS/cloud backend: joblib.memory( uri, backend= s3 )

45 Challenges and dreams Read-only data store High-level constructs for distributed computation & data exchange MPI feels too low level and without data concepts Parameter database Node's memory Goal: reusable algorithms from laptops to datacenters Capturing data access patterns is the missing piece G Varoquaux 22

Challenges and dreams 0387879479792 7 Read-only data store High-level constructs for distributed computation & data exchange MPI feels too low level and without data concepts Parameter database

46 Challenges and dreams Read-only data store High-level constructs for distributed computation & data exchange MPI feels too low level and without data concepts Parameter database Node's memory Goal: reusable algorithms from laptops to datacenters Capturing data access patterns is the missing piece Dask project: Limit to purely-functional code Lazy computation / compilation Build a data flow + execution graph Also: deep-learning engines, for GPUs G Varoquaux 22

47 @GaelVaroquaux Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Enables focusing on algorithmic optimization Great to grow a community Can easily drop to compiled code

48 @GaelVaroquaux Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Algorithms operate on expectancies Stochastic Gradient Descent Random projections Can bring data locality

49 Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Distributed data computing Data acces is central Must be optimized for algorithm File system and memory no longer suffice

50 Lessons from scikit-learn Small-computer machine-learning trying to scale Python gets us very far Statistical algorithmics Distributed data computing If you know what your doing, you can scale scikit-learn The challenge is to make this easy and

51 4 References I N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev., 53, ISSN doi: / URL J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19, A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. Arxiv preprint, 2017.

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai

ECE521 W17 Tutorial 1. Renjie Liao & Min Bai ECE521 W17 Tutorial 1 Renjie Liao & Min Bai Schedule Linear Algebra Review Matrices, vectors Basic operations Introduction to TensorFlow NumPy Computational Graphs Basic Examples Linear Algebra Review