Clustering algorithms distributed over a Cloud Computing Platform.

Size: px

Start display at page:

Download "Clustering algorithms distributed over a Cloud Computing Platform."

Clement Ray
5 years ago
Views:

1 Clustering algorithms distributed over a Cloud Computing Platform. SEPTEMBER 28 TH 2012 Ph. D. thesis supervised by Pr. Fabrice Rossi. Matthieu Durut (Telecom/Lokad) 1 / 55

2 Outline. 1 Introduction to Cloud Computing 2 Context 3 Distributed Batch K-Means 4 Distributed Vector Quantization algorithms Matthieu Durut (Telecom/Lokad) 2 / 55

3 Outline Introduction to Cloud Computing 1 Introduction to Cloud Computing 2 Context 3 Distributed Batch K-Means 4 Distributed Vector Quantization algorithms Matthieu Durut (Telecom/Lokad) 3 / 55

4 Introduction to Cloud Computing What is Cloud Computing? Some Features 1 Abstraction of commodity hardware that can be rent on-demand on a hourly basis. 2 Quasi-infinite hardware scale-up. 3 Virtualization, that makes web-applications maintenance easier. Grid vs Cloud Ownership. Intensive use of Virtual Machines (VM). Elasticity. Hardware administration and maintenance. Matthieu Durut (Telecom/Lokad) 4 / 55

5 Introduction to Cloud Computing Everything a as Service 1 Software as a Service (SaaS) : Gmail, Salesforce, Lokad API, etc. 2 Platform as a Service (PaaS) : Azure, Amazon S3, etc, 3 Infrastructure as a Service (IaaS) : Amazon EC2, etc. Stack of Azure Storage Level : BlobStorage, TableStorage, QueueStorage, SQLAzure. Execution Level : Dryad. Domain Specific Language Level : DryadLinq, Scope. Matthieu Durut (Telecom/Lokad) 5 / 55

6 Introduction to Cloud Computing Figure : Illustration of the Google, Hadoop and Microsoft technology stacks for cloud applications building. Matthieu Durut (Telecom/Lokad) 6 / 55

7 MapReduce Introduction to Cloud Computing Matthieu Durut (Telecom/Lokad) 7 / 55

8 Introduction to Cloud Computing The Windows Azure Storage (WAS) BlobStorage Key-value pair (blobname/blob) storage. No more ACID. But atomicity, strong persistency and strong consistency per blob. Optimistic Read-Modify-Write primitive (RMW). QueueStorage Set of scalable queues. Asynchronous Message Delivery mechanism. Approximately FIFO. Messages returned at least once => Idempotency. Matthieu Durut (Telecom/Lokad) 8 / 55

9 Introduction to Cloud Computing Elements of Azure applications architecture No communication framework such as MPI. WAS used as a shared memory abstraction. No affinity between storage and processing units. Task agnosticity of workers (at least in the beginning). Idempotence. Matthieu Durut (Telecom/Lokad) 9 / 55

10 Outline Context 1 Introduction to Cloud Computing 2 Context 3 Distributed Batch K-Means 4 Distributed Vector Quantization algorithms Matthieu Durut (Telecom/Lokad) 10 / 55

11 Context Why clustering? One of the Lokad s abilities is to deal with large scale data. Need to group client data (clustering) to extract information from complex objects (e.g. time series seasonality). Problem Set-up Data set is composed of N points {z t } N t=1 in Rκ. Clustering POV: find a simplified representation with κ vectors of R d. These vectors will be called prototypes/centroids and gathered in a quantization scheme w = (w 1,..., w κ ) ( R d) κ. Matthieu Durut (Telecom/Lokad) 11 / 55

12 Context Objective Clustering challenge can be expressed as a minimization of the empirical distortion C N, where Initial challenge C N (w) = N min z t w l 2, w ( R d) κ. l=1,...,κ t=1 Exact minimization is computationnaly intractable. Some approximative algorithms Batch K-Means Vector Quantization (Online K-Means) Neural Gas Kohonen Maps Matthieu Durut (Telecom/Lokad) 12 / 55

13 Context Architecture Context Why distributed? A suitable way to allow more computing resources. Faster serial computers: increasingly expensive + physical limits. Cloud computing: adopted by Lokad (MS Azure). Early 2012, all apps on Cloud and scale-up 300VMs. Consequences: communication delays and the lack of efficient shared memory asynchronous schemes. Matthieu Durut (Telecom/Lokad) 13 / 55

14 Outline Distributed Batch K-Means 1 Introduction to Cloud Computing 2 Context 3 Distributed Batch K-Means 4 Distributed Vector Quantization algorithms Matthieu Durut (Telecom/Lokad) 14 / 55

15 Distributed Batch K-Means Sequencial Batch K-Means Algorithm 1 Sequential Batch K-Means Select κ initial prototypes (w k ) κ k=1 repeat for t = 1 to N do for k = 1 to κ do compute z t w k 2 2 end for find the closest centroid w k (t) from z t ; end for for k = 1 to κ do 1 w k = #{t,k (t)=k} z t {t,k (t)=k} end for until the stopping criterion is met Matthieu Durut (Telecom/Lokad) 15 / 55

16 Distributed Batch K-Means Characteristics Relatively fast : Batch Walltime seq = (3Nκd + Nκ + Nd + κd)it flop, where I refers to the number of iterations and T flop refers to the time for a floating point operation to be evaluated. Determinist. Easy to set-up. Results stationary from a certain iteration. Suited for parallelization? Obvious data-level Parallelism. Same result than sequential. Excellent speed-up efficiency already achieved. Matthieu Durut (Telecom/Lokad) 16 / 55

17 Distributed Batch K-Means Distribution Scheme Data-level parallelism suggests iterated Map-Reduce distribution. Data set {z t } N t=1 is homogeneously split into M chunks (one per processing unit): S i, i {1..M}. The processing unit i computes the distance z i t w k 2 2 for zi t Si and k {1..κ} (Map phase). Then the new prototypes version is recomputed by one or several machines (Reduce phase). Matthieu Durut (Telecom/Lokad) 17 / 55

18 Distributed Batch K-Means Batch K-Means distributed over a DMM architecture Matthieu Durut (Telecom/Lokad) 18 / 55

19 Distributed Batch K-Means Wall Time Batch WallTime DMM = T comp M + T comm M, where T comp M refers to the wall time of the assignment phase and TM comm refers to the wall time of the recalculation phase (mostly spent in communications). Assignment phase T comp M = 3INκdT flop. M Matthieu Durut (Telecom/Lokad) 19 / 55

20 Distributed Batch K-Means Recalculation phase - DMM architecture with MPI T comm M = log 2 (M) IκdS B, where S refers to the size of a double in memory (8 bytes in the following) and B refers to the communication bandwidth per machine. Wall time - DMM architecture with MPI Batch WallTime DMM 3INκdT flop = M + log 2 (M) IκdS B. Matthieu Durut (Telecom/Lokad) 20 / 55

21 Distributed Batch K-Means Speed-up - DMM architecture with MPI SpeedUp DMM (M, N) = 3NT flop 3NT flop M + S B log 2(M). Optimal number of processing units MDMM = 3NT flop B. S Matthieu Durut (Telecom/Lokad) 21 / 55

22 Distributed Batch K-Means Batch K-Means distributed over Azure Matthieu Durut (Telecom/Lokad) 22 / 55

23 Distributed Batch K-Means Mapper 1 worker push blob into pings the storage untill it finds the given blob, then downloads it blobstorage Map result (prototypes) Partial reduce result (prototypes) Final reduce result (prototypes) Mapper 2 Partial Reducer Mapper 3 Final Reducer Mapper 4 Mapper 5 Partial Reducer Mapper 6 Figure : Distribution scheme of our cloud Batch K-Means. Matthieu Durut (Telecom/Lokad) 23 / 55

24 Distributed Batch K-Means Communication Modeling T comm M = I MκdS(2T read Blob + T write Blob ), where TBlob read (resp. T write Blob ) refers to the time needed by a given processing unit to download (resp. upload) a blob from (resp. to) the storage per memory unit. Speed-up - Cloud architecture SpeedUp(M, N) = 3NT flop M 3NT flop + MS(2T read Blob + T write Blob ). Optimal number of workers M (N) = 2/3 6NT flop S(2T read Blob + T write Blob ). Matthieu Durut (Telecom/Lokad) 24 / 55

25 Time to execute the reduce phase per byte (in 10-7 sec/byte) Distributed Batch K-Means Reduce phase duration per byte in function of the number of communicating units Number of communicating units Figure : Time to execute the Reduce phase per unit of memory (2TBlob read + T Blob write) in 10 7 sec/byte in function of the number of communicating units. Matthieu Durut (Telecom/Lokad) 25 / 55

26 Distributed Batch K-Means 70 Speedup in func3on of the number of mappers Speedup Theore0cal speedup N = N = N = N = N = Number of mappers P Figure : Charts of speedup performance curves with different data set size. Matthieu Durut (Telecom/Lokad) 26 / 55

27 Distributed Batch K-Means N Meff M Wall Time Sequential Effective Theoretical theoretic time Speedup Speedup (= M 3 ) Exp Exp Exp Exp Table : Comparison between the effective optimal number of processing units M eff and the theoretical optimal number of processing units M for different data set size. Matthieu Durut (Telecom/Lokad) 27 / 55

28 Distributed Batch K-Means Speedup in func3on of the number of mappers Speedup Observed speedup Theore4cal speedup Number of mappers P Figure : Charts of speedup performance curves with different number of processing units. For each value of M, the value of N is set accordingly so that the processing units are heavy loaded with data and computations. Matthieu Durut (Telecom/Lokad) 28 / 55

29 Distributed Batch K-Means Figure : Distribution of the processing time (in second) for multiple runs of the same computation task for multiple VM. Matthieu Durut (Telecom/Lokad) 29 / 55

30 Outline Distributed Vector Quantization algorithms 1 Introduction to Cloud Computing 2 Context 3 Distributed Batch K-Means 4 Distributed Vector Quantization algorithms Matthieu Durut (Telecom/Lokad) 30 / 55

31 Distributed Vector Quantization algorithms Asynchronous clustering: motivation Joint work with Benoit Patra Every actions should be accounted once No calculation should be discarded. No calculation should be used more than once. All the writes should result into prototypes update everywhere. All the reads should be used locally. On War from Clausewitz Saturate bandwidth, memory, CPU, etc. = Asynchronism = Online or at least mini-batch (no more batch) Matthieu Durut (Telecom/Lokad) 31 / 55

32 Distributed Vector Quantization algorithms Sequential VQ algorithm Consists in incremental updates of the ( R d) κ -valued prototypes {w(t)} t=0. Initiated from a random initial w(0) ( R d) κ. Given a series of positive steps (ε t ) t>0, it produces a series of w(t) by updating w at each step with a descent term. H(z, w) = ( ) (w l z) 1 {l=argmini=1,...,κ z w i 2 }. 1 l κ w(t + 1) = w(t) ε t+1 H ( z {t+1 mod n}, w(t) ), t 0. Matthieu Durut (Telecom/Lokad) 32 / 55

33 Distributed Vector Quantization algorithms Algorithm 2 Sequential VQ algorithm Select κ initial prototypes (w k ) κ k=1 Set t=0 repeat for k = 1 to κ do compute z {t+1 mod n} w k 2 2 end for Deduce H(z {t+1 mod n}, w) Set w(t + 1) = w(t) ε t+1 H ( z {t+1 mod n}, w(t) ) increment t until the stopping criterion is met Matthieu Durut (Telecom/Lokad) 33 / 55

34 Distributed Vector Quantization algorithms Our context We assume that a satisfactory VQ implementation has been found but too slow. We will not be concerned with optimization of the several parameters (initialization, sequence of steps etc.) We have access to a finite dataset: { z i n t}, i {1,..., M} t=0 distributed over M processing units. When does a distributed VQ implementation perform better than the corresponding sequential one? Matthieu Durut (Telecom/Lokad) 34 / 55

35 Distributed Vector Quantization algorithms Definition of Speed-up for VQ algorithms A reference prototypes version is made available in the shared-memory (BlobStorage), referred to as the prototypes shared version: w srd. Performance is measured with the corresponding empirical distortion: for all w ( R d) κ, L N (w) = 1 nm M n min l=1,...,κ i=1 t=1 z i 2 t w l After any t wall time seconds, the empirical distortion of the prototypes shared version should be lower than for the prototypes version produced by sequential algorithm. Matthieu Durut (Telecom/Lokad) 35 / 55

36 Distributed Vector Quantization algorithms Previous work VQ as stochastic gradient descent method Shared-Memory : interlaying the prototypes version updates No Shared-Memory but Loss Convexity : averaging the prototypes versions In our case No efficient shared-memory No convexity of the loss function Organization of our work Simulated distributed architecture on a single machine. Then cloud implementation Matthieu Durut (Telecom/Lokad) 36 / 55

37 Distributed Vector Quantization algorithms First distributed scheme All the versions are set equal at time t = 0, w 1 (0) =... = w M (0). For all i {1,..., M} and all t 0, we have the following iterations: ( ) wtemp i = w i (t) ε t+1 H z i {t+1 mod n}, w i (t) w { i (t + 1) = wtemp i if t mod τ 0 or t = 0, w srd = 1 M M j=1 w j temp w i (t + 1) = w srd if t mod τ = 0 and t τ. Matthieu Durut (Telecom/Lokad) 37 / 55

38 Distributed Vector Quantization algorithms A first basic parallelization scheme Global time reference Averaging phase Figure : A simple (and synchronous) scheme: whenever τ points are processed an averaging phase occurs. Matthieu Durut (Telecom/Lokad) 38 / 55

39 Distributed Vector Quantization algorithms A first basic parallelization scheme 6.5E+06 Empirical distortion 6.0E E E E+06 M=1 M=2 M=10 4.0E E t (iterations) Figure : Charts of performance with different number of computing entities: M = 1, 2, 10 and τ = 10. Matthieu Durut (Telecom/Lokad) 39 / 55

40 Distributed Vector Quantization algorithms A comparison between the previous parallel scheme and the sequential VQ For t mod τ = 0 and t > 0. Then, for all i {1,..., M}, w i (t + 1) = w i (t τ + 1) t t =t τ+1 ε 1 ( )) M t +1( M j=1 H z j t +1, w j (t ) (parallel) w(t + 1) = w(t τ + 1) t t =t τ+1 ε t +1H ( z {t +1 mod n}, w(t ) ) (sequential) Terms in blue are estimators of the gradient. Matthieu Durut (Telecom/Lokad) 40 / 55

41 Distributed Vector Quantization algorithms Two SGD algorithms with the same sequence of steps then, they have similar convergence speed. Sequence of steps learning rate trade-off exploration/convergence. Introducing displacement/descent terms For all j {1,..., M} and t 2 t 1 0 set j t 1 t 2 = t 2 t =t 1 +1 ε t +1H ( ) z j {t +1 mod n}, w j (t ). corresponds to the displacement of the prototypes computed by j during (t 1, t 2 ), Matthieu Durut (Telecom/Lokad) 41 / 55

42 Distributed Vector Quantization algorithms Second distributed scheme ( ) wtemp i = w i (t) ε t+1 H z i {t+1 mod n}, w i (t) w { i (t + 1) = wtemp i if t mod τ 0 or t = 0, w srd = w srd M j=1 j t τ t w i (t + 1) = w srd if t mod τ = 0 and t τ. Matthieu Durut (Telecom/Lokad) 42 / 55

43 Distributed Vector Quantization algorithms Displacement terms Figure : Illustration of the parallelization scheme of VQ procedures described by equations (43). Matthieu Durut (Telecom/Lokad) 43 / 55

44 Distributed Vector Quantization algorithms Empirical distortion 6.0E E E E+06 M=1 M=2 M=10 4.0E E t (iterations) Figure : Charts of performance curves for a reviewed scheme M = 1, 2, 10 and τ = 10. Matthieu Durut (Telecom/Lokad) 44 / 55

45 Distributed Vector Quantization algorithms Delayed distributed scheme ( ) wtemp i = w i (t) ε t+1 H z i {t+1 mod n}, w i (t) { w i (t + 1) = wtemp i if t mod τ 0 or t = 0, w srd = w srd M j=1 j t 2τ t τ if t mod τ = 0 and t 2τ, w i (t + 1) = w srd i t τ t if t mod τ = 0 and t τ. Matthieu Durut (Telecom/Lokad) 45 / 55

46 Distributed Vector Quantization algorithms delay Figure : Illustration of the parallelization scheme described by equations (46). The reducing phase is only drawn for processor 1 where t = 2τ and processor 4 where t = 4τ. Matthieu Durut (Telecom/Lokad) 46 / 55

47 Distributed Vector Quantization algorithms Empirical distortion 6.0E E E E+06 M=1 M=2 M=10 4.0E E t (iterations) Figure : Charts of performance curves for iterations (46) with different numbers of computing entities, M = 1, 2, 10 and τ = 10. Matthieu Durut (Telecom/Lokad) 47 / 55

48 Distributed Vector Quantization algorithms Simulated parallelization schemes first conclusions Motto: summing displacement term rather than averaging versions. Experimental results Satisfactory speed-ups are recovered for the later simulated parallel schemes. Delays (determinist + random) are also studied: reasonable [random] delays do not have sever impact on the convergence. Good perspectives for a true implementation on a could computing platform. Matthieu Durut (Telecom/Lokad) 48 / 55

49 Distributed Vector Quantization algorithms The CloudDALVQ project Scientific project for testing new large scale clustering/quantization algorithms distributed on a Cloud Platform (MS Azure). Open source written in C#.NET released under new BSD Licence. Matthieu Durut (Telecom/Lokad) 49 / 55

50 Distributed Vector Quantization algorithms Mapper 1 worker push blob into pings the storage untill it finds the given blob, then downloads it blobstorage Map result (displacement term) Partial reduce result (displacement term) Final reduce result (prototypes) Mapper 2 Partial Reducer Mapper 3 Final Reducer Mapper 4 Mapper 5 Partial Reducer Mapper 6 Figure : Distribution scheme of our cloud VQ implementation. Matthieu Durut (Telecom/Lokad) 50 / 55

51 Distributed Vector Quantization algorithms BlobStorage shared version pull thread read buffer process thread process action 1 ProcessService process action 2 local version process action 3 displacement term data write buffer push thread displacement term BlobStorage Matthieu Durut (Telecom/Lokad) 51 / 55

52 Distributed Vector Quantization algorithms Empirical distortion 5.9E E E+04 M M M M M 4.4E seconds Figure : Normalized quantization curves with M = 1, 2, 4, 8, 16. Troubles appear with M = 16 because the ReduceService is overloaded. Matthieu Durut (Telecom/Lokad) 52 / 55

53 Distributed Vector Quantization algorithms 6.8E+04 Empirical distortion 6.3E E E E+04 M M M M 4.3E E seconds Figure : Normalized quantization curves with M = 8, 16, 32, 64 with an extra layer for the so called reducing task. Matthieu Durut (Telecom/Lokad) 53 / 55

54 Distributed Vector Quantization algorithms Empirical distortion 9.0E E E+04 CloudDALVQ CloudBatchKM 6.0E E seconds Figure : This chart reports on the competition between our cloud DAVQ algorithm and the cloud Batch K-Means. The graph shows the empirical distortion of the algorithms over the time. Matthieu Durut (Telecom/Lokad) 54 / 55

Convergence of a distributed asynchronous learning vector quantization algorithm.

Convergence of a distributed asynchronous learning vector quantization algorithm. ENS ULM, NOVEMBER 2010 Benoît Patra (UPMC-Paris VI/Lokad) 1 / 59 Outline. 1 Introduction. 2 Vector quantization, convergence