Distributed Deep Learning Parallel Sparse Autoencoder. 2 Serial Sparse Autoencoder. 1 Introduction. 2.1 Stochastic Gradient Descent

Size: px

Start display at page:

Download "Distributed Deep Learning Parallel Sparse Autoencoder. 2 Serial Sparse Autoencoder. 1 Introduction. 2.1 Stochastic Gradient Descent"

Joy Small
6 years ago
Views:

1 Disribued Deep Learning Parallel Sparse Auoencoder Inroducion Abhik Lahiri Raghav Pasari Bobby Prochnow December 0, 00 Much of he bleeding edge research in he areas of compuer vision, naural language processing, and audio recogniion revolves around he painful and ime consuming process of hand-picking feaures from raining daa. Many researchers sp decades experimening wih complex feaure selecion processes in he hopes of improving he performance of learning algorihms. Deep learning approaches aemp o replace he pracice of hand-picking feaures by insead algorihmically deermining srucure or organizaion hidden wihin he raining daa. Some early deep learning approaches have shown grea promise - even ouperforming many of he sae-of-he-ar algorihms ha operae on hand-picked feaures. Deep learning algorihms, however, are compuaionally expensive. Even on powerful compuers, i can be impracical o have he algorihms learn on a sufficien amoun of inpu daa, making hese algorihms considerably less pracical for many problems. Parallelizing hese algorihms and running hem in a muli-core or disribued seing could resul in a significan speedup. This, in urn, makes i more pracical o feed larger amouns of daa ino he algorihms - which will improve heir performance considerably. Thus, our goal is o undersand how o scale deep learning mehods o funcion on large clusers wih many cores and machines. For he exen of his paper, we focused on parallelizaion of he sparse auoencoder learning algorihm. Towards his goal, we firs did a survey of serial opimizaion algorihms for he sparse auoencoder (sochasic gradien descen, conugae gradien, L-BFGS). We hen parallelized he sparse auoencoder using a simple approximaion o he cos funcion (which we have proven is a sufficien approximaion). Finally, we performed smallscale benchmarks boh in a muli-core environmen and in a cluser environmen. Serial Sparse Auoencoder The sparse auoencoder is a deep learning varian of a neural nework used o represen he ideniy funcion on unlabeled raining daa. To force he nework o find srucure in he daa, we enforce a sparsiy consrain ha ensures ha each of he hidden nodes fires very infrequenly over he course of he raining se.. Sochasic Gradien Descen Our naive approach used sochasic gradien descen o opimize he sandard cos funcion: (h(x(i) ) x (i) ) + λ W (W (l) i ) Addiionally, o enforce sparsiy, afer each ieraion of sochasic gradien descen, we performed he following updae on he biases for he hidden layer: b () i := b () i aβ(ˆρ i ρ) where ˆρ i is a running esimae of he probabiliy of he hidden node i firing and ρ is he desired sparsiy.. Bach Opimizaion Algorihms For opimizaion algorihms ha ierae on enire baches of examples (L-BFGS and conugae gradien), we inegrae he sparsiy consrain direcly ino he cos funcion and use he kl divergence o measure he difference beween he curren and arge sparsiies: m m (h(x (i) ) x (i) ) + λ W l i λ ρ KL(ρ p ) l i (W (l) i ) + where ρ is he desired sparsiy and p is he curren sparsiy for hidden node over he enire bach of examples..3 Comparison of Algorihms In all benchmarks, he raining examples are a random sampling of 8x8 paches from a se of en 5x5 images (couresy of Bruno Olhaussen). We resric he hidden layer of he nework o 30 nodes, se λ W =.00, se λ ρ = 4, and arge he probabiliy of a hidden node firing o be.00.

On image inpu, we expec he learned weighs for he hidden nodes of he sparse auoencoder o represen edges of indepen orienaion. on a se of 00k examples every 8000 ieraions).

2 On image inpu, we expec he learned weighs for he hidden nodes of he sparse auoencoder o represen edges of indepen orienaion. on a se of 00k examples every 8000 ieraions). While sochasic gradien descen performs well in his case, i does no l iself o much parallelism, as ieraions mus be performed in sequence, and each ieraion is exremely cheap. Boh L-BFGS and conugae gradien operae on baches of examples, allowing for poenial parallelism; however, conugae gradien akes abou wice as long as L-BFGS o learn he auoencoder. For his reason, we chose o use L- BFGS in our following parallel implemenaion. 3 Parallel Sparse Auoencoder Figure.3.. Sample learned hidden weighs Quanifying how close learned weighs were o his goal is difficul - as exremely small differences in he value of he cos funcion, sparsiy, or error can resul in highly varied change wih respec o how edgelike he learned weighs are. For our purposes, however, i sufficed o quaniaively analyze how well he algorihms do wih respec o minimizing he cos funcion (and hen as a saniy check, visualize he hidden weighs o verify he expeced oupu). In pracice, qualiy oupu is achieved hrough 4 million ieraions of sochasic gradien descen, or 500 ieraions of L-BFGS/conugae gradien wih a 00K bach size. Below are he averaged resuls for 3 indepen rials. Mean Execuion Time (seconds) Sochasic Gradien Descen 648 Conugae Gradien 098 L-BFGS 653 Figure.3.. Average cos funcion over ime (To calculae he cos funcion for sochasic gradien descen, we calculaed he bach cos funcion Our parallel algorihm is quie naural - we use a serial implemenaion of L-BFGS wih a parallel cos funcion: proc ParallelAvgCosFuncion( W, X) foreach parallel do X := GeThreadDaa( X, ); cos, grad := SerialCosFuncion( W, X ); avgcos := average(... cos...); avggrad := average(... grad...); reurn avgcos, avggrad;. This assumes he X is he same size for all - so ha all examples are considered wih he same weigh (when he cos values are averaged), bu i s no difficul o accoun for cases where his does no hold. Also noe ha he algorihm is merely psuedocode here - among oher hings, in he implemened algorihm, X is sored permenanly for each worker once, and does no need o be repeaedly compued or communicaed beween hreads. A firs glance, his algorihm seems rivially correc; however, because of he kl divergence for he sparsiy erm, he above funcion does no necessarily compue he correc cos funcion such ha ParallelAvgCosFuncion(W,X) = SerialCosFuncion(W,X). Regardless, we can prove ha ParallelAvgCosFuncion is an exremely good approximaion - and he resuls confirm his. Also noe ha he gradien compued by ParallelAvgCosFuncion is correc wih respec o he cos funcion ParallelAvgCosFuncion compues.

3 3. Alernave Algorihm From he cos funcion definiion, we noice ha he error-squared erm is rivially parallelizable - each hread can compue he erm on a differen subse of paches and hen he resuls are averaged - and he weighs-squared erm is relaively cheap o compue (so i does no need o be parallelized), bu he kl divergence erm (he las erm) is non-rivial o make parallel. An inuiive way o compue he kl divergence correcly in a parallel seing is wih he following algorihm: proc ParallelExacCosFuncion( W, X) foreach parallel do X := GeThreadDaa( X, ); cos, grad, p, a := SerialErrorSquared( W, X ); avgcos := average(... cos...); avggrad := average(... grad...); p := average(... p...); foreach parallel do X := GeThreadDaa( X, ); sgrad := SparsiyTermGrad( W, X, p, a ); wcos, wgrad = SumWeighsTerm(W); cos = avgcos + wcos + SparsiyTermCos(ρ, p); grad = avggrad + wgrad + sum(sgrad ); reurn cos, grad;. The complexiy (and poenial performance hi) from his approach arises from he fac ha in order o calculae he gradien wih respec o he kl erm on a bach of examples, he hread needs he correc value of p in addiion o all of he acivaions a calculaed on he bach. In an acual implemenaion, a would no be communicaed beween hreads. The worker hread would merely sore a locally in he call o SerialErrorSquared for laer use in he SparsiyTerm funcion. 3. Parallel Correcness Forunaely, we can prove high probabiliy bounds on he difference beween ParallelAvgCosFuncion and SerialCosFuncion. We ll sar, however, by proving a few basic facs. Our raining se consiss of iid examples, and for each raining example, he hidden node acivaes by a Bernoulli rial wih probabiliy p - he rue probabiliy of hidden node firing on a random example from he raining se. Le p (i) be he i h hread s approximaion o p. By definiion, p (i) is he mean of he m Bernoulli rials ha deermine he acivaion of n on hread i s chunk of he raining se. Fac 3.. Wih probabiliy a leas h exp ( m exp ( ɛ h here does no exis a hread i and hidden node such ha p (i) p > exp ɛ h. Proof By a corollary of he Hoeffding inequaliy (proven using he union bound []), since we have h indepen esimaions of p, we have he following: )) P ( [h], i [] : p (i) p > exp ɛ h ) ( ( )) m ɛ h exp exp h Fac 3.. Assume x y z. This implies log x log y log z. Proof Wihou loss of generaliy, assume x > y. x y z () x y z () x y + z (3) log x log(y + z) (4) log x log y + log z (5) log x log y log z (6) log x log y log z (7) Sep () is he assumpion made in he fac saemen. Sep () followed from our assumpion wihou loss of generaliy. Sep (4) followed from he fac ha log is an increasing funcion. Sep (5) is usified by he fac ha log is concave. Sep (7) is usified by he fac ha x y implies log x log y because log is increasing. Fac 3.3. Assume p (i) p ɛ h. This implies log p (i) log p log ɛ h and log( p(i) ) log( p ) log ɛ h. 3

4 Proof Apply Fac 3. wih x = p (i), y = p, z = exp ɛ h. Similarly, p (i) p ɛ h ( p ) ( p (i) ) exp ɛ h. Apply Fac 3. wih x = ( p ), y = ( p (i) ), z = exp ɛ h. Theorem 3.. Le be he number of hreads, m be he oal number of raining examples, h be he number of hidden nodes, and ɛ be some permissable error. Le C be he acual cos funcion on W and C be he approximaion calculaed by ParallelAvgCosFuncion. P ( C C λ ρ ɛ) = h exp ( m exp ( ɛ h Proof Consider he cos funcion spli ino erms: C = C E + C W + C P C P = λ ρ ρ log ρ + ( ρ) log ρ p p C = CE + CW + CP CP = λ ρ ρ log ρ p (i) )) + ( ρ) log ρ p (i) The CE = CE follows from he assumpion ha all hreads have exacly m examples in heir chunk. CW = C W rivially, as W is he same for all hreads. This leaves us wih needing o bound he value C P CP. Assume ha p(i) p exp ɛ h. By Fac 3., we know his occurs wih probabiliy a leas h exp ( m exp ( )) ɛ h. C P CP = λ ρ ρ log ρ + ( ρ) log ρ p p λ ρ ρ log ρ + ( ρ) log ρ p (i) p (i) = λ ρ [ ρ log p ( ρ) log( p ) + ρ log p (i) = λ ρ [ρ(log p + ( ρ) log( p (i) )] + ( ρ)(log( p ) log p (i) ) log( p (i) ))] = λ ρ [ ρ (log p log p (i) ) + ρ (log( p ) log( p (i) ))] λ ρ [ ρ log p log p (i) + ρ log( p ) log( p (i) ) ] λ ρ [ ρ log exp ɛ h + ρ log exp ɛ h ] = ρ [ ρ ɛ h + ρ ɛ h ] = λ ρ [ρ ɛ h + ( ρ) ɛ h ] = λ ρ h[ρ ɛ h + ( ρ) ɛ h ] = λ ρh[ ɛ h ] = λ ρɛ Aside from algrebraic manipulaion, we used he fac ha f(x) f(x), and we used subsi- x x uion using Fac.3. This resul proves ha ParallelAvgCosFuncion is an exremely good approximaion, so long as m is a reasonable value. For insance, in our mos ypical benchmark, we have m = 00000, h = 30, and pre we were esing on = 000 nodes. The difference beween ParallelAvgCosFuncion and he rue cos on hose examples will be no more han λ ρ 0 00 wih probabiliy a leas Muli-core Benchmarks Using Parallel Pyhon, we implemened ParallelAvgCosFuncion for esing in a mulicore environmen - a quad-core, hyper-hreading enabled deskop (Inel Core i GB RAM). According o Inel, hyperhreading improves performance by approximaely 30% []. We ran benchmarks o demonsrae parallel speedup wih respec o bach size. Each execuion ime was averaged over 3 indepen rials. (We also benchmarked ParallelExacCosFuncion. On 00K bach size, ParallelExacCosFuncion was an average of 8 o seconds slower han ParallelAvgCosFuncion, regardless of he number of hreads.) 4

5 Toal Running Time (seconds) Workers 4 8 K K K M Figure Average speedup across bach size 3.4 Cluser Benchmarks The Parallel Pyhon framework used in he mulicore benchmarks is unforunaely ill-suied for learning he sparse auoencoder in he clusers. I is no possible (wihou modifying he source o Parallel Pyhon) o have worker hreads mainain copies of heir own example ses in memory, meaning ha he hreads would have o hi he disk every ieraion. Forunaely, anoher 9 group (see Acknowledgemens) developed he QJAM parallel framework for Pyhon. The following benchmarks were performed on he yggdrasil machines couresy of he Sanford AI Lab: Toal Running Time (seconds) Workers 4 8 K K K While he performance of he framework suffers on smaller bach sizes (because of he high cos of communicaing wihin a cluser), a speed up of 5.5 on 8 cores for a bach size of 00K is quie significan. For more analysis of he cluser benchmarks, see he proec paper wrien by he framework s creaors. 4 Conclusions In esing serial opimizaion algorihms for he sparse auoencoder, we deermined ha L-BFGS demonsraed faser convergence han conugae gradien, and hus eleced o use L-BFGS in our parallel implemenaion. We also demonsraed ha our approximaion ParallelAvgCosFuncion is an iniuiive and exremely accurae approximaion o he acual value of he cos funcion. The parallelism obained on he QJAM framework wih our parallel implemenaion of he sparse auoencoder is quie promising, especially when conrased wih he resuls obained by Parallel Pyhon in a muli-core environmen. Wih a bach size of 00K, he parallel pyhon framework could only achieve a.75x speedup on 4 workers, compared o he full 4x speed up on 4 workers achieved by he QJAM framework. The difference in serial execuion ime beween our muli-core es machine and yggdrasils is puzzling (00K pach size: 5030 seconds on yggdrasil compared o 650 seconds on our es machine), bu he slower serial execuion ime alone canno accoun for he beer parallelism achieved on QJAM; wih M paches and a serial execuion ime of 643 seconds, he Parallel Pyhon framework sill only achieved a.88x speed up on 4 workers. 5 Acknowledgemens We would like o hank he following: Professor Ng and Adam Coaes for advising his proec; Juan Baiz-Bene, Quinn Slack, Ma Sparks, and Ali Yahya for heir work on he QJAM Pyhon parallel framework; Milinda Lakkam and Sisi Sarkizova for heir collaboraion on he sparse auoencoder. 6 References Figure Average speedup across bach size [] hp:// noes4.pdf [] hp://sofware.inel.com/en-us/aricles/ performance-insighs-o-inel-hyper-hreadingechnology/ 5

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he