Fast Universal Background Model (UBM) Training on GPUs using Compute Unified Device Architecture (CUDA)

Size: px

Start display at page:

Download "Fast Universal Background Model (UBM) Training on GPUs using Compute Unified Device Architecture (CUDA)"

Patricia Morris
5 years ago
Views:

1 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: 04 5 Fas Unversal Background Model (UBM) ranng on GPUs usng Compue Unfed Devce Archecure (CUDA) M. Azhar, C. Ergün. Compuer Engneerng Deparmen, Easern Mederranean Unversy Absrac Unversal Background Modelng (UBM) s an alernave hypoheszed modelng ha s used exensvely n Speaker Verfcaon (SV) sysems. ranng he background models from large speech daa requres a sgnfcan amoun of memory and compuaonal load. In hs paper a parallel mplemenaon of speaker verfcaon sysem based on Gaussan Mxure Modelng Unversal Background Modelng (GMM UBM) desgned for many-core archecure of NVIDIA s Graphcs Processng Uns (GPU) usng CUDA sngle nsrucon mulple hreads (SIM) model s presened. CUDA mplemenaon of hese algorhms s desgned n such a way ha he speed of compuaon of he algorhm ncreases wh number of GPU cores. In hs expermen 30 mes speedup for k-means cluserng and 6 mes speedup for Expecaon Maxmzaon (EM) was acheved for an npu of abou 350K frames of 6 dmensons and 04 mxures on GeForce GX 570 (NVIDIA Ferm Seres) wh 480 cores when compared o a sngle hreaded mplemenaon on he radonal CPU. Index erm Compue Unfed Devce Archecure (CUDA), Expecaon Maxmzaon (EM), Speaker Verfcaon, Unversal Background Model (UBM) I. INRODUCION UNIVERSAL BACKGROUND MODELING s an approach o approxmae he mposer model n speaker verfcaon sysems. In hs approach, pooled ranng daa from a large number of speakers s ulzed o consruc a large mxure model. he lkelhood of background speakers used o form reference. he UBM model parameers are also raned by esmang parameers λ of he GMM ha maches he dsrbuon of he ranng feaure vecors usng ranng speech from a speaker. he mos popular, well-esablshed and robus mehod s he Maxmum Lkelhood (ML) whch eravely refnes he GMM parameers. One mporan propery of ML esmaon s ha for a large enough se of ranng, he model esmae converges o he rue model parameers as he daa lengh grows. Snce here s no closedform for compung he ML esmae, an erave echnque known as Expecaon Maxmzaon s used []. M. A. s wh Compuer Engneerng Deparmen, Easern Mederranean Unversy, Famagusa, Mersn 0 urkey (mohammad.azhar@emu.edu.r). C. E. s wh Compuer Engneerng Deparmen, Easern Mederranean Unversy, Famagusa, Mersn 0 urkey (cem.ergun@emu.edu.r). Opmzaon of he SV sysems can help o decrease developmen me. I should be kep n mnd ha he mxure based modelng echnques can be easly opmzed usng parallel mplemenaon. he exsence of a massvely parallel compung echnology such as CUDA has changed he face of compung n he pas years. Currenly havng some of he mos effcen rangs on Green500 ls [] ndcaes ha oday s GPUs are more sued for scenfc compung wh less power consumpon. In hs paper a research on parallelzng UBM parameer esmaon of SV s performed. In general, a man SV sysem ncludes fron end processng (nose cancellaon, feaure exracon and selecon), background model generaon, adapaon and esng phases. In hs paper he focus wll be on background model parameers generaon for Mel-Frequency Cepsrum Coeffcens (MFCC) daa and parallelzaon of uns such as MFCC, k-means and Expecaon Maxmzaon algorhm. he conrbuon s on wo dfferen archecures of NVIDIA GPUs namely GX 85 (G00 Archecure) and GX 570 (Ferm Archecure). he speedup resuls of he laer wh he radonal CPU archecure, boh n sngle hreaded and mul-core mplemenaons wll be compared. I wll be observed how he use of even naïve GPU mplemenaon can resul n faser calculaons and how dfferen hardware archecures can affec he speed. Also wll be explaned how he echnques n GPU sofware mplemenaon drecly arge opmaly of he performance. GPUs have a very hgh floang pon calculaon hroughpu (+ Flop/s peak performance as opposed o a Core 7 CPU - heorecal peak performance of 09 GFlops/s) [3]. In hs paper, he naure of he problem was nroduced wh double precson floang pons. he menoned GPUs perform double precson operaons heorecally 75 GFlops and 86 GFLops for GX85 and GX570 respecvely. Consderng he sze of he problem (abou 350K samples of 6 dmensons) he soluon s no memory-bound and comes down o he compuaon lm. For he me beng, shared memory approaches were no sepped no as requres a more programmacally complex mplemenaon and suffers from ssues such as memory coalescng [4]. he res of he paper s organzed n he followng manner. In secon II, he MFCC, k-means and EM algorhms are descrbed o generae UBM parameers. In secon III, he archecure of CUDA Ferm IJECS-IJENS Augus 0 IJENS

2 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: wll be presened. In secon IV he desgn of laer algorhms s dscussed boh for CPU Parallel and CUDA. he expermenaon and resuls of dfferen archecure models and analyss of performance booss are presened n secon V. Secon VI concludes he paper and presens some fuure work. II. FROND-END MODULE AND UBM GENERAION A. Mel-Frequency Cepsrum Coeffcens Mel-frequency Cepsrum coeffcens are well known feaures used o descrbe speech sgnal. hey are based on he evdence ha he nformaon carred by low-frequency componens of he speech sgnal s phonecally more mporan for humans han he nformaon carred by hgh frequency componens [5]. he Mel scale reflecs hs by usng a nonlnear warpng of he frequency scale,.e., by reducng he frequency resoluon as he frequency ncreases. he process of exracng MFCC from connuous speech s llusraed n Fg. : where s[n] s he wndowed speech segmen of lengh N. he power specrum s compued by akng he absolue value of he DF. hs s represened as he second block n Fg. hen he power specrum s passed hrough each Mel fler of he fler bank for calculang he oupu power of each fler. he fnal processng sep s o apply Dscree Cosne ransform (DC) o he logarhm of he fler-bank coeffcens, yeldng he MFCC parameers as C m N f N f log m e j cos j N f j m =,, D () where D s he number of MFCC parameers calculaed for each frame. In (), can be seen ha C 0 represens he average log energy of he specrum. Snce s preferred o use a feaure se whch s no suscepble o varyng background nose and loudness of speech, C 0 s omed, resulng n a D- dmensonal MFCC feaure vecor as, DF. Wndowed speech segmen Magnude specra of band flers x C, C,... C, D B. k-means Cluserng k-means Cluserng s a mehod ha helps o accelerae convergence [6]. In hs mehod wh a se of observaons (x, x,, x n ), each of whch a vecor of d dmensons are clusered no k ses (k n) S = {S, S,, S k } such ha he sum of square n each cluser s mnmzed: arg s mn k x js (3) x (4) where µ s he mean of pon n S. j Energy DC MFCC feaure vecor f (khz) Fg.. Block dagram of he MFCC compuaon process [5] In order o calculae he MFCC of a speech segmen, he frs block of he fgure obans he Dscree Fourer ransform (DF) of he segmen. A hs sep, he lengh of hs sequence s ncreased from N o N by nserng zeros o mprove he resoluon n he frequency doman (zero paddng). hen he segmen s ransformed no he frequency doman as C. Expecaon Maxmzaon Algorhm Expecaon-maxmzaon (EM) algorhm s a mehod ha eravely esmaes he lkelhood [7]. I s smlar o he k- means cluserng algorhm for Gaussan mxures snce hey boh look for he cener of clusers and refnemen s done eravely. For a gven se of observaon vecors X = { x, x,, x } and assumng vecors are ndependen and dencally dsrbued (d), he bes fng model λ s he one ha maxmzes, p( X ) p( ). (5) hs s a non-lnear problem; herefore λ canno be drecly calculaed. he EM algorhm consss of he followng seps: S k N n0 s n e kn j N k =,, N () IJECS-IJENS Augus 0 IJENS

Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: 04 54 Alg.

3 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: Alg.. Expecaon Maxmzaon Procedure In each EM eraon, he ML esmaes for he means, varances and weghs (a pror mxure probably) for a parcular speaker model are compued as follows: Mxure Wegh: m Means:, ) (6) x, ) x (7) x, ) Varances: where. Choose an nal model.. Fnd a new model so ha p(x ) > p(x ). 3. Repea sep unl he dfference 4. p(x ) - p(x ) has reached a convergence hreshold or you have reached he maxmum number of eraons. x, ) x (8) x, ) m, and are updaed values of m, and respecvely, and, and x refer o arbrary elemens of he vecors n, and x respecvely and, x s he shorhand for dalog. he poseror probably for he h acousc class s gven by, m b ( x ) x, ) (9) m b ( x ) M k k k One of he parameers ha should be seleced before ranng a Gaussan mxure speaker model s he order M of he mxure densy. he model parameers mus also be nalzed pror o he EM algorhm. hese selecons are expermenally deermned for a gven ask. In hs expermen, he order of mxure densy (M) s 04. he EM algorhm s guaraneed o fnd a locally opmal maxmum lkelhood model regardless of he sarng pon. However may have several local maxma and dfferen sarng models can lead o dfferen local maxma [8]. Alernavely, k-means algorhm can be used o perform a cluserng of he ranng daa no M acousc classes [9]. he M nal esmaes for he mxure means and he covarance marces can be seleced as he mean and he covarance of each cluser. III. COMPUE UNIFIED DEVICE ARCHIECURE CUDA [0] s he parallel compung archecure developed by NVIDIA and s he compung engne n NVIDIA graphcal processng uns (GPUs). hs archecure s avalable hrough dfferen programmng languages and suppors oher applcaon programmng nerfaces, such as CUDA FORRAN, OpenCL, and DrecCompue. CUDA helps o solve many complex compuaonal problems n a more effcen way han on a CPU. he CUDA parallel programmng model s desgned o overcome he challenge of developng applcaon sofware ha ransparenly scales s parallelsm whle mananng a low learnng curve for programmers famlar wh sandard programmng languages such as C. CUDA has several advanages over radonal general purpose compuaon on GPUs []. One of hem s ha he GPU code can access dfferen addresses n memory. Anoher advanage s avalably of shared memory whch s a locally accessble memory ha s shared beween a group of hreads. hs helps o reduce he global memory accesses and as a resul provdng a hgher bandwdh. However, here are some lmaons. One s ha he daa ransfer beween he devce and he hos memory s slower ha whn-devce memory ransfers. hreads are acvaed n groups of 3, wh housands of hem runnng n oal. Alhough [] dscusses ha hs may no always be he case, ncludng [3] and curren mplemenaon n whch a small number of hreads per block group were used o ge beer resuls. Anoher lmaon s ha CUDA echnology s only avalable for NVIDIA GPUs. I only suppors round-o-neares mode of IEEE 754 [4] sandard for double precson calculaons. On CUDA archecure, he problems are dvded no subproblems and each sub-problem no fner peces ha can cooperavely run n parallel by all hreads whn he block. Each block of hreads can be scheduled on any of he avalable processor cores, n any order, concurrenly or sequenally, so ha a compled CUDA program can execue on any number of processor cores as llusraed by Fg. : Fg.. Auomac Scalably [0] Blocks are organzed no a one-dmensonal, wodmensonal, or hree-dmensonal grd of hread blocks as IJECS-IJENS Augus 0 IJENS

Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: 04 55 llusraed by Fg. 3.

4 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: llusraed by Fg. 3. he number of hread blocks n a grd s usually dcaed by he sze of he daa beng processed or he number of processors n he sysem, whch can grealy exceed. redundancy of he calculaons.. Cores = 4;. N = 40; // Wnowsze 3. RangeSze = N / Cores; 4. For core I (concurrenly) 5. For k from [RangeSze * I] o [RangeSze * ( I + ) ] 6. Inalze Real and Imagnary varables 7. For n from 0 o * N 8. Calculae Real & Imagnary 0. Calculae DF Alg.. CPU DF Parallel In hs algorhm, he me complexy s O N. On CPU Parallelsm, he complexy s only dvded by a number much smaller han N. herefore he complexy remans he same. he CUDA DF mplemenaon n naure also follows he CPU model wh he excepon ha each eraon of loop k s run on a separae hread. In hs case he complexy s reduced O N. here are also memory ransfers beween hos and o CUDA devce as shown n Alg. 3. Fg. 3. Grd of hread Blocks [0] Each runnng block s dvded no Sngle Insrucon Mulple hreads groups called warps. he sze of hese warps are equal. he runnng warps are scheduled n a mely manner (me-slced). he hread scheduler swches beween warps o maxmze he use of he mulprocessors compuaonal resources. A. MFCC IV. IMPLEMENAION AND DESIGN Due o he naure of speech daa, s very normal o have unvoced sound segmens. Here, Voce Acvaon Daa (VAD) s used o dscard slence perods [5]. he oher fronend operaons lke segmenaon and wndowng are also appled o each frame n order o produce he correspondng feaure se. he complexy of MFCC exracon s manly domnaed by Dscree Fourer ransform (DF) of each O where N s s he wndow wndow segmen whch s sze. In hs sudy only he DF sep of MFCC s parallelzed as shown n Fg.. he CPU DF pseudo-code for wndow sze of 40 s as llusraed n Alg.. In order o have load balancng, he execuon of he loop a lne 5 of Alg. was spl no I number of cores. Each core s responsble for execung operaons of a range of wndows sze frames. In curren case an Inel Core 7-90 [3] CPU was used o perform he parallel ask on 4 cores. Noe ha he cosne and sne values are pre-compued o decrease he N s. Allocae & ransfer Sne and Cosne ables from hos o devce memory.. Allocae & ransfer Curren Wndow Frames from hos o devce memory. 3. Allocae space for DF oupu on devce memory 4. Perform DF on CUDA wh grd sze of 6 and block sze of 6 5. Copy he oupu from devce o hos memory Alg. 3. CUDA DF Parallel B. k-means Cluserng he k-means cluserng algorhm pseudo-code n a nushell s as shown n Alg. 4. Snce nal selecon of mean s random, mean vecors of sze M mxures by D dmensons were chosen from he feaure vecors.. Random nalzaon of mean vecors.. = number of frames; 3. M= order of mxures 4. D= number of dmensons; 5. For each frame 6. For each mxure M 7. For each feaure D 8. Calculae mnmum dsance 9. Calculae average of seleced feaure vecors belongng o he same cenrod. Alg. 4. k-means Pseudo-Code In parallel mplemenaon s advsed o gve more work load o each processor o maxmze he ulzaon. In he case of k-means cluserng, he major loop ha s parallelzable s loop whch ncludes nner loop M and D. he compuaon complexy of k-means cluserng s On n dk log. Snce k and d are fxed, he problem wll come down o n number of enes o be clusered [5]. In boh CPU and GPU approach IJECS-IJENS Augus 0 IJENS

5 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: he mos ouer loop was parallelzed whch n hs case s, gven he fac ha he dsances can be calculaed concurrenly. For CUDA devce mplemenaon he calculaons were performed for each feaure vecor smulaneously. As a resul he complexy wll be O n dk log n where n n. Addonally, here are memory ransfers from and o devce. he order of CUDA acons for k-means algorhm s shown below: Alg. 5. k-means CUDA Noe ha all of -dmensonal jagged arrays are frs convered no -dmensonal arrays for devce memory allocaon. C. Expecaon Maxmzaon As explaned n secon II. C., he frs sep of expecaon maxmzaon s o nalze parameers mxure weghs (6), means (7) and varances (8). he compuaon complexy for background speaker model s drecly exraced from Alg. 6 as ON MD where N s he number of feaure vecors each havng dmenson of D and each speaker havng M mxures. In order o preven redundan calculaons, frs he common par of hose parameers s calculaed whch s:. Random nalzaon of mean vecors.. = number of frames; 3. M= order of mxures 4. D= number of dmensons; 5. ransfer nalzed random mean vecors o devce. 6. ransfer feaure vecors o devce. 7. Allocae label memory on devce. 8. Allocae mean average memory on devce. 9. Allocae dsance memory on devce. 0. Perform k-means on CUDA devce wh grd sze of /0+ and block sze of 0.. ransfer label memory back o hos.. ransfer mean-average memory back o hos. 3. Calculae average of seleced feaure vecors belongng o he same cenrod., ) (0) where s he number of mxures. Followng he (9), he denomnaor of (9) s calculaed, so laer can be appled for he poseror probably. hen s connued wh calculang poseror probably, sum of means and sum of varances. hen (6), (7) and (8) were appled o calculae he updaed values m, and. hs concludes one eraon of EM Algorhm. In hs mplemenaon he eraons are repeaed 50 mes. he pseudo-code for EM algorhm s llusraed n Alg. 6.. Inalze mean, varance and wegh values.. = Number of feaure vecors. 3. M = Order of mxures 4. D = Number of dmensons n a feaure vecor 5. For eraon beween 0 and 50 { 6. For k beween 0 and M 7. Calculae deermnan of sgma[k] 8. For beween 0 and 9. For k beween 0 and M 0. For l beween 0 and D. prepare sum.. calculae sum_p_denomnaor 3. For k beween 0 and M 4. For l beween 0 and D 5. Inalze Sum_mean and Sum_varance 6. For beween 0 and 7. Inalze Sum 8. For l beween 0 and D 9. Calculae Sum 0. Calculae Sum_p. For l beween 0 and D. Calculae Sum_mean 3. Calculae Sum_varance 4. For l beween 0 and D 5. Calculae Mean and Varance 6. Calculae k } Alg. 6. Expecaon Maxmzaon Pseudo-Code for CPU he menonable pons of parallelzaon n Alg. 6 s where loops of sze exss. Snce he convergence eraons are dependen and have o be execued sequenally, he mos domnan parallelzable loop s. In case of CPU parallelsm, a smlar mehod s appled as seen n Alg. o dsrbue he load. In a nushell he followng asks are performed n CPU Parallel mode:. Inalze mean, varance and wegh values.. = Number of feaure vecors. (around 350K) 3. M = Order of mxures (04) 4. D = Number of dmensons n a feaure vecor (6) 5. For eraon beween 0 and 50 { 6. Parallel For k beween 0 and M 7. Calculae deermnan sgma[k]; 8. Parallel For beween 0 and 9. Calculae sum_p_denomnaor; // Updang mean, wegh and varance 0. For k beween 0 and M. For l beween 0 and D. sum_mean, sum_var nalzaon 3. Parallel For beween 0 and 4. Calculae sum_mean and sum_var 5. For l beween 0 and D 6. Updae mean,varance 7. Updae wegh } Alg. 7. EM Pseudo-Code for Parallel CPU he D loops are lef sequenal because n he small loops he cos of creang mulple hreads s more han he calculaon cos n a sngle hread. On he oher hand he K loop s no parallelzed for he fac ha he nner loop s bg IJECS-IJENS Augus 0 IJENS

6 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: enough o cover he cos of hread acvaon. In CUDA Parallel mode he procedures were slced no smaller groups o handle he number of hreads beer. Snce he CUDA code s SIM all hreads run dencal code only on dfferen pars of he memory. For example n he case of nalzng sum of means and sum of varances s only needed o have D hreads whle for updang hem hreads are acvaed. hs calls for ask separaon. I can be undersood from Alg. 7 ha for EM algorhm here are more memory ransfers/allocaons are nvolved. I s also known ha he CUDA code wll run on devce as kernels. Kernels are C exenson funcons of CUDA echnology. hey can be lunched from he hos code o run on many CUDA hreads. he mporance of grd sze and block sze wll be dscussed n secon V. he followng s he EM algorhm desgned o run on CUDA devce:. Inalze mean, varance and wegh values.. = Number of feaure vecors. (around 350K) 3. M = Order of mxures (04) 4. D = Number of dmensons n a feaure vecor (6) 5. Allocae memory on devce and ransfer he nalzed mean, varance and wegh. 6. ransfer feaure vecors o devce memory. 7. For eraon beween 0 and 50 { 8. Calculae Deermnan of Sgma on blocks of sze M/; 9. Calculae Sum p denomnaor on / 0 + blocks of sze 0; 0. For k beween 0 and M. Inalze SumMean and SumVarance on block of sze D;. Calculae SumMean and SumVarance on / 3 + blocks of sze 3; 3. Updae Mean & Varance on block of sze D; 4. Updae Wegh on block of sze ; } 5. ransfer mean, wegh and varance from devce o hos memory Alg. 8. EM Pseudo-Code for Parallel CUDA he complexy of hs EM algorhm s reduced o O MD. V. EXPERIMENS he parallel algorhms were mplemened usng CUDA verson 3.. he expermens were performed on a PC wh GX 85 and GX 570 GPUs and an Inel Core 7-90 CPU. he CPU has 4 cores (8 hyper-hreads) runnng a.66 GHz. he man memory s 3 GB (DDR3-600) wh he peak bandwdh of.8 GB/sec. he specfcaon of used GPUs can be seen n able I. ABLE I SPECIFICAIONS OF NVIDIA GPUS GPU Cores DRAM Processor Clock Memory Bandwdh GX GB.47 GHz 59 GB/s GX GB.46 GHz 5 GB/s he resuls of calculaons were presened whle omng he fle read/wre operaons o concenrae on algorhm s speedup. he resuls were performed on around daa sze of 350,000 feaure vecors belongng o 9 female speakers. he able II shows he acual mes aken o perform MFCC and k- means algorhm n seconds. ABLE II IME OF MFCC AND K-MEANS IN SECOND Algorhm CPU CPU CUDA CUDA Parallel (GX 85) (GX 570) MFCC k-means I can be observed ha he expermened MFCC algorhm has poor performance compared o radonal CPU algorhm. he reasons for hs are he followng: - he only parallelzed funcon whn MFCC was dscree flourer ransform algorhm. Alhough he complexy s reduced n heory, gven he sze of he problem for a sngle DF run, s no feasble o dsrbue he subproblem no oo many hreads. For hs expermen he sze of ouer loop was 40 and hence 40 hreads were acvaed, bu because of he low occupancy level he me needed o read/wre no devce global memory affeced he performance. - he oher reason s low hroughpu of memory ransfer beween hos and devce. he DF algorhm s called almos 350,000 mes and each me copes he wndow frames from hos o devce and copes he resuls of DF back o hos memory. In k-means algorhm a consderable performance mprovemen can be seen when daa-parallelsm s compared o sngle hreaded CPU. However he me aken on boh GPU archecures s he same. One reason for hs can be lack of use of shared memory o benef from he archecure specfc advanages. hs may also ndcae ha curren algorhm suffers from memory paron campng [6]. Global memory accesses go hrough parons. Successve 56-bye regons of global memory are assgned o successve parons. he problem of paron campng s when global memory accesses a an nsan use a subse of parons. hs may hde he rue poenal of speedup scalably across cores. For opmal performance GPU accesses should be dsrbued evenly among parons. he resuls of EM algorhm are measured n pars o show he execuon mes more clearly. able III shows he EM me n seconds. ABLE III IME OF EXPECAION MAXIMIZAION IN SECONDS Algorhm CPU CPU Parallel GX 85 GX 570 SUM_P Per EM Ieraon Updang Mean,Var & Wegh Per EM Ieraon EM oal per eraon Agan, faser calculaons on more cores and furher on many cores of GPUs can be seen. able IV shows he speedup of all algorhms on dfferen sysems. he speedups are IJECS-IJENS Augus 0 IJENS

7 Inernaonal Journal of Elecrcal & Compuer Scences IJECS-IJENS Vol: No: calculaed usng / where s he sngle hreaded CPU me and s he arge parallel model me. Addonally, Fg. 4 shows he graph correspondng o able IV. ABLE IV SPEEDUP OF MFCC, K-MEANS AND EM ALGORIHMS Sage CPU CPU Parallel GX 85 GX 570 MFCC k-means EM I s observed ha he ncrease n he number of cores wll decrease he calculaon me. However, he echnque of choosng he correc grd and block szes should be aken no consderaon. oo lle hreads n a block may race agans he occupancy facor of he acve blocks [0]. oo many hreads n a block may preven he use of shared memory snce here s a lm of 6KB on GX 85 and 48KB on GX 570 models per block. In hs sudy, a naïve GPU code ha uses he global memory was provded. he number of hreads used per CUDA block vares from o 3. ha s because curren mplemenaons performed beer n low block dmenson szes. Accordng o [] n some cases such as [3] a beer performance may be acheved n very low block szes. Fg. 4. Speedup scalably VI. CONCLUSION In hs paper, a CUDA based mplemenaon of DF, k- means and EM algorhms were nroduced. Also, he CPU parallelsm of he menoned algorhms for beer comparson was presened. he resuls demonsrae he advanage of parallelzaon, specfcally usng GPUs o speed up calculaon of k-means up o 30 mes and EM up o 7.9 mes of a sngle hreaded CPU. In he fuure, he effecs of usng shared memory on he performance wll be suded. Addonally, runnng he algorhms concurrenly on more han one GPU wll be presened o show how GPUs can benef from cross devce memory access [0]. Addonally, he advanage of concurren kernels and mul sreams wll be expermened. Furhermore, a comparson beween he performance hs of sngle-precson and double-precson floang pon calculaons wll be observed. REFERENCES [] N. M. Lard, and D.B. Rubn A. P. Dempser, "Maxmum Lkelhood from Incomplee Daa va EM Algorhm," Journal of he Royal Sascal Socey, vol. 39, pp. -38, 977. [] Green500. (0, June) Green500. [Onlne]. hp:// [3] Inel. (0, June) Prevous Generaon Inel Core 7 Processor. [Onlne]. hp:// x.hm [4] N. Kumar, S. Saoor, and I. Buck, "Fas Parallel Expecaon Maxmzaon for Gaussan Mxure Models on GPUs Usng CUDA," n Hgh Performance Compung and Communcaon, h IEEE Inernaonal Conference on, 009, pp [5] R. W. Schafer and L. R. abner, "Dgal Repesenaons of Speech Sgnals," IEEE Proceedngs, vol. 63, pp , Aprl 975. [6] Rchard O. Duda, Peer E. Har, and Davd G. Srok, Paern Classfcaon, nd ed.: John Wley and Sons, 00. [7] Wkpeda. (0, June) Expecaon-maxmzaon algorhm. [Onlne]. hp://en.wkpeda.org/wk/expecaon_maxmzaon_algorhm [8] D.A. Rose, R.C. Reynolds, "Robus ex-ndependen speaker denfcaon usng Gaussan mxure speaker models," Speech and Audo Processng, IEEE ransacons on, vol. 3, no., p. 7, Jan 995. [9] A. M. Kondoz, Dgal Speech: Codng for Low B Rae Communcaon Sysems, nd ed.: Wley, 004. [0] NVIDIA. (0, March) NVIDIA CUDA Programmng Gude - Verson 4.0. [Onlne]. hp://developer.nvda.com/cuda-oolk-40 [] Wkpeda. (0, June) CUDA. [Onlne]. hp://en.wkpeda.org/wk/cuda [] V. Volkov. (0, June) Beer Performance a Lower Occupancy, GC On-Demand: GC 00. [Onlne]. hp:// [3] Anas Mohd Nazlee and Noohul Basheer Zan Al Fawnzu Azmad Hussn, "ranformaon of CPU-based Applcaons o Leverage on Graphcs Processors usng CUDA," IJECS: Inernaonal Journal of Elecrcal & Compuer Scences, vol. 0, no., pp , February 00. [4] "IEEE Sandard for Floang-Pon Arhmec," IEEE Sd , pp. -58, Augus 008. [5] "Speech Codec Deecor Based Speaker Verfcaon Sysem n a Mul- Coder Envronmen," docoral dsseraon, Compuer Engneerng Deparmen, Easern Mederranean Unversy, Famagusa, Mersn 0 urkey, 004. C. Ergün,. [6] NVIDIA. (009, March) Opmzng CUDA, Ausralan Naonal Unversy: CUDA uoral. [Onlne]. hp://cs.anu.edu.au/fles/sysems/gpuwksp/pdfs/04_opmzngcud A_full.pdf IJECS-IJENS Augus 0 IJENS

Clustering (Bishop ch 9)

Clustering (Bishop ch 9) Cluserng (Bshop ch 9) Reference: Daa Mnng by Margare Dunham (a slde source) 1 Cluserng Cluserng s unsupervsed learnng, here are no class labels Wan o fnd groups of smlar nsances Ofen use a dsance measure