Accelerating Proton Computed Tomography with GPUs

Size: px

Start display at page:

Download "Accelerating Proton Computed Tomography with GPUs"

Kenneth Caldwell
5 years ago
Views:

1 Accelerating Proton Computed Tomography with GPUs Thomas'D.'Uram,'Argonne'Leadership'Compu2ng'Facility' Michael'E.'Papka,'Argonne'Leadership'Compu2ng'Facility,'Northern'Illinois'University' Nicholas'T.'Karonis,'Northern'Illinois'University,'Argonne'Na2onal'Laboratory

2 Overview Proton'computed'tomography'(pCT)'is'an'alterna2ve'to'xEray'based'CAT'scans,'which' promises'several'medical'benefits'at'the'cost'of'being'significantly'more'computa2onally' expensive' We'designed'a'60Enode'GPU'cluster'to'meet'the'computa2onal'challenge' Computed'tomography' Benefits'of'proton'computed'tomography' Computa2onal'problem'descrip2on' CPU/GPU'performance'comparison 2

3 What is Computed Tomography? CAT'(or'CT)'scans'are'wellEknown' CAT'==' computerized'axial'tomography ' CAT'scans'are'used'to'reconstruct'the'density'distribu2on'within'a'volume,'typically'used' in'medical'imaging' CAT'scans'are'conducted'with'photons'(XErays)' What'is'Proton'Computed'Tomography?' A'reconstruc2on'technique'similar'to'XEray'computed'tomography,'conducted'with' protons'instead'of'photons 3

4 Why Proton Computed Tomography? 13'million'people'are'diagnosed'with'cancer'each'year'worldwide' 2.6'million'of'them'are'candidates'for'proton'therapy'treatment' Proton'therapy'involves'deposi2ng'protons'at'precise'loca2ons'within'a'tumor' site'where'they'irradiate'the'target'2ssue' The'protons'emit'lower'radia2on'as'they'travel'through'the'body'un2l'they' reach'the'target,'where'they'emit'a'burst'of'radia2on'(the'bragg'peak)' Healthy'2ssue'beyond'the'tumor'site'receives'nominally'no'radia2on' It'is'crucially'important'to'precisely'iden2fy'the'tumor'site' To'ensure'that'cancerous'2ssue'is'destroyed' To'avoid'damaging'healthy'2ssue'surrounding'the'tumor,'especially'in' sensi2ve'areas' Proton'therapy'treatment'planning'is'currently'performed'using'XEray'imaging' Photons'and'protons'interact'with'intermediate'material'differently' Conversion'between'photon/proton'modali2es'involves'a'systema0c'range' error'of'365% Image source: Wikipedia 4

5 Proton computed tomography Our'goal'is'to'reconstruct'volume' of'adult'human'head'in'under'10' minutes'' Protons'directed'through'two' frontal'planes,'the'target'volume,' two'backing'planes,'and'finally'a' calorimeter' Measures'posi2on'and'angle'of' incidence'of'protons'at'entry'and' exit,'and'the'energy'loss Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode Tracking Plane: Each large square corresponds to one doublesided or two single-sided 9cm x 9cm SSDs 5

Problem Description Proton'source,'detector'planes,'and'calorimeter' mounted'on'rota2ng'gantry,'as'in'familiar'xeray'ct' configura2ons' Data'collected'over'a'full'rota2on'of'the'gantry,'180'

6 Problem Description Proton'source,'detector'planes,'and'calorimeter' mounted'on'rota2ng'gantry,'as'in'familiar'xeray'ct' configura2ons' Data'collected'over'a'full'rota2on'of'the'gantry,'180' samples'(every'2'degrees)' Ini2al'detector'designed'to'image'a'human'head' (nominally'25cm'cube)' From'physics'domain,'and'so'that'each'voxel'is' sufficiently'represented'in'the'resul2ng'system' matrix,'we'approximate'requiring'a'volume' consis2ng'of'256x256x36'(2,359,296=~'2.4m)' voxels'and'2'billion'protons'total' For'each'proton,'we'track'11'values:' [x,y,z]'at'entry' [x,y,z]'at'exit' angle'at'entry'and'exit' input'and'output'energy' gantry'rota2on'angle Final System (in black): 4 tracking planes with XY Si detectors: calorimeter with 64 end=on CsI Crystals Planned Scaled Prototype (in red): 4 planes of XY Si detectors (2 X-SSDs and 2 Y-SSDs per plane): 8 CsI Crystal bars Calorimeter: Each bar corresponds to a 5cm x 5cm CsI Crystal, read out by a photodiode Tracking Plane: Each large square corresponds to one doublesided or two single-sided 9cm x 9cm SSDs 6

7 Baseline execution times Began'with'serial'code' 1 billion protons, 60 nodes, CPU only that'took'more'than'7' Phase Execution time (seconds) hours'to'process'131m' protons' Parallelized'with'MPI'to' use'mul2ple'cpus' Established'baseline' execu2on'2mes Setup { Most Likely Path (MLP) Linear solver (CARP) Overall execution time

MLP (Most Likely Path) In'contrast'with'XEray'computed'tomography'in' which'the'par2cles'traverse'the'volume'in' straight'lines,'in'pct'the'protons'are'scakered'

8 MLP (Most Likely Path) In'contrast'with'XEray'computed'tomography'in' which'the'par2cles'traverse'the'volume'in' straight'lines,'in'pct'the'protons'are'scakered' by'the'material'as'they'travel'through'the' volume' MLP'computes'the'path'integral'of'the'protons' through'the'material'based'on'their'known' entry'and'exit'loca2ons'and'angles'and'the' energy'loss' The'proton'paths'are'discre2zed'as'the'voxels' touched'while'traversing'the'volume' Path'integral'calcula2ons'are'independent'and' parallelize'at'the'level'of'protons'(but'inherently' sequen2al'within'each'path) 8

9 Linear solver (CARP) The'result'of'MLP'is'a'system'of'equa2ons'rela2ng'each'proton s'touched' voxels'to'the'rela2ve'stopping'power'(roughly,'the'energy'loss)' We'began'the'project'with'a'CPU'implementa2on'of'the'rowEac2on'based' sparse'itera2ve'solver'carp'(component'averaged'row'projec2ons)' CARP'decomposes'the'matrix'into'row'blocks,'one'block'per'processor,'and' iterates'to'sa2sfactory'convergence:' Performs'a'JacobiElike'itera2on'sequen2ally'through'the'rows'to'produce'a'perE block'solu2on'vector' Averages'the'perEblock'solu2on'vectors'(in'componentEwise'fashion)' Redistributes'the'solu2on'vector'x'to'all'processors 9

10 Hardware: Gaea GPU cluster at Northern Illinois University 60'compute'nodes' Node'configura2on' 2x'Intel'X5650'12Ecore'CPUs' 2x'NVIDIA'M2070'GPUs' 72GB'RAM' QDR'Infiniband 10

11 Data decomposition 2.1B'protons'/'60'nodes'=~'35M'protons'per'node' 2'GPUs'E>'17M'protons'per'GPU' The'maximum'voxels'per'proton'is'~364' 17M'protons'x'364'voxels'x'4'bytes/voxel'='25GB'data'per'GPU' Larger'than'available'M2070'GPU'memory'of'6GB' High'watermark'memory'requirement'on'cluster'is'3TB'(aggregate) 11

12 MLP (Most Likely Path) CUDA implementation MLP'involves'calcula2ng'path'integral'of'the'protons' Ini2al'implementa2on'assigns'a'thread'per'proton' PerEGPU'proton'data'is'larger'than'GPU'memory'on'M2070' Stage'batches'of'protons'to'GPU' MLP'was'ported'to'the'GPU,'with'mul2ple'variants' gpu'struct:'direct'port'of'cpuebased'code'using'structured'proton/voxel'data' gpu'flat'memory:'flat'memory'space'with'pereproton'padded'voxel'arrays' gpu'flat'memory'+'overlap:'streaming'computa2on'to'overlap'compute'and' hostedevice'transfers' 12

13 MLP (Most Likely Path) CUDA implementation (26M protons, 2 GPUs) Implementation Execution time (seconds) Speedup cpu gpu_struct x gpu_flat_memory x gpu_flat_memory + overlap x 13

14 Linear solver (CARP) CUDA implementation (26M protons, 2 GPUs) CARP'ported'directly'from'CPU'code' PerEnode'rowEblock'data'larger'than'GPU'memory;'batch'process' Further'subdivide'perEnode'rowEblock'into'rowEblocks'per'streaming'mul2processor' Implementation Execution time Speedup (seconds) cpu gpu x Limited'speedup'in'GPU'implementa2on,'because:' roweac2on'based'solver'constrains'parallel'granularity' scakered'memory'accesses'constrain'performance,'as'is'typical'of'sparse'matrix'opera2ons 14

15 Performance at scale 2'billion'protons,'60'nodes,'12'CPU'cores/node,'2'GPUs/node Phase Execution time (seconds) Setup 22.3 Most Likely Path (MLP) Linear solver (CARP) Overall execution time Initial goal was to complete in <600s (10mins) 15

16 Further work: CARP Hybrid CPU/GPU Assign'row'blocks'to'CPU'and'GPU'simultaneously' Weighted'work'distribu2on'based'on'ini2al'performance'measurements 2'billion'protons,'60'nodes,'12'cores/node,'2'GPUs/node Implementation Execution time (seconds) Speedup cpu gpu x hybrid x 16

17 Future work Integrate'alterna2ve'linear'solvers'to'improve'performance (amgx,'cusparse,'petsc)' Consider'alternate'data'decomposi2ons'to'improve'cache'locality' volume'slab'per'streaming'mul2processor' volume'wedge'per'streaming'mul2processor'' Measure'performance'on'nextEgenera2on'GPUs' K80'for'greater'performance' Jetson/TK1'for'greater'performance/wak' Experiment'with'GPU'cloud'plauorms'(Amazon'cloud) 17

18 Acknowledgements Nicholas'T.'Karonis,'Northern'Illinois'University'(NIU)'and'Argonne'Na2onal'Laboratory'(ANL)' Michael'E.'Papka,'NIU'and'ANL' Caesar'Ordoñez,'NIU' Eric'Olson,'ANL' Kirk'Duffin,'NIU' Venkat'Vishwanath,'ANL' US'Department'of'Defense'contract'number'W81XWHE10E1E0170'sponsored'this'work.' 18

Acceleration of WRF on the GPU

Acceleration of WRF on the GPU Daniel Abdi, Sam Elliott, Iman Gohari Don Berchoff, Gene Pache, John Manobianco TempoQuest 1434 Spruce Street Boulder, CO 80302 720 726 9032 TempoQuest.com THE WORLD S FASTEST