Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Size: px

Start display at page:

Download "Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI"

Austen Smith
5 years ago
Views:

1 Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI Charles Lo and Paul Chow {locharl1, pc}@eecg.toronto.edu Department of Electrical and Computer Engineering University of Toronto FPGA / 16

2 Motivation Restricted Boltzmann Machines (RBMs) Two-Layer Artificial Neural Network 2 / 16

3 Motivation Restricted Boltzmann Machines (RBMs) Two-Layer Artificial Neural Network Deep neural networks Efficiently trained with RBMs Successfully applied to interesting machine learning problems Learning slow with thousands of nodes per layer 2 / 16

4 Motivation RBM training easily parallelized FPGA implementations achieve dramatic speed-up RBM size limited by available on-chip RAM Virtualized architecture 1 Significant performance hit from limited memory bandwidth This work: Builds upon virtualized architecture Accelerate large RBMs Increase performance 1 D. Ly et al., High-Performance Reconfigurable Hardware Architecture for Restricted Boltzmann Machines, TNN, Nov / 16

5 The Rest of the Talk Restricted Boltzmann Machines Baseline Architecture Performance Improvements Embedded Message Passing Interface (MPI) 4 / 16

6 Restricted Boltzmann Machines Hidden Layer Visible Layer 5 / 16

7 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

8 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

9 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

10 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

11 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

12 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

13 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 5 / 16

14 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Hidden Nodes 5 / 16

15 Alternating Gibbs Sampling t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Hidden Nodes RBM size limited by on-chip memory 5 / 16

16 Virtualization t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Hidden Nodes RBM size limited by on-chip memory Train a virtual RBM by time-multiplexing hardware 6 / 16

17 Virtualization t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 W 0,0 W 0,0 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Hidden Nodes Time Train a virtual RBM by time-multiplexing hardware 6 / 16

18 Virtualization t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 W 0,0 W 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Hidden Nodes W 0,0 W 0,1 Time Train a virtual RBM by time-multiplexing hardware 6 / 16

19 Virtualization t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 Hidden Nodes W 0,0 W 0,1 W 0,0 W 0,1 W 1,1 ww 3,2 1,1 w 3,3 Time Train a virtual RBM by time-multiplexing hardware 6 / 16

20 Virtualization t = 0 t = 1 t = 2 t = 3 Visible Nodes w 0,0 w 0,1 W 0,0 W 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w ww 2,3 3,0 1,0 w 3,1 ww 3,2 1,1 w 3,3 Hidden Nodes W 0,0 W 0,1 W 1,1 W 1,0 Time Train a virtual RBM by time-multiplexing hardware 6 / 16

21 Baseline Architecture DDR2 Memory Processor Memory Controller Processor Local Bus (PLB) DMA Engine Message Passing Interface (MPI) Network Compute Engines FPGA 7 / 16

22 Improving Performance DDR2 Memory Processor Memory Controller Processor Local Bus (PLB) DMA Engine Message Passing Interface (MPI) Network Compute Engines FPGA 8 / 16

23 Improving Performance DDR2 Memory Processor Memory Controller Processor Local Bus (PLB) DMA Engine Three Approaches Partition Size Message Passing Interface (MPI) Network Compute Engines FPGA 8 / 16

24 Improving Performance DDR2 Memory Processor Memory Controller Processor Local Bus (PLB) DMA Engine Three Approaches Partition Size Memory Interface Message Passing Interface (MPI) Network Compute Engines FPGA 8 / 16

25 Improving Performance DDR2 Memory Processor Memory Controller Processor Local Bus (PLB) DMA Engine Three Approaches Partition Size Memory Interface Multi-FPGA Extension Message Passing Interface (MPI) Network Compute Engines FPGA 8 / 16

26 Partition Size W 0,0 Weight Matrix w 0,0 w 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Address Weights in Block RAM 0 w 0,0 w 0,1 w 0,2 w 0,3 Larger 1 wpartitions: 1,1 w 1,2 w 1,3 w 1,0 Reduce context switches 2 w 2,2 w 2,3 w 2,0 w Increase parallel weight access 2,1 3 w 3,3 w 3,0 w 3,1 w 3,2 BRAM 0 BRAM 1 BRAM 2 BRAM Bits 9 / 16

27 Partition Size W 0,0 Weight Matrix w 0,0 w 0,1 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,0 w 2,1 w 2,2 w 2,3 w 3,0 w 3,1 w 3,2 w 3,3 Weights in Block RAM 0 w 0,0 w 0,1 w 0,2 w 0,3 Diagonal assignment allows 1 w 1,1 w 1,2 w 1,3 w 1,0 parallel access to rows and2columns w 2,2 w 2,3 w 2,0 w 2,1 Address 3 w 3,3 w 3,0 w 3,1 w 3,2 BRAM 0 BRAM 1 BRAM 2 BRAM Bits 9 / 16

28 Partition Size W 0,0 Weight Matrix Weights in Block RAM w 2,0 w 0,1 w 0,0 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,1 w 2,2 w 2,3 Address w 0,0 w 0,1 w 1,1 w 1,2 w 2,2 w 2,3 w 0,2 w 0,3 w 1,3 w 1,0 w 2,0 w 2,1 w 3,0 w 3,1 w 3,2 w 3,3 3 w 3,3 w 3,0 w 3,1 w 3,2 BRAM 0 BRAM 1 BRAM 2 BRAM Bits 9 / 16

29 Partition Size W 0,0 Weight Matrix Weights in Block RAM w 2,0 w 0,1 w 0,0 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,1 w 2,2 w 2,3 Address w 0,0 w 0,1 w 1,0 w 1,1 w 2,2 w 2,3 w 0,2 w 0,3 w 1,2 w 1,3 w 2,0 w 2,1 w 3,0 w 3,1 w 3,2 w 3,3 3 w 3,2 w 3,3 w 3,0 w 3,3 BRAM 0 BRAM Bits 9 / 16

30 Partition Size W 0,0 Weight Matrix Weights in Block RAM w 2,0 w 0,1 w 0,0 w 0,2 w 0,3 w 1,0 w 1,1 w 1,2 w 1,3 w 2,1 w 2,2 w 2,3 Address w 0,0 w 0,1 w 1,0 w 1,1 w 2,2 w 2,3 w 0,2 w 0,3 w 1,2 w 1,3 w 2,0 w 2,1 w 3,0 w 3,1 w 3,2 w 3,3 3 w 3,2 w 3,3 w 3,0 w 3,3 BRAM 0 BRAM Bits Block diagonal assignment provides access to packed weights 9 / 16

31 Memory Interface DDR2 Memory Processor Memory Controller Processor Local Bus (PLB) DMA Engine Message Passing Interface (MPI) Network Compute Engines FPGA 10 / 16

32 Memory Interface DDR2 Memory Processor Processor Local Bus (PLB) DMA Engine Memory Controller Memory Access Core Native Port Interface (NPI) Message Passing Interface (MPI) Network Compute Engines FPGA 10 / 16

33 Single-FPGA Performance Design compared to original virtualized architecture Performance Metric: Connection Updates per Second (CUPS) Single-FPGA performance increased 4.2 fold over original architecture 11 / 16

34 Multi-FPGA Extension Partitioned Weight Matrix W 0,0 W 1,0 W 0,2 W 0,1 W 0,3 W 1,1 W 1,2 W 1,3 16 context switches required on a single FPGA W 2,0 W 2,1 W 2,2 W 2,3 W 3,0 W 3,1 W 3,2 W 3,3 12 / 16

35 Multi-FPGA Extension Partitioned Weight Matrix Time W 0,0 W 2,0 W 3,0 W 0,1 W 0,2 W 0,3 W 1,0 W 1,1 W 1,2 W 2,1 W 3,1 W 2,2 W 3,2 W 1,3 W 2,3 W 3,3 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Space 16 context switches required on a single FPGA Virtualization allows for distribution of partitions in time and space 12 / 16

36 Inter-FPGA Accumulation Partitioned Weight Matrix Time W 0,0 W 2,0 W 0,1 W 0,2 W 0,3 W 1,0 W 1,1 W 1,2 W 2,1 W 2,2 W 1,3 W 2,3 Accumulation in space requires inter-fpga communication W 3,0 W 3,1 W 3,2 W 3,3 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Space 13 / 16

37 Inter-FPGA Accumulation Partitioned Weight Matrix W 0,0 W 0,1 W 0,2 W 0,3 FPGA 0 Dataflow FPGA 1 Time W 1,0 W 1,1 W 1,2 W 2,0 W 2,1 W 2,2 W 1,3 W 2,3 FPGA 3 FPGA 2 Tree Adder W 3,0 W 3,1 W 3,2 W 3,3 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Space F0 F1 F2 F3 13 / 16

38 Inter-FPGA Accumulation Partitioned Weight Matrix W 0,0 W 0,1 W 0,2 W 0,3 FPGA 0 Dataflow FPGA 1 Time W 1,0 W 1,1 W 1,2 W 2,0 W 2,1 W 2,2 W 1,3 W 2,3 FPGA 3 FPGA 2 Tree Adder W 3,0 W 3,1 W 3,2 W 3,3 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Space F0 F1 F2 F3 13 / 16

39 Inter-FPGA Accumulation Partitioned Weight Matrix W 0,0 W 0,1 W 0,2 W 0,3 FPGA 0 Dataflow FPGA 1 Time W 1,0 W 1,1 W 1,2 W 2,0 W 2,1 W 2,2 W 1,3 W 2,3 FPGA 3 FPGA 2 Tree Adder W 3,0 W 3,1 W 3,2 W 3,3 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Space F0 F1 F2 F3 13 / 16

40 Inter-FPGA Accumulation Partitioned Weight Matrix W 0,0 W 0,1 W 0,2 W 0,3 FPGA 0 Dataflow FPGA 1 Time W 1,0 W 1,1 W 1,2 W 2,0 W 2,1 W 2,2 W 1,3 W 2,3 FPGA 3 FPGA 2 Tree Adder W 3,0 W 3,1 W 3,2 W 3,3 FPGA 0 FPGA 1 FPGA 2 FPGA 3 Space F0 F1 F2 F3 13 / 16

41 Inter-FPGA Communication 12,000 Performance Scaling with Network Size Speed MCUPS] 10,000 8,000 6,000 4,000 2,000 4,000 6,000 8,000 Number of nodes in visible and hidden layers 1-FPGA 4-FPGA 14 / 16

42 Embedded Message Passing Interface ArchES-MPI 2 Scalable, high performance on-chip network System component implementations abstracted by ranks Simple incremental improvements Reconfigurable control and dataflow Design portability 2 M. Saldaña et al., MPI as an Abstraction for Software Hardware Interaction for HPRCs, HPRCTA, Nov / 16

43 Conclusion Virtualization allows for the acceleration of large RBMs through the distribution of work in time and space Single-FPGA performance increased 4.2 fold Embedded MPI simplifies design changes and increases design portability 16 / 16

44 RBM Bibliography G. Hinton et al., A fast learning algorithm for deep belief nets, Neural Computation, 2006 G. Hinton et al., Reducing the dimensionality of data with neural networks, Science Jul / 16

45 Weight Pipeline w 0,0 0 w 0,0 w 0,1 w 0,2 w 0,3 w 0,1 BRAM 0 buffer w 0,0 w 1,1 w 0,1 w 1,0 Address w 1,0 w 1,1 w 2,2 w 2,3 w 3,2 w 3,3 w 1,2 w 1,3 w 2,0 w 2,1 w 3,0 w 3,1 BRAM 0 w 2,2 w 2,3 buffer w 0,1 w 1, Bits BRAM 0 BRAM 1 BRAM 0 buffer 18 / 16

46 Mini-Batch Size Effects Effect of Mini-Batch size Speed MCUPS] 8,000 6,000 4,000 2, ,000 Mini-Batch Size 19 / 16

47 FPGA Performance Comparison Implementation Network mbatch Absolute Size Size Performance Virtualized 1-FPGA 1024x MCUPS Virtualized 4-FPGA 1024x MCUPS Virtualized 1-FPGA 1024x MCUPS Virtualized 4-FPGA 1024x MCUPS Virtualized 1-FPGA 8192x MCUPS Virtualized 4-FPGA 8192x MCUPS Virtualized 1-FPGA 256x MCUPS Kim et al. 4-FPGA x MCUPS CUP S = n2 T L 3 S. Kim et al., A Large-Scale Architecture for Restricted Boltzmann Machines, FPL / 16

A FPGA Implementation of Large Restricted Boltzmann Machines. Charles Lo. Supervisor: Paul Chow April 2010

A FPGA Implementation of Large Restricted Boltzmann Machines by Charles Lo Supervisor: Paul Chow April 2010 Abstract A FPGA Implementation of Large Restricted Boltzmann Machines Charles Lo Engineering