A Deep Convolutional Neural Network Based on Nested Residue Number System

Size: px

Start display at page:

Download "A Deep Convolutional Neural Network Based on Nested Residue Number System"

Agnes Gordon
5 years ago
Views:

1 A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara Tsutomu Sasao Ehime University, Japan Meiji University, Japan

2 Outline Background Deep convolutional neural network (DCNN) Residue number system (RNS) DCNN using nested RNS (NRNS) Experimental results Conclusion

3 Background Deep Neural Network Multi-layer neuron model Used for embedded vision system FPGA realization is suitable for real-time systems faster than the CPU Lower power consumption than the GPU Fixed point representation is sufficient High-performance per area is desired 3

4 Deep Convolutional Neural Network (DCNN) 4

5 x = Artificial Neuron w (Bias) y f (u) x x w w + u f(u) y u N i w i x i x N... w N x i : Input signal w i : Weight u: Internal state f(u): Activation function (Sigmoid, ReLU, etc.) y: Output signal 5

6 Deep Convolutional Neural Network(DCNN) for ImageNet D convolutional layer, pooling layer, and fully connection layer 6

7 D Convolutional Layer Consumes more than 9% of the computation time Multiply-accumulation (MAC) operation is performed K z ij y ij K K m n x i m, j n w mn K x ij : Input signal y ij : Bias w mn : Weight K: Kernel size z ij : Output signal 7

8 Realization of D Convolutional Layer Requires more than billion MACs! Our realization Time multiplexing Nested Residue Number System(NRNS) + * * + BRAM + * * FPGA + * * + BRAM + * * FPGA Converter Converter Converter Converter * * * * * * * * * * * * * * * * * * * * * * * * * * * * Converter Converter Converter Converter BRAM BRAM BRAM BRAM Off chip Memory Off chip Memory 8

9 Residue Number System(RNS) 9

10 Residue Number System (RNS) Defined by a set of L mutually prime integer constants m,m,...,m L No pair modulus have a common factor with any other Typically, prime number is used as moduli set An arbitrary integer X can be uniquely represented by a tuple of L integers (X,X,,X L ), where X i X (mod m i Dynamic range M L i m i )

11 Multiplication on RNS Moduli set 3,4,5, X=8, Y= Z=X Y=6=(,,) X=(,,3), Y=(,,) Z=(4 mod 3, mod 4,6 mod 5) =(,,)=6 BinaryRNS Conversion Parallel Multiplication RNSBinary Conversion

12 BinaryRNS Converter X mod mod 3 mod

13 Functional Decomposition Free variables X=(x3, x4) Bound variables X=(x, x) h(x) Column multiplicity= x x h(x) h(x) x3,x4 4 x=6 [bit] x+ 3 x= [bit] 3

14 Decomposition Chart for X mod 3 X=(x, x) Free variables Bound variables X=(x3, x4, x5) mod 3 = mod 3 = mod 3 = 3 mod 3 = 4 mod 3 = 5 mod 3 = 6 mod 3 = 7 mod 3 = 8 mod 3 = 9 mod 3 = mod 3 = 4

15 Decomposition Chart for X mod 3 Free Bound X=(x3, x4, x5) X=(x, x) x3 x4 x5 h(x ) mod 3 = mod 3 = mod 3 = 3 mod 3 = 4 mod 3 = 5 mod 3 = 6 mod 3 = 7 mod 3 = 8 mod 3 = 9 mod 3 = mod 3 = 5

16 BinaryRNS Converter LUT cascade for X mod m LUT cascade for X mod m BRAM BRAM BRAM 6

17 RNSBinary Converter (m=3) x y 5 x y x3 y Mod m Adder Mod m Adder carry carry 7

input 3 4 LUT BinaryRNS Converter by BRAMs 4 3 4 4 8 input

18 Problem Moduli set of RNS consists of mutually prime numbers sizes of circuits are all different Example: <7,,3> 3 6 input 3 4 LUT BinaryRNS Converter by BRAMs input LUT 8 input LUT 4 4 RNSBinary Converter by DSP blocks and BRAMs 8

19 DCNN using Nested RNS 9

20 Nested RNS (Z,Z,,Z i,, Z L ) (Z,Z,,(Z i,z i,,z ij ),, Z L ) Ex: <7,,3> <7,,3> Original modulus <7,<5,6,7>,<5,6,7> 3 > <7,<5,6,7>,<5,6,7> 3 >. Reuse the same moduli set. Decompose a large modulo into smaller ones

21 Example of Nested RNS 9x(=48) on <7,<5,6,7>,<5,6,7> 3 > 9 BinaryNRNS Conversion =<5,8,6> <,,9> =<5,<3,,>,<,,6> 3 > <,<,,>,<4,3,> 3 > =<5,<,,>,<4,,5> 3 > Modulo Multiplication =<5,,> =48 RNSBin BinRNS on NRNS

22 Realization of Nested RNS Bin <7,,3> Bin <5,6,7> Bin <5,6,7> input LUT 6 input LUT 6 input LUT 6 input LUT 3 3 <5,6,7> Bin <7,,3> Bin Bin <7,,3> Binary NRNS Bin <5,6,7> Bin <5,6,7> 6 input LUT 6 input LUT 6 input LUT <5,6,7> Bin NRNS Binary Realized by BRAMs LUTs BRAMs and DSP blocks

23 Moduli Set for NRNS Conventional RNS (uses 3 moduli) <3,4,5,7,,3,7,9,3,9,3,37,4,43,47,53,59, 6,67,7,73,79,83> Applied the NRNS to moduli that are greater than 5 <3,4,5,7,,3, <3,4,5,7,,3> 7, <3,4,5,7,,3> 9, <3,4,5,7,,3,<3,4,5,7,,3> 7 > 3, <3,4,5,7,,3,<3,4,5,7,,3> 7 > 9,, <3,4,5,7,,3,<3,4,5,7,,3> 7 > 83 > All the 48-bit MAC operations are decomposed into 4-bit ones 3

24 DCNN Architecture using the NRNS Parallel BinNRNS Converters BRAM BRAM... BRAM BRAM BRAM... BRAM... BRAM BRAM... BRAM 6 parallel modulo m i D convolutional units Tree based NRNSBin Converters RNS Bin RNS Bin RNS Bin RNS Bin RNS Bin... Sequencer DDR3 Ctrl. On chip Memory External DDR3SODIMM 4

25 Experimental Results 5

Implementation Setup FPGA board: Xilinx VC77 FPGA: Virtex7 VC485T GB DDR3SODIMM (Bus@8MHz, 64 bit width) Realized

26 Implementation Setup FPGA board: Xilinx VC77 FPGA: Virtex7 VC485T GB DDR3SODIMM 64 bit width) Realized the pre-trained ImageNet by Convnet 48-bit fixed precision Synthesis tool: Xilinx Vivado4. Timing constrain: 4MHz 6

27 Comparison with Other Precision Implementations Max. Freq. [MHz] FPGA Performance [GOPS] Performance per area [GOPS/ Slice x 4 ] ASAP9 6bit fixed 5 Viretex5 LX33T PACT fixed 5 Viretex5 SX4T 7..9 FPL9 48bit fixed 5 Spartax3A DSP ISCA 48bit fixed Virtex5 SX4T ICCD3 fixed 5 Virtex6 LVX4T FPGA5 3bit float Virtex7 VX485T Proposed 48bit fixed 4 Virtex7 VX485T

28 Conclusion Realized the DCNN on the FPGA Time multiplexing Nested RNS MAC operation is realized by small LUTs Functional decomposition are used as follows: BinNRNS converter is realized by BRAMs NRNSBin converter is realized by DSP blocks and BRAMs Performance per area (GOPS/Slice) 5.86 times higher than ISCA s 8

A Deep Convolutional Neural Network Based on Nested Residue Number System

A Deep Convolutional Neural Network Based on Nested Residue Number System A Deep Convolutional Neual Netwok Based on Nested Residue Numbe System Hioki Nakahaa Ehime Univesity, Japan Tsutomu Sasao Meiji Univesity, Japan Abstact A pe-tained deep convolutional neual netwok (DCNN)