HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Size: px

Start display at page:

Download "HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU"

Wilfrid Barton
6 years ago
Views:

1 April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA April 5th

2 Brief Introduction of CTC AGENDA Alpha/Beta Matrix Computation Gradient Matrix Computation Overall Performance 2

3 BRIEF INTRODUCTION OF CTC 3

4 RNN. p[t-1] Softmax y[t-1] Output layer h[t-1] Hidden layer BRIEF INTRODUCTION OF CTC Overview g[t-1] Label sequence: C, A, T p[t] CTC y[t] Output layer g[t] Softmax h[t] Hidden layer p[t+1] g[t+1] Softmax y[t+1] Output layer h[t+1] Hidden layer frame t-1 frame t frame t+1. CTC is a loss function to train the RNN Inputs: (1) p--softmax output (2) label sequence Output: g--gradient w.r.t. output layer CTC Includes: (1)Alpha computation (2)Beta computation (3)Gradient computation 6/7/2016 4

5 BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s Matrix Dim: T rows * S columns S = 2L + 1 is the length of augmented label sequence l L is the number of characters in the original label sequence T is the number of time-steps in the utterance 6/7/2016 5

6 BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation l blank c blank a blank t blank α s 2 s 1 s t 1 t α t 1 s 2 α t 1 s 1 α t 1 s α t s 6/7/2016 6

7 BRIEF INTRODUCTION OF CTC Gradient Matrix Computation g t a = p t a p t 1 a nll s: l s =a α t s β t s Matrix Dim: T rows * A columns A is the alphabet size, e.g. 28 for English key-value reduction using the character l s as key 6/7/2016 7

8 BRIEF INTRODUCTION OF CTC Gradient Matrix Computation l blank C blank A blank T blank α β t g blank A B C Z space t 6/7/2016 8

9 ALPHA/BETA MATRIX COMPUTATION 9

10 ALPHA/BETA MATRIX COMPUTATION GPU Implementation Each CUDA Block owns one sequence, i.e. #Block is the minibatch size Each Thread owns one column of the Alpha/Bata Matrix. Threads iterate over matrix rows with synchronizations after each iteration. Block Thread s 2 Thread s 1 Thread s α t 1 s 2 α t 1 s 1 α t 1 s α t s 10

11 ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s l s and l s 2 will be used by all iterations They are invariable across all iterations So load them into Register File to be reused by all iterations in the thread 6/7/

12 ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s α t 1 s is output of last iteration of the same thread Thus can be transferred through Register File 6/7/

13 ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s α t 1 s 1 and α t 1 s 2 are outputs of last iteration of the other threads in the same block Thus can be transferred through Shared Memory 6/7/

14 ALPHA/BETA MATRIX COMPUTATION Performance on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 0.41ms 0.22ms 1.89x N= ms 0.23ms 1.84x N= ms 0.23ms 1.82x N= ms 0.26ms 1.70x N= ms 0.30ms 1.56x Warp-ctc: 14

15 ALPHA/BETA MATRIX COMPUTATION Performance on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 0.41ms 0.25ms 1.65x N= ms 0.28ms 1.66x N= ms 0.28ms 1.65x N= ms 0.29ms 1.65x N= ms 0.30ms 1.68x Warp-ctc: 15

16 GRADIENT MATRIX COMPUTATION 16

17 GRADIENT MATRIX COMPUTATION GPU Implementation Each Block owns one row of Alpha and Beta Matrix, i.e. #Block = minibatch * T Within each block, key-value reduction through Atomic operations on Shared Memory α β Block t Shared Memory of Block t g blank A B C Z space 17

18 GRADIENT MATRIX COMPUTATION Compute for Blanks Separately Blanks contribute most of the address conflicts We know their exact position in the label sequence It becomes a normal parallel reduction problem to compute for blanks separately l α β Block t Shared Memory blank C blank A blank T blank Shared Memory 18

19 GRADIENT MATRIX COMPUTATION Allocate Redundant Shared Memory It reduces address conflicts for atomic operations Results in redundant shared memory elements are then accumulated for each character in parallel Not applicable for languages with large alphabet size, like Chinese 19

20 GRADIENT MATRIX COMPUTATION Reuse the memory of Matrix p for Gradient Matrix g g t a = p t a p t 1 a nll s: l s =a α t s β t s Results in 0 for more than 99% characters in the large alphabet of Chinese. So more than 99% elements of Matrix g are the same as Matrix p, and nearly half time is spent on copying them from Matrix p to Matrix g Matrix p will no longer be used after the gradient computation Reusing the memory of Matrix p for Gradient Matrix g, we only need to update gradient of less than 1% Matrix elements Not necessary for languages with small alphabet size, like English 6/7/

21 GRADIENT MATRIX COMPUTATION Performance on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 2.16ms 0.02ms x N= ms 0.06ms 37.26x N= ms 0.11ms 19.32x N= ms 0.21ms 10.49x N= ms 0.41ms 5.52x For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel 21

22 GRADIENT MATRIX COMPUTATION Performance on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 5.52ms 0.04ms x N= ms 0.21ms 30.28x N= ms 0.47ms 13.73x N= ms 0.78ms 8.67x N= ms 1.56ms 4.63x For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel 22

23 OVERALL PERFORMANCE 23

24 OVERALL PERFORMANCE CTC(Alpha+Beta+Gradient) on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 2.98ms 0.45ms 6.57x N= ms 0.51ms 5.92x N= ms 0.58ms 5.25x N= ms 0.72ms 4.27x N= ms 1.01ms 3.14x 24

25 OVERALL PERFORMANCE CTC(Alpha+Beta+Gradient) on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 6.34ms 0.54ms 11.67x N= ms 0.77ms 9.43x N= ms 1.04ms 7.14x N= ms 1.36ms 5.67x N= ms 2.15ms 3.81x 25

26 OVERALL PERFORMANCE Softmax+CTC on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 3.12ms 0.59ms 5.28x N= ms 0.65ms 4.89x N= ms 0.88ms 3.65x N= ms 1.08ms 3.07x N= ms 1.37ms 2.56x 26

27 OVERALL PERFORMANCE Softmax+CTC on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc Optimized speedup N=1 6.61ms 0.79ms 8.34x N= ms 2.69ms 3.40x N= ms 4.92ms 2.24x N= ms 8.67ms 1.71x N= ms 16.49ms 1.36x 27

28 April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good