HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th

Brief Introduction of CTC AGENDA Alpha/Beta Matrix Computation Gradient Matrix Computation Overall Performance 2

BRIEF INTRODUCTION OF CTC 3

RNN. p[t-1] Softmax y[t-1] Output layer h[t-1] Hidden layer BRIEF INTRODUCTION OF CTC Overview g[t-1] Label sequence: C, A, T p[t] CTC y[t] Output layer g[t] Softmax h[t] Hidden layer p[t+1] g[t+1] Softmax y[t+1] Output layer h[t+1] Hidden layer frame t-1 frame t frame t+1. CTC is a loss function to train the RNN Inputs: (1) p--softmax output (2) label sequence Output: g--gradient w.r.t. output layer CTC Includes: (1)Alpha computation (2)Beta computation (3)Gradient computation 6/7/2016 4

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s Matrix Dim: T rows * S columns S = 2L + 1 is the length of augmented label sequence l L is the number of characters in the original label sequence T is the number of time-steps in the utterance 6/7/2016 5

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation l blank c blank a blank t blank α s 2 s 1 s t 1 t α t 1 s 2 α t 1 s 1 α t 1 s α t s 6/7/2016 6

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation g t a = p t a p t 1 a nll s: l s =a α t s β t s Matrix Dim: T rows * A columns A is the alphabet size, e.g. 28 for English key-value reduction using the character l s as key 6/7/2016 7

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation l blank C blank A blank T blank α β t g blank A B C Z space t 6/7/2016 8

ALPHA/BETA MATRIX COMPUTATION 9

ALPHA/BETA MATRIX COMPUTATION GPU Implementation Each CUDA Block owns one sequence, i.e. #Block is the minibatch size Each Thread owns one column of the Alpha/Bata Matrix. Threads iterate over matrix rows with synchronizations after each iteration. Block Thread s 2 Thread s 1 Thread s α t 1 s 2 α t 1 s 1 α t 1 s α t s 10

ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s l s and l s 2 will be used by all iterations They are invariable across all iterations So load them into Register File to be reused by all iterations in the thread 6/7/2016 11

ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s α t 1 s is output of last iteration of the same thread Thus can be transferred through Register File 6/7/2016 12

ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s α t 1 s 1 and α t 1 s 2 are outputs of last iteration of the other threads in the same block Thus can be transferred through Shared Memory 6/7/2016 13

ALPHA/BETA MATRIX COMPUTATION Performance on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 0.41ms 0.22ms 1.89x N=16 0.42ms 0.23ms 1.84x N=32 0.42ms 0.23ms 1.82x N=64 0.43ms 0.26ms 1.70x N=128 0.47ms 0.30ms 1.56x Warp-ctc: https://github.com/baidu-research/warp-ctc 14

ALPHA/BETA MATRIX COMPUTATION Performance on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 0.41ms 0.25ms 1.65x N=16 0.47ms 0.28ms 1.66x N=32 0.47ms 0.28ms 1.65x N=64 0.48ms 0.29ms 1.65x N=128 0.50ms 0.30ms 1.68x Warp-ctc: https://github.com/baidu-research/warp-ctc 15

GRADIENT MATRIX COMPUTATION 16

GRADIENT MATRIX COMPUTATION GPU Implementation Each Block owns one row of Alpha and Beta Matrix, i.e. #Block = minibatch * T Within each block, key-value reduction through Atomic operations on Shared Memory α β Block t Shared Memory of Block t g blank A B C Z space 17

GRADIENT MATRIX COMPUTATION Compute for Blanks Separately Blanks contribute most of the address conflicts We know their exact position in the label sequence It becomes a normal parallel reduction problem to compute for blanks separately l α β Block t Shared Memory blank C blank A blank T blank Shared Memory 18

GRADIENT MATRIX COMPUTATION Allocate Redundant Shared Memory It reduces address conflicts for atomic operations Results in redundant shared memory elements are then accumulated for each character in parallel Not applicable for languages with large alphabet size, like Chinese 19

GRADIENT MATRIX COMPUTATION Reuse the memory of Matrix p for Gradient Matrix g g t a = p t a p t 1 a nll s: l s =a α t s β t s Results in 0 for more than 99% characters in the large alphabet of Chinese. So more than 99% elements of Matrix g are the same as Matrix p, and nearly half time is spent on copying them from Matrix p to Matrix g Matrix p will no longer be used after the gradient computation Reusing the memory of Matrix p for Gradient Matrix g, we only need to update gradient of less than 1% Matrix elements Not necessary for languages with small alphabet size, like English 6/7/2016 20

GRADIENT MATRIX COMPUTATION Performance on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 2.16ms 0.02ms 134.89x N=16 2.19ms 0.06ms 37.26x N=32 2.20ms 0.11ms 19.32x N=64 2.23ms 0.21ms 10.49x N=128 2.24ms 0.41ms 5.52x For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel 21

GRADIENT MATRIX COMPUTATION Performance on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 5.52ms 0.04ms 128.26x N=16 6.36ms 0.21ms 30.28x N=32 6.49ms 0.47ms 13.73x N=64 6.75ms 0.78ms 8.67x N=128 7.20ms 1.56ms 4.63x For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel 22

OVERALL PERFORMANCE 23

OVERALL PERFORMANCE CTC(Alpha+Beta+Gradient) on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 2.98ms 0.45ms 6.57x N=16 3.03ms 0.51ms 5.92x N=32 3.05ms 0.58ms 5.25x N=64 3.10ms 0.72ms 4.27x N=128 3.18ms 1.01ms 3.14x 24

OVERALL PERFORMANCE CTC(Alpha+Beta+Gradient) on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 6.34ms 0.54ms 11.67x N=16 7.30ms 0.77ms 9.43x N=32 7.43ms 1.04ms 7.14x N=64 7.71ms 1.36ms 5.67x N=128 8.20ms 2.15ms 3.81x 25

OVERALL PERFORMANCE Softmax+CTC on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 3.12ms 0.59ms 5.28x N=16 3.16ms 0.65ms 4.89x N=32 3.20ms 0.88ms 3.65x N=64 3.30ms 1.08ms 3.07x N=128 3.49ms 1.37ms 2.56x 26

OVERALL PERFORMANCE Softmax+CTC on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc Optimized speedup N=1 6.61ms 0.79ms 8.34x N=16 9.13ms 2.69ms 3.40x N=32 11.01ms 4.92ms 2.24x N=64 14.83ms 8.67ms 1.71x N=128 22.36ms 16.49ms 1.36x 27

April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join