HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Similar documents
Deep Learning Recurrent Networks 2/28/2018

Quantum Computer Simulation Using CUDA (Quantum Fourier Transform Algorithm)

Neural Architectures for Image, Language, and Speech Processing

Memory-Augmented Attention Model for Scene Text Recognition

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

arxiv: v1 [hep-lat] 7 Oct 2010

S0214 : GPU Based Stacking Sequence Generation For Composite Skins Using GA

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Introduction to Neural Networks

RNNLIB: Connectionist Temporal Classification and Transcription Layer

NLP Programming Tutorial 8 - Recurrent Neural Nets

Solutions to Homework 5 - Math 3410

Neural Networks: Backpropagation

An Implementation of SPELT(31, 4, 96, 96, (32, 16, 8))

CSE 150. Assignment 6 Summer Maximum likelihood estimation. Out: Thu Jul 14 Due: Tue Jul 19

Seq2Seq Losses (CTC)

Spatial Transformation

CS-206 Concurrency. Lecture 13. Wrap Up. Spring 2015 Prof. Babak Falsafi parsa.epfl.ch/courses/cs206/

Recurrent Neural Networks

Part I: Definitions and Properties

Sparse LU Factorization on GPUs for Accelerating SPICE Simulation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning for Automatic Speech Recognition Part II

Introduction to Neural Networks

Lecture 17: Neural Networks and Deep Learning

Unfolded Recurrent Neural Networks for Speech Recognition

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

CSC321 Lecture 15: Exploding and Vanishing Gradients

Convolutional Neural Networks

GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

ECS289: Scalable Machine Learning

Stochastic gradient descent; Classification

Solving PDEs with CUDA Jonathan Cohen

Using a CUDA-Accelerated PGAS Model on a GPU Cluster for Bioinformatics

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Prof. Brant Robertson Department of Astronomy and Astrophysics University of California, Santa

CS 453X: Class 20. Jacob Whitehill

Presented By: Omer Shmueli and Sivan Niv

Neural Network Training

A CUDA Solver for Helmholtz Equation

J. Sadeghi E. Patelli M. de Angelis

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Internal Covariate Shift Batch Normalization Implementation Experiments. Batch Normalization. Devin Willmott. University of Kentucky.

Recurrent Neural Networks. Jian Tang

Introduction to numerical computations on the GPU

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

(

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Natural Language Processing and Recurrent Neural Networks

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2005, Lecture 4 Gradient-Based Learning III: Architectures Yann LeCun

ECE521 Lectures 9 Fully Connected Neural Networks

EE-559 Deep learning Recurrent Neural Networks

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Computational Genomics and Molecular Biology, Fall

Distributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Multiclass Logistic Regression

arxiv: v3 [cs.lg] 14 Jan 2018

word2vec Parameter Learning Explained

Soundex distance metric

Multicore Parallelization of Determinant Quantum Monte Carlo Simulations

First, a look at using OpenACC on WRF subroutine advance_w dynamics routine

Lecture 11 Recurrent Neural Networks I

Introduction to RNNs!

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 11 Recurrent Neural Networks I

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

GPU-to-GPU and Host-to-Host Multipattern String Matching On A GPU

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

CNTK Microsoft s Open Source Deep Learning Toolkit. Taifeng Wang Lead Researcher, Microsoft Research Asia 2016 GTC China

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Object Recognition Using a Neural Network and Invariant Zernike Features

A Practitioner s Guide to MXNet

Recurrent Neural Networks

S XMP LIBRARY INTERNALS. Niall Emmart University of Massachusetts. Follow on to S6151 XMP: An NVIDIA CUDA Accelerated Big Integer Library

ACCELERATING RECURRENT NEURAL NETWORK TRAINING VIA TWO STAGE CLASSES AND PARALLELIZATION

Lecture 15: Recurrent Neural Nets

Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

PARALLELIZING LINEAR RECURRENT NEURAL NETS OVER SEQUENCE LENGTH

Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

11 Parallel programming models

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

EE-559 Deep learning LSTM and GRU

Transcription:

April 4-7, 2016 Silicon Valley HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU Minmin Sun, NVIDIA minmins@nvidia.com April 5th

Brief Introduction of CTC AGENDA Alpha/Beta Matrix Computation Gradient Matrix Computation Overall Performance 2

BRIEF INTRODUCTION OF CTC 3

RNN. p[t-1] Softmax y[t-1] Output layer h[t-1] Hidden layer BRIEF INTRODUCTION OF CTC Overview g[t-1] Label sequence: C, A, T p[t] CTC y[t] Output layer g[t] Softmax h[t] Hidden layer p[t+1] g[t+1] Softmax y[t+1] Output layer h[t+1] Hidden layer frame t-1 frame t frame t+1. CTC is a loss function to train the RNN Inputs: (1) p--softmax output (2) label sequence Output: g--gradient w.r.t. output layer CTC Includes: (1)Alpha computation (2)Beta computation (3)Gradient computation 6/7/2016 4

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s Matrix Dim: T rows * S columns S = 2L + 1 is the length of augmented label sequence l L is the number of characters in the original label sequence T is the number of time-steps in the utterance 6/7/2016 5

BRIEF INTRODUCTION OF CTC Alpha/Beta Matrix Computation l blank c blank a blank t blank α s 2 s 1 s t 1 t α t 1 s 2 α t 1 s 1 α t 1 s α t s 6/7/2016 6

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation g t a = p t a p t 1 a nll s: l s =a α t s β t s Matrix Dim: T rows * A columns A is the alphabet size, e.g. 28 for English key-value reduction using the character l s as key 6/7/2016 7

BRIEF INTRODUCTION OF CTC Gradient Matrix Computation l blank C blank A blank T blank α β t g blank A B C Z space t 6/7/2016 8

ALPHA/BETA MATRIX COMPUTATION 9

ALPHA/BETA MATRIX COMPUTATION GPU Implementation Each CUDA Block owns one sequence, i.e. #Block is the minibatch size Each Thread owns one column of the Alpha/Bata Matrix. Threads iterate over matrix rows with synchronizations after each iteration. Block Thread s 2 Thread s 1 Thread s α t 1 s 2 α t 1 s 1 α t 1 s α t s 10

ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s l s and l s 2 will be used by all iterations They are invariable across all iterations So load them into Register File to be reused by all iterations in the thread 6/7/2016 11

ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s α t 1 s is output of last iteration of the same thread Thus can be transferred through Register File 6/7/2016 12

ALPHA/BETA MATRIX COMPUTATION Data Reuse if l s = blank or l s = l s 2 α t s = α t 1 s + α t 1 s 1 p t l s else : α t s = α t 1 s + α t 1 s 1 + α t 1 s 2 p t l s α t 1 s 1 and α t 1 s 2 are outputs of last iteration of the other threads in the same block Thus can be transferred through Shared Memory 6/7/2016 13

ALPHA/BETA MATRIX COMPUTATION Performance on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 0.41ms 0.22ms 1.89x N=16 0.42ms 0.23ms 1.84x N=32 0.42ms 0.23ms 1.82x N=64 0.43ms 0.26ms 1.70x N=128 0.47ms 0.30ms 1.56x Warp-ctc: https://github.com/baidu-research/warp-ctc 14

ALPHA/BETA MATRIX COMPUTATION Performance on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 0.41ms 0.25ms 1.65x N=16 0.47ms 0.28ms 1.66x N=32 0.47ms 0.28ms 1.65x N=64 0.48ms 0.29ms 1.65x N=128 0.50ms 0.30ms 1.68x Warp-ctc: https://github.com/baidu-research/warp-ctc 15

GRADIENT MATRIX COMPUTATION 16

GRADIENT MATRIX COMPUTATION GPU Implementation Each Block owns one row of Alpha and Beta Matrix, i.e. #Block = minibatch * T Within each block, key-value reduction through Atomic operations on Shared Memory α β Block t Shared Memory of Block t g blank A B C Z space 17

GRADIENT MATRIX COMPUTATION Compute for Blanks Separately Blanks contribute most of the address conflicts We know their exact position in the label sequence It becomes a normal parallel reduction problem to compute for blanks separately l α β Block t Shared Memory blank C blank A blank T blank Shared Memory 18

GRADIENT MATRIX COMPUTATION Allocate Redundant Shared Memory It reduces address conflicts for atomic operations Results in redundant shared memory elements are then accumulated for each character in parallel Not applicable for languages with large alphabet size, like Chinese 19

GRADIENT MATRIX COMPUTATION Reuse the memory of Matrix p for Gradient Matrix g g t a = p t a p t 1 a nll s: l s =a α t s β t s Results in 0 for more than 99% characters in the large alphabet of Chinese. So more than 99% elements of Matrix g are the same as Matrix p, and nearly half time is spent on copying them from Matrix p to Matrix g Matrix p will no longer be used after the gradient computation Reusing the memory of Matrix p for Gradient Matrix g, we only need to update gradient of less than 1% Matrix elements Not necessary for languages with small alphabet size, like English 6/7/2016 20

GRADIENT MATRIX COMPUTATION Performance on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 2.16ms 0.02ms 134.89x N=16 2.19ms 0.06ms 37.26x N=32 2.20ms 0.11ms 19.32x N=64 2.23ms 0.21ms 10.49x N=128 2.24ms 0.41ms 5.52x For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel 21

GRADIENT MATRIX COMPUTATION Performance on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 5.52ms 0.04ms 128.26x N=16 6.36ms 0.21ms 30.28x N=32 6.49ms 0.47ms 13.73x N=64 6.75ms 0.78ms 8.67x N=128 7.20ms 1.56ms 4.63x For warp-ctc, this is the run time of kernel compute_betas_grad_kernel minus the run time of compute_alpha_kernel 22

OVERALL PERFORMANCE 23

OVERALL PERFORMANCE CTC(Alpha+Beta+Gradient) on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 2.98ms 0.45ms 6.57x N=16 3.03ms 0.51ms 5.92x N=32 3.05ms 0.58ms 5.25x N=64 3.10ms 0.72ms 4.27x N=128 3.18ms 1.01ms 3.14x 24

OVERALL PERFORMANCE CTC(Alpha+Beta+Gradient) on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc optimized speedup N=1 6.34ms 0.54ms 11.67x N=16 7.30ms 0.77ms 9.43x N=32 7.43ms 1.04ms 7.14x N=64 7.71ms 1.36ms 5.67x N=128 8.20ms 2.15ms 3.81x 25

OVERALL PERFORMANCE Softmax+CTC on Titan X Small Alphabet Size T=150, L=40, A=28 warp-ctc optimized speedup N=1 3.12ms 0.59ms 5.28x N=16 3.16ms 0.65ms 4.89x N=32 3.20ms 0.88ms 3.65x N=64 3.30ms 1.08ms 3.07x N=128 3.49ms 1.37ms 2.56x 26

OVERALL PERFORMANCE Softmax+CTC on Titan X Large Alphabet Size T=150, L=20, A=5000 warp-ctc Optimized speedup N=1 6.61ms 0.79ms 8.34x N=16 9.13ms 2.69ms 3.40x N=32 11.01ms 4.92ms 2.24x N=64 14.83ms 8.67ms 1.71x N=128 22.36ms 16.49ms 1.36x 27

April 4-7, 2016 Silicon Valley THANK YOU JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join