Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Size: px

Start display at page:

Download "Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ."

Roxanne Bryan
5 years ago
Views:

1 Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Sevilla) Computational and Biological Learning Lab Dep. of Engineering University of Cambridge Nov

2 Introduction

3 Deep Structured Prediction in Handwriting Recognition 3/44 Deep Structured Prediction? A defined concept with a very broad meaning. In ICML 17 Workshop on Deep Structured Prediction: many real problems involve highly dependent, structured variables. In such scenarios, it is desired or even necessary to model correlations and dependencies between the multiple input and output variables. Such problems arise in a wide range of domains, from natural language processing, computer vision, computational biology and others. In Wikipedia: Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values (...) application domains including bioinformatics, natural language processing, speech recognition, and computer vision.

4 Deep Structured Prediction in Handwriting Recognition 4/44 Goals Objectives: Understand deep learning concepts when recurrent networks (LSTM) are involved Overview temporal clasification with CTC See how theory can be implemented in automatically transcripting handwritten text

5 Letter (digits) recognition with CNN

Deep Structured Prediction in Handwriting Recognition murillo@us.es, jjm77@cam.ac.

6 Deep Structured Prediction in Handwriting Recognition 6/44 Letters (digits) recognition Let s start with the problem of digits recognition (MNIST data base) Multiclass classification problem Here the output structure is simple compared to a whole text line Straight forward solution: Apply a simple NN?

k=1 ea0 k L(y, ŷ) = 1 N NX KX n=1 k=1 y (n) k log(y (n) k )+(1 y(n) k ) log(1 y(n) k ) Training:

7 Deep Structured Prediction in Handwriting Recognition 7/44 MNIST with Simple NN We use softmax (output, ŷ k ) and cross-entropy (cost function, L) ŷ k = e ak P K k=1 ea0 k L(y, ŷ) = 1 N NX KX n=1 k=1 y (n) k log(y (n) k )+(1 y(n) k ) log(1 y(n) k ) Training: Back-propagation combined with Stochastic Gradient Descend Figures: Lipton, Berkowitz (2015) A Critical Review of Recurrent Neural Networks for Sequence Learning

8 MNIST with a very simple NN y = σ (Wx + b) (1) where σ ( ) is the sigmoid function applied entry-wise (outputs between 0 and 1) Deep Structured Prediction in Handwriting Recognition murillo@us.es, jjm77@cam.ac.uk 8/44

9 Deep Structured Prediction in Handwriting Recognition 9/44 Learning of simple NN for MNIST We used the sample code in TF, but using ADAM as optimizer and a larger number of iterations. It is a NN with just input-output layers, minibatchs of 100 images. Accuracy of 92% in the test set, number of correct outputs / size of test set. Results showed with TensorBoard : Orange is test results, thin blue is train results and dark blue is averaged train results.

10 Deep Structured Prediction in Handwriting Recognition 10/44 MNIST with Multilayer and CNN: architecture Convolutional NN is, perhaps, the most straight forward way to find and use structures in the prediction Convolutional NN allow for a fully connected structure where weights are tied : same weights are used several times. Multilayer architecture (see TensorFlow documentation or Göner s blog ) : with = 3,273,504 weights where the MLP has an intermediate layer of 1024 units

11 Deep Structured Prediction in Handwriting Recognition 11/44 MNIST with Multilayer and CNN: results We get an accuracy of 99.3%. Best known solution uses a similar structure with 2 CNN layers to get 99.79% (see for a list of best solutions and for the best solution) In is reported the best NN (MLP 6 layers) solution to have an accuracy of 99.65%

12 Problems in Long Term

13 Deep Structured Prediction in Handwriting Recognition 13/44 Problems in long term dependences Digit recognition In the recognition of digits the input was restricted to some bits to provide one output out of 10 The size of the digits is uniform Learning local structures is enough Not a hard problem Two problems in recognition of letters in text lines: 1. In word and line recognition we should better exploit the information beyond some bits that a convolution can provide We need to recover long dependences 2. Furthermore: we are to detect the positions of characters (labellings) within the imagen. we need to recover the sequence of letters To cope with this problems 1. We use recurrent structures: long short-term memory (LSTM) as an evolution of recurrent NN (RNN) 2. We use the connectionist temporal classification (CTC)

14 RNN and LSTM

15 Deep Structured Prediction in Handwriting Recognition 15/44 RNN In digits recognition, CNN first look for good local features encoding the digits In a word or a line we also should use information from previous and following letters Recurrent NN (RNN) are networks were every unit locally processes part of the input but including the result of the processing in the previous unit. They may use as previous result the output of the unit or its state In case of using its state they can provide an output at every unit or at the end While CNN are performing a filtering, RNN are updating a state or memory y (1) y (2) y (3) y (4) y (5) y (6) h (t) = tanh(b + W h h h(t 1) + W x x (t) ) y (t) = c + W y h h(t) h (1) h (2) h (3) h (4) h (5) h (6) x (1) x (2) x (3) x (4) x (5) x (6) Figure: A simple Recurrent Neural Network

16 Deep Structured Prediction in Handwriting Recognition 16/44 The challenge of long term dependencies Gradients propagated over many stages tend to either vanish (most of the time) or explode (rarely but with much damage to the optimization) (Goodfellow et al 2016) This can be also be seen when computing the state from previous ones, h (t) =tanh(b+w h h tanh(b+wh h tanh(b+wh h (...)+W xx (t 3) )+W x x (t 2) )+W x x (t 1) ) In the simplest linear case h (t) = (W h h )t h (0) +... Figure: Repeated function composition, by Goodfellow, Bengio and Courville, 2016

17 Strategies to gain long-term dependency in RNN We comment on three strategies 1. Assign skip connections: add direct connections between variables far apart difficult to tune we have the same problem but for a larger delay Remark: this idea was successfully applied trough layers (not through time) (ImageNet winner 2015) 2. Removing connections: remove length one connections and replace them by longer ones 3. Leaky units: units with linear auto-connections. We accumulate a running average µ (t) of some value v (t) by Using Leaky units is the key idea of the LSTM cells µ (t) αµ (t 1) + (1 α)v (t) (2) Deep Structured Prediction in Handwriting Recognition murillo@us.es, jjm77@cam.ac.uk 17/44

18 Deep Structured Prediction in Handwriting Recognition 18/44 LSTM cell x (t) x (t) x (t) Input Gate o i (t) Cell c (t) Output Gate (t) h (t) i (t) = σ(wx i x(t) + Wh i h(t 1) + b i ) f (t) = σ(wx f x(t) + Wh f h(t 1) + b f ) o (t) = σ(wx o x(t) + Wh o h(t 1) + b o ) c (t) = tanh(wx c x(t) + Wh c h(t 1) + b j ) f (t) x (t) Forget Gate c (t) = f (t) c (t 1) + i (t) c (t) h (t) = tanh(c (t) ) o (t) This is the typical representation for LSTM It is the folded scheme, may be hard to understand Gates acts as leaky units: input gate decides what part of the processed input c is stored in new state, forget gate decides what part of state (memory) c remains, the output gate decides what part of the memory exits Then, in next unit we process next input, previous memory c and previous output h.

19 Deep Structured Prediction in Handwriting Recognition 19/44 LSTM vs RNN Unfolded Figure: Unfolded RNN (up) and LSTM (down), by C. Olah s blog

20 Deep Structured Prediction in Handwriting Recognition 20/44 LSTM unfolded, by C. Olah s blog (a) State Flow (b) Forget Gate f (t) = σ(w f x x(t) +W f h h(t 1) +b f ) (c) Input Gate i (t) = σ(w i x x(t) +W i h h(t 1) +b i ) (d) State Update c (t) = f (t) c (t 1) + i (t) c (t) (e) Output h (t) = tanh(c (t) ) o (t)

21 Deep Structured Prediction in Handwriting Recognition 21/44 Further Comments on RNN Bidirectional (Bi-RNN): In sequences we can have a RNN running from start to the end and another from end to begining ĥ (1) ĥ (2) ĥ (3) ĥ (4) ĥ (5) ĥ (6) h (1) h (2) h (3) h (4) h (5) h (6) x (1) x (2) x (3) x (4) x (5) x (6) Figure: Bidirectional RNN Multidimensional: In images we may have one RNN starting from every corner h NW i,j h SW i,j = LSTM ( x i,j,h NW i,j 1,hNW i 1,j,hNW i 1,j±1 = LSTM ( x i,j,h SW i,j+1,hsw i+1,j,hsw i+1,j±1 ) ) h NE i,j h NE i,j = LSTM ( x i,j,h NE i,j+1,hne i 1,j,hNE i 1,j±1 = LSTM ( x i,j,h NE i,j+1,hne i+1,j,hne i+1,j±1 Stacking: We can have several RNN arranged in several layers See A. Karpathy s blog for amazing results with RNN and also nice interpretations ) )

22 Connectionist Temporal Classification

23 Deep Structured Prediction in Handwriting Recognition 23/44 Labelling alignment is a problem with RNNs Since the RNN network only outputs local classifications, a post-processing stage is required to give the final label sequence Suppose we have the following image and we want to recognise the text Training data must be pre-segmented? (locate where every letter is) Labelling unsegmented sequence data, denoted as temporal classification, is a well known problem in real-world sequence learning Suppose that we feed a RNN with the image above to provide 120 outputs but we have 44 letters as labels "monasteries, manors, townships, or wards and" In the training we need to translate from labels (44) to outputs of the RNN (120) We will need two temporal indexes, u for labels, and t for outputs of the RNN Several consecutive output indexes of the RNN correspond to the same label index

24 Deep Structured Prediction in Handwriting Recognition 24/44 Introduction to CTC (I) The connectionist temporal classification (CTC) does not require pre-segmented training data, or external post-processing to extract the label sequence from the network outputs. It brought the phoneme error rate on TIMIT to a record value (Graves et al ICASSP 2013) CTC models all aspects of the sequence with a single neural network:

25 Deep Structured Prediction in Handwriting Recognition 25/44 Introduction to CTC (II) We want something on top of a RNN that translates from probabilities for labels at T times of a sequence (image, audio,...) into a sequence of U labels (letters,phonemes,...). It avoids segmentation, providing the position of every label, u, in the training sequence (image, audio,...) We want to use backpropagation to train the whole system

26 Deep Structured Prediction in Handwriting Recognition 26/44 Notation A is the label alphabet. E.g., characters in the Latin alphabet. A = A {blank} X = (x (1), x (2),...,x (T) ) is the input sequence to the RNN Y = (y (1), y (2),...,y (T) ) is the RNN output, where y (t) [0,1], k = 1,..., A, and A k=1 y(t) k = 1 (softmax layer at the RNN outputs). y (t) k Probability of observing label k at time t. ˆL = (ˆl 1,ˆl 2,...,ˆl U ) is the estimated out label sequence, ˆl j A. Objective The recurrent neural network is defined by some weights w i Determine a mapping h(x) = ˆL from one sequence of length T to another of (unknown) length U, after training with a set of pairs (X,L). k

27 Deep Structured Prediction in Handwriting Recognition 27/44 From Network Outputs to Labellings Define the probability distribution over all possible paths in the set A T : p(π X) = T i=1 y (t) π t π A T RNN and CTC We compute above the input-output (X-Y) response of the RNN. We next build the CTC output. Different sequences Y may provide the same output L due to repeated outputs & blanks. A many-to-one map B : A T A T is defined. Remove repeated labels Remove blanks B(a ab ) = B( aa abb) = aab Outputting a new label when the network switches from predicting no label to predicting a label, or from predicting one label to another. For any L A U with U T then p(l X) = p(π X) π B 1 (L)

28 From Network Outputs to Labellings In p(l X) = p(π X), L A U π B 1 (L) Collapsing The CTC collapses different paths into the same label sequence L: makes it possible to use unsegmented data. Solution The CTC solution is the most probable labelling of the input sequence h(x) = arg max L A T p(l X) Deep Structured Prediction in Handwriting Recognition murillo@us.es, jjm77@cam.ac.uk 28/44

29 Deep Structured Prediction in Handwriting Recognition 29/44 Output Two different problems Training: estimate the probability p(l X) for some L. Output: find the L providing the largest p(l X). Solutions to the CTC Solution 1. Best Path Decoding (Trivial): Let π = arg max π p(π X) (sequence of highest RNN outputs), then L B(π ) Solution 2. Prefix Search Decoding: Based on dynamical programming to calculate the probabilities of successive extensions to labelling prefixes. Solution 2 is harder to solve Forward-backward algorithm can be applied But the complexity grows exponentially with the length of the input sequence Heuristics are used: breaking the input into subsequences

30 Notes on the best solution Example Suppose you have two consecutive time instants t and t + 1 and you want to know the most probable label sequence when blank has probabilities y t = 0.7 and y t+1 = 0.6 A has probabilities y t A = 0.3 and y t+1 A = 0.4 all other probabilities are zero. The most probable labelling is A because it adds the pb of options -A, AA and A-, 0.58 compared to the pb of - -, Figure: A.Graves Ph.D 2008 Hence Solution 1 would provide - while Solution 2, A Deep Structured Prediction in Handwriting Recognition murillo@us.es, jjm77@cam.ac.uk 30/44

31 Deep Structured Prediction in Handwriting Recognition 31/44 Training Training: estimate the probability p(l X) for some given L. The problem is easier than computing the output: here we know L and its length. Note that the CTC has nothing to be trained, is allows for a translation from RNN to labelling and we discuss how we can train the RNN backwards. Objective function uses the target labelings in the training set (L,X), L = ln p(l X) = ln p(l X) (3) (L,X) Given the derivatives with respect to the RNN outputs, the weight gradients of the RNN can be computed with standard backpropagation, for some training pair (L,X), ln p(l X) = y (t) k 1 p(l X)y (t) k (L,X) u:l u =k α(t,u)β(t,u) (4) We need to efficiently compute p(l X): solved with a dynamic programming algorithm, with a forward and backward algorithm to provide α(t,u) and β(t,u).

32 Deep Structured Prediction in Handwriting Recognition 32/44 CTC Forward-Backward (I) Given L, we define L with blanks added to the beginning and end and inserted between every pair of labels. For t = 1,...,T and u = 1...,2U + 1 we define the following two sets of paths and cumulated probabilities V(t,u) = { π A t : B(π) = L u 2,π t = l u } α(t,u) = t π V(t,u) i=1 W(t,u) = { π A T t : B( ˆπ + π) = L ˆπ V(t,u) } β(t,u) = y (i) π i π W(t,u) T i=t y (i) π i Figure: A.Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. (2006)

33 Deep Structured Prediction in Handwriting Recognition 33/44 CTC Forward-Backward (II) Can be computed recursively from previous values in t and u ( ) α(t,u) = f α(t 1,u), α(t 1,u 1),α(t 1,u 2) ( ) β(t,u) = g β(t + 1,u), β(t + 1,u + 1),α(t + 1,u + 2) p(l X) = α(t,2u) + α(t,2u + 1)

34 Line Recognition

35 IAM lines database English, 300 dpi, PNG, 256 grey levels 5545 training data 616 validation 2772 test data we measure the CER (character error rate, %) Deep Structured Prediction in Handwriting Recognition 35/44

36 Deep Structured Prediction in Handwriting Recognition 36/44 IAM line recognition TensorFlow has the LSTM, Bi-LSTM and CTC implemented, but not the M-LSTM We use the framework implemented by researchers at Universidad de Aachen (Germany), where the layer MDLSTM is implemented in CUDA Programming the MLSTM may end with memory problems (even in GPUs) Following state of the art works we train a solution with 3 and 5 layers (Conv+MLSTM), with architecture of the type Figure: Structure adapted from Pham et al 2014.

37 Deep Structured Prediction in Handwriting Recognition 37/44 IAM line recognition In particular, for the intermediate layers we use the following (5 layers option), with Dropout (forward) of 25%, Adam as optimizer, minibatchs of 10 images, Figure: Structure following V. Paul, et al Int. Conf. on Frontiers in Handwritting Recognition

38 Deep Structured Prediction in Handwriting Recognition 38/44 IAM line recognition: CER Each Epoch 1 h 50 in a GPU Tesla P GB. It gets a CER = 5.7%. Note that no language rules have been used here.

39 IAM line recognition: CTC CTC output after the transcription of a line of the test set. Deep Structured Prediction in Handwriting Recognition murillo@us.es, jjm77@cam.ac.uk 39/44

40 AM line recognition: text transcripted Deep Structured Prediction in Handwriting Recognition 40/44

41 Conclusions

42 Deep Structured Prediction in Handwriting Recognition 42/44 Conclusions In structured input/outputs (handwriting, speech,...) the LSTM is widely exploited Temporal classification is a problem solved with CTC no need of grammar or dictionary, that can be later used to further improve the solution CNNs, LSTM and CTC are programmed in TensorFlow multidimensional LSTM is not (CUDA was used) Future?: Current works on augmented RNN include (see C. Olag s blog on this topic): Neural Turing Machines (Graves et al Neural Turing Machines 2014) Attentional Interfaces: Atenttion has been included in Bluche 2016 before the CTC layer... THANK YOU for your attention

43 Readings

44 Deep Structured Prediction in Handwriting Recognition 44/44 Readings I. Goodfellow, Y. Bengio, A. Courville Deep Learning. MIT Press 2016, Chapters 6 to 9 for concepts on deep learning, Chapter 10 in particular for recurrent networks (but see below for LSTM). M. Görner, Tensor Flow and Deep Learning without a PhD, Quick Review of Main concepts and examples on DL Z. C. Lipton, John Berkowitz, Charles Elkan A Critical Review of Recurrent Neural Networks for Sequence Learning K. Cho, Natural Language Understanding with Distributed Representation, 2016, For a detailed explanation on LSTM C. Olag Understanding LSTM Networks For a good explanation on LSTM A. Graves Supervised sequence labelling with recurrent neural networks Ph. D. Thesis On the explanation of the CTC T. Bluche, Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition NIPS 2016, state-of-the-art solution to hand writing recognition problem

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks