Chinese Character Handwriting Generation in TensorFlow Heri Zhao, Jiayu Ye, Ke Xu Abstract Recurrent neural network(rnn) has been proved to be successful in several sequence generation task. RNN can generate reasonable prediction of the sequence by remembering previous state. For example, trained with Wikipedia text, RNN can predict the word in a sentence or even generate new masterpiece. Also, if trained with Linux kernel code, RNN can imitate writing C code. Furthermore, generating English real handwritten strokes is also one of the capabilities of RNN. So we are curious same technique also applies to Chinese character which got much more different characters and complicated stroke sequence. We experimented with a joint Mixture Density Network and RNN model to generate handwritten Chinese Characters. I. INTRODUCTION Chinese Characters recognition has obtained great achievements using traditional Convolutional Neural Network[1], but drawing handwritten characters is a completely different task. Given the success of generating handwritten English Characters[2], we aim to craft a model for handwritten Chinese characters. As an old writing system with thousands of years history, Chinese character provides a drastically different challenge compared to English characters. First of all, it has tens of thousands of characters in the dictionary compared to only 26 characters in English. Also there are written styles that greatly diverge from the canonical representation (standard font). And each character contains much more strokes than English characters. Fortunately, Recurrent Neural Network(RNN) provides a generic model that represents a sequence of data. Researchers have used RNN to generate fake Chinese characters[3], and it shows a great potential of generating real human-recognizable Chinese characters. This model generates new strokes based on the stated stored in RNN. It provides meaningful parts which are composed of non-trivial amount strokes. Mixture density network(mdn) is applied here to generate multi-variable neural network[4]. Without doubt, other researchers have already proposed a framework to generated handwritten Chinese characters conditional on a specific Chinese character[5]. This paper provides a generative model that utilizes the RNN and Gaussian Mixture Model(GMM) to predict the next positions and the states of the pen. The GMM here is a modified version over the Mixture Density Network. Besides the primitive input strokes, RNN is also conditioned on a condensed representation of each characters so that the system will draw specific characters that are human-recognizable. Given these successful explorations, we experimented with a model using generalized MDN to synthesis handwritten Chinese characters via Tensorflow[6]. This generative model is trained with KanjiVG[7] data targeting at producing real strokes. The rest of this paper is organized as follows. Section II introduces the dataset. Section III and IV are the baseline and oracle models. Section V describes the core generative RNN model. Section VI explains the evaluation metrics. Section VII compares the proposed model with state-of-art approaches. Section VIII reports experimental results. II. DATASET AND REPRESENTATION The dataset used for training is KanjiVG, where each Chinese character is represented as a variable length of points sequence in the order of pen movements: [[x 1, y 1, eos 1, eoc 1 ],... [x n, y n, eos n, eoc n ]] (1) x i and y i are the xy-coordinates of each point, eos i (end of stroke) is a single bit indicating whether the current point is the end point of a stroke, 1
#strokes in Char #correct #Total Random Accuracy Bigram Accuracy Bigram w/ distancecost 3 23 27 1/(3-1)! = 50% 70.37% 85.19% 4 32 55 1/(4-1)! = 16.7% 29.63% 58.18% 5 15 68 1/(5-1)! = 4.16% 13.04% 22.06% 6 15 111 1/(6-1)! = 0.83% 1.8% 13.51% 7 10 125 1/(7-1)! = 0.001% N/A 8.00% TABLE I: Baseline Results and eoc i (end of character) is also a single bit indicating whether the character is end or not. This representation is better than the image based character, because the sequences contains not only the spatial information, but also the time order for each stroke. This is very useful when generating characters later. Here is a visualized example of character 卒, this character contains 8 strokes: Fig. 1: A visualized example with stroke orders The xy-coordinates have been scaled down with coefficient 15 to help to fit training parameters, and then scaled up back to visualize it in svg format. III. BASELINE ENHANCED N-GRAM Use the given dataset, the baseline algorithm is trying to predict the next stroke based on the current stroke, given all of the strokes for a character as prior knowledge. The cost not only depends on the how frequency a (pre-stroke, post-stroke) combination happens in the training set, but also depends on the distance between the previous stroke and post stroke. The n-gram model is trained and validated on 185 random characters from the dataset. A. Algorithm The algorithm is a stroke-order search problem, which can be easily solved by any search algorithm: State: binary tuple [previous-stroke, list of strokes that has not yet been selected] Start state: the first stroke ground truth End state: the stokes list is empty SuccAndCost: Given a state, return (action, newstate, cost) based on the bigram cost with distance cost B. Distance Cost Stroke distance cost has been introduced in order to utilize the spatial information in predicting next states. The distance is simply the Manhattan Distance of two center points of strokes (s 1, s 2 ) with a coefficient k: DistanceCost = k ManhattanDistance(s 1, s 2 ) (2) This distance cost takes spatial information into consideration, and prefer the closest stroke in predicting, which is a very command stroke movement in Chinese. C. Results Applied uniform cost search on the strokereorderproblem with distance cost (k = 3), Table I shows the results of baseline model in predicting characters with number of stroke from 3 to 7. More complicated character with more than 7 strokes are not tested in this model, because the accuracy will be very closed to the random accuracy, which means the model would not give any meaningful predictions. The baseline model performs very well with small number of strokes, but underperforms as the number of strokes increases. D. Baseline Take Away Based on the results of baseline model, it is not quite interesting only to predict the order of strokes. And thus, Section V describes another 2
method, which will do character generation instead of only prediction. The generative method not only provides the order of the strokes, but also generates stylized characters. The focus of the rest of the report is on the generative model, and the evaluation also focuses on how similar the character that the model generates compared with the original characters, instead of only evaluating on order of strokes. IV. ORACLE For the problem of stroke prediction and character generation, the oracle model is human being, which is then the character dataset in this case. In terms of stroke order prediction, the oracle algorithm takes all the strokes information from dataset (Chinese Dictionary) and behaves like dictionary lookup, cheating to look at the correct answers. In terms of handwriting generation, the oracle uses the characters in dataset directly as output, without adding any variations. Therefore, the oracle s accuracy is 100% for stroke prediction, and loss is 0 for RNN model. Fig. 2 shows the baseline and oracle results, as well as the comparison. At a time t, input x t and previous hidden state h t 1 are feed into the RNN to produce the output y t and new hidden state h t. The hidden state h t represents the compressed information of the previous input sequence. But it s well known that RNN is hard to train until Long Short Term Memory appeared[9]. B. Long Short Term Memory(LSTM) The intuition behind LSTM is that, cell states C t is added into the RNN flow to keep track of the internal states. The states are controlled by three kind of gates: forgot gate f t, input gate i t and output gate o t. As formally described in [9], f t = σ(w f [h t 1, x t ] + b f ) (3) i t = σ(w i [h t 1, x t ] + b i ) (4) i t = σ(w i [h t 1, x t ] + b i ) (5) C t = tanh(w C [h t 1, x t ] + b C ) (6) C t = f t C t 1 + i t C t (7) o t = σ(w o [h t 1, x t ] + b o ) (8) h t = o t tanh(c t ) (9) All the W are the weight matrix to be trained and so as the bias term b. C. Gated Recurrent Unit(GRU) GRU provides a simpler way to abstract the cell data and hidden state. It uses a single state to combine the result of output gate and forget gate. However, this simplified model does provide similar performance [10]. As formally described in [9], Fig. 2: Baseline/Oracle Comparison V. CONDITIONAL GENERATIVE RECURRENT NEURAL NETWORK A. Recurrent Neural Network (RNN) A recurrent neural network is a general model to represent a sequential data series. The hidden state is feed into the hidden node in a loop mechanism. z t = σ(w z [h t 1, x t ]) (10) r t = σ(w r [h t 1, x t ]) (11) h t = tanh(w [r t h t 1 ]) (12) h t = (1 z t ) h t 1 + z t h t (13) Given a complex input data representation as mentioned in Section II, we prefer to modify a simpler model to tackle the task. Applied the model to our data, x t R 1 4 is just the one data entry with position and eos/eoc information. 3
Fig. 3: A visualized example of GRU unit from [9] D. Adjusted GRU with class information Besides input data, class information is also feed into the GRU. Currently an one-hot vector c R #classes is used to let RNN condition on the specific class. An modified version of GRU is listed as follow, z t = σ(w z [h t 1, x t ] + M z c) (14) r t = σ(w r [h t 1, x t ] + M r c) (15) h t = tanh(w [r t h t 1 ]) (16) h t = (1 z t ) h t 1 + z t h t (17) M z and M r are additional parameters to tune in the RNN so that the output of RNN will be based on the character class. E. Mixture Density Network When writing a Chinese character, we do not care the exact location of the next end point of the previous sequence. So the problem is defined as an multi-valued output problem that traditional forward neural network wouldn t perform well. So Mixture Density Network comes into place[4]. In a high level view, MDN is a generic way of describe multi-valued data using Gaussian Mixtures Model and embedded Neural Network. The inference is defined as probability distribution of target t conditioned on the input x, referred from [4](22). m p(t x) = α i (x)φ i (t x) (18) i=1 Here, α represents the importance of the Gaussian model and φ is Gaussian distributions. Then it s RNN s responsibility to generate the parameters for m number of Gaussian distributions µ and σ. The train loss is defined as negative log of the overall inference, referred from [2](26). T L(x) = log( π j t N (x t+1 µ j t, σ j t )) t=1 j (19) Notice we truncated the end of character and end of stroke loss to keep the model simple. VI. EVALUATION METRICS Due to the nature of generative model. Evaluation can t fully represent the errors the model make. We have implemented a preliminary verification method by calculating the total location difference of each stroke. L(c) = log( location) (20) Notice that if strokes are not in a right order, the location loss is still high. Furthermore, we utilize an existing Chinese character classifier to provide more formal data. Say c is the class, Classifier outputs a c given an character image Image. Then the errors is L(all c) = c Classifier(Image) c (21) VII. LITERATURE REVIEW This section provides comparison of the proposed approach with English sentences synthesis[2] state-of-art approach, and Chinese character drawing[5] state-of-art approach. We compare them in the aspects of dataset, models, and results. A. Dataset 1) Compared with English synthesis approach: Although this approach for English can reach very high accuracy and performs very well on English words dataset, it is probably not applicable to Chinese characters. Because Chinese characters are far more complicated than English sentences, which have only fixed number of letters. And each Chinese character is compose of multiple different strokes with different combinations. Besides, Chinese characters have more spatial information, like a graph, instead of an ordered English letter sequence. 4
(a) Printed styled characters (b) Online handwritten characters [11] Fig. 4: Comparison of datasets the English sequence generation is pure generative task, there is no actual evaluation metric to compare with. 2) Compared with Chinese drawing approach: The Chinese drawing approach uses the online handwritten Chinese character dataset, which contains more than two million training characters written by human. And there are multiple different samples(written by different people) for a given character. However, for our proposed approach, we use the dataset with printed styled characters, where one character only corresponding to one sample. This make the training more difficult due to the dataset limitation. Fig. 4 shows the differences of the dataset between printed styled characters and online handwritten characters. D. Result comparison The chinese paper reaches the mean accuracy of 93.98% for 3755 classes. Due to the time constraint, we are only experimenting the model with 5 classes which yields a mean accuracy of 14%. VIII. E XPERIMENTS AND A NALYSIS In this section, we present the experimental results on what experiments we did, and how we got the the end results by simplifying features or pruning training sets. B. Specific GMM and general MDN As mentioned above, we are employing more generic MDN model with less complexity. Compared to Chinese character drawing paper [5], we truncate the end of character and end of stroke data in training phase. Also, instead of using a dense matrix with character embedding, we are only using an one-hot vector to represent the specific class information. A. Taste the full dataset Initially, we use the full dataset which contains more than 10,000 characters. As we are using a one-hot vector to represent the class information. It provides a huge vector which produces very poor result when sampling conditional characters. Because the per stroke data is only of size 4 or 5 (dependent on if the end of stroke or end of character are included), the massive class index vector dilutes the weight matrix of the actual stroke weight in the RNN. Hence we tried to reduce the dataset to gain more intuition. C. Inside GRU or filtering layer In the sequence generation paper [2], author proposed another filter lay in RNN so that the network is conditioned on a specific character. However, it s targeted to provide cursive writing in a window of several English characters. To avoid the complication, we directly embedded the onehot class vector inside the RNN cell. Given that B. Result for Limited Dataset Then we limit the dataset to be simply size of 5 to verify our ideas. It produces some occluded 5
or prolonged sampling examples which are humanrecognizable to some extent. We believe that s because the dataset is polluted with non-fix aspectratio scaled images. C. Remove Training Noise and Modify Batch Then we look into the data generation process, it does provides random generated data with different scales to enlarge the number of training examples. As we want to generate real Chinese characters that are comparable to canonical representation, we removed the random sampling. Also some inconsistent batching mechanism are removed since we don t want to RNN to learn the transition from one character to another. D. Remove eos and eoc We try to remove the end of stroke and end of character from the training set, so that the model can only focus on generating the next xycoordinates. The model converges after around 8,000 batches, and finally is able to generate human recognizable characters. Fig. 5 shows 20 results that are purely generated by RNN for the character 七. Experiment Convergence Training Loss Batches until Convergence Full No > 30, 000 N/A Limited No > 30, 000 N/A No eos&eoc Yes 1.1 8, 000 TABLE II: Training Loss and Convergence IX. CONCLUSIONS This paper presents a joint Mixture Density Network and RNN model to generate handwritten Chinese Characters. Although there are existing work of drawing character, considering all the difficulties stated above, we tried several different methods, and showed some preliminary results here. The final model only applies to very limited dataset and character classes. There are still much future work we can do for this interesting topic. REFERENCES [1] Y. H. Zhang, Deep Convolutional Network for Handwritten Chinese Character Recognition, cs231n. [2] A. Graves, Generating Sequences With Recurrent Neural Networks arxiv:1308.0850v5 [cs.ne] 5 Jun 2014. [3] S. Otoro, Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow, studio otoro [4] C.M. Bishop, Mixture Density Networks, Neural Computing Research Group Report: NCRG/94/004. [5] X.Y. Zhang, F. Yin, Y.M. Zhang, C.L. Liu, Y. Bengio. Drawing and Recognizing Chinese Characters with Recurrent Neural Network, arxiv:1606.06539v1 [cs.cv] 21 Jun 2016 [6] Google Research TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. [7] KanjiVG: A description of the sinographs (or kanji) used by the Japanese language. [8] Colah, Understanding LSTM Networks. [9] S. Hochreiter, J. Schmidhuber LONG SHORT-TERM MEMORY, Neural Computation 9(8):17351780, 1997 [10] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, arxiv:1412.3555v1 [cs.ne] 11 Dec 2014 [11] C.L Liu, F. Yin, D.H. Wang, Q.F. Wang, CASIA Online and Offline Chinese Handwriting Databases, National Laboratory of Pattern Recognition (NLPR) Fig. 5: Results purely generated by RNN E. Training Loss and Explanation Table II shows the training results for each different attempts. 6